All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
  2008-03-06  7:28 ` Keir Fraser
  2008-03-06 10:15   ` Tim Deegan
@ 2008-03-07  1:08   ` Ky Srinivasan
  1 sibling, 0 replies; 14+ messages in thread
From: Ky Srinivasan @ 2008-03-07  1:08 UTC (permalink / raw)
  To: Keir Fraser, xen-devel



>>> On Thu, Mar 6, 2008 at  2:28 AM, in message
<C3F54D9B.14C64%keir.fraser@eu.citrix.com>, Keir Fraser
<keir.fraser@eu.citrix.com> wrote: 
> Personally I think the approach is ugly, and also you have not yet presented
> evidence that supporting the Viridian paravirtualisations improves
> performance.

When I first implemented this (about a year ago), it was not clear if Microsoft would be open sourcing the Veridian specification. Given that, I wanted to have a narrow set of interfaces both to the adapter as well as from the adapter. I take it that you don't care much for this exercise in attempting to isolate the adapter. Now that Veridian specification has been open sourced, I agree there is no need to isolate the adapter from the base hypervisor the way it is currently done. However, given that:
(a) Veridian specification is evolving and Microsoft may define additional functionality to improve guest performance
(b) CPUID namespace, MSR namespace and hypercall namespace collisions between Xen and Veridian. This is the case today and it can only get worse over time.

I feel having a framework that allows you to implement these kinds of mapping layers in complete isolation from the base hypervisor  may in fact be cleaner than trying to implement the mapping code inline in the base Xen code.

With regards to performance, we have only run NetBench and on NetBench we have seen a 10% improvement (on a uniprocessor system). We have had some issues with SMP PV drivers and that is the reason I don't have SMP numbers (the adapter has been tested on SMP machines though). We are currently in the process of running a range of benchmarks and I will keep you posted on what we see. Our goal here is clearly to be competitive (as far as performance goes) with Veridian hosting an enlightened windows guest.  
 
> If it doesn't then it's a waste of time; if it does then it
> raises the question of which hypercalls provide the benefit, and do we get a
> smaller neater patch by supporting just those?

I think the only assumption we can make here is  that the enlightenments will improve the  guest performance. This has been confirmed with the minimal performance testing we have already done.  I am also going to assume that Microsoft will continue to evolve Veridian and the set of enlightenments visible to their guests to improve performance. The question that we need to answer, I think is how are we going to support these enlightenments and not if we are going to support Microsoft specific enlightenments. I will be the first one to admit the patches I submitted need to be cleaned up:

1) Fix coding style
2) Get rid of code that is not being exercised. Based on the Veridian specification I identified a set of functionality that I thought an enlightened guest may depend on. It looks like the current shipping windows server 2008 does not use all the functionality that is currently supported. I am somewhat hesitant to get rid of   unused functionality since I don't know what the next release of windows will use. In fact, the current shipping windows server 2008 (32 bit version)  is already using an undocumented hypercall! 

I do think however that having an environment in which we can implement and evolve the support for windows enlightenments without constantly churning the base Xen code  is the right approach.

Regards,

K. Y  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
  2008-03-06 10:15   ` Tim Deegan
@ 2008-03-07  1:10     ` Ky Srinivasan
  2008-03-07 11:57       ` Tim Deegan
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Ky Srinivasan @ 2008-03-07  1:10 UTC (permalink / raw)
  To: Tim Deegan, Keir Fraser; +Cc: xen-devel



>>> On Thu, Mar 6, 2008 at  5:15 AM, in message
<20080306101542.GA22422@york.uk.xensource.com>, Tim Deegan
<Tim.Deegan@citrix.com> wrote: 
> At 07:28 +0000 on 06 Mar (1204788507), Keir Fraser wrote:
>> Personally I think the approach is ugly, and also you have not yet presented
>> evidence that supporting the Viridian paravirtualisations improves
>> performance. If it doesn't then it's a waste of time; if it does then it
>> raises the question of which hypercalls provide the benefit, and do we get a
>> smaller neater patch by supporting just those? One final comment is that the
>> TLB management code that this slaps on top of the core hypervisor looks a
>> bit scary to me. Tim Deegan may care to comment more on that.
> 
> Some blame lies with the mismatch between the Viridian interface and
> Xen's; there needs to be a way for the TLB flush hypercall to block
> indefinitely.  But I can't see how that turns into more than an atomic_t
> for TlbFlushInhibit and a block-and-schedule operation.  In the current
> patches, there's quite a lot of locking and ownership going on as well.
> I'm confused by the use of wait_on_xen_event_channel(0, xyz); event
> channels don't seem to come into it.

The Veridian API allows the guest to pass in a variable list of arguments to the TLB flush call ( HvFlushVirtualAddressList). Furthermore, both forms of the flush APIs (HvFlushVirtualAddressSpace and HvFlushVirtualAdressList) can specify a list of vcpus that should be involved in the flush process. So, as you have noted we will need a mechanism to co-ordinate the flush operation amongst the set of vcpus involved which means we need to be able give up the physical CPU in the hypervisor waiting for the flush to complete. I have used wait_on_xen_event_channel() to implement this synchronization. Since we don't preserve the stack state when we block in the hypervisor, I have used a seperate per-vcpu page for dealing with hypercall input parameters for calls that can potentially block in the hypervisor. From what I have seen, win2k8 server mostly specifies  all the processors in ProcessorMask. So, I chose to implement TLB flush operations using a single serialization object that keeps track of both the set of vcpus involved in the flush operation as well as the list of pages to be flushed.

> 
> I'll mention now, since I have the patch in front of me, that I dislike
> the addition of an "ext_id" field to the HVM save format header and
> associated special treatment in the save/restore code; you should be
> able to figure out that this is a w2k8 domain from the presence of your
> other records in the save file.

I can fix this.

> 
> 
> 
> More generally, I agree that the approach is very heavyweight.  I don't
> see the need for a framework here, since there's no other proposed user
> of it that would want the same interface.

I agree that there is no need to isolate the shim's dependence on the base Xen code (xen_call_vector_t). I implemented this shim a year ago and at that point it was not clear what Microsoft might do with the Veridian specification.  So, clearly some of the design choices that I made a year ago may not be the right choice today. However, I still think that having an intercept framework where  one can implement Veridian specific functionality without cluttering up the base Xen code is still the right approach. 
    
> It seems to duplicate a lot
> of things (does it really need its own spinlock implementation?)

Clearly not! As I noted in an earlier email to Kier, I will be the first to admit that these patches require significant cleanup and I am willing to clean them up. A lot of what you see has historical baggage and I wanted to get some feedback before I invested the time to clean things up.

> 
> It's certainly not in Xen coding style, even in the framework
> implementation.  (The MS habit of encoding scope and type information in
> variable names annoys the heck out of me.  Why does a lock field in an
> nsPartition_t need to be called "nsLock"?)

Agreed. 

> 
> The naming in general could do with a kicking -- calling everything
> "Novell Shim" is understandable for historical reasons but not really
> descriptive of its function.   But maybe that can wait.
Agreed.

Regards,

K. Y
> 
> Tim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
  2008-03-07  1:10     ` [PATCH][RFC] Supporting Enlightened Windows 2008Server Ky Srinivasan
@ 2008-03-07 11:57       ` Tim Deegan
  2008-03-07 13:19       ` Keir Fraser
  2008-03-07 13:30       ` Keir Fraser
  2 siblings, 0 replies; 14+ messages in thread
From: Tim Deegan @ 2008-03-07 11:57 UTC (permalink / raw)
  To: Ky Srinivasan; +Cc: xen-devel, Keir Fraser

At 18:10 -0700 on 06 Mar (1204827047), Ky Srinivasan wrote:

> The Veridian API allows the guest to pass in a variable list of
> arguments to the TLB flush call
> (HvFlushVirtualAddressList). Furthermore, both forms of the flush APIs
> (HvFlushVirtualAddressSpace and HvFlushVirtualAdressList) can specify
> a list of vcpus that should be involved in the flush process.

I expect that the best way to implement the list-of-addresses feature on
Xen is to do a full TLB flush if there's more than one entry in the list
(that's different from the Hyper-V shadow pagetable design, where
explicit lists of addresses to flush make much more sense).

Then all you need are the existing Xen TLB flush operations, and some
means of gating them.  If you use a generation counter with each vcpu's
inhibit bit, you can probably do that without the need for any locks.

> So, as
> you have noted we will need a mechanism to co-ordinate the flush
> operation amongst the set of vcpus involved which means we need to be
> able give up the physical CPU in the hypervisor waiting for the flush
> to complete. I have used wait_on_xen_event_channel() to implement this
> synchronization.

wait_on_xen_event_channel is more than you need, since you're not going
to wake on events.

> Since we don't preserve the stack state when we block
> in the hypervisor, I have used a seperate per-vcpu page for dealing
> with hypercall input parameters for calls that can potentially block
> in the hypervisor.

Xen already has a system of hypercall continuations that might help here.

Cheers,

Tim

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
  2008-03-07  1:10     ` [PATCH][RFC] Supporting Enlightened Windows 2008Server Ky Srinivasan
  2008-03-07 11:57       ` Tim Deegan
@ 2008-03-07 13:19       ` Keir Fraser
  2008-03-07 13:30       ` Keir Fraser
  2 siblings, 0 replies; 14+ messages in thread
From: Keir Fraser @ 2008-03-07 13:19 UTC (permalink / raw)
  To: Ky Srinivasan, Tim Deegan; +Cc: xen-devel

On 7/3/08 01:10, "Ky Srinivasan" <ksrinivasan@novell.com> wrote:

> The Veridian API allows the guest to pass in a variable list of arguments to
> the TLB flush call ( HvFlushVirtualAddressList). Furthermore, both forms of
> the flush APIs (HvFlushVirtualAddressSpace and HvFlushVirtualAdressList) can
> specify a list of vcpus that should be involved in the flush process. So, as
> you have noted we will need a mechanism to co-ordinate the flush operation
> amongst the set of vcpus involved which means we need to be able give up the
> physical CPU in the hypervisor waiting for the flush to complete. I have used
> wait_on_xen_event_channel() to implement this synchronization. Since we don't
> preserve the stack state when we block in the hypervisor, I have used a
> seperate per-vcpu page for dealing with hypercall input parameters for calls
> that can potentially block in the hypervisor. From what I have seen, win2k8
> server mostly specifies  all the processors in ProcessorMask. So, I chose to
> implement TLB flush operations using a single serialization object that keeps
> track of both the set of vcpus involved in the flush operation as well as the
> list of pages to be flushed.

Clearly avoiding emulating IPI-to-all-CPUs is rather likely to be a win. But
is the very selective subset-of-CPUs and subset-of-addresses really that
useful? Do you get any significant win over just calling
hvmop_flush_tlb_all()?

Also we need to weigh up the likely penetration of NPT and EPT capable
processors by the time w2k8 is shipping in any volume. But even ignoring
that, I bet 95% of the benefit of this patch can be got with a much smaller
patch.

 -- Keir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
  2008-03-07  1:10     ` [PATCH][RFC] Supporting Enlightened Windows 2008Server Ky Srinivasan
  2008-03-07 11:57       ` Tim Deegan
  2008-03-07 13:19       ` Keir Fraser
@ 2008-03-07 13:30       ` Keir Fraser
  2 siblings, 0 replies; 14+ messages in thread
From: Keir Fraser @ 2008-03-07 13:30 UTC (permalink / raw)
  To: Ky Srinivasan, Tim Deegan; +Cc: xen-devel

On 7/3/08 01:10, "Ky Srinivasan" <ksrinivasan@novell.com> wrote:

> I agree that there is no need to isolate the shim's dependence on the base Xen
> code (xen_call_vector_t). I implemented this shim a year ago and at that point
> it was not clear what Microsoft might do with the Veridian specification.  So,
> clearly some of the design choices that I made a year ago may not be the right
> choice today. However, I still think that having an intercept framework where
> one can implement Veridian specific functionality without cluttering up the
> base Xen code is still the right approach.

Clearly putting the Viridian hypercall shims in a different file/directory
makes sense. But I think the shims would need to go on a diet. The TLB
flushing implementation is a good example -- the useful extra features of
the Viridian flush hypercall (if there are any, when partnered with Xen's
shadow code) should be pushed into core Xen HVM TLB-flush handling code.
Otherwise it sits out on the periphery with a correspndingly greater
tendency to rot, and for no benefit (certainly I would strongly argue it is
not cleaner!).

 -- Keir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
@ 2008-04-04 23:41 Ky Srinivasan
  2008-04-05 15:17 ` Daniel P. Berrange
       [not found] ` <20080410112313.GA4628@weybridge.uk.xensource.com>
  0 siblings, 2 replies; 14+ messages in thread
From: Ky Srinivasan @ 2008-04-04 23:41 UTC (permalink / raw)
  To: Tim Deegan, Keir Fraser; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2052 bytes --]

>>> On Fri, Mar 7, 2008 at  8:30 AM, in message
<C3F6F3FB.1DAE5%keir.fraser@eu.citrix.com>, Keir Fraser
<keir.fraser@eu.citrix.com> wrote: 
> On 7/3/08 01:10, "Ky Srinivasan" <ksrinivasan@novell.com> wrote:
> 
>> I agree that there is no need to isolate the shim's dependence on the base 
> Xen
>> code (xen_call_vector_t). I implemented this shim a year ago and at that 
> point
>> it was not clear what Microsoft might do with the Veridian specification.  
> So,
>> clearly some of the design choices that I made a year ago may not be the 
> right
>> choice today. However, I still think that having an intercept framework 
> where
>> one can implement Veridian specific functionality without cluttering up the
>> base Xen code is still the right approach.
> 
> Clearly putting the Viridian hypercall shims in a different file/directory
> makes sense. But I think the shims would need to go on a diet. The TLB
> flushing implementation is a good example -- the useful extra features of
> the Viridian flush hypercall (if there are any, when partnered with Xen's
> shadow code) should be pushed into core Xen HVM TLB-flush handling code.
> Otherwise it sits out on the periphery with a correspndingly greater
> tendency to rot, and for no benefit (certainly I would strongly argue it is
> not cleaner!).
> 
>  -- Keir

Kier,

Based on the feedback I got from you and Tim, I am enclosing the next version of the patches to support enlightened win2008 server. Here are the changes I have made:

1) I have put the shim on a low calorie diet - I have gotten rid of the framework infrastructure and to the extent possible integrated the shim code with xen.

2) I have tried to cleanup the code. I am sure  more work will be needed here.

3) I am not advertising the TLB related enlightenments. We can revisit this later if needed.

Regards,

K. Y
 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel




[-- Attachment #2: hv_tools.patch --]
[-- Type: text/plain, Size: 4468 bytes --]

Index: xen-unstable.hg/tools/python/xen/lowlevel/xc/xc.c
===================================================================
--- xen-unstable.hg.orig/tools/python/xen/lowlevel/xc/xc.c	2008-04-04 13:01:00.000000000 -0400
+++ xen-unstable.hg/tools/python/xen/lowlevel/xc/xc.c	2008-04-04 13:45:08.000000000 -0400
@@ -622,14 +622,14 @@
     int i;
 #endif
     char *image;
-    int memsize, vcpus = 1, acpi = 0, apic = 1;
+    int memsize, vcpus = 1, acpi = 0, apic = 1, extid = 0;
 
     static char *kwd_list[] = { "domid",
-				"memsize", "image", "vcpus", "acpi",
+				"memsize", "image", "vcpus", "extid", "acpi",
 				"apic", NULL };
-    if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iis|iii", kwd_list,
+    if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iis|iiii", kwd_list,
                                       &dom, &memsize,
-                                      &image, &vcpus, &acpi, &apic) )
+                                      &image, &vcpus, &extid, &acpi, &apic) )
         return NULL;
 
     if ( xc_hvm_build(self->xc_handle, dom, memsize, image) != 0 )
@@ -654,6 +654,7 @@
     va_hvm->checksum = -sum;
     munmap(va_map, XC_PAGE_SIZE);
 #endif
+    xc_set_hvm_param(self->xc_handle, dom, HVM_PARAM_EXTEND_HYPERVISOR, extid);
 
     return Py_BuildValue("{}");
 }
Index: xen-unstable.hg/tools/python/xen/xend/XendConfig.py
===================================================================
--- xen-unstable.hg.orig/tools/python/xen/xend/XendConfig.py	2008-04-04 13:01:01.000000000 -0400
+++ xen-unstable.hg/tools/python/xen/xend/XendConfig.py	2008-04-04 13:56:58.000000000 -0400
@@ -139,6 +139,7 @@
     'monitor': int,
     'nographic': int,
     'pae' : int,
+    'extid': int, 
     'rtc_timeoffset': int,
     'serial': str,
     'sdl': int,
Index: xen-unstable.hg/tools/python/xen/xend/image.py
===================================================================
--- xen-unstable.hg.orig/tools/python/xen/xend/image.py	2008-04-04 13:01:01.000000000 -0400
+++ xen-unstable.hg/tools/python/xen/xend/image.py	2008-04-04 13:45:08.000000000 -0400
@@ -508,6 +508,7 @@
 
         self.apic = int(vmConfig['platform'].get('apic', 0))
         self.acpi = int(vmConfig['platform'].get('acpi', 0))
+	self.extid  = int(vmConfig['platform'].get('extid',  0))
         self.guest_os_type = vmConfig['platform'].get('guest_os_type')
 
     # Return a list of cmd line args to the device models based on the
@@ -607,6 +608,7 @@
         log.debug("store_evtchn   = %d", store_evtchn)
         log.debug("memsize        = %d", mem_mb)
         log.debug("vcpus          = %d", self.vm.getVCpuCount())
+	log.debug("extid          = %d", self.extid)
         log.debug("acpi           = %d", self.acpi)
         log.debug("apic           = %d", self.apic)
 
@@ -614,6 +616,7 @@
                           image          = self.loader,
                           memsize        = mem_mb,
                           vcpus          = self.vm.getVCpuCount(),
+			  extid          = self.extid,
                           acpi           = self.acpi,
                           apic           = self.apic)
         rc['notes'] = { 'SUSPEND_CANCEL': 1 }
Index: xen-unstable.hg/tools/python/xen/xm/create.py
===================================================================
--- xen-unstable.hg.orig/tools/python/xen/xm/create.py	2008-04-04 13:01:01.000000000 -0400
+++ xen-unstable.hg/tools/python/xen/xm/create.py	2008-04-04 13:45:08.000000000 -0400
@@ -207,6 +207,10 @@
           use="""Timer mode (0=delay virtual time when ticks are missed;
           1=virtual time is always wallclock time.""")
 
+gopts.var('extid', val='EXTID',
+          fn=set_int, default=0,
+          use="Specify extention ID for a HVM domain.")
+
 gopts.var('acpi', val='ACPI',
           fn=set_int, default=1,
           use="Disable or enable ACPI of HVM domain.")
@@ -750,7 +754,7 @@
 def configure_hvm(config_image, vals):
     """Create the config for HVM devices.
     """
-    args = [ 'device_model', 'pae', 'vcpus', 'boot', 'fda', 'fdb', 'timer_mode',
+    args = [ 'device_model', 'pae', 'extid', 'vcpus', 'boot', 'fda', 'fdb', 'timer_mode',
              'localtime', 'serial', 'stdvga', 'isa', 'nographic', 'soundhw',
              'vnc', 'vncdisplay', 'vncunused', 'vncconsole', 'vnclisten',
              'sdl', 'display', 'xauthority', 'rtc_timeoffset', 'monitor',

[-- Attachment #3: hv_xen_base.patch --]
[-- Type: text/plain, Size: 9364 bytes --]

%patch
Index: xen-unstable.hg/xen/arch/x86/hvm/svm/svm.c
===================================================================
--- xen-unstable.hg.orig/xen/arch/x86/hvm/svm/svm.c	2008-04-04 14:28:26.000000000 -0400
+++ xen-unstable.hg/xen/arch/x86/hvm/svm/svm.c	2008-04-04 14:28:44.000000000 -0400
@@ -50,6 +50,7 @@
 #include <asm/hvm/vpt.h>
 #include <asm/hvm/trace.h>
 #include <asm/hap.h>
+#include <asm/hvm/hvm_extensions.h>
 
 u32 svm_feature_flags;
 
@@ -950,7 +951,7 @@
 
 static void svm_vmexit_do_cpuid(struct cpu_user_regs *regs)
 {
-    unsigned int eax, ebx, ecx, edx, inst_len;
+    unsigned int eax, ebx, ecx, edx, inst_len, input;
 
     inst_len = __get_instruction_length(current, INSTR_CPUID, NULL);
     if ( inst_len == 0 ) 
@@ -960,6 +961,7 @@
     ebx = regs->ebx;
     ecx = regs->ecx;
     edx = regs->edx;
+    input = eax;
 
     svm_cpuid_intercept(&eax, &ebx, &ecx, &edx);
 
@@ -968,6 +970,7 @@
     regs->ecx = ecx;
     regs->edx = edx;
 
+    hyperx_intercept_do_cpuid(input, regs);
     __update_guest_eip(regs, inst_len);
 }
 
@@ -984,6 +987,10 @@
     struct vcpu *v = current;
     struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
 
+    if (hyperx_intercept_do_msr_read(ecx, regs))
+    {
+            goto done;
+    }
     switch ( ecx )
     {
     case MSR_IA32_TSC:
@@ -1085,6 +1092,10 @@
     msr_content = (u32)regs->eax | ((u64)regs->edx << 32);
 
     hvmtrace_msr_write(v, ecx, msr_content);
+    if (hyperx_intercept_do_msr_write(ecx, regs))
+    {
+            goto done_msr_write;
+    }
 
     switch ( ecx )
     {
@@ -1141,7 +1152,7 @@
         }
         break;
     }
-
+done_msr_write:
     return X86EMUL_OKAY;
 
  gpf:
Index: xen-unstable.hg/xen/arch/x86/hvm/vmx/vmx.c
===================================================================
--- xen-unstable.hg.orig/xen/arch/x86/hvm/vmx/vmx.c	2008-04-04 14:28:26.000000000 -0400
+++ xen-unstable.hg/xen/arch/x86/hvm/vmx/vmx.c	2008-04-04 14:28:44.000000000 -0400
@@ -49,6 +49,7 @@
 #include <asm/hvm/vpt.h>
 #include <public/hvm/save.h>
 #include <asm/hvm/trace.h>
+#include <asm/hvm/hvm_extensions.h>
 
 enum handler_return { HNDL_done, HNDL_unhandled, HNDL_exception_raised };
 
@@ -1225,12 +1226,13 @@
 
 static void vmx_do_cpuid(struct cpu_user_regs *regs)
 {
-    unsigned int eax, ebx, ecx, edx;
+    unsigned int eax, ebx, ecx, edx, input;
 
     eax = regs->eax;
     ebx = regs->ebx;
     ecx = regs->ecx;
     edx = regs->edx;
+    input = eax;
 
     vmx_cpuid_intercept(&eax, &ebx, &ecx, &edx);
 
@@ -1238,6 +1240,7 @@
     regs->ebx = ebx;
     regs->ecx = ecx;
     regs->edx = edx;
+    hyperx_intercept_do_cpuid(input, regs);
 }
 
 static void vmx_dr_access(unsigned long exit_qualification,
@@ -1506,6 +1509,9 @@
 
     HVM_DBG_LOG(DBG_LEVEL_1, "ecx=%x", ecx);
 
+    if (hyperx_intercept_do_msr_read(ecx, regs))
+        goto done;
+
     switch ( ecx )
     {
     case MSR_IA32_TSC:
@@ -1699,6 +1705,9 @@
     HVM_DBG_LOG(DBG_LEVEL_1, "ecx=%x, eax=%x, edx=%x",
                 ecx, (u32)regs->eax, (u32)regs->edx);
 
+    if (hyperx_intercept_do_msr_write(ecx, regs)) 
+        return X86EMUL_OKAY;
+
     msr_content = (u32)regs->eax | ((u64)regs->edx << 32);
 
     hvmtrace_msr_write(v, ecx, msr_content);
Index: xen-unstable.hg/xen/include/asm-x86/hvm/domain.h
===================================================================
--- xen-unstable.hg.orig/xen/include/asm-x86/hvm/domain.h	2008-04-04 14:28:26.000000000 -0400
+++ xen-unstable.hg/xen/include/asm-x86/hvm/domain.h	2008-04-04 14:28:44.000000000 -0400
@@ -79,6 +79,7 @@
 #endif
     bool_t                 hap_enabled;
     bool_t                 qemu_mapcache_invalidate;
+    void                   *hyperv_handle; /* will be NULL on creation (memset)*/
 };
 
 #endif /* __ASM_X86_HVM_DOMAIN_H__ */
Index: xen-unstable.hg/xen/arch/x86/hvm/Makefile
===================================================================
--- xen-unstable.hg.orig/xen/arch/x86/hvm/Makefile	2008-04-04 14:28:26.000000000 -0400
+++ xen-unstable.hg/xen/arch/x86/hvm/Makefile	2008-04-04 14:28:44.000000000 -0400
@@ -1,5 +1,6 @@
 subdir-y += svm
 subdir-y += vmx
+subdir-y += hyperv 
 
 obj-y += emulate.o
 obj-y += hvm.o
Index: xen-unstable.hg/xen/arch/x86/hvm/hvm.c
===================================================================
--- xen-unstable.hg.orig/xen/arch/x86/hvm/hvm.c	2008-04-04 14:28:26.000000000 -0400
+++ xen-unstable.hg/xen/arch/x86/hvm/hvm.c	2008-04-04 14:28:44.000000000 -0400
@@ -43,6 +43,7 @@
 #include <asm/mc146818rtc.h>
 #include <asm/spinlock.h>
 #include <asm/hvm/hvm.h>
+#include <asm/hvm/hvm_extensions.h>
 #include <asm/hvm/vpt.h>
 #include <asm/hvm/support.h>
 #include <asm/hvm/cacheattr.h>
@@ -156,6 +157,7 @@
     rtc_migrate_timers(v);
     hpet_migrate_timers(v);
     pt_migrate(v);
+    hyperx_intercept_do_migrate_timers(v);
 }
 
 void hvm_do_resume(struct vcpu *v)
@@ -342,6 +344,7 @@
 
 void hvm_domain_destroy(struct domain *d)
 {
+    hyperx_intercept_domain_destroy(d);
     hvm_funcs.domain_destroy(d);
     vioapic_deinit(d);
     hvm_destroy_cacheattr_region_list(d);
@@ -625,8 +628,14 @@
 {
     int rc;
 
+    if ((rc = hyperx_intercept_vcpu_initialize(v)) != 0)
+        goto fail1;
+
     if ( (rc = vlapic_init(v)) != 0 )
+    {
+        hyperx_intercept_vcpu_destroy(v);
         goto fail1;
+    }
 
     if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 )
         goto fail2;
@@ -674,12 +683,14 @@
     hvm_funcs.vcpu_destroy(v);
  fail2:
     vlapic_destroy(v);
+    hyperx_intercept_vcpu_destroy(v);
  fail1:
     return rc;
 }
 
 void hvm_vcpu_destroy(struct vcpu *v)
 {
+    hyperx_intercept_vcpu_destroy(v);
     vlapic_destroy(v);
     hvm_funcs.vcpu_destroy(v);
 
@@ -1823,6 +1834,10 @@
     case 0:
         break;
     }
+    if (hyperx_intercept_do_hypercall(regs)) 
+    {
+        return HVM_HCALL_completed;
+    }
 
     if ( (eax >= NR_hypercalls) || !hvm_hypercall32_table[eax] )
     {
@@ -1964,6 +1979,7 @@
         vcpu_wake(v);
 
     gdprintk(XENLOG_INFO, "AP %d bringup succeeded.\n", vcpuid);
+    hyperx_intercept_vcpu_up(v);
     return 0;
 }
 
@@ -2210,6 +2226,10 @@
                 if ( a.value > HVMPTM_one_missed_tick_pending )
                     goto param_fail;
                 break;
+            case HVM_PARAM_EXTEND_HYPERVISOR:
+printk("KYS: EXTEND hypervisor id is %d\n", (int)a.value);
+                if ((a.value == 1) && hyperv_initialize(d)) 
+                    goto param_fail;
             }
             d->arch.hvm_domain.params[a.index] = a.value;
             rc = 0;
Index: xen-unstable.hg/xen/arch/x86/hvm/save.c
===================================================================
--- xen-unstable.hg.orig/xen/arch/x86/hvm/save.c	2008-04-04 14:28:26.000000000 -0400
+++ xen-unstable.hg/xen/arch/x86/hvm/save.c	2008-04-04 14:28:44.000000000 -0400
@@ -23,6 +23,8 @@
 
 #include <asm/hvm/support.h>
 #include <public/hvm/save.h>
+#include <public/hvm/params.h>
+#include <asm/hvm/hvm_extensions.h>
 
 void arch_hvm_save(struct domain *d, struct hvm_save_header *hdr)
 {
Index: xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h
===================================================================
--- xen-unstable.hg.orig/xen/include/public/arch-x86/hvm/save.h	2008-04-04 14:28:26.000000000 -0400
+++ xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h	2008-04-04 14:28:44.000000000 -0400
@@ -38,7 +38,7 @@
     uint32_t version;           /* File format version */
     uint64_t changeset;         /* Version of Xen that saved this file */
     uint32_t cpuid;             /* CPUID[0x01][%eax] on the saving machine */
-    uint32_t pad0;
+    uint32_t pad0;		/* extension ID */
 };
 
 DECLARE_HVM_SAVE_TYPE(HEADER, 1, struct hvm_save_header);
@@ -421,9 +421,30 @@
 
 DECLARE_HVM_SAVE_TYPE(MTRR, 14, struct hvm_hw_mtrr);
 
+struct hvm_hyperv_dom {
+    uint64_t guestid_msr;
+    uint64_t hypercall_msr;
+    uint32_t long_mode;
+    uint32_t ext_id;
+};
+DECLARE_HVM_SAVE_TYPE(HYPERV_DOM, 15, struct hvm_hyperv_dom);
+
+struct hvm_hyperv_cpu {
+    uint64_t control_msr;
+    uint64_t version_msr;
+    uint64_t sief_msr;
+    uint64_t simp_msr;
+    uint64_t eom_msr;
+    uint64_t int_msr[16];
+    struct {
+        uint64_t config;
+        uint64_t count;
+    } timers[4];
+};
+DECLARE_HVM_SAVE_TYPE(HYPERV_CPU, 16, struct hvm_hyperv_cpu);
 /* 
  * Largest type-code in use
  */
-#define HVM_SAVE_CODE_MAX 14
+#define HVM_SAVE_CODE_MAX 16
 
 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
Index: xen-unstable.hg/xen/include/public/hvm/params.h
===================================================================
--- xen-unstable.hg.orig/xen/include/public/hvm/params.h	2008-04-04 14:28:05.000000000 -0400
+++ xen-unstable.hg/xen/include/public/hvm/params.h	2008-04-04 14:30:49.000000000 -0400
@@ -83,7 +83,8 @@
 
 /* Boolean: Enable virtual HPET (high-precision event timer)? (x86-only) */
 #define HVM_PARAM_HPET_ENABLED 11
+#define HVM_PARAM_EXTEND_HYPERVISOR 12
 
-#define HVM_NR_PARAMS          12
+#define HVM_NR_PARAMS          13
 
 #endif /* __XEN_PUBLIC_HVM_PARAMS_H__ */

[-- Attachment #4: hv_xen_extension.patch --]
[-- Type: text/plain, Size: 86132 bytes --]

%patch
Index: xen-unstable.hg/xen/include/asm-x86/hvm/hvm_extensions.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ xen-unstable.hg/xen/include/asm-x86/hvm/hvm_extensions.h	2008-03-26 13:56:39.000000000 -0400
@@ -0,0 +1,166 @@
+/****************************************************************************
+ |
+ | Copyright (c) [2007, 2008] Novell, Inc.
+ | All Rights Reserved.
+ |
+ | This program is free software; you can redistribute it and/or
+ | modify it under the terms of version 2 of the GNU General Public License as
+ | published by the Free Software Foundation.
+ |
+ | This program is distributed in the hope that it will be useful,
+ | but WITHOUT ANY WARRANTY; without even the implied warranty of
+ | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ | GNU General Public License for more details.
+ |
+ | You should have received a copy of the GNU General Public License
+ | along with this program; if not, contact Novell, Inc.
+ |
+ | To contact Novell about this file by physical or electronic mail,
+ | you may find current contact information at www.novell.com
+ |
+ |***************************************************************************
+*/
+
+/*
+ * hvm_extensions.h  
+ * Implement Hyperv extensions.
+ * Engineering Contact: K. Y. Srinivasan
+ */
+
+#ifndef HVM_EXTENSION_H
+#define HVM_EXTENSION_H 
+
+#include <xen/sched.h>
+#include <asm/domain.h>
+#include <xen/timer.h>
+#include <xen/time.h>
+#include <asm/regs.h>
+#include <asm/types.h>
+#include <asm/hvm/io.h>
+#include <asm/hvm/hvm.h>
+#include <asm/hvm/domain.h>
+
+int
+hyperv_dom_create(struct domain *d);
+void
+hyperv_dom_destroy(struct domain *d);
+int
+hyperv_vcpu_initialize(struct vcpu *v);
+void
+hyperv_vcpu_up(struct vcpu *v);
+void
+hyperv_vcpu_destroy(struct vcpu *v);
+int
+hyperv_do_cpu_id(uint32_t input, struct cpu_user_regs *regs);
+int
+hyperv_do_rd_msr(uint32_t idx, struct cpu_user_regs *regs);
+int
+hyperv_do_wr_msr(uint32_t idx, struct cpu_user_regs *regs);
+int
+hyperv_do_hypercall(struct cpu_user_regs *pregs);
+void
+hyperv_do_migrate_timers(struct vcpu *v);
+int
+hyperv_initialize(struct domain *d);
+
+
+
+			
+static inline int
+hyperx_intercept_domain_create(struct domain *d)
+{
+	if (d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] ==1) {
+		return(hyperv_dom_create(d));
+	}
+	return (0);
+}
+
+static inline void 
+hyperx_intercept_domain_destroy(struct domain *d)
+{
+	if (d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] ==1) {
+		hyperv_dom_destroy(d);
+	}
+}
+
+static inline int
+hyperx_intercept_vcpu_initialize(struct vcpu *v)
+{
+	struct domain *d = v->domain;
+	if (d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] ==1) {
+		return(hyperv_vcpu_initialize(v));
+	}
+	return (0);
+}
+
+
+static inline void 
+hyperx_intercept_vcpu_up(struct vcpu *v)
+{
+	struct domain *d = current->domain;
+	if (d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] ==1) {
+		hyperv_vcpu_up(v);
+	}
+}
+	
+static inline void 
+hyperx_intercept_vcpu_destroy(struct vcpu *v)
+{
+	struct domain *d = v->domain;
+	if (d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] ==1) {
+		hyperv_vcpu_destroy(v);
+	}
+}
+
+static inline int
+hyperx_intercept_do_cpuid(uint32_t idx, struct cpu_user_regs *regs)
+{
+	struct domain *d = current->domain;
+	if (d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] ==1) {
+		return(hyperv_do_cpu_id(idx, regs));
+	}
+	return (0);
+}
+
+static inline int
+hyperx_intercept_do_msr_read(uint32_t idx, struct cpu_user_regs *regs)
+{
+	struct domain *d = current->domain;
+	if (d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] ==1) {
+		return(hyperv_do_rd_msr(idx, regs));
+	}
+	return (0);
+}
+static inline int
+hyperx_intercept_do_msr_write(uint32_t idx, struct cpu_user_regs *regs)
+{
+	struct domain *d = current->domain;
+	if (d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] ==1) {
+		return(hyperv_do_wr_msr(idx, regs));
+	}
+	return (0);
+}
+
+static inline int
+hyperx_intercept_do_hypercall(struct cpu_user_regs *regs)
+{
+	struct domain *d = current->domain;
+	if (d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] ==1) {
+		return(hyperv_do_hypercall(regs));
+	}
+	return (0);
+}
+
+static inline void 
+hyperx_intercept_do_migrate_timers(struct vcpu *v)
+{
+	struct domain *d = current->domain;
+	if (d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] ==1) {
+		return(hyperv_do_migrate_timers(v));
+	}
+}
+
+
+int hyperx_initialize(struct domain *d);
+
+#endif
Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/Makefile	2008-03-26 13:56:39.000000000 -0400
@@ -0,0 +1,2 @@
+obj-y += hv_intercept.o  
+obj-y += hv_hypercall.o  
Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_errno.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_errno.h	2008-03-26 13:56:39.000000000 -0400
@@ -0,0 +1,62 @@
+/****************************************************************************
+ |
+ | Copyright (c) [2007, 2008] Novell, Inc.
+ | All Rights Reserved.
+ |
+ | This program is free software; you can redistribute it and/or
+ | modify it under the terms of version 2 of the GNU General Public License as
+ | published by the Free Software Foundation.
+ |
+ | This program is distributed in the hope that it will be useful,
+ | but WITHOUT ANY WARRANTY; without even the implied warranty of
+ | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ | GNU General Public License for more details.
+ |
+ | You should have received a copy of the GNU General Public License
+ | along with this program; if not, contact Novell, Inc.
+ |
+ | To contact Novell about this file by physical or electronic mail,
+ | you may find current contact information at www.novell.com
+ |
+ |***************************************************************************
+*/
+
+/*
+ * hv_errno.h
+ * Error codes for the  Novell Shim.
+ *
+ * Engineering Contact: K. Y. Srinivasan
+ */
+
+#ifndef HV_ERRNO_H
+#define HV_ERRNO_H
+
+#define HV_STATUS_SUCCESS			0x0000
+#define HV_STATUS_INVALID_HYPERCALL_CODE	0x0002
+#define HV_STATUS_INVALID_HYPERCALL_INPUT	0x0003
+#define HV_STATUS_INVALID_ALIGNMENT		0x0004
+#define HV_STATUS_INVALID_PARAMETER		0x0005
+#define HV_STATUS_ACCESS_DENIED			0x0006
+#define HV_STATUS_INVALID_PARTITION_STATE	0x0007
+#define HV_STATUS_OPERATION_DENIED		0x0008
+#define HV_STATUS_UNKNOWN_PROPERTY		0x0009
+#define HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE	0x000A
+#define HV_STATUS_INSUFFICIENT_MEMORY		0x000B
+#define HV_STATUS_PARTITION_TOO_DEEP		0x000C
+#define HV_STATUS_INVALID_PARTITION_ID		0x000D
+#define HV_STATUS_INVALID_VP_INDEX		0x000E
+#define HV_STATUS_UNABLE_TO_RESTORE_STATE	0x000F
+#define HV_STATUS_NOT_FOUND			0x0010
+#define HV_STATUS_INVALID_PORT_ID		0x0011
+#define HV_STATUS_INVALID_CONNECTION_ID		0x0012
+#define HV_STATUS_INSUFFICIENT_BUFFERS		0x0013
+#define HV_STATUS_NOT_ACKNOWLEDGED		0x0014
+#define HV_STATUS_INVALID_VP_STATE		0x0015
+#define HV_STATUS_ACKNOWLEDGED			0x0016
+#define HV_STATUS_INVALID_SAVE_RESTORE_STATE	0x0017
+#define	HV_STATUS_NO_MEMORY_4PAGES		0x0100
+#define	HV_STATUS_NO_MEMORY_16PAGES		0x0101
+#define	HV_STATUS_NO_MEMORY_64PAGES		0x0102
+#define	HV_STATUS_NO_MEMORY_256PAGES		0x0103
+#define	HV_STATUS_NO_MEMORY_1024PAGES		0x0104
+#endif 	
Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c	2008-03-26 14:18:32.000000000 -0400
@@ -0,0 +1,747 @@
+/****************************************************************************
+ |
+ | Copyright (c) [2007, 2008] Novell, Inc.
+ | All Rights Reserved.
+ |
+ | This program is free software; you can redistribute it and/or
+ | modify it under the terms of version 2 of the GNU General Public License as
+ | published by the Free Software Foundation.
+ |
+ | This program is distributed in the hope that it will be useful,
+ | but WITHOUT ANY WARRANTY; without even the implied warranty of
+ | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ | GNU General Public License for more details.
+ |
+ | You should have received a copy of the GNU General Public License
+ | along with this program; if not, contact Novell, Inc.
+ |
+ | To contact Novell about this file by physical or electronic mail,
+ | you may find current contact information at www.novell.com
+ |
+ |***************************************************************************
+*/
+
+/*
+ * nshypercall.c.
+ * This file implements the hypercall component of the hyperv Shim. 
+ *
+ * Engineering Contact: K. Y. Srinivasan
+ */
+
+#include <asm/page.h>
+#include <asm/processor.h>
+#include <asm/hvm/support.h>
+#include <xen/cpumask.h>
+#include <xen/event.h>
+#include <xen/domain.h>
+
+#include <asm/hvm/hvm_extensions.h>
+#include "hv_shim.h"
+#include "hv_errno.h"
+#include "hv_hypercall.h"
+
+
+void hv_collect_stats(int event, hv_vcpu_stats_t *statsp)
+{
+	switch (event) {
+	case HV_CSWITCH:
+		statsp->num_switches++;
+		return;
+	case HV_FLUSH_VA:
+		statsp->num_flushes++;
+		return;
+	case HV_FLUSH_RANGE:
+		statsp->num_flush_ranges++;
+		return;
+	case HV_FLUSH_VA_POSTED: 
+		statsp->num_flushes_posted++;
+		return;
+	case HV_FLUSH_RANGE_POSTED:
+		statsp->num_flush_ranges_posted++;
+		return;
+	case HV_TPR_READ:
+		statsp->num_tpr_reads++;
+		return;
+	case HV_ICR_READ:
+		statsp->num_icr_reads++;
+		return;
+	case HV_TPR_WRITE: 
+		statsp->num_tpr_writes++;
+		return;
+	case HV_ICR_WRITE:
+		statsp->num_icr_writes++;
+		return;
+	case HV_EOI_WRITE:
+		statsp->num_eoi_writes++;
+		return;
+
+	case HV_GFS_ACQUIRE:
+		statsp->num_gfs_acquires++;
+		return;
+	case HV_GFS_RELEASE:
+		statsp->num_gfs_releases++;
+		return;
+	case HV_TLB_FLUSH:
+		statsp->num_tlb_flushes++;
+		return;
+	case HV_INVL_PG:
+		statsp->num_invl_pages++;
+		return;
+	case HV_TIMEOUTS:
+		statsp->num_time_outs++;
+		return;
+	}
+}
+
+void
+hv_print_stats(hv_partition_t *curp, int i)
+{
+	hv_vcpu_t *v;
+	v = &curp->vcpu_state[i];
+	printk("Printing stats for vcpu ID: %d\n", i);
+
+	printk("Number of context switches: %lu\n", v->stats.num_switches);
+	printk("Number of flushes: %lu\n", v->stats.num_flushes);
+	printk("Number of flushes posted: %lu\n", v->stats.num_flushes_posted);
+	printk("Number of flush ranges: %lu\n", v->stats.num_flush_ranges);
+	printk("Number of flush ranges posted: %lu\n", 
+		v->stats.num_flush_ranges_posted);
+	printk("Number of TPR reads: %lu\n", v->stats.num_tpr_reads);
+	printk("Number of ICR reads: %lu\n", v->stats.num_icr_reads);
+	printk("Number of Eoi writes: %lu\n", v->stats.num_eoi_writes);
+	printk("Number of Tpr writes: %lu\n", v->stats.num_tpr_writes);
+	printk("Number of Icr writes: %lu\n", v->stats.num_icr_writes);
+	printk("Number of GFS acuires: %lu\n", v->stats.num_gfs_acquires);
+	printk("Number of GFS releases: %lu\n", v->stats.num_gfs_releases);
+	printk("Number of TLB flushes: %lu\n", v->stats.num_tlb_flushes);
+	printk("Number of INVLPG flushes: %lu\n", v->stats.num_invl_pages);
+	printk("Number of TIMEOUTS: %lu\n", v->stats.num_time_outs);
+
+}
+
+			
+
+static int
+hv_get_vp_registers(paddr_t input, paddr_t output)
+{
+	hv_vcpu_t        *vcpup, *targetp;
+	hv_partition_t   *curp = hv_get_current_partition();
+	get_vp_registers_input_t	*inbuf;
+	get_vp_registers_output_t	*outbuf;
+	struct vcpu_guest_context	*vcpu_ctx;
+	u32		*reg_indexp;
+	get_vp_registers_output_t		*out_regp;
+	u32		num_output_bytes = 0;
+
+        vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
+	inbuf = vcpup->input_buffer;
+	outbuf = vcpup->output_buffer;
+	out_regp = outbuf;
+	/*
+	 * Copy the input data to the per-cpu input buffer.
+	 * This may be an overkill; obviously it is better to only
+	 * copy what we need. XXXKYS: Check with Mike.
+	 */
+	if (hvm_copy_from_guest_phys(inbuf, input, PAGE_SIZE)) {
+		return (HV_STATUS_INVALID_ALIGNMENT);
+	}
+	/*
+	 * If the partition ID specified does not match with the current 
+	 * domain return appropriate error.
+	 */
+	if ((u64)current->domain->domain_id != inbuf->partition_id) {
+		return (HV_STATUS_ACCESS_DENIED);
+	}
+	if (inbuf->vp_index > MAX_VIRT_CPUS) { 
+		return (HV_STATUS_INVALID_VP_INDEX);
+	}
+	targetp = &curp->vcpu_state[inbuf->vp_index]; 
+	if (!(targetp->flags & HV_VCPU_UP)) {
+		return (HV_STATUS_INVALID_VP_STATE);
+	}
+	if ((vcpu_ctx = 
+	     xmalloc_bytes(sizeof(struct vcpu_guest_context))) 
+		== NULL) {
+		return (HV_STATUS_INSUFFICIENT_MEMORY);
+	}
+
+	/*
+	 * Get the register state of the specified vcp.
+	 */
+	if (current->vcpu_id != inbuf->vp_index) {
+		vcpu_pause(targetp->xen_vcpu);
+	}
+	arch_get_info_guest(targetp->xen_vcpu, vcpu_ctx);
+	if (current->vcpu_id != inbuf->vp_index) {
+		vcpu_unpause(targetp->xen_vcpu);
+	}
+	/*
+	 * Now that we have the register state; select what we want and
+	 * populate the output buffer.
+	 */
+	reg_indexp = &inbuf->reg_index;
+	while (*reg_indexp != 0) {
+		switch	(*reg_indexp) {
+			/*
+			 * XXXKYS: need mapping code here; populate
+			 * outbuf.
+			 */
+			panic("hv_get_vp_registers not supported\n");
+		}
+		reg_indexp++;
+		out_regp++ ;	/*128 bit registers */
+		num_output_bytes +=16;
+		if ((char *)reg_indexp > ((char *)inbuf + PAGE_SIZE)) {
+			/*
+			 *input list not reminated correctly; bail out.
+			 */
+			panic("hv_get_vp_registers:input list not terminated\n"); 
+			break;
+		}
+	}
+	if (hvm_copy_to_guest_phys(output, outbuf, num_output_bytes)) {
+		/* Some problem copying data out*/
+		panic("hv_get_vp_registers:copyout problem\n"); 
+	}
+	xfree(vcpu_ctx);
+	return (HV_STATUS_SUCCESS);
+}
+		
+static int
+hv_set_vp_registers(paddr_t input, paddr_t output)
+{
+	hv_vcpu_t        *vcpup, *targetp;
+	hv_partition_t   *curp = hv_get_current_partition();
+	set_vp_registers_input_t	*inbuf;
+	struct vcpu_guest_context	*vcpu_ctx;
+	set_vp_register_spec_t		*reg_indexp;
+	int		ret_val = HV_STATUS_SUCCESS;
+
+        vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
+	inbuf = vcpup->input_buffer;
+	/*
+	 * Copy the input data to the per-cpu input buffer.
+	 * This may be an overkill; obviously it is better to only
+	 * copy what we need. XXXKYS: Check with Mike.
+	 */
+	if (hvm_copy_from_guest_phys(inbuf, input, PAGE_SIZE)) {
+		return (HV_STATUS_INVALID_ALIGNMENT);
+	}
+	/*
+	 * If the partition ID specified does not match with the current 
+	 * domain return appropriate error.
+	 */
+	if ((u64)current->domain->domain_id != inbuf-> partition_id) {
+		return (HV_STATUS_ACCESS_DENIED);
+	}
+	if (inbuf->vp_index > MAX_VIRT_CPUS) { 
+		return (HV_STATUS_INVALID_VP_INDEX);
+	}
+	targetp = &curp->vcpu_state[inbuf->vp_index]; 
+	if (!(targetp->flags & HV_VCPU_UP)) {
+		return (HV_STATUS_INVALID_VP_STATE);
+	}
+	if ((vcpu_ctx = 
+	     xmalloc_bytes(sizeof(struct vcpu_guest_context))) 
+		== NULL) {
+		return (HV_STATUS_INSUFFICIENT_MEMORY);
+	}
+	/*
+	 * XXXKYS: Is it sufficient to just pause the target vcpu; on the 
+	 * xen side domain is paused for this call. CHECK.
+	 */
+	if (current->vcpu_id != inbuf->vp_index) {
+		vcpu_pause(targetp->xen_vcpu);
+	}
+
+	arch_get_info_guest(targetp->xen_vcpu, vcpu_ctx);
+	/*
+	 * Now that we have the register state; update the register state
+	 * based on what we are given. 
+	 */
+	reg_indexp = &inbuf->reg_spec;
+	/*
+	 * XXXKYS: Assuming the list is terminated by a reg_name that is 0.
+	 * Check with Mike.
+	 */
+	while (reg_indexp->reg_name != 0) {
+		switch	(reg_indexp->reg_name) {
+			/*
+			 * XXXKYS: need mapping code here; populate
+			 * vcpu_ctx 
+			 */
+			panic("hv_set_vp_registers not supported\n");
+		}
+		reg_indexp++;
+		if ((char *)reg_indexp > ((char *)inbuf + PAGE_SIZE)) {
+			/*
+			 *input list not reminated correctly; bail out.
+			 */
+			panic("hv_set_vp_registers:input list not terminated\n"); 
+			break;
+		}
+	}
+	/*
+	 * Now set register state.
+	 *
+	 * XXXKYS: Is it sufficient to just pause the target vcpu; on the 
+	 * xen side domain is paused for this call. CHECK.
+	 */
+
+	if (arch_set_info_guest(targetp->xen_vcpu, vcpu_ctx)) { 
+		ret_val = HV_STATUS_INVALID_PARAMETER;
+	}
+	if (current->vcpu_id != inbuf->vp_index) {
+		vcpu_unpause(targetp->xen_vcpu);
+	}
+	xfree(vcpu_ctx);
+	return (ret_val);
+}
+
+static int
+hv_switch_va(paddr_t input)
+{
+	hv_partition_t   *curp = hv_get_current_partition();
+        hv_vcpu_t *vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
+
+	/*
+	 * XXXKYS: the spec sys the asID is passed via memory at offset 0 of 
+	 * the page whose GPA is in the input register. However, it appears 
+	 * the current build of longhorn (longhorn-2007-02-06-x86_64-fv-02)
+	 * passes the asID in the input register instead. Need to check if 
+	 * future builds do this.
+	 */
+	hvm_set_cr3(input); 
+	HV_STATS_COLLECT(HV_CSWITCH, &vcpup->stats);
+	return (HV_STATUS_SUCCESS);
+}
+
+static int
+hv_flush_va(paddr_t input)
+{
+	hv_partition_t   *curp = hv_get_current_partition();
+	int		i;
+        hv_vcpu_t	*cur_vcpup;
+
+	flush_va_t	*flush_argp;
+	cpumask_t	vcpu_mask;
+	u64		as_id, input_mask, ret_val;
+	int		flush_global = 1;
+
+	cur_vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
+	flush_argp = cur_vcpup->input_buffer;
+
+
+	if (hvm_copy_from_guest_phys(flush_argp, input, sizeof(*flush_argp))) {
+		return (HV_STATUS_INVALID_ALIGNMENT);
+	}
+	input_mask = flush_argp->p_mask;
+	as_id = flush_argp->as_handle;
+	cpus_clear(vcpu_mask);
+	/*
+	 * Deal with all trivial error conditions.
+	 */
+	if (flush_argp->flags != 0 && (!(flush_argp->flags & 
+			      (HV_FLUSH_ALL_PROCESSORS | 
+			       HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES |
+			       HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY)))) {
+		return (HV_STATUS_INVALID_PARAMETER);
+	}
+	if (((flush_argp->p_mask) == 0) &&
+	   !(flush_argp->flags & HV_FLUSH_ALL_PROCESSORS)) {
+		return (HV_STATUS_INVALID_PARAMETER);
+	}
+				 
+	if (flush_argp->flags & HV_FLUSH_ALL_PROCESSORS) {
+		for (i=0; i< MAX_VIRT_CPUS; i++) {
+			if (current->domain->vcpu[i] != NULL) {
+				cpu_set(i, vcpu_mask);
+			}
+		}
+	} else {
+		i = 0;
+		while (input_mask) {
+			if (input_mask &0x1) {
+				cpu_set(i, vcpu_mask);
+			}
+			input_mask = (input_mask >> 1);
+			i++;
+		}
+	}
+		
+	if (flush_argp->flags & HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES) {
+		as_id = HV_ALL_AS;
+	}
+	if (flush_argp->flags & HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY) {
+		flush_global = 0;
+	}
+	/*
+	 * Now operate on what we are given
+	 * XXXKYS: For now we are ignoring as_id and fushGlobal flag.
+	 * May have to revisit this. But first stash away the processed 
+	 * parameters for subsequent use.
+	 */
+	flush_argp->as_handle = as_id;
+	flush_argp->flags = flush_global;
+	flush_argp->v_mask = vcpu_mask;
+
+
+	ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
+	hv_set_syscall_retval(guest_cpu_user_regs(),
+                                   curp->long_mode_guest,
+                                   ret_val);
+	HV_STATS_COLLECT(HV_FLUSH_VA_STAT, &cur_vcpup->stats);
+	panic("hv_flush_va not supported\n"); 
+	return (HV_STATUS_SUCCESS);	
+}
+
+static int
+hv_flush_va_range(paddr_t input, unsigned short start_index, 
+unsigned short rep_count, unsigned short *reps_done)
+{
+	hv_vcpu_t        *cur_vcpup;
+	hv_partition_t   *curp = hv_get_current_partition();
+	flush_va_t  	*flush_argp;
+	cpumask_t	vcpu_mask;
+	u64		as_id, input_mask, ret_val;
+	int		flush_global = 1;
+	int		flush_all_proc = 0;
+	int		i;
+
+        cur_vcpup =  &curp->vcpu_state[hv_get_current_vcpu_index()];
+	flush_argp = cur_vcpup->input_buffer;
+	ASSERT(rep_count >=1);
+	ASSERT(((sizeof(*flush_argp)) + 8*(rep_count -1)) <= PAGE_SIZE);
+	if (hvm_copy_from_guest_phys(flush_argp, input, 
+			((sizeof(*flush_argp)) + 8*(rep_count -1)))) {
+		return (HV_STATUS_INVALID_ALIGNMENT);
+	}
+	*reps_done = rep_count;
+	input_mask = flush_argp->p_mask;
+	as_id = flush_argp->as_handle;
+	cpus_clear(vcpu_mask);
+	/*
+	 * Deal with all trivial error conditions.
+	 */
+	if (flush_argp->flags != 0 && (!(flush_argp->flags & 
+			      (HV_FLUSH_ALL_PROCESSORS | 
+			       HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES |
+			       HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY)))) {
+		return (HV_STATUS_INVALID_PARAMETER);
+	}
+	if ((flush_argp->p_mask == 0) &&
+	   !(flush_argp->flags & HV_FLUSH_ALL_PROCESSORS)) {
+		return (HV_STATUS_INVALID_PARAMETER);
+	}
+				 
+	if (flush_argp->flags & HV_FLUSH_ALL_PROCESSORS) {
+		flush_all_proc = 1;
+		for (i=0; i< MAX_VIRT_CPUS; i++) {
+			if (current->domain->vcpu[i] != NULL) {
+				cpu_set(i, vcpu_mask);
+			}
+		}
+	} else {
+		i = 0;
+		/*
+		 * populate the vcpu mask based on the input.
+		 */
+		while (input_mask) {
+			if (input_mask & 0x1) {
+				cpu_set(i, vcpu_mask);
+			}
+			input_mask = (input_mask >> 1);
+			i++;
+		}
+	}
+	if (flush_argp->flags & HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES) {
+		as_id = HV_ALL_AS;
+	}
+	if (flush_argp->flags & HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY) {
+		flush_global = 0;
+	}
+	/*
+	 * Now operate on what we are given
+	 * XXXKYS: For now we are ignoring as_id and fushGlobal flag.
+	 * May have to revisit this.
+	 * May have to revisit this. But first stash away the processed 
+	 * parameters for subsequent use.
+	 */
+	flush_argp->as_handle = as_id;
+	flush_argp->flags = flush_global;
+	flush_argp->v_mask = vcpu_mask;
+	
+	ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, rep_count);
+	hv_set_syscall_retval(guest_cpu_user_regs(),
+                                   curp->long_mode_guest,
+                                   ret_val);
+
+
+	HV_STATS_COLLECT(HV_FLUSH_RANGE, &cur_vcpup->stats);
+	panic("hv_flush_vaRange not supported\n"); 
+	return (HV_STATUS_SUCCESS);	
+}
+
+void
+hv_handle_hypercall(u64 opcode, u64 input, u64 output, 
+		  u64 *ret_val)
+{
+	unsigned short	verb;
+	unsigned short	rep_count;
+	unsigned short	reps_done =0;
+	unsigned short	start_index;
+	hv_partition_t   *curp = hv_get_current_partition();
+	u64		partition_id;
+	int		value;
+	
+
+	verb = (short)(opcode & 0xffff);
+	rep_count = (short)((opcode >>32) & 0xfff);
+	start_index = (short)((opcode >> 48) & 0xfff);
+	switch (verb) {
+	case HV_CREATE_PARTITION:
+		/*
+		 * Xen only allows dom0 to create domains.
+		 */	
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_INITIALIZE_PARTITION:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_DELETE_PARTITION:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_GET_PARTITION_PROPERTY:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_SET_PARTITION_PROPERTY:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_GET_PARTITION_ID:
+		if (!hv_privilege_check(curp, HV_ACCESS_PARTITION_ID)) {
+			*ret_val = 
+			hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+			return;
+		}
+		partition_id = (u64)current->domain->domain_id;
+		if (hvm_copy_to_guest_phys(output, 
+			&partition_id, 8)) {
+			/*
+			 * Invalid output area.
+			 */
+			*ret_val = 
+			hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+			return;
+		}
+		*ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
+		return;
+	case HV_GET_NEXT_CHILD_PARTITION:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_SET_LOGICAL_PROCESSOR_RUN_TIME_GROUP:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_CLEAR_LOGICAL_PROCESSOR_RUN_TIME_GROUP:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_NOTIFY_LOGICAL_PROCESSOR_POWER_STATE:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_GET_LOGICAL_PROCESSOR_RUN_TIME:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_DEPOSIT_MEMORY:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_WITHDRAW_MEMORY:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_GET_MEMORY_BALANCE:	
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_MAP_GPA_PAGES:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_UNMAP_GPA_PAGES:	
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_INSTALL_INTERCEPT:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_CREATE_VP:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_TERMINATE_VP:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_DELETE_VP:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_GET_NEXT_VP:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_GET_VP_REGISTERS:
+		*ret_val = hv_build_hcall_retval(
+		hv_get_vp_registers(input, output), 0);
+		return;
+	case HV_SET_VP_REGISTERS:
+		*ret_val = hv_build_hcall_retval(
+		hv_set_vp_registers(input, output), 0);
+	case HV_SWITCH_VA:
+		*ret_val = 
+		hv_build_hcall_retval(hv_switch_va(input), 0);
+		return;
+	case HV_FLUSH_VA:
+		*ret_val = 
+		hv_build_hcall_retval(hv_flush_va(input), 0);
+		return;
+	case HV_FLUSH_VA_LIST:
+		value  = hv_flush_va_range(input, start_index, 
+					rep_count, &reps_done);
+		*ret_val = hv_build_hcall_retval(value, reps_done);  
+		return;
+		
+	case HV_TRASLATE_VA:	
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_READ_GPA:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_WRITE_GPA:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_ASSERT_VIRTUAL_INTERRUPT:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_CLEAR_VIRTUAL_INTERRUPT:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_CREATE_PORT:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_DELETE_PORT:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_CONNECT_PORT:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_GET_PORT_PROPERTY:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_DISCONNECT_PORT:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_POST_MESSAGE:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case HV_POST_EVENT:
+		/*
+		 * We don't support this.
+		 */
+		*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
+		return;
+	case 0:
+		/*
+		 * 32 bit longhorn invokes hypercall with verb == 0; need to 
+		 * check with Mike (XXXKYS). For now ignore it.
+		 */
+		*ret_val = 
+		hv_build_hcall_retval(HV_STATUS_INVALID_HYPERCALL_CODE, 0);
+		return;
+	default:
+		printk("Unkown hypercall: verb is: %d\n", verb); 
+		*ret_val = 
+		hv_build_hcall_retval(HV_STATUS_INVALID_HYPERCALL_CODE, 0);
+		return;
+	}
+}
Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h	2008-03-26 13:56:39.000000000 -0400
@@ -0,0 +1,126 @@
+/****************************************************************************
+ |
+ | Copyright (c) [2007, 2008] Novell, Inc.
+ | All Rights Reserved.
+ |
+ | This program is free software; you can redistribute it and/or
+ | modify it under the terms of version 2 of the GNU General Public License as
+ | published by the Free Software Foundation.
+ |
+ | This program is distributed in the hope that it will be useful,
+ | but WITHOUT ANY WARRANTY; without even the implied warranty of
+ | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ | GNU General Public License for more details.
+ |
+ | You should have received a copy of the GNU General Public License
+ | along with this program; if not, contact Novell, Inc.
+ |
+ | To contact Novell about this file by physical or electronic mail,
+ | you may find current contact information at www.novell.com
+ |
+ |***************************************************************************
+*/
+
+/*
+ * nshypercall.h
+ * Memory layouts for the various hypercalls supported. 
+ *
+ * Engineering Contact: K. Y. Srinivasan
+ */
+
+#ifndef HV_HYPERCALL_H
+#define HV_HYPERCALL_H
+
+#include <xen/cpumask.h>
+
+
+typedef struct get_vp_registers_input {
+	u64	partition_id;
+	u64	vp_index;
+	u32	reg_index;
+	u32	pad1;
+	u64	pad2;
+} get_vp_registers_input_t;
+
+typedef struct get_vp_registers_output {
+	u64	low_value;
+	u64	high_value;
+} get_vp_registers_output_t;
+
+typedef struct set_vp_register_spec {
+	u32	reg_name;
+	u32	pad;
+	u64	pad1;
+	u64	low_value;
+	u64	high_value;
+} set_vp_register_spec_t;
+
+typedef struct set_vp_registers_input {
+	u64	partition_id;
+	u64	vp_index;
+	set_vp_register_spec_t	reg_spec;
+} set_vp_registers_input_t;
+
+
+typedef struct flush_va {
+	u64	as_handle;
+	u64	flags;
+	union  {
+		u64		processor_mask;
+		cpumask_t 	vcpu_mask;
+	} proc_mask;
+#define p_mask 	proc_mask.processor_mask
+#define v_mask	proc_mask.vcpu_mask
+	u64	gva;
+} flush_va_t;
+
+#define HV_FLUSH_ALL_PROCESSORS	0x00000001
+#define HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES 0x00000002
+#define HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY 0x00000004
+
+#define HV_ALL_AS	(-1)
+
+/*
+ * Hypercall verbs.
+ */
+
+#define HV_CREATE_PARTITION 	0x0010
+#define HV_INITIALIZE_PARTITION 0x0011
+#define HV_DELETE_PARTITION	0x0014
+#define HV_GET_PARTITION_PROPERTY 0x0017
+#define HV_SET_PARTITION_PROPERTY 0x0018
+#define HV_GET_PARTITION_ID	0x0015
+#define HV_GET_NEXT_CHILD_PARTITION 0x0016
+#define HV_SET_LOGICAL_PROCESSOR_RUN_TIME_GROUP 0x0005
+#define HV_CLEAR_LOGICAL_PROCESSOR_RUN_TIME_GROUP 0x0006
+#define HV_NOTIFY_LOGICAL_PROCESSOR_POWER_STATE	0x0007
+#define HV_GET_LOGICAL_PROCESSOR_RUN_TIME	0x0004
+#define HV_DEPOSIT_MEMORY	0x001C
+#define HV_WITHDRAW_MEMORY	0x001D
+#define HV_GET_MEMORY_BALANCE	0x001E
+#define HV_MAP_GPA_PAGES	0x001A
+#define HV_UNMAP_GPA_PAGES	0x001B
+#define HV_INSTALL_INTERCEPT	0x0019
+#define HV_CREATE_VP		0x001F
+#define HV_TERMINATE_VP		0x0020
+#define HV_DELETE_VP		0x0021
+#define HV_GET_NEXT_VP		0x0027
+#define HV_GET_VP_REGISTERS	0x0022
+#define HV_SET_VP_REGISTERS	0x0023
+#define HV_SWITCH_VA		0x0001
+#define HV_FLUSH_VA		0x0002
+#define HV_FLUSH_VA_LIST	0x0003
+#define HV_TRASLATE_VA		0x0024
+#define HV_READ_GPA		0x0025
+#define HV_WRITE_GPA		0x0026
+#define HV_ASSERT_VIRTUAL_INTERRUPT	0x002A
+#define HV_CLEAR_VIRTUAL_INTERRUPT	0x002C
+#define HV_CREATE_PORT			0x002D
+#define HV_DELETE_PORT			0x002E
+#define HV_CONNECT_PORT			0x002F
+#define HV_GET_PORT_PROPERTY		0x0031
+#define HV_DISCONNECT_PORT		0x0030
+#define HV_POST_MESSAGE			0x0032
+#define HV_POST_EVENT			0x0034
+
+#endif /* HV_HYPERCALL_H */
Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c	2008-03-26 14:26:38.000000000 -0400
@@ -0,0 +1,1695 @@
+/****************************************************************************
+ |
+ | Copyright (c) [2007, 2008] Novell, Inc.
+ | All Rights Reserved.
+ |
+ | This program is free software; you can redistribute it and/or
+ | modify it under the terms of version 2 of the GNU General Public License as
+ | published by the Free Software Foundation.
+ |
+ | This program is distributed in the hope that it will be useful,
+ | but WITHOUT ANY WARRANTY; without even the implied warranty of
+ | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ | GNU General Public License for more details.
+ |
+ | You should have received a copy of the GNU General Public License
+ | along with this program; if not, contact Novell, Inc.
+ |
+ | To contact Novell about this file by physical or electronic mail,
+ | you may find current contact information at www.novell.com
+ |
+ |***************************************************************************
+*/
+
+/*
+ * nsintercept.c.
+ * This file implements the intercepts to support the  Hyperv Shim. 
+ *
+ * Engineering Contact: K. Y. Srinivasan
+ */
+
+#include <asm/hvm/hvm_extensions.h>
+
+
+#include <asm/config.h>
+#include <asm/hvm/io.h>
+#include <asm/processor.h>
+#include <asm/page.h>
+#include <asm/apicdef.h>
+#include <asm/regs.h>
+#include <asm/msr.h>
+#include <asm-x86/event.h>
+
+#include <xen/string.h>
+#include <xen/init.h>
+#include <xen/compile.h>
+#include <xen/hvm/save.h>
+#include <public/sched.h>
+
+
+/*
+ * Local includes; extension specific.
+ */
+#include "hv_errno.h"
+#include "hv_shim.h"
+
+
+/*
+ * Implement the Hyperv Shim.
+ */
+
+extern struct cpuinfo_x86 boot_cpu_data;
+extern struct hvm_mmio_handler vlapic_mmio_handler;
+
+static inline void
+hv_inject_exception(int trap);
+
+static inline void
+hv_hypercall_page_initialize(void *hypercall_page,  hv_partition_t *curp);
+
+static inline void
+hv_init_event_page(void *sief_page);
+
+static inline void
+hv_init_message_page(void *sim_page);
+
+static inline void *
+get_virt_from_gmfn(struct domain *d, unsigned long gmfn)
+{
+	unsigned long mfn = gmfn_to_mfn(d, gmfn);
+	if (mfn == INVALID_MFN) {
+		return (NULL);
+	}
+	return (map_domain_page_global(mfn));
+}
+
+static inline void
+inject_interrupt(struct vcpu *v, int vector, int type)
+{
+        struct vlapic *vlapic = vcpu_vlapic(v);
+
+        /*
+         * XXXKYS: Check the trigger mode.
+         */
+        if (vlapic_set_irq(vlapic, vector, 1)) {
+                vcpu_kick(v);
+        }
+}
+
+static inline unsigned long
+get_mfn_from_gva(unsigned long va)
+{
+	uint32_t pfec = PFEC_page_present;
+	unsigned long gfn;
+	gfn = paging_gva_to_gfn(current, va, &pfec);
+	return (gmfn_to_mfn((current->domain), gfn));
+}
+
+static inline void *
+get_virt_from_page_ptr(void *page)
+{
+        struct page_info *pg = page;
+        unsigned long mfn = page_to_mfn(pg);
+        return (map_domain_page_global(mfn));
+}
+
+static inline void 
+hv_write_guestid_msr(hv_partition_t *curp, hv_vcpu_t *cur_vcpu, u64 msr_content)
+{
+	curp->guest_id_msr = msr_content;
+	if (curp->guest_id_msr == 0) {
+		/*
+		 * Guest has cleared the guest ID;
+		 * clear the hypercall page.
+		 */
+		if (curp->hypercall_msr)  {
+			cur_vcpu->flags &= ~HV_VCPU_UP;
+		}
+	}
+}
+
+
+static inline void 
+hv_write_hypercall_msr(hv_partition_t *curp,
+		  hv_vcpu_t	*cur_vcpu,
+		  u64		msr_content)
+{
+	unsigned long gmfn;
+	void	*hypercall_page;
+	struct domain	*d = cur_vcpu->xen_vcpu->domain;
+
+	spin_lock(&curp->lock);
+	gmfn = (msr_content >> 12);
+	if (curp->guest_id_msr == 0) {
+		/* Nothing to do if the guest is not registered*/
+		spin_unlock(&curp->lock);
+		return;
+	}
+	/*
+	 * Guest is registered; see if we can turn-on the 
+	 * hypercall page.
+	 * XXXKYS: Can the guest write the GPA in one call and 
+	 * subsequently enable it? Check. For now assume that all the
+	 * info is specified in one call.
+	 */
+	if (((u32)msr_content & (0x00000001)) == 0) {	
+		/*
+		 * The client is not enabling the hypercall; just
+		 * ignore everything. 
+		 */
+		spin_unlock(&curp->lock);
+		return;
+	}
+	hypercall_page = get_virt_from_gmfn(d,gmfn);
+	if (hypercall_page == NULL) {
+		/*
+		 * The guest specified a bogus GPA; inject a GP fault
+		 * into the guest.
+		 */
+		hv_inject_exception(TRAP_gp_fault);
+		spin_unlock(&curp->lock);
+		return;
+	}
+	hv_hypercall_page_initialize(hypercall_page, curp);
+	curp->hypercall_mfn = gmfn_to_mfn(d, gmfn);
+#ifdef CONFIG_DOMAIN_PAGE
+	unmap_domain_page_global(hypercall_page);
+#endif
+	curp->hypercall_msr = msr_content;
+	spin_unlock(&curp->lock);
+	cur_vcpu->flags |= HV_VCPU_UP;
+}
+
+
+static inline void hv_write_sx_msr(uint32_t idx, hv_partition_t *curp,
+				  hv_vcpu_t	*cur_vcpu,
+				  u64		msr_content)
+{
+	unsigned long gmfn;
+	void            *sx_page;
+	struct domain	*d = cur_vcpu->xen_vcpu->domain;
+	gmfn = (msr_content >> 12);
+	/*
+	 * Can the client enable the siefp and specify 
+	 * the base address in two 
+	 * different calls? XXXKYS: For now assume 
+	 * that it is done in one call.
+	 */
+	if (!((u32)msr_content & (0x00000001))) {	
+		/*
+		 * The client is not enabling the sx page; just
+		 * ignore everything. 
+		 */
+		return;
+	}
+	sx_page = get_virt_from_gmfn(d, gmfn);
+	if (sx_page == NULL) {
+		/*
+		 * The guest specified a bogus GPA; inject a GP fault
+		 * into the guest.
+		 */
+		hv_inject_exception(TRAP_gp_fault);
+		return;
+	}
+	switch (idx) {
+		case HV_MSR_SIEFP:
+			hv_init_event_page(sx_page);
+			cur_vcpu->siefp_msr = msr_content; 
+			cur_vcpu->sief_page = sx_page; 
+			break;
+		case HV_MSR_SIMP:
+			hv_init_message_page(sx_page);
+			cur_vcpu->simp_msr = msr_content;
+			cur_vcpu->sim_page = sx_page;
+			break;
+	}
+
+}
+
+int 
+hyperv_initialize(struct domain *d)
+{
+	int i;
+	printk("Hyperv extensions initialized\n");
+	if (hyperv_dom_create(d)) {
+		printk("Hyperv dom create failed\n");
+		return (1);
+	}
+	for (i=0; i < MAX_VIRT_CPUS; i++) {
+		if (d->vcpu[i] != NULL) {
+			if (hyperv_vcpu_initialize(d->vcpu[i])) {
+				int j;
+				for (j= (i-1); j >=0; j--) {
+					hyperv_vcpu_destroy(d->vcpu[j]);
+				}	
+				hyperv_dom_destroy(d);
+				return (1);
+			}
+		}
+	}
+	return (0);
+}
+
+/*
+ * Time this domain booted.
+ */
+s_time_t hv_domain_boot_time;
+
+
+static inline u64
+hv_get_time_since_boot(void)
+{
+	u64	curTime = get_s_time();
+	return ((curTime - hv_domain_boot_time)/100) ;
+}
+
+static inline int
+hv_call_from_bios(struct cpu_user_regs *regs)
+{
+	if (hvm_paging_enabled(current)) {
+		return (0);
+	} else {
+		return (1);
+	}
+}
+
+ 
+static inline void
+hv_inject_exception(int trap)
+{
+	hvm_funcs.inject_exception(trap, 0, 0);
+}
+
+
+static inline int
+hv_os_registered(void)
+{
+	hv_partition_t	*curp = hv_get_current_partition();
+	return (curp->guest_id_msr != 0?1:0);
+}
+
+
+
+static inline void 
+hv_set_partition_privileges(hv_partition_t *hvpp)
+{
+	/*
+	 * This is based on the hypervisor spec under section 5.2.3. 
+	 */
+	hvpp->privileges = 0x000000020000007f;
+}
+
+static inline u32
+hv_get_recommendations(void)
+{
+	/*
+	 *For now we recommend all the features. Need to validate.
+	 */
+	if ( paging_mode_hap(current->domain)) {
+		/*
+		 * If HAP is enabled; the guest should not use TLB flush
+		 * related enlightenments.
+		 */
+		return (0x19);
+	} else {
+		/*
+		 * For now disable TLB flush enlightenments.
+		 */
+		return (0x19); 
+	}
+}
+
+
+static inline void 
+hv_set_partition_features(hv_partition_t *hvpp)
+{
+	hvpp->supported_features = 0x1f;
+}
+
+static inline u16 
+hv_get_guest_major(void)
+{
+	//KYS: Check!
+	return (0);
+}
+static inline u16
+hv_get_guest_minor(void)
+{
+	//KYS: Check!
+	return (0);
+}
+static inline u32
+hv_get_guest_service_pack(void)
+{
+	//KYS: Check!
+	return (0);
+}
+ 
+static inline u8 
+hv_get_guest_service_branch_info(void)
+{
+	//KYS: Check!
+	return (0);
+}
+static inline u32 
+hv_get_guest_service_number(void)
+{
+	//KYS: Check!
+	return (0);
+}
+
+static inline u32
+hv_get_supported_synthetic_msrs(void)
+{
+	/*
+	 * All MSRS in the spec version 0.83 including RESET MSR. 
+	 */
+	return (0xff);
+}
+
+
+static inline u32
+hv_get_max_vcpus_supported(void)
+{
+	return MAX_VIRT_CPUS;
+}
+
+static inline u32
+hv_get_max_lcpus_supported(void)
+{
+	return NR_CPUS;
+}
+
+
+static inline void
+hv_read_icr(u64 *icr_content)
+{
+	u32	icr_low, icr_high;
+	u64	ret_val;
+
+
+	icr_low = vlapic_mmio_handler.read_handler(current, 
+		 (vlapic_base_address(vcpu_vlapic(current)) + 0x300), 4);
+	icr_high = vlapic_mmio_handler.read_handler(current, 
+		 (vlapic_base_address(vcpu_vlapic(current)) + 0x310), 4);
+	ret_val = icr_high;
+	*icr_content = ((ret_val << 32) | icr_low);
+
+}
+
+static inline void
+hv_read_tpr(u64 *tpr_content)
+{
+	u32	tpr_low;
+
+
+	tpr_low = vlapic_mmio_handler.read_handler(current, 
+		 (vlapic_base_address(vcpu_vlapic(current)) + 0x80), 4);
+	*tpr_content = (u64)tpr_low;
+
+}
+
+static inline void
+hv_write_eoi(u64 msr_content)
+{
+	u32 eoi = (u32)msr_content;
+
+	vlapic_mmio_handler.write_handler(current, 
+		 (vlapic_base_address(vcpu_vlapic(current)) + 0xb0), 4, eoi);
+
+}
+
+static inline void
+hv_write_icr(u64 msr_content)
+{
+	u32	icr_low, icr_high;
+	icr_low = (u32)msr_content;
+	icr_high = (u32)(msr_content >> 32);
+
+	if (icr_high != 0) {
+		vlapic_mmio_handler.write_handler(current, 
+		 (vlapic_base_address(vcpu_vlapic(current)) + 0x310), 4, 
+		icr_high);
+	}
+	if (icr_low != 0) {
+		vlapic_mmio_handler.write_handler(current, 
+		 (vlapic_base_address(vcpu_vlapic(current)) + 0x300), 4, 
+		icr_low);
+	}
+
+}
+
+static inline void
+hv_write_tpr(u64 msr_content)
+{
+	u32 tpr = (u32)msr_content;
+
+
+	vlapic_mmio_handler.write_handler(current, 
+		 (vlapic_base_address(vcpu_vlapic(current)) + 0x80), 4, tpr);
+
+}
+
+static inline void
+hv_hypercall_page_initialize(void *hypercall_page, hv_partition_t *curp)
+{
+	char *p;
+
+	if (hvm_funcs.guest_x86_mode(current) == 8) {
+		curp->long_mode_guest = 1;
+	} else {
+		curp->long_mode_guest = 0;
+	}
+
+	memset(hypercall_page, 0, PAGE_SIZE);
+	p = (char *)(hypercall_page) ;
+	*(u8  *)(p + 0) = 0x0f; /* vmcall */
+       	*(u8  *)(p + 1) = 0x01;
+	if (boot_cpu_data.x86_vendor == 0) {
+       		*(u8  *)(p + 2) = 0xc1;
+	} else { 
+       		*(u8  *)(p + 2) = 0xd9;
+	}
+       	*(u8  *)(p + 3) = 0xc3; /* ret */
+}
+
+static inline void
+hv_init_event_page(void *sief_page)
+{
+	memset(sief_page, 0, PAGE_SIZE);
+}
+
+static inline void
+hv_init_message_page(void *sim_page)
+{
+	memset(sim_page, 0, PAGE_SIZE);
+}
+
+
+static inline void
+hv_process_message_q(hv_partition_t *curp, hv_vcpu_t *cur_vcpu)
+{
+	/*
+	 * XXXKYS: we currently do not support queued messages.
+	 */
+}
+
+static inline void
+hv_schedule_time_out(hv_vcpu_timer_state_t *timer) 
+{
+	/*
+	 * We maintain the count in the units of 100ns. Furthermore,
+	 * this is not relative to NOW() but rather absolute.
+	 */
+	set_timer(&timer->vcpu_timer, (timer->count * 100));
+}
+
+static void
+hv_timeout_handler(void *arg)
+{
+	hv_vcpu_timer_state_t	*timer_data = arg;
+	hv_vcpu_t	*cur_vcpu = timer_data->this_cpu;
+	int		int_num;
+	int		vector;
+	if (!(cur_vcpu->control_msr & 0x9)) {
+		goto timeout_postprocess;
+	}
+	/*
+	 * SynIC is enabled; do further processing. Timeouts are posted as
+	 * messages; verify if the message page is enabled.
+	 */
+	if (!(cur_vcpu->simp_msr & 0x1)) {
+		goto timeout_postprocess;
+	}
+	int_num = (((u32)(timer_data->config >> 16)) & 0x0000000f);
+	/*
+	 * First post the message and then optionally deal with the 
+	 * interrupt notification.
+	 */
+	if (cur_vcpu->sim_page == NULL) {
+		panic("Novell Shim: Sim page not setup\n");
+	}
+	if ((((hv_message_t *)cur_vcpu->sim_page)[int_num]).type !=
+		TYPE_NONE) {
+		/*
+		 * The message slot is not empty just silently return.
+		 */
+		goto timeout_postprocess;
+	}
+	/*
+	 * The slot is available; post the message.
+	 */
+	(((hv_timer_message_t *)cur_vcpu->sim_page)[int_num]).type = 
+	TYPE_TIMER_EXPIRED;
+	(((hv_timer_message_t *)cur_vcpu->sim_page)[int_num]).size = 
+	sizeof(hv_timer_message_t);
+	(((hv_timer_message_t *)cur_vcpu->sim_page)[int_num]).timer_index = 
+	timer_data->timer_index;
+	(((hv_timer_message_t *)cur_vcpu->sim_page)[int_num]).expiration_time = 
+	timer_data->count;
+	if ((cur_vcpu->int_msr[int_num] >> 16) &0x1) {
+		/*
+		 * The designated sintx register is masked; just return.
+		 */
+		goto timeout_postprocess;
+	}
+	vector = ((u32)cur_vcpu->int_msr[int_num] &0xff);
+
+	/*
+	 * Now post the interrupt to the VCPU.
+	 * XXXKYS: What is the delivery mode for interrupts delivered here.
+	 * Check with Mike?
+	 */
+	inject_interrupt(current, vector, APIC_DM_FIXED);
+	
+	/*
+	 * If auto eoi is set; deal with that.
+	 */
+	if (((u32)(cur_vcpu->int_msr[int_num] >> 16)) & 0x1) {
+		hv_write_eoi(0);
+	}
+
+timeout_postprocess:
+	/*
+	 * Prior to returning, deal with all the post timeout issues.
+	 */
+	if (((u32)(timer_data->config))  & 0x00000002) {
+		HV_STATS_COLLECT(HV_TIMEOUTS, &cur_vcpu->stats);
+		hv_schedule_time_out(timer_data);
+	}
+}
+
+static inline void
+hv_timer_init(hv_vcpu_t *vcpup, int timer)
+{
+	vcpup->timers[timer].config = 0;
+	vcpup->timers[timer].count = 0;
+	vcpup->timers[timer].this_cpu = vcpup;
+	vcpup->timers[timer].timer_index = timer;
+	init_timer(&vcpup->timers[timer].vcpu_timer, hv_timeout_handler, 
+		&vcpup->timers[timer], current->processor);
+}
+
+static inline int
+hv_access_time_refcnt(hv_partition_t *curp, u64 *msr_content)
+{
+	if (!hv_privilege_check(curp, HV_ACCESS_TIME_REF_CNT)) {
+		/*
+		 * The partition does not have the privilege to
+		 * read this; return error.
+		 */
+		return (0);
+	}
+	*msr_content = hv_get_time_since_boot();
+	return (1);
+}
+
+void
+hyperv_do_migrate_timers(struct vcpu *v)
+{
+	hv_partition_t	*curp = hv_get_current_partition();
+	hv_vcpu_t        *vcpup;
+	int i;
+	vcpup  =  &curp->vcpu_state[hv_get_current_vcpu_index()];
+
+	for (i=0; i<4; i++) {
+		migrate_timer(&vcpup->timers[i].vcpu_timer, v->processor);
+	}
+}
+		
+
+void
+hyperv_vcpu_up(struct vcpu *v)
+{
+	hv_partition_t	*curp = hv_get_current_partition();
+	hv_vcpu_t        *vcpup;
+	vcpup  =  &curp->vcpu_state[v->vcpu_id];
+	vcpup->flags |= HV_VCPU_UP;
+}
+		
+int
+hyperv_do_hypercall(struct cpu_user_regs *pregs)
+{
+	hv_partition_t	*curp = hv_get_current_partition();
+	hv_vcpu_t        *vcpup;
+	int	long_mode_guest = curp->long_mode_guest;
+	unsigned long hypercall_mfn;
+	unsigned long gmfn;
+	gmfn = (curp->hypercall_msr >> 12);
+
+	hypercall_mfn = get_mfn_from_gva(pregs->eip);
+
+	if (hypercall_mfn == curp->hypercall_mfn) {
+		u64	opcode, input, output, ret_val;
+		vcpup  =  &curp->vcpu_state[hv_get_current_vcpu_index()];
+
+		/* 
+		 * This is an extension hypercall; process it; but first make
+		 * sure that the CPU is in the right state for invoking
+		 * the hypercall - protected mode at CPL 0.
+		 */
+		if (hv_invalid_cpu_state())  {
+			hv_inject_exception(TRAP_gp_fault);
+                	ret_val = hv_build_hcall_retval(
+					HV_STATUS_INVALID_VP_STATE, 0);
+			hv_set_syscall_retval(pregs, long_mode_guest, ret_val);
+			return (1);
+		}
+		if (long_mode_guest) {
+			opcode = pregs->ecx;
+			input = pregs->edx;
+			output = pregs->r8;
+		} else {
+			opcode = 
+			((((u64)pregs->edx) << 32) | ((u64)pregs->eax));
+			input = 
+			((((u64)pregs->ebx) << 32) | ((u64)pregs->ecx));
+			output = 
+			((((u64)pregs->edi) << 32) | ((u64)pregs->esi));
+		}
+		ASSERT(vcpup->nsVcplockDepth == 0);
+		hv_handle_hypercall(opcode, input, output, &ret_val); 
+		hv_set_syscall_retval(pregs, long_mode_guest, ret_val);
+		return (1);
+	}
+	/*
+	 * This hypercall page is not the page for the Veridian extension.
+	 */
+	return (0);
+}
+
+	 
+int 
+hyperv_dom_create(struct domain *d)
+{
+	hv_partition_t	*hvpp;
+	hvpp = xmalloc_bytes(sizeof(hv_partition_t));
+	if (hvpp == NULL) {
+		printk("Hyprv Dom Create: Memory allocation failed\n");
+		return (1);
+	}
+	memset(hvpp, 0, sizeof(*hvpp));
+	spin_lock_init(&hvpp->lock);
+	/*
+	 * Set the partition wide privilege; We can start with no privileges 
+	 * and progressively turn on fancier hypervisor features.
+	 */
+	hv_set_partition_privileges(hvpp);
+	hv_set_partition_features(hvpp);
+	/*
+	 * Stash away pointer to our state in the hvm domain structure.
+	 */
+	d->arch.hvm_domain.hyperv_handle = hvpp;
+	hv_domain_boot_time = get_s_time();
+	return (0);
+}
+
+void 
+hyperv_dom_destroy(struct domain *d)
+{
+	int i;
+	hv_partition_t *curp = d->arch.hvm_domain.hyperv_handle;
+	printk("Hyper-V Domain Being Destroyed\n");
+	ASSERT(curp != NULL);
+#ifdef HV_STATS
+	printk("DUMP STATS\n");
+	printk("GFS cpucount is %d\n", curp->flush_state.count);
+	if (curp->flush_state.owner != NULL) {
+		printk("GFS owner  is %d\n", curp->flush_state.owner->vcpu_id);
+	} else {
+		printk("GFS is free\n");
+	}
+	if (!cpus_empty(curp->flush_state.waiters)) {
+		printk("GFS: waiters not empty\n");
+	} else {
+		printk("GFS: waiters  empty\n");
+	}
+	for (i=0; i < MAX_VIRT_CPUS; i++) {
+		if (d->vcpu[i] != NULL) {
+			hv_print_stats(curp, i);
+		}
+	}
+#endif //HV_STATS
+	
+	xfree(d->arch.hvm_domain.hyperv_handle);	
+	d->arch.hvm_domain.hyperv_handle = NULL;
+}
+
+int
+hyperv_vcpu_initialize(struct vcpu *v)
+{
+	hv_vcpu_t	*vcpup;
+	hv_partition_t	*curp = v->domain->arch.hvm_domain.hyperv_handle;
+	int		i;
+	vcpup = &curp->vcpu_state[v->vcpu_id];
+	atomic_inc(&curp->vcpus_active);
+	if (v->vcpu_id == 0) {
+		vcpup->flags |= HV_VCPU_BOOT_CPU;
+	}
+	/*
+	 * Initialize all the synthetic MSRs corresponding to this VCPU. 
+	 * Note that all state is set to 0 to begin 
+	 * with.
+	 */
+	vcpup->version_msr = 0x00000001;
+	/*
+	 * Initialize the synthetic timet structures.
+	 */
+	for (i=0; i < 4; i++) {
+		hv_timer_init(vcpup, i);
+	}
+	/*
+	 * Setup the input page for handling hypercalls.
+	 *
+	 */
+	vcpup->input_buffer_page = 	
+	alloc_domheap_page(NULL);	
+	if (vcpup->input_buffer_page == NULL) {
+		printk("Hyperv vcpu init: Memory allocation failed\n");
+		return (1);
+	}
+	vcpup->input_buffer =
+	get_virt_from_page_ptr(vcpup->input_buffer_page);	
+	if (vcpup->input_buffer == NULL) {
+		printk("Hyperv vcpu init: Coud not get VA\n");
+		free_domheap_pages(vcpup->input_buffer_page, 0);	
+		return (1);
+	}
+	memset(vcpup->input_buffer, 0, PAGE_SIZE); 
+	vcpup->output_buffer_page = 	
+	alloc_domheap_page(NULL);	
+		
+	if (vcpup->output_buffer_page == NULL) {
+		printk("Hyperv vcpu init: Memory allocation failed\n");
+#ifdef CONFIG_DOMAIN_PAGE
+		unmap_domain_page_global(vcpup->input_buffer);
+#endif
+		free_domheap_pages(vcpup->input_buffer_page, 0);	
+		return (1);
+	}
+	vcpup->output_buffer =
+	get_virt_from_page_ptr(vcpup->output_buffer_page);	
+	if (vcpup->output_buffer == NULL) {
+		printk("Hyperv vcpu init: Coud not get VA\n");
+		free_domheap_pages(vcpup->output_buffer_page, 0);	
+#ifdef CONFIG_DOMAIN_PAGE
+		unmap_domain_page_global(vcpup->input_buffer);
+#endif
+		free_domheap_pages(vcpup->input_buffer_page, 0);	
+		return (1);
+	}
+	vcpup->xen_vcpu = v; 
+
+	return (0);
+}
+
+void 
+hyperv_vcpu_destroy(struct vcpu *v)
+{
+	hv_vcpu_t	*vcpup;
+	hv_partition_t	*curp = v->domain->arch.hvm_domain.hyperv_handle;
+	int 		i;
+
+	vcpup = &curp->vcpu_state[v->vcpu_id];
+	atomic_dec(&curp->vcpus_active);
+	vcpup->flags &= ~HV_VCPU_UP;
+	/*
+	 * Get rid of the pages we have allocated for this VCPU.
+	 */
+#ifdef CONFIG_DOMAIN_PAGE
+	unmap_domain_page_global(vcpup->sief_page);
+	unmap_domain_page_global(vcpup->sim_page);
+	unmap_domain_page_global(vcpup->input_buffer);
+	unmap_domain_page_global(vcpup->output_buffer);
+#endif
+
+	free_domheap_pages(vcpup->input_buffer_page, 0);	
+	free_domheap_pages(vcpup->output_buffer_page, 0);	
+	/*
+	 * Kill the timers 
+	 */
+	for (i=0; i < 4; i++) {
+		kill_timer(&vcpup->timers[i].vcpu_timer);
+	}
+	return;
+}
+
+static int 
+hyperv_vcpu_save(struct domain *d, hvm_domain_context_t *h)
+{
+	struct vcpu *v;
+	struct hvm_hyperv_cpu ctxt;
+
+	hv_vcpu_t	*vcpup;
+	hv_partition_t	*curp = d->arch.hvm_domain.hyperv_handle;
+	int i;
+
+	if (curp == NULL) {
+		return 0;
+	}
+	for_each_vcpu(d, v) {
+		vcpup = &curp->vcpu_state[v->vcpu_id];
+	
+        	/* 
+	 	 * We don't need to save state for a 
+		 * vcpu that is down; the restore
+         	 * code will leave it down if there is nothing saved. 
+	 	 */
+        	if ( test_bit(_VPF_down, &v->pause_flags) )
+            		continue;
+		ctxt.control_msr = vcpup->control_msr;
+		ctxt.version_msr = vcpup->version_msr;
+		ctxt.sief_msr = vcpup->siefp_msr;
+		ctxt.simp_msr = vcpup->simp_msr;
+		ctxt.eom_msr = vcpup->eom_msr;
+		for (i=0; i < 16; i++)
+			ctxt.int_msr[i] = vcpup->int_msr[i];
+		for (i=0; i < 4; i++) {
+			ctxt.timers[i].config = vcpup->timers[i].config;
+			/*
+			 * Save the count in units of 100ns relative to NOW()
+			 * When we restore we will add NOW() to properly
+			 * account for the elapsed time when the timer was
+			 * active.
+			 */ 
+			if (vcpup->timers[i].count > ((NOW())/100)) {
+				ctxt.timers[i].count = 
+				(vcpup->timers[i].count - ((NOW())/100));
+			} else {
+				ctxt.timers[i].count = 0;
+			} 
+		}
+		if ( hvm_save_entry(HYPERV_CPU, 
+			v->vcpu_id, h, &ctxt) != 0 )
+			return 1;
+	}
+
+	return 0;
+}
+
+static int 
+hyperv_vcpu_restore(struct domain *d, hvm_domain_context_t *h)
+{
+	int vcpuid, i;
+	struct hvm_hyperv_cpu ctxt;
+
+	hv_vcpu_t	*vcpup;
+	hv_partition_t	*curp = d->arch.hvm_domain.hyperv_handle;
+
+	if (curp == NULL) {
+		return 0;
+	}
+	/* Which vcpu is this? */
+	vcpuid = hvm_load_instance(h);
+	vcpup = &curp->vcpu_state[vcpuid];
+	ASSERT(vcpup != NULL);
+	if ( hvm_load_entry(HYPERV_CPU, h, &ctxt) != 0 )
+        	return -22;
+
+	vcpup->control_msr = ctxt.control_msr;
+	vcpup->version_msr = ctxt.version_msr;
+
+	hv_write_sx_msr(HV_MSR_SIEFP, curp, vcpup, ctxt.sief_msr); 
+	hv_write_sx_msr(HV_MSR_SIMP, curp, vcpup, ctxt.simp_msr); 
+
+	vcpup->eom_msr = ctxt.eom_msr;
+	for (i=0; i<16; i++)
+		vcpup->int_msr[i] = ctxt.int_msr[i];
+	for (i=0; i < 4; i++) {
+		vcpup->timers[i].config = ctxt.timers[i].config;
+		vcpup->timers[i].count = 
+		(ctxt.timers[i].count + ((NOW())/100)); 
+		if ((vcpup->timers[i].config | 0x9)) {
+			/*
+			 * XXXKYS: Some issues with regards to time
+			 * management here:
+			 * 1) We will ignore the elapsed wall clock time
+			 *    when the domain was not running.
+			 * 2) Clearly we should account fot the time that 
+			 *    has elapsed when the domain was running with 
+			 *    respect to the timeouts that were scheduled
+			 *    prior to saving the domain.
+			 * We will deal with on the save side.
+			 */ 
+			hv_schedule_time_out(&vcpup->timers[i]); 
+			HV_STATS_COLLECT(HV_TIMEOUTS, &vcpup->stats);
+		}
+	}
+
+	vcpup->flags |=  HV_VCPU_UP;
+	return 0;
+}
+
+
+
+
+static int 
+hyperv_dom_save(struct domain *d, hvm_domain_context_t *h)
+{
+	struct hvm_hyperv_dom ctxt;
+	hv_partition_t	*curp = d->arch.hvm_domain.hyperv_handle;
+
+	if (curp == NULL) {
+		return 0;
+	}
+
+	ctxt.guestid_msr = curp->guest_id_msr;
+	ctxt.hypercall_msr = curp->hypercall_msr;
+	ctxt.long_mode = curp->long_mode_guest;
+	ctxt.ext_id = d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
+	return (hvm_save_entry(HYPERV_DOM, 0, h, &ctxt)); 
+}
+
+static int 
+hyperv_dom_restore(struct domain *d, hvm_domain_context_t *h)
+{
+	struct hvm_hyperv_dom ctxt;
+	hv_partition_t	*curp;
+
+	if ( hvm_load_entry(HYPERV_DOM, h, &ctxt) != 0 )
+        	return -22;
+	d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] = ctxt.ext_id;
+	if ((ctxt.ext_id == 0) || (ctxt.ext_id > 1)) {
+		return 0;
+	}
+	if (hyperv_initialize(d)) {
+		return -22;
+	}
+	curp = d->arch.hvm_domain.hyperv_handle;
+
+	curp->guest_id_msr = ctxt.guestid_msr;
+	curp->hypercall_msr = ctxt.hypercall_msr;
+	curp->long_mode_guest = ctxt.long_mode;
+	curp->hypercall_mfn =
+	gmfn_to_mfn(d, (ctxt.hypercall_msr >> 12));
+	
+	return 0; 
+}
+
+HVM_REGISTER_SAVE_RESTORE(HYPERV_DOM, hyperv_dom_save, hyperv_dom_restore,
+                          1, HVMSR_PER_DOM);
+	
+
+HVM_REGISTER_SAVE_RESTORE(HYPERV_CPU,hyperv_vcpu_save,hyperv_vcpu_restore,
+                          1, HVMSR_PER_VCPU);
+
+
+static int
+hv_preprocess_cpuid_leaves(unsigned int input, struct cpu_user_regs *regs)
+{
+	uint32_t idx;
+	struct domain	*d = current->domain;
+	int	extid = d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
+
+	if (extid == 1) {
+		/*
+		 * Enlightened Windows guest; need to remap and handle 
+		 * leaves used by PV front-end drivers.
+		 */
+		if ((input >= 0x40000000) && (input <= 0x40000005)) {
+			return (0);
+		}
+		/*
+	 	 * PV drivers use cpuid to query the hypervisor for details. On
+	 	 * Windows we will use the following leaves for this:
+		 *
+		 * 4096: VMM Sinature (corresponds to 0x40000000 on Linux)
+		 * 4097: VMM Version (corresponds to 0x40000001 on Linux)
+		 * 4098: Hypercall details (corresponds to 0x40000002 on Linux)
+		 */
+		if ((input >= 0x40001000) && (input <= 0x40001002)) {
+			idx = (input - 0x40001000);
+			switch (idx) {
+			case 0:
+				regs->eax = 0x40000002; /* Largest leaf  */
+				regs->ebx = 0x566e6558;/*Signature 1: "XenV" */
+				regs->ecx = 0x65584d4d; /*Signature 2: "MMXe" */
+				regs->edx = 0x4d4d566e; /*Signature 3: "nVMM"*/
+				break;
+			case 1:
+				regs->eax = 
+				(XEN_VERSION << 16) | 
+				XEN_SUBVERSION;
+				regs->ebx = 0;          /* Reserved */
+				regs->ecx = 0;          /* Reserved */
+				regs->edx = 0;          /* Reserved */
+				break;
+
+			case 2:
+				regs->eax = 1; /*Number of hypercall-transfer pages*/
+				/*In linux this is 0x40000000 */
+				regs->ebx = 0x40001000; /* MSR base address */
+				regs->ecx = 0;          /* Features 1 */
+				regs->edx = 0;          /* Features 2 */
+				break;
+			}
+		}
+		return (1);
+	} else {
+		/*
+		 * For now this is all other "enlightened guests"
+		 */
+		if ((input >= 0x40000000) && (input <= 0x40000002)) {
+			/*
+			 * These leaves have already been correctly
+			 * processed; just return.
+			 */
+			return (1);
+		} 
+		return (0);
+	}
+}
+
+int 
+hyperv_do_cpu_id(unsigned int input, struct cpu_user_regs *regs)
+{
+	uint32_t idx;
+
+	/*
+	 * hvmloader uses cpuid to set up a hypercall page; we don't want to
+	 * intercept calls coming from the bootstrap (bios) code in the HVM 
+	 * guest; we discriminate based on the instruction pointer.
+	 */
+	if (hv_call_from_bios(regs)) {
+		/*
+		 * We don't intercept this.
+		 */
+		return (0);
+	}
+		
+	if (input == 0x00000001) { 
+		regs->ecx = (regs->ecx | 0x80000000);
+		return (1);
+	} 
+
+	if (hv_preprocess_cpuid_leaves(input, regs)) {
+		return (0);
+	}
+	idx = (input - 0x40000000);
+
+	switch (idx) {
+	case 0:
+		/*
+		 * 0x40000000: Hypervisor identification. 
+		 */
+		regs->eax = 0x40000005; /* For now clamp this */
+		regs->ebx = 0x65766f4e; /* "Nove" */ 
+		regs->ecx = 0x68536c6c; /* "llSh" */
+		regs->edx = 0x76486d69; /* "imHv" */ 
+		break;
+
+	case 1:
+		/*
+		 * 0x40000001: Hypervisor identification. 
+		 */
+		regs->eax = 0x31237648; /* "Hv#1*/
+		regs->ebx = 0; /* Reserved */ 
+		regs->ecx = 0; /* Reserved */
+		regs->edx = 0; /* Reserved */ 
+		break;
+	case 2:
+		/*
+		 * 0x40000002: Guest Info 
+		 */
+		if (hv_os_registered()) {
+			regs->eax = hv_get_guest_major();
+			regs->ebx = 
+			(hv_get_guest_major() << 16) | hv_get_guest_minor();
+			regs->ecx = hv_get_guest_service_pack();
+			regs->edx = 
+			(hv_get_guest_service_branch_info() << 24) |
+			hv_get_guest_service_number();
+		} else {
+			regs->eax = 0;
+			regs->ebx = 0;
+			regs->ecx = 0;
+			regs->edx = 0;
+		}
+		break;
+	case 3:
+		/*
+		 * 0x40000003: Feature identification.
+		 */
+		regs->eax = hv_get_supported_synthetic_msrs();
+		/* We only support AcessSelfPartitionId bit 1 */
+		regs->ebx = 0x2; 
+		regs->ecx = 0; /* Reserved */
+		regs->edx = 0; /*No MWAIT (bit 0), No debugging (bit 1)*/
+		break;
+	case 4:
+		/*
+		 * 0x40000004: Imlementation recommendations.
+		 */
+		regs->eax = hv_get_recommendations();
+		regs->ebx = 0; /* Reserved */
+		regs->ecx = 0; /* Reserved */
+		regs->edx = 0; /* Reserved */
+		break;
+	case 5:
+		/*
+		 * 0x40000005: Implementation limits.
+		 * Currently we retrieve maximum number of vcpus and 
+		 * logical processors (hardware threads) supported.
+		 */
+		regs->eax = hv_get_max_vcpus_supported();
+		regs->ebx = hv_get_max_lcpus_supported();
+		regs->ecx = 0; /* Reserved */
+		regs->edx = 0; /* Reserved */
+		break;
+
+	default:
+		/*
+		 * We don't handle this leaf.
+		 */
+		return (0);
+
+	}
+	return (1);
+}
+
+int
+hyperv_do_rd_msr(uint32_t idx, struct cpu_user_regs *regs)
+{
+	hv_partition_t	*curp = hv_get_current_partition();
+	unsigned int	vcp_index = hv_get_current_vcpu_index();
+	u64 msr_content = 0;
+	hv_vcpu_t	*cur_vcpu = &curp->vcpu_state[vcp_index];
+	int		syn_int, timer;
+	struct domain	*d = current->domain;
+	int	extid = d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
+	u64		timer_count;
+
+	/*
+	 * hvmloader uses rdmsr; we don't want to
+	 * intercept calls coming from the bootstrap (bios) code in the HVM 
+	 * guest; we descriminate based on the instruction pointer.
+	 */
+	if (hv_call_from_bios(regs)) {
+		/*
+		 * We don't intercept this.
+		 */
+		return (0);
+	}
+	if (extid > 1) {
+		/*
+		 * For now this is all other "Enlightened" operating systems
+		 * other than Longhorn.
+		 */
+		if (idx == 0x40000000) {
+			/*
+			 * PV driver hypercall setup. Let xen handle this.
+			 */
+			return (0);
+		}
+		if (idx == 0x40001000) {
+			idx = 0x40000000;
+		}
+	}
+	switch (idx) {
+	case HV_MSR_GUEST_OS_ID: 
+		spin_lock(&curp->lock);
+		regs->eax = (u32)(curp->guest_id_msr & 0xFFFFFFFF);
+		regs->edx = (u32)(curp->guest_id_msr >> 32);
+		spin_unlock(&curp->lock);
+		break;
+	case HV_MSR_HYPERCALL:
+		spin_lock(&curp->lock);
+		regs->eax = (u32)(curp->hypercall_msr & 0xFFFFFFFF);
+		regs->edx = (u32)(curp->hypercall_msr >> 32);
+		spin_unlock(&curp->lock);
+		if ((((u32)curp->hypercall_msr) & (0x00000001)) != 0) {
+			cur_vcpu->flags |= HV_VCPU_UP;
+		}
+		break;
+	case HV_MSR_VP_INDEX:
+		regs->eax = (u32)(vcp_index);
+		regs->edx = (u32)(0x0);
+		break;
+	case HV_MSR_ICR:
+		if (!hv_privilege_check(curp, HV_ACCESS_APIC_MSRS)) {
+			goto msr_read_error;
+		}
+		hv_read_icr(&msr_content);
+		HV_STATS_COLLECT(HV_ICR_READ, &cur_vcpu->stats);
+		regs->eax = (u32)(msr_content & 0xFFFFFFFF);
+		regs->edx = (u32)(msr_content >> 32);
+		break;
+	case HV_MSR_TPR:
+		if (!hv_privilege_check(curp, HV_ACCESS_APIC_MSRS)) {
+			goto msr_read_error;
+		}
+		hv_read_tpr(&msr_content);
+		HV_STATS_COLLECT(HV_TPR_READ, &cur_vcpu->stats);
+		regs->eax = (u32)(msr_content & 0xFFFFFFFF);
+		regs->edx = (u32)(msr_content >> 32);
+		break;
+	/*
+	 * The following synthetic MSRs are implemented in the Novell Shim.
+	 */
+	case HV_MSR_SCONTROL:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
+			goto msr_read_error;
+		}
+		regs->eax = (u32)(cur_vcpu->control_msr & 0xFFFFFFFF);
+		regs->edx = (u32)(cur_vcpu->control_msr >> 32);
+		break;
+	case HV_MSR_SVERSION:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
+			goto msr_read_error;
+		}
+		regs->eax = (u32)(cur_vcpu->version_msr & 0xFFFFFFFF);
+		regs->edx = (u32)(cur_vcpu->version_msr >> 32);
+		break;
+	case HV_MSR_SIEFP:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
+			goto msr_read_error;
+		}
+		regs->eax = (u32)(cur_vcpu->siefp_msr & 0xFFFFFFFF);
+		regs->edx = (u32)(cur_vcpu->siefp_msr >> 32);
+		break;
+	case HV_MSR_SIMP:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
+			goto msr_read_error;
+		}
+		regs->eax = (u32)(cur_vcpu->simp_msr & 0xFFFFFFFF);
+		regs->edx = (u32)(cur_vcpu->simp_msr >> 32);
+		break;
+	case HV_MSR_SINT0:
+		syn_int = 0;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT1:
+		syn_int = 1;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT2:
+		syn_int = 2;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT3:
+		syn_int = 3;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT4:
+		syn_int = 4;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT5:
+		syn_int = 5;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT6:
+		syn_int = 6;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT7:
+		syn_int = 7;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT8:
+		syn_int = 8;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT9:
+		syn_int = 9;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT10:
+		syn_int = 10;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT11:
+		syn_int = 11;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT12:
+		syn_int = 12;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT13:
+		syn_int = 13;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT14:
+		syn_int = 14;
+		goto	syn_int_read_process;
+	case HV_MSR_SINT15:
+		syn_int = 15;
+syn_int_read_process:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
+			goto msr_read_error;
+		}
+		regs->eax = (u32)(cur_vcpu->int_msr[syn_int] & 0xFFFFFFFF);
+		regs->edx = (u32)(cur_vcpu->int_msr[syn_int] >> 32);
+		break;
+
+	case HV_MSR_SEOM:
+		/*
+		 * This is a write only register; reads return 0.
+		 */
+		regs->eax = 0;
+		regs->edx = 0;
+		break;
+	case HV_MSR_TIME_REF_COUNT:
+		if (!hv_access_time_refcnt(curp, &msr_content)) {
+			goto msr_read_error;
+		}
+		regs->eax = (u32)(msr_content & 0xFFFFFFFF);
+		regs->edx = (u32)(msr_content >> 32);
+		break;
+	/*
+	 * Synthetic timer MSRs.
+	 */
+	case HV_MSR_TIMER0_CONFIG:
+		timer = 0;
+		goto	process_timer_config_read;
+	case HV_MSR_TIMER1_CONFIG:
+		timer = 1;
+		goto	process_timer_config_read;
+	case HV_MSR_TIMER2_CONFIG:
+		timer = 2;
+		goto	process_timer_config_read;
+	case HV_MSR_TIMER3_CONFIG:
+		timer = 3;
+process_timer_config_read:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
+			goto msr_read_error;
+		}
+		regs->eax = 
+		(u32)(cur_vcpu->timers[timer].config & 0xFFFFFFFF);
+		regs->edx = 
+		(u32)(cur_vcpu->timers[timer].config >> 32);
+		break;
+	case HV_MSR_TIMER0_COUNT:
+		timer = 0;
+		goto process_timer_count_read;
+	case HV_MSR_TIMER1_COUNT:
+		timer = 1;
+		goto process_timer_count_read;
+	case HV_MSR_TIMER2_COUNT:
+		timer = 2;
+		goto process_timer_count_read;
+	case HV_MSR_TIMER3_COUNT:
+		timer = 3;
+process_timer_count_read:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
+			goto msr_read_error;
+		}
+		timer_count = cur_vcpu->timers[timer].count;
+		if (timer_count > ((NOW())/100)) {
+			timer_count -= ((NOW())/100);
+		} else {
+			timer_count = 0;
+		}
+		regs->eax = 
+		(u32)(timer_count & 0xFFFFFFFF);
+		regs->edx = 
+		(u32)(timer_count >> 32);
+		break;
+	case HV_MSR_PVDRV_HCALL:
+		regs->eax = 0;
+		regs->edx = 0;
+		break; 
+	case HV_MSR_SYSTEM_RESET:
+		regs->eax = 0;
+		regs->edx = 0;
+		break; 
+	default:
+		/*
+		 * We did not handle the MSR address specified; 
+		 * let the caller figure out
+		 * What to do.
+		 */
+		return (0);
+	}
+	return (1);
+msr_read_error:
+	/*
+	 * Have to inject #GP fault.
+	 */
+	hv_inject_exception(TRAP_gp_fault);
+	return (1);
+}
+
+int
+hyperv_do_wr_msr(uint32_t idx, struct cpu_user_regs *regs)
+{
+	hv_partition_t	*curp = hv_get_current_partition();
+	unsigned int	vcp_index = hv_get_current_vcpu_index();
+	u64 msr_content = 0;
+	hv_vcpu_t	*cur_vcpu = &curp->vcpu_state[vcp_index];
+	int		syn_int, timer;
+	struct domain	*d = current->domain;
+	int	extid = d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
+
+	/*
+	 * hvmloader uses wrmsr; we don't want to
+	 * intercept calls coming from the bootstrap (bios) code in the HVM 
+	 * guest; we descriminate based on the instruction pointer.
+	 */
+	if (hv_call_from_bios(regs)) {
+		/*
+		 * We don't intercept this.
+		 */
+		return (0);
+	}
+	msr_content =
+	(u32)regs->eax | ((u64)regs->edx << 32);
+	if (extid > 1) {
+		/*
+		 * For now this is all other "Enlightened" operating systems
+		 * other than Longhorn.
+		 */
+		if (idx == 0x40000000) {
+			/*
+			 * PV driver hypercall setup. Let xen handle this.
+			 */
+			return (0);
+		}
+		if (idx == 0x40001000) {
+			idx = 0x40000000;
+		}
+	}
+
+	switch (idx) {
+	case HV_MSR_GUEST_OS_ID: 
+		hv_write_guestid_msr(curp, cur_vcpu,  msr_content);
+		break; 
+	case HV_MSR_HYPERCALL:
+		hv_write_hypercall_msr(curp, cur_vcpu, msr_content);
+		break;
+
+	case HV_MSR_VP_INDEX:
+		goto msr_write_error;
+		
+	case HV_MSR_EOI:
+		if (!hv_privilege_check(curp, HV_ACCESS_APIC_MSRS)) {
+			goto msr_write_error;
+		}
+		hv_write_eoi(msr_content);
+		HV_STATS_COLLECT(HV_EOI_WRITE, &cur_vcpu->stats);
+		break;
+	case HV_MSR_ICR:
+		if (!hv_privilege_check(curp, HV_ACCESS_APIC_MSRS)) {
+			goto msr_write_error;
+		}
+		hv_write_icr(msr_content);
+		HV_STATS_COLLECT(HV_ICR_WRITE, &cur_vcpu->stats);
+		break;
+	case HV_MSR_TPR:
+		if (!hv_privilege_check(curp, HV_ACCESS_APIC_MSRS)) {
+			goto msr_write_error;
+		}
+		hv_write_tpr(msr_content);
+		HV_STATS_COLLECT(HV_TPR_WRITE, &cur_vcpu->stats);
+		break;
+
+	/*
+	 * The following MSRs are synthetic MSRs supported in the Novell Shim.
+	 */
+	case HV_MSR_SCONTROL:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
+			goto msr_write_error;
+		}
+		cur_vcpu->control_msr = msr_content; 
+		break;
+	case HV_MSR_SVERSION:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
+			goto msr_write_error;
+		}
+		/*
+		 * This is a read-only MSR; generate #GP
+		 */
+		hv_inject_exception(TRAP_gp_fault);
+		break;
+	case HV_MSR_SIEFP:
+	case HV_MSR_SIMP:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
+			goto msr_write_error;
+		} 
+		hv_write_sx_msr(idx, curp, cur_vcpu, msr_content);
+		break;
+	case HV_MSR_SINT0:
+		syn_int = 0;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT1:
+		syn_int = 1;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT2:
+		syn_int = 2;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT3:
+		syn_int = 3;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT4:
+		syn_int = 4;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT5:
+		syn_int = 5;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT6:
+		syn_int = 6;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT7:
+		syn_int = 7;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT8:
+		syn_int = 8;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT9:
+		syn_int = 9;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT10:
+		syn_int = 10;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT11:
+		syn_int = 11;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT12:
+		syn_int = 12;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT13:
+		syn_int = 13;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT14:
+		syn_int = 14;
+		goto	syn_int_wr_process;
+	case HV_MSR_SINT15:
+		syn_int = 15;
+syn_int_wr_process:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
+			goto msr_write_error;
+		}
+		/*
+		 * XXXKYS: We assume that the syn_int registers will be 
+		 * first written before the interrupt generation can occur.
+		 * Specifically if SINT is masked all interrupts that may have 
+		 * been generated will be lost. Also when SINT is disabled; 
+		 * its effects will be only felt for subsequent interrupts that 
+		 * may be posted. XXXKYS: CHECK
+		 */
+		cur_vcpu->int_msr[syn_int] = msr_content; 
+		break;
+
+	case HV_MSR_SEOM:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
+			goto msr_write_error;
+		}
+		cur_vcpu->eom_msr = msr_content; 
+		hv_process_message_q(curp, cur_vcpu);
+		break;
+	case HV_MSR_TIME_REF_COUNT: 
+		/*
+		 * This is a read-only msr.
+		 */
+		goto msr_write_error;
+	
+	/*
+	 * Synthetic timer MSRs.
+	 */
+	case HV_MSR_TIMER0_CONFIG:
+		timer = 0;
+		goto	process_timer_config;
+	case HV_MSR_TIMER1_CONFIG:
+		timer = 1;
+		goto	process_timer_config;
+	case HV_MSR_TIMER2_CONFIG:
+		timer = 2;
+		goto	process_timer_config;
+	case HV_MSR_TIMER3_CONFIG:
+		timer = 3;
+process_timer_config:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
+			goto msr_write_error;
+		}
+		/*
+		 * Assume that the client is going to write the whole msr. 
+		 */
+		if (!(msr_content & 0x9)) {
+			/*
+			 * We are neither setting Auto Enable or Enable; 
+			 * silently exit.
+			 * Should this be considered to turn off a 
+			 * timer that may be currently 
+			 * active; XXXKYS: Check. For now we are 
+			 * not doing anything here.
+			 */
+			break;
+		}
+		if (!(((u32)(msr_content >> 16)) & 0x0000000f)) {
+			/*
+			 * sintx is 0; clear the enable bit(s).
+			 */
+			msr_content &= ~(0x1);
+		}
+		cur_vcpu->timers[timer].config = msr_content;
+		/*
+		 * XXXKYS: Can any order be assumed here; 
+		 * should we just act on whatever is in the 
+		 * count register. For now act as if the count 
+		 * register is valid and act on it.
+		 */
+		if (msr_content & 0x1) {
+			hv_schedule_time_out(&cur_vcpu->timers[timer]); 
+			HV_STATS_COLLECT(HV_TIMEOUTS, &cur_vcpu->stats);
+		}
+		break;
+	case HV_MSR_TIMER0_COUNT:
+		timer = 0;
+		goto process_timer_count;
+	case HV_MSR_TIMER1_COUNT:
+		timer = 1;
+		goto process_timer_count;
+	case HV_MSR_TIMER2_COUNT:
+		timer = 2;
+		goto process_timer_count;
+	case HV_MSR_TIMER3_COUNT:
+		timer = 3;
+process_timer_count:
+		if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
+			goto msr_write_error;
+		}
+		cur_vcpu->timers[timer].count = 
+		(msr_content + ((NOW())/100));
+		if ((cur_vcpu->timers[timer].config | 0x9)) {
+			hv_schedule_time_out(&cur_vcpu->timers[timer]); 
+			HV_STATS_COLLECT(HV_TIMEOUTS, &cur_vcpu->stats);
+		}
+		
+		break;
+	case HV_MSR_PVDRV_HCALL:
+		/*
+		 * Establish the hypercall page for PV drivers.
+		 */
+		wrmsr_hypervisor_regs(0x40000000, regs->eax, regs->edx);
+		break; 
+	case HV_MSR_SYSTEM_RESET:
+		/*
+		 * Shutdown the domain/partition.
+	 	 */
+		if (msr_content & 0x1) {
+			domain_shutdown(d, SHUTDOWN_reboot);
+		}
+		break; 
+		
+	default:
+		/*
+		 * We did not handle the MSR address; 
+		 * let the caller deal with this.
+		 */
+		return (0);
+	}
+	return (1);
+msr_write_error:
+	/*
+	 * Have to inject #GP fault.
+	 */
+	hv_inject_exception(TRAP_gp_fault);
+	return (1);
+}
Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_shim.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_shim.h	2008-03-26 14:29:34.000000000 -0400
@@ -0,0 +1,326 @@
+/****************************************************************************
+ |
+ | Copyright (c) [2007, 2008] Novell, Inc.
+ | All Rights Reserved.
+ |
+ | This program is free software; you can redistribute it and/or
+ | modify it under the terms of version 2 of the GNU General Public License as
+ | published by the Free Software Foundation.
+ |
+ | This program is distributed in the hope that it will be useful,
+ | but WITHOUT ANY WARRANTY; without even the implied warranty of
+ | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ | GNU General Public License for more details.
+ |
+ | You should have received a copy of the GNU General Public License
+ | along with this program; if not, contact Novell, Inc.
+ |
+ | To contact Novell about this file by physical or electronic mail,
+ | you may find current contact information at www.novell.com
+ |
+ |***************************************************************************
+*/
+
+/*
+ * Hyperv Shim Implementation.
+ *
+ * Engineering Contact: K. Y. Srinivasan
+ */
+
+#ifndef HV_SHIM_H
+#define HV_SHIM_H
+
+#include <xen/sched.h>
+#include <xen/types.h>
+#include <xen/timer.h>
+#include <asm/current.h>
+#include <asm/domain.h>
+#include <asm/shadow.h>
+#include <public/xen.h>
+
+#include "hv_hypercall.h"
+
+/*
+ * Synthetic MSR addresses
+ */
+#define HV_MSR_GUEST_OS_ID	0x40000000
+#define HV_MSR_HYPERCALL	0x40000001
+#define HV_MSR_VP_INDEX		0x40000002
+#define HV_MSR_SYSTEM_RESET	0x40000003
+#define HV_MSR_TIME_REF_COUNT	0x40000020
+#define HV_MSR_EOI		0x40000070
+#define HV_MSR_ICR		0x40000071
+#define HV_MSR_TPR		0x40000072
+
+#define HV_MSR_SCONTROL		0x40000080
+#define HV_MSR_SVERSION		0x40000081
+#define HV_MSR_SIEFP		0x40000082
+#define HV_MSR_SIMP		0x40000083
+#define HV_MSR_SEOM		0x40000084
+#define HV_MSR_SINT0		0x40000090
+#define HV_MSR_SINT1		0x40000091
+#define HV_MSR_SINT2		0x40000092
+#define HV_MSR_SINT3		0x40000093
+#define HV_MSR_SINT4		0x40000094
+#define HV_MSR_SINT5		0x40000095
+#define HV_MSR_SINT6		0x40000096
+#define HV_MSR_SINT7		0x40000097
+#define HV_MSR_SINT8		0x40000098
+#define HV_MSR_SINT9		0x40000099
+#define HV_MSR_SINT10		0x4000009A
+#define HV_MSR_SINT11		0x4000009B
+#define HV_MSR_SINT12		0x4000009C
+#define HV_MSR_SINT13		0x4000009D
+#define HV_MSR_SINT14		0x4000009E
+#define HV_MSR_SINT15		0x4000009F
+
+#define HV_MSR_TIMER0_CONFIG	0x400000B0
+#define HV_MSR_TIMER0_COUNT	0x400000B1
+#define HV_MSR_TIMER1_CONFIG	0x400000B2
+#define HV_MSR_TIMER1_COUNT	0x400000B3
+#define HV_MSR_TIMER2_CONFIG	0x400000B4
+#define HV_MSR_TIMER2_COUNT	0x400000B5
+#define HV_MSR_TIMER3_CONFIG	0x400000B6
+#define HV_MSR_TIMER3_COUNT	0x400000B7
+
+/*
+ * MSR for supporting PV drivers on longhorn.
+ */
+#define HV_MSR_PVDRV_HCALL	0x40001000
+
+/*
+ * Hyperv Shim VCPU flags.
+ */
+#define HV_VCPU_BOOT_CPU	0x00000001
+#define HV_VCPU_UP		0x00000002
+
+/*
+ * Hyperv shim flush flags.
+ */
+
+#define HV_FLUSH_TLB		0X01
+#define HV_FLUSH_INVLPG		0X02
+
+/*
+ * We use the following global state to manage TLB flush requests from the 
+ * guest. At most only one flush can be active in the guest; we may have to
+ * revisit this if this is a bottleneck.
+ */
+typedef struct hv_flush_state {
+	int	count; //0 unused; else #cpus participating
+	cpumask_t	waiters; //Cpus waiting for the flush block
+	struct vcpu	*owner;
+	u64		ret_val;
+	flush_va_t	*flush_param;
+	unsigned short	rep_count;
+} hv_flush_state_t;
+	
+/*
+ * hyperv shim message structure.
+ */
+typedef enum {
+	/*
+	 * For now we only support timer messages
+	 */
+	TYPE_NONE = 0x00000000,
+	TYPE_TIMER_EXPIRED = 0x80000010
+} hv_message_type;
+
+typedef struct hv_timer_message {
+	hv_message_type	type;
+	u8		pad1[3];
+	u8		size;
+	u32		timer_index;
+	u32		pad2;
+	u64		expiration_time;
+} hv_timer_message_t;
+
+typedef struct hv_message {
+	hv_message_type	type;
+	uint8_t		size;
+	uint8_t		flags;
+	uint8_t		reserved[2];
+	uint32_t	reserved1;
+	uint64_t	pay_load[30];
+} hv_message_t;
+
+
+typedef struct hv_vcpu_timer_state {
+	u64	config;
+	u64	count;	/*expiration time in 100ns units*/
+	int	timer_index;
+	struct hv_vcpu	*this_cpu;
+	struct timer	vcpu_timer;
+} hv_vcpu_timer_state_t;
+
+/*
+ * Stats structure.
+ */
+
+typedef struct {
+	u64	num_switches;
+	u64	num_flushes;
+	u64	num_flushes_posted;
+	u64	num_flush_ranges;
+	u64	num_flush_ranges_posted;
+
+	u64	num_tpr_reads;
+	u64	num_icr_reads;
+	u64	num_eoi_writes;
+	u64	num_tpr_writes;
+	u64	num_icr_writes;
+
+	u64	num_gfs_acquires;
+	u64	num_gfs_releases;
+	u64	num_tlb_flushes;
+	u64	num_invl_pages;
+	u64	num_time_outs;
+} hv_vcpu_stats_t;
+
+typedef struct hv_vcpu {
+	/*
+	 * Per-vcpu state to support the hyperv shim; 
+	 */
+	unsigned long	flags;
+	/*
+	 * Synthetic msrs.
+	 */
+	u64	control_msr;
+	u64	version_msr;
+	u64	siefp_msr;
+	u64	simp_msr;
+	u64	eom_msr;
+
+	u64	int_msr[16];
+	/*
+	 * Timer MSRs.
+	 */
+	hv_vcpu_timer_state_t	timers[4];
+	void	*sief_page;
+	void	*sim_page;
+	/*
+	 * Hypercall input/output processing.
+	 * We keep these pages mapped in the hypervisor space.
+	 */
+	void	*input_buffer; /*input buffer virt address*/
+	struct page_info *input_buffer_page; /*input buffer struct page */
+	void	*output_buffer; /*output buffer virt address*/
+	struct page_info *output_buffer_page; /*output buffer struct page */
+	struct vcpu	*xen_vcpu; /*corresponding xen vcpu*/
+	hv_vcpu_stats_t	stats;
+} hv_vcpu_t;
+
+/*
+ * Events of interest for gathering stats.
+ */
+#define HV_CSWITCH	1
+#define HV_FLUSH_VA_STAT	2
+#define HV_FLUSH_RANGE	3
+#define HV_FLUSH_VA_POSTED 4
+#define HV_FLUSH_RANGE_POSTED 5
+#define HV_TPR_READ	6
+#define HV_ICR_READ	7
+#define HV_TPR_WRITE	8	
+#define HV_ICR_WRITE	9
+#define HV_EOI_WRITE	10
+
+#define HV_GFS_ACQUIRE	11	
+#define HV_GFS_RELEASE	12
+#define HV_TLB_FLUSH	13
+#define HV_INVL_PG	14	
+#define HV_TIMEOUTS	15	
+
+void hv_collect_stats(int event, hv_vcpu_stats_t *ststp); 
+
+#define HV_STATS //KYS: Temporary
+
+#ifdef HV_STATS
+#define HV_STATS_COLLECT(event, statp) hv_collect_stats(event, statp)
+#else
+define HV_STATS_COLLECT(event, statp)
+#endif
+
+typedef struct hv_partition {
+	/*
+	 * State maintained on a per guest basis to implement 
+	 * the Hyperv shim.
+	 */
+	spinlock_t	lock;
+	atomic_t	vcpus_active;
+	u64		guest_id_msr;
+	u64		hypercall_msr;
+	u64		privileges;
+	u64		supported_features;
+	unsigned long	hypercall_mfn;
+	int		long_mode_guest;
+	/*
+	 * Each VCPU here corresponds to the vcpu in the underlying hypervisor;
+	 * they share the same ID.
+	 */
+	hv_vcpu_t	vcpu_state[MAX_VIRT_CPUS];
+	hv_flush_state_t flush_state;
+} hv_partition_t;
+
+
+/*
+ * Privilege flags.
+ */
+
+#define HV_ACCESS_VP_RUNTIME	(1ULL << 0)
+#define HV_ACCESS_TIME_REF_CNT	(1ULL << 1)
+#define HV_ACCESS_SYNC_MSRS	(1ULL << 2)
+#define HV_ACCESS_SYNC_TIMERS	(1ULL << 3)
+#define HV_ACCESS_APIC_MSRS	(1ULL << 4)
+#define HV_ACCESS_PARTITION_ID	(1ULL << 33)
+	
+#define hv_get_current_partition() \
+((current)->domain->arch.hvm_domain.hyperv_handle)
+
+#define hv_get_current_vcpu_index() (current)->vcpu_id
+
+
+static inline int
+hv_invalid_cpu_state(void)
+{
+	int state;
+	state = hvm_funcs.guest_x86_mode(current);
+	if ((state == 4) || (state == 8)) {
+		return (0);
+	}
+	return (1);
+}
+
+static inline u64
+hv_build_hcall_retval(int code, int reps)
+{
+	u64	ret_val=0;
+	ret_val |= (code & 0xff);
+	ret_val |= (((long long)(reps & 0xfff)) << 32);
+	return (ret_val);
+}
+
+static inline void  hv_set_syscall_retval(struct cpu_user_regs *pregs, 
+				int long_mode, u64 ret_val)
+{
+	if (long_mode) {
+		pregs->eax = ret_val;
+	} else {
+		pregs->edx = (u32)(ret_val >> 32);
+		pregs->eax = (u32)(ret_val);
+	}
+}
+
+static inline int 
+hv_privilege_check(hv_partition_t *curp, u64 flags)
+{
+	return ((curp->privileges & flags)? 1: 0);
+}
+
+void
+hv_handle_hypercall(u64 opcode, u64 input, u64 output, 
+		  u64 *ret_val);
+
+
+void hv_print_stats(hv_partition_t *curp, int i);
+
+#endif /*HV_SHIM_H */

[-- Attachment #5: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
       [not found] <47F68017.E57C.0030.0@novell.com>
@ 2008-04-05  9:21 ` Keir Fraser
  0 siblings, 0 replies; 14+ messages in thread
From: Keir Fraser @ 2008-04-05  9:21 UTC (permalink / raw)
  To: Ky Srinivasan, Tim Deegan; +Cc: xen-devel

On 5/4/08 00:24, "Ky Srinivasan" <ksrinivasan@novell.com> wrote:

> Based on the feedback I got from you and Tim, I am enclosing the next version
> of the patches to support enlightened win2008 server. Here are the changes I
> have made:
> 
> 1) I have put the shim on a low calorie diet - I have gotten rid of the
> framework infrastructure and to the extent possible integrated the shim code
> with xen.
> 
> 2) I have tried to cleanup the code. I am sure  more work will be needed here.
> 
> 3) I am not advertising the TLB related enlightenments. We can revisit this
> later if needed.

It's certainly quite a bit shorter which is good. For the remaining stuff,
do you have empirical evidence that performance is improved by it?

Other more minor comments are that the coding style is still off (e.g.,
start braces should go on their own line, spaces inside () for if/for/while
headers), you have at least one big switch statement where most of the cases
could be collapsed to just one shared block of code, and indeed shouldn't
the 'default' case in the hypercall demuxing switch statement be to return
'denied', and that would get rid of most of the individual cases altogether?

 -- Keir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
  2008-04-04 23:41 [PATCH][RFC] Supporting Enlightened Windows 2008Server Ky Srinivasan
@ 2008-04-05 15:17 ` Daniel P. Berrange
  2008-04-07 14:27   ` Ky Srinivasan
  2008-04-07 14:27   ` Ky Srinivasan
       [not found] ` <20080410112313.GA4628@weybridge.uk.xensource.com>
  1 sibling, 2 replies; 14+ messages in thread
From: Daniel P. Berrange @ 2008-04-05 15:17 UTC (permalink / raw)
  To: Ky Srinivasan; +Cc: xen-devel, Tim Deegan, Keir Fraser

On Fri, Apr 04, 2008 at 05:41:32PM -0600, Ky Srinivasan wrote:
> Based on the feedback I got from you and Tim, I am enclosing 
> the next version of the patches to support enlightened win2008
>  server. Here are the changes I have made:
> 
> 1) I have put the shim on a low calorie diet - I have gotten rid
>  of the framework infrastructure and to the extent possible 
>  integrated the shim code with xen.
> 
> 2) I have tried to cleanup the code. I am sure  more work will be 
> needed here.
> 
> 3) I am not advertising the TLB related enlightenments. We can 
> revisit this later if needed.


> Index: xen-unstable.hg/tools/python/xen/xm/create.py
> ===================================================================
> --- xen-unstable.hg.orig/tools/python/xen/xm/create.py	2008-04-04 13:01:01.000000000 -0400
> +++ xen-unstable.hg/tools/python/xen/xm/create.py	2008-04-04 13:45:08.000000000 -0400
> @@ -207,6 +207,10 @@
>            use="""Timer mode (0=delay virtual time when ticks are missed;
>            1=virtual time is always wallclock time.""")
>  
> +gopts.var('extid', val='EXTID',
> +          fn=set_int, default=0,
> +          use="Specify extention ID for a HVM domain.")
> +
>  gopts.var('acpi', val='ACPI',
>            fn=set_int, default=1,
>            use="Disable or enable ACPI of HVM domain.")
> @@ -750,7 +754,7 @@
>  def configure_hvm(config_image, vals):
>      """Create the config for HVM devices.
>      """
> -    args = [ 'device_model', 'pae', 'vcpus', 'boot', 'fda', 'fdb', 'timer_mode',
> +    args = [ 'device_model', 'pae', 'extid', 'vcpus', 'boot', 'fda', 'fdb', 'timer_mode',
>               'localtime', 'serial', 'stdvga', 'isa', 'nographic', 'soundhw',
>               'vnc', 'vncdisplay', 'vncunused', 'vncconsole', 'vnclisten',
>               'sdl', 'display', 'xauthority', 'rtc_timeoffset', 'monitor',


I'd still like to see the userspace tools side changed to not expose
the integrate 'extid' value to administrators configuring guests, and
instead use some semanatically meaningful string param

http://lists.xensource.com/archives/html/xen-devel/2008-03/msg00165.html

Regards,
Daniel.
-- 
|: Red Hat, Engineering, Boston   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
  2008-04-05 15:17 ` Daniel P. Berrange
@ 2008-04-07 14:27   ` Ky Srinivasan
  2008-04-07 14:27   ` Ky Srinivasan
  1 sibling, 0 replies; 14+ messages in thread
From: Ky Srinivasan @ 2008-04-07 14:27 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: xen-devel, Tim Deegan, Keir Fraser



>>> On Sat, Apr 5, 2008 at 11:17 AM, in message
<20080405151705.GA19777@redhat.com>, "Daniel P. Berrange" <berrange@redhat.com>
wrote: 
> On Fri, Apr 04, 2008 at 05:41:32PM -0600, Ky Srinivasan wrote:
>> Based on the feedback I got from you and Tim, I am enclosing 
>> the next version of the patches to support enlightened win2008
>>  server. Here are the changes I have made:
>> 
>> 1) I have put the shim on a low calorie diet - I have gotten rid
>>  of the framework infrastructure and to the extent possible 
>>  integrated the shim code with xen.
>> 
>> 2) I have tried to cleanup the code. I am sure  more work will be 
>> needed here.
>> 
>> 3) I am not advertising the TLB related enlightenments. We can 
>> revisit this later if needed.
> 
> 
>> Index: xen-unstable.hg/tools/python/xen/xm/create.py
>> ===================================================================
>> --- xen-unstable.hg.orig/tools/python/xen/xm/create.py	2008-04-04 
> 13:01:01.000000000 -0400
>> +++ xen-unstable.hg/tools/python/xen/xm/create.py	2008-04-04 13:45:08.000000000 
> -0400
>> @@ -207,6 +207,10 @@
>>            use="""Timer mode (0=delay virtual time when ticks are missed;
>>            1=virtual time is always wallclock time.""")
>>  
>> +gopts.var('extid', val='EXTID',
>> +          fn=set_int, default=0,
>> +          use="Specify extention ID for a HVM domain.")
>> +
>>  gopts.var('acpi', val='ACPI',
>>            fn=set_int, default=1,
>>            use="Disable or enable ACPI of HVM domain.")
>> @@ -750,7 +754,7 @@
>>  def configure_hvm(config_image, vals):
>>      """Create the config for HVM devices.
>>      """
>> -    args = [ 'device_model', 'pae', 'vcpus', 'boot', 'fda', 'fdb', 
> 'timer_mode',
>> +    args = [ 'device_model', 'pae', 'extid', 'vcpus', 'boot', 'fda', 'fdb', 
> 'timer_mode',
>>               'localtime', 'serial', 'stdvga', 'isa', 'nographic', 
> 'soundhw',
>>               'vnc', 'vncdisplay', 'vncunused', 'vncconsole', 'vnclisten',
>>               'sdl', 'display', 'xauthority', 'rtc_timeoffset', 'monitor',
> 
> 
> I'd still like to see the userspace tools side changed to not expose
> the integrate 'extid' value to administrators configuring guests, and
> instead use some semanatically meaningful string param
> 
> http://lists.xensource.com/archives/html/xen-devel/2008-03/msg00165.html

Absolutely. I will fix this in the next revision.

Regards,

K. Y
> 
> Regards,
> Daniel.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
  2008-04-05 15:17 ` Daniel P. Berrange
  2008-04-07 14:27   ` Ky Srinivasan
@ 2008-04-07 14:27   ` Ky Srinivasan
  1 sibling, 0 replies; 14+ messages in thread
From: Ky Srinivasan @ 2008-04-07 14:27 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: xen-devel, Tim Deegan, Keir Fraser



>>> On Sat, Apr 5, 2008 at 11:17 AM, in message
<20080405151705.GA19777@redhat.com>, "Daniel P. Berrange" <berrange@redhat.com>
wrote: 
> On Fri, Apr 04, 2008 at 05:41:32PM -0600, Ky Srinivasan wrote:
>> Based on the feedback I got from you and Tim, I am enclosing 
>> the next version of the patches to support enlightened win2008
>>  server. Here are the changes I have made:
>> 
>> 1) I have put the shim on a low calorie diet - I have gotten rid
>>  of the framework infrastructure and to the extent possible 
>>  integrated the shim code with xen.
>> 
>> 2) I have tried to cleanup the code. I am sure  more work will be 
>> needed here.
>> 
>> 3) I am not advertising the TLB related enlightenments. We can 
>> revisit this later if needed.
> 
> 
>> Index: xen-unstable.hg/tools/python/xen/xm/create.py
>> ===================================================================
>> --- xen-unstable.hg.orig/tools/python/xen/xm/create.py	2008-04-04 
> 13:01:01.000000000 -0400
>> +++ xen-unstable.hg/tools/python/xen/xm/create.py	2008-04-04 13:45:08.000000000 
> -0400
>> @@ -207,6 +207,10 @@
>>            use="""Timer mode (0=delay virtual time when ticks are missed;
>>            1=virtual time is always wallclock time.""")
>>  
>> +gopts.var('extid', val='EXTID',
>> +          fn=set_int, default=0,
>> +          use="Specify extention ID for a HVM domain.")
>> +
>>  gopts.var('acpi', val='ACPI',
>>            fn=set_int, default=1,
>>            use="Disable or enable ACPI of HVM domain.")
>> @@ -750,7 +754,7 @@
>>  def configure_hvm(config_image, vals):
>>      """Create the config for HVM devices.
>>      """
>> -    args = [ 'device_model', 'pae', 'vcpus', 'boot', 'fda', 'fdb', 
> 'timer_mode',
>> +    args = [ 'device_model', 'pae', 'extid', 'vcpus', 'boot', 'fda', 'fdb', 
> 'timer_mode',
>>               'localtime', 'serial', 'stdvga', 'isa', 'nographic', 
> 'soundhw',
>>               'vnc', 'vncdisplay', 'vncunused', 'vncconsole', 'vnclisten',
>>               'sdl', 'display', 'xauthority', 'rtc_timeoffset', 'monitor',
> 
> 
> I'd still like to see the userspace tools side changed to not expose
> the integrate 'extid' value to administrators configuring guests, and
> instead use some semanatically meaningful string param
> 
> http://lists.xensource.com/archives/html/xen-devel/2008-03/msg00165.html

Absolutely. I will fix this in the next revision.

Regards,

K. Y
> 
> Regards,
> Daniel.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
       [not found] ` <20080410112313.GA4628@weybridge.uk.xensource.com>
@ 2008-04-13 18:43   ` Ky Srinivasan
       [not found]   ` <48DF4074.E57C.0030.0@novell.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Ky Srinivasan @ 2008-04-13 18:43 UTC (permalink / raw)
  To: Steven Smith; +Cc: xen-devel, Tim Deegan, Keir Fraser

Steven,

Thanks for the detailed review. I will address the issues you have raised as part of my next cleanup of these patches. I have responded to your comments in-line.

Regards,

K. Y

>>> Steven Smith <steven.smith@citrix.com> 04/10/08 7:23 AM >>> 
> I have a couple of comments on the new patches:


> Would you mind generating diffs with -p, please?  It makes them a fair
> bit easier to read.

Will do.

>  
>  static void svm_vmexit_do_cpuid(struct cpu_user_regs *regs)
>  {
> -    unsigned int eax, ebx, ecx, edx, inst_len;
> +    unsigned int eax, ebx, ecx, edx, inst_len, input;
>  
>      inst_len = __get_instruction_length(current, INSTR_CPUID, NULL);
>      if ( inst_len == 0 ) 
> ...
> @@ -968,6 +970,7 @@
>      regs->ecx = ecx;
>      regs->edx = edx;
>  
> +    hyperx_intercept_do_cpuid(input, regs);
>      __update_guest_eip(regs, inst_len);
>  }
>
  
> Would it be easier to put this bit in hvm_cpuid, or maybe
> cpuid_hypervisor_leaves?  That would avoid needing to make the same
> change to both the VMX and SVM CPUID infrastructure.

Good point. I will look into it.



> @@ -984,6 +987,10 @@
>      struct vcpu *v = current;
>      struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
>  
> +    if (hyperx_intercept_do_msr_read(ecx, regs))
> +    {
> +            goto done;
> +    }
>      switch ( ecx )
>      {
>      case MSR_IA32_TSC:
> 
> Likewise, it seems like that could be done better in
> rdmsr_hypervisor_regs().

Yep.

> @@ -1085,6 +1092,10 @@
>      msr_content = (u32)regs->eax | ((u64)regs->edx << 32);
>  
>      hvmtrace_msr_write(v, ecx, msr_content);
> +    if (hyperx_intercept_do_msr_write(ecx, regs))
> +    {
> +            goto done_msr_write;
> +    }
>  
>      switch ( ecx )
>      {

> Or wrmsr_hypervisor_regs().

Will do.





> @@ -2210,6 +2226,10 @@
>                  if ( a.value > HVMPTM_one_missed_tick_pending )
>                      goto param_fail;
>                  break;
> +            case HVM_PARAM_EXTEND_HYPERVISOR:
> +printk("KYS: EXTEND hypervisor id is %d\n", (int)a.value);
> +                if ((a.value == 1) && hyperv_initialize(d)) 
> +                    goto param_fail;
>              }
>              d->arch.hvm_domain.params[a.index] = a.value;
>              rc = 0;
> 

> It might be a good idea to fail with -EINVAL or something if
> a.value > 1, so that if anyone ever introduces another hypervisor
> extension it's easy for the tools to tell whether it's available or
> not.

Will do.


> Index: xen-unstable.hg/xen/arch/x86/hvm/save.c
> ===================================================================
> --- xen-unstable.hg.orig/xen/arch/x86/hvm/save.c2008-04-04 14:28:26.000000000 -0400
> +++ xen-unstable.hg/xen/arch/x86/hvm/save.c2008-04-04 14:28:44.000000000 -0400
> @@ -23,6 +23,8 @@
>  
>  #include <asm/hvm/support.h>
>  #include <public/hvm/save.h>
> +#include <public/hvm/params.h>
> +#include <asm/hvm/hvm_extensions.h>
>  
>  void arch_hvm_save(struct domain *d, struct hvm_save_header *hdr)
>  {
> 

> The only change made to this file was to add some #include's.  Why
> were they necessary?

This was necessary at some point in the past. This is the vestiges of a partial cleanup! This will be fixed.



> Index: xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h
> ===================================================================
> --- xen-unstable.hg.orig/xen/include/public/arch-x86/hvm/save.h2008-04-04 14:28:26.000000000 -0400
> +++ xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h2008-04-04 14:28:44.000000000 -0400
> @@ -38,7 +38,7 @@
>      uint32_t version;           /* File format version */
>      uint64_t changeset;         /* Version of Xen that saved this file */
>      uint32_t cpuid;             /* CPUID[0x01][%eax] on the saving machine */
> -    uint32_t pad0;
> +    uint32_t pad0;/* extension ID */
>  };
>  
>  DECLARE_HVM_SAVE_TYPE(HEADER, 1, struct hvm_save_header);
> 
> I think it might be a good idea to give the field a more
> descriptive name than ``pad0'' if it's being used for something
> other than padding.

This again is the result of a partial cleanup. I was stashing away the extension ID and Tim rightly pointed out that we had enough state and did not need the extension ID in the save record. This will be cleaned up.

> It might be an even better idea to move the ID out of the header
> entirely and put it in its own HVM_SAVE_TYPE.  You already have
> hvm_hyperv_dom; can you use that to identify the presence or
> absence of the extensions?

I will look into this.




> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_errno.h2008-03-26 13:56:39.000000000 -0400
> @@ -0,0 +1,4 @@
...
> +/*
> + * hv_errno.h
> + * Error codes for the  Novell Shim.
> 
>> These are just the Viridian error codes, aren't they, rather than
>> something specific to the Novell shim?

Yes. This will be fixed.


> + *
> + * Engineering Contact: K. Y. Srinivasan
> + */
> +
> +#ifndef HV_ERRNO_H
> +#define HV_ERRNO_H
> +
> +#define HV_STATUS_SUCCESS0x0000
> +#define HV_STATUS_INVALID_HYPERCALL_CODE0x0002
> +#define HV_STATUS_INVALID_HYPERCALL_INPUT0x0003
> +#define HV_STATUS_INVALID_ALIGNMENT0x0004
> +#define HV_STATUS_INVALID_PARAMETER0x0005
...
> +#endif 
> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c2008-03-26 14:18:32.000000000 -0400
> @@ -0,0 +1,13 @@
...
> +void hv_collect_stats(int event, hv_vcpu_stats_t *statsp)
> +{
> +switch (event) {
> +case HV_CSWITCH:
> +statsp->num_switches++;
> +return;
> +case HV_FLUSH_VA:
> +statsp->num_flushes++;
> +return;
> +case HV_FLUSH_RANGE:
> +statsp->num_flush_ranges++;
> +return;
> That looks a bit pointless.  Would it not be easier to just inline
> this whole function?

Now that, I have dropped all the TLB flush enlightenments, I will clean this up.


...
> +}
...
> +static int
> +hv_get_vp_registers(paddr_t input, paddr_t output)
> +{
> +hv_vcpu_t        *vcpup, *targetp;
> +hv_partition_t   *curp = hv_get_current_partition();
> +get_vp_registers_input_t*inbuf;
> +get_vp_registers_output_t*outbuf;
> +struct vcpu_guest_context*vcpu_ctx;
> +u32*reg_indexp;
> +get_vp_registers_output_t*out_regp;
> +u32num_output_bytes = 0;
> +
> +        vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> +inbuf = vcpup->input_buffer;
> +outbuf = vcpup->output_buffer;
> +out_regp = outbuf;
> +/*
> + * Copy the input data to the per-cpu input buffer.
> + * This may be an overkill; obviously it is better to only
> + * copy what we need. XXXKYS: Check with Mike.
> + */
> Who's Mike?  What did he say?

Mike is  the MSFT contact. 




> +if (hvm_copy_from_guest_phys(inbuf, input, PAGE_SIZE)) {
> +return (HV_STATUS_INVALID_ALIGNMENT);
> +}
...
> +/*
> + * Now that we have the register state; select what we want and
> + * populate the output buffer.
> + */
> +reg_indexp = &inbuf->reg_index;
> +while (*reg_indexp != 0) {
> +switch(*reg_indexp) {
> +/*
> + * XXXKYS: need mapping code here; populate
> + * outbuf.
> + */
> +panic("hv_get_vp_registers not supported\n");

> Yeah, that's not really good enough.  It would be better to just
> not support this hypercall at all than have an implementation which
> always crashes Xen.

Early on, I decided what Hypercalls I would support; and from this list I wanted 
to only support those that were used by windows 2008 server. I  had put in a bunch of panics to trap all the hypercalls that the guest would invoke. This hypercall is not used. Thus the implementation is incomplete. I will clean this up.

> +}
> +reg_indexp++;
> +out_regp++ ;/*128 bit registers */
> +num_output_bytes +=16;
> +if ((char *)reg_indexp > ((char *)inbuf + PAGE_SIZE)) {
> +/*
> + *input list not reminated correctly; bail out.
> + */
> +panic("hv_get_vp_registers:input list not terminated\n"); 
> +break;

> It would be nice if this caused the hypercall to fail, rather than
> crashing Xen.

Will be cleaned up.



> +}
> +}
> +if (hvm_copy_to_guest_phys(output, outbuf, num_output_bytes)) {
> +/* Some problem copying data out*/
> +panic("hv_get_vp_registers:copyout problem\n");
 
> And again.

Will be cleaned up.


> +}
> +xfree(vcpu_ctx);
> +return (HV_STATUS_SUCCESS);
> +}

> Has this hypercall had any testing at all?  If it hasn't, then what
> is it doing here?

As I noted earlier, win2k8 does not currently use this API. I will clean this up.



> +
> +static int
> +hv_set_vp_registers(paddr_t input, paddr_t output)

> Again, I don't believe this function has ever been called.  That
> suggests that this may not be a high-value optimisation suitable
> for inclusion in an initial version.

As I noted earlier, win2k8 does not currently use this API. I will clean this up.



...
> 
> +static int
> +hv_switch_va(paddr_t input)
> +{
> +hv_partition_t   *curp = hv_get_current_partition();
> +        hv_vcpu_t *vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> +
> +/*
> + * XXXKYS: the spec sys the asID is passed via memory at offset 0 of 
> + * the page whose GPA is in the input register. However, it appears 
> + * the current build of longhorn (longhorn-2007-02-06-x86_64-fv-02)
> + * passes the asID in the input register instead. Need to check if 
> + * future builds do this.
> + */
> +hvm_set_cr3(input); 
> +HV_STATS_COLLECT(HV_CSWITCH, &vcpup->stats);
> +return (HV_STATUS_SUCCESS);
> +}

> Do you have any evidence that this is actually faster than just
> doing the CR3 write in the guest?  It's a win on real Viridian
> because you avoid blowing the shadow page table cache, but I don't
> think that applies on Xen's current shadow page table
> implementation.

As I noted earlier, initially I identified all the hypercalls that made for a win2k8 guest on our platform. From this list I implemented those that the guest actually invoked. I implemented all the hypercalls that the guest was invoking without regard to the performance value of the enlightenment. I agree with you that this may be more important on Veridian than on our platform. 



> +static int
> +hv_flush_va(paddr_t input)
> +{
...
> +/*
> + * Now operate on what we are given
> + * XXXKYS: For now we are ignoring as_id and fushGlobal flag.
> + * May have to revisit this. But first stash away the processed 
> + * parameters for subsequent use.
> + */
> +flush_argp->as_handle = as_id;
> +flush_argp->flags = flush_global;
> +flush_argp->v_mask = vcpu_mask;
> +
> +
> +ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> +hv_set_syscall_retval(guest_cpu_user_regs(),
> +                                   curp->long_mode_guest,
> +                                   ret_val);
> +HV_STATS_COLLECT(HV_FLUSH_VA_STAT, &cur_vcpup->stats);
> +panic("hv_flush_va not supported\n"); 

> So if this hypercall is ever correctly used by the guest, Xen will
> crash?

In the original patches that I posted, this hypercall was implemented. As part of my first round of cleanup, I removed the support for the TLB  flush hypercalls. As part of this cleanup, I also disabled TLB related enlightenments. This will be cleaned up.

> +return (HV_STATUS_SUCCESS);
> +}
> +
> +static int
> +hv_flush_va_range(paddr_t input, unsigned short start_index, 
> +unsigned short rep_count, unsigned short *reps_done)
> +{
...
> +ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, rep_count);
> +hv_set_syscall_retval(guest_cpu_user_regs(),
> +                                   curp->long_mode_guest,
> +                                   ret_val);
> +
> +
> +HV_STATS_COLLECT(HV_FLUSH_RANGE, &cur_vcpup->stats);
> +panic("hv_flush_vaRange not supported\n");
 
> And again.

Will be cleaned up.


> +return (HV_STATUS_SUCCESS);
> +}
> +
> +void
> +hv_handle_hypercall(u64 opcode, u64 input, u64 output, 
> +  u64 *ret_val)

> Would it be easier to just return the return value?

Perhaps.


> +{
...
> +verb = (short)(opcode & 0xffff);
> +rep_count = (short)((opcode >>32) & 0xfff);
> +start_index = (short)((opcode >> 48) & 0xfff);
> +switch (verb) {
> +case HV_CREATE_PARTITION:
> +/*
> + * Xen only allows dom0 to create domains.
> + */
> +*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;
> +case HV_INITIALIZE_PARTITION:
> +/*
> + * We don't support this.
> + */
> +*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;

> Would it be easier to just make HV_STATUS_ACCESS_DENIED the
> default?  Actually, my reading of the spec is that you should
> return HV_STATUS_INVALID_HYPERCALL_CODE if the guest makes a
> hypercall which you don't support, but I can't imagine it makes a
> great deal of difference.

Perhaps not. 


...
> +case HV_GET_PARTITION_ID:
> +if (!hv_privilege_check(curp, HV_ACCESS_PARTITION_ID)) {
> +*ret_val = 
> +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;
> +}
> +partition_id = (u64)current->domain->domain_id;
> +if (hvm_copy_to_guest_phys(output, 
> +&partition_id, 8)) {
> +/*
> + * Invalid output area.
> + */
> +*ret_val = 
> +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);

> I think you mean HV_STATUS_INVALID_ALIGNMENT.

Agreed.


> +return;
> +}
> +*ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> +return;
> 
> +case HV_GET_VP_REGISTERS:
> +*ret_val = hv_build_hcall_retval(
> +hv_get_vp_registers(input, output), 0);
> +return;
> +case HV_SET_VP_REGISTERS:
> +*ret_val = hv_build_hcall_retval(
> +hv_set_vp_registers(input, output), 0);
...
> +case HV_FLUSH_VA:
> +*ret_val = 
> +hv_build_hcall_retval(hv_flush_va(input), 0);
> +return;
> +case HV_FLUSH_VA_LIST:
> +value  = hv_flush_va_range(input, start_index, 
> +rep_count, &reps_done);
> +*ret_val = hv_build_hcall_retval(value, reps_done);  
> +return;

> As mentioned above, your implementations of these hypercalls are
> really quite broken.

These hypercalls are not invoked by the guest and that is the reason they are broken. I will clean it up though.



...
> 
> +}
> +}
> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h2008-03-26 13:56:39.000000000 -0400
> @@ -0,0 +1,21 @@
> +
> +/*
> + * nshypercall.h
> + * Memory layouts for the various hypercalls supported. 
> + *
> + * Engineering Contact: K. Y. Srinivasan
> + */
> +
> +#ifndef HV_HYPERCALL_H
> +#define HV_HYPERCALL_H
> +
> +#include <xen/cpumask.h>
> +
> +
> +typedef struct get_vp_registers_input {
> +u64partition_id;
> +u64vp_index;
> +u32reg_index;
> +u32pad1;
> +u64pad2;
> +} get_vp_registers_input_t;

> My reading of the 0.83 specification is that this should have been:

> typedef struct get_vp_registers_input {
> u64 partition_id;
> u32 vp_index;
> u32 pad;
> u32 reg_index[0];
> } get_vp_registers_input_t;

> Is my copy of the specification out of date?

Nope. 



> +typedef struct set_vp_register_spec {
> +u32reg_name;
> +u32pad;
> +u64pad1;
> +u64low_value;
> +u64high_value;
> +} set_vp_register_spec_t;
> +
> +typedef struct set_vp_registers_input {
> +u64partition_id;
> +u64vp_index;
> +set_vp_register_spec_treg_spec;
> +} set_vp_registers_input_t;

> Again, the 0.83 spec says that vp_index should be a u32, and should
> be followed by a u32 MBZ pad.
Right.

> +
> +
> +typedef struct flush_va {
> +u64as_handle;
> +u64flags;
> +union  {
> +u64processor_mask;
> +cpumask_t vcpu_mask;
> +} proc_mask;
> +#define p_mask proc_mask.processor_mask
> +#define v_maskproc_mask.vcpu_mask
> +u64gva;
> +} flush_va_t;

> Is there something wrong with anonymous unions?



> +/*
> + * Hypercall verbs.
> + */
> +
> +#define HV_CREATE_PARTITION 0x0010
> +#define HV_INITIALIZE_PARTITION 0x0011
> +#define HV_DELETE_PARTITION0x0014

> Is it really necesary to include #define's for hypercalls which we
> don't support and will probably never support?

No; I will clean this file up.


> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c2008-03-26 14:26:38.000000000 -0400
> @@ -0,0 +1,0 @@
> 
> +/*
> + * Local includes; extension specific.
> + */
> +#include "hv_errno.h"
> +#include "hv_shim.h"
> +
> +
> +/*
> + * Implement the Hyperv Shim.
> + */
> +
> +extern struct cpuinfo_x86 boot_cpu_data;

> Is there something wrong with processor.h?

This will be fixed.

> +extern struct hvm_mmio_handler vlapic_mmio_handler;

> We should probably move that to a header file somewhere.

Will do.

> +static inline void *
> +get_virt_from_gmfn(struct domain *d, unsigned long gmfn)
> +{
> +unsigned long mfn = gmfn_to_mfn(d, gmfn);
> +if (mfn == INVALID_MFN) {
> +return (NULL);
> +}
> +return (map_domain_page_global(mfn));
> +}

> Is it really necessary to use map_domain_page_global() here?
> map_domain_page() is usually quite a bit faster.  To be honest, I'm
> not convinced that this wrapper function has any real value anyway.

I will look into this.


> +static inline unsigned long
> +get_mfn_from_gva(unsigned long va)
> +{
> +uint32_t pfec = PFEC_page_present;
> +unsigned long gfn;
> +gfn = paging_gva_to_gfn(current, va, &pfec);
> +return (gmfn_to_mfn((current->domain), gfn));
> +}

> That's only valid if you can guarantee that the resulting MFN will
> never be written to.

Yes. I use this to figure out if the hypercall is from the PV drivers in the guest or the hyparcall from the guest.


> +static inline void 
> +hv_write_hypercall_msr(hv_partition_t *curp,
> +  hv_vcpu_t*cur_vcpu,
> +  u64msr_content)
> +{
...
> +hv_hypercall_page_initialize(hypercall_page, curp);
> +curp->hypercall_mfn = gmfn_to_mfn(d, gmfn);
> +#ifdef CONFIG_DOMAIN_PAGE
> +unmap_domain_page_global(hypercall_page);
> +#endif

> unmap_domain_page_global() is #define'd to nothing #ifndef
> CONFIG_DOMAIN_PAGE, so you don't need the #ifdef.

Yes.


> +curp->hypercall_msr = msr_content;
> +spin_unlock(&curp->lock);
> +cur_vcpu->flags |= HV_VCPU_UP;
> +}
> +
> +
> +static inline void hv_write_sx_msr(uint32_t idx, hv_partition_t *curp,
> +  hv_vcpu_t*cur_vcpu,
> +  u64msr_content)
> +{
...
> +sx_page = get_virt_from_gmfn(d, gmfn);
> +if (sx_page == NULL) {
> +/*
> + * The guest specified a bogus GPA; inject a GP fault
> + * into the guest.
> + */
> +hv_inject_exception(TRAP_gp_fault);

> The spec says tha SIEFP can be outside the physical address space,
> and you're just supposed to deal with it.  The guest will be unable
> to access the frame, but the MSR write is supposed to succeed
> (section 14.6.3 of version 0.83).  Likewise the SIMP (14.6.4).

You are right. As it turns out, the win2k8 guest on our platform never runs into this situation. I will fix this.


> +return;
> +}
> +switch (idx) {
> +case HV_MSR_SIEFP:
> +hv_init_event_page(sx_page);
> +cur_vcpu->siefp_msr = msr_content; 
> +cur_vcpu->sief_page = sx_page; 
> +break;
> +case HV_MSR_SIMP:
> +hv_init_message_page(sx_page);
> +cur_vcpu->simp_msr = msr_content;
> +cur_vcpu->sim_page = sx_page;
> +break;
> +}
> +
> +}
> +

> +
> +/*
> + * Time this domain booted.
> + */
> +s_time_t hv_domain_boot_time;
> Time which domain booted?  What if you have two domains which both
> use the HyperV extensions?

Oops! Will fix this.



> +static inline void
> +hv_inject_exception(int trap)
> +{
> +hvm_funcs.inject_exception(trap, 0, 0);
> +}
> hvm_inject_exception() might be a better choice.

Ok.

> +static inline void 
> +hv_set_partition_privileges(hv_partition_t *hvpp)
> +{
> +/*
> + * This is based on the hypervisor spec under section 5.2.3. 
> + */
> +hvpp->privileges = 0x000000020000007f;
> +}
> So you allow AccessVpRuntime, AccessTimeCounters, AccessSynicMsrs,
> AccessSyntheticTimers, AccessApicMsrs, AccessHypercallMsrs,
> AccessVpIndex, and AccessSelfPartitionId?  That sounds like a
> pretty reasonable set, but it'd be nice if you'd documented the
> fact rather than just dropping in a magic number.  Something like
> this, maybe:

> hvpp->privileges = HV_PRIV_ACCESS_VP_RUNTIME |
>   HV_PRIV_ACCESS_TIME_COUNTERS |
>                  HV_PRIV_ACCESS_SYNIC_MSRS |
                   ...

Ok.

> +static inline u32
> +hv_get_recommendations(void)
> +{
> +/*
> + *For now we recommend all the features. Need to validate.
> + */
> +if ( paging_mode_hap(current->domain)) {
> +/*
> + * If HAP is enabled; the guest should not use TLB flush
> + * related enlightenments.
> + */
> +return (0x19);
> +} else {
> +/*
> + * For now disable TLB flush enlightenments.
> + */
> +return (0x19); 
> +}
> +}

> So you recommend the use of hypercalls for address switches, MSRs
> for APIC registers, and MSRs for rebooting?  The first two make
> sense, but I'm not so sure that enlightened reboot is a useful
> optimisation.

Originally, I had implemented TLB flush enlightenments as well. Based on some of the performance analysis we are currently doing, I may re-introduce them. As I noted earlier, I just wanted to implement all the features that made sense for the guest on our platform - hence I implemented reboot as well!

> Again, it would have been useful if the actual set of features was
> documented rather than just using a magic number.

Yep.

> +static inline void 
> +hv_set_partition_features(hv_partition_t *hvpp)
> +{
> +hvpp->supported_features = 0x1f;
> +}
>i.e. VP_RUNTIME | PARTITION_REF_COUNT | SYNIC MSRS | SYNTHETIC_TIMERS |
>     APIC_MSRS | HYPERCALL_MSRS.

> It's kind of surprising that you recommend guests use the reboot MSRs,
> but then don't claim to support it.  Was that deliberate?

> Actually, looking some more, it doesn't look like you ever use the
> supported_features field, and have instead hardcoded the value to
> 0xff everywhere you need it.

Will fix this.

> +
> +static inline u16 
> +hv_get_guest_major(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> +static inline u16
> +hv_get_guest_minor(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> +static inline u32
> +hv_get_guest_service_pack(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> Err... are these supposed to return the guest major/minor/sp
> reported in the identification MSR?

Yep. Was not quite sure what numbers made sense!



> +static inline void
> +hv_read_icr(u64 *icr_content)
...
> +static inline void
> +hv_read_tpr(u64 *tpr_content)
...
> +static inline void
> +hv_write_eoi(u64 msr_content)
...
> +static inline void
> +hv_write_icr(u64 msr_content)
...
> +static inline void
> +hv_write_tpr(u64 msr_content)
...

> These all have really weird implementations and lots of gratuitous
> use of hardcoded magic numbers (have a look at apicdef.h).

Will fix it.


> +static void
> +hv_timeout_handler(void *arg)
> +{
...
> +/*
> + * First post the message and then optionally deal with the 
> + * interrupt notification.
> + */
> +if (cur_vcpu->sim_page == NULL) {
> +panic("Novell Shim: Sim page not setup\n");
> +}

> So if a guest requests a timeout before they've set up the sim
> page, Xen crashes?  What if they clear the sim page after setting
> up a timeout, but before it fires?

I will fix this. As I noted earlier, win2k8 currently does not use any of these features.

...
> +}
> +

> +int
> +hyperv_do_hypercall(struct cpu_user_regs *pregs)
> +{
> +hv_partition_t*curp = hv_get_current_partition();
> +hv_vcpu_t        *vcpup;
> +intlong_mode_guest = curp->long_mode_guest;
> +unsigned long hypercall_mfn;
> +unsigned long gmfn;
> +gmfn = (curp->hypercall_msr >> 12);
> +
> +hypercall_mfn = get_mfn_from_gva(pregs->eip);

> That's exciting.  Do you expect that guests will make hypercalls
> from anywhere other than the Viridian hypercall page?

We support PV drivers for win2k8 guests. Given that both HyperV and Xen use the same hypercall numbers to implement different functionality, I needed a way to differentiate HyperV hypercalls. I use the mfn of the hypercall page to figure out who should handle the hypercall.

...

> +int
> +hyperv_vcpu_initialize(struct vcpu *v)
> +{
...
> +/*
> + * Setup the input page for handling hypercalls.
> + *
> + */
> +vcpup->input_buffer_page = 
> +alloc_domheap_page(NULL);
...
> +vcpup->input_buffer =
> +get_virt_from_page_ptr(vcpup->input_buffer_page);
...
> +vcpup->output_buffer_page = 
> +alloc_domheap_page(NULL);
...
> +vcpup->output_buffer =
> +get_virt_from_page_ptr(vcpup->output_buffer_page);

> What exactly are these used for?  The only place I can see it
> referenced are immediately before a panic(), which seems a bit
> pointless.

These were used in implementing TLB flush enlightenments. I got rid of those hypercalls in the current patch set. I will clean these up.


> +vcpup->xen_vcpu = v; 
> +
> +return (0);
> +}
> +

> +
> +static int 
> +hyperv_dom_restore(struct domain *d, hvm_domain_context_t *h)
> +{
> +struct hvm_hyperv_dom ctxt;
> +hv_partition_t*curp;
> +
> +if ( hvm_load_entry(HYPERV_DOM, h, &ctxt) != 0 )
> +        return -22;
> +d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] = ctxt.ext_id;
> +if ((ctxt.ext_id == 0) || (ctxt.ext_id > 1)) {
> +return 0;
> +}
> +if (hyperv_initialize(d)) {
> +return -22;
> +}
> ????  Did you mean -EINVAL there?

Yep.

> +curp = d->arch.hvm_domain.hyperv_handle;
> +
> +curp->guest_id_msr = ctxt.guestid_msr;
> +curp->hypercall_msr = ctxt.hypercall_msr;
> +curp->long_mode_guest = ctxt.long_mode;
> +curp->hypercall_mfn =
> +gmfn_to_mfn(d, (ctxt.hypercall_msr >> 12));
> +
> +return 0; 
> +}
> +

> +static int
> +hv_preprocess_cpuid_leaves(unsigned int input, struct cpu_user_regs *regs)
> +{
> +uint32_t idx;
> +struct domain*d = current->domain;
> +intextid = d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
> +
> +if (extid == 1) {
> +/*
> + * Enlightened Windows guest; need to remap and handle 
> + * leaves used by PV front-end drivers.
> + */
> +if ((input >= 0x40000000) && (input <= 0x40000005)) {
> +return (0);
> +}
> +/*
> + * PV drivers use cpuid to query the hypervisor for details. On
> + * Windows we will use the following leaves for this:
> + *
> + * 4096: VMM Sinature (corresponds to 0x40000000 on Linux)
> + * 4097: VMM Version (corresponds to 0x40000001 on Linux)
> + * 4098: Hypercall details (corresponds to 0x40000002 on Linux)
> + */

> I think it would be better to just unconditionally duplicate the
> Xen leaves at 0x40001000, regardless of whether the viridian
> enlightenments are enabled.  That would make it much easier to
> write PV drivers which work regardless of whether the
> enlightenments are enabled, and if we can do that then it might (in
> a few years' time) be possible to default to enlightments-on for
> all domains.  That isn't really an option when there's no common
> Xen API.

> This would also mean that you don't need to duplicate the whole of
> hypervisor_cpuid_leaves() in the Viridian implementation.

I will look into cleaning this up.

...
> +}
> +
> +int 
> +hyperv_do_cpu_id(unsigned int input, struct cpu_user_regs *regs)
> +{
> +uint32_t idx;
> +
> +/*
> + * hvmloader uses cpuid to set up a hypercall page; we don't want to
> + * intercept calls coming from the bootstrap (bios) code in the HVM 
> + * guest; we discriminate based on the instruction pointer.
> + */
> +if (hv_call_from_bios(regs)) {
> +/*
> + * We don't intercept this.
> + */
> +return (0);
> +}
> +
> +if (input == 0x00000001) { 
> +regs->ecx = (regs->ecx | 0x80000000);
> +return (1);
> +} 

> I think we should be doing this even when the shim is disabled.
> That bit is supposed to indicate ``running on a VMM'', not
> ``running on a Viridian-compatible VMM'', and it has the side
> effect of making Windows much more tolerant of vcpus getting
> scheduled at different rates.

Good point.

> +if (hv_preprocess_cpuid_leaves(input, regs)) {
> +return (0);
> +}
> +idx = (input - 0x40000000);
> +
> +switch (idx) {
> +case 0:
> +/*
> + * 0x40000000: Hypervisor identification. 
> + */
> +regs->eax = 0x40000005; /* For now clamp this */
> +regs->ebx = 0x65766f4e; /* "Nove" */ 
> +regs->ecx = 0x68536c6c; /* "llSh" */
> +regs->edx = 0x76486d69; /* "imHv" */ 
> +break;

> I think this is supposed to identify the hypervisor (Xen), rather
> than the Viridian implementation (Novell shim).

You are right.

...
> +case 5:
> +/*
> + * 0x40000005: Implementation limits.
> + * Currently we retrieve maximum number of vcpus and 
> + * logical processors (hardware threads) supported.
> + */
> +regs->eax = hv_get_max_vcpus_supported();
> 
> +regs->ebx = hv_get_max_lcpus_supported();
> a) ebx is marked as reserved in the specification, and
> b) it doesn't make a great deal of sense to expose the maximum
>   number of physical CPUs to a virtualised guest anyway.

I thought this is what the spec mandated.  I will look into this.

> +regs->ecx = 0; /* Reserved */
> +regs->edx = 0; /* Reserved */
> +break; 
> +
> +default:
> +/*
> + * We don't handle this leaf.
> + */
> +return (0);
> +
> +}
> +return (1);
> +}
> +
> +int
> +hyperv_do_rd_msr(uint32_t idx, struct cpu_user_regs *regs)
> +{
...
> +if (extid > 1) {
> +/*
> + * For now this is all other "Enlightened" operating systems
> + * other than Longhorn.
> + */
> +if (idx == 0x40000000) {
> +/*
> + * PV driver hypercall setup. Let xen handle this.
> + */
> +return (0);
> +}
> +if (idx == 0x40001000) {
> +idx = 0x40000000;
> +}

> Eh?  What's this doing?

At one point, I was planning to support other enlightened operating systems. I will clean it up.


> +}
> +switch (idx) {
...
> +case HV_MSR_TIMER0_COUNT:
> +timer = 0;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER1_COUNT:
> +timer = 1;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER2_COUNT:
> +timer = 2;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER3_COUNT:
> +timer = 3;
> +process_timer_count_read:

> How much does timer support actually buy you?  It's pretty complicated
> all by itself, and it also seems to be the only reason you need SYNIC
> support.

The feature is not used currently by the win2k8 guest. I may choose to get rid of this body of code.

> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
> +goto msr_read_error;
> +}
...
> +}
> +
> +int
> +hyperv_do_wr_msr(uint32_t idx, struct cpu_user_regs *regs)
> +{
...
> +switch (idx) {
...
> +case HV_MSR_SEOM:
> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
> +goto msr_write_error;
> +}
> +cur_vcpu->eom_msr = msr_content; 
> +hv_process_message_q(curp, cur_vcpu);
> +break;

> You don't support queued messages, so there's not much point in
> supporting the SEOM MSR either.  Also, eom_msr never seems to get
> read, except for saving it when you do a suspend.

Right.


> +case HV_MSR_TIMER3_CONFIG:
> +timer = 3;
> +process_timer_config:
> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
> +goto msr_write_error;
> +}
> +/*
> + * Assume that the client is going to write the whole msr. 
> + */
> +if (!(msr_content & 0x9)) {
> +/*
> + * We are neither setting Auto Enable or Enable; 
> + * silently exit.
> + * Should this be considered to turn off a 
> + * timer that may be currently 
> + * active; XXXKYS: Check. For now we are 
> + * not doing anything here.
> + */
> +break;
> +}
> +if (!(((u32)(msr_content >> 16)) & 0x0000000f)) {
> +/*
> + * sintx is 0; clear the enable bit(s).
> + */
> +msr_content &= ~(0x1);
> +}

> Again, some #define's would make these magic numbers a bit more
> clear.

I will fix the code.

Regards,

K. Y


> Steven.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
@ 2008-04-13 19:04 Ky Srinivasan
  0 siblings, 0 replies; 14+ messages in thread
From: Ky Srinivasan @ 2008-04-13 19:04 UTC (permalink / raw)
  To: Steven Smith; +Cc: xen-devel, Tim Deegan, Keir Fraser

Steven,

Thanks for the detailed review. I will address the issues you have raised as part of my next cleanup of these patches. I have responded to your comments in-line.

Regards,

K. Y

>>> Steven Smith <steven.smith@citrix.com> 04/10/08 7:23 AM >>> 
> I have a couple of comments on the new patches:


> Would you mind generating diffs with -p, please?  It makes them a fair
> bit easier to read.

Will do.

>  
>  static void svm_vmexit_do_cpuid(struct cpu_user_regs *regs)
>  {
> -    unsigned int eax, ebx, ecx, edx, inst_len;
> +    unsigned int eax, ebx, ecx, edx, inst_len, input;
>  
>      inst_len = __get_instruction_length(current, INSTR_CPUID, NULL);
>      if ( inst_len == 0 ) 
> ...
> @@ -968,6 +970,7 @@
>      regs->ecx = ecx;
>      regs->edx = edx;
>  
> +    hyperx_intercept_do_cpuid(input, regs);
>      __update_guest_eip(regs, inst_len);
>  }
>
  
> Would it be easier to put this bit in hvm_cpuid, or maybe
> cpuid_hypervisor_leaves?  That would avoid needing to make the same
> change to both the VMX and SVM CPUID infrastructure.

Good point. I will look into it.



> @@ -984,6 +987,10 @@
>      struct vcpu *v = current;
>      struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
>  
> +    if (hyperx_intercept_do_msr_read(ecx, regs))
> +    {
> +            goto done;
> +    }
>      switch ( ecx )
>      {
>      case MSR_IA32_TSC:
> 
> Likewise, it seems like that could be done better in
> rdmsr_hypervisor_regs().

Yep.

> @@ -1085,6 +1092,10 @@
>      msr_content = (u32)regs->eax | ((u64)regs->edx << 32);
>  
>      hvmtrace_msr_write(v, ecx, msr_content);
> +    if (hyperx_intercept_do_msr_write(ecx, regs))
> +    {
> +            goto done_msr_write;
> +    }
>  
>      switch ( ecx )
>      {

> Or wrmsr_hypervisor_regs().

Will do.





> @@ -2210,6 +2226,10 @@
>                  if ( a.value > HVMPTM_one_missed_tick_pending )
>                      goto param_fail;
>                  break;
> +            case HVM_PARAM_EXTEND_HYPERVISOR:
> +printk("KYS: EXTEND hypervisor id is %d\n", (int)a.value);
> +                if ((a.value == 1) && hyperv_initialize(d)) 
> +                    goto param_fail;
>              }
>              d->arch.hvm_domain.params[a.index] = a.value;
>              rc = 0;
> 

> It might be a good idea to fail with -EINVAL or something if
> a.value > 1, so that if anyone ever introduces another hypervisor
> extension it's easy for the tools to tell whether it's available or
> not.

Will do.


> Index: xen-unstable.hg/xen/arch/x86/hvm/save.c
> ===================================================================
> --- xen-unstable.hg.orig/xen/arch/x86/hvm/save.c2008-04-04 14:28:26.000000000 -0400
> +++ xen-unstable.hg/xen/arch/x86/hvm/save.c2008-04-04 14:28:44.000000000 -0400
> @@ -23,6 +23,8 @@
>  
>  #include <asm/hvm/support.h>
>  #include <public/hvm/save.h>
> +#include <public/hvm/params.h>
> +#include <asm/hvm/hvm_extensions.h>
>  
>  void arch_hvm_save(struct domain *d, struct hvm_save_header *hdr)
>  {
> 

> The only change made to this file was to add some #include's.  Why
> were they necessary?

This was necessary at some point in the past. This is the vestiges of a partial cleanup! This will be fixed.



> Index: xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h
> ===================================================================
> --- xen-unstable.hg.orig/xen/include/public/arch-x86/hvm/save.h2008-04-04 14:28:26.000000000 -0400
> +++ xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h2008-04-04 14:28:44.000000000 -0400
> @@ -38,7 +38,7 @@
>      uint32_t version;           /* File format version */
>      uint64_t changeset;         /* Version of Xen that saved this file */
>      uint32_t cpuid;             /* CPUID[0x01][%eax] on the saving machine */
> -    uint32_t pad0;
> +    uint32_t pad0;/* extension ID */
>  };
>  
>  DECLARE_HVM_SAVE_TYPE(HEADER, 1, struct hvm_save_header);
> 
> I think it might be a good idea to give the field a more
> descriptive name than ``pad0'' if it's being used for something
> other than padding.

This again is the result of a partial cleanup. I was stashing away the extension ID and Tim rightly pointed out that we had enough state and did not need the extension ID in the save record. This will be cleaned up.

> It might be an even better idea to move the ID out of the header
> entirely and put it in its own HVM_SAVE_TYPE.  You already have
> hvm_hyperv_dom; can you use that to identify the presence or
> absence of the extensions?

I will look into this.




> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_errno.h2008-03-26 13:56:39.000000000 -0400
> @@ -0,0 +1,4 @@
...
> +/*
> + * hv_errno.h
> + * Error codes for the  Novell Shim.
> 
>> These are just the Viridian error codes, aren't they, rather than
>> something specific to the Novell shim?

Yes. This will be fixed.


> + *
> + * Engineering Contact: K. Y. Srinivasan
> + */
> +
> +#ifndef HV_ERRNO_H
> +#define HV_ERRNO_H
> +
> +#define HV_STATUS_SUCCESS0x0000
> +#define HV_STATUS_INVALID_HYPERCALL_CODE0x0002
> +#define HV_STATUS_INVALID_HYPERCALL_INPUT0x0003
> +#define HV_STATUS_INVALID_ALIGNMENT0x0004
> +#define HV_STATUS_INVALID_PARAMETER0x0005
...
> +#endif 
> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c2008-03-26 14:18:32.000000000 -0400
> @@ -0,0 +1,13 @@
...
> +void hv_collect_stats(int event, hv_vcpu_stats_t *statsp)
> +{
> +switch (event) {
> +case HV_CSWITCH:
> +statsp->num_switches++;
> +return;
> +case HV_FLUSH_VA:
> +statsp->num_flushes++;
> +return;
> +case HV_FLUSH_RANGE:
> +statsp->num_flush_ranges++;
> +return;
> That looks a bit pointless.  Would it not be easier to just inline
> this whole function?

Now that, I have dropped all the TLB flush enlightenments, I will clean this up.


...
> +}
...
> +static int
> +hv_get_vp_registers(paddr_t input, paddr_t output)
> +{
> +hv_vcpu_t        *vcpup, *targetp;
> +hv_partition_t   *curp = hv_get_current_partition();
> +get_vp_registers_input_t*inbuf;
> +get_vp_registers_output_t*outbuf;
> +struct vcpu_guest_context*vcpu_ctx;
> +u32*reg_indexp;
> +get_vp_registers_output_t*out_regp;
> +u32num_output_bytes = 0;
> +
> +        vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> +inbuf = vcpup->input_buffer;
> +outbuf = vcpup->output_buffer;
> +out_regp = outbuf;
> +/*
> + * Copy the input data to the per-cpu input buffer.
> + * This may be an overkill; obviously it is better to only
> + * copy what we need. XXXKYS: Check with Mike.
> + */
> Who's Mike?  What did he say?

Mike is  the MSFT contact. 




> +if (hvm_copy_from_guest_phys(inbuf, input, PAGE_SIZE)) {
> +return (HV_STATUS_INVALID_ALIGNMENT);
> +}
...
> +/*
> + * Now that we have the register state; select what we want and
> + * populate the output buffer.
> + */
> +reg_indexp = &inbuf->reg_index;
> +while (*reg_indexp != 0) {
> +switch(*reg_indexp) {
> +/*
> + * XXXKYS: need mapping code here; populate
> + * outbuf.
> + */
> +panic("hv_get_vp_registers not supported\n");

> Yeah, that's not really good enough.  It would be better to just
> not support this hypercall at all than have an implementation which
> always crashes Xen.

Early on, I decided what Hypercalls I would support; and from this list I wanted 
to only support those that were used by windows 2008 server. I  had put in a bunch of panics to trap all the hypercalls that the guest would invoke. This hypercall is not used. Thus the implementation is incomplete. I will clean this up.

> +}
> +reg_indexp++;
> +out_regp++ ;/*128 bit registers */
> +num_output_bytes +=16;
> +if ((char *)reg_indexp > ((char *)inbuf + PAGE_SIZE)) {
> +/*
> + *input list not reminated correctly; bail out.
> + */
> +panic("hv_get_vp_registers:input list not terminated\n"); 
> +break;

> It would be nice if this caused the hypercall to fail, rather than
> crashing Xen.

Will be cleaned up.



> +}
> +}
> +if (hvm_copy_to_guest_phys(output, outbuf, num_output_bytes)) {
> +/* Some problem copying data out*/
> +panic("hv_get_vp_registers:copyout problem\n");
 
> And again.

Will be cleaned up.


> +}
> +xfree(vcpu_ctx);
> +return (HV_STATUS_SUCCESS);
> +}

> Has this hypercall had any testing at all?  If it hasn't, then what
> is it doing here?

As I noted earlier, win2k8 does not currently use this API. I will clean this up.



> +
> +static int
> +hv_set_vp_registers(paddr_t input, paddr_t output)

> Again, I don't believe this function has ever been called.  That
> suggests that this may not be a high-value optimisation suitable
> for inclusion in an initial version.

As I noted earlier, win2k8 does not currently use this API. I will clean this up.



...
> 
> +static int
> +hv_switch_va(paddr_t input)
> +{
> +hv_partition_t   *curp = hv_get_current_partition();
> +        hv_vcpu_t *vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> +
> +/*
> + * XXXKYS: the spec sys the asID is passed via memory at offset 0 of 
> + * the page whose GPA is in the input register. However, it appears 
> + * the current build of longhorn (longhorn-2007-02-06-x86_64-fv-02)
> + * passes the asID in the input register instead. Need to check if 
> + * future builds do this.
> + */
> +hvm_set_cr3(input); 
> +HV_STATS_COLLECT(HV_CSWITCH, &vcpup->stats);
> +return (HV_STATUS_SUCCESS);
> +}

> Do you have any evidence that this is actually faster than just
> doing the CR3 write in the guest?  It's a win on real Viridian
> because you avoid blowing the shadow page table cache, but I don't
> think that applies on Xen's current shadow page table
> implementation.

As I noted earlier, initially I identified all the hypercalls that made for a win2k8 guest on our platform. From this list I implemented those that the guest actually invoked. I implemented all the hypercalls that the guest was invoking without regard to the performance value of the enlightenment. I agree with you that this may be more important on Veridian than on our platform. 



> +static int
> +hv_flush_va(paddr_t input)
> +{
...
> +/*
> + * Now operate on what we are given
> + * XXXKYS: For now we are ignoring as_id and fushGlobal flag.
> + * May have to revisit this. But first stash away the processed 
> + * parameters for subsequent use.
> + */
> +flush_argp->as_handle = as_id;
> +flush_argp->flags = flush_global;
> +flush_argp->v_mask = vcpu_mask;
> +
> +
> +ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> +hv_set_syscall_retval(guest_cpu_user_regs(),
> +                                   curp->long_mode_guest,
> +                                   ret_val);
> +HV_STATS_COLLECT(HV_FLUSH_VA_STAT, &cur_vcpup->stats);
> +panic("hv_flush_va not supported\n"); 

> So if this hypercall is ever correctly used by the guest, Xen will
> crash?

In the original patches that I posted, this hypercall was implemented. As part of my first round of cleanup, I removed the support for the TLB  flush hypercalls. As part of this cleanup, I also disabled TLB related enlightenments. This will be cleaned up.

> +return (HV_STATUS_SUCCESS);
> +}
> +
> +static int
> +hv_flush_va_range(paddr_t input, unsigned short start_index, 
> +unsigned short rep_count, unsigned short *reps_done)
> +{
...
> +ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, rep_count);
> +hv_set_syscall_retval(guest_cpu_user_regs(),
> +                                   curp->long_mode_guest,
> +                                   ret_val);
> +
> +
> +HV_STATS_COLLECT(HV_FLUSH_RANGE, &cur_vcpup->stats);
> +panic("hv_flush_vaRange not supported\n");
 
> And again.

Will be cleaned up.


> +return (HV_STATUS_SUCCESS);
> +}
> +
> +void
> +hv_handle_hypercall(u64 opcode, u64 input, u64 output, 
> +  u64 *ret_val)

> Would it be easier to just return the return value?

Perhaps.


> +{
...
> +verb = (short)(opcode & 0xffff);
> +rep_count = (short)((opcode >>32) & 0xfff);
> +start_index = (short)((opcode >> 48) & 0xfff);
> +switch (verb) {
> +case HV_CREATE_PARTITION:
> +/*
> + * Xen only allows dom0 to create domains.
> + */
> +*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;
> +case HV_INITIALIZE_PARTITION:
> +/*
> + * We don't support this.
> + */
> +*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;

> Would it be easier to just make HV_STATUS_ACCESS_DENIED the
> default?  Actually, my reading of the spec is that you should
> return HV_STATUS_INVALID_HYPERCALL_CODE if the guest makes a
> hypercall which you don't support, but I can't imagine it makes a
> great deal of difference.

Perhaps not. 


...
> +case HV_GET_PARTITION_ID:
> +if (!hv_privilege_check(curp, HV_ACCESS_PARTITION_ID)) {
> +*ret_val = 
> +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;
> +}
> +partition_id = (u64)current->domain->domain_id;
> +if (hvm_copy_to_guest_phys(output, 
> +&partition_id, 8)) {
> +/*
> + * Invalid output area.
> + */
> +*ret_val = 
> +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);

> I think you mean HV_STATUS_INVALID_ALIGNMENT.

Agreed.


> +return;
> +}
> +*ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> +return;
> 
> +case HV_GET_VP_REGISTERS:
> +*ret_val = hv_build_hcall_retval(
> +hv_get_vp_registers(input, output), 0);
> +return;
> +case HV_SET_VP_REGISTERS:
> +*ret_val = hv_build_hcall_retval(
> +hv_set_vp_registers(input, output), 0);
...
> +case HV_FLUSH_VA:
> +*ret_val = 
> +hv_build_hcall_retval(hv_flush_va(input), 0);
> +return;
> +case HV_FLUSH_VA_LIST:
> +value  = hv_flush_va_range(input, start_index, 
> +rep_count, &reps_done);
> +*ret_val = hv_build_hcall_retval(value, reps_done);  
> +return;

> As mentioned above, your implementations of these hypercalls are
> really quite broken.

These hypercalls are not invoked by the guest and that is the reason they are broken. I will clean it up though.



...
> 
> +}
> +}
> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h2008-03-26 13:56:39.000000000 -0400
> @@ -0,0 +1,21 @@
> +
> +/*
> + * nshypercall.h
> + * Memory layouts for the various hypercalls supported. 
> + *
> + * Engineering Contact: K. Y. Srinivasan
> + */
> +
> +#ifndef HV_HYPERCALL_H
> +#define HV_HYPERCALL_H
> +
> +#include <xen/cpumask.h>
> +
> +
> +typedef struct get_vp_registers_input {
> +u64partition_id;
> +u64vp_index;
> +u32reg_index;
> +u32pad1;
> +u64pad2;
> +} get_vp_registers_input_t;

> My reading of the 0.83 specification is that this should have been:

> typedef struct get_vp_registers_input {
> u64 partition_id;
> u32 vp_index;
> u32 pad;
> u32 reg_index[0];
> } get_vp_registers_input_t;

> Is my copy of the specification out of date?

Nope. 



> +typedef struct set_vp_register_spec {
> +u32reg_name;
> +u32pad;
> +u64pad1;
> +u64low_value;
> +u64high_value;
> +} set_vp_register_spec_t;
> +
> +typedef struct set_vp_registers_input {
> +u64partition_id;
> +u64vp_index;
> +set_vp_register_spec_treg_spec;
> +} set_vp_registers_input_t;

> Again, the 0.83 spec says that vp_index should be a u32, and should
> be followed by a u32 MBZ pad.
Right.

> +
> +
> +typedef struct flush_va {
> +u64as_handle;
> +u64flags;
> +union  {
> +u64processor_mask;
> +cpumask_t vcpu_mask;
> +} proc_mask;
> +#define p_mask proc_mask.processor_mask
> +#define v_maskproc_mask.vcpu_mask
> +u64gva;
> +} flush_va_t;

> Is there something wrong with anonymous unions?



> +/*
> + * Hypercall verbs.
> + */
> +
> +#define HV_CREATE_PARTITION 0x0010
> +#define HV_INITIALIZE_PARTITION 0x0011
> +#define HV_DELETE_PARTITION0x0014

> Is it really necesary to include #define's for hypercalls which we
> don't support and will probably never support?

No; I will clean this file up.


> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c2008-03-26 14:26:38.000000000 -0400
> @@ -0,0 +1,0 @@
> 
> +/*
> + * Local includes; extension specific.
> + */
> +#include "hv_errno.h"
> +#include "hv_shim.h"
> +
> +
> +/*
> + * Implement the Hyperv Shim.
> + */
> +
> +extern struct cpuinfo_x86 boot_cpu_data;

> Is there something wrong with processor.h?

This will be fixed.

> +extern struct hvm_mmio_handler vlapic_mmio_handler;

> We should probably move that to a header file somewhere.

Will do.

> +static inline void *
> +get_virt_from_gmfn(struct domain *d, unsigned long gmfn)
> +{
> +unsigned long mfn = gmfn_to_mfn(d, gmfn);
> +if (mfn == INVALID_MFN) {
> +return (NULL);
> +}
> +return (map_domain_page_global(mfn));
> +}

> Is it really necessary to use map_domain_page_global() here?
> map_domain_page() is usually quite a bit faster.  To be honest, I'm
> not convinced that this wrapper function has any real value anyway.

I will look into this.


> +static inline unsigned long
> +get_mfn_from_gva(unsigned long va)
> +{
> +uint32_t pfec = PFEC_page_present;
> +unsigned long gfn;
> +gfn = paging_gva_to_gfn(current, va, &pfec);
> +return (gmfn_to_mfn((current->domain), gfn));
> +}

> That's only valid if you can guarantee that the resulting MFN will
> never be written to.

Yes. I use this to figure out if the hypercall is from the PV drivers in the guest or the hyparcall from the guest.


> +static inline void 
> +hv_write_hypercall_msr(hv_partition_t *curp,
> +  hv_vcpu_t*cur_vcpu,
> +  u64msr_content)
> +{
...
> +hv_hypercall_page_initialize(hypercall_page, curp);
> +curp->hypercall_mfn = gmfn_to_mfn(d, gmfn);
> +#ifdef CONFIG_DOMAIN_PAGE
> +unmap_domain_page_global(hypercall_page);
> +#endif

> unmap_domain_page_global() is #define'd to nothing #ifndef
> CONFIG_DOMAIN_PAGE, so you don't need the #ifdef.

Yes.


> +curp->hypercall_msr = msr_content;
> +spin_unlock(&curp->lock);
> +cur_vcpu->flags |= HV_VCPU_UP;
> +}
> +
> +
> +static inline void hv_write_sx_msr(uint32_t idx, hv_partition_t *curp,
> +  hv_vcpu_t*cur_vcpu,
> +  u64msr_content)
> +{
...
> +sx_page = get_virt_from_gmfn(d, gmfn);
> +if (sx_page == NULL) {
> +/*
> + * The guest specified a bogus GPA; inject a GP fault
> + * into the guest.
> + */
> +hv_inject_exception(TRAP_gp_fault);

> The spec says tha SIEFP can be outside the physical address space,
> and you're just supposed to deal with it.  The guest will be unable
> to access the frame, but the MSR write is supposed to succeed
> (section 14.6.3 of version 0.83).  Likewise the SIMP (14.6.4).

You are right. As it turns out, the win2k8 guest on our platform never runs into this situation. I will fix this.


> +return;
> +}
> +switch (idx) {
> +case HV_MSR_SIEFP:
> +hv_init_event_page(sx_page);
> +cur_vcpu->siefp_msr = msr_content; 
> +cur_vcpu->sief_page = sx_page; 
> +break;
> +case HV_MSR_SIMP:
> +hv_init_message_page(sx_page);
> +cur_vcpu->simp_msr = msr_content;
> +cur_vcpu->sim_page = sx_page;
> +break;
> +}
> +
> +}
> +

> +
> +/*
> + * Time this domain booted.
> + */
> +s_time_t hv_domain_boot_time;
> Time which domain booted?  What if you have two domains which both
> use the HyperV extensions?

Oops! Will fix this.



> +static inline void
> +hv_inject_exception(int trap)
> +{
> +hvm_funcs.inject_exception(trap, 0, 0);
> +}
> hvm_inject_exception() might be a better choice.

Ok.

> +static inline void 
> +hv_set_partition_privileges(hv_partition_t *hvpp)
> +{
> +/*
> + * This is based on the hypervisor spec under section 5.2.3. 
> + */
> +hvpp->privileges = 0x000000020000007f;
> +}
> So you allow AccessVpRuntime, AccessTimeCounters, AccessSynicMsrs,
> AccessSyntheticTimers, AccessApicMsrs, AccessHypercallMsrs,
> AccessVpIndex, and AccessSelfPartitionId?  That sounds like a
> pretty reasonable set, but it'd be nice if you'd documented the
> fact rather than just dropping in a magic number.  Something like
> this, maybe:

> hvpp->privileges = HV_PRIV_ACCESS_VP_RUNTIME |
>   HV_PRIV_ACCESS_TIME_COUNTERS |
>                  HV_PRIV_ACCESS_SYNIC_MSRS |
                   ...

Ok.

> +static inline u32
> +hv_get_recommendations(void)
> +{
> +/*
> + *For now we recommend all the features. Need to validate.
> + */
> +if ( paging_mode_hap(current->domain)) {
> +/*
> + * If HAP is enabled; the guest should not use TLB flush
> + * related enlightenments.
> + */
> +return (0x19);
> +} else {
> +/*
> + * For now disable TLB flush enlightenments.
> + */
> +return (0x19); 
> +}
> +}

> So you recommend the use of hypercalls for address switches, MSRs
> for APIC registers, and MSRs for rebooting?  The first two make
> sense, but I'm not so sure that enlightened reboot is a useful
> optimisation.

Originally, I had implemented TLB flush enlightenments as well. Based on some of the performance analysis we are currently doing, I may re-introduce them. As I noted earlier, I just wanted to implement all the features that made sense for the guest on our platform - hence I implemented reboot as well!

> Again, it would have been useful if the actual set of features was
> documented rather than just using a magic number.

Yep.

> +static inline void 
> +hv_set_partition_features(hv_partition_t *hvpp)
> +{
> +hvpp->supported_features = 0x1f;
> +}
>i.e. VP_RUNTIME | PARTITION_REF_COUNT | SYNIC MSRS | SYNTHETIC_TIMERS |
>     APIC_MSRS | HYPERCALL_MSRS.

> It's kind of surprising that you recommend guests use the reboot MSRs,
> but then don't claim to support it.  Was that deliberate?

> Actually, looking some more, it doesn't look like you ever use the
> supported_features field, and have instead hardcoded the value to
> 0xff everywhere you need it.

Will fix this.

> +
> +static inline u16 
> +hv_get_guest_major(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> +static inline u16
> +hv_get_guest_minor(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> +static inline u32
> +hv_get_guest_service_pack(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> Err... are these supposed to return the guest major/minor/sp
> reported in the identification MSR?

Yep. Was not quite sure what numbers made sense!



> +static inline void
> +hv_read_icr(u64 *icr_content)
...
> +static inline void
> +hv_read_tpr(u64 *tpr_content)
...
> +static inline void
> +hv_write_eoi(u64 msr_content)
...
> +static inline void
> +hv_write_icr(u64 msr_content)
...
> +static inline void
> +hv_write_tpr(u64 msr_content)
...

> These all have really weird implementations and lots of gratuitous
> use of hardcoded magic numbers (have a look at apicdef.h).

Will fix it.


> +static void
> +hv_timeout_handler(void *arg)
> +{
...
> +/*
> + * First post the message and then optionally deal with the 
> + * interrupt notification.
> + */
> +if (cur_vcpu->sim_page == NULL) {
> +panic("Novell Shim: Sim page not setup\n");
> +}

> So if a guest requests a timeout before they've set up the sim
> page, Xen crashes?  What if they clear the sim page after setting
> up a timeout, but before it fires?

I will fix this. As I noted earlier, win2k8 currently does not use any of these features.

...
> +}
> +

> +int
> +hyperv_do_hypercall(struct cpu_user_regs *pregs)
> +{
> +hv_partition_t*curp = hv_get_current_partition();
> +hv_vcpu_t        *vcpup;
> +intlong_mode_guest = curp->long_mode_guest;
> +unsigned long hypercall_mfn;
> +unsigned long gmfn;
> +gmfn = (curp->hypercall_msr >> 12);
> +
> +hypercall_mfn = get_mfn_from_gva(pregs->eip);

> That's exciting.  Do you expect that guests will make hypercalls
> from anywhere other than the Viridian hypercall page?

We support PV drivers for win2k8 guests. Given that both HyperV and Xen use the same hypercall numbers to implement different functionality, I needed a way to differentiate HyperV hypercalls. I use the mfn of the hypercall page to figure out who should handle the hypercall.

...

> +int
> +hyperv_vcpu_initialize(struct vcpu *v)
> +{
...
> +/*
> + * Setup the input page for handling hypercalls.
> + *
> + */
> +vcpup->input_buffer_page = 
> +alloc_domheap_page(NULL);
...
> +vcpup->input_buffer =
> +get_virt_from_page_ptr(vcpup->input_buffer_page);
...
> +vcpup->output_buffer_page = 
> +alloc_domheap_page(NULL);
...
> +vcpup->output_buffer =
> +get_virt_from_page_ptr(vcpup->output_buffer_page);

> What exactly are these used for?  The only place I can see it
> referenced are immediately before a panic(), which seems a bit
> pointless.

These were used in implementing TLB flush enlightenments. I got rid of those hypercalls in the current patch set. I will clean these up.


> +vcpup->xen_vcpu = v; 
> +
> +return (0);
> +}
> +

> +
> +static int 
> +hyperv_dom_restore(struct domain *d, hvm_domain_context_t *h)
> +{
> +struct hvm_hyperv_dom ctxt;
> +hv_partition_t*curp;
> +
> +if ( hvm_load_entry(HYPERV_DOM, h, &ctxt) != 0 )
> +        return -22;
> +d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] = ctxt.ext_id;
> +if ((ctxt.ext_id == 0) || (ctxt.ext_id > 1)) {
> +return 0;
> +}
> +if (hyperv_initialize(d)) {
> +return -22;
> +}
> ????  Did you mean -EINVAL there?

Yep.

> +curp = d->arch.hvm_domain.hyperv_handle;
> +
> +curp->guest_id_msr = ctxt.guestid_msr;
> +curp->hypercall_msr = ctxt.hypercall_msr;
> +curp->long_mode_guest = ctxt.long_mode;
> +curp->hypercall_mfn =
> +gmfn_to_mfn(d, (ctxt.hypercall_msr >> 12));
> +
> +return 0; 
> +}
> +

> +static int
> +hv_preprocess_cpuid_leaves(unsigned int input, struct cpu_user_regs *regs)
> +{
> +uint32_t idx;
> +struct domain*d = current->domain;
> +intextid = d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
> +
> +if (extid == 1) {
> +/*
> + * Enlightened Windows guest; need to remap and handle 
> + * leaves used by PV front-end drivers.
> + */
> +if ((input >= 0x40000000) && (input <= 0x40000005)) {
> +return (0);
> +}
> +/*
> + * PV drivers use cpuid to query the hypervisor for details. On
> + * Windows we will use the following leaves for this:
> + *
> + * 4096: VMM Sinature (corresponds to 0x40000000 on Linux)
> + * 4097: VMM Version (corresponds to 0x40000001 on Linux)
> + * 4098: Hypercall details (corresponds to 0x40000002 on Linux)
> + */

> I think it would be better to just unconditionally duplicate the
> Xen leaves at 0x40001000, regardless of whether the viridian
> enlightenments are enabled.  That would make it much easier to
> write PV drivers which work regardless of whether the
> enlightenments are enabled, and if we can do that then it might (in
> a few years' time) be possible to default to enlightments-on for
> all domains.  That isn't really an option when there's no common
> Xen API.

> This would also mean that you don't need to duplicate the whole of
> hypervisor_cpuid_leaves() in the Viridian implementation.

I will look into cleaning this up.

...
> +}
> +
> +int 
> +hyperv_do_cpu_id(unsigned int input, struct cpu_user_regs *regs)
> +{
> +uint32_t idx;
> +
> +/*
> + * hvmloader uses cpuid to set up a hypercall page; we don't want to
> + * intercept calls coming from the bootstrap (bios) code in the HVM 
> + * guest; we discriminate based on the instruction pointer.
> + */
> +if (hv_call_from_bios(regs)) {
> +/*
> + * We don't intercept this.
> + */
> +return (0);
> +}
> +
> +if (input == 0x00000001) { 
> +regs->ecx = (regs->ecx | 0x80000000);
> +return (1);
> +} 

> I think we should be doing this even when the shim is disabled.
> That bit is supposed to indicate ``running on a VMM'', not
> ``running on a Viridian-compatible VMM'', and it has the side
> effect of making Windows much more tolerant of vcpus getting
> scheduled at different rates.

Good point.

> +if (hv_preprocess_cpuid_leaves(input, regs)) {
> +return (0);
> +}
> +idx = (input - 0x40000000);
> +
> +switch (idx) {
> +case 0:
> +/*
> + * 0x40000000: Hypervisor identification. 
> + */
> +regs->eax = 0x40000005; /* For now clamp this */
> +regs->ebx = 0x65766f4e; /* "Nove" */ 
> +regs->ecx = 0x68536c6c; /* "llSh" */
> +regs->edx = 0x76486d69; /* "imHv" */ 
> +break;

> I think this is supposed to identify the hypervisor (Xen), rather
> than the Viridian implementation (Novell shim).

You are right.

...
> +case 5:
> +/*
> + * 0x40000005: Implementation limits.
> + * Currently we retrieve maximum number of vcpus and 
> + * logical processors (hardware threads) supported.
> + */
> +regs->eax = hv_get_max_vcpus_supported();
> 
> +regs->ebx = hv_get_max_lcpus_supported();
> a) ebx is marked as reserved in the specification, and
> b) it doesn't make a great deal of sense to expose the maximum
>   number of physical CPUs to a virtualised guest anyway.

I thought this is what the spec mandated.  I will look into this.

> +regs->ecx = 0; /* Reserved */
> +regs->edx = 0; /* Reserved */
> +break; 
> +
> +default:
> +/*
> + * We don't handle this leaf.
> + */
> +return (0);
> +
> +}
> +return (1);
> +}
> +
> +int
> +hyperv_do_rd_msr(uint32_t idx, struct cpu_user_regs *regs)
> +{
...
> +if (extid > 1) {
> +/*
> + * For now this is all other "Enlightened" operating systems
> + * other than Longhorn.
> + */
> +if (idx == 0x40000000) {
> +/*
> + * PV driver hypercall setup. Let xen handle this.
> + */
> +return (0);
> +}
> +if (idx == 0x40001000) {
> +idx = 0x40000000;
> +}

> Eh?  What's this doing?

At one point, I was planning to support other enlightened operating systems. I will clean it up.


> +}
> +switch (idx) {
...
> +case HV_MSR_TIMER0_COUNT:
> +timer = 0;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER1_COUNT:
> +timer = 1;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER2_COUNT:
> +timer = 2;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER3_COUNT:
> +timer = 3;
> +process_timer_count_read:

> How much does timer support actually buy you?  It's pretty complicated
> all by itself, and it also seems to be the only reason you need SYNIC
> support.

The feature is not used currently by the win2k8 guest. I may choose to get rid of this body of code.

> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
> +goto msr_read_error;
> +}
...
> +}
> +
> +int
> +hyperv_do_wr_msr(uint32_t idx, struct cpu_user_regs *regs)
> +{
...
> +switch (idx) {
...
> +case HV_MSR_SEOM:
> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
> +goto msr_write_error;
> +}
> +cur_vcpu->eom_msr = msr_content; 
> +hv_process_message_q(curp, cur_vcpu);
> +break;

> You don't support queued messages, so there's not much point in
> supporting the SEOM MSR either.  Also, eom_msr never seems to get
> read, except for saving it when you do a suspend.

Right.


> +case HV_MSR_TIMER3_CONFIG:
> +timer = 3;
> +process_timer_config:
> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
> +goto msr_write_error;
> +}
> +/*
> + * Assume that the client is going to write the whole msr. 
> + */
> +if (!(msr_content & 0x9)) {
> +/*
> + * We are neither setting Auto Enable or Enable; 
> + * silently exit.
> + * Should this be considered to turn off a 
> + * timer that may be currently 
> + * active; XXXKYS: Check. For now we are 
> + * not doing anything here.
> + */
> +break;
> +}
> +if (!(((u32)(msr_content >> 16)) & 0x0000000f)) {
> +/*
> + * sintx is 0; clear the enable bit(s).
> + */
> +msr_content &= ~(0x1);
> +}

> Again, some #define's would make these magic numbers a bit more
> clear.

I will fix the code.

Regards,

K. Y


> Steven.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
@ 2008-04-13 21:50 Ky Srinivasan
  0 siblings, 0 replies; 14+ messages in thread
From: Ky Srinivasan @ 2008-04-13 21:50 UTC (permalink / raw)
  To: Steven Smith; +Cc: xen-devel, Tim Deegan, Keir Fraser

Steven,

Thanks for the detailed review. I will address the issues you have raised as part of my next cleanup of these patches. I have responded to your comments in-line.

Regards,

K. Y

>>> Steven Smith <steven.smith@citrix.com> 04/10/08 7:23 AM >>> 
> I have a couple of comments on the new patches:


> Would you mind generating diffs with -p, please?  It makes them a fair
> bit easier to read.

Will do.

>  
>  static void svm_vmexit_do_cpuid(struct cpu_user_regs *regs)
>  {
> -    unsigned int eax, ebx, ecx, edx, inst_len;
> +    unsigned int eax, ebx, ecx, edx, inst_len, input;
>  
>      inst_len = __get_instruction_length(current, INSTR_CPUID, NULL);
>      if ( inst_len == 0 ) 
> ...
> @@ -968,6 +970,7 @@
>      regs->ecx = ecx;
>      regs->edx = edx;
>  
> +    hyperx_intercept_do_cpuid(input, regs);
>      __update_guest_eip(regs, inst_len);
>  }
>
  
> Would it be easier to put this bit in hvm_cpuid, or maybe
> cpuid_hypervisor_leaves?  That would avoid needing to make the same
> change to both the VMX and SVM CPUID infrastructure.

Good point. I will look into it.



> @@ -984,6 +987,10 @@
>      struct vcpu *v = current;
>      struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
>  
> +    if (hyperx_intercept_do_msr_read(ecx, regs))
> +    {
> +            goto done;
> +    }
>      switch ( ecx )
>      {
>      case MSR_IA32_TSC:
> 
> Likewise, it seems like that could be done better in
> rdmsr_hypervisor_regs().

Yep.

> @@ -1085,6 +1092,10 @@
>      msr_content = (u32)regs->eax | ((u64)regs->edx << 32);
>  
>      hvmtrace_msr_write(v, ecx, msr_content);
> +    if (hyperx_intercept_do_msr_write(ecx, regs))
> +    {
> +            goto done_msr_write;
> +    }
>  
>      switch ( ecx )
>      {

> Or wrmsr_hypervisor_regs().

Will do.





> @@ -2210,6 +2226,10 @@
>                  if ( a.value > HVMPTM_one_missed_tick_pending )
>                      goto param_fail;
>                  break;
> +            case HVM_PARAM_EXTEND_HYPERVISOR:
> +printk("KYS: EXTEND hypervisor id is %d\n", (int)a.value);
> +                if ((a.value == 1) && hyperv_initialize(d)) 
> +                    goto param_fail;
>              }
>              d->arch.hvm_domain.params[a.index] = a.value;
>              rc = 0;
> 

> It might be a good idea to fail with -EINVAL or something if
> a.value > 1, so that if anyone ever introduces another hypervisor
> extension it's easy for the tools to tell whether it's available or
> not.

Will do.


> Index: xen-unstable.hg/xen/arch/x86/hvm/save.c
> ===================================================================
> --- xen-unstable.hg.orig/xen/arch/x86/hvm/save.c2008-04-04 14:28:26.000000000 -0400
> +++ xen-unstable.hg/xen/arch/x86/hvm/save.c2008-04-04 14:28:44.000000000 -0400
> @@ -23,6 +23,8 @@
>  
>  #include <asm/hvm/support.h>
>  #include <public/hvm/save.h>
> +#include <public/hvm/params.h>
> +#include <asm/hvm/hvm_extensions.h>
>  
>  void arch_hvm_save(struct domain *d, struct hvm_save_header *hdr)
>  {
> 

> The only change made to this file was to add some #include's.  Why
> were they necessary?

This was necessary at some point in the past. This is the vestiges of a partial cleanup! This will be fixed.



> Index: xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h
> ===================================================================
> --- xen-unstable.hg.orig/xen/include/public/arch-x86/hvm/save.h2008-04-04 14:28:26.000000000 -0400
> +++ xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h2008-04-04 14:28:44.000000000 -0400
> @@ -38,7 +38,7 @@
>      uint32_t version;           /* File format version */
>      uint64_t changeset;         /* Version of Xen that saved this file */
>      uint32_t cpuid;             /* CPUID[0x01][%eax] on the saving machine */
> -    uint32_t pad0;
> +    uint32_t pad0;/* extension ID */
>  };
>  
>  DECLARE_HVM_SAVE_TYPE(HEADER, 1, struct hvm_save_header);
> 
> I think it might be a good idea to give the field a more
> descriptive name than ``pad0'' if it's being used for something
> other than padding.

This again is the result of a partial cleanup. I was stashing away the extension ID and Tim rightly pointed out that we had enough state and did not need the extension ID in the save record. This will be cleaned up.

> It might be an even better idea to move the ID out of the header
> entirely and put it in its own HVM_SAVE_TYPE.  You already have
> hvm_hyperv_dom; can you use that to identify the presence or
> absence of the extensions?

I will look into this.




> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_errno.h2008-03-26 13:56:39.000000000 -0400
> @@ -0,0 +1,4 @@
...
> +/*
> + * hv_errno.h
> + * Error codes for the  Novell Shim.
> 
>> These are just the Viridian error codes, aren't they, rather than
>> something specific to the Novell shim?

Yes. This will be fixed.


> + *
> + * Engineering Contact: K. Y. Srinivasan
> + */
> +
> +#ifndef HV_ERRNO_H
> +#define HV_ERRNO_H
> +
> +#define HV_STATUS_SUCCESS0x0000
> +#define HV_STATUS_INVALID_HYPERCALL_CODE0x0002
> +#define HV_STATUS_INVALID_HYPERCALL_INPUT0x0003
> +#define HV_STATUS_INVALID_ALIGNMENT0x0004
> +#define HV_STATUS_INVALID_PARAMETER0x0005
...
> +#endif 
> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c2008-03-26 14:18:32.000000000 -0400
> @@ -0,0 +1,13 @@
...
> +void hv_collect_stats(int event, hv_vcpu_stats_t *statsp)
> +{
> +switch (event) {
> +case HV_CSWITCH:
> +statsp->num_switches++;
> +return;
> +case HV_FLUSH_VA:
> +statsp->num_flushes++;
> +return;
> +case HV_FLUSH_RANGE:
> +statsp->num_flush_ranges++;
> +return;
> That looks a bit pointless.  Would it not be easier to just inline
> this whole function?

Now that, I have dropped all the TLB flush enlightenments, I will clean this up.


...
> +}
...
> +static int
> +hv_get_vp_registers(paddr_t input, paddr_t output)
> +{
> +hv_vcpu_t        *vcpup, *targetp;
> +hv_partition_t   *curp = hv_get_current_partition();
> +get_vp_registers_input_t*inbuf;
> +get_vp_registers_output_t*outbuf;
> +struct vcpu_guest_context*vcpu_ctx;
> +u32*reg_indexp;
> +get_vp_registers_output_t*out_regp;
> +u32num_output_bytes = 0;
> +
> +        vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> +inbuf = vcpup->input_buffer;
> +outbuf = vcpup->output_buffer;
> +out_regp = outbuf;
> +/*
> + * Copy the input data to the per-cpu input buffer.
> + * This may be an overkill; obviously it is better to only
> + * copy what we need. XXXKYS: Check with Mike.
> + */
> Who's Mike?  What did he say?

Mike is  the MSFT contact. 




> +if (hvm_copy_from_guest_phys(inbuf, input, PAGE_SIZE)) {
> +return (HV_STATUS_INVALID_ALIGNMENT);
> +}
...
> +/*
> + * Now that we have the register state; select what we want and
> + * populate the output buffer.
> + */
> +reg_indexp = &inbuf->reg_index;
> +while (*reg_indexp != 0) {
> +switch(*reg_indexp) {
> +/*
> + * XXXKYS: need mapping code here; populate
> + * outbuf.
> + */
> +panic("hv_get_vp_registers not supported\n");

> Yeah, that's not really good enough.  It would be better to just
> not support this hypercall at all than have an implementation which
> always crashes Xen.

Early on, I decided what Hypercalls I would support; and from this list I wanted 
to only support those that were used by windows 2008 server. I  had put in a bunch of panics to trap all the hypercalls that the guest would invoke. This hypercall is not used. Thus the implementation is incomplete. I will clean this up.

> +}
> +reg_indexp++;
> +out_regp++ ;/*128 bit registers */
> +num_output_bytes +=16;
> +if ((char *)reg_indexp > ((char *)inbuf + PAGE_SIZE)) {
> +/*
> + *input list not reminated correctly; bail out.
> + */
> +panic("hv_get_vp_registers:input list not terminated\n"); 
> +break;

> It would be nice if this caused the hypercall to fail, rather than
> crashing Xen.

Will be cleaned up.



> +}
> +}
> +if (hvm_copy_to_guest_phys(output, outbuf, num_output_bytes)) {
> +/* Some problem copying data out*/
> +panic("hv_get_vp_registers:copyout problem\n");
 
> And again.

Will be cleaned up.


> +}
> +xfree(vcpu_ctx);
> +return (HV_STATUS_SUCCESS);
> +}

> Has this hypercall had any testing at all?  If it hasn't, then what
> is it doing here?

As I noted earlier, win2k8 does not currently use this API. I will clean this up.



> +
> +static int
> +hv_set_vp_registers(paddr_t input, paddr_t output)

> Again, I don't believe this function has ever been called.  That
> suggests that this may not be a high-value optimisation suitable
> for inclusion in an initial version.

As I noted earlier, win2k8 does not currently use this API. I will clean this up.



...
> 
> +static int
> +hv_switch_va(paddr_t input)
> +{
> +hv_partition_t   *curp = hv_get_current_partition();
> +        hv_vcpu_t *vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> +
> +/*
> + * XXXKYS: the spec sys the asID is passed via memory at offset 0 of 
> + * the page whose GPA is in the input register. However, it appears 
> + * the current build of longhorn (longhorn-2007-02-06-x86_64-fv-02)
> + * passes the asID in the input register instead. Need to check if 
> + * future builds do this.
> + */
> +hvm_set_cr3(input); 
> +HV_STATS_COLLECT(HV_CSWITCH, &vcpup->stats);
> +return (HV_STATUS_SUCCESS);
> +}

> Do you have any evidence that this is actually faster than just
> doing the CR3 write in the guest?  It's a win on real Viridian
> because you avoid blowing the shadow page table cache, but I don't
> think that applies on Xen's current shadow page table
> implementation.

As I noted earlier, initially I identified all the hypercalls that made for a win2k8 guest on our platform. From this list I implemented those that the guest actually invoked. I implemented all the hypercalls that the guest was invoking without regard to the performance value of the enlightenment. I agree with you that this may be more important on Veridian than on our platform. 



> +static int
> +hv_flush_va(paddr_t input)
> +{
...
> +/*
> + * Now operate on what we are given
> + * XXXKYS: For now we are ignoring as_id and fushGlobal flag.
> + * May have to revisit this. But first stash away the processed 
> + * parameters for subsequent use.
> + */
> +flush_argp->as_handle = as_id;
> +flush_argp->flags = flush_global;
> +flush_argp->v_mask = vcpu_mask;
> +
> +
> +ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> +hv_set_syscall_retval(guest_cpu_user_regs(),
> +                                   curp->long_mode_guest,
> +                                   ret_val);
> +HV_STATS_COLLECT(HV_FLUSH_VA_STAT, &cur_vcpup->stats);
> +panic("hv_flush_va not supported\n"); 

> So if this hypercall is ever correctly used by the guest, Xen will
> crash?

In the original patches that I posted, this hypercall was implemented. As part of my first round of cleanup, I removed the support for the TLB  flush hypercalls. As part of this cleanup, I also disabled TLB related enlightenments. This will be cleaned up.

> +return (HV_STATUS_SUCCESS);
> +}
> +
> +static int
> +hv_flush_va_range(paddr_t input, unsigned short start_index, 
> +unsigned short rep_count, unsigned short *reps_done)
> +{
...
> +ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, rep_count);
> +hv_set_syscall_retval(guest_cpu_user_regs(),
> +                                   curp->long_mode_guest,
> +                                   ret_val);
> +
> +
> +HV_STATS_COLLECT(HV_FLUSH_RANGE, &cur_vcpup->stats);
> +panic("hv_flush_vaRange not supported\n");
 
> And again.

Will be cleaned up.


> +return (HV_STATUS_SUCCESS);
> +}
> +
> +void
> +hv_handle_hypercall(u64 opcode, u64 input, u64 output, 
> +  u64 *ret_val)

> Would it be easier to just return the return value?

Perhaps.


> +{
...
> +verb = (short)(opcode & 0xffff);
> +rep_count = (short)((opcode >>32) & 0xfff);
> +start_index = (short)((opcode >> 48) & 0xfff);
> +switch (verb) {
> +case HV_CREATE_PARTITION:
> +/*
> + * Xen only allows dom0 to create domains.
> + */
> +*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;
> +case HV_INITIALIZE_PARTITION:
> +/*
> + * We don't support this.
> + */
> +*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;

> Would it be easier to just make HV_STATUS_ACCESS_DENIED the
> default?  Actually, my reading of the spec is that you should
> return HV_STATUS_INVALID_HYPERCALL_CODE if the guest makes a
> hypercall which you don't support, but I can't imagine it makes a
> great deal of difference.

Perhaps not. 


...
> +case HV_GET_PARTITION_ID:
> +if (!hv_privilege_check(curp, HV_ACCESS_PARTITION_ID)) {
> +*ret_val = 
> +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;
> +}
> +partition_id = (u64)current->domain->domain_id;
> +if (hvm_copy_to_guest_phys(output, 
> +&partition_id, 8)) {
> +/*
> + * Invalid output area.
> + */
> +*ret_val = 
> +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);

> I think you mean HV_STATUS_INVALID_ALIGNMENT.

Agreed.


> +return;
> +}
> +*ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> +return;
> 
> +case HV_GET_VP_REGISTERS:
> +*ret_val = hv_build_hcall_retval(
> +hv_get_vp_registers(input, output), 0);
> +return;
> +case HV_SET_VP_REGISTERS:
> +*ret_val = hv_build_hcall_retval(
> +hv_set_vp_registers(input, output), 0);
...
> +case HV_FLUSH_VA:
> +*ret_val = 
> +hv_build_hcall_retval(hv_flush_va(input), 0);
> +return;
> +case HV_FLUSH_VA_LIST:
> +value  = hv_flush_va_range(input, start_index, 
> +rep_count, &reps_done);
> +*ret_val = hv_build_hcall_retval(value, reps_done);  
> +return;

> As mentioned above, your implementations of these hypercalls are
> really quite broken.

These hypercalls are not invoked by the guest and that is the reason they are broken. I will clean it up though.



...
> 
> +}
> +}
> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h2008-03-26 13:56:39.000000000 -0400
> @@ -0,0 +1,21 @@
> +
> +/*
> + * nshypercall.h
> + * Memory layouts for the various hypercalls supported. 
> + *
> + * Engineering Contact: K. Y. Srinivasan
> + */
> +
> +#ifndef HV_HYPERCALL_H
> +#define HV_HYPERCALL_H
> +
> +#include <xen/cpumask.h>
> +
> +
> +typedef struct get_vp_registers_input {
> +u64partition_id;
> +u64vp_index;
> +u32reg_index;
> +u32pad1;
> +u64pad2;
> +} get_vp_registers_input_t;

> My reading of the 0.83 specification is that this should have been:

> typedef struct get_vp_registers_input {
> u64 partition_id;
> u32 vp_index;
> u32 pad;
> u32 reg_index[0];
> } get_vp_registers_input_t;

> Is my copy of the specification out of date?

Nope. 



> +typedef struct set_vp_register_spec {
> +u32reg_name;
> +u32pad;
> +u64pad1;
> +u64low_value;
> +u64high_value;
> +} set_vp_register_spec_t;
> +
> +typedef struct set_vp_registers_input {
> +u64partition_id;
> +u64vp_index;
> +set_vp_register_spec_treg_spec;
> +} set_vp_registers_input_t;

> Again, the 0.83 spec says that vp_index should be a u32, and should
> be followed by a u32 MBZ pad.
Right.

> +
> +
> +typedef struct flush_va {
> +u64as_handle;
> +u64flags;
> +union  {
> +u64processor_mask;
> +cpumask_t vcpu_mask;
> +} proc_mask;
> +#define p_mask proc_mask.processor_mask
> +#define v_maskproc_mask.vcpu_mask
> +u64gva;
> +} flush_va_t;

> Is there something wrong with anonymous unions?



> +/*
> + * Hypercall verbs.
> + */
> +
> +#define HV_CREATE_PARTITION 0x0010
> +#define HV_INITIALIZE_PARTITION 0x0011
> +#define HV_DELETE_PARTITION0x0014

> Is it really necesary to include #define's for hypercalls which we
> don't support and will probably never support?

No; I will clean this file up.


> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c2008-03-26 14:26:38.000000000 -0400
> @@ -0,0 +1,0 @@
> 
> +/*
> + * Local includes; extension specific.
> + */
> +#include "hv_errno.h"
> +#include "hv_shim.h"
> +
> +
> +/*
> + * Implement the Hyperv Shim.
> + */
> +
> +extern struct cpuinfo_x86 boot_cpu_data;

> Is there something wrong with processor.h?

This will be fixed.

> +extern struct hvm_mmio_handler vlapic_mmio_handler;

> We should probably move that to a header file somewhere.

Will do.

> +static inline void *
> +get_virt_from_gmfn(struct domain *d, unsigned long gmfn)
> +{
> +unsigned long mfn = gmfn_to_mfn(d, gmfn);
> +if (mfn == INVALID_MFN) {
> +return (NULL);
> +}
> +return (map_domain_page_global(mfn));
> +}

> Is it really necessary to use map_domain_page_global() here?
> map_domain_page() is usually quite a bit faster.  To be honest, I'm
> not convinced that this wrapper function has any real value anyway.

I will look into this.


> +static inline unsigned long
> +get_mfn_from_gva(unsigned long va)
> +{
> +uint32_t pfec = PFEC_page_present;
> +unsigned long gfn;
> +gfn = paging_gva_to_gfn(current, va, &pfec);
> +return (gmfn_to_mfn((current->domain), gfn));
> +}

> That's only valid if you can guarantee that the resulting MFN will
> never be written to.

Yes. I use this to figure out if the hypercall is from the PV drivers in the guest or the hyparcall from the guest.


> +static inline void 
> +hv_write_hypercall_msr(hv_partition_t *curp,
> +  hv_vcpu_t*cur_vcpu,
> +  u64msr_content)
> +{
...
> +hv_hypercall_page_initialize(hypercall_page, curp);
> +curp->hypercall_mfn = gmfn_to_mfn(d, gmfn);
> +#ifdef CONFIG_DOMAIN_PAGE
> +unmap_domain_page_global(hypercall_page);
> +#endif

> unmap_domain_page_global() is #define'd to nothing #ifndef
> CONFIG_DOMAIN_PAGE, so you don't need the #ifdef.

Yes.


> +curp->hypercall_msr = msr_content;
> +spin_unlock(&curp->lock);
> +cur_vcpu->flags |= HV_VCPU_UP;
> +}
> +
> +
> +static inline void hv_write_sx_msr(uint32_t idx, hv_partition_t *curp,
> +  hv_vcpu_t*cur_vcpu,
> +  u64msr_content)
> +{
...
> +sx_page = get_virt_from_gmfn(d, gmfn);
> +if (sx_page == NULL) {
> +/*
> + * The guest specified a bogus GPA; inject a GP fault
> + * into the guest.
> + */
> +hv_inject_exception(TRAP_gp_fault);

> The spec says tha SIEFP can be outside the physical address space,
> and you're just supposed to deal with it.  The guest will be unable
> to access the frame, but the MSR write is supposed to succeed
> (section 14.6.3 of version 0.83).  Likewise the SIMP (14.6.4).

You are right. As it turns out, the win2k8 guest on our platform never runs into this situation. I will fix this.


> +return;
> +}
> +switch (idx) {
> +case HV_MSR_SIEFP:
> +hv_init_event_page(sx_page);
> +cur_vcpu->siefp_msr = msr_content; 
> +cur_vcpu->sief_page = sx_page; 
> +break;
> +case HV_MSR_SIMP:
> +hv_init_message_page(sx_page);
> +cur_vcpu->simp_msr = msr_content;
> +cur_vcpu->sim_page = sx_page;
> +break;
> +}
> +
> +}
> +

> +
> +/*
> + * Time this domain booted.
> + */
> +s_time_t hv_domain_boot_time;
> Time which domain booted?  What if you have two domains which both
> use the HyperV extensions?

Oops! Will fix this.



> +static inline void
> +hv_inject_exception(int trap)
> +{
> +hvm_funcs.inject_exception(trap, 0, 0);
> +}
> hvm_inject_exception() might be a better choice.

Ok.

> +static inline void 
> +hv_set_partition_privileges(hv_partition_t *hvpp)
> +{
> +/*
> + * This is based on the hypervisor spec under section 5.2.3. 
> + */
> +hvpp->privileges = 0x000000020000007f;
> +}
> So you allow AccessVpRuntime, AccessTimeCounters, AccessSynicMsrs,
> AccessSyntheticTimers, AccessApicMsrs, AccessHypercallMsrs,
> AccessVpIndex, and AccessSelfPartitionId?  That sounds like a
> pretty reasonable set, but it'd be nice if you'd documented the
> fact rather than just dropping in a magic number.  Something like
> this, maybe:

> hvpp->privileges = HV_PRIV_ACCESS_VP_RUNTIME |
>   HV_PRIV_ACCESS_TIME_COUNTERS |
>                  HV_PRIV_ACCESS_SYNIC_MSRS |
                   ...

Ok.

> +static inline u32
> +hv_get_recommendations(void)
> +{
> +/*
> + *For now we recommend all the features. Need to validate.
> + */
> +if ( paging_mode_hap(current->domain)) {
> +/*
> + * If HAP is enabled; the guest should not use TLB flush
> + * related enlightenments.
> + */
> +return (0x19);
> +} else {
> +/*
> + * For now disable TLB flush enlightenments.
> + */
> +return (0x19); 
> +}
> +}

> So you recommend the use of hypercalls for address switches, MSRs
> for APIC registers, and MSRs for rebooting?  The first two make
> sense, but I'm not so sure that enlightened reboot is a useful
> optimisation.

Originally, I had implemented TLB flush enlightenments as well. Based on some of the performance analysis we are currently doing, I may re-introduce them. As I noted earlier, I just wanted to implement all the features that made sense for the guest on our platform - hence I implemented reboot as well!

> Again, it would have been useful if the actual set of features was
> documented rather than just using a magic number.

Yep.

> +static inline void 
> +hv_set_partition_features(hv_partition_t *hvpp)
> +{
> +hvpp->supported_features = 0x1f;
> +}
>i.e. VP_RUNTIME | PARTITION_REF_COUNT | SYNIC MSRS | SYNTHETIC_TIMERS |
>     APIC_MSRS | HYPERCALL_MSRS.

> It's kind of surprising that you recommend guests use the reboot MSRs,
> but then don't claim to support it.  Was that deliberate?

> Actually, looking some more, it doesn't look like you ever use the
> supported_features field, and have instead hardcoded the value to
> 0xff everywhere you need it.

Will fix this.

> +
> +static inline u16 
> +hv_get_guest_major(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> +static inline u16
> +hv_get_guest_minor(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> +static inline u32
> +hv_get_guest_service_pack(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> Err... are these supposed to return the guest major/minor/sp
> reported in the identification MSR?

Yep. Was not quite sure what numbers made sense!



> +static inline void
> +hv_read_icr(u64 *icr_content)
...
> +static inline void
> +hv_read_tpr(u64 *tpr_content)
...
> +static inline void
> +hv_write_eoi(u64 msr_content)
...
> +static inline void
> +hv_write_icr(u64 msr_content)
...
> +static inline void
> +hv_write_tpr(u64 msr_content)
...

> These all have really weird implementations and lots of gratuitous
> use of hardcoded magic numbers (have a look at apicdef.h).

Will fix it.


> +static void
> +hv_timeout_handler(void *arg)
> +{
...
> +/*
> + * First post the message and then optionally deal with the 
> + * interrupt notification.
> + */
> +if (cur_vcpu->sim_page == NULL) {
> +panic("Novell Shim: Sim page not setup\n");
> +}

> So if a guest requests a timeout before they've set up the sim
> page, Xen crashes?  What if they clear the sim page after setting
> up a timeout, but before it fires?

I will fix this. As I noted earlier, win2k8 currently does not use any of these features.

...
> +}
> +

> +int
> +hyperv_do_hypercall(struct cpu_user_regs *pregs)
> +{
> +hv_partition_t*curp = hv_get_current_partition();
> +hv_vcpu_t        *vcpup;
> +intlong_mode_guest = curp->long_mode_guest;
> +unsigned long hypercall_mfn;
> +unsigned long gmfn;
> +gmfn = (curp->hypercall_msr >> 12);
> +
> +hypercall_mfn = get_mfn_from_gva(pregs->eip);

> That's exciting.  Do you expect that guests will make hypercalls
> from anywhere other than the Viridian hypercall page?

We support PV drivers for win2k8 guests. Given that both HyperV and Xen use the same hypercall numbers to implement different functionality, I needed a way to differentiate HyperV hypercalls. I use the mfn of the hypercall page to figure out who should handle the hypercall.

...

> +int
> +hyperv_vcpu_initialize(struct vcpu *v)
> +{
...
> +/*
> + * Setup the input page for handling hypercalls.
> + *
> + */
> +vcpup->input_buffer_page = 
> +alloc_domheap_page(NULL);
...
> +vcpup->input_buffer =
> +get_virt_from_page_ptr(vcpup->input_buffer_page);
...
> +vcpup->output_buffer_page = 
> +alloc_domheap_page(NULL);
...
> +vcpup->output_buffer =
> +get_virt_from_page_ptr(vcpup->output_buffer_page);

> What exactly are these used for?  The only place I can see it
> referenced are immediately before a panic(), which seems a bit
> pointless.

These were used in implementing TLB flush enlightenments. I got rid of those hypercalls in the current patch set. I will clean these up.


> +vcpup->xen_vcpu = v; 
> +
> +return (0);
> +}
> +

> +
> +static int 
> +hyperv_dom_restore(struct domain *d, hvm_domain_context_t *h)
> +{
> +struct hvm_hyperv_dom ctxt;
> +hv_partition_t*curp;
> +
> +if ( hvm_load_entry(HYPERV_DOM, h, &ctxt) != 0 )
> +        return -22;
> +d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] = ctxt.ext_id;
> +if ((ctxt.ext_id == 0) || (ctxt.ext_id > 1)) {
> +return 0;
> +}
> +if (hyperv_initialize(d)) {
> +return -22;
> +}
> ????  Did you mean -EINVAL there?

Yep.

> +curp = d->arch.hvm_domain.hyperv_handle;
> +
> +curp->guest_id_msr = ctxt.guestid_msr;
> +curp->hypercall_msr = ctxt.hypercall_msr;
> +curp->long_mode_guest = ctxt.long_mode;
> +curp->hypercall_mfn =
> +gmfn_to_mfn(d, (ctxt.hypercall_msr >> 12));
> +
> +return 0; 
> +}
> +

> +static int
> +hv_preprocess_cpuid_leaves(unsigned int input, struct cpu_user_regs *regs)
> +{
> +uint32_t idx;
> +struct domain*d = current->domain;
> +intextid = d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
> +
> +if (extid == 1) {
> +/*
> + * Enlightened Windows guest; need to remap and handle 
> + * leaves used by PV front-end drivers.
> + */
> +if ((input >= 0x40000000) && (input <= 0x40000005)) {
> +return (0);
> +}
> +/*
> + * PV drivers use cpuid to query the hypervisor for details. On
> + * Windows we will use the following leaves for this:
> + *
> + * 4096: VMM Sinature (corresponds to 0x40000000 on Linux)
> + * 4097: VMM Version (corresponds to 0x40000001 on Linux)
> + * 4098: Hypercall details (corresponds to 0x40000002 on Linux)
> + */

> I think it would be better to just unconditionally duplicate the
> Xen leaves at 0x40001000, regardless of whether the viridian
> enlightenments are enabled.  That would make it much easier to
> write PV drivers which work regardless of whether the
> enlightenments are enabled, and if we can do that then it might (in
> a few years' time) be possible to default to enlightments-on for
> all domains.  That isn't really an option when there's no common
> Xen API.

> This would also mean that you don't need to duplicate the whole of
> hypervisor_cpuid_leaves() in the Viridian implementation.

I will look into cleaning this up.

...
> +}
> +
> +int 
> +hyperv_do_cpu_id(unsigned int input, struct cpu_user_regs *regs)
> +{
> +uint32_t idx;
> +
> +/*
> + * hvmloader uses cpuid to set up a hypercall page; we don't want to
> + * intercept calls coming from the bootstrap (bios) code in the HVM 
> + * guest; we discriminate based on the instruction pointer.
> + */
> +if (hv_call_from_bios(regs)) {
> +/*
> + * We don't intercept this.
> + */
> +return (0);
> +}
> +
> +if (input == 0x00000001) { 
> +regs->ecx = (regs->ecx | 0x80000000);
> +return (1);
> +} 

> I think we should be doing this even when the shim is disabled.
> That bit is supposed to indicate ``running on a VMM'', not
> ``running on a Viridian-compatible VMM'', and it has the side
> effect of making Windows much more tolerant of vcpus getting
> scheduled at different rates.

Good point.

> +if (hv_preprocess_cpuid_leaves(input, regs)) {
> +return (0);
> +}
> +idx = (input - 0x40000000);
> +
> +switch (idx) {
> +case 0:
> +/*
> + * 0x40000000: Hypervisor identification. 
> + */
> +regs->eax = 0x40000005; /* For now clamp this */
> +regs->ebx = 0x65766f4e; /* "Nove" */ 
> +regs->ecx = 0x68536c6c; /* "llSh" */
> +regs->edx = 0x76486d69; /* "imHv" */ 
> +break;

> I think this is supposed to identify the hypervisor (Xen), rather
> than the Viridian implementation (Novell shim).

You are right.

...
> +case 5:
> +/*
> + * 0x40000005: Implementation limits.
> + * Currently we retrieve maximum number of vcpus and 
> + * logical processors (hardware threads) supported.
> + */
> +regs->eax = hv_get_max_vcpus_supported();
> 
> +regs->ebx = hv_get_max_lcpus_supported();
> a) ebx is marked as reserved in the specification, and
> b) it doesn't make a great deal of sense to expose the maximum
>   number of physical CPUs to a virtualised guest anyway.

I thought this is what the spec mandated.  I will look into this.

> +regs->ecx = 0; /* Reserved */
> +regs->edx = 0; /* Reserved */
> +break; 
> +
> +default:
> +/*
> + * We don't handle this leaf.
> + */
> +return (0);
> +
> +}
> +return (1);
> +}
> +
> +int
> +hyperv_do_rd_msr(uint32_t idx, struct cpu_user_regs *regs)
> +{
...
> +if (extid > 1) {
> +/*
> + * For now this is all other "Enlightened" operating systems
> + * other than Longhorn.
> + */
> +if (idx == 0x40000000) {
> +/*
> + * PV driver hypercall setup. Let xen handle this.
> + */
> +return (0);
> +}
> +if (idx == 0x40001000) {
> +idx = 0x40000000;
> +}

> Eh?  What's this doing?

At one point, I was planning to support other enlightened operating systems. I will clean it up.


> +}
> +switch (idx) {
...
> +case HV_MSR_TIMER0_COUNT:
> +timer = 0;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER1_COUNT:
> +timer = 1;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER2_COUNT:
> +timer = 2;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER3_COUNT:
> +timer = 3;
> +process_timer_count_read:

> How much does timer support actually buy you?  It's pretty complicated
> all by itself, and it also seems to be the only reason you need SYNIC
> support.

The feature is not used currently by the win2k8 guest. I may choose to get rid of this body of code.

> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
> +goto msr_read_error;
> +}
...
> +}
> +
> +int
> +hyperv_do_wr_msr(uint32_t idx, struct cpu_user_regs *regs)
> +{
...
> +switch (idx) {
...
> +case HV_MSR_SEOM:
> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
> +goto msr_write_error;
> +}
> +cur_vcpu->eom_msr = msr_content; 
> +hv_process_message_q(curp, cur_vcpu);
> +break;

> You don't support queued messages, so there's not much point in
> supporting the SEOM MSR either.  Also, eom_msr never seems to get
> read, except for saving it when you do a suspend.

Right.


> +case HV_MSR_TIMER3_CONFIG:
> +timer = 3;
> +process_timer_config:
> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
> +goto msr_write_error;
> +}
> +/*
> + * Assume that the client is going to write the whole msr. 
> + */
> +if (!(msr_content & 0x9)) {
> +/*
> + * We are neither setting Auto Enable or Enable; 
> + * silently exit.
> + * Should this be considered to turn off a 
> + * timer that may be currently 
> + * active; XXXKYS: Check. For now we are 
> + * not doing anything here.
> + */
> +break;
> +}
> +if (!(((u32)(msr_content >> 16)) & 0x0000000f)) {
> +/*
> + * sintx is 0; clear the enable bit(s).
> + */
> +msr_content &= ~(0x1);
> +}

> Again, some #define's would make these magic numbers a bit more
> clear.

I will fix the code.

Regards,

K. Y


> Steven.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH][RFC] Supporting Enlightened Windows 2008Server
       [not found]   ` <48DF4074.E57C.0030.0@novell.com>
@ 2008-04-14 11:49     ` Steven Smith
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Smith @ 2008-04-14 11:49 UTC (permalink / raw)
  To: Ky Srinivasan; +Cc: xen-devel, Keir Fraser, Steven Smith, Tim Deegan


[-- Attachment #1.1: Type: text/plain, Size: 15409 bytes --]

> > Would you mind generating diffs with -p, please?  It makes them a fair
> > bit easier to read.
> Will do.
Thanks.

> >  static void svm_vmexit_do_cpuid(struct cpu_user_regs *regs)
> >  {
> > -    unsigned int eax, ebx, ecx, edx, inst_len;
> > +    unsigned int eax, ebx, ecx, edx, inst_len, input;
> >  
> >      inst_len = __get_instruction_length(current, INSTR_CPUID, NULL);
> >      if ( inst_len == 0 ) 
> > ...
> > @@ -968,6 +970,7 @@
> >      regs->ecx = ecx;
> >      regs->edx = edx;
> >  
> > +    hyperx_intercept_do_cpuid(input, regs);
> >      __update_guest_eip(regs, inst_len);
> >  }
> >
>   
> > Would it be easier to put this bit in hvm_cpuid, or maybe
> > cpuid_hypervisor_leaves?  That would avoid needing to make the same
> > change to both the VMX and SVM CPUID infrastructure.
> Good point. I will look into it.
Thanks.


> > +static int
> > +hv_get_vp_registers(paddr_t input, paddr_t output)
> > +{
> > +hv_vcpu_t        *vcpup, *targetp;
> > +hv_partition_t   *curp = hv_get_current_partition();
> > +get_vp_registers_input_t*inbuf;
> > +get_vp_registers_output_t*outbuf;
> > +struct vcpu_guest_context*vcpu_ctx;
> > +u32*reg_indexp;
> > +get_vp_registers_output_t*out_regp;
> > +u32num_output_bytes = 0;
> > +
> > +        vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> > +inbuf = vcpup->input_buffer;
> > +outbuf = vcpup->output_buffer;
> > +out_regp = outbuf;
> > +/*
> > + * Copy the input data to the per-cpu input buffer.
> > + * This may be an overkill; obviously it is better to only
> > + * copy what we need. XXXKYS: Check with Mike.
> > + */
> > Who's Mike?  What did he say?
> Mike is  the MSFT contact. 
Ah, okay.  I think we'd rather not have comments which are meaningless
to other people reading the code, so it'd be better to get rid of that
sort of thing.

> > +if (hvm_copy_from_guest_phys(inbuf, input, PAGE_SIZE)) {
> > +return (HV_STATUS_INVALID_ALIGNMENT);
> > +}
> ...
> > +/*
> > + * Now that we have the register state; select what we want and
> > + * populate the output buffer.
> > + */
> > +reg_indexp = &inbuf->reg_index;
> > +while (*reg_indexp != 0) {
> > +switch(*reg_indexp) {
> > +/*
> > + * XXXKYS: need mapping code here; populate
> > + * outbuf.
> > + */
> > +panic("hv_get_vp_registers not supported\n");
> 
> > Yeah, that's not really good enough.  It would be better to just
> > not support this hypercall at all than have an implementation which
> > always crashes Xen.
> Early on, I decided what Hypercalls I would support; and from this
> list I wanted to only support those that were used by windows 2008
> server. I had put in a bunch of panics to trap all the hypercalls
> that the guest would invoke. This hypercall is not used. Thus the
> implementation is incomplete. I will clean this up.
If the hypercall is never made, it'd probably be best to just remove
it rather than trying to clean it up.

> > +static int
> > +hv_switch_va(paddr_t input)
> > +{
> > +hv_partition_t   *curp = hv_get_current_partition();
> > +        hv_vcpu_t *vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> > +
> > +/*
> > + * XXXKYS: the spec sys the asID is passed via memory at offset 0 of 
> > + * the page whose GPA is in the input register. However, it appears 
> > + * the current build of longhorn (longhorn-2007-02-06-x86_64-fv-02)
> > + * passes the asID in the input register instead. Need to check if 
> > + * future builds do this.
> > + */
> > +hvm_set_cr3(input); 
> > +HV_STATS_COLLECT(HV_CSWITCH, &vcpup->stats);
> > +return (HV_STATUS_SUCCESS);
> > +}
> > Do you have any evidence that this is actually faster than just
> > doing the CR3 write in the guest?  It's a win on real Viridian
> > because you avoid blowing the shadow page table cache, but I don't
> > think that applies on Xen's current shadow page table
> > implementation.
> As I noted earlier, initially I identified all the hypercalls that
> made for a win2k8 guest on our platform. From this list I
> implemented those that the guest actually invoked. I implemented all
> the hypercalls that the guest was invoking without regard to the
> performance value of the enlightenment. I agree with you that this
> may be more important on Veridian than on our platform.
Generally, small patches are easier to merge than big ones.  That
suggests that you should start with a minimal implementation and then
add things which have demonstrated value, rather than trying to get
everything working in the initial merge.

> > +static int
> > +hv_flush_va(paddr_t input)
> > +{
> ...
> > +/*
> > + * Now operate on what we are given
> > + * XXXKYS: For now we are ignoring as_id and fushGlobal flag.
> > + * May have to revisit this. But first stash away the processed 
> > + * parameters for subsequent use.
> > + */
> > +flush_argp->as_handle = as_id;
> > +flush_argp->flags = flush_global;
> > +flush_argp->v_mask = vcpu_mask;
> > +
> > +
> > +ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> > +hv_set_syscall_retval(guest_cpu_user_regs(),
> > +                                   curp->long_mode_guest,
> > +                                   ret_val);
> > +HV_STATS_COLLECT(HV_FLUSH_VA_STAT, &cur_vcpup->stats);
> > +panic("hv_flush_va not supported\n"); 
> 
> > So if this hypercall is ever correctly used by the guest, Xen will
> > crash?
> In the original patches that I posted, this hypercall was
> implemented. As part of my first round of cleanup, I removed the
> support for the TLB flush hypercalls. As part of this cleanup, I
> also disabled TLB related enlightenments. This will be cleaned up.
Thank you.

> > +static inline unsigned long
> > +get_mfn_from_gva(unsigned long va)
> > +{
> > +uint32_t pfec = PFEC_page_present;
> > +unsigned long gfn;
> > +gfn = paging_gva_to_gfn(current, va, &pfec);
> > +return (gmfn_to_mfn((current->domain), gfn));
> > +}
> 
> > That's only valid if you can guarantee that the resulting MFN will
> > never be written to.
> Yes. I use this to figure out if the hypercall is from the PV
> drivers in the guest or the hyparcall from the guest.
If it's only used in one place, it might be easier to remove the
foot-shooting potential completely and inline it at its call site.

> > +curp->hypercall_msr = msr_content;
> > +spin_unlock(&curp->lock);
> > +cur_vcpu->flags |= HV_VCPU_UP;
> > +}
> > +
> > +
> > +static inline void hv_write_sx_msr(uint32_t idx, hv_partition_t *curp,
> > +  hv_vcpu_t*cur_vcpu,
> > +  u64msr_content)
> > +{
> ...
> > +sx_page = get_virt_from_gmfn(d, gmfn);
> > +if (sx_page == NULL) {
> > +/*
> > + * The guest specified a bogus GPA; inject a GP fault
> > + * into the guest.
> > + */
> > +hv_inject_exception(TRAP_gp_fault);
> 
> > The spec says tha SIEFP can be outside the physical address space,
> > and you're just supposed to deal with it.  The guest will be unable
> > to access the frame, but the MSR write is supposed to succeed
> > (section 14.6.3 of version 0.83).  Likewise the SIMP (14.6.4).
>  You are right. As it turns out, the win2k8 guest on our platform
> never runs into this situation.  I will fix this.
Thank you.

> > +static inline u32
> > +hv_get_recommendations(void)
> > +{
> > +/*
> > + *For now we recommend all the features. Need to validate.
> > + */
> > +if ( paging_mode_hap(current->domain)) {
> > +/*
> > + * If HAP is enabled; the guest should not use TLB flush
> > + * related enlightenments.
> > + */
> > +return (0x19);
> > +} else {
> > +/*
> > + * For now disable TLB flush enlightenments.
> > + */
> > +return (0x19); 
> > +}
> > +}
> > So you recommend the use of hypercalls for address switches, MSRs
> > for APIC registers, and MSRs for rebooting?  The first two make
> > sense, but I'm not so sure that enlightened reboot is a useful
> > optimisation.
> Originally, I had implemented TLB flush enlightenments as
> well. Based on some of the performance analysis we are currently
> doing, I may re-introduce them.
Sure.

> As I noted earlier, I just wanted to implement all the features that
> made sense for the guest on our platform - hence I implemented
> reboot as well!
I think the interesting question, when deciding whether to implement a
feature, is less ``does this feature make sense?'', and more ``does
this feature need to be accelerated?''.  We're never going to be able
to remove the un-enlightened reboot support, and so there needs to be
some compelling advantage to justify duplicating the functionality
like this.  For hot-path operations, like page table updates, that
justification is performance, but it's not worth the extra code, with
all its attendant bug potential and maintenance cost, just to shave a
few milliseconds off reboot times.


> > +static inline void 
> > +hv_set_partition_features(hv_partition_t *hvpp)
> > +{
> > +hvpp->supported_features = 0x1f;
> > +}
> >i.e. VP_RUNTIME | PARTITION_REF_COUNT | SYNIC MSRS | SYNTHETIC_TIMERS |
> >     APIC_MSRS | HYPERCALL_MSRS.
> 
> > It's kind of surprising that you recommend guests use the reboot MSRs,
> > but then don't claim to support it.  Was that deliberate?
> 
> > Actually, looking some more, it doesn't look like you ever use the
> > supported_features field, and have instead hardcoded the value to
> > 0xff everywhere you need it.
> Will fix this.
Thanks.

> 
> > +
> > +static inline u16 
> > +hv_get_guest_major(void)
> > +{
> > +//KYS: Check!
> > +return (0);
> > +}
> > +static inline u16
> > +hv_get_guest_minor(void)
> > +{
> > +//KYS: Check!
> > +return (0);
> > +}
> > +static inline u32
> > +hv_get_guest_service_pack(void)
> > +{
> > +//KYS: Check!
> > +return (0);
> > +}
> > Err... are these supposed to return the guest major/minor/sp
> > reported in the identification MSR?
> Yep. Was not quite sure what numbers made sense!
The guest should tell you what numbers to use when it does interface
discovery.  See section 3.6 of the spec.

Of course, it really isn't obvious why the guest would ever need to
ask the hypervisor what operating system it's running, but that's a
problem with the specification rather than with this particular
implementation.

> > +static inline void
> > +hv_read_icr(u64 *icr_content)
> ...
> > +static inline void
> > +hv_read_tpr(u64 *tpr_content)
> ...
> > +static inline void
> > +hv_write_eoi(u64 msr_content)
> ...
> > +static inline void
> > +hv_write_icr(u64 msr_content)
> ...
> > +static inline void
> > +hv_write_tpr(u64 msr_content)
> ...
> 
> > These all have really weird implementations and lots of gratuitous
> > use of hardcoded magic numbers (have a look at apicdef.h).
> Will fix it.
Thanks.

> > +int
> > +hyperv_do_hypercall(struct cpu_user_regs *pregs)
> > +{
> > +hv_partition_t*curp = hv_get_current_partition();
> > +hv_vcpu_t        *vcpup;
> > +intlong_mode_guest = curp->long_mode_guest;
> > +unsigned long hypercall_mfn;
> > +unsigned long gmfn;
> > +gmfn = (curp->hypercall_msr >> 12);
> > +
> > +hypercall_mfn = get_mfn_from_gva(pregs->eip);
> 
> > That's exciting.  Do you expect that guests will make hypercalls
> > from anywhere other than the Viridian hypercall page?
> We support PV drivers for win2k8 guests. Given that both HyperV and
> Xen use the same hypercall numbers to implement different
> functionality, I needed a way to differentiate HyperV hypercalls. I
> use the mfn of the hypercall page to figure out who should handle
> the hypercall.
Is there any reason you can't make that decision based on eax or some
such?

The Viridian spec only specifies the interface between the guest and
the hypercall stub page, and not the interface between the stub page
and the hypervisor proper.  There's no reason why your stub page
couldn't do something like this:

orl $0x80000000, %eax
vmmcall

rather than just doing a vmmcall directly.  That would allow you to
distinguish Viridian hypercalls from Xen ones just by looking at
eax[1], which seems a bit cleaner to me.


[1] Assuming that neither Xen nor Viridian ever define hypercall
numbers above 0x80000000, which seems likely.

> > +int
> > +hyperv_vcpu_initialize(struct vcpu *v)
> > +{
> ...
> > +/*
> > + * Setup the input page for handling hypercalls.
> > + *
> > + */
> > +vcpup->input_buffer_page = 
> > +alloc_domheap_page(NULL);
> ...
> > +vcpup->input_buffer =
> > +get_virt_from_page_ptr(vcpup->input_buffer_page);
> ...
> > +vcpup->output_buffer_page = 
> > +alloc_domheap_page(NULL);
> ...
> > +vcpup->output_buffer =
> > +get_virt_from_page_ptr(vcpup->output_buffer_page);
> 
> > What exactly are these used for?  The only place I can see it
> > referenced are immediately before a panic(), which seems a bit
> > pointless.
> These were used in implementing TLB flush enlightenments. I got rid
> of those hypercalls in the current patch set. I will clean these up.
Thanks.

> > +static int
> > +hv_preprocess_cpuid_leaves(unsigned int input, struct cpu_user_regs *regs)
> > +{
> > +uint32_t idx;
> > +struct domain*d = current->domain;
> > +intextid = d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
> > +
> > +if (extid == 1) {
> > +/*
> > + * Enlightened Windows guest; need to remap and handle 
> > + * leaves used by PV front-end drivers.
> > + */
> > +if ((input >= 0x40000000) && (input <= 0x40000005)) {
> > +return (0);
> > +}
> > +/*
> > + * PV drivers use cpuid to query the hypervisor for details. On
> > + * Windows we will use the following leaves for this:
> > + *
> > + * 4096: VMM Sinature (corresponds to 0x40000000 on Linux)
> > + * 4097: VMM Version (corresponds to 0x40000001 on Linux)
> > + * 4098: Hypercall details (corresponds to 0x40000002 on Linux)
> > + */
> 
> > I think it would be better to just unconditionally duplicate the
> > Xen leaves at 0x40001000, regardless of whether the viridian
> > enlightenments are enabled.  That would make it much easier to
> > write PV drivers which work regardless of whether the
> > enlightenments are enabled, and if we can do that then it might (in
> > a few years' time) be possible to default to enlightments-on for
> > all domains.  That isn't really an option when there's no common
> > Xen API.
> 
> > This would also mean that you don't need to duplicate the whole of
> > hypervisor_cpuid_leaves() in the Viridian implementation.
> 
> I will look into cleaning this up.
Thanks.

> > +}
> > +switch (idx) {
> ...
> > +case HV_MSR_TIMER0_COUNT:
> > +timer = 0;
> > +goto process_timer_count_read;
> > +case HV_MSR_TIMER1_COUNT:
> > +timer = 1;
> > +goto process_timer_count_read;
> > +case HV_MSR_TIMER2_COUNT:
> > +timer = 2;
> > +goto process_timer_count_read;
> > +case HV_MSR_TIMER3_COUNT:
> > +timer = 3;
> > +process_timer_count_read:
> 
> > How much does timer support actually buy you?  It's pretty complicated
> > all by itself, and it also seems to be the only reason you need SYNIC
> > support.
> The feature is not used currently by the win2k8 guest. I may choose
> to get rid of this body of code.
Please do.  If it's not used then it can't possibly be buying you
anything, and it's a lot of code to carry just on the off chance that
it'll be useful later.

Steven.

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-04-14 11:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-04 23:41 [PATCH][RFC] Supporting Enlightened Windows 2008Server Ky Srinivasan
2008-04-05 15:17 ` Daniel P. Berrange
2008-04-07 14:27   ` Ky Srinivasan
2008-04-07 14:27   ` Ky Srinivasan
     [not found] ` <20080410112313.GA4628@weybridge.uk.xensource.com>
2008-04-13 18:43   ` Ky Srinivasan
     [not found]   ` <48DF4074.E57C.0030.0@novell.com>
2008-04-14 11:49     ` Steven Smith
  -- strict thread matches above, loose matches on Subject: below --
2008-04-13 21:50 Ky Srinivasan
2008-04-13 19:04 Ky Srinivasan
     [not found] <47F68017.E57C.0030.0@novell.com>
2008-04-05  9:21 ` Keir Fraser
2008-03-05 22:15 [PATCH][RFC] Supporting Enlightened Windows 2008 Server Ky Srinivasan
2008-03-06  7:28 ` Keir Fraser
2008-03-06 10:15   ` Tim Deegan
2008-03-07  1:10     ` [PATCH][RFC] Supporting Enlightened Windows 2008Server Ky Srinivasan
2008-03-07 11:57       ` Tim Deegan
2008-03-07 13:19       ` Keir Fraser
2008-03-07 13:30       ` Keir Fraser
2008-03-07  1:08   ` Ky Srinivasan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.