All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
@ 2006-03-13 17:58 ` Zachary Amsden
  0 siblings, 0 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-13 17:58 UTC (permalink / raw)
  To: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton,
	Zachary Amsden, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn,
	Zachary Amsden

In OLS 2005, we described the work that we have been doing in VMware
with respect a common interface for paravirtualization of Linux. We
shared the general vision in Rik's virtualization BoF.

This note is an update on our further work on the Virtual Machine
Interface, VMI.  The patches provided have been tested on 2.6.16-rc6.
We are currently recollecting performance information for the new -rc6
kernel, but expect our numbers to match previous results, which showed
no impact whatsoever on macro benchmarks, and nearly neglible impact
on microbenchmarks.

Unlike the full-virtualization techniques used in the traditional VMware
products, paravirtualization is a technique where the operating system
is modified to enlighten the hypervisor with timely knowledge about the
operating system's activities. Since the hypervisor now depends on the
kernel to tell it about common idioms etc, it does not need to write
protect OS objects such as page and descriptor tables as a solution
based on full-virtualization needs. This has two important effects (a)
it shortens the critical path, since faulting is expensive on modern
processors (b) by eliminating complex heuristics the hypervisor is
simplified. While the former delivers performance, the latter is quite
important too. 

Not surprisingly, paravirtualization's strength, ie that it encourages
tighter communication between the kernel and the hypervisor, is also its
weakness. Unless the changes to the operating system are moderated, you
can very quickly find yourself with a kernel that (a) looks and feels
like a brand new kernel or (b) cannot run on native machines or on newer
versions of the hypervisor without a full recompile. The former can
impede innovation in the Linux kernel, and the latter can be a problem
for software vendors. 

VMware proposes VMI as a paravirtualization interface for Linux that
solves these problems. 
  - A VMI'fied Linux kernel runs unmodified on native hardware, and on
    many hypervisors, while simultaneously delivering on the performance
    promise of paravirtualization. 
  - VMI has a rich and low level interface, which allows the kernel to
    cope with future hardware evolution by querying for hardware
    capability. It is our expectation that a single kernel will run
    unmodified on both today's processors with limited hardware
    virtualization support and also keep up with any evolution on the
    processor front 
  - VMI Linux is a fairly clean interface, with distinct name spaces
    for objects from the kernel and the hypervisor. Nowhere do we mingle
    names from the hypervisor with that of the kernel. This separation
    allows innovation in the kernel to proceed at the same speed as
    always. For most kernel developers, a VMI kernel looks and feels like
    a regular Linux kernel.  
  - VMI Linux still supports "native" hypervisor device drivers, for
    example a hypervisor vendor's own private network or block device
    drivers which are free to use any interface desired to communicate
    with the hypervisor.

At present, we are sharing a working implementation of the VMI for
2.6.16-rc6 version of Linux. We have verified that VMI Linux does indeed
run well on native machines (both P4 and Opterons), and on VMware style
hypervisors. VMI Linux has negligible overheads on native machines, so
much so, that we are confident that VMI Linux can, in the long run, be
the default Linux for i386.  We believe that this interface is both
cleaner and more powerful than other proposals that have been made
towards virtualization of Linux, and can easily be adapted to work with
other hypervisors.

This is by no means finished work. A few of the areas that need more
attention and exploration are (a) 64bit support is still lacking, but we
feel a port of VMI to the 64 bit Linux can be done without too much
trouble (b) the Xen compatibility layer needs some work to bring it
up to the Xen 3.0 interfaces.  Work is underway on this already, and
no major issues are expected at this time. 

Two final notes.  This is not an attempt to force a proprietary interface
into the Linux kernel.  This is an attempt to find a common interface
that can be used by many hypervisors by isolating hypervisor specific
idioms into a neutral layer.  This new layer is just what is claims to
be - a virtual machine interface, which allows hypervisor dependent code
to be abstracted in a way that benefits both Linux and hypervisor
development.

This is also not an attempt to define an exact and final specification
of how virtualization should be done in Linux.  This is very much a work
in progress, and it is understood that the interfaces proposed here will
change in time to accommodate the needs of all interested parties.  We 
hope to find a common solution that can eventually become part of the
Linux kernel and serve as a model for other operating systems as well.

We appreciate your feedback on this design and the patches to Linux, and
welcome working with anyone who is interested in making virtualization
in Linux a friendly environment to innovate in.  If you find the ideas
here interesting, please volunteer to help improve them.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
@ 2006-03-13 17:58 ` Zachary Amsden
  0 siblings, 0 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-13 17:58 UTC (permalink / raw)
  To: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton,
	Zachary Amsden, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

In OLS 2005, we described the work that we have been doing in VMware
with respect a common interface for paravirtualization of Linux. We
shared the general vision in Rik's virtualization BoF.

This note is an update on our further work on the Virtual Machine
Interface, VMI.  The patches provided have been tested on 2.6.16-rc6.
We are currently recollecting performance information for the new -rc6
kernel, but expect our numbers to match previous results, which showed
no impact whatsoever on macro benchmarks, and nearly neglible impact
on microbenchmarks.

Unlike the full-virtualization techniques used in the traditional VMware
products, paravirtualization is a technique where the operating system
is modified to enlighten the hypervisor with timely knowledge about the
operating system's activities. Since the hypervisor now depends on the
kernel to tell it about common idioms etc, it does not need to write
protect OS objects such as page and descriptor tables as a solution
based on full-virtualization needs. This has two important effects (a)
it shortens the critical path, since faulting is expensive on modern
processors (b) by eliminating complex heuristics the hypervisor is
simplified. While the former delivers performance, the latter is quite
important too. 

Not surprisingly, paravirtualization's strength, ie that it encourages
tighter communication between the kernel and the hypervisor, is also its
weakness. Unless the changes to the operating system are moderated, you
can very quickly find yourself with a kernel that (a) looks and feels
like a brand new kernel or (b) cannot run on native machines or on newer
versions of the hypervisor without a full recompile. The former can
impede innovation in the Linux kernel, and the latter can be a problem
for software vendors. 

VMware proposes VMI as a paravirtualization interface for Linux that
solves these problems. 
  - A VMI'fied Linux kernel runs unmodified on native hardware, and on
    many hypervisors, while simultaneously delivering on the performance
    promise of paravirtualization. 
  - VMI has a rich and low level interface, which allows the kernel to
    cope with future hardware evolution by querying for hardware
    capability. It is our expectation that a single kernel will run
    unmodified on both today's processors with limited hardware
    virtualization support and also keep up with any evolution on the
    processor front 
  - VMI Linux is a fairly clean interface, with distinct name spaces
    for objects from the kernel and the hypervisor. Nowhere do we mingle
    names from the hypervisor with that of the kernel. This separation
    allows innovation in the kernel to proceed at the same speed as
    always. For most kernel developers, a VMI kernel looks and feels like
    a regular Linux kernel.  
  - VMI Linux still supports "native" hypervisor device drivers, for
    example a hypervisor vendor's own private network or block device
    drivers which are free to use any interface desired to communicate
    with the hypervisor.

At present, we are sharing a working implementation of the VMI for
2.6.16-rc6 version of Linux. We have verified that VMI Linux does indeed
run well on native machines (both P4 and Opterons), and on VMware style
hypervisors. VMI Linux has negligible overheads on native machines, so
much so, that we are confident that VMI Linux can, in the long run, be
the default Linux for i386.  We believe that this interface is both
cleaner and more powerful than other proposals that have been made
towards virtualization of Linux, and can easily be adapted to work with
other hypervisors.

This is by no means finished work. A few of the areas that need more
attention and exploration are (a) 64bit support is still lacking, but we
feel a port of VMI to the 64 bit Linux can be done without too much
trouble (b) the Xen compatibility layer needs some work to bring it
up to the Xen 3.0 interfaces.  Work is underway on this already, and
no major issues are expected at this time. 

Two final notes.  This is not an attempt to force a proprietary interface
into the Linux kernel.  This is an attempt to find a common interface
that can be used by many hypervisors by isolating hypervisor specific
idioms into a neutral layer.  This new layer is just what is claims to
be - a virtual machine interface, which allows hypervisor dependent code
to be abstracted in a way that benefits both Linux and hypervisor
development.

This is also not an attempt to define an exact and final specification
of how virtualization should be done in Linux.  This is very much a work
in progress, and it is understood that the interfaces proposed here will
change in time to accommodate the needs of all interested parties.  We 
hope to find a common solution that can eventually become part of the
Linux kernel and serve as a model for other operating systems as well.

We appreciate your feedback on this design and the patches to Linux, and
welcome working with anyone who is interested in making virtualization
in Linux a friendly environment to innovate in.  If you find the ideas
here interesting, please volunteer to help improve them.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 17:58 ` Zachary Amsden
  (?)
@ 2006-03-13 18:09 ` Arjan van de Ven
  2006-03-13 18:22   ` Zachary Amsden
  -1 siblings, 1 reply; 38+ messages in thread
From: Arjan van de Ven @ 2006-03-13 18:09 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

> Two final notes.  This is not an attempt to force a proprietary interface
> into the Linux kernel.  This is an attempt to find a common interface
> that can be used by many hypervisors by isolating hypervisor specific
> idioms into a neutral layer.  This new layer is just what is claims to
> be - a virtual machine interface, which allows hypervisor dependent code
> to be abstracted in a way that benefits both Linux and hypervisor
> development.


such an interface should be defined with source visibility of both sides
though. At least of one user. Can XEN or any of the other open
hypervisors use this? What does it look like? And if not, why not,
wouldn't that make VMA a VMwareInterface instead ? ;)

Why can't vmware use the Xen interface instead?



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:09 ` Arjan van de Ven
@ 2006-03-13 18:22   ` Zachary Amsden
  2006-03-13 18:26     ` Arjan van de Ven
  2006-03-15 10:25     ` Christoph Hellwig
  0 siblings, 2 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-13 18:22 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Arjan van de Ven wrote:
>> Two final notes.  This is not an attempt to force a proprietary interface
>> into the Linux kernel.  This is an attempt to find a common interface
>> that can be used by many hypervisors by isolating hypervisor specific
>> idioms into a neutral layer.  This new layer is just what is claims to
>> be - a virtual machine interface, which allows hypervisor dependent code
>> to be abstracted in a way that benefits both Linux and hypervisor
>> development.
>>     
>
>
> such an interface should be defined with source visibility of both sides
> though. At least of one user. Can XEN or any of the other open
> hypervisors use this? What does it look like? And if not, why not,
> wouldn't that make VMA a VMwareInterface instead ? ;)
>   

Yes, Xen can use this interface, even without modification to Xen.  The 
interface was used successfully to run a VMI kernel on Xen 2.0.  As it 
stands now, the interface does need to change a bit to accomodate Xen 
3.0 - but it is possible to do.  Rather than wait until we have a 
working prototype of that, we thought the interface itself warrants 
discussion now.

> Why can't vmware use the Xen interface instead?
>   

We could.  But it is our opinion that the Xen interface is unnecessarily 
complicated, without a clean separation between the layer of interaction 
with the hypervisor and the kernel proper.  The interface we propose we 
believe is more powerful, and more conducive to performance 
optimizations while providing significant advantages - most 
specifically, a single binary image that is properly virtualizable on 
multiple hypervisors and capable of running on native hardware.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:22   ` Zachary Amsden
@ 2006-03-13 18:26     ` Arjan van de Ven
  2006-03-13 18:30       ` Zachary Amsden
  2006-03-15 10:25     ` Christoph Hellwig
  1 sibling, 1 reply; 38+ messages in thread
From: Arjan van de Ven @ 2006-03-13 18:26 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

>   The interface we propose we 
> believe is more powerful, and more conducive to performance 
> optimizations while providing significant advantages - most 
> specifically, a single binary image that is properly virtualizable on 
> multiple hypervisors and capable of running on native hardware.

that is mostly an advantage in the binary would though.. less so in the
open source world.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:26     ` Arjan van de Ven
@ 2006-03-13 18:30       ` Zachary Amsden
  2006-03-13 18:42         ` Arjan van de Ven
  2006-03-13 18:56         ` Hollis Blanchard
  0 siblings, 2 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-13 18:30 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Arjan van de Ven wrote:
>>   The interface we propose we 
>> believe is more powerful, and more conducive to performance 
>> optimizations while providing significant advantages - most 
>> specifically, a single binary image that is properly virtualizable on 
>> multiple hypervisors and capable of running on native hardware.
>>     
>
> that is mostly an advantage in the binary would though.. less so in the
> open source world.
>   

It is an advantage for everyone.  It cuts support and certification 
costs for Linux distributors, software vendors, makes debugging and 
development easier, and gives hypervisors room to grow while maintaining 
binary compatibility with already released kernels.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:30       ` Zachary Amsden
@ 2006-03-13 18:42         ` Arjan van de Ven
  2006-03-13 18:48           ` Zachary Amsden
                             ` (2 more replies)
  2006-03-13 18:56         ` Hollis Blanchard
  1 sibling, 3 replies; 38+ messages in thread
From: Arjan van de Ven @ 2006-03-13 18:42 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

On Mon, 2006-03-13 at 10:30 -0800, Zachary Amsden wrote:
> Arjan van de Ven wrote:
> >>   The interface we propose we 
> >> believe is more powerful, and more conducive to performance 
> >> optimizations while providing significant advantages - most 
> >> specifically, a single binary image that is properly virtualizable on 
> >> multiple hypervisors and capable of running on native hardware.
> >>     
> >
> > that is mostly an advantage in the binary would though.. less so in the
> > open source world.
> >   
> 
> It is an advantage for everyone.  It cuts support and certification 
> costs for Linux distributors,

that I'll buy
>  software vendors, 

that I'll buy a lot less except those with kernel modules (which is
evil ;)
> makes debugging and 
> development easier,

that I don't buy; a fixed interface tends to make debugging harder not
easier since you can't change it to add more information

>  and gives hypervisors room to grow while maintaining 
> binary compatibility with already released kernels.

that I buy for binary only hypervisors. But in an open source world I'll
buy this a LOT less as being relevant.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:42         ` Arjan van de Ven
@ 2006-03-13 18:48           ` Zachary Amsden
  2006-03-13 19:02             ` Chris Wright
  2006-03-13 18:52           ` VMI interface documentation Zachary Amsden
  2006-03-13 18:56           ` [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal Joshua LeVasseur
  2 siblings, 1 reply; 38+ messages in thread
From: Zachary Amsden @ 2006-03-13 18:48 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Arjan van de Ven wrote:
>
>> makes debugging and 
>> development easier,
>>     
>
> that I don't buy; a fixed interface tends to make debugging harder not
> easier since you can't change it to add more information

This we find to be quite true.  Now, you can use a VMI kernel, make 
changes to it, run it on native hardware, and be confident that it will 
run properly in a VM as well.  And you can develop in a VM, with 
confidence that you can run on native hardware.  You can even replace 
the entire "ROM" image with your own custom debugging image to add any 
type of debugging or performance monitoring facility you want - and you 
have some very, very interesting hook points into the kernel that make 
that task much more achievable.


> that I buy for binary only hypervisors. But in an open source world I'll
> buy this a LOT less as being relevant.
>   

This is not about the open source versus the closed source world.  It is 
about the real world, where customers want to make as few changes as 
possible to a working and already deployed system.  If they have to 
recompile a kernel just to get their system to run again, that is a pain 
point that is easily avoided.


Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* VMI interface documentation
  2006-03-13 18:42         ` Arjan van de Ven
  2006-03-13 18:48           ` Zachary Amsden
@ 2006-03-13 18:52           ` Zachary Amsden
  2006-03-13 18:56           ` [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal Joshua LeVasseur
  2 siblings, 0 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-13 18:52 UTC (permalink / raw)
  To: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel
  Cc: Andrew Morton, Joshua LeVasseur, Leendert van Doorn, Jack Lo,
	Dan Hecht, Christopher Li, Jan Beulich, Wim Coekaerts,
	Chris Wright, Pratap Subrahmanyam, Anne Holler, Jyothy Reddy,
	Kip Macy, Ky Srinivasan, Arjan van de Ven, Dan Arai

[-- Attachment #1: Type: text/plain, Size: 111 bytes --]

Sorry for the spam if you got multiples - vger doesn't seem to like my 
documentation.  I've attached it here.

[-- Attachment #2: vmi_spec.txt --]
[-- Type: text/plain, Size: 91945 bytes --]


	Paravirtualization API Version 2.0

	Zachary Amsden, Daniel Arai, Daniel Hecht, Pratap Subrahmanyam
	Copyright (C) 2005, 2006, VMware, Inc.
	All rights reserved

Revision history:
         1.0: Initial version
         1.1: arai 2005-11-15
              Added SMP-related sections: AP startup and Local APIC support
         1.2: dhecht 2006-02-23
              Added Time Interface section and Time related VMI calls

Contents

1) Motivations
2) Overview
    Initialization
    Privilege model
    Memory management
    Segmentation
    Interrupt and I/O subsystem
    IDT management
    Transparent Paravirtualization
    3rd Party Extensions
    AP Startup
    State Synchronization in SMP systems
    Local APIC Support
    Time Interface
3) Architectural Differences from Native Hardware
4) ROM Implementation
    Detection
    Data layout
    Call convention
    PCI implementation

Appendix A - VMI ROM low level ABI
Appendix B - VMI C prototypes
Appendix C - Sensitive x86 instructions


1) Motivations

   There are several high level goals which must be balanced in designing
   an API for paravirtualization.  The most general concerns are:

   Portability      - it should be easy to port a guest OS to use the API 
   High performance - the API must not obstruct a high performance
	              hypervisor implementation 
   Maintainability  - it should be easy to maintain and upgrade the guest
                      OS 
   Extensibility    - it should be possible for future expansion of the
                      API 

   Portability.

     The general approach to paravirtualization rather than full
     virtualization is to modify the guest operating system.  This means
     there is implicitly some code cost to port a guest OS to run in a
     paravirtual environment.  The closer the API resembles a native
     platform which the OS supports, the lower the cost of porting.
     Rather than provide an alternative, high level interface for this
     API, the approach is to provide a low level interface which
     encapsulates the sensitive and performance critical parts of the
     system.  Thus, we have direct parallels to most privileged
     instructions, and the process of converting a guest OS to use these
     instructions is in many cases a simple replacement of one function
     for another. Although this is sufficient for CPU virtualization,
     performance concerns have forced us to add additional calls for
     memory management, and notifications about updates to certain CPU
     data structures. Support for this in the Linux operating system has
     proved to be very minimal in cost because of the already somewhat
     portable and modular design of the memory management layer.

  High Performance.

     Providing a low level API that closely resembles hardware does not
     provide any support for compound operations; indeed, typical
     compound operations on hardware can be updating of many page table
     entries, flushing system TLBs, or providing floating point safety.
     Since these operations may require several privileged or sensitive
     operations, it becomes important to defer some of these operations
     until explicit flushes are issued, or to provide higher level
     operations around some of these functions.  In order to keep with
     the goal of portability, this has been done only when deemed
     necessary for performance reasons, and we have tried to package
     these compound operations into methods that are typically used in
     guest operating systems.  In the future, we envision that additional
     higher level abstractions will be added as an adjunct to the
     low-level API.  These higher level abstractions will target large
     bulk operations such as creation, and destruction of address spaces,
     context switches, thread creation and control.

  Maintainability.

     In the course of development with a virtualized environment, it is
     not uncommon for support of new features or higher performance to
     require radical changes to the operation of the system.  If these
     changes are visible to the guest OS in a paravirtualized system,
     this will require updates to the guest kernel, which presents a
     maintenance problem.  In the Linux world, the rapid pace of
     development on the kernel means new kernel versions are produced
     every few months.  This rapid pace is not always appropriate for end
     users, so it is not uncommon to have dozens of different versions of
     the Linux kernel in use that must be actively supported.  To keep
     this many versions in sync with potentially radical changes in the
     paravirtualized system is not a scalable solution.  To reduce the
     maintenance burden as much as possible, while still allowing the
     implementation to accommodate changes, the design provides a stable
     ABI with semantic invariants.  The underlying implementation of the
     ABI and details of what data or how it communicates with the
     hypervisor are not visible to the guest OS.  As a result, in most
     cases, the guest OS need not even be recompiled to work with a newer
     hypervisor.  This allows performance optimizations, bug fixes,
     debugging, or statistical instrumentation to be added to the API
     implementation without any impact on the guest kernel.  This is
     achieved by publishing a block of code from the hypervisor in the
     form of a ROM.  The guest OS makes calls into this ROM to perform
     privileged or sensitive actions in the system.

  Extensibility.

     In order to provide a vehicle for new features, new device support,
     and general evolution, the API uses feature compartmentalization
     with controlled versioning.  The API is split into sections, with
     each section having independent versions.  Each section has a top
     level version which is incremented for each major revision, with a
     minor version indicating incremental level.  Version compatibility
     is based on matching the major version field, and changes of the
     major version are assumed to break compatibility.  This allows
     accurate matching of compatibility.  In the event of incompatible
     API changes, multiple APIs may be advertised by the hypervisor if it
     wishes to support older versions of guest kernels.  This provides
     the most general forward / backward compatibility possible.
     Currently, the API has a core section for CPU / MMU virtualization
     support, with additional sections provided for each supported device
     class.

2) Overview

   Initialization.

     Initialization is done with a bootstrap loader that creates
     the "start of day" state.  This is a known state, running 32-bit
     protected mode code with paging enabled.  The guest has all the
     standard structures in memory that are provided by a native ROM
     boot environment, including a memory map and ACPI tables.  For
     the native hardware, this bootstrap loader can be run before
     the kernel code proper, and this environment can be created
     readily from within the hypervisor for the virtual case.  At
     some point, the bootstrap loader or the kernel itself invokes
     the initialization call to enter paravirtualized mode.

   Privilege Model.

     The guest kernel must be modified to run at a dynamic privilege
     level, since if entry to paravirtual mode is successful, the kernel
     is no longer allowed to run at the highest hardware privilege level.
     On the IA-32 architecture, this means the kernel will be running at
     CPL 1-2, and with the hypervisor running at CPL0, and user code at
     CPL3.  The IOPL will be lowered as well to avoid giving the guest
     direct access to hardware ports and control of the interrupt flag.

     This change causes certain IA-32 instructions to become "sensitive",
     so additional support for clearing and setting the hardware
     interrupt flag are present.  Since the switch into paravirtual mode
     may happen dynamically, the guest OS must not rely on testing for a
     specific privilege level by checking the RPL field of segment
     selectors, but should check for privileged execution by performing
     an (RPL != 3 && !EFLAGS_VM) comparison.  This means the DPL of kernel
     ring descriptors in the GDT or LDT may be raised to match the CPL of
     the kernel.  This change is visible by inspecting the segments
     registers while running in privileged code, and by using the LAR
     instruction.

     The system also cannot be allowed to write directly to the hardware
     GDT, LDT, IDT, or TSS, so these data structures are maintained by the
     hypervisor, and may be shadowed or guest visible structures.  These
     structures are required to be page aligned to support non-shadowed
     operation.

     Currently, the system only provides for two guest security domains,
     kernel (which runs at the equivalent of virtual CPL-0), and user
     (which runs at the equivalent of virtual CPL-3, with no hardware
     access).  Typically, this is not a problem, but if a guest OS relies
     on using multiple hardware rings for privilege isolation, this
     interface would need to be expanded to support that.

   Memory Management.

     Since a virtual machine typically does not have access to all the
     physical memory on the machine, there is a need to redefine the
     physical address space layout for the virtual machine.  The
     spectrum of possibilities ranges from presenting the guest with
     a view of a physically contiguous memory of a boot-time determined
     size, exactly what the guest would see when running on hardware, to
     the opposite, which presents the guest with the actual machine pages
     which the hypervisor has allocated for it.  Using this approach
     requires the guest to obtain information about the pages it has
     from the hypervisor; this can be done by using the memory map which
     would normally be passed to the guest by the BIOS.

     The interface is designed to support either mode of operation.
     This allows the implementation to use either direct page tables
     or shadow page tables, or some combination of both.  All writes to
     page table entries are done through calls to the hypervisor
     interface layer.  The guest notifies the hypervisor about page
     tables updates, flushes, and invalidations through API calls.

     The guest OS is also responsible for notifying the hypervisor about
     which pages in its physical memory are going to be used to hold page
     tables or page directories.  Both PAE and non-PAE paging modes are
     supported.  When the guest is finished using pages as page tables, it
     should release them promptly to allow the hypervisor to free the
     page table shadows.  Using a page as both a page table and a page
     directory for linear page table access is possible, but currently
     not supported by our implementation.

     The hypervisor lives concurrently in the same address space as the
     guest operating system.  Although this is not strictly necessary on
     IA-32 hardware, performance would be severely degraded if that were
     not the case.  The hypervisor must therefore reserve some portion of
     linear address space for its own use. The implementation currently
     reserves the top 64 megabytes of linear space for the hypervisor.
     This requires the guest to relocate any data in high linear space
     down by 64 megabytes.  For non-paging mode guests, this means the
     high 64 megabytes of physical memory should be reserved.  Because
     page tables are not sensitive to CPL, only to user/supervisor level,
     the hypervisor must combine segment protection to ensure that the
     guest can not access this 64 megabyte region.

     An experimental patch is available to enable boot-time sizing of
     the hypervisor hole.

   Segmentation.

     The IA-32 architecture provides segmented virtual memory, which can
     be used as another form of privilege separation.  Each segment
     contains a base, limit, and properties.  The base is added to the
     virtual address to form a linear address.  The limit determines the
     length of linear space which is addressable through the segment.
     The properties determine read/write, code and data size of the
     region, as well as the direction in which segments grow.  Segments
     are loaded from descriptors in one of two system tables, the GDT or
     the LDT, and the values loaded are cached until the next load of the
     segment.  This property, known as segment caching, allows the
     machine to be put into a non-reversible state by writing over the
     descriptor table entry from which a segment was loaded.  There is no
     efficient way to extract the base field of the segment after it is
     loaded, as it is hidden by the processor.  In a hypervisor
     environment, the guest OS can be interrupted at any point in time by
     interrupts and NMIs which must be serviced by the hypervisor.  The
     hypervisor must be able to recreate the original guest state when it
     is done servicing the external event.

     To avoid creating non-reversible segments, the hypervisor will
     forcibly reload any live segment registers that are updated by
     writes to the descriptor tables.  *N.B - in the event that a segment
     is put into an invalid or not present state by an update to the
     descriptor table, the segment register must be forced to NULL so
     that reloading it will not cause a general protection fault (#GP)
     when restoring the guest state.  This may require the guest to save
     the segment register value before issuing a hypervisor API call
     which will update the descriptor table.*

     Because the hypervisor must protect its own memory space from
     privileged code running in the guest at CPL1-2, descriptors may not
     provide access to the 64 megabyte region of high linear space.  To
     achieve this, the hypervisor will truncate descriptors in the
     descriptor tables.  This means that attempts by the guest to access
     through negative offsets to the segment base will fault, so this is
     highly discouraged (some TLS implementations on Linux do this).
     In addition, this causes the truncated length of the segment to
     become visible to the guest through the LSL instruction.

   Interrupt and I/O Subsystem.

     For security reasons, the guest operating system is not given
     control over the hardware interrupt flag.  We provide a virtual
     interrupt flag that is under guest control.  The virtual operating
     system always runs with hardware interrupts enabled, but hardware
     interrupts are transparent to the guest.  The API provides calls for
     all instructions which modify the interrupt flag.

     The paravirtualization environment provides a legacy programmable
     interrupt controller (PIC) to the virtual machine.  Future releases
     will provide a virtual interrupt controller (VIC) that provides
     more advanced features.

     In addition to a virtual interrupt flag, there is also a virtual
     IOPL field which the guest can use to enable access to port I/O
     from userspace for privileged applications.

     Generic PCI based device probing is available to detect virtual
     devices.  The use of PCI is pragmatic, since it allows a vendor
     ID, class ID, and device ID to identify the appropriate driver
     for each virtual device.

   IDT Management.

     The paravirtual operating environment provides the traditional x86
     interrupt descriptor table for handling external interrupts,
     software interrupts, and exceptions.  The interrupt descriptor table
     provides the destination code selector and EIP for interruptions.
     The current task state structure (TSS) provides the new stack
     address to use for interruptions that result in a privilege level
     change.  The guest OS is responsible for notifying the hypervisor
     when it updates the stack address in the TSS.

     Two types of indirect control flow are of critical importance to the
     performance of an operating system.  These are system calls and page
     faults.  The guest is also responsible for calling out to the
     hypervisor when it updates gates in the IDT.  Making IDT and TSS
     updates known to the hypervisor in this fashion allows efficient
     delivery through these performance critical gates.

   Transparent Paravirtualization.

     The guest operating system may provide an alternative implementation
     of the VMI option rom compiled in.  This implementation should
     provide implementations of the VMI calls that are suitable for
     running on native x86 hardware.  This code may be used by the guest
     operating system while it is being loaded, and may also be used if
     the operating system is loaded on hardware that does not support
     paravirtualization.

     When the guest detects that the VMI option rom is available, it
     replaces the compiled-in version of the rom with the rom provided by
     the platform.  This can be accomplished by copying the rom contents,
     or by remapping the virtual address containing the compiled-in rom
     to point to the platform's ROM.  When booting on a platform that
     does not provide a VMI rom, the operating system can continue to use
     the compiled-in version to run in a non-paravirtualized fashion.

   3rd Party Extensions.

     If desired, it should be possible for 3rd party virtual machine
     monitors to implement a paravirtualization environment that can run
     guests written to this specification.

     The general mechanism for providing customized features and
     capabilities is to provide notification of these feature through
     the CPUID call, and allowing configuration of CPU features
     through RDMSR / WRMSR instructions.  This allows a hypervisor vendor
     ID to be published, and the kernel may enable or disable specific
     features based on this id.  This has the advantage of following
     closely the boot time logic of many operating systems that enables
     certain performance enhancements or bugfixes based on processor
     revision, using exactly the same mechanism.

     An exact formal specification of the new CPUID functions and which
     functions are vendor specific is still needed.

   AP Startup.

     Application Processor startup in paravirtual SMP systems works a bit 
     differently than in a traditional x86 system.

     APs will launch directly in paravirtual mode with initial state
     provided by the BSP.  Rather than the traditional init/startup
     IPI sequence, the BSP must issue the init IPI, a set application
     processor state hypercall, followed by the startup IPI.

     The initial state contains the AP's control registers, general
     purpose registers and segment registers, as well as the IDTR, 
     GDTR, LDTR and EFER.  Any processor state not included in the initial
     AP state (including x87 FPRs, SSE register states, and MSRs other than
     EFER), are left in the poweron state.

     The BSP must construct the initial GDT used by each AP.  The segment
     register hidden state will be loaded from the GDT specified in the
     initial AP state.  The IDT and (if used) LDT may either be constructed by
     the BSP or by the AP.

     Similarly, the initial page tables used by each AP must also be
     constructed by the BSP.

     If an AP's initial state is invalid, or no initial state is provided
     before a start IPI is received by that AP, then the AP will fail to start.
     It is therefore advisable to have a timeout for waiting for AP's to start,
     as is recommended for traditional x86 systems.

     See VMI_SetInitialAPState in Appendix A for a description of the
     VMI_SetInitialAPState hypercall and the associated APState data structure.
  
   State Synchronization In SMP Systems.
     
     Some in-memory data structures that may require no special synchronization
     on a traditional x86 systems need special handling when run on a 
     hypervisor.  Two of particular note are the descriptor tables and page
     tables.

     Each processor in an SMP system should have its own GDT and LDT.  Changes
     to each processor's descriptor tables must be made on that processor
     via the appropriate VMI calls.  There is no VMI interface for updating
     another CPU's descriptor tables (aside from VMI_SetInitialAPState),
     and the result of memory writes to other processors' descriptor tables
     are undefined.
     
     Page tables have slightly different semantics than in a traditional x86
     system.  As in traditional x86 systems, page table writes may not be
     respected by the current CPU until a TLB flush or invlpg is issued.
     In a paravirtual system, the hypervisor implementation is free to 
     provide either shared or private caches of the guest's page tables.
     Page table updates must therefore be propagated to the other CPUs
     before they are guaranteed to be noticed.
     
     In particular, when doing TLB shootdown, the initiating processor
     must ensure that all deferred page table updates are flushed to the
     hypervisor, to ensure that the receiving processor has the most up-to-date
     mapping when it performs its invlpg.

   Local APIC Support.

     A traditional x86 local APIC is provided by the hypervisor.  The local
     APIC is enabled and its address is set via the IA32_APIC_BASE MSR, as
     usual.  APIC registers may be read and written via ordinary memory
     operations.

     For performance reasons, higher performance APIC read and write interfaces
     are provided.  If possible, these interfaces should be used to access
     the local APIC.

     The IO-APIC is not included in this spec, as it is typically not
     performance critical, and used mainly for initial wiring of IRQ pins.
     Currently, we implement a fully functional IO-APIC with all the
     capabilities of real hardware.  This may seem like an unnecessary burden,
     but if the goal is transparent paravirtualization, the kernel must
     provide fallback support for an IO-APIC anyway.  In addition, the
     hypervisor must support an IO-APIC for SMP non-paravirtualized guests.
     The net result is less code on both sides, and an already well defined
     interface between the two.  This avoids the complexity burden of having
     to support two different interfaces to achieve the same task.

     One shortcut we have found most helpful is to simply disable NMI delivery
     to the paravirtualized kernel.  There is no reason NMIs can't be
     supported, but typical uses for them are not as productive in a
     virtualized environment.  Watchdog NMIs are of limited use if the OS is
     already correct and running on stable hardware; profiling NMIs are
     similarly of less use, since this task is accomplished with more accuracy
     in the VMM itself; and NMIs for machine check errors should be handled
     outside of the VM.  The addition of NMI support does create additional
     complexity for the trap handling code in the VM, and although the task is
     surmountable, the value proposition is debatable.  Here, again, feedback
     is desired.

   Time Interface.

     In a virtualized environment, virtual machines (VM) will time share
     the system with each other and with other processes running on the
     host system.  Therefore, a VM's virtual CPUs (VCPUs) will be
     executing on the host's physical CPUs (PCPUs) for only some portion
     of time.  This section of the VMI exposes a paravirtual view of
     time to the guest operating systems so that they may operate more
     effectively in a virtual environment.  The interface also provides
     a way for the VCPUs to set alarms in this paravirtual view of time.

     Time Domains:

     a) Wallclock Time:

     Wallclock time exposed to the VM through this interface indicates
     the number of nanoseconds since epoch, 1970-01-01T00:00:00Z (ISO
     8601 date format).  If the host's wallclock time changes (say, when
     an error in the host's clock is corrected), so does the wallclock
     time as viewed through this interface.

     b) Real Time:

     Another view of time accessible through this interface is real
     time.  Real time always progresses except for when the VM is
     stopped or suspended.  Real time is presented to the guest as a
     counter which increments at a constant rate defined (and presented)
     by the hypervisor.  All the VCPUs of a VM share the same real time
     counter.

     The unit of the counter is called "cycles".  The unit and initial
     value (corresponding to the time the VM enters para-virtual mode)
     are chosen by the hypervisor so that the real time counter will not
     rollover in any practical length of time.  It is expected that the
     frequency (cycles per second) is chosen such that this clock
     provides a "high-resolution" view of time.  The unit can only
     change when the VM (re)enters paravirtual mode.

     c) Stolen time and Available time:

     A VCPU is always in one of three states: running, halted, or ready.
     The VCPU is in the 'running' state if it is executing.  When the
     VCPU executes the HLT interface, the VCPU enters the 'halted' state
     and remains halted until there is some work pending for the VCPU
     (e.g. an alarm expires, host I/O completes on behalf of virtual
     I/O).  At this point, the VCPU enters the 'ready' state (waiting
     for the hypervisor to reschedule it).  Finally, at any time when
     the VCPU is not in the 'running' state nor the 'halted' state, it
     is in the 'ready' state.

     For example, consider the following sequence of events, with times
     given in real time:

     (Example 1)

     At 0 ms, VCPU executing guest code.
     At 1 ms, VCPU requests virtual I/O.
     At 2 ms, Host performs I/O for virtual I/0.
     At 3 ms, VCPU executes VMI_Halt.
     At 4 ms, Host completes I/O for virtual I/O request.
     At 5 ms, VCPU begins executing guest code, vectoring to the interrupt 
              handler for the device initiating the virtual I/O.
     At 6 ms, VCPU preempted by hypervisor.
     At 9 ms, VCPU begins executing guest code.

     From 0 ms to 3 ms, VCPU is in the 'running' state.  At 3 ms, VCPU
     enters the 'halted' state and remains in this state until the 4 ms
     mark.  From 4 ms to 5 ms, the VCPU is in the 'ready' state.  At 5
     ms, the VCPU re-enters the 'running' state until it is preempted by
     the hypervisor at the 6 ms mark.  From 6 ms to 9 ms, VCPU is again
     in the 'ready' state, and finally 'running' again after 9 ms.

     Stolen time is defined per VCPU to progress at the rate of real
     time when the VCPU is in the 'ready' state, and does not progress
     otherwise.  Available time is defined per VCPU to progress at the
     rate of real time when the VCPU is in the 'running' and 'halted'
     states, and does not progress when the VCPU is in the 'ready'
     state.

     So, for the above example, the following table indicates these time
     values for the VCPU at each ms boundary:

     Real time    Stolen time    Available time
      0            0              0
      1            0              1
      2            0              2
      3            0              3
      4            0              4
      5            1              4
      6            1              5
      7            2              5
      8            3              5
      9            4              5
     10            4              6

     Notice that at any point:
        real_time == stolen_time + available_time

     Stolen time and available time are also presented as counters in
     "cycles" units.  The initial value of the stolen time counter is 0.
     This implies the initial value of the available time counter is the
     same as the real time counter.

     Alarms:

     Alarms can be set (armed) against the real time counter or the
     available time counter. Alarms can be programmed to expire once
     (one-shot) or on a regular period (periodic).  They are armed by
     indicating an absolute counter value expiry, and in the case of a
     periodic alarm, a non-zero relative period counter value.  [TBD:
     The method of wiring the alarms to an interrupt vector is dependent
     upon the virtual interrupt controller portion of the interface.
     Currently, the alarms may be wired as if they are attached to IRQ0
     or the vector in the local APIC LVTT.  This way, the alarms can be
     used as drop in replacements for the PIT or local APIC timer.]

     Alarms are per-vcpu mechanisms.  An alarm set by vcpu0 will fire
     only on vcpu0, while an alarm set by vcpu1 will only fire on vcpu1.
     If an alarm is set relative to available time, its expiry is a
     value relative to the available time counter of the vcpu that set
     it.

     The interface includes a method to cancel (disarm) an alarm.  On
     each vcpu, one alarm can be set against each of the two counters
     (real time and available time).  A vcpu in the 'halted' state
     becomes 'ready' when any of its alarm's counters reaches the
     expiry.

     An alarm "fires" by signaling the virtual interrupt controller.  An
     alarm will fire as soon as possible after the counter value is
     greater than or equal to the alarm's current expiry.  However, an
     alarm can fire only when its vcpu is in the 'running' state.

     If the alarm is periodic, a sequence of expiry values,

      E(i) = e0 + p * i ,  i = 0, 1, 2, 3, ...

     where 'e0' is the expiry specified when setting the alarm and 'p'
     is the period of the alarm, is used to arm the alarm.  Initially,
     E(0) is used as the expiry.  When the alarm fires, the next expiry
     value in the sequence that is greater than the current value of the
     counter is used as the alarm's new expiry.

     One-shot alarms have only one expiry.  When a one-shot alarm fires,
     it is automatically disarmed.

     Suppose an alarm is set relative to real time with expiry at the 3
     ms mark and a period of 2 ms.  It will expire on these real time
     marks: 3, 5, 7, 9.  Note that even if the alarm does not fire
     during the 5 ms to 7 ms interval, the alarm can fire at most once
     during the 7 ms to 9 ms interval (unless, of course, it is
     reprogrammed).

     If an alarm is set relative to available time with expiry at the 1
     ms mark (in available time) and with a period of 2 ms, then it will
     expire on these available time marks: 1, 3, 5.  In the scenario
     described in example 1, those available time values correspond to
     these values in real time: 1, 3, 6.

3) Architectural Differences from Native Hardware.

     For the sake of performance, some requirements are imposed on kernel
     fault handlers which are not present on real hardware.  Most modern
     operating systems should have no trouble meeting these requirements.
     Failure to meet these requirements may prevent the kernel from
     working properly.

     1) The hardware flags on entry to a fault handler may not match
        the EFLAGS image on the fault handler stack.  The stack image
        is correct, and will have the correct state of the interrupt
        and arithmetic flags.

     2) The stack used for kernel traps must be flat - that is, zero base,
        segment limit determined by the hypervisor.

     3) On entry to any fault handler, the stack must have sufficient space
        to hold 32 bytes of data, or the guest may be terminated.

     4) When calling VMI functions, the kernel must be running on a
        flat 32-bit stack and code segment.

     5) Most VMI functions require flat data and extra segment (DS and ES)
        segments as well; notable exceptions are IRET and SYSEXIT.
        XXXPara - may need to add STI and CLI to this list.

     6) Interrupts must always be enabled when running code in userspace.

     7) IOPL semantics for userspace are changed; although userspace may be
        granted port access, it can not affect the interrupt flag.

     8) The EIPs at which faults may occur in VMI calls may not match the
        original native instruction EIP; this is a bug in the system
        today, as many guests do rely on lazy fault handling.

     9) On entry to V8086 mode, MSR_SYSENTER_CS is cleared to zero.

     10) Todo - we would like to support these features, but they are not
        fully tested and / or implemented:

        Userspace 16-bit stack support
        Proper handling of faulting IRETs

4) ROM Implementation

   Modularization

     Originally, we envisioned modularizing the ROM API into several
     subsections, but the close coupling between the initial layers
     and the requirement to support native PCI bus devices has made
     ROM components for network or block devices unnecessary to this
     point in time.

	VMI - the virtual machine interface.  This is the core CPU, I/O
	      and MMU virtualization layer.  I/O is currently limited
              to port access to emulated devices.
	
   Detection

      The presence of hypervisor ROMs can be recognized by scanning the
      upper region of the first megabyte of physical memory.  Multiple
      ROMs may be provided to support older API versions for legacy guest
      OS support.  ROM detection is done in the traditional manner, by
      scanning the memory region from C8000h - DFFFFh in 2 kilobyte
      increments.  The romSignature bytes must be '0x55, 0xAA', and the
      checksum of the region indicated by the romLength field must be zero.
      The checksum is a simple 8-bit addition of all bytes in the ROM region.

   Data layout

      typedef struct HyperRomHeader {
         uint16_t        romSignature; 
         int8_t          romLength;
         unsigned char   romEntry[4];
         uint8_t         romPad0;
         uint32_t        hyperSignature;
         uint8_t         APIVersionMinor;
         uint8_t         APIVersionMajor;
         uint8_t         reserved0;
         uint8_t         reserved1;
         uint32_t        reserved2;
         uint32_t        reserved3;
         uint16_t        pciHeaderOffset;
         uint16_t        pnpHeaderOffset;
         uint32_t        romPad3;
         char            reserved[32];
         char            elfHeader[64];
      } HyperRomHeader;

      The first set of fields is defined by the BIOS:

      romSignature - fixed 0xAA55, BIOS ROM signature
      romLength    - the length of the ROM, in 512 byte chunks.
                     Determines the area to be checksummed.
      romEntry     - 16-bit initialization code stub used by BIOS.
      romPad0      - reserved

      The next set of fields is defined by this API:

      hyperSignature  - a 4 byte signature providing recognition of the
	                device class represented by this ROM.  Each
	                device class defines its own unique signature.
      APIVersionMinor - the revision level of this device class' API.
	                This indicates incremental changes to the API.
      APIVersionMajor - the major version. Used to indicates large
	                revisions or additions to the API which break
	                compatibility with the previous version.
      reserved0,1,2,3 - for future expansion

      The next set of fields is defined by the PCI / PnP BIOS spec:

      pciHeaderOffset - relative offset to the PCI device header from
	  	        the start of this ROM.
      pnpHeaderOffset - relative offset to the PnP boot header from the
	                start of this ROM.
      romPad3         - reserved by PCI spec.

      Finally, there is space for future header fields, and an area
      reserved for an ELF header to point to symbol information.

Appendix A - VMI ROM Low Level ABI

   OS writers intending to port their OS to the paravirtualizable x86
   processor being modeled by this hypervisor need to access the
   hypervisor through the VMI layer. It is possible although it is
   currently unimplemented to add or replace the functionality of
   individual hypervisor calls by providing your own ROM images. This is
   intended to allow third party customizations.
        
   VMI compatible ROMs user the signature "cVmi" in the hyperSignature
   field of the ROM header.

   Many of these calls are compatible with the SVR4 C call ABI, using up
   to three register arguments.   Some calls are not, due to restrictions
   of the native instruction set.  Calls which diverge from this ABI are
   noted.  In GNU terms, this means most of the calls are compatible with
   regparm(3) argument passing.

   Most of these calls behave as standard C functions, and as such, may
   clobber registers EAX, EDX, ECX, flags.  Memory clobbers are noted
   explicitly, since many of them may be inlined without a memory clobber.

   Most of these calls require well defined segment conventions - that is,
   flat full size 32-bit segments for all the general segments, CS, SS, DS,
   ES.  Exceptions in some cases are noted.

   The net result of these choices is that most of the calls are very
   easy to make from C-code, and calls that are likely to be required in
   low level trap handling code are easy to call from assembler.   Most
   of these calls are also very easily implemented by the hypervisor
   vendor in C code, and only the performance critical calls from
   assembler paths require custom assembly implementations.

   CORE INTERFACE CALLS
   
    This set of calls provides the base functionality to establish running
    the kernel in VMI mode.

    The interface will be expanded to include feature negotiation, more
    explicit control over call bundling and flushing, and hypervisor
    notifications to allow inline code patching.

    VMI_Init
   
       VMICALL void VMI_Init(void);

       Initializes the hypervisor environment.  Returns zero on success,
       or -1 if the hypervisor could not be initialized.  Note that this
       is a recoverable error if the guest provides the requisite native
       code to support transparent paravirtualization.

       Inputs:      None
       Outputs:     EAX = result
       Clobbers:    Standard
       Segments:    Standard


   PROCESSOR STATE CALLS

    This set of calls controls the online status of the processor.  It
    include interrupt control, reboot, halt, and shutdown functionality.
    Future expansions may include deep sleep and hotplug CPU capabilities.

    VMI_DisableInterrupts

       VMICALL void VMI_DisableInterrupts(void);

       Disable maskable interrupts on the processor.

       Inputs:      None
       Outputs:     None
       Clobbers:    Flags only
       Segments:    As this is both performance critical and likely to
          be called from low level interrupt code, this call does not
          require flat DS/ES segments, but uses the stack segment for
          data access.  Therefore only CS/SS must be well defined.

    VMI_EnableInterrupts

       VMICALL void VMI_EnableInterrupts(void);

       Enable maskable interrupts on the processor.  Note that the
       current implementation always will deliver any pending interrupts
       on a call which enables interrupts, for compatibility with kernel
       code which expects this behavior.  Whether this should be required
       is open for debate.

       Inputs:      None
       Outputs:     None
       Clobbers:    Flags only
       Segments:    CS/SS only

    VMI_GetInterruptMask

       VMICALL VMI_UINT VMI_GetInterruptMask(void);

       Returns the current interrupt state mask of the processor.  The
       mask is defined to be 0x200 (matching processor flag IF) to indicate
       interrupts are enabled.

       Inputs:      None
       Outputs:     EAX = mask
       Clobbers:    Flags only
       Segments:    CS/SS only

    VMI_SetInterruptMask
   
       VMICALL void VMI_SetInterruptMask(VMI_UINT mask);

       Set the current interrupt state mask of the processor.  Also
       delivers any pending interrupts if the mask is set to allow
       them.

       Inputs:      EAX = mask
       Outputs:     None
       Clobbers:    Flags only
       Segments:    CS/SS only

    VMI_DeliverInterrupts (For future debate)

       Enable and deliver any pending interrupts.  This would remove
       the implicit delivery semantic from the SetInterruptMask and
       EnableInterrupts calls.

    VMI_Pause

       VMICALL void VMI_Pause(void);

       Pause the processor temporarily, to allow a hypertwin or remote
       CPU to continue operation without lock or cache contention.

       Inputs:      None
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_Halt

       VMICALL void VMI_Halt(void);

       Put the processor into interruptible halt mode.  This is defined
       to be a non-running mode where maskable interrupts are enabled,
       not a deep low power sleep mode.

       Inputs:      None
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_Shutdown

       VMICALL void VMI_Shutdown(void);

       Put the processor into non-interruptible halt mode.  This is defined
       to be a non-running mode where maskable interrupts are disabled,
       indicates a power-off event for this CPU.

       Inputs:      None
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_Reboot:

       VMICALL void VMI_Reboot(VMI_INT how);

       Reboot the virtual machine, using a hard or soft reboot.  A soft
       reboot corresponds to the effects of an INIT IPI, and preserves
       some APIC and CR state.  A hard reboot corresponds to a hardware
       reset.

       Inputs:      EAX = reboot mode
                      #define VMI_REBOOT_SOFT 0x0
                      #define VMI_REBOOT_HARD 0x1
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

   VMI_SetInitialAPState:

       void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID);

       Sets the initial state of the application processor with local APIC ID
       "apicID" to the state in apState.  apState must be the page-aligned
       linear address of the APState structure describing the initial state of
       the specified application processor.

       Control register CR0 must have both PE and PG set;  the result of
       either of these bits being cleared is undefined.  It is recommended
       that for best performance, all processors in the system have the same
       setting of the CR4 PAE bit.  LME and LMA in EFER are both currently
       unsupported.  The result of setting either of these bits is undefined.

       Inputs:      EAX = pointer to APState structure for new co-processor
                    EDX = APIC ID of processor to initialize
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard


   DESCRIPTOR RELATED CALLS

    VMI_SetGDT

       VMICALL void VMI_SetGDT(VMI_DTR *gdtr);

       Load the global descriptor table limit and base registers.  In
       addition to the straightforward load of the hardware registers, this
       has the additional side effect of reloading all segment registers in a
       virtual machine.  The reason is that otherwise, the hidden part of
       segment registers (the base field) may be put into a non-reversible
       state.  Non-reversible segments are problematic because they can not be
       reloaded - any subsequent loads of the segment will load the new
       descriptor state.  In general, is not possible to resume direct
       execution of the virtual machine if certain segments become
       non-reversible.
       
       A load of the GDTR may cause the guest visible memory image of the GDT
       to be changed.  This allows the hypervisor to share the GDT pages with
       the guest, but also continue to maintain appropriate protections on the
       GDT page by transparently adjusting the DPL and RPL of descriptors in
       the GDT.

       Inputs:      EAX = pointer to descriptor limit / base
       Outputs:     None
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_SetIDT

       VMICALL void VMI_SetIDT(VMI_DTR *idtr);

       Load the interrupt descriptor table limit and base registers.  The IDT
       format is defined to be the same as native hardware.

       A load of the IDTR may cause the guest visible memory image of the IDT
       to be changed.  This allows the hypervisor to rewrite the IDT pages in
       a format more suitable to the hypervisor, which may include adjusting
       the DPL and RPL of descriptors in the guest IDT.

       Inputs:      EAX = pointer to descriptor limit / base
       Outputs:     None
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_SetLDT

       VMICALL void VMI_SetLDT(VMI_SELECTOR ldtSel);

       Load the local descriptor table.  This has the additional side effect
       of of reloading all segment registers.  See VMI_SetGDT for an
       explanation of why this is required.  A load of the LDT may cause the
       guest visible memory image of the LDT to be changed, just as GDT and
       IDT loads.

       Inputs:      EAX = GDT selector of LDT descriptor
       Outputs:     None
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_SetTR

       VMICALL void VMI_SetTR(VMI_SELECTOR ldtSel);

       Load the task register.  Functionally equivalent to the LTR
       instruction.

       Inputs:      EAX = GDT selector of TR descriptor
       Outputs:     None
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_GetGDT

       VMICALL void VMI_GetGDT(VMI_DTR *gdtr);

       Copy the GDT limit and base fields into the provided pointer.  This is
       equivalent to the SGDT instruction, which is non-virtualizable.
       
       Inputs:      EAX = pointer to descriptor limit / base
       Outputs:     None
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_GetIDT

       VMICALL void VMI_GetIDT(VMI_DTR *idtr);

       Copy the IDT limit and base fields into the provided pointer.  This is
       equivalent to the SIDT instruction, which is non-virtualizable.
       
       Inputs:      EAX = pointer to descriptor limit / base
       Outputs:     None
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_GetLDT

       VMICALL VMI_SELECTOR VMI_GetLDT(void);

       Load the task register.  Functionally equivalent to the SLDT
       instruction, which is non-virtualizable.

       Inputs:      None
       Outputs:     EAX = selector of LDT descriptor
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_GetTR

       VMICALL VMI_SELECTOR VMI_GetTR(void);

       Load the task register.  Functionally equivalent to the STR
       instruction, which is non-virtualizable.

       Inputs:      None
       Outputs:     EAX = selector of TR descriptor
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_WriteGDTEntry

       VMICALL void VMI_WriteGDTEntry(void *gdt, VMI_UINT entry,
                                      VMI_UINT32 descLo,
                                      VMI_UINT32 descHi);

       Write a descriptor to a GDT entry.  Note that writes to the GDT itself
       may be disallowed by the hypervisor, in which case this call must be
       converted into a hypercall.  In addition, since the descriptor may need
       to be modified to change limits and / or permissions, the guest kernel
       should not assume the update will be binary identical to the passed
       input.

       Inputs:      EAX   = pointer to GDT base
                    EDX   = GDT entry number
                    ECX   = descriptor low word
                    ST(1) = descriptor high word
       Outputs:     None
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_WriteLDTEntry

       VMICALL void VMI_WriteLDTEntry(void *gdt, VMI_UINT entry,
                                      VMI_UINT32 descLo,
                                      VMI_UINT32 descHi);

       Write a descriptor to a LDT entry.  Note that writes to the LDT itself
       may be disallowed by the hypervisor, in which case this call must be
       converted into a hypercall.  In addition, since the descriptor may need
       to be modified to change limits and / or permissions, the guest kernel
       should not assume the update will be binary identical to the passed
       input.

       Inputs:      EAX   = pointer to LDT base
                    EDX   = LDT entry number
                    ECX   = descriptor low word
                    ST(1) = descriptor high word
       Outputs:     None
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_WriteIDTEntry

       VMICALL void VMI_WriteIDTEntry(void *gdt, VMI_UINT entry,
                                      VMI_UINT32 descLo,
                                      VMI_UINT32 descHi);

       Write a descriptor to a IDT entry.  Since the descriptor may need to be
       modified to change limits and / or permissions, the guest kernel should
       not assume the update will be binary identical to the passed input.

       Inputs:      EAX   = pointer to IDT base
                    EDX   = IDT entry number
                    ECX   = descriptor low word
                    ST(1) = descriptor high word
       Outputs:     None
       Clobbers:    Standard, Memory
       Segments:    Standard


   CPU CONTROL CALLS

    These calls encapsulate the set of privileged instructions used to
    manipulate the CPU control state.  These instructions are all properly
    virtualizable using trap and emulate, but for performance reasons, a
    direct call may be more efficient.  With hardware virtualization
    capabilities, many of these calls can be left as IDENT translations, that
    is, inline implementations of the native instructions, which are not
    rewritten by the hypervisor.  Some of these calls are performance critical
    during context switch paths, and some are not, but they are all included
    for completeness, with the exceptions of the obsoleted LMSW and SMSW
    instructions.

    VMI_WRMSR

       VMICALL void VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg);

       Write to a model specific register.  This functions identically to the
       hardware WRMSR instruction.  Note that a hypervisor may not implement
       the full set of MSRs supported by native hardware, since many of them
       are not useful in the context of a virtual machine.

       Inputs:      ECX = model specific register index 
                    EAX = low word of register
                    EDX = high word of register
       Outputs:     None
       Clobbers:    Standard, Memory
       Segments:    Standard

    VMI_RDMSR

       VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);

       Read from a model specific register.  This functions identically to the
       hardware RDMSR instruction.  Note that a hypervisor may not implement
       the full set of MSRs supported by native hardware, since many of them
       are not useful in the context of a virtual machine.

       Inputs:      ECX = machine specific register index 
       Outputs:     EAX = low word of register
                    EDX = high word of register
       Clobbers:    Standard
       Segments:    Standard

    VMI_SetCR0

       VMICALL void VMI_SetCR0(VMI_UINT val);

       Write to control register zero.  This can cause TLB flush and FPU
       handling side effects.  The set of features available to the kernel
       depend on the completeness of the hypervisor.  An explicit list of
       supported functionality or required settings may need to be negotiated
       by the hypervisor and kernel during bootstrapping.  This is likely to
       be implementation or vendor specific, and the precise restrictions are
       not yet worked out.  Our implementation in general supports turning on
       additional functionality - enabling protected mode, paging, page write
       protections; however, once those features have been enabled, they may
       not be disabled on the virtual hardware.

       Inputs:      EAX = input to control register
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_SetCR2

       VMICALL void VMI_SetCR2(VMI_UINT val);

       Write to control register two.  This has no side effects other than
       updating the CR2 register value.

       Inputs:      EAX = input to control register
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_SetCR3

       VMICALL void VMI_SetCR3(VMI_UINT val);

       Write to control register three.  This causes a TLB flush on the local
       processor.  In addition, this update may be queued as part of a lazy
       call invocation, which allows multiple hypercalls to be issued during
       the context switch path.  The queuing convention is to be negotiated
       with the hypervisor during bootstrapping, but the interfaces for this
       negotiation are currently vendor specific.

       Inputs:      EAX = input to control register
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard
       Queue Class: MMU

    VMI_SetCR4

       VMICALL void VMI_SetCR3(VMI_UINT val);

       Write to control register four.  This can cause TLB flush and many
       other CPU side effects.  The set of features available to the kernel
       depend on the completeness of the hypervisor.  An explicit list of
       supported functionality or required settings may need to be negotiated
       by the hypervisor and kernel during bootstrapping.  This is likely to
       be implementation or vendor specific, and the precise restrictions are
       not yet worked out.  Our implementation in general supports turning on
       additional MMU functionality - enabling global pages, large pages, PAE
       mode, and other features - however, once those features have been
       enabled, they may not be disabled on the virtual hardware.  The
       remaining CPU control bits of CR4 remain active and behave identically
       to real hardware.

       Inputs:      EAX = input to control register
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_GetCR0
    VMI_GetCR2
    VMI_GetCR3
    VMI_GetCR4

       VMICALL VMI_UINT32 VMI_GetCR0(void);
       VMICALL VMI_UINT32 VMI_GetCR2(void);
       VMICALL VMI_UINT32 VMI_GetCR3(void);
       VMICALL VMI_UINT32 VMI_GetCR4(void);

       Read the value of a control register into EAX.  The register contents
       are identical to the native hardware control registers; CR0 contains
       the control bits and task switched flag, CR2 contains the last page
       fault address, CR3 contains the page directory base pointer, and CR4
       contains various feature control bits.

       Inputs:      None
       Outputs:     EAX = value of control register
       Clobbers:    Standard
       Segments:    Standard

    VMI_CLTS

       VMICALL void VMI_CLTS(void);

       Used to clear the task switched (TS) flag in control register zero.  A
       replacement for the CLTS instruction.

       Inputs:      None
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

     VMI_SetDR

       VMICALL void VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val);

       Set the debug register to the given value.  If a hypervisor
       implementation supports debug registers, this functions equivalently to
       native hardware move to DR instructions.

       Inputs:      EAX = debug register number
                    EDX = debug register value
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

     VMI_GetDR

       VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num);

       Read a debug register.  If debug registers are not supported, the
       implementation is free to return zero values.

       Inputs:      EAX = debug register number
       Outputs:     EAX = debug register value
       Clobbers:    Standard
       Segments:    Standard


   PROCESSOR INFORMATION CALLS

    These calls provide access to processor identification, performance and
    cycle data, which may be inaccurate due to the nature of running on
    virtual hardware.   This information may be visible in a non-virtualizable
    way to applications running outside of the kernel.  As such, both RDTSC
    and RDPMC should be disabled by kernels or hypervisors where information
    leakage is a concern, and the accuracy of data retrieved by these functions
    is up to the individual hypervisor vendor.

    VMI_CPUID

       /* Not expressible as a C function */

       The CPUID instruction provides processor feature identification in a
       vendor specific manner.  The instruction itself is non-virtualizable
       without hardware support, requiring a hypervisor assisted CPUID call
       that emulates the effect of the native instruction, while masking any
       unsupported CPU feature bits.

       Inputs:       EAX = CPUID number
                     ECX = sub-level query (nonstandard)
       Outputs:      EAX = CPUID dword 0
                     EBX = CPUID dword 1
                     ECX = CPUID dword 2
                     EDX = CPUID dword 3
       Clobbers:     Flags only
       Segments:     Standard

    VMI_RDTSC

       VMICALL VMI_UINT64 VMI_RDTSC(void);

       The RDTSC instruction provides a cycles counter which may be made
       visible to userspace.  For better or worse, many applications have made
       use of this feature to implement userspace timers, database indices, or
       for micro-benchmarking of performance.  This instruction is extremely
       problematic for virtualization, because even though it is selectively 
       virtualizable using trap and emulate, it is much more expensive to
       virtualize it in this fashion.  On the other hand, if this instruction
       is allowed to execute without trapping, the cycle counter provided
       could be wrong in any number of circumstances due to hardware drift,
       migration, suspend/resume, CPU hotplug, and other unforeseen
       consequences of running inside of a virtual machine.  There is no
       standard specification for how this instruction operates when issued
       from userspace programs, but the VMI call here provides a proper
       interface for the kernel to read this cycle counter.

       Inputs:      None
       Outputs:     EAX = low word of TSC cycle counter
                    EDX = high word of TSC cycle counter
       Clobbers:    Standard
       Segments:    Standard

    VMI_RDPMC

       VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);

       Similar to RDTSC, this call provides the functionality of reading
       processor performance counters.  It also is selectively visible to
       userspace, and maintaining accurate data for the performance counters
       is an extremely difficult task due to the side effects introduced by
       the hypervisor.

       Inputs:      ECX = performance counter index 
       Outputs:     EAX = low word of counter
                    EDX = high word of counter
       Clobbers:    Standard
       Segments:    Standard


   STACK / PRIVILEGE TRANSITION CALLS
    
    This set of calls encapsulates mechanisms required to transfer between
    higher privileged kernel tasks and userspace.  The stack switching and
    return mechanisms are also used to return from interrupt handlers into
    the kernel, which may involve atomic interrupt state and stack
    transitions.

    VMI_UpdateKernelStack

       VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0);

       Inform the hypervisor that a new kernel stack pointer has been loaded
       in the TSS structure.  This new kernel stack pointer will be used for
       entry into the kernel on interrupts from userspace.

       Inputs:      EAX = pointer to TSS structure
                    EDX = new kernel stack top
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_IRET

       /* No C prototype provided */

       Perform a near equivalent of the IRET instruction, which atomically
       switches off the current stack and restore the interrupt mask.  This
       may return to userspace or back to the kernel from an interrupt or
       exception handler.  The VMI_IRET call does not restore IOPL from the
       stack image, as the native hardware equivalent would.  Instead, IOPL
       must be explicitly restored using a VMI_SetIOPL call.  The VMI_IRET
       call does, however, restore the state of the EFLAGS_VM bit from the
       stack image in the event that the hypervisor and kernel both support
       V8086 execution mode.  If the hypervisor does not support V8086 mode,
       this can be silently ignored, generating an error that the guest must
       deal with.  Note this call is made using a CALL instruction, just as
       all other VMI calls, so the EIP of the call site is available to the
       VMI layer.  This allows faults during the sequence to be properly
       passed back to the guest kernel with the correct EIP.

       Note that returning to userspace with interrupts disabled is an invalid
       operation in a paravirtualized kernel, and the results of an attempt to
       do so are undefined.

       Also note that when issuing the VMI_IRET call, the userspace data
       segments may have already been restored, so only the stack and code
       segments can be assumed valid.

       There is currently no support for IRET calls from a 16-bit stack
       segment, which poses a problem for supporting certain userspace
       applications which make use of high bits of ESP on a 16-bit stack.  How
       to best resolve this is an open question.  One possibility is to
       introduce a new VMI call which can operate on 16-bit segments, since it
       is desirable to make the common case here as fast as possible.

       Inputs:      ST(0) = New EIP
                    ST(1) = New CS
                    ST(2) = New Flags (including interrupt mask)
                    ST(3) = New ESP (for userspace returns)
                    ST(4) = New SS (for userspace returns)
                    ST(5) = New ES (for v8086 returns)
                    ST(6) = New DS (for v8086 returns)
                    ST(7) = New FS (for v8086 returns)
                    ST(8) = New GS (for v8086 returns)
       Outputs:     None (does not return)
       Clobbers:    None (does not return)
       Segments:    CS / SS only

    VMI_SYSEXIT

       /* No C prototype provided */

       For hypervisors and processors which support SYSENTER / SYSEXIT, the
       VMI_SYSEXIT call is provided as a binary equivalent to the native
       SYSENTER instruction.  Since interrupts must always be enabled in
       userspace, the VMI version of this function always combines atomically
       enabling interrupts with the return to userspace.

       Inputs:      EDX = New EIP
                    ECX = New ESP
       Outputs:     None (does not return)
       Clobbers:    None (does not return)
       Segments:    CS / SS only


   I/O CALLS

    This set of calls incorporates I/O related calls - PIO, setting I/O
    privilege level, and forcing memory writeback for device coherency.

    VMI_INB
    VMI_INW
    VMI_INL

       VMICALL VMI_UINT8  VMI_INB(VMI_UINT dummy, VMI_UINT port);
       VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port);
       VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port);

       Input a byte, word, or doubleword from an I/O port.  These
       instructions have binary equivalent semantics to native instructions.

       Inputs:      EDX = port number
                      EDX, rather than EAX is used, because the native
                      encoding of the instruction may use this register
                      implicitly.
       Outputs:     EAX = port value
       Clobbers:    Memory only
       Segments:    Standard

    VMI_OUTB
    VMI_OUTW
    VMI_OUTL
   
       VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port);
       VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port);
       VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port);

       Output a byte, word, or doubleword to an I/O port.  These
       instructions have binary equivalent semantics to native instructions.

       Inputs:      EAX = port value
                    EDX = port number
       Outputs:     None
       Clobbers:    None
       Segments:    Standard

    VMI_INSB
    VMI_INSW
    VMI_INSL

       /* Not expressible as C functions */

       Input a string of bytes, words, or doublewords from an I/O port.  These
       instructions have binary equivalent semantics to native instructions.
       They do not follow a C calling convention, and clobber only the same
       registers as native instructions.

       Inputs:      EDI = destination address
                    EDX = port number
                    ECX = count
       Outputs:     None
       Clobbers:    ESI, ECX, Memory
       Segments:    Standard

    VMI_OUTSB
    VMI_OUTSW
    VMI_OUTSL

       /* Not expressible as C functions */

       Output a string of bytes, words, or doublewords to an I/O port.  These
       instructions have binary equivalent semantics to native instructions.
       They do not follow a C calling convention, and clobber only the same
       registers as native instructions.

       Inputs:      ESI = source address
                    EDX = port number
                    ECX = count
       Outputs:     None
       Clobbers:    ESI, ECX
       Segments:    Standard

    VMI_IODelay

       VMICALL void VMI_IODelay(void);

       Delay the processor by time required to access a bus register.  This is
       easily implemented on native hardware by an access to a bus scratch
       register, but is typically not useful in a virtual machine.  It is
       paravirtualized to remove the overhead implied by executing the native
       delay.

       Inputs:      None
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_SetIOPLMask

       VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask);

       Set the IOPL mask of the processor to allow userspace to access I/O
       ports.  Note the mask is pre-shifted, so an IOPL of 3 would be
       expressed as (3 << 12).  If the guest chooses to use IOPL to allow
       CPL-3 access to I/O ports, it must explicitly set and restore IOPL
       using these calls; attempting to set the IOPL flags with popf or iret
       may produce no result.

       Inputs:      EAX = Mask
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_WBINVD

       VMICALL void VMI_WBINVD(void);

       Write back and invalidate the data cache.  This is used to synchronize
       I/O memory.

       Inputs:      None
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_INVD

       This instruction is deprecated.  It is invalid to execute in a virtual
       machine.  It is documented here only because it is still declared in
       the interface, and dropping it required a version change.


   APIC CALLS

    APIC virtualization is currently quite simple.  These calls support the
    functionality of the hardware APIC in a form that allows for more
    efficient implementation in a hypervisor, by avoiding trapping access to
    APIC memory.  The calls are kept simple to make the implementation
    compatible with native hardware.  The APIC must be mapped at a page
    boundary in the processor virtual address space.

    VMI_APICWrite
   
       VMICALL void VMI_APICWrite(void *reg, VMI_UINT32 value);

       Write to a local APIC register.  Side effects are the same as native
       hardware APICs.

       Inputs:      EAX = APIC register address
                    EDX = value to write
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_APICRead

       VMICALL VMI_UINT32 VMI_APICRead(void *reg);

       Read from a local APIC register.  Side effects are the same as native
       hardware APICs.

       Inputs:      EAX = APIC register address
       Outputs:     EAX = APIC register value
       Clobbers:    Standard
       Segments:    Standard


   TIMER CALLS

    The VMI interfaces define a highly accurate and efficient timer interface
    that is available when running inside of a hypervisor.  This is an
    optional but highly recommended feature which avoids many of the problems
    presented by classical timer virtualization.  It provides notions of
    stolen time, counters, and wall clock time which allows the VM to
    get the most accurate information in a way which is free of races and
    legacy hardware dependence.

    VMI_GetWallclockTime

       VMI_NANOSECS VMICALL VMI_GetWallclockTime(void);

       VMI_GetWallclockTime returns the current wallclock time as the number
       of nanoseconds since the epoch.  Nanosecond resolution along with the
       64-bit unsigned type provide over 580 years from epoch until rollover.
       The wallclock time is relative to the host's wallclock time.

       Inputs:      None
       Outputs:     EAX = low word, wallclock time in nanoseconds 
                    EDX = high word, wallclock time in nanoseconds 
       Clobbers:    Standard
       Segments:    Standard

    VMI_WallclockUpdated

       VMI_BOOL     VMICALL VMI_WallclockUpdated(void);
    
       VMI_WallclockUpdated returns TRUE if the wallclock time has changed
       relative to the real cycle counter since the previous time that
       VMI_WallclockUpdated was polled.  For example, while a VM is suspended,
       the real cycle counter will halt, but wallclock time will continue to
       advance.  Upon resuming the VM, the first call to VMI_WallclockUpdated
       will return TRUE.

       Inputs:      None
       Outputs:     EAX = 0 for FALSE, 1 for TRUE
       Clobbers:    Standard
       Segments:    Standard

    VMI_GetCycleFrequency

       VMICALL VMI_CYCLES VMI_GetCycleFrequency(void);

       VMI_GetCycleFrequency returns the number of cycles in one second.  This
       value can be used by the guest to convert between cycles and other time
       units.

       Inputs:      None
       Outputs:     EAX = low word, cycle frequency
                    EDX = high word, cycle frequency
       Clobbers:    Standard
       Segments:    Standard

    VMI_GetCycleCounter

       VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter);

       VMI_GetCycleCounter returns the current value, in cycles units, of the
       counter corresponding to 'whichCounter' if it is one of
       VMI_CYCLES_REAL, VMI_CYCLES_AVAILABLE or VMI_CYCLES_STOLEN.
       VMI_GetCycleCounter returns 0 for any other value of 'whichCounter'.

       Inputs:      EAX = counter index, one of
                        #define VMI_CYCLES_REAL        0
                        #define VMI_CYCLES_AVAILABLE   1
                        #define VMI_CYCLES_STOLEN      2
       Outputs:     EAX = low word, cycle counter
                    EDX = high word, cycle counter 
       Clobbers:    Standard
       Segments:    Standard

    VMI_SetAlarm

       VMICALL void VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry,
                                 VMI_CYCLES period);

       VMI_SetAlarm is used to arm the vcpu's alarms.  The 'flags' parameter
       is used to specify which counter's alarm is being set (VMI_CYCLES_REAL
       or VMI_CYCLES_AVAILABLE), how to deliver the alarm to the vcpu
       (VMI_ALARM_WIRED_IRQ0 or VMI_ALARM_WIRED_LVTT), and the mode
       (VMI_ALARM_IS_ONESHOT or VMI_ALARM_IS_PERIODIC).  If the alarm is set
       against the VMI_ALARM_STOLEN counter or an undefined counter number,
       the call is a nop.  The 'expiry' parameter indicates the expiry of the
       alarm, and for periodic alarms, the 'period' parameter indicates the
       period of the alarm.  If the value of 'period' is zero, the alarm is
       armed as a one-shot alarm regardless of the mode specified by 'flags'.
       Finally, a call to VMI_SetAlarm for an alarm that is already armed is
       equivalent to first calling VMI_CancelAlarm and then calling
       VMI_SetAlarm, except that the value returned by VMI_CancelAlarm is not
       accessible.
       
       /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */

       Inputs:      EAX   = flags value, cycle counter number or'ed with
                        #define VMI_ALARM_WIRED_IRQ0   0x00000000
                        #define VMI_ALARM_WIRED_LVTT   0x00010000
                        #define VMI_ALARM_IS_ONESHOT   0x00000000
                        #define VMI_ALARM_IS_PERIODIC  0x00000100
                    EDX   = low word, alarm expiry
                    ECX   = high word, alarm expiry
                    ST(0) = low word, alarm expiry
                    ST(1) = high word, alarm expiry
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

   VMI_CancelAlarm

       VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags);

       VMI_CancelAlarm is used to disarm an alarm.  The 'flags' parameter
       indicates which alarm to cancel (VMI_CYCLES_REAL or
       VMI_CYCLES_AVAILABLE).  The return value indicates whether or not the
       cancel succeeded.  A return value of FALSE indicates that the alarm was
       already disarmed either because a) the alarm was never set or b) it was
       a one-shot alarm and has already fired (though perhaps not yet
       delivered to the guest).  TRUE indicates that the alarm was armed and
       either a) the alarm was one-shot and has not yet fired (and will no
       longer fire until it is rearmed) or b) the alarm was periodic.

       Inputs:      EAX = cycle counter number
       Outputs:     EAX = 0 for FALSE, 1 for TRUE
       Clobbers:    Standard
       Segments:    Standard


   MMU CALLS

    The MMU plays a large role in paravirtualization due to the large
    performance opportunities realized by gaining insight into the guest
    machine's use of page tables.  These calls are designed to accommodate the
    existing MMU functionality in the guest OS while providing the hypervisor
    with hints that can be used to optimize performance to a large degree.

    VMI_SetLinearMapping

       VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va,
                                         VMI_UINT32 pages, VMI_UINT32 ppn);

       /* The number of VMI address translation slot */
       #define VMI_LINEAR_MAP_SLOTS    4

       Register a virtual to physical translation of virtual address range to
       physical pages.  This may be used to register single pages or to
       register large ranges.  There is an upper limit on the number of active
       mappings, which should be sufficient to allow the hypervisor and VMI
       layer to perform page translation without requiring dynamic storage.
       Translations are only required to be registered for addresses used to
       access page table entries through the VMI page table access functions.
       The guest is free to use the provided linear map slots in a manner that
       it finds most convenient.  Kernels which linearly map a large chunk of
       physical memory and use page tables in this linear region will only
       need to register one such region after initialization of the VMI.
       Hypervisors which do not require linear to physical conversion hints
       are free to leave these calls as NOPs, which is the default when
       inlined into the native kernel.

       Inputs:      EAX   = linear map slot
                    EDX   = virtual address start of mapping
                    ECX   = number of pages in mapping
                    ST(0) = physical frame number to which pages are mapped
       Outputs:     None
       Clobbers:    Standard
       Segments:    Standard

    VMI_FlushTLB

       VMICALL void VMI_FlushTLB(int how);
   
       Flush all non-global mappings in the TLB, optionally flushing global
       mappings as well.  The VMI_FLUSH_TLB flag should always be specified,
       optionally or'ed with the VMI_FLUSH_GLOBAL flag.

       Inputs:      EAX = flush type
                       #define VMI_FLUSH_TLB            0x01
                       #define VMI_FLUSH_GLOBAL         0x02
       Outputs:     None
       Clobbers:    Standard, memory (implied)
       Segments:    Standard

    VMI_InvalPage

       VMICALL void VMI_InvalPage(VMI_UINT32 va);

       Invalidate the TLB mapping for a single page or large page at the
       given virtual address.

       Inputs:      EAX = virtual address
       Outputs:     None
       Clobbers:    Standard, memory (implied)
       Segments:    Standard

   The remaining documentation here needs updating when the PTE accessors are
   simplified.

    70) VMI_SetPte

        void VMI_SetPte(VMI_PTE pte, VMI_PTE *ptep);

        Assigns a new value to a page table / directory entry. It is a
        requirement that ptep points to a page that has already been
        registered with the hypervisor as a page of the appropriate type
	using the VMI_RegisterPageUsage function.
            
    71) VMI_SwapPte           

        VMI_PTE VMI_SwapPte(VMI_PTE pte, VMI_PTE *ptep);

        Write 'pte' into the page table entry pointed by 'ptep', and returns
        the old value in 'ptep'.  This function acts atomically on the PTE
        to provide up to date A/D bit information in the returned value.

    72) VMI_TestAndSetPteBit

        VMI_BOOL VMI_TestAndSetPteBit(VMI_INT bit, VMI_PTE *ptep);

        Atomically set a bit in a page table entry.  Returns zero if the bit
        was not set, and non-zero if the bit was set.

    73) VMI_TestAndClearPteBit 

        VMI_BOOL VMI_TestAndSetClearBit(VMI_INT bit, VMI_PTE *ptep);

        Atomically clear a bit in a page table entry.  Returns zero if the bit
        was not set, and non-zero if the bit was set.

    74) VMI_SetPteLong
    75) VMI_SwapPteLong           
    76) VMI_TestAndSetPteBitLong
    77) VMI_TestAndClearPteBitLong

        void VMI_SetPteLong(VMI_PAE_PTE pte, VMI_PAE_PTE *ptep);
        VMI_PAE_PTE VMI_SwapPteLong(VMI_UINT64 pte, VMI_PAE_PTE *ptep);
        VMI_BOOL VMI_TestAndSetPteBitLong(VMI_INT bit, VMI_PAE_PTE *ptep);
        VMI_BOOL VMI_TestAndSetClearBitLong(VMI_INT bit, VMI_PAE_PTE *ptep);
        
        These functions act identically to the 32-bit PTE update functions,
        but provide support for PAE mode.  The calls are guaranteed to never
        create a temporarily invalid but present page mapping that could be
        accidentally prefetched by another processor, and all returned bits
        are guaranteed to be atomically up to date.

        One special exception is the VMI_SwapPteLong function only provides
        synchronization against A/D bits from other processors, not against
        other invocations of VMI_SwapPteLong.

    78) VMI_ClonePageTable
        VMI_ClonePageDirectory

        #define VMI_MKCLONE(start, count) (((start) << 16) | (count))

        void VMI_ClonePageTable(VMI_UINT32 dstPPN, VMI_UINT32 srcPPN,
                                VMI_UINT32 flags);
        void VMI_ClonePageDirectory(VMI_UINT32 dstPPN, VMI_UINT32 srcPPN,
                                VMI_UINT32 flags);

        These functions tell the hypervisor to allocate a page shadow
        at the PT or PD level using a shadow template.  Because of the
        availability of bits in the flags, these calls may be merged
        together as well as flag the PAE-ness of the shadows.

    80) VMI_RegisterPageUsage
    81) VMI_ReleasePage

        #define VMI_PAGE_PT              0x01
        #define VMI_PAGE_PD              0x02
        #define VMI_PAGE_PDP             0x04
        #define VMI_PAGE_PML4            0x08
        #define VMI_PAGE_GDT             0x10
        #define VMI_PAGE_LDT             0x20
        #define VMI_PAGE_IDT             0x40
        #define VMI_PAGE_TSS             0x80

        void  VMI_RegisterPageUsage(VMI_UINT32 ppn, int flags);
        void  VMI_ReleasePage(VMI_UINT32 ppn, int flags);

        These are used to register a page with the hypervisor as being of a
        particular type, for instance, VMI_PAGE_PT says it is a page table
        page.  

    85) VMI_SetDeferredMode

        void VMI_SetDeferredMode(VMI_UINT32 deferBits); 

        Set the lazy state update mode to the specified set of bits.  This
        allows the processor, hypervisor, or VMI layer to lazily update
        certain CPU and MMU state.  When setting this to a more permissive
        setting, no flush is implied, but when clearing bits in the current
        defer mask, all pending state will be flushed.

        The 'deferBits' is a mask specifying how to flush.

            #define VMI_DEFER_NONE          0x00

        Disallow all asynchronous state updates.  This is the default
        state.

            #define VMI_DEFER_MMU           0x01

	Flush all pending page table updates.  Note that page faults,
        invalidations and TLB flushes will implicitly flush all pending
        updates. 

            #define VMI_DEFER_CPU           0x02

        Allow CPU state updates to control registers to be deferred, with
        the exception of updates that change FPU state.  This is useful
        for combining a reload of the page table base in CR3 with other
        updates, such as the current kernel stack.

            #define VMI_DEFER_DT            0x04

        Allow descriptor table updates to be delayed.  This allows the
        VMI_UpdateGDT / IDT / LDT calls to be asynchronously queued.

    86) VMI_FlushDeferredCalls

        void VMI_FlushDeferredCalls(void);

        Flush all asynchronous state updates which may be queued as
        a result of setting deferred update mode.


Appendix B - VMI C prototypes

   Most of the VMI calls are properly callable C functions.  Note that for the
   absolute best performance, assembly calls are preferable in some cases, as
   they do not imply all of the side effects of a C function call, such as
   register clobber and memory access.  Nevertheless, these wrappers serve as
   a useful interface definition for higher level languages.

   In some cases, a dummy variable is passed as an unused input to force
   proper alignment of the remaining register values.

   The call convention for these is defined to be standard GCC convention with
   register passing.  The regparm call interface is documented at:

   http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html

   Types used by these calls:

   VMI_UINT64   64 bit unsigned integer
   VMI_UINT32   32 bit unsigned integer
   VMI_UINT16   16 bit unsigned integer
   VMI_UINT8    8 bit unsigned integer
   VMI_INT      32 bit integer
   VMI_UINT     32 bit unsigned integer
   VMI_DTR      6 byte compressed descriptor table limit/base
   VMI_PTE      4 byte page table entry (or page directory)
   VMI_LONG_PTE 8 byte page table entry (or PDE or PDPE)
   VMI_SELECTOR 16 bit segment selector
   VMI_BOOL     32 bit unsigned integer
   VMI_CYCLES   64 bit unsigned integer
   VMI_NANOSECS 64 bit unsigned integer


   #ifndef VMI_PROTOTYPES_H
   #define VMI_PROTOTYPES_H

   /* Insert local type definitions here */
   typedef struct VMI_DTR {
      uint16 limit;
      uint32 offset __attribute__ ((packed));
   } VMI_DTR;

   typedef struct APState {
      VMI_UINT32 cr0;
      VMI_UINT32 cr2;
      VMI_UINT32 cr3;
      VMI_UINT32 cr4;

      VMI_UINT64 efer;

      VMI_UINT32 eip;
      VMI_UINT32 eflags;
      VMI_UINT32 eax;
      VMI_UINT32 ebx;
      VMI_UINT32 ecx;
      VMI_UINT32 edx;
      VMI_UINT32 esp;
      VMI_UINT32 ebp;
      VMI_UINT32 esi;
      VMI_UINT32 edi;
      VMI_UINT16 cs;
      VMI_UINT16 ss;

      VMI_UINT16 ds;
      VMI_UINT16 es;
      VMI_UINT16 fs;
      VMI_UINT16 gs;
      VMI_UINT16 ldtr;

      VMI_UINT16 gdtrLimit;
      VMI_UINT32 gdtrBase;
      VMI_UINT32 idtrBase;
      VMI_UINT16 idtrLimit;
   } APState;

   #define VMICALL __attribute__((regparm(3)))

   /* CORE INTERFACE CALLS */
   VMICALL void VMI_Init(void);

   /* PROCESSOR STATE CALLS */
   VMICALL void     VMI_DisableInterrupts(void);
   VMICALL void     VMI_EnableInterrupts(void);

   VMICALL VMI_UINT VMI_GetInterruptMask(void);
   VMICALL void     VMI_SetInterruptMask(VMI_UINT mask);

   VMICALL void     VMI_Pause(void);
   VMICALL void     VMI_Halt(void);
   VMICALL void     VMI_Shutdown(void);
   VMICALL void     VMI_Reboot(VMI_INT how);

   #define VMI_REBOOT_SOFT 0x0
   #define VMI_REBOOT_HARD 0x1

   void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID);

   /* DESCRIPTOR RELATED CALLS */
   VMICALL void         VMI_SetGDT(VMI_DTR *gdtr);
   VMICALL void         VMI_SetIDT(VMI_DTR *idtr);
   VMICALL void         VMI_SetLDT(VMI_SELECTOR ldtSel);
   VMICALL void         VMI_SetTR(VMI_SELECTOR ldtSel);

   VMICALL void         VMI_GetGDT(VMI_DTR *gdtr);
   VMICALL void         VMI_GetIDT(VMI_DTR *idtr);
   VMICALL VMI_SELECTOR VMI_GetLDT(void);
   VMICALL VMI_SELECTOR VMI_GetTR(void);

   VMICALL void         VMI_WriteGDTEntry(void *gdt,
                                          VMI_UINT entry,
                                          VMI_UINT32 descLo,
                                          VMI_UINT32 descHi);
   VMICALL void         VMI_WriteLDTEntry(void *gdt,
                                          VMI_UINT entry,
                                          VMI_UINT32 descLo,
                                          VMI_UINT32 descHi);
   VMICALL void         VMI_WriteIDTEntry(void *gdt,
                                          VMI_UINT entry,
                                          VMI_UINT32 descLo,
                                          VMI_UINT32 descHi);

   /* CPU CONTROL CALLS */
   VMICALL void       VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg);
   VMICALL void       VMI_WRMSR_SPLIT(VMI_UINT32 valLo, VMI_UINT32 valHi,
                                      VMI_UINT32 reg);

   /* Not truly a proper C function; use dummy to align reg in ECX */
   VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);

   VMICALL void VMI_SetCR0(VMI_UINT val);
   VMICALL void VMI_SetCR2(VMI_UINT val);
   VMICALL void VMI_SetCR3(VMI_UINT val);
   VMICALL void VMI_SetCR4(VMI_UINT val);

   VMICALL VMI_UINT32 VMI_GetCR0(void);
   VMICALL VMI_UINT32 VMI_GetCR2(void);
   VMICALL VMI_UINT32 VMI_GetCR3(void);
   VMICALL VMI_UINT32 VMI_GetCR4(void);

   VMICALL void       VMI_CLTS(void);

   VMICALL void       VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val);
   VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num);

   /* PROCESSOR INFORMATION CALLS */

   VMICALL VMI_UINT64 VMI_RDTSC(void);
   VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);

   /* STACK / PRIVILEGE TRANSITION CALLS */
   VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0);

   /* I/O CALLS */
   /* Native port in EDX - use dummy */
   VMICALL VMI_UINT8  VMI_INB(VMI_UINT dummy, VMI_UINT port);
   VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port);
   VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port);

   VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port);
   VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port);
   VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port);

   VMICALL void VMI_IODelay(void);
   VMICALL void VMI_WBINVD(void);
   VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask);

   /* APIC CALLS */
   VMICALL void       VMI_APICWrite(void *reg, VMI_UINT32 value);
   VMICALL VMI_UINT32 VMI_APICRead(void *reg);

   /* TIMER CALLS */
   VMICALL VMI_NANOSECS VMI_GetWallclockTime(void);
   VMICALL VMI_BOOL     VMI_WallclockUpdated(void);

   /* Predefined rate of the wallclock. */
   #define VMI_WALLCLOCK_HZ       1000000000

   VMICALL VMI_CYCLES VMI_GetCycleFrequency(void);
   VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter);

   /* Defined cycle counters */
   #define VMI_CYCLES_REAL        0
   #define VMI_CYCLES_AVAILABLE   1
   #define VMI_CYCLES_STOLEN      2

   VMICALL void     VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry,
                                 VMI_CYCLES period);
   VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags);

   /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */
   #define VMI_ALARM_COUNTER_MASK 0x000000ff

   #define VMI_ALARM_WIRED_IRQ0   0x00000000
   #define VMI_ALARM_WIRED_LVTT   0x00010000

   #define VMI_ALARM_IS_ONESHOT   0x00000000
   #define VMI_ALARM_IS_PERIODIC  0x00000100

   /* MMU CALLS */
   VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va,
                                     VMI_UINT32 pages, VMI_UINT32 ppn);

   /* The number of VMI address translation slot */
   #define VMI_LINEAR_MAP_SLOTS    4

   VMICALL void VMI_InvalPage(VMI_UINT32 va);
   VMICALL void VMI_FlushTLB(int how);
   
   /* Flags used by VMI_FlushTLB call */
   #define VMI_FLUSH_TLB            0x01
   #define VMI_FLUSH_GLOBAL         0x02

   #endif


Appendix C - Sensitive x86 instructions in the paravirtual environment

  This is a list of x86 instructions which may operate in a different manner
  when run inside of a paravirtual environment.

	ARPL - continues to function as normal, but kernel segment registers
	       may be different, so parameters to this instruction may need
	       to be modified. (System)
	
	IRET - the IRET instruction will be unable to change the IOPL, VM,
	       VIF, VIP, or IF fields. (System)

	       the IRET instruction may #GP if the return CS/SS RPL are
	       below the CPL, or are not equal. (System)

	LAR  - the LAR instruction will reveal changes to the DPL field of
	       descriptors in the GDT and LDT tables. (System, User)

	LSL  - the LSL instruction will reveal changes to the segment limit
	       of descriptors in the GDT and LDT tables. (System, User)

	LSS  - the LSS instruction may #GP if the RPL is not set properly.
	       (System)

	MOV  - the mov %seg, %reg instruction may reveal a different RPL
	       on the segment register. (System)

	       The mov %reg, %ss instruction may #GP if the RPL is not set
	       to the current CPL. (System)

	POP  - the pop %ss instruction may #GP if the RPL is not set to
	       the appropriate CPL. (System)

	POPF - the POPF instruction will be unable to set the hardware
	       interrupt flag. (System)

	PUSH - the push %seg instruction may reveal a different RPL on the
	       segment register. (System)

	PUSHF- the PUSHF instruction will reveal a possible different IOPL,
	       and the value of the hardware interrupt flag, which is always
	       set.  (System, User)

	SGDT - the SGDT instruction will reveal the location and length of
	       the GDT shadow instead of the guest GDT. (System, User)

	SIDT - the SIDT instruction will reveal the location and length of
	       the IDT shadow instead of the guest IDT. (System, User)

	SLDT - the SLDT instruction will reveal the selector used for
	       the shadow LDT rather than the selector loaded by the guest.
	       (System, User).

	STR  - the STR instruction will reveal the selector used for the
	       shadow TSS rather than the selector loaded by the guest.
	       (System, User).

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:42         ` Arjan van de Ven
  2006-03-13 18:48           ` Zachary Amsden
  2006-03-13 18:52           ` VMI interface documentation Zachary Amsden
@ 2006-03-13 18:56           ` Joshua LeVasseur
  2006-03-16 18:52             ` Jan Engelhardt
  2 siblings, 1 reply; 38+ messages in thread
From: Joshua LeVasseur @ 2006-03-13 18:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Zachary Amsden, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

>
> On Mar 13, 2006, at 19:42, Arjan van de Ven wrote:
>
> On Mon, 2006-03-13 at 10:30 -0800, Zachary Amsden wrote:
>>  and gives hypervisors room to grow while maintaining
>> binary compatibility with already released kernels.
>
> that I buy for binary only hypervisors. But in an open source world  
> I'll
> buy this a LOT less as being relevant.
>


Binary compatibility to Linux is pretty important for applications.   
Even though Apache is open source, I don't want to recompile it for  
every new Linux kernel.  Fortunately I don't have to, because glibc  
abstracts the Linux kernel interface.  Consider VMI in the same role  
as glibc -- when the hypervisor changes, VMI maintains compatibility  
with your pre-existing infrastructure, while letting you have some of  
the benefits of the new hypervisor.  The upgrade and recompile game  
can quickly end in a stalemate when you have packages with  
conflicting dependencies (one package requires the old version, and  
the other package requires the new version).

Josh


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:30       ` Zachary Amsden
  2006-03-13 18:42         ` Arjan van de Ven
@ 2006-03-13 18:56         ` Hollis Blanchard
  2006-03-13 18:59             ` Zachary Amsden
  1 sibling, 1 reply; 38+ messages in thread
From: Hollis Blanchard @ 2006-03-13 18:56 UTC (permalink / raw)
  To: virtualization
  Cc: Zachary Amsden, Arjan van de Ven, Xen-devel, Wim Coekaerts,
	Chris Wright, Christopher Li, Jan Beulich,
	Linux Kernel Mailing List, Linus Torvalds, Anne Holler,
	Jyothy Reddy, Kip Macy, Ky Srinivasan, Leendert van Doorn

On Monday 13 March 2006 12:30, Zachary Amsden wrote:
> It is an advantage for everyone.  It cuts support and certification 
> costs for Linux distributors, software vendors, makes debugging and 
> development easier, and gives hypervisors room to grow while maintaining 
> binary compatibility with already released kernels.

It certainly is good for kernel developers and end-users.

However, it would be a foolish distributor or ISV who tests with one 
hypervisor and decides that covers all hypervisors which implement the same 
interface. So I'm not sure there's any advantage w.r.t. support and 
certification costs.

-- 
Hollis Blanchard
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:56         ` Hollis Blanchard
@ 2006-03-13 18:59             ` Zachary Amsden
  0 siblings, 0 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-13 18:59 UTC (permalink / raw)
  To: Hollis Blanchard
  Cc: virtualization, Arjan van de Ven, Xen-devel, Wim Coekaerts,
	Chris Wright, Christopher Li, Jan Beulich,
	Linux Kernel Mailing List, Linus Torvalds, Anne Holler,
	Jyothy Reddy, Kip Macy, Ky Srinivasan, Leendert van Doorn

Hollis Blanchard wrote:
> On Monday 13 March 2006 12:30, Zachary Amsden wrote:
>   
>> It is an advantage for everyone.  It cuts support and certification 
>> costs for Linux distributors, software vendors, makes debugging and 
>> development easier, and gives hypervisors room to grow while maintaining 
>> binary compatibility with already released kernels.
>>     
>
> It certainly is good for kernel developers and end-users.
>
> However, it would be a foolish distributor or ISV who tests with one 
> hypervisor and decides that covers all hypervisors which implement the same 
> interface. So I'm not sure there's any advantage w.r.t. support and 
> certification costs.
>   

Your point is well noted.  I'm not arguing that it would be smart to 
test with just one hypervisor (or worse, yet, test only on native 
hardware), and proudly declare your kernel virtualization compatible.  
There are some things you can do (instrument a torture test verification 
module in a native VMI ROM) to help with that test load.

But in the end, having a single binary reduces the complexity and work 
that goes into a certification, which does simplify the process - even 
if you still have to validate against the list of all supported vendors 
/ hardware.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
@ 2006-03-13 18:59             ` Zachary Amsden
  0 siblings, 0 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-13 18:59 UTC (permalink / raw)
  To: Hollis Blanchard
  Cc: Xen-devel, Chris Wright, Linux Kernel Mailing List,
	Christopher Li, Jan Beulich, Wim Coekaerts, virtualization,
	Linus Torvalds, Anne Holler, Jyothy Reddy, Kip Macy,
	Ky Srinivasan, Leendert van Doorn, Arjan van de Ven

Hollis Blanchard wrote:
> On Monday 13 March 2006 12:30, Zachary Amsden wrote:
>   
>> It is an advantage for everyone.  It cuts support and certification 
>> costs for Linux distributors, software vendors, makes debugging and 
>> development easier, and gives hypervisors room to grow while maintaining 
>> binary compatibility with already released kernels.
>>     
>
> It certainly is good for kernel developers and end-users.
>
> However, it would be a foolish distributor or ISV who tests with one 
> hypervisor and decides that covers all hypervisors which implement the same 
> interface. So I'm not sure there's any advantage w.r.t. support and 
> certification costs.
>   

Your point is well noted.  I'm not arguing that it would be smart to 
test with just one hypervisor (or worse, yet, test only on native 
hardware), and proudly declare your kernel virtualization compatible.  
There are some things you can do (instrument a torture test verification 
module in a native VMI ROM) to help with that test load.

But in the end, having a single binary reduces the complexity and work 
that goes into a certification, which does simplify the process - even 
if you still have to validate against the list of all supported vendors 
/ hardware.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:48           ` Zachary Amsden
@ 2006-03-13 19:02             ` Chris Wright
  0 siblings, 0 replies; 38+ messages in thread
From: Chris Wright @ 2006-03-13 19:02 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Arjan van de Ven, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

* Zachary Amsden (zach@vmware.com) wrote:
> This we find to be quite true.  Now, you can use a VMI kernel, make 
> changes to it, run it on native hardware, and be confident that it will 
> run properly in a VM as well.  And you can develop in a VM, with 
> confidence that you can run on native hardware.  You can even replace 
> the entire "ROM" image with your own custom debugging image to add any 
> type of debugging or performance monitoring facility you want - and you 
> have some very, very interesting hook points into the kernel that make 
> that task much more achievable.

Replacing a ROM image is easier in terms package management, but still
requires full validation and certification process.  Swapping out the
underlying core is a major change and needs to be re-validated.

thanks,
-chris

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 17:58 ` Zachary Amsden
  (?)
  (?)
@ 2006-03-13 20:17 ` Sam Vilain
  -1 siblings, 0 replies; 38+ messages in thread
From: Sam Vilain @ 2006-03-13 20:17 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Zachary Amsden wrote:

>In OLS 2005, we described the work that we have been doing in VMware
>with respect a common interface for paravirtualization of Linux. We
>shared the general vision in Rik's virtualization BoF.
>[...]
>Unlike the full-virtualization techniques used in the traditional VMware
>products, paravirtualization is a technique where the operating system
>is modified to enlighten the hypervisor with timely knowledge about the
>operating system's activities. Since the hypervisor now depends on the
>kernel to tell it about common idioms etc, it does not need to write
>protect OS objects such as page and descriptor tables as a solution
>based on full-virtualization needs. This has two important effects (a)
>it shortens the critical path, since faulting is expensive on modern
>processors (b) by eliminating complex heuristics the hypervisor is
>simplified. While the former delivers performance, the latter is quite
>important too. 
>  
>

An interesting vision, especially if it merges the current VMWare / Xen
techno-social rift.

I think there will still be a place for the "complete" (eg, QEMU, or of
course your own product) and the "ultimately lightweight" (eg,
vserver/openvz/jails/containers) approaches to virtualisation, though.

While the same kernel may be able to run in these different situations,
having a "real" hardware emulator like QEMU/VMWare will allow you to
test all of those alternate code paths.  As time goes on this will no
doubt seem a more and more superfluous requirement, especially if the
actual code in those places is minimal as you suggest.

Looking the other way, there is no way that a system like this will ever
approach the performance of fork(), as vserver and related technology
does.  No doubt for certain common applications of virtualisation, such
as providing "complete" virtual servers, this will be seen as less and
less important as time goes on.  However, for other applications - such
as jailing services, or systems that make use of advantages of single
kernel virtualisation (such as shared VFS/network, visibility into other
systems' processes, etc) - the extra kernel that a virtualisation
context implies is simply unwanted.  No doubt still other users will
simply prefer the simplicity of system call level virtualisation, such
as only having one set of routing tables/iptables rules/VGs to manage, etc.

There are currently two debates on virtualisation happening here, on and
off.  We have Xen/VMI, and vserver/openvz/jails/containers.  Let's just
try not to get them confused :-).  From the perspective of the vserver
project, I consider your work orthogonal and complementary and wish you
well in the success of your gracious offering to the Linux community.

Sam.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 17:58 ` Zachary Amsden
                   ` (2 preceding siblings ...)
  (?)
@ 2006-03-14  0:39 ` Anthony Liguori
  2006-03-14  4:01   ` Zachary Amsden
  -1 siblings, 1 reply; 38+ messages in thread
From: Anthony Liguori @ 2006-03-14  0:39 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Zachary Amsden wrote:
> This is by no means finished work. A few of the areas that need more
> attention and exploration are (a) 64bit support is still lacking, but we
> feel a port of VMI to the 64 bit Linux can be done without too much
> trouble (b) the Xen compatibility layer needs some work to bring it
> up to the Xen 3.0 interfaces.  Work is underway on this already, and
> no major issues are expected at this time. 
>   
Hi Zach,

Can you please post the Xen compatibility layer (even if it is for 
2.0.x).  I think it's important to see that code to understand the 
advantages/disadvantages compared to the existing Xen paravirtualization 
interface.  Likewise, any Xen performance data would be useful as there 
has been some discussion about whether VMI would negatively impact Xen 
performance[1].

Thanks,

Anthony Liguori
> Two final notes.  This is not an attempt to force a proprietary interface
> into the Linux kernel.  This is an attempt to find a common interface
> that can be used by many hypervisors by isolating hypervisor specific
> idioms into a neutral layer.  This new layer is just what is claims to
> be - a virtual machine interface, which allows hypervisor dependent code
> to be abstracted in a way that benefits both Linux and hypervisor
> development.
>
> This is also not an attempt to define an exact and final specification
> of how virtualization should be done in Linux.  This is very much a work
> in progress, and it is understood that the interfaces proposed here will
> change in time to accommodate the needs of all interested parties.  We 
> hope to find a common solution that can eventually become part of the
> Linux kernel and serve as a model for other operating systems as well.
>
> We appreciate your feedback on this design and the patches to Linux, and
> welcome working with anyone who is interested in making virtualization
> in Linux a friendly environment to innovate in.  If you find the ideas
> here interesting, please volunteer to help improve them.
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/virtualization
>   


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-14  0:39 ` Anthony Liguori
@ 2006-03-14  4:01   ` Zachary Amsden
  2006-03-14  4:04     ` Rik van Riel
  0 siblings, 1 reply; 38+ messages in thread
From: Zachary Amsden @ 2006-03-14  4:01 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Anthony Liguori wrote:
> Zachary Amsden wrote:
>> This is by no means finished work. A few of the areas that need more
>> attention and exploration are (a) 64bit support is still lacking, but we
>> feel a port of VMI to the 64 bit Linux can be done without too much
>> trouble (b) the Xen compatibility layer needs some work to bring it
>> up to the Xen 3.0 interfaces.  Work is underway on this already, and
>> no major issues are expected at this time.   
> Hi Zach,
>
> Can you please post the Xen compatibility layer (even if it is for 
> 2.0.x).  I think it's important to see that code to understand the 
> advantages/disadvantages compared to the existing Xen 
> paravirtualization interface.  Likewise, any Xen performance data 
> would be useful as there has been some discussion about whether VMI 
> would negatively impact Xen performance[1].

About performance - I actually believe that it is possible to implement 
VMI Linux in such a way that it actually has _better_ performance on Xen 
than the current XenoLinux kernels.

I'm working on getting together the older interface pieces now.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-14  4:01   ` Zachary Amsden
@ 2006-03-14  4:04     ` Rik van Riel
  2006-03-14  4:55       ` Zachary Amsden
  0 siblings, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2006-03-14  4:04 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Anthony Liguori, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

On Mon, 13 Mar 2006, Zachary Amsden wrote:

> About performance - I actually believe that it is possible to implement 
> VMI Linux in such a way that it actually has _better_ performance on Xen 
> than the current XenoLinux kernels.

How would VMI allow page table batching at fault time?
(one of the future optimizations that are probably worth
making for Xen)

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 17:58 ` Zachary Amsden
                   ` (3 preceding siblings ...)
  (?)
@ 2006-03-14  4:13 ` Anthony Liguori
  2006-03-14  4:26   ` Zachary Amsden
  -1 siblings, 1 reply; 38+ messages in thread
From: Anthony Liguori @ 2006-03-14  4:13 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Hi Zach,

A number of the files you posted (including the vmi_spec.txt) have the 
phrase 'All rights reserved'.  That seems incompatible with the GPL.  In 
particular, it makes it unclear about how one can use the actual vmi spec.

In your next round of patches, could you clarify the actual licensing of 
the files?

Regards,

Anthony Liguori

Zachary Amsden wrote:
> In OLS 2005, we described the work that we have been doing in VMware
> with respect a common interface for paravirtualization of Linux. We
> shared the general vision in Rik's virtualization BoF.
>
> This note is an update on our further work on the Virtual Machine
> Interface, VMI.  The patches provided have been tested on 2.6.16-rc6.
> We are currently recollecting performance information for the new -rc6
> kernel, but expect our numbers to match previous results, which showed
> no impact whatsoever on macro benchmarks, and nearly neglible impact
> on microbenchmarks.
>
> Unlike the full-virtualization techniques used in the traditional VMware
> products, paravirtualization is a technique where the operating system
> is modified to enlighten the hypervisor with timely knowledge about the
> operating system's activities. Since the hypervisor now depends on the
> kernel to tell it about common idioms etc, it does not need to write
> protect OS objects such as page and descriptor tables as a solution
> based on full-virtualization needs. This has two important effects (a)
> it shortens the critical path, since faulting is expensive on modern
> processors (b) by eliminating complex heuristics the hypervisor is
> simplified. While the former delivers performance, the latter is quite
> important too. 
>
> Not surprisingly, paravirtualization's strength, ie that it encourages
> tighter communication between the kernel and the hypervisor, is also its
> weakness. Unless the changes to the operating system are moderated, you
> can very quickly find yourself with a kernel that (a) looks and feels
> like a brand new kernel or (b) cannot run on native machines or on newer
> versions of the hypervisor without a full recompile. The former can
> impede innovation in the Linux kernel, and the latter can be a problem
> for software vendors. 
>
> VMware proposes VMI as a paravirtualization interface for Linux that
> solves these problems. 
>   - A VMI'fied Linux kernel runs unmodified on native hardware, and on
>     many hypervisors, while simultaneously delivering on the performance
>     promise of paravirtualization. 
>   - VMI has a rich and low level interface, which allows the kernel to
>     cope with future hardware evolution by querying for hardware
>     capability. It is our expectation that a single kernel will run
>     unmodified on both today's processors with limited hardware
>     virtualization support and also keep up with any evolution on the
>     processor front 
>   - VMI Linux is a fairly clean interface, with distinct name spaces
>     for objects from the kernel and the hypervisor. Nowhere do we mingle
>     names from the hypervisor with that of the kernel. This separation
>     allows innovation in the kernel to proceed at the same speed as
>     always. For most kernel developers, a VMI kernel looks and feels like
>     a regular Linux kernel.  
>   - VMI Linux still supports "native" hypervisor device drivers, for
>     example a hypervisor vendor's own private network or block device
>     drivers which are free to use any interface desired to communicate
>     with the hypervisor.
>
> At present, we are sharing a working implementation of the VMI for
> 2.6.16-rc6 version of Linux. We have verified that VMI Linux does indeed
> run well on native machines (both P4 and Opterons), and on VMware style
> hypervisors. VMI Linux has negligible overheads on native machines, so
> much so, that we are confident that VMI Linux can, in the long run, be
> the default Linux for i386.  We believe that this interface is both
> cleaner and more powerful than other proposals that have been made
> towards virtualization of Linux, and can easily be adapted to work with
> other hypervisors.
>
> This is by no means finished work. A few of the areas that need more
> attention and exploration are (a) 64bit support is still lacking, but we
> feel a port of VMI to the 64 bit Linux can be done without too much
> trouble (b) the Xen compatibility layer needs some work to bring it
> up to the Xen 3.0 interfaces.  Work is underway on this already, and
> no major issues are expected at this time. 
>
> Two final notes.  This is not an attempt to force a proprietary interface
> into the Linux kernel.  This is an attempt to find a common interface
> that can be used by many hypervisors by isolating hypervisor specific
> idioms into a neutral layer.  This new layer is just what is claims to
> be - a virtual machine interface, which allows hypervisor dependent code
> to be abstracted in a way that benefits both Linux and hypervisor
> development.
>
> This is also not an attempt to define an exact and final specification
> of how virtualization should be done in Linux.  This is very much a work
> in progress, and it is understood that the interfaces proposed here will
> change in time to accommodate the needs of all interested parties.  We 
> hope to find a common solution that can eventually become part of the
> Linux kernel and serve as a model for other operating systems as well.
>
> We appreciate your feedback on this design and the patches to Linux, and
> welcome working with anyone who is interested in making virtualization
> in Linux a friendly environment to innovate in.  If you find the ideas
> here interesting, please volunteer to help improve them.
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/virtualization
>   


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-14  4:13 ` Anthony Liguori
@ 2006-03-14  4:26   ` Zachary Amsden
  2006-03-14  4:30     ` Rik van Riel
  0 siblings, 1 reply; 38+ messages in thread
From: Zachary Amsden @ 2006-03-14  4:26 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Anthony Liguori wrote:
> Hi Zach,
>
> A number of the files you posted (including the vmi_spec.txt) have the 
> phrase 'All rights reserved'.  That seems incompatible with the GPL.  
> In particular, it makes it unclear about how one can use the actual 
> vmi spec.
>
> In your next round of patches, could you clarify the actual licensing 
> of the files?

I'm sorry about the legalese.  The patches are patches to the Linux 
kernel, and therefore under GPL v2 by default.  I thought that would be 
implicit.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-14  4:26   ` Zachary Amsden
@ 2006-03-14  4:30     ` Rik van Riel
  2006-03-14  5:46       ` Zachary Amsden
  0 siblings, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2006-03-14  4:30 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Anthony Liguori, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

On Mon, 13 Mar 2006, Zachary Amsden wrote:
> Anthony Liguori wrote:

> > In your next round of patches, could you clarify the actual licensing of the
> > files?
> 
> I'm sorry about the legalese.  The patches are patches to the Linux kernel,
> and therefore under GPL v2 by default.  I thought that would be implicit.

It would be very bad if Linus started applying code with
a dubious license to the kernel, if we want to keep the
kernel GPL v2.

Having an explicit license and a Signed-off-by: line are
things to remember with big patch sets.  At the very least
a Signed-off-by: line..

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-14  4:04     ` Rik van Riel
@ 2006-03-14  4:55       ` Zachary Amsden
  0 siblings, 0 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-14  4:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Anthony Liguori, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

Rik van Riel wrote:
> On Mon, 13 Mar 2006, Zachary Amsden wrote:
>
>   
>> About performance - I actually believe that it is possible to implement 
>> VMI Linux in such a way that it actually has _better_ performance on Xen 
>> than the current XenoLinux kernels.
>>     
>
> How would VMI allow page table batching at fault time?
> (one of the future optimizations that are probably worth
> making for Xen)
>   

This is exactly what we do.  All page table transitions from P->NP or 
P->P already require a flushing call (FlushTLB or InvalPage).  The 
remaining transitions, NP->P require explicit flushing, and we have 
added the appropriate call sites to do so.  It turns out, the external 
MMU cache on Sparc provided exactly the required hook point in this case 
- update_mmu_cache().

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-14  4:30     ` Rik van Riel
@ 2006-03-14  5:46       ` Zachary Amsden
  2006-03-14 12:44         ` Rik van Riel
  2006-03-16 18:58         ` Jan Engelhardt
  0 siblings, 2 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-14  5:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Anthony Liguori, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

Rik van Riel wrote:
> It would be very bad if Linus started applying code with
> a dubious license to the kernel, if we want to keep the
> kernel GPL v2.
>   

I believe it says explicitly in our patches that they are licensed under 
GPL v2.

> Having an explicit license and a Signed-off-by: line are
> things to remember with big patch sets.  At the very least
> a Signed-off-by: line.
>   

There is a Signed-off-by line on every patch I send out, with full 
knowledge that this constitutes the work of the author of the said line, 
and full knowledge that this commits the patch into the domain of the 
GPL license.  Sorry for sounding like a lawyer here.  IANAL, but I 
thought that was completely implicit in all patches made to GPL'd 
software.  The signed off by provides accountability and open licensing 
simultaneously.

But most importantly, I really don't understand how it is possible to 
make a patch to the Linux kernel and not release it under GPL.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-14  5:46       ` Zachary Amsden
@ 2006-03-14 12:44         ` Rik van Riel
  2006-03-14 16:22           ` Zachary Amsden
  2006-03-16 18:58         ` Jan Engelhardt
  1 sibling, 1 reply; 38+ messages in thread
From: Rik van Riel @ 2006-03-14 12:44 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Anthony Liguori, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

On Mon, 13 Mar 2006, Zachary Amsden wrote:

> There is a Signed-off-by line on every patch I send out,

You're right.  It was just the first 1/24 that was missing it,
it was there in the second copy.

> But most importantly, I really don't understand how it is possible to 
> make a patch to the Linux kernel and not release it under GPL.

This can really only be done if the person posting the patch
does not have the right to release the code.  This is what the
Signed-off-by lines are for, IIRC.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-14 12:44         ` Rik van Riel
@ 2006-03-14 16:22           ` Zachary Amsden
  0 siblings, 0 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-14 16:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Anthony Liguori, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

Rik van Riel wrote:
> On Mon, 13 Mar 2006, Zachary Amsden wrote:
>
>   
>> There is a Signed-off-by line on every patch I send out,
>>     
>
> You're right.  It was just the first 1/24 that was missing it,
> it was there in the second copy.
>   

BTW, I have no idea why the first 1/24 was missing it.  I checked right 
before sending, and it was there - perhaps I forgot to save my changes.  
The second copy turned out fine, but didn't make it to LKML.  Everyone 
cc'd directly got it, but the LKML filter has a ban on the word 
propasition, and being blackholed by it, I merely assumed the patch was 
too large - so I split it up, and actually ended up binary searching 
down to the problematic section before finding the taboo list.

_Every_  problem eventually turns into a binary search.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:22   ` Zachary Amsden
  2006-03-13 18:26     ` Arjan van de Ven
@ 2006-03-15 10:25     ` Christoph Hellwig
  2006-03-15 15:57       ` Zachary Amsden
  2006-03-15 17:38       ` Joshua LeVasseur
  1 sibling, 2 replies; 38+ messages in thread
From: Christoph Hellwig @ 2006-03-15 10:25 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Arjan van de Ven, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

On Mon, Mar 13, 2006 at 10:22:15AM -0800, Zachary Amsden wrote:
> >Why can't vmware use the Xen interface instead?
> >  
> 
> We could.  But it is our opinion that the Xen interface is unnecessarily 
> complicated, without a clean separation between the layer of interaction 
> with the hypervisor and the kernel proper.  The interface we propose we 
> believe is more powerful, and more conducive to performance 
> optimizations while providing significant advantages - most 
> specifically, a single binary image that is properly virtualizable on 
> multiple hypervisors and capable of running on native hardware.

I agree with Zach here, the Xen hypervisor <-> kernel interface is
not very nice.  This proposal seems like a step forward althogh it'll
probably need to go through a few iterations.  Without and actually
useable opensource hypevisor reference implementation it's totally
unacceptable, though.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-15 10:25     ` Christoph Hellwig
@ 2006-03-15 15:57       ` Zachary Amsden
  2006-03-15 17:38       ` Joshua LeVasseur
  1 sibling, 0 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-15 15:57 UTC (permalink / raw)
  To: Christoph Hellwig, Zachary Amsden, Arjan van de Ven,
	Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Christoph Hellwig wrote:
> I agree with Zach here, the Xen hypervisor <-> kernel interface is
> not very nice.  This proposal seems like a step forward althogh it'll
> probably need to go through a few iterations.  Without and actually
> useable opensource hypevisor reference implementation it's totally
> unacceptable, though.
>   

Which is why our top priority is getting VMI Linux to run on Xen.  The 
churn rate on both ends has been very high, and we really wanted to 
release our patches with Xen support, but we also didn't want to wait 
some unknown number of weeks more to release them - and we're actually 
looking for volunteers to help with the port if anyone is interested.

What we are hoping for, in the end, is a Linux kernel with a clean 
virtualization interface, that is maintainable, does not slow hypervisor 
exploitation of new technologies, still offers all of the same 
performance advantages, works for the open source community, and allows 
hypervisor vendors of many creeds to benefit from cross-kernel binary 
compatibility.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-15 10:25     ` Christoph Hellwig
  2006-03-15 15:57       ` Zachary Amsden
@ 2006-03-15 17:38       ` Joshua LeVasseur
  2006-03-15 20:02         ` Andrew Morton
  1 sibling, 1 reply; 38+ messages in thread
From: Joshua LeVasseur @ 2006-03-15 17:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zachary Amsden, Arjan van de Ven, Linus Torvalds,
	Linux Kernel Mailing List, Virtualization Mailing List, Xen-devel,
	Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Chris Wright, Rik Van Riel,
	Jyothy Reddy, Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan,
	Wim Coekaerts, Leendert van Doorn


On Mar 15, 2006, at 11:25 , Christoph Hellwig wrote:

> On Mon, Mar 13, 2006 at 10:22:15AM -0800, Zachary Amsden wrote:
>>> Why can't vmware use the Xen interface instead?
>>>
>>
>> We could.  But it is our opinion that the Xen interface is  
>> unnecessarily
>> complicated, without a clean separation between the layer of  
>> interaction
>> with the hypervisor and the kernel proper.  The interface we  
>> propose we
>> believe is more powerful, and more conducive to performance
>> optimizations while providing significant advantages - most
>> specifically, a single binary image that is properly virtualizable on
>> multiple hypervisors and capable of running on native hardware.
>
> I agree with Zach here, the Xen hypervisor <-> kernel interface is
> not very nice.  This proposal seems like a step forward althogh it'll
> probably need to go through a few iterations.  Without and actually
> useable opensource hypevisor reference implementation it's totally
> unacceptable, though.
>


As part of our pre-virtualization work, we developed a virtualization  
solution similar to VMI.  We support Xen v2 and v3 with high  
performance.  We added support for the first generation of VMI to our  
project, and are currently adding support for the latest VMI patch.   
Our work is open source.  We'll announce when we finish the VMI updates.

We also experimented with other architectures and found the approach  
highly suitable, such as for Itanium.

Joshua



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-15 17:38       ` Joshua LeVasseur
@ 2006-03-15 20:02         ` Andrew Morton
  2006-03-16  0:05           ` Joshua LeVasseur
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2006-03-15 20:02 UTC (permalink / raw)
  To: Joshua LeVasseur
  Cc: hch, zach, arjan, torvalds, linux-kernel, virtualization,
	xen-devel, dhecht, arai, anne, pratap, chrisl, chrisw, riel,
	jreddy, jlo, kmacy, jbeulich, ksrinivasan, wim.coekaerts,
	leendert

Joshua LeVasseur <jtl@ira.uka.de> wrote:
>
> As part of our pre-virtualization work, we developed a virtualization  
> solution similar to VMI.  We support Xen v2 and v3 with high  
> performance.  We added support for the first generation of VMI to our  
> project, and are currently adding support for the latest VMI patch.   
> Our work is open source.  We'll announce when we finish the VMI updates.

Who is "we" and what product are you referring to?

(I think an important part of this discussion is getting an understanding
of which virtualisation products (current or planned) could use a VMI).

Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-15 20:02         ` Andrew Morton
@ 2006-03-16  0:05           ` Joshua LeVasseur
  0 siblings, 0 replies; 38+ messages in thread
From: Joshua LeVasseur @ 2006-03-16  0:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hch, zach, arjan, torvalds, linux-kernel, virtualization,
	xen-devel, dhecht, arai, anne, pratap, chrisl, chrisw, riel,
	jreddy, jlo, kmacy, jbeulich, ksrinivasan, wim.coekaerts,
	leendert


On Mar 15, 2006, at 21:02 , Andrew Morton wrote:

> Joshua LeVasseur <jtl@ira.uka.de> wrote:
>>
>> As part of our pre-virtualization work, we developed a virtualization
>> solution similar to VMI.  We support Xen v2 and v3 with high
>> performance.  We added support for the first generation of VMI to our
>> project, and are currently adding support for the latest VMI patch.
>> Our work is open source.  We'll announce when we finish the VMI  
>> updates.
>
> Who is "we" and what product are you referring to?
>
> (I think an important part of this discussion is getting an  
> understanding
> of which virtualisation products (current or planned) could use a  
> VMI).
>
> Thanks.


This is a project at the University of Karlsruhe. The project web  
page is:
http://l4ka.org/projects/virtualization/afterburn/
and there you'll find documentation and source code. The source code  
supports our pre-virtualization project on L4 and Xen (and we have  
some nascent implementations for Linux-as-hypervisor and Windows-as- 
hypervisor).  We automate the transformations to the Linux code, thus  
minimizing the number of manual modifications. Due to the  
similarities of our approach and VMI, our virtualization runtime  
supports VMI with some minor additions.

Joshua


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-13 18:56           ` [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal Joshua LeVasseur
@ 2006-03-16 18:52             ` Jan Engelhardt
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Engelhardt @ 2006-03-16 18:52 UTC (permalink / raw)
  To: Joshua LeVasseur
  Cc: Arjan van de Ven, Zachary Amsden, Linus Torvalds,
	Linux Kernel Mailing List, Virtualization Mailing List, Xen-devel,
	Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Chris Wright, Rik Van Riel,
	Jyothy Reddy, Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan,
	Wim Coekaerts, Leendert van Doorn

>> > and gives hypervisors room to grow while maintaining
>> > binary compatibility with already released kernels.
>> 
>> that I buy for binary only hypervisors. But in an open source world I'll
>> buy this a LOT less as being relevant.
>
> Binary compatibility to Linux is pretty important for applications.  Even
> though Apache is open source, I don't want to recompile it for every new Linux
> kernel.  Fortunately I don't have to, because glibc abstracts the Linux kernel
> interface.  Consider VMI in the same role as glibc -- when the hypervisor
> changes, VMI maintains compatibility with your pre-existing infrastructure,

VMI = kernel code (AFAIU)

I would rather like a user-space-based compat layer.


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
  2006-03-14  5:46       ` Zachary Amsden
  2006-03-14 12:44         ` Rik van Riel
@ 2006-03-16 18:58         ` Jan Engelhardt
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Engelhardt @ 2006-03-16 18:58 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Rik van Riel, Anthony Liguori, Linus Torvalds,
	Linux Kernel Mailing List, Virtualization Mailing List, Xen-devel,
	Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Jyothy Reddy, Jack Lo, Kip Macy, Jan Beulich,
	Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

>
> But most importantly, I really don't understand how it is possible to make a
> patch to the Linux kernel and not release it under GPL.
>

If the patch is so ultimatively trivial that there is only a few solutions (one
or two), then there is no use in gpl'ing that flock of patchcode, in which case
I think, it is (or at best should be) public domain. In conjunction with the
patched function, they will/should become GPL.


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
@ 2006-03-17 15:56 Chuck Ebbert
  2006-03-17 17:52   ` Zachary Amsden
  0 siblings, 1 reply; 38+ messages in thread
From: Chuck Ebbert @ 2006-03-17 15:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zachary Amsden, Arjan van de Ven, Linus Torvalds, linux-kernel,
	Virtualization Mailing List, Xen-devel, Chris Wright

In-Reply-To: <20060315102522.GA5926@infradead.org>

On Wed, 15 Mar 2006 10:25:22 +0000, Christoph Hellwig wrote:

> I agree with Zach here, the Xen hypervisor <-> kernel interface is
> not very nice.  This proposal seems like a step forward althogh it'll
> probably need to go through a few iterations.  Without and actually
> useable opensource hypevisor reference implementation it's totally
> unacceptable, though.

I'd like to see a test harness implementation that has no actual
hypervisor functionality and just implements the VMI calls natively.
This could be used to test the interface and would provide a nice
starting point for those who want to write a VMI hypervisor.


-- 
Chuck
"Penguins don't come from next door, they come from the Antarctic!"


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface  proposal
  2006-03-17 15:56 Chuck Ebbert
@ 2006-03-17 17:52   ` Zachary Amsden
  0 siblings, 0 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-17 17:52 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Christoph Hellwig, Arjan van de Ven, Linus Torvalds, linux-kernel,
	Virtualization Mailing List, Xen-devel, Chris Wright

Chuck Ebbert wrote:
> In-Reply-To: <20060315102522.GA5926@infradead.org>
>
> On Wed, 15 Mar 2006 10:25:22 +0000, Christoph Hellwig wrote:
>   
> I'd like to see a test harness implementation that has no actual
> hypervisor functionality and just implements the VMI calls natively.
> This could be used to test the interface and would provide a nice
> starting point for those who want to write a VMI hypervisor.
>   

I was going to make one yesterday.  But Fry's electronics stopped 
carrying flashable blank PCI cards. :)  Anyone know of a vendor?

It is possible to do in a software layer, although it really is a lot 
easier to have the BIOS take care of all the fuss of finding a place in 
low memory for you to live, setting up the various memory maps and 
everything else for you.

There is enormous benefit to having such a layer - you have a very power 
test harness, not just to make sure VMI works, but even more 
importantly, to inspect and verify the native kernel operation as well.  
You have a plethora of imporant hooks into the system, which feed you 
knowledge you can not otherwise gain about which page tables have been 
made active, when you take IRQs, where the kernel stack lives.

All of this is ripe for a debug harness that can verify the kernel 
doesn't overflow the kernel stack, doesn't write to active page table 
entries without proper accessors and subsequent invalidations, and obeys 
the rules that are required for correctness when running under a 
hypervisor.  You probably even want to do hypervisor like things - such 
as write protecting the kernel page tables so that you can be confident 
there are no stray raw PTE accesses.

We actually found one (harmless on native) in i386, which was enabling 
NX bit.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
@ 2006-03-17 17:52   ` Zachary Amsden
  0 siblings, 0 replies; 38+ messages in thread
From: Zachary Amsden @ 2006-03-17 17:52 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Chris Wright, Xen-devel, linux-kernel, Christoph Hellwig,
	Virtualization Mailing List, Linus Torvalds, Arjan van de Ven

Chuck Ebbert wrote:
> In-Reply-To: <20060315102522.GA5926@infradead.org>
>
> On Wed, 15 Mar 2006 10:25:22 +0000, Christoph Hellwig wrote:
>   
> I'd like to see a test harness implementation that has no actual
> hypervisor functionality and just implements the VMI calls natively.
> This could be used to test the interface and would provide a nice
> starting point for those who want to write a VMI hypervisor.
>   

I was going to make one yesterday.  But Fry's electronics stopped 
carrying flashable blank PCI cards. :)  Anyone know of a vendor?

It is possible to do in a software layer, although it really is a lot 
easier to have the BIOS take care of all the fuss of finding a place in 
low memory for you to live, setting up the various memory maps and 
everything else for you.

There is enormous benefit to having such a layer - you have a very power 
test harness, not just to make sure VMI works, but even more 
importantly, to inspect and verify the native kernel operation as well.  
You have a plethora of imporant hooks into the system, which feed you 
knowledge you can not otherwise gain about which page tables have been 
made active, when you take IRQs, where the kernel stack lives.

All of this is ripe for a debug harness that can verify the kernel 
doesn't overflow the kernel stack, doesn't write to active page table 
entries without proper accessors and subsequent invalidations, and obeys 
the rules that are required for correctness when running under a 
hypervisor.  You probably even want to do hypervisor like things - such 
as write protecting the kernel page tables so that you can be confident 
there are no stray raw PTE accesses.

We actually found one (harmless on native) in i386, which was enabling 
NX bit.

Zach

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
@ 2006-03-20 18:51 Anne Holler
  0 siblings, 0 replies; 38+ messages in thread
From: Anne Holler @ 2006-03-20 18:51 UTC (permalink / raw)
  To: Zach Amsden, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton
  Cc: Anne Holler

[-- Attachment #1: Type: text/plain, Size: 2998 bytes --]

>-----Original Message-----
>From: Zachary Amsden [mailto:zach@vmware.com]
>Sent: Monday, March 13, 2006 9:58 AM
>To: Linus Torvalds; Linux Kernel Mailing List; Virtualization Mailing
>List; Xen-devel; Andrew Morton; Zach Amsden; Daniel Hecht; Daniel Arai;
>Anne Holler; Pratap Subrahmanyam; Christopher Li; Joshua LeVasseur;
>Chris Wright; Rik Van Riel; Jyothy Reddy; Jack Lo; Kip Macy; Jan
>Beulich; Ky Srinivasan; Wim Coekaerts; Leendert van Doorn; Zach Amsden
>Subject: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface
>proposal

>In OLS 2005, we described the work that we have been doing in VMware
>with respect a common interface for paravirtualization of Linux. We
>shared the general vision in Rik's virtualization BoF.

>This note is an update on our further work on the Virtual Machine
>Interface, VMI.  The patches provided have been tested on 2.6.16-rc6.
>We are currently recollecting performance information for the new -rc6
>kernel, but expect our numbers to match previous results, which showed
>no impact whatsoever on macro benchmarks, and nearly neglible impact
>on microbenchmarks.

Folks,

I'm a member of the performance team at VMware & I recently did a
round of testing measuring the performance of a set of benchmarks
on the following 2 linux variants, both running natively:
 1) 2.6.16-rc6 including VMI + 64MB hole
 2) 2.6.16-rc6 not including VMI + no 64MB hole
The intent was to measure the overhead of VMI calls on native runs.
Data was collected on both p4 & opteron boxes.  The workloads used
were dbench/1client, netperf/receive+send, UP+SMP kernel compile,
lmbench, & some VMware in-house kernel microbenchmarks.  The CPU(s)
were pegged for all workloads except netperf, for which I include
CPU utilization measurements.

Attached please find a html file presenting the benchmark results
collected in terms of ratio of 1) to 2), along with the raw scores
given in brackets.  System configurations & benchmark descriptions
are given at the end of the webpage; more details are available on
request.  Also attached for reference is an html file giving the
width of the 95% confidence interval around the mean of the scores
reported for each benchmark, expressed as a percentage of the mean.

As you can see on the benchmark results webpage, the VMI-Native
& Native scores for almost all workloads match within the 95%
confidence interval.  On the P4, only 4 workloads, all lmbench
microbenchmarks (forkproc,shproc,mmap,pagefault) were outside the
interval & the overheads (2%,1%,2%,1%, respectively) are low.
The opteron microbenchmark data was a little more ragged than
the P4 in terms of variance, but it appears that only a few
lmbench microbenchmarks (forkproc,execproc,shproc) were outside
their confidence intervals and they show low overheads (4%,3%,2%,
respectively); our in-house segv & divzero seemed to show
measureable overheads as well (8%,9%).

-Regards, Anne Holler (anne@vmware.com)

[-- Attachment #2: score.2.6.16-rc6.html --]
[-- Type: text/html, Size: 4471 bytes --]

[-- Attachment #3: confid.2.6.16-rc6.html --]
[-- Type: text/html, Size: 2325 bytes --]

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
@ 2006-03-20 22:03 ` Anne Holler
  0 siblings, 0 replies; 38+ messages in thread
From: Anne Holler @ 2006-03-20 22:03 UTC (permalink / raw)
  To: Anne Holler, Zach Amsden, Linus Torvalds,
	Linux Kernel Mailing List, Virtualization Mailing List, Xen-devel,
	Andrew Morton, Zach Amsden, Daniel Hecht, Daniel Arai,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn,
	Zach Amsden

[-- Attachment #1: Type: text/plain, Size: 2593 bytes --]

[Apologies for resend: earlier email with html attachments was
 rejected.  Resending with txt attachments.]

>From: Zachary Amsden [mailto:zach@vmware.com]
>Sent: Monday, March 13, 2006 9:58 AM

>In OLS 2005, we described the work that we have been doing in VMware
>with respect a common interface for paravirtualization of Linux. We
>shared the general vision in Rik's virtualization BoF.

>This note is an update on our further work on the Virtual Machine
>Interface, VMI.  The patches provided have been tested on 2.6.16-rc6.
>We are currently recollecting performance information for the new -rc6
>kernel, but expect our numbers to match previous results, which showed
>no impact whatsoever on macro benchmarks, and nearly neglible impact
>on microbenchmarks.

Folks,

I'm a member of the performance team at VMware & I recently did a
round of testing measuring the performance of a set of benchmarks
on the following 2 linux variants, both running natively:
 1) 2.6.16-rc6 including VMI + 64MB hole
 2) 2.6.16-rc6 not including VMI + no 64MB hole
The intent was to measure the overhead of VMI calls on native runs.
Data was collected on both p4 & opteron boxes.  The workloads used
were dbench/1client, netperf/receive+send, UP+SMP kernel compile,
lmbench, & some VMware in-house kernel microbenchmarks.  The CPU(s)
were pegged for all workloads except netperf, for which I include
CPU utilization measurements.

Attached please find a text file presenting the benchmark results
collected in terms of ratio of 1) to 2), along with the raw scores
given in brackets.  System configurations & benchmark descriptions
are given at the end of the page; more details are available on
request.  Also attached for reference is a text file giving the
width of the 95% confidence interval around the mean of the scores
reported for each benchmark, expressed as a percentage of the mean.

The VMI-Native & Native scores for almost all workloads match
within the 95% confidence interval.  On the P4, only 4 workloads,
all lmbench microbenchmarks (forkproc,shproc,mmap,pagefault) were
outside the interval & the overheads (2%,1%,2%,1%, respectively)
were low.  The opteron microbenchmark data was a little more
ragged than the P4 in terms of variance, but it appears that only
a few lmbench microbenchmarks (forkproc,execproc,shproc) were
outside their confidence intervals and they show low overheads
(4%,3%,2%, respectively); our in-house segv & divzero seemed to
show measureable overheads as well (8%,9%).

-Regards, Anne Holler (anne@vmware.com)

[-- Attachment #2: score.2.6.16-rc6.txt --]
[-- Type: text/plain, Size: 3892 bytes --]

2.6.16-rc6 Transparent Paravirtualization Performance Scoreboard2.6.16-rc6 Transparent Paravirtualization Performance Scoreboard
Updated: 03/20/2006 * Contact: Anne Holler (anne@vmware.com)

Throughput benchmarks -> HIGHER IS BETTER -> Higher ratio is better
                     P4                  Opteron 
                     VMI-Native/Native   VMI-Native/Native   Comments
 Dbench
  1client            1.00 [312/311]      1.00 [425/425]
 Netperf
  Receive            1.00 [948/947]      1.00 [937/937]      CpuUtil:P4(VMI:43%,Ntv:42%);Opteron(VMI:36%,Ntv:34%)
  Send               1.00 [939/939]      1.00 [937/936]      CpuUtil:P4(VMI:25%,Ntv:25%);Opteron(VMI:62%,Ntv:60%)

Latency benchmarks -> LOWER IS BETTER -> Lower ratio is better
                     P4                  Opteron 
                     VMI-Native/Native   VMI-Native/Native   Comments
 Kernel compile
  UP                 1.00 [221/220]      1.00 [131/131]
  SMP/2way           1.00 [117/117]      1.00 [67/67]
 Lmbench process time latencies
  null call          1.00 [0.17/0.17]    1.00 [0.08/0.08]
  null i/o           1.00 [0.29/0.29]    0.92 [0.23/0.25]    opteron: wide confidence interval
  stat               0.99 [2.14/2.16]    0.94 [2.25/2.39]    opteron: odd, 1% outside wide confidence interval
  open clos          1.01 [3.00/2.96]    0.98 [3.16/3.24]
  slct TCP           1.00 [8.84/8.83]    0.94 [11.8/12.5]    opteron: wide confidence interval
  sig inst           0.99 [0.68/0.69]    1.09 [0.36/0.33]    opteron: best is 1.03 [0.34/0.33]
  sig hndl           0.99 [2.19/2.21]    1.05 [1.20/1.14]    opteron: best is 1.02 [1.13/1.11]
  fork proc          1.02 [137/134]      1.04 [100/96]
  exec proc          1.02 [536/525]      1.03 [309/301]
  sh proc            1.01 [3204/3169]    1.02 [1551/1528]
 Lmbench context switch time latencies
  2p/0K              1.00 [2.84/2.84]    1.14 [0.74/0.65]    opteron: wide confidence interval
  2p/16K             1.01 [2.98/2.95]    0.93 [0.74/0.80]    opteron: wide confidence interval
  2p/64K             1.02 [3.06/3.01]    1.00 [4.19/4.18]
  8p/16K             1.02 [3.31/3.26]    0.97 [1.86/1.91]
  8p/64K             1.01 [30.4/30.0]    1.00 [4.33/4.34]
  16p/16K            0.96 [7.76/8.06]    0.97 [2.03/2.10]
  16p/64K            1.00 [41.5/41.4]    1.00 [15.9/15.9]
 Lmbench system latencies
  Mmap               1.02 [6681/6542]    1.00 [3452/3441]
  Prot Fault         1.06 [0.920/0.872]  1.07 [0.197/0.184]  p4+opteron: wide confidence interval
  Page Fault         1.01 [2.065/2.050]  1.00 [1.10/1.10]
 Kernel Microbenchmarks
  getppid            1.00 [1.70/1.70]    1.00 [0.83/0.83]
  segv               0.99 [7.05/7.09]    1.08 [2.95/2.72]
  forkwaitn          1.02 [3.60/3.54]    1.05 [2.61/2.48]
  divzero            0.99 [5.68/5.73]    1.09 [2.71/2.48]

System Configurations:
 P4:      CPU: 2.4GHz; MEM: 1024MB; DISK: 10K SCSI; Server+Client NICs: Intel e1000 server adapter
 Opteron: CPU: 2.2Ghz; MEM: 1024MB; DISK: 10K SCSI; Server+Client NICs: Broadcom NetXtreme BCM5704
 UP kernel used for all workloads except SMP kernel compile

Benchmark Descriptions:
 Dbench: repeat N times until 95% confidence interval 5% around mean; report mean
  version 2.0 run as "time ./dbench -c client_plain.txt 1"
 Netperf: best of 5 runs
  MessageSize:8192+SocketSize:65536; netperf -H client-ip -l 60 -t TCP_STREAM
 Kernel compile: best of 3 runs
  Build of 2.6.11 kernel w/gcc 4.0.2 via "time make -j 16 bzImage"
 Lmbench: average of best 18 of 30 runs
  version 3.0-a4; obtained from sourceforge
 Kernel microbenchmarks: average of best 3 of 5 runs
  getppid: loop of 10 calls to getppid, repeated 1,000,000 times
  segv: signal of SIGSEGV, repeated 3,000,000 times
  forkwaitn: fork/wait for child to exit, repeated 40,000 times
  divzero: divide by 0 fault 3,000,000 times

[-- Attachment #3: confid.2.6.16-rc6.txt --]
[-- Type: text/plain, Size: 2123 bytes --]

2.6.16-rc6 Transparent Paravirtualization Performance Confidence Interval Widths2.6.16-rc6 Transparent Paravirtualization Performance Confidence Interval Widths
Updated: 03/20/2006 * Contact: Anne Holler (anne@vmware.com)
Values are 95% confidence interval width around mean given in terms of percentage of mean

                   P4                  Opteron
                   Native VMI-Native   Native VMI-Native
 Dbench2.0
  1client            5.0%  1.4%          0.8%  3.6%
 Netperf
  Receive            0.1%  0.0%          0.0%  0.0%
  Send               0.6%  1.8%          0.0%  0.0%
 Kernel compile
  UP                 3.4%  2.6%          2.2%  0.0%
  SMP/2way           2.4%  4.9%          4.3%  4.2%
 Lmbench process time latencies
  null call          0.0%  0.0%          0.0%  0.0%
  null i/o           0.0%  0.0%          5.2% 10.8%
  stat               1.0%  1.0%          1.7%  3.2%
  open clos          1.3%  0.7%          2.4%  3.0%
  slct TCP           0.3%  0.3%         19.9% 20.1%
  sig inst           0.3%  0.5%          0.0%  5.5%
  sig hndl           0.4%  0.4%          2.0%  2.0%
  fork proc          0.5%  0.9%          0.8%  1.0%
  exec proc          0.8%  0.9%          1.0%  0.7%
  sh proc            0.1%  0.2%          0.9%  0.4%
 Lmbench context switch time latencies
  2p/0K              0.8%  1.8%         16.1%  9.9%
  2p/16K             1.5%  1.8%         10.5% 10.1%
  2p/64K             2.4%  3.0%          1.8%  1.4%
  8p/16K             4.5%  4.2%          2.4%  4.2%
  8p/64K             3.0%  2.8%          1.6%  1.5%
  16p/16K            3.1%  6.7%          2.6%  3.2%
  16p/64K            0.5%  0.5%          2.9%  2.9%
 Lmbench system latencies
  Mmap               0.7%  0.3%          2.2% 2.4%
  Prot Fault         7.4%  7.5%         49.4% 38.7%
  Page Fault         0.2%  0.2%          2.4%  2.9%
 Kernel Microbenchmarks
  getppid            1.7%  2.9%          3.5%  3.5%
  segv               2.3%  0.7%          1.8%  1.9%
  forkwaitn          0.8%  0.8%          5.3%  2.2%
  divzero            0.9%  1.3%          1.2%  1.1%

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal
@ 2006-03-20 22:03 ` Anne Holler
  0 siblings, 0 replies; 38+ messages in thread
From: Anne Holler @ 2006-03-20 22:03 UTC (permalink / raw)
  To: Anne Holler, Zach Amsden, Linus Torvalds,
	Linux Kernel Mailing List, Virtualization Mailing List, Xen-devel,
	Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2593 bytes --]

[Apologies for resend: earlier email with html attachments was
 rejected.  Resending with txt attachments.]

>From: Zachary Amsden [mailto:zach@vmware.com]
>Sent: Monday, March 13, 2006 9:58 AM

>In OLS 2005, we described the work that we have been doing in VMware
>with respect a common interface for paravirtualization of Linux. We
>shared the general vision in Rik's virtualization BoF.

>This note is an update on our further work on the Virtual Machine
>Interface, VMI.  The patches provided have been tested on 2.6.16-rc6.
>We are currently recollecting performance information for the new -rc6
>kernel, but expect our numbers to match previous results, which showed
>no impact whatsoever on macro benchmarks, and nearly neglible impact
>on microbenchmarks.

Folks,

I'm a member of the performance team at VMware & I recently did a
round of testing measuring the performance of a set of benchmarks
on the following 2 linux variants, both running natively:
 1) 2.6.16-rc6 including VMI + 64MB hole
 2) 2.6.16-rc6 not including VMI + no 64MB hole
The intent was to measure the overhead of VMI calls on native runs.
Data was collected on both p4 & opteron boxes.  The workloads used
were dbench/1client, netperf/receive+send, UP+SMP kernel compile,
lmbench, & some VMware in-house kernel microbenchmarks.  The CPU(s)
were pegged for all workloads except netperf, for which I include
CPU utilization measurements.

Attached please find a text file presenting the benchmark results
collected in terms of ratio of 1) to 2), along with the raw scores
given in brackets.  System configurations & benchmark descriptions
are given at the end of the page; more details are available on
request.  Also attached for reference is a text file giving the
width of the 95% confidence interval around the mean of the scores
reported for each benchmark, expressed as a percentage of the mean.

The VMI-Native & Native scores for almost all workloads match
within the 95% confidence interval.  On the P4, only 4 workloads,
all lmbench microbenchmarks (forkproc,shproc,mmap,pagefault) were
outside the interval & the overheads (2%,1%,2%,1%, respectively)
were low.  The opteron microbenchmark data was a little more
ragged than the P4 in terms of variance, but it appears that only
a few lmbench microbenchmarks (forkproc,execproc,shproc) were
outside their confidence intervals and they show low overheads
(4%,3%,2%, respectively); our in-house segv & divzero seemed to
show measureable overheads as well (8%,9%).

-Regards, Anne Holler (anne@vmware.com)

[-- Attachment #2: score.2.6.16-rc6.txt --]
[-- Type: text/plain, Size: 3892 bytes --]

2.6.16-rc6 Transparent Paravirtualization Performance Scoreboard2.6.16-rc6 Transparent Paravirtualization Performance Scoreboard
Updated: 03/20/2006 * Contact: Anne Holler (anne@vmware.com)

Throughput benchmarks -> HIGHER IS BETTER -> Higher ratio is better
                     P4                  Opteron 
                     VMI-Native/Native   VMI-Native/Native   Comments
 Dbench
  1client            1.00 [312/311]      1.00 [425/425]
 Netperf
  Receive            1.00 [948/947]      1.00 [937/937]      CpuUtil:P4(VMI:43%,Ntv:42%);Opteron(VMI:36%,Ntv:34%)
  Send               1.00 [939/939]      1.00 [937/936]      CpuUtil:P4(VMI:25%,Ntv:25%);Opteron(VMI:62%,Ntv:60%)

Latency benchmarks -> LOWER IS BETTER -> Lower ratio is better
                     P4                  Opteron 
                     VMI-Native/Native   VMI-Native/Native   Comments
 Kernel compile
  UP                 1.00 [221/220]      1.00 [131/131]
  SMP/2way           1.00 [117/117]      1.00 [67/67]
 Lmbench process time latencies
  null call          1.00 [0.17/0.17]    1.00 [0.08/0.08]
  null i/o           1.00 [0.29/0.29]    0.92 [0.23/0.25]    opteron: wide confidence interval
  stat               0.99 [2.14/2.16]    0.94 [2.25/2.39]    opteron: odd, 1% outside wide confidence interval
  open clos          1.01 [3.00/2.96]    0.98 [3.16/3.24]
  slct TCP           1.00 [8.84/8.83]    0.94 [11.8/12.5]    opteron: wide confidence interval
  sig inst           0.99 [0.68/0.69]    1.09 [0.36/0.33]    opteron: best is 1.03 [0.34/0.33]
  sig hndl           0.99 [2.19/2.21]    1.05 [1.20/1.14]    opteron: best is 1.02 [1.13/1.11]
  fork proc          1.02 [137/134]      1.04 [100/96]
  exec proc          1.02 [536/525]      1.03 [309/301]
  sh proc            1.01 [3204/3169]    1.02 [1551/1528]
 Lmbench context switch time latencies
  2p/0K              1.00 [2.84/2.84]    1.14 [0.74/0.65]    opteron: wide confidence interval
  2p/16K             1.01 [2.98/2.95]    0.93 [0.74/0.80]    opteron: wide confidence interval
  2p/64K             1.02 [3.06/3.01]    1.00 [4.19/4.18]
  8p/16K             1.02 [3.31/3.26]    0.97 [1.86/1.91]
  8p/64K             1.01 [30.4/30.0]    1.00 [4.33/4.34]
  16p/16K            0.96 [7.76/8.06]    0.97 [2.03/2.10]
  16p/64K            1.00 [41.5/41.4]    1.00 [15.9/15.9]
 Lmbench system latencies
  Mmap               1.02 [6681/6542]    1.00 [3452/3441]
  Prot Fault         1.06 [0.920/0.872]  1.07 [0.197/0.184]  p4+opteron: wide confidence interval
  Page Fault         1.01 [2.065/2.050]  1.00 [1.10/1.10]
 Kernel Microbenchmarks
  getppid            1.00 [1.70/1.70]    1.00 [0.83/0.83]
  segv               0.99 [7.05/7.09]    1.08 [2.95/2.72]
  forkwaitn          1.02 [3.60/3.54]    1.05 [2.61/2.48]
  divzero            0.99 [5.68/5.73]    1.09 [2.71/2.48]

System Configurations:
 P4:      CPU: 2.4GHz; MEM: 1024MB; DISK: 10K SCSI; Server+Client NICs: Intel e1000 server adapter
 Opteron: CPU: 2.2Ghz; MEM: 1024MB; DISK: 10K SCSI; Server+Client NICs: Broadcom NetXtreme BCM5704
 UP kernel used for all workloads except SMP kernel compile

Benchmark Descriptions:
 Dbench: repeat N times until 95% confidence interval 5% around mean; report mean
  version 2.0 run as "time ./dbench -c client_plain.txt 1"
 Netperf: best of 5 runs
  MessageSize:8192+SocketSize:65536; netperf -H client-ip -l 60 -t TCP_STREAM
 Kernel compile: best of 3 runs
  Build of 2.6.11 kernel w/gcc 4.0.2 via "time make -j 16 bzImage"
 Lmbench: average of best 18 of 30 runs
  version 3.0-a4; obtained from sourceforge
 Kernel microbenchmarks: average of best 3 of 5 runs
  getppid: loop of 10 calls to getppid, repeated 1,000,000 times
  segv: signal of SIGSEGV, repeated 3,000,000 times
  forkwaitn: fork/wait for child to exit, repeated 40,000 times
  divzero: divide by 0 fault 3,000,000 times

[-- Attachment #3: confid.2.6.16-rc6.txt --]
[-- Type: text/plain, Size: 2123 bytes --]

2.6.16-rc6 Transparent Paravirtualization Performance Confidence Interval Widths2.6.16-rc6 Transparent Paravirtualization Performance Confidence Interval Widths
Updated: 03/20/2006 * Contact: Anne Holler (anne@vmware.com)
Values are 95% confidence interval width around mean given in terms of percentage of mean

                   P4                  Opteron
                   Native VMI-Native   Native VMI-Native
 Dbench2.0
  1client            5.0%  1.4%          0.8%  3.6%
 Netperf
  Receive            0.1%  0.0%          0.0%  0.0%
  Send               0.6%  1.8%          0.0%  0.0%
 Kernel compile
  UP                 3.4%  2.6%          2.2%  0.0%
  SMP/2way           2.4%  4.9%          4.3%  4.2%
 Lmbench process time latencies
  null call          0.0%  0.0%          0.0%  0.0%
  null i/o           0.0%  0.0%          5.2% 10.8%
  stat               1.0%  1.0%          1.7%  3.2%
  open clos          1.3%  0.7%          2.4%  3.0%
  slct TCP           0.3%  0.3%         19.9% 20.1%
  sig inst           0.3%  0.5%          0.0%  5.5%
  sig hndl           0.4%  0.4%          2.0%  2.0%
  fork proc          0.5%  0.9%          0.8%  1.0%
  exec proc          0.8%  0.9%          1.0%  0.7%
  sh proc            0.1%  0.2%          0.9%  0.4%
 Lmbench context switch time latencies
  2p/0K              0.8%  1.8%         16.1%  9.9%
  2p/16K             1.5%  1.8%         10.5% 10.1%
  2p/64K             2.4%  3.0%          1.8%  1.4%
  8p/16K             4.5%  4.2%          2.4%  4.2%
  8p/64K             3.0%  2.8%          1.6%  1.5%
  16p/16K            3.1%  6.7%          2.6%  3.2%
  16p/64K            0.5%  0.5%          2.9%  2.9%
 Lmbench system latencies
  Mmap               0.7%  0.3%          2.2% 2.4%
  Prot Fault         7.4%  7.5%         49.4% 38.7%
  Page Fault         0.2%  0.2%          2.4%  2.9%
 Kernel Microbenchmarks
  getppid            1.7%  2.9%          3.5%  3.5%
  segv               2.3%  0.7%          1.8%  1.9%
  forkwaitn          0.8%  0.8%          5.3%  2.2%
  divzero            0.9%  1.3%          1.2%  1.1%

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2006-03-20 22:06 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-13 17:58 [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal Zachary Amsden
2006-03-13 17:58 ` Zachary Amsden
2006-03-13 18:09 ` Arjan van de Ven
2006-03-13 18:22   ` Zachary Amsden
2006-03-13 18:26     ` Arjan van de Ven
2006-03-13 18:30       ` Zachary Amsden
2006-03-13 18:42         ` Arjan van de Ven
2006-03-13 18:48           ` Zachary Amsden
2006-03-13 19:02             ` Chris Wright
2006-03-13 18:52           ` VMI interface documentation Zachary Amsden
2006-03-13 18:56           ` [RFC, PATCH 0/24] VMI i386 Linux virtualization interface proposal Joshua LeVasseur
2006-03-16 18:52             ` Jan Engelhardt
2006-03-13 18:56         ` Hollis Blanchard
2006-03-13 18:59           ` Zachary Amsden
2006-03-13 18:59             ` Zachary Amsden
2006-03-15 10:25     ` Christoph Hellwig
2006-03-15 15:57       ` Zachary Amsden
2006-03-15 17:38       ` Joshua LeVasseur
2006-03-15 20:02         ` Andrew Morton
2006-03-16  0:05           ` Joshua LeVasseur
2006-03-13 20:17 ` Sam Vilain
2006-03-14  0:39 ` Anthony Liguori
2006-03-14  4:01   ` Zachary Amsden
2006-03-14  4:04     ` Rik van Riel
2006-03-14  4:55       ` Zachary Amsden
2006-03-14  4:13 ` Anthony Liguori
2006-03-14  4:26   ` Zachary Amsden
2006-03-14  4:30     ` Rik van Riel
2006-03-14  5:46       ` Zachary Amsden
2006-03-14 12:44         ` Rik van Riel
2006-03-14 16:22           ` Zachary Amsden
2006-03-16 18:58         ` Jan Engelhardt
  -- strict thread matches above, loose matches on Subject: below --
2006-03-17 15:56 Chuck Ebbert
2006-03-17 17:52 ` Zachary Amsden
2006-03-17 17:52   ` Zachary Amsden
2006-03-20 18:51 Anne Holler
2006-03-20 22:03 Anne Holler
2006-03-20 22:03 ` Anne Holler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.