* [RFC, PATCH 1/24] i386 Vmi documentation
@ 2006-03-13 17:59 Zachary Amsden
2006-03-13 22:49 ` Chris Wright
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Zachary Amsden @ 2006-03-13 17:59 UTC (permalink / raw)
To: Linus Torvalds, Linux Kernel Mailing List,
Virtualization Mailing List, Xen-devel, Andrew Morton,
Zachary Amsden, Dan Hecht, Dan Arai, Anne Holler,
Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
Index: linux-2.6.16-rc5/Documentation/vmi_spec.txt
===================================================================
--- linux-2.6.16-rc5.orig/Documentation/vmi_spec.txt 2006-03-09 23:33:29.000000000 -0800
+++ linux-2.6.16-rc5/Documentation/vmi_spec.txt 2006-03-10 12:55:29.000000000 -0800
@@ -0,0 +1,2197 @@
+
+ Paravirtualization API Version 2.0
+
+ Zachary Amsden, Daniel Arai, Daniel Hecht, Pratap Subrahmanyam
+ Copyright (C) 2005, 2006, VMware, Inc.
+ All rights reserved
+
+Revision history:
+ 1.0: Initial version
+ 1.1: arai 2005-11-15
+ Added SMP-related sections: AP startup and Local APIC support
+ 1.2: dhecht 2006-02-23
+ Added Time Interface section and Time related VMI calls
+
+Contents
+
+1) Motivations
+2) Overview
+ Initialization
+ Privilege model
+ Memory management
+ Segmentation
+ Interrupt and I/O subsystem
+ IDT management
+ Transparent Paravirtualization
+ 3rd Party Extensions
+ AP Startup
+ State Synchronization in SMP systems
+ Local APIC Support
+ Time Interface
+3) Architectural Differences from Native Hardware
+4) ROM Implementation
+ Detection
+ Data layout
+ Call convention
+ PCI implementation
+
+Appendix A - VMI ROM low level ABI
+Appendix B - VMI C prototypes
+Appendix C - Sensitive x86 instructions
+
+
+1) Motivations
+
+ There are several high level goals which must be balanced in designing
+ an API for paravirtualization. The most general concerns are:
+
+ Portability - it should be easy to port a guest OS to use the API
+ High performance - the API must not obstruct a high performance
+ hypervisor implementation
+ Maintainability - it should be easy to maintain and upgrade the guest
+ OS
+ Extensibility - it should be possible for future expansion of the
+ API
+
+ Portability.
+
+ The general approach to paravirtualization rather than full
+ virtualization is to modify the guest operating system. This means
+ there is implicitly some code cost to port a guest OS to run in a
+ paravirtual environment. The closer the API resembles a native
+ platform which the OS supports, the lower the cost of porting.
+ Rather than provide an alternative, high level interface for this
+ API, the approach is to provide a low level interface which
+ encapsulates the sensitive and performance critical parts of the
+ system. Thus, we have direct parallels to most privileged
+ instructions, and the process of converting a guest OS to use these
+ instructions is in many cases a simple replacement of one function
+ for another. Although this is sufficient for CPU virtualization,
+ performance concerns have forced us to add additional calls for
+ memory management, and notifications about updates to certain CPU
+ data structures. Support for this in the Linux operating system has
+ proved to be very minimal in cost because of the already somewhat
+ portable and modular design of the memory management layer.
+
+ High Performance.
+
+ Providing a low level API that closely resembles hardware does not
+ provide any support for compound operations; indeed, typical
+ compound operations on hardware can be updating of many page table
+ entries, flushing system TLBs, or providing floating point safety.
+ Since these operations may require several privileged or sensitive
+ operations, it becomes important to defer some of these operations
+ until explicit flushes are issued, or to provide higher level
+ operations around some of these functions. In order to keep with
+ the goal of portability, this has been done only when deemed
+ necessary for performance reasons, and we have tried to package
+ these compound operations into methods that are typically used in
+ guest operating systems. In the future, we envision that additional
+ higher level abstractions will be added as an adjunct to the
+ low-level API. These higher level abstractions will target large
+ bulk operations such as creation, and destruction of address spaces,
+ context switches, thread creation and control.
+
+ Maintainability.
+
+ In the course of development with a virtualized environment, it is
+ not uncommon for support of new features or higher performance to
+ require radical changes to the operation of the system. If these
+ changes are visible to the guest OS in a paravirtualized system,
+ this will require updates to the guest kernel, which presents a
+ maintenance problem. In the Linux world, the rapid pace of
+ development on the kernel means new kernel versions are produced
+ every few months. This rapid pace is not always appropriate for end
+ users, so it is not uncommon to have dozens of different versions of
+ the Linux kernel in use that must be actively supported. To keep
+ this many versions in sync with potentially radical changes in the
+ paravirtualized system is not a scalable solution. To reduce the
+ maintenance burden as much as possible, while still allowing the
+ implementation to accommodate changes, the design provides a stable
+ ABI with semantic invariants. The underlying implementation of the
+ ABI and details of what data or how it communicates with the
+ hypervisor are not visible to the guest OS. As a result, in most
+ cases, the guest OS need not even be recompiled to work with a newer
+ hypervisor. This allows performance optimizations, bug fixes,
+ debugging, or statistical instrumentation to be added to the API
+ implementation without any impact on the guest kernel. This is
+ achieved by publishing a block of code from the hypervisor in the
+ form of a ROM. The guest OS makes calls into this ROM to perform
+ privileged or sensitive actions in the system.
+
+ Extensibility.
+
+ In order to provide a vehicle for new features, new device support,
+ and general evolution, the API uses feature compartmentalization
+ with controlled versioning. The API is split into sections, with
+ each section having independent versions. Each section has a top
+ level version which is incremented for each major revision, with a
+ minor version indicating incremental level. Version compatibility
+ is based on matching the major version field, and changes of the
+ major version are assumed to break compatibility. This allows
+ accurate matching of compatibility. In the event of incompatible
+ API changes, multiple APIs may be advertised by the hypervisor if it
+ wishes to support older versions of guest kernels. This provides
+ the most general forward / backward compatibility possible.
+ Currently, the API has a core section for CPU / MMU virtualization
+ support, with additional sections provided for each supported device
+ class.
+
+2) Overview
+
+ Initialization.
+
+ Initialization is done with a bootstrap loader that creates
+ the "start of day" state. This is a known state, running 32-bit
+ protected mode code with paging enabled. The guest has all the
+ standard structures in memory that are provided by a native ROM
+ boot environment, including a memory map and ACPI tables. For
+ the native hardware, this bootstrap loader can be run before
+ the kernel code proper, and this environment can be created
+ readily from within the hypervisor for the virtual case. At
+ some point, the bootstrap loader or the kernel itself invokes
+ the initialization call to enter paravirtualized mode.
+
+ Privilege Model.
+
+ The guest kernel must be modified to run at a dynamic privilege
+ level, since if entry to paravirtual mode is successful, the kernel
+ is no longer allowed to run at the highest hardware privilege level.
+ On the IA-32 architecture, this means the kernel will be running at
+ CPL 1-2, and with the hypervisor running at CPL0, and user code at
+ CPL3. The IOPL will be lowered as well to avoid giving the guest
+ direct access to hardware ports and control of the interrupt flag.
+
+ This change causes certain IA-32 instructions to become "sensitive",
+ so additional support for clearing and setting the hardware
+ interrupt flag are present. Since the switch into paravirtual mode
+ may happen dynamically, the guest OS must not rely on testing for a
+ specific privilege level by checking the RPL field of segment
+ selectors, but should check for privileged execution by performing
+ an (RPL != 3 && !EFLAGS_VM) comparison. This means the DPL of kernel
+ ring descriptors in the GDT or LDT may be raised to match the CPL of
+ the kernel. This change is visible by inspecting the segments
+ registers while running in privileged code, and by using the LAR
+ instruction.
+
+ The system also cannot be allowed to write directly to the hardware
+ GDT, LDT, IDT, or TSS, so these data structures are maintained by the
+ hypervisor, and may be shadowed or guest visible structures. These
+ structures are required to be page aligned to support non-shadowed
+ operation.
+
+ Currently, the system only provides for two guest security domains,
+ kernel (which runs at the equivalent of virtual CPL-0), and user
+ (which runs at the equivalent of virtual CPL-3, with no hardware
+ access). Typically, this is not a problem, but if a guest OS relies
+ on using multiple hardware rings for privilege isolation, this
+ interface would need to be expanded to support that.
+
+ Memory Management.
+
+ Since a virtual machine typically does not have access to all the
+ physical memory on the machine, there is a need to redefine the
+ physical address space layout for the virtual machine. The
+ spectrum of possibilities ranges from presenting the guest with
+ a view of a physically contiguous memory of a boot-time determined
+ size, exactly what the guest would see when running on hardware, to
+ the opposite, which presents the guest with the actual machine pages
+ which the hypervisor has allocated for it. Using this approach
+ requires the guest to obtain information about the pages it has
+ from the hypervisor; this can be done by using the memory map which
+ would normally be passed to the guest by the BIOS.
+
+ The interface is designed to support either mode of operation.
+ This allows the implementation to use either direct page tables
+ or shadow page tables, or some combination of both. All writes to
+ page table entries are done through calls to the hypervisor
+ interface layer. The guest notifies the hypervisor about page
+ tables updates, flushes, and invalidations through API calls.
+
+ The guest OS is also responsible for notifying the hypervisor about
+ which pages in its physical memory are going to be used to hold page
+ tables or page directories. Both PAE and non-PAE paging modes are
+ supported. When the guest is finished using pages as page tables, it
+ should release them promptly to allow the hypervisor to free the
+ page table shadows. Using a page as both a page table and a page
+ directory for linear page table access is possible, but currently
+ not supported by our implementation.
+
+ The hypervisor lives concurrently in the same address space as the
+ guest operating system. Although this is not strictly necessary on
+ IA-32 hardware, performance would be severely degraded if that were
+ not the case. The hypervisor must therefore reserve some portion of
+ linear address space for its own use. The implementation currently
+ reserves the top 64 megabytes of linear space for the hypervisor.
+ This requires the guest to relocate any data in high linear space
+ down by 64 megabytes. For non-paging mode guests, this means the
+ high 64 megabytes of physical memory should be reserved. Because
+ page tables are not sensitive to CPL, only to user/supervisor level,
+ the hypervisor must combine segment protection to ensure that the
+ guest can not access this 64 megabyte region.
+
+ An experimental patch is available to enable boot-time sizing of
+ the hypervisor hole.
+
+ Segmentation.
+
+ The IA-32 architecture provides segmented virtual memory, which can
+ be used as another form of privilege separation. Each segment
+ contains a base, limit, and properties. The base is added to the
+ virtual address to form a linear address. The limit determines the
+ length of linear space which is addressable through the segment.
+ The properties determine read/write, code and data size of the
+ region, as well as the direction in which segments grow. Segments
+ are loaded from descriptors in one of two system tables, the GDT or
+ the LDT, and the values loaded are cached until the next load of the
+ segment. This property, known as segment caching, allows the
+ machine to be put into a non-reversible state by writing over the
+ descriptor table entry from which a segment was loaded. There is no
+ efficient way to extract the base field of the segment after it is
+ loaded, as it is hidden by the processor. In a hypervisor
+ environment, the guest OS can be interrupted at any point in time by
+ interrupts and NMIs which must be serviced by the hypervisor. The
+ hypervisor must be able to recreate the original guest state when it
+ is done servicing the external event.
+
+ To avoid creating non-reversible segments, the hypervisor will
+ forcibly reload any live segment registers that are updated by
+ writes to the descriptor tables. *N.B - in the event that a segment
+ is put into an invalid or not present state by an update to the
+ descriptor table, the segment register must be forced to NULL so
+ that reloading it will not cause a general protection fault (#GP)
+ when restoring the guest state. This may require the guest to save
+ the segment register value before issuing a hypervisor API call
+ which will update the descriptor table.*
+
+ Because the hypervisor must protect its own memory space from
+ privileged code running in the guest at CPL1-2, descriptors may not
+ provide access to the 64 megabyte region of high linear space. To
+ achieve this, the hypervisor will truncate descriptors in the
+ descriptor tables. This means that attempts by the guest to access
+ through negative offsets to the segment base will fault, so this is
+ highly discouraged (some TLS implementations on Linux do this).
+ In addition, this causes the truncated length of the segment to
+ become visible to the guest through the LSL instruction.
+
+ Interrupt and I/O Subsystem.
+
+ For security reasons, the guest operating system is not given
+ control over the hardware interrupt flag. We provide a virtual
+ interrupt flag that is under guest control. The virtual operating
+ system always runs with hardware interrupts enabled, but hardware
+ interrupts are transparent to the guest. The API provides calls for
+ all instructions which modify the interrupt flag.
+
+ The paravirtualization environment provides a legacy programmable
+ interrupt controller (PIC) to the virtual machine. Future releases
+ will provide a virtual interrupt controller (VIC) that provides
+ more advanced features.
+
+ In addition to a virtual interrupt flag, there is also a virtual
+ IOPL field which the guest can use to enable access to port I/O
+ from userspace for privileged applications.
+
+ Generic PCI based device probing is available to detect virtual
+ devices. The use of PCI is pragmatic, since it allows a vendor
+ ID, class ID, and device ID to identify the appropriate driver
+ for each virtual device.
+
+ IDT Management.
+
+ The paravirtual operating environment provides the traditional x86
+ interrupt descriptor table for handling external interrupts,
+ software interrupts, and exceptions. The interrupt descriptor table
+ provides the destination code selector and EIP for interruptions.
+ The current task state structure (TSS) provides the new stack
+ address to use for interruptions that result in a privilege level
+ change. The guest OS is responsible for notifying the hypervisor
+ when it updates the stack address in the TSS.
+
+ Two types of indirect control flow are of critical importance to the
+ performance of an operating system. These are system calls and page
+ faults. The guest is also responsible for calling out to the
+ hypervisor when it updates gates in the IDT. Making IDT and TSS
+ updates known to the hypervisor in this fashion allows efficient
+ delivery through these performance critical gates.
+
+ Transparent Paravirtualization.
+
+ The guest operating system may provide an alternative implementation
+ of the VMI option rom compiled in. This implementation should
+ provide implementations of the VMI calls that are suitable for
+ running on native x86 hardware. This code may be used by the guest
+ operating system while it is being loaded, and may also be used if
+ the operating system is loaded on hardware that does not support
+ paravirtualization.
+
+ When the guest detects that the VMI option rom is available, it
+ replaces the compiled-in version of the rom with the rom provided by
+ the platform. This can be accomplished by copying the rom contents,
+ or by remapping the virtual address containing the compiled-in rom
+ to point to the platform's ROM. When booting on a platform that
+ does not provide a VMI rom, the operating system can continue to use
+ the compiled-in version to run in a non-paravirtualized fashion.
+
+ 3rd Party Extensions.
+
+ If desired, it should be possible for 3rd party virtual machine
+ monitors to implement a paravirtualization environment that can run
+ guests written to this specification.
+
+ The general mechanism for providing customized features and
+ capabilities is to provide notification of these feature through
+ the CPUID call, and allowing configuration of CPU features
+ through RDMSR / WRMSR instructions. This allows a hypervisor vendor
+ ID to be published, and the kernel may enable or disable specific
+ features based on this id. This has the advantage of following
+ closely the boot time logic of many operating systems that enables
+ certain performance enhancements or bugfixes based on processor
+ revision, using exactly the same mechanism.
+
+ An exact formal specification of the new CPUID functions and which
+ functions are vendor specific is still needed.
+
+ AP Startup.
+
+ Application Processor startup in paravirtual SMP systems works a bit
+ differently than in a traditional x86 system.
+
+ APs will launch directly in paravirtual mode with initial state
+ provided by the BSP. Rather than the traditional init/startup
+ IPI sequence, the BSP must issue the init IPI, a set application
+ processor state hypercall, followed by the startup IPI.
+
+ The initial state contains the AP's control registers, general
+ purpose registers and segment registers, as well as the IDTR,
+ GDTR, LDTR and EFER. Any processor state not included in the initial
+ AP state (including x87 FPRs, SSE register states, and MSRs other than
+ EFER), are left in the poweron state.
+
+ The BSP must construct the initial GDT used by each AP. The segment
+ register hidden state will be loaded from the GDT specified in the
+ initial AP state. The IDT and (if used) LDT may either be constructed by
+ the BSP or by the AP.
+
+ Similarly, the initial page tables used by each AP must also be
+ constructed by the BSP.
+
+ If an AP's initial state is invalid, or no initial state is provided
+ before a start IPI is received by that AP, then the AP will fail to start.
+ It is therefore advisable to have a timeout for waiting for AP's to start,
+ as is recommended for traditional x86 systems.
+
+ See VMI_SetInitialAPState in Appendix A for a description of the
+ VMI_SetInitialAPState hypercall and the associated APState data structure.
+
+ State Synchronization In SMP Systems.
+
+ Some in-memory data structures that may require no special synchronization
+ on a traditional x86 systems need special handling when run on a
+ hypervisor. Two of particular note are the descriptor tables and page
+ tables.
+
+ Each processor in an SMP system should have its own GDT and LDT. Changes
+ to each processor's descriptor tables must be made on that processor
+ via the appropriate VMI calls. There is no VMI interface for updating
+ another CPU's descriptor tables (aside from VMI_SetInitialAPState),
+ and the result of memory writes to other processors' descriptor tables
+ are undefined.
+
+ Page tables have slightly different semantics than in a traditional x86
+ system. As in traditional x86 systems, page table writes may not be
+ respected by the current CPU until a TLB flush or invlpg is issued.
+ In a paravirtual system, the hypervisor implementation is free to
+ provide either shared or private caches of the guest's page tables.
+ Page table updates must therefore be propagated to the other CPUs
+ before they are guaranteed to be noticed.
+
+ In particular, when doing TLB shootdown, the initiating processor
+ must ensure that all deferred page table updates are flushed to the
+ hypervisor, to ensure that the receiving processor has the most up-to-date
+ mapping when it performs its invlpg.
+
+ Local APIC Support.
+
+ A traditional x86 local APIC is provided by the hypervisor. The local
+ APIC is enabled and its address is set via the IA32_APIC_BASE MSR, as
+ usual. APIC registers may be read and written via ordinary memory
+ operations.
+
+ For performance reasons, higher performance APIC read and write interfaces
+ are provided. If possible, these interfaces should be used to access
+ the local APIC.
+
+ The IO-APIC is not included in this spec, as it is typically not
+ performance critical, and used mainly for initial wiring of IRQ pins.
+ Currently, we implement a fully functional IO-APIC with all the
+ capabilities of real hardware. This may seem like an unnecessary burden,
+ but if the goal is transparent paravirtualization, the kernel must
+ provide fallback support for an IO-APIC anyway. In addition, the
+ hypervisor must support an IO-APIC for SMP non-paravirtualized guests.
+ The net result is less code on both sides, and an already well defined
+ interface between the two. This avoids the complexity burden of having
+ to support two different interfaces to achieve the same task.
+
+ One shortcut we have found most helpful is to simply disable NMI delivery
+ to the paravirtualized kernel. There is no reason NMIs can't be
+ supported, but typical uses for them are not as productive in a
+ virtualized environment. Watchdog NMIs are of limited use if the OS is
+ already correct and running on stable hardware; profiling NMIs are
+ similarly of less use, since this task is accomplished with more accuracy
+ in the VMM itself; and NMIs for machine check errors should be handled
+ outside of the VM. The addition of NMI support does create additional
+ complexity for the trap handling code in the VM, and although the task is
+ surmountable, the value proposition is debatable. Here, again, feedback
+ is desired.
+
+ Time Interface.
+
+ In a virtualized environment, virtual machines (VM) will time share
+ the system with each other and with other processes running on the
+ host system. Therefore, a VM's virtual CPUs (VCPUs) will be
+ executing on the host's physical CPUs (PCPUs) for only some portion
+ of time. This section of the VMI exposes a paravirtual view of
+ time to the guest operating systems so that they may operate more
+ effectively in a virtual environment. The interface also provides
+ a way for the VCPUs to set alarms in this paravirtual view of time.
+
+ Time Domains:
+
+ a) Wallclock Time:
+
+ Wallclock time exposed to the VM through this interface indicates
+ the number of nanoseconds since epoch, 1970-01-01T00:00:00Z (ISO
+ 8601 date format). If the host's wallclock time changes (say, when
+ an error in the host's clock is corrected), so does the wallclock
+ time as viewed through this interface.
+
+ b) Real Time:
+
+ Another view of time accessible through this interface is real
+ time. Real time always progresses except for when the VM is
+ stopped or suspended. Real time is presented to the guest as a
+ counter which increments at a constant rate defined (and presented)
+ by the hypervisor. All the VCPUs of a VM share the same real time
+ counter.
+
+ The unit of the counter is called "cycles". The unit and initial
+ value (corresponding to the time the VM enters para-virtual mode)
+ are chosen by the hypervisor so that the real time counter will not
+ rollover in any practical length of time. It is expected that the
+ frequency (cycles per second) is chosen such that this clock
+ provides a "high-resolution" view of time. The unit can only
+ change when the VM (re)enters paravirtual mode.
+
+ c) Stolen time and Available time:
+
+ A VCPU is always in one of three states: running, halted, or ready.
+ The VCPU is in the 'running' state if it is executing. When the
+ VCPU executes the HLT interface, the VCPU enters the 'halted' state
+ and remains halted until there is some work pending for the VCPU
+ (e.g. an alarm expires, host I/O completes on behalf of virtual
+ I/O). At this point, the VCPU enters the 'ready' state (waiting
+ for the hypervisor to reschedule it). Finally, at any time when
+ the VCPU is not in the 'running' state nor the 'halted' state, it
+ is in the 'ready' state.
+
+ For example, consider the following sequence of events, with times
+ given in real time:
+
+ (Example 1)
+
+ At 0 ms, VCPU executing guest code.
+ At 1 ms, VCPU requests virtual I/O.
+ At 2 ms, Host performs I/O for virtual I/0.
+ At 3 ms, VCPU executes VMI_Halt.
+ At 4 ms, Host completes I/O for virtual I/O request.
+ At 5 ms, VCPU begins executing guest code, vectoring to the interrupt
+ handler for the device initiating the virtual I/O.
+ At 6 ms, VCPU preempted by hypervisor.
+ At 9 ms, VCPU begins executing guest code.
+
+ From 0 ms to 3 ms, VCPU is in the 'running' state. At 3 ms, VCPU
+ enters the 'halted' state and remains in this state until the 4 ms
+ mark. From 4 ms to 5 ms, the VCPU is in the 'ready' state. At 5
+ ms, the VCPU re-enters the 'running' state until it is preempted by
+ the hypervisor at the 6 ms mark. From 6 ms to 9 ms, VCPU is again
+ in the 'ready' state, and finally 'running' again after 9 ms.
+
+ Stolen time is defined per VCPU to progress at the rate of real
+ time when the VCPU is in the 'ready' state, and does not progress
+ otherwise. Available time is defined per VCPU to progress at the
+ rate of real time when the VCPU is in the 'running' and 'halted'
+ states, and does not progress when the VCPU is in the 'ready'
+ state.
+
+ So, for the above example, the following table indicates these time
+ values for the VCPU at each ms boundary:
+
+ Real time Stolen time Available time
+ 0 0 0
+ 1 0 1
+ 2 0 2
+ 3 0 3
+ 4 0 4
+ 5 1 4
+ 6 1 5
+ 7 2 5
+ 8 3 5
+ 9 4 5
+ 10 4 6
+
+ Notice that at any point:
+ real_time == stolen_time + available_time
+
+ Stolen time and available time are also presented as counters in
+ "cycles" units. The initial value of the stolen time counter is 0.
+ This implies the initial value of the available time counter is the
+ same as the real time counter.
+
+ Alarms:
+
+ Alarms can be set (armed) against the real time counter or the
+ available time counter. Alarms can be programmed to expire once
+ (one-shot) or on a regular period (periodic). They are armed by
+ indicating an absolute counter value expiry, and in the case of a
+ periodic alarm, a non-zero relative period counter value. [TBD:
+ The method of wiring the alarms to an interrupt vector is dependent
+ upon the virtual interrupt controller portion of the interface.
+ Currently, the alarms may be wired as if they are attached to IRQ0
+ or the vector in the local APIC LVTT. This way, the alarms can be
+ used as drop in replacements for the PIT or local APIC timer.]
+
+ Alarms are per-vcpu mechanisms. An alarm set by vcpu0 will fire
+ only on vcpu0, while an alarm set by vcpu1 will only fire on vcpu1.
+ If an alarm is set relative to available time, its expiry is a
+ value relative to the available time counter of the vcpu that set
+ it.
+
+ The interface includes a method to cancel (disarm) an alarm. On
+ each vcpu, one alarm can be set against each of the two counters
+ (real time and available time). A vcpu in the 'halted' state
+ becomes 'ready' when any of its alarm's counters reaches the
+ expiry.
+
+ An alarm "fires" by signaling the virtual interrupt controller. An
+ alarm will fire as soon as possible after the counter value is
+ greater than or equal to the alarm's current expiry. However, an
+ alarm can fire only when its vcpu is in the 'running' state.
+
+ If the alarm is periodic, a sequence of expiry values,
+
+ E(i) = e0 + p * i , i = 0, 1, 2, 3, ...
+
+ where 'e0' is the expiry specified when setting the alarm and 'p'
+ is the period of the alarm, is used to arm the alarm. Initially,
+ E(0) is used as the expiry. When the alarm fires, the next expiry
+ value in the sequence that is greater than the current value of the
+ counter is used as the alarm's new expiry.
+
+ One-shot alarms have only one expiry. When a one-shot alarm fires,
+ it is automatically disarmed.
+
+ Suppose an alarm is set relative to real time with expiry at the 3
+ ms mark and a period of 2 ms. It will expire on these real time
+ marks: 3, 5, 7, 9. Note that even if the alarm does not fire
+ during the 5 ms to 7 ms interval, the alarm can fire at most once
+ during the 7 ms to 9 ms interval (unless, of course, it is
+ reprogrammed).
+
+ If an alarm is set relative to available time with expiry at the 1
+ ms mark (in available time) and with a period of 2 ms, then it will
+ expire on these available time marks: 1, 3, 5. In the scenario
+ described in example 1, those available time values correspond to
+ these values in real time: 1, 3, 6.
+
+3) Architectural Differences from Native Hardware.
+
+ For the sake of performance, some requirements are imposed on kernel
+ fault handlers which are not present on real hardware. Most modern
+ operating systems should have no trouble meeting these requirements.
+ Failure to meet these requirements may prevent the kernel from
+ working properly.
+
+ 1) The hardware flags on entry to a fault handler may not match
+ the EFLAGS image on the fault handler stack. The stack image
+ is correct, and will have the correct state of the interrupt
+ and arithmetic flags.
+
+ 2) The stack used for kernel traps must be flat - that is, zero base,
+ segment limit determined by the hypervisor.
+
+ 3) On entry to any fault handler, the stack must have sufficient space
+ to hold 32 bytes of data, or the guest may be terminated.
+
+ 4) When calling VMI functions, the kernel must be running on a
+ flat 32-bit stack and code segment.
+
+ 5) Most VMI functions require flat data and extra segment (DS and ES)
+ segments as well; notable exceptions are IRET and SYSEXIT.
+ XXXPara - may need to add STI and CLI to this list.
+
+ 6) Interrupts must always be enabled when running code in userspace.
+
+ 7) IOPL semantics for userspace are changed; although userspace may be
+ granted port access, it can not affect the interrupt flag.
+
+ 8) The EIPs at which faults may occur in VMI calls may not match the
+ original native instruction EIP; this is a bug in the system
+ today, as many guests do rely on lazy fault handling.
+
+ 9) On entry to V8086 mode, MSR_SYSENTER_CS is cleared to zero.
+
+ 10) Todo - we would like to support these features, but they are not
+ fully tested and / or implemented:
+
+ Userspace 16-bit stack support
+ Proper handling of faulting IRETs
+
+4) ROM Implementation
+
+ Modularization
+
+ Originally, we envisioned modularizing the ROM API into several
+ subsections, but the close coupling between the initial layers
+ and the requirement to support native PCI bus devices has made
+ ROM components for network or block devices unnecessary to this
+ point in time.
+
+ VMI - the virtual machine interface. This is the core CPU, I/O
+ and MMU virtualization layer. I/O is currently limited
+ to port access to emulated devices.
+
+ Detection
+
+ The presence of hypervisor ROMs can be recognized by scanning the
+ upper region of the first megabyte of physical memory. Multiple
+ ROMs may be provided to support older API versions for legacy guest
+ OS support. ROM detection is done in the traditional manner, by
+ scanning the memory region from C8000h - DFFFFh in 2 kilobyte
+ increments. The romSignature bytes must be '0x55, 0xAA', and the
+ checksum of the region indicated by the romLength field must be zero.
+ The checksum is a simple 8-bit addition of all bytes in the ROM region.
+
+ Data layout
+
+ typedef struct HyperRomHeader {
+ uint16_t romSignature;
+ int8_t romLength;
+ unsigned char romEntry[4];
+ uint8_t romPad0;
+ uint32_t hyperSignature;
+ uint8_t APIVersionMinor;
+ uint8_t APIVersionMajor;
+ uint8_t reserved0;
+ uint8_t reserved1;
+ uint32_t reserved2;
+ uint32_t reserved3;
+ uint16_t pciHeaderOffset;
+ uint16_t pnpHeaderOffset;
+ uint32_t romPad3;
+ char reserved[32];
+ char elfHeader[64];
+ } HyperRomHeader;
+
+ The first set of fields is defined by the BIOS:
+
+ romSignature - fixed 0xAA55, BIOS ROM signature
+ romLength - the length of the ROM, in 512 byte chunks.
+ Determines the area to be checksummed.
+ romEntry - 16-bit initialization code stub used by BIOS.
+ romPad0 - reserved
+
+ The next set of fields is defined by this API:
+
+ hyperSignature - a 4 byte signature providing recognition of the
+ device class represented by this ROM. Each
+ device class defines its own unique signature.
+ APIVersionMinor - the revision level of this device class' API.
+ This indicates incremental changes to the API.
+ APIVersionMajor - the major version. Used to indicates large
+ revisions or additions to the API which break
+ compatibility with the previous version.
+ reserved0,1,2,3 - for future expansion
+
+ The next set of fields is defined by the PCI / PnP BIOS spec:
+
+ pciHeaderOffset - relative offset to the PCI device header from
+ the start of this ROM.
+ pnpHeaderOffset - relative offset to the PnP boot header from the
+ start of this ROM.
+ romPad3 - reserved by PCI spec.
+
+ Finally, there is space for future header fields, and an area
+ reserved for an ELF header to point to symbol information.
+
+Appendix A - VMI ROM Low Level ABI
+
+ OS writers intending to port their OS to the paravirtualizable x86
+ processor being modeled by this hypervisor need to access the
+ hypervisor through the VMI layer. It is possible although it is
+ currently unimplemented to add or replace the functionality of
+ individual hypervisor calls by providing your own ROM images. This is
+ intended to allow third party customizations.
+
+ VMI compatible ROMs user the signature "cVmi" in the hyperSignature
+ field of the ROM header.
+
+ Many of these calls are compatible with the SVR4 C call ABI, using up
+ to three register arguments. Some calls are not, due to restrictions
+ of the native instruction set. Calls which diverge from this ABI are
+ noted. In GNU terms, this means most of the calls are compatible with
+ regparm(3) argument passing.
+
+ Most of these calls behave as standard C functions, and as such, may
+ clobber registers EAX, EDX, ECX, flags. Memory clobbers are noted
+ explicitly, since many of them may be inlined without a memory clobber.
+
+ Most of these calls require well defined segment conventions - that is,
+ flat full size 32-bit segments for all the general segments, CS, SS, DS,
+ ES. Exceptions in some cases are noted.
+
+ The net result of these choices is that most of the calls are very
+ easy to make from C-code, and calls that are likely to be required in
+ low level trap handling code are easy to call from assembler. Most
+ of these calls are also very easily implemented by the hypervisor
+ vendor in C code, and only the performance critical calls from
+ assembler paths require custom assembly implementations.
+
+ CORE INTERFACE CALLS
+
+ This set of calls provides the base functionality to establish running
+ the kernel in VMI mode.
+
+ The interface will be expanded to include feature negotiation, more
+ explicit control over call bundling and flushing, and hypervisor
+ notifications to allow inline code patching.
+
+ VMI_Init
+
+ VMICALL void VMI_Init(void);
+
+ Initializes the hypervisor environment. Returns zero on success,
+ or -1 if the hypervisor could not be initialized. Note that this
+ is a recoverable error if the guest provides the requisite native
+ code to support transparent paravirtualization.
+
+ Inputs: None
+ Outputs: EAX = result
+ Clobbers: Standard
+ Segments: Standard
+
+
+ PROCESSOR STATE CALLS
+
+ This set of calls controls the online status of the processor. It
+ include interrupt control, reboot, halt, and shutdown functionality.
+ Future expansions may include deep sleep and hotplug CPU capabilities.
+
+ VMI_DisableInterrupts
+
+ VMICALL void VMI_DisableInterrupts(void);
+
+ Disable maskable interrupts on the processor.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Flags only
+ Segments: As this is both performance critical and likely to
+ be called from low level interrupt code, this call does not
+ require flat DS/ES segments, but uses the stack segment for
+ data access. Therefore only CS/SS must be well defined.
+
+ VMI_EnableInterrupts
+
+ VMICALL void VMI_EnableInterrupts(void);
+
+ Enable maskable interrupts on the processor. Note that the
+ current implementation always will deliver any pending interrupts
+ on a call which enables interrupts, for compatibility with kernel
+ code which expects this behavior. Whether this should be required
+ is open for debate.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Flags only
+ Segments: CS/SS only
+
+ VMI_GetInterruptMask
+
+ VMICALL VMI_UINT VMI_GetInterruptMask(void);
+
+ Returns the current interrupt state mask of the processor. The
+ mask is defined to be 0x200 (matching processor flag IF) to indicate
+ interrupts are enabled.
+
+ Inputs: None
+ Outputs: EAX = mask
+ Clobbers: Flags only
+ Segments: CS/SS only
+
+ VMI_SetInterruptMask
+
+ VMICALL void VMI_SetInterruptMask(VMI_UINT mask);
+
+ Set the current interrupt state mask of the processor. Also
+ delivers any pending interrupts if the mask is set to allow
+ them.
+
+ Inputs: EAX = mask
+ Outputs: None
+ Clobbers: Flags only
+ Segments: CS/SS only
+
+ VMI_DeliverInterrupts (For future debate)
+
+ Enable and deliver any pending interrupts. This would remove
+ the implicit delivery semantic from the SetInterruptMask and
+ EnableInterrupts calls.
+
+ VMI_Pause
+
+ VMICALL void VMI_Pause(void);
+
+ Pause the processor temporarily, to allow a hypertwin or remote
+ CPU to continue operation without lock or cache contention.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_Halt
+
+ VMICALL void VMI_Halt(void);
+
+ Put the processor into interruptible halt mode. This is defined
+ to be a non-running mode where maskable interrupts are enabled,
+ not a deep low power sleep mode.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_Shutdown
+
+ VMICALL void VMI_Shutdown(void);
+
+ Put the processor into non-interruptible halt mode. This is defined
+ to be a non-running mode where maskable interrupts are disabled,
+ indicates a power-off event for this CPU.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_Reboot:
+
+ VMICALL void VMI_Reboot(VMI_INT how);
+
+ Reboot the virtual machine, using a hard or soft reboot. A soft
+ reboot corresponds to the effects of an INIT IPI, and preserves
+ some APIC and CR state. A hard reboot corresponds to a hardware
+ reset.
+
+ Inputs: EAX = reboot mode
+ #define VMI_REBOOT_SOFT 0x0
+ #define VMI_REBOOT_HARD 0x1
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetInitialAPState:
+
+ void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID);
+
+ Sets the initial state of the application processor with local APIC ID
+ "apicID" to the state in apState. apState must be the page-aligned
+ linear address of the APState structure describing the initial state of
+ the specified application processor.
+
+ Control register CR0 must have both PE and PG set; the result of
+ either of these bits being cleared is undefined. It is recommended
+ that for best performance, all processors in the system have the same
+ setting of the CR4 PAE bit. LME and LMA in EFER are both currently
+ unsupported. The result of setting either of these bits is undefined.
+
+ Inputs: EAX = pointer to APState structure for new co-processor
+ EDX = APIC ID of processor to initialize
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+
+ DESCRIPTOR RELATED CALLS
+
+ VMI_SetGDT
+
+ VMICALL void VMI_SetGDT(VMI_DTR *gdtr);
+
+ Load the global descriptor table limit and base registers. In
+ addition to the straightforward load of the hardware registers, this
+ has the additional side effect of reloading all segment registers in a
+ virtual machine. The reason is that otherwise, the hidden part of
+ segment registers (the base field) may be put into a non-reversible
+ state. Non-reversible segments are problematic because they can not be
+ reloaded - any subsequent loads of the segment will load the new
+ descriptor state. In general, is not possible to resume direct
+ execution of the virtual machine if certain segments become
+ non-reversible.
+
+ A load of the GDTR may cause the guest visible memory image of the GDT
+ to be changed. This allows the hypervisor to share the GDT pages with
+ the guest, but also continue to maintain appropriate protections on the
+ GDT page by transparently adjusting the DPL and RPL of descriptors in
+ the GDT.
+
+ Inputs: EAX = pointer to descriptor limit / base
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_SetIDT
+
+ VMICALL void VMI_SetIDT(VMI_DTR *idtr);
+
+ Load the interrupt descriptor table limit and base registers. The IDT
+ format is defined to be the same as native hardware.
+
+ A load of the IDTR may cause the guest visible memory image of the IDT
+ to be changed. This allows the hypervisor to rewrite the IDT pages in
+ a format more suitable to the hypervisor, which may include adjusting
+ the DPL and RPL of descriptors in the guest IDT.
+
+ Inputs: EAX = pointer to descriptor limit / base
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_SetLDT
+
+ VMICALL void VMI_SetLDT(VMI_SELECTOR ldtSel);
+
+ Load the local descriptor table. This has the additional side effect
+ of of reloading all segment registers. See VMI_SetGDT for an
+ explanation of why this is required. A load of the LDT may cause the
+ guest visible memory image of the LDT to be changed, just as GDT and
+ IDT loads.
+
+ Inputs: EAX = GDT selector of LDT descriptor
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_SetTR
+
+ VMICALL void VMI_SetTR(VMI_SELECTOR ldtSel);
+
+ Load the task register. Functionally equivalent to the LTR
+ instruction.
+
+ Inputs: EAX = GDT selector of TR descriptor
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_GetGDT
+
+ VMICALL void VMI_GetGDT(VMI_DTR *gdtr);
+
+ Copy the GDT limit and base fields into the provided pointer. This is
+ equivalent to the SGDT instruction, which is non-virtualizable.
+
+ Inputs: EAX = pointer to descriptor limit / base
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_GetIDT
+
+ VMICALL void VMI_GetIDT(VMI_DTR *idtr);
+
+ Copy the IDT limit and base fields into the provided pointer. This is
+ equivalent to the SIDT instruction, which is non-virtualizable.
+
+ Inputs: EAX = pointer to descriptor limit / base
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_GetLDT
+
+ VMICALL VMI_SELECTOR VMI_GetLDT(void);
+
+ Load the task register. Functionally equivalent to the SLDT
+ instruction, which is non-virtualizable.
+
+ Inputs: None
+ Outputs: EAX = selector of LDT descriptor
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_GetTR
+
+ VMICALL VMI_SELECTOR VMI_GetTR(void);
+
+ Load the task register. Functionally equivalent to the STR
+ instruction, which is non-virtualizable.
+
+ Inputs: None
+ Outputs: EAX = selector of TR descriptor
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_WriteGDTEntry
+
+ VMICALL void VMI_WriteGDTEntry(void *gdt, VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+
+ Write a descriptor to a GDT entry. Note that writes to the GDT itself
+ may be disallowed by the hypervisor, in which case this call must be
+ converted into a hypercall. In addition, since the descriptor may need
+ to be modified to change limits and / or permissions, the guest kernel
+ should not assume the update will be binary identical to the passed
+ input.
+
+ Inputs: EAX = pointer to GDT base
+ EDX = GDT entry number
+ ECX = descriptor low word
+ ST(1) = descriptor high word
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_WriteLDTEntry
+
+ VMICALL void VMI_WriteLDTEntry(void *gdt, VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+
+ Write a descriptor to a LDT entry. Note that writes to the LDT itself
+ may be disallowed by the hypervisor, in which case this call must be
+ converted into a hypercall. In addition, since the descriptor may need
+ to be modified to change limits and / or permissions, the guest kernel
+ should not assume the update will be binary identical to the passed
+ input.
+
+ Inputs: EAX = pointer to LDT base
+ EDX = LDT entry number
+ ECX = descriptor low word
+ ST(1) = descriptor high word
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_WriteIDTEntry
+
+ VMICALL void VMI_WriteIDTEntry(void *gdt, VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+
+ Write a descriptor to a IDT entry. Since the descriptor may need to be
+ modified to change limits and / or permissions, the guest kernel should
+ not assume the update will be binary identical to the passed input.
+
+ Inputs: EAX = pointer to IDT base
+ EDX = IDT entry number
+ ECX = descriptor low word
+ ST(1) = descriptor high word
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+
+ CPU CONTROL CALLS
+
+ These calls encapsulate the set of privileged instructions used to
+ manipulate the CPU control state. These instructions are all properly
+ virtualizable using trap and emulate, but for performance reasons, a
+ direct call may be more efficient. With hardware virtualization
+ capabilities, many of these calls can be left as IDENT translations, that
+ is, inline implementations of the native instructions, which are not
+ rewritten by the hypervisor. Some of these calls are performance critical
+ during context switch paths, and some are not, but they are all included
+ for completeness, with the exceptions of the obsoleted LMSW and SMSW
+ instructions.
+
+ VMI_WRMSR
+
+ VMICALL void VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg);
+
+ Write to a model specific register. This functions identically to the
+ hardware WRMSR instruction. Note that a hypervisor may not implement
+ the full set of MSRs supported by native hardware, since many of them
+ are not useful in the context of a virtual machine.
+
+ Inputs: ECX = model specific register index
+ EAX = low word of register
+ EDX = high word of register
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_RDMSR
+
+ VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
+
+ Read from a model specific register. This functions identically to the
+ hardware RDMSR instruction. Note that a hypervisor may not implement
+ the full set of MSRs supported by native hardware, since many of them
+ are not useful in the context of a virtual machine.
+
+ Inputs: ECX = machine specific register index
+ Outputs: EAX = low word of register
+ EDX = high word of register
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetCR0
+
+ VMICALL void VMI_SetCR0(VMI_UINT val);
+
+ Write to control register zero. This can cause TLB flush and FPU
+ handling side effects. The set of features available to the kernel
+ depend on the completeness of the hypervisor. An explicit list of
+ supported functionality or required settings may need to be negotiated
+ by the hypervisor and kernel during bootstrapping. This is likely to
+ be implementation or vendor specific, and the precise restrictions are
+ not yet worked out. Our implementation in general supports turning on
+ additional functionality - enabling protected mode, paging, page write
+ protections; however, once those features have been enabled, they may
+ not be disabled on the virtual hardware.
+
+ Inputs: EAX = input to control register
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetCR2
+
+ VMICALL void VMI_SetCR2(VMI_UINT val);
+
+ Write to control register two. This has no side effects other than
+ updating the CR2 register value.
+
+ Inputs: EAX = input to control register
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetCR3
+
+ VMICALL void VMI_SetCR3(VMI_UINT val);
+
+ Write to control register three. This causes a TLB flush on the local
+ processor. In addition, this update may be queued as part of a lazy
+ call invocation, which allows multiple hypercalls to be issued during
+ the context switch path. The queuing convention is to be negotiated
+ with the hypervisor during bootstrapping, but the interfaces for this
+ negotiation are currently vendor specific.
+
+ Inputs: EAX = input to control register
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+ Queue Class: MMU
+
+ VMI_SetCR4
+
+ VMICALL void VMI_SetCR3(VMI_UINT val);
+
+ Write to control register four. This can cause TLB flush and many
+ other CPU side effects. The set of features available to the kernel
+ depend on the completeness of the hypervisor. An explicit list of
+ supported functionality or required settings may need to be negotiated
+ by the hypervisor and kernel during bootstrapping. This is likely to
+ be implementation or vendor specific, and the precise restrictions are
+ not yet worked out. Our implementation in general supports turning on
+ additional MMU functionality - enabling global pages, large pages, PAE
+ mode, and other features - however, once those features have been
+ enabled, they may not be disabled on the virtual hardware. The
+ remaining CPU control bits of CR4 remain active and behave identically
+ to real hardware.
+
+ Inputs: EAX = input to control register
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_GetCR0
+ VMI_GetCR2
+ VMI_GetCR3
+ VMI_GetCR4
+
+ VMICALL VMI_UINT32 VMI_GetCR0(void);
+ VMICALL VMI_UINT32 VMI_GetCR2(void);
+ VMICALL VMI_UINT32 VMI_GetCR3(void);
+ VMICALL VMI_UINT32 VMI_GetCR4(void);
+
+ Read the value of a control register into EAX. The register contents
+ are identical to the native hardware control registers; CR0 contains
+ the control bits and task switched flag, CR2 contains the last page
+ fault address, CR3 contains the page directory base pointer, and CR4
+ contains various feature control bits.
+
+ Inputs: None
+ Outputs: EAX = value of control register
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_CLTS
+
+ VMICALL void VMI_CLTS(void);
+
+ Used to clear the task switched (TS) flag in control register zero. A
+ replacement for the CLTS instruction.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetDR
+
+ VMICALL void VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val);
+
+ Set the debug register to the given value. If a hypervisor
+ implementation supports debug registers, this functions equivalently to
+ native hardware move to DR instructions.
+
+ Inputs: EAX = debug register number
+ EDX = debug register value
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_GetDR
+
+ VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num);
+
+ Read a debug register. If debug registers are not supported, the
+ implementation is free to return zero values.
+
+ Inputs: EAX = debug register number
+ Outputs: EAX = debug register value
+ Clobbers: Standard
+ Segments: Standard
+
+
+ PROCESSOR INFORMATION CALLS
+
+ These calls provide access to processor identification, performance and
+ cycle data, which may be inaccurate due to the nature of running on
+ virtual hardware. This information may be visible in a non-virtualizable
+ way to applications running outside of the kernel. As such, both RDTSC
+ and RDPMC should be disabled by kernels or hypervisors where information
+ leakage is a concern, and the accuracy of data retrieved by these functions
+ is up to the individual hypervisor vendor.
+
+ VMI_CPUID
+
+ /* Not expressible as a C function */
+
+ The CPUID instruction provides processor feature identification in a
+ vendor specific manner. The instruction itself is non-virtualizable
+ without hardware support, requiring a hypervisor assisted CPUID call
+ that emulates the effect of the native instruction, while masking any
+ unsupported CPU feature bits.
+
+ Inputs: EAX = CPUID number
+ ECX = sub-level query (nonstandard)
+ Outputs: EAX = CPUID dword 0
+ EBX = CPUID dword 1
+ ECX = CPUID dword 2
+ EDX = CPUID dword 3
+ Clobbers: Flags only
+ Segments: Standard
+
+ VMI_RDTSC
+
+ VMICALL VMI_UINT64 VMI_RDTSC(void);
+
+ The RDTSC instruction provides a cycles counter which may be made
+ visible to userspace. For better or worse, many applications have made
+ use of this feature to implement userspace timers, database indices, or
+ for micro-benchmarking of performance. This instruction is extremely
+ problematic for virtualization, because even though it is selectively
+ virtualizable using trap and emulate, it is much more expensive to
+ virtualize it in this fashion. On the other hand, if this instruction
+ is allowed to execute without trapping, the cycle counter provided
+ could be wrong in any number of circumstances due to hardware drift,
+ migration, suspend/resume, CPU hotplug, and other unforeseen
+ consequences of running inside of a virtual machine. There is no
+ standard specification for how this instruction operates when issued
+ from userspace programs, but the VMI call here provides a proper
+ interface for the kernel to read this cycle counter.
+
+ Inputs: None
+ Outputs: EAX = low word of TSC cycle counter
+ EDX = high word of TSC cycle counter
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_RDPMC
+
+ VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
+
+ Similar to RDTSC, this call provides the functionality of reading
+ processor performance counters. It also is selectively visible to
+ userspace, and maintaining accurate data for the performance counters
+ is an extremely difficult task due to the side effects introduced by
+ the hypervisor.
+
+ Inputs: ECX = performance counter index
+ Outputs: EAX = low word of counter
+ EDX = high word of counter
+ Clobbers: Standard
+ Segments: Standard
+
+
+ STACK / PRIVILEGE TRANSITION CALLS
+
+ This set of calls encapsulates mechanisms required to transfer between
+ higher privileged kernel tasks and userspace. The stack switching and
+ return mechanisms are also used to return from interrupt handlers into
+ the kernel, which may involve atomic interrupt state and stack
+ transitions.
+
+ VMI_UpdateKernelStack
+
+ VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0);
+
+ Inform the hypervisor that a new kernel stack pointer has been loaded
+ in the TSS structure. This new kernel stack pointer will be used for
+ entry into the kernel on interrupts from userspace.
+
+ Inputs: EAX = pointer to TSS structure
+ EDX = new kernel stack top
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_IRET
+
+ /* No C prototype provided */
+
+ Perform a near equivalent of the IRET instruction, which atomically
+ switches off the current stack and restore the interrupt mask. This
+ may return to userspace or back to the kernel from an interrupt or
+ exception handler. The VMI_IRET call does not restore IOPL from the
+ stack image, as the native hardware equivalent would. Instead, IOPL
+ must be explicitly restored using a VMI_SetIOPL call. The VMI_IRET
+ call does, however, restore the state of the EFLAGS_VM bit from the
+ stack image in the event that the hypervisor and kernel both support
+ V8086 execution mode. If the hypervisor does not support V8086 mode,
+ this can be silently ignored, generating an error that the guest must
+ deal with. Note this call is made using a CALL instruction, just as
+ all other VMI calls, so the EIP of the call site is available to the
+ VMI layer. This allows faults during the sequence to be properly
+ passed back to the guest kernel with the correct EIP.
+
+ Note that returning to userspace with interrupts disabled is an invalid
+ operation in a paravirtualized kernel, and the results of an attempt to
+ do so are undefined.
+
+ Also note that when issuing the VMI_IRET call, the userspace data
+ segments may have already been restored, so only the stack and code
+ segments can be assumed valid.
+
+ There is currently no support for IRET calls from a 16-bit stack
+ segment, which poses a problem for supporting certain userspace
+ applications which make use of high bits of ESP on a 16-bit stack. How
+ to best resolve this is an open question. One possibility is to
+ introduce a new VMI call which can operate on 16-bit segments, since it
+ is desirable to make the common case here as fast as possible.
+
+ Inputs: ST(0) = New EIP
+ ST(1) = New CS
+ ST(2) = New Flags (including interrupt mask)
+ ST(3) = New ESP (for userspace returns)
+ ST(4) = New SS (for userspace returns)
+ ST(5) = New ES (for v8086 returns)
+ ST(6) = New DS (for v8086 returns)
+ ST(7) = New FS (for v8086 returns)
+ ST(8) = New GS (for v8086 returns)
+ Outputs: None (does not return)
+ Clobbers: None (does not return)
+ Segments: CS / SS only
+
+ VMI_SYSEXIT
+
+ /* No C prototype provided */
+
+ For hypervisors and processors which support SYSENTER / SYSEXIT, the
+ VMI_SYSEXIT call is provided as a binary equivalent to the native
+ SYSENTER instruction. Since interrupts must always be enabled in
+ userspace, the VMI version of this function always combines atomically
+ enabling interrupts with the return to userspace.
+
+ Inputs: EDX = New EIP
+ ECX = New ESP
+ Outputs: None (does not return)
+ Clobbers: None (does not return)
+ Segments: CS / SS only
+
+
+ I/O CALLS
+
+ This set of calls incorporates I/O related calls - PIO, setting I/O
+ privilege level, and forcing memory writeback for device coherency.
+
+ VMI_INB
+ VMI_INW
+ VMI_INL
+
+ VMICALL VMI_UINT8 VMI_INB(VMI_UINT dummy, VMI_UINT port);
+ VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port);
+ VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port);
+
+ Input a byte, word, or doubleword from an I/O port. These
+ instructions have binary equivalent semantics to native instructions.
+
+ Inputs: EDX = port number
+ EDX, rather than EAX is used, because the native
+ encoding of the instruction may use this register
+ implicitly.
+ Outputs: EAX = port value
+ Clobbers: Memory only
+ Segments: Standard
+
+ VMI_OUTB
+ VMI_OUTW
+ VMI_OUTL
+
+ VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port);
+ VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port);
+ VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port);
+
+ Output a byte, word, or doubleword to an I/O port. These
+ instructions have binary equivalent semantics to native instructions.
+
+ Inputs: EAX = port value
+ EDX = port number
+ Outputs: None
+ Clobbers: None
+ Segments: Standard
+
+ VMI_INSB
+ VMI_INSW
+ VMI_INSL
+
+ /* Not expressible as C functions */
+
+ Input a string of bytes, words, or doublewords from an I/O port. These
+ instructions have binary equivalent semantics to native instructions.
+ They do not follow a C calling convention, and clobber only the same
+ registers as native instructions.
+
+ Inputs: EDI = destination address
+ EDX = port number
+ ECX = count
+ Outputs: None
+ Clobbers: ESI, ECX, Memory
+ Segments: Standard
+
+ VMI_OUTSB
+ VMI_OUTSW
+ VMI_OUTSL
+
+ /* Not expressible as C functions */
+
+ Output a string of bytes, words, or doublewords to an I/O port. These
+ instructions have binary equivalent semantics to native instructions.
+ They do not follow a C calling convention, and clobber only the same
+ registers as native instructions.
+
+ Inputs: ESI = source address
+ EDX = port number
+ ECX = count
+ Outputs: None
+ Clobbers: ESI, ECX
+ Segments: Standard
+
+ VMI_IODelay
+
+ VMICALL void VMI_IODelay(void);
+
+ Delay the processor by time required to access a bus register. This is
+ easily implemented on native hardware by an access to a bus scratch
+ register, but is typically not useful in a virtual machine. It is
+ paravirtualized to remove the overhead implied by executing the native
+ delay.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetIOPLMask
+
+ VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask);
+
+ Set the IOPL mask of the processor to allow userspace to access I/O
+ ports. Note the mask is pre-shifted, so an IOPL of 3 would be
+ expressed as (3 << 12). If the guest chooses to use IOPL to allow
+ CPL-3 access to I/O ports, it must explicitly set and restore IOPL
+ using these calls; attempting to set the IOPL flags with popf or iret
+ may produce no result.
+
+ Inputs: EAX = Mask
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_WBINVD
+
+ VMICALL void VMI_WBINVD(void);
+
+ Write back and invalidate the data cache. This is used to synchronize
+ I/O memory.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_INVD
+
+ This instruction is deprecated. It is invalid to execute in a virtual
+ machine. It is documented here only because it is still declared in
+ the interface, and dropping it required a version change.
+
+
+ APIC CALLS
+
+ APIC virtualization is currently quite simple. These calls support the
+ functionality of the hardware APIC in a form that allows for more
+ efficient implementation in a hypervisor, by avoiding trapping access to
+ APIC memory. The calls are kept simple to make the implementation
+ compatible with native hardware. The APIC must be mapped at a page
+ boundary in the processor virtual address space.
+
+ VMI_APICWrite
+
+ VMICALL void VMI_APICWrite(void *reg, VMI_UINT32 value);
+
+ Write to a local APIC register. Side effects are the same as native
+ hardware APICs.
+
+ Inputs: EAX = APIC register address
+ EDX = value to write
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_APICRead
+
+ VMICALL VMI_UINT32 VMI_APICRead(void *reg);
+
+ Read from a local APIC register. Side effects are the same as native
+ hardware APICs.
+
+ Inputs: EAX = APIC register address
+ Outputs: EAX = APIC register value
+ Clobbers: Standard
+ Segments: Standard
+
+
+ TIMER CALLS
+
+ The VMI interfaces define a highly accurate and efficient timer interface
+ that is available when running inside of a hypervisor. This is an
+ optional but highly recommended feature which avoids many of the problems
+ presented by classical timer virtualization. It provides notions of
+ stolen time, counters, and wall clock time which allows the VM to
+ get the most accurate information in a way which is free of races and
+ legacy hardware dependence.
+
+ VMI_GetWallclockTime
+
+ VMI_NANOSECS VMICALL VMI_GetWallclockTime(void);
+
+ VMI_GetWallclockTime returns the current wallclock time as the number
+ of nanoseconds since the epoch. Nanosecond resolution along with the
+ 64-bit unsigned type provide over 580 years from epoch until rollover.
+ The wallclock time is relative to the host's wallclock time.
+
+ Inputs: None
+ Outputs: EAX = low word, wallclock time in nanoseconds
+ EDX = high word, wallclock time in nanoseconds
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_WallclockUpdated
+
+ VMI_BOOL VMICALL VMI_WallclockUpdated(void);
+
+ VMI_WallclockUpdated returns TRUE if the wallclock time has changed
+ relative to the real cycle counter since the previous time that
+ VMI_WallclockUpdated was polled. For example, while a VM is suspended,
+ the real cycle counter will halt, but wallclock time will continue to
+ advance. Upon resuming the VM, the first call to VMI_WallclockUpdated
+ will return TRUE.
+
+ Inputs: None
+ Outputs: EAX = 0 for FALSE, 1 for TRUE
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_GetCycleFrequency
+
+ VMICALL VMI_CYCLES VMI_GetCycleFrequency(void);
+
+ VMI_GetCycleFrequency returns the number of cycles in one second. This
+ value can be used by the guest to convert between cycles and other time
+ units.
+
+ Inputs: None
+ Outputs: EAX = low word, cycle frequency
+ EDX = high word, cycle frequency
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_GetCycleCounter
+
+ VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter);
+
+ VMI_GetCycleCounter returns the current value, in cycles units, of the
+ counter corresponding to 'whichCounter' if it is one of
+ VMI_CYCLES_REAL, VMI_CYCLES_AVAILABLE or VMI_CYCLES_STOLEN.
+ VMI_GetCycleCounter returns 0 for any other value of 'whichCounter'.
+
+ Inputs: EAX = counter index, one of
+ #define VMI_CYCLES_REAL 0
+ #define VMI_CYCLES_AVAILABLE 1
+ #define VMI_CYCLES_STOLEN 2
+ Outputs: EAX = low word, cycle counter
+ EDX = high word, cycle counter
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetAlarm
+
+ VMICALL void VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry,
+ VMI_CYCLES period);
+
+ VMI_SetAlarm is used to arm the vcpu's alarms. The 'flags' parameter
+ is used to specify which counter's alarm is being set (VMI_CYCLES_REAL
+ or VMI_CYCLES_AVAILABLE), how to deliver the alarm to the vcpu
+ (VMI_ALARM_WIRED_IRQ0 or VMI_ALARM_WIRED_LVTT), and the mode
+ (VMI_ALARM_IS_ONESHOT or VMI_ALARM_IS_PERIODIC). If the alarm is set
+ against the VMI_ALARM_STOLEN counter or an undefined counter number,
+ the call is a nop. The 'expiry' parameter indicates the expiry of the
+ alarm, and for periodic alarms, the 'period' parameter indicates the
+ period of the alarm. If the value of 'period' is zero, the alarm is
+ armed as a one-shot alarm regardless of the mode specified by 'flags'.
+ Finally, a call to VMI_SetAlarm for an alarm that is already armed is
+ equivalent to first calling VMI_CancelAlarm and then calling
+ VMI_SetAlarm, except that the value returned by VMI_CancelAlarm is not
+ accessible.
+
+ /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */
+
+ Inputs: EAX = flags value, cycle counter number or'ed with
+ #define VMI_ALARM_WIRED_IRQ0 0x00000000
+ #define VMI_ALARM_WIRED_LVTT 0x00010000
+ #define VMI_ALARM_IS_ONESHOT 0x00000000
+ #define VMI_ALARM_IS_PERIODIC 0x00000100
+ EDX = low word, alarm expiry
+ ECX = high word, alarm expiry
+ ST(0) = low word, alarm expiry
+ ST(1) = high word, alarm expiry
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_CancelAlarm
+
+ VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags);
+
+ VMI_CancelAlarm is used to disarm an alarm. The 'flags' parameter
+ indicates which alarm to cancel (VMI_CYCLES_REAL or
+ VMI_CYCLES_AVAILABLE). The return value indicates whether or not the
+ cancel succeeded. A return value of FALSE indicates that the alarm was
+ already disarmed either because a) the alarm was never set or b) it was
+ a one-shot alarm and has already fired (though perhaps not yet
+ delivered to the guest). TRUE indicates that the alarm was armed and
+ either a) the alarm was one-shot and has not yet fired (and will no
+ longer fire until it is rearmed) or b) the alarm was periodic.
+
+ Inputs: EAX = cycle counter number
+ Outputs: EAX = 0 for FALSE, 1 for TRUE
+ Clobbers: Standard
+ Segments: Standard
+
+
+ MMU CALLS
+
+ The MMU plays a large role in paravirtualization due to the large
+ performance opportunities realized by gaining insight into the guest
+ machine's use of page tables. These calls are designed to accommodate the
+ existing MMU functionality in the guest OS while providing the hypervisor
+ with hints that can be used to optimize performance to a large degree.
+
+ VMI_SetLinearMapping
+
+ VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va,
+ VMI_UINT32 pages, VMI_UINT32 ppn);
+
+ /* The number of VMI address translation slot */
+ #define VMI_LINEAR_MAP_SLOTS 4
+
+ Register a virtual to physical translation of virtual address range to
+ physical pages. This may be used to register single pages or to
+ register large ranges. There is an upper limit on the number of active
+ mappings, which should be sufficient to allow the hypervisor and VMI
+ layer to perform page translation without requiring dynamic storage.
+ Translations are only required to be registered for addresses used to
+ access page table entries through the VMI page table access functions.
+ The guest is free to use the provided linear map slots in a manner that
+ it finds most convenient. Kernels which linearly map a large chunk of
+ physical memory and use page tables in this linear region will only
+ need to register one such region after initialization of the VMI.
+ Hypervisors which do not require linear to physical conversion hints
+ are free to leave these calls as NOPs, which is the default when
+ inlined into the native kernel.
+
+ Inputs: EAX = linear map slot
+ EDX = virtual address start of mapping
+ ECX = number of pages in mapping
+ ST(0) = physical frame number to which pages are mapped
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_FlushTLB
+
+ VMICALL void VMI_FlushTLB(int how);
+
+ Flush all non-global mappings in the TLB, optionally flushing global
+ mappings as well. The VMI_FLUSH_TLB flag should always be specified,
+ optionally or'ed with the VMI_FLUSH_GLOBAL flag.
+
+ Inputs: EAX = flush type
+ #define VMI_FLUSH_TLB 0x01
+ #define VMI_FLUSH_GLOBAL 0x02
+ Outputs: None
+ Clobbers: Standard, memory (implied)
+ Segments: Standard
+
+ VMI_InvalPage
+
+ VMICALL void VMI_InvalPage(VMI_UINT32 va);
+
+ Invalidate the TLB mapping for a single page or large page at the
+ given virtual address.
+
+ Inputs: EAX = virtual address
+ Outputs: None
+ Clobbers: Standard, memory (implied)
+ Segments: Standard
+
+ The remaining documentation here needs updating when the PTE accessors are
+ simplified.
+
+ 70) VMI_SetPte
+
+ void VMI_SetPte(VMI_PTE pte, VMI_PTE *ptep);
+
+ Assigns a new value to a page table / directory entry. It is a
+ requirement that ptep points to a page that has already been
+ registered with the hypervisor as a page of the appropriate type
+ using the VMI_RegisterPageUsage function.
+
+ 71) VMI_SwapPte
+
+ VMI_PTE VMI_SwapPte(VMI_PTE pte, VMI_PTE *ptep);
+
+ Write 'pte' into the page table entry pointed by 'ptep', and returns
+ the old value in 'ptep'. This function acts atomically on the PTE
+ to provide up to date A/D bit information in the returned value.
+
+ 72) VMI_TestAndSetPteBit
+
+ VMI_BOOL VMI_TestAndSetPteBit(VMI_INT bit, VMI_PTE *ptep);
+
+ Atomically set a bit in a page table entry. Returns zero if the bit
+ was not set, and non-zero if the bit was set.
+
+ 73) VMI_TestAndClearPteBit
+
+ VMI_BOOL VMI_TestAndSetClearBit(VMI_INT bit, VMI_PTE *ptep);
+
+ Atomically clear a bit in a page table entry. Returns zero if the bit
+ was not set, and non-zero if the bit was set.
+
+ 74) VMI_SetPteLong
+ 75) VMI_SwapPteLong
+ 76) VMI_TestAndSetPteBitLong
+ 77) VMI_TestAndClearPteBitLong
+
+ void VMI_SetPteLong(VMI_PAE_PTE pte, VMI_PAE_PTE *ptep);
+ VMI_PAE_PTE VMI_SwapPteLong(VMI_UINT64 pte, VMI_PAE_PTE *ptep);
+ VMI_BOOL VMI_TestAndSetPteBitLong(VMI_INT bit, VMI_PAE_PTE *ptep);
+ VMI_BOOL VMI_TestAndSetClearBitLong(VMI_INT bit, VMI_PAE_PTE *ptep);
+
+ These functions act identically to the 32-bit PTE update functions,
+ but provide support for PAE mode. The calls are guaranteed to never
+ create a temporarily invalid but present page mapping that could be
+ accidentally prefetched by another processor, and all returned bits
+ are guaranteed to be atomically up to date.
+
+ One special exception is the VMI_SwapPteLong function only provides
+ synchronization against A/D bits from other processors, not against
+ other invocations of VMI_SwapPteLong.
+
+ 78) VMI_ClonePageTable
+ VMI_ClonePageDirectory
+
+ #define VMI_MKCLONE(start, count) (((start) << 16) | (count))
+
+ void VMI_ClonePageTable(VMI_UINT32 dstPPN, VMI_UINT32 srcPPN,
+ VMI_UINT32 flags);
+ void VMI_ClonePageDirectory(VMI_UINT32 dstPPN, VMI_UINT32 srcPPN,
+ VMI_UINT32 flags);
+
+ These functions tell the hypervisor to allocate a page shadow
+ at the PT or PD level using a shadow template. Because of the
+ availability of bits in the flags, these calls may be merged
+ together as well as flag the PAE-ness of the shadows.
+
+ 80) VMI_RegisterPageUsage
+ 81) VMI_ReleasePage
+
+ #define VMI_PAGE_PT 0x01
+ #define VMI_PAGE_PD 0x02
+ #define VMI_PAGE_PDP 0x04
+ #define VMI_PAGE_PML4 0x08
+ #define VMI_PAGE_GDT 0x10
+ #define VMI_PAGE_LDT 0x20
+ #define VMI_PAGE_IDT 0x40
+ #define VMI_PAGE_TSS 0x80
+
+ void VMI_RegisterPageUsage(VMI_UINT32 ppn, int flags);
+ void VMI_ReleasePage(VMI_UINT32 ppn, int flags);
+
+ These are used to register a page with the hypervisor as being of a
+ particular type, for instance, VMI_PAGE_PT says it is a page table
+ page.
+
+ 85) VMI_SetDeferredMode
+
+ void VMI_SetDeferredMode(VMI_UINT32 deferBits);
+
+ Set the lazy state update mode to the specified set of bits. This
+ allows the processor, hypervisor, or VMI layer to lazily update
+ certain CPU and MMU state. When setting this to a more permissive
+ setting, no flush is implied, but when clearing bits in the current
+ defer mask, all pending state will be flushed.
+
+ The 'deferBits' is a mask specifying how to flush.
+
+ #define VMI_DEFER_NONE 0x00
+
+ Disallow all asynchronous state updates. This is the default
+ state.
+
+ #define VMI_DEFER_MMU 0x01
+
+ Flush all pending page table updates. Note that page faults,
+ invalidations and TLB flushes will implicitly flush all pending
+ updates.
+
+ #define VMI_DEFER_CPU 0x02
+
+ Allow CPU state updates to control registers to be deferred, with
+ the exception of updates that change FPU state. This is useful
+ for combining a reload of the page table base in CR3 with other
+ updates, such as the current kernel stack.
+
+ #define VMI_DEFER_DT 0x04
+
+ Allow descriptor table updates to be delayed. This allows the
+ VMI_UpdateGDT / IDT / LDT calls to be asynchronously queued.
+
+ 86) VMI_FlushDeferredCalls
+
+ void VMI_FlushDeferredCalls(void);
+
+ Flush all asynchronous state updates which may be queued as
+ a result of setting deferred update mode.
+
+
+Appendix B - VMI C prototypes
+
+ Most of the VMI calls are properly callable C functions. Note that for the
+ absolute best performance, assembly calls are preferable in some cases, as
+ they do not imply all of the side effects of a C function call, such as
+ register clobber and memory access. Nevertheless, these wrappers serve as
+ a useful interface definition for higher level languages.
+
+ In some cases, a dummy variable is passed as an unused input to force
+ proper alignment of the remaining register values.
+
+ The call convention for these is defined to be standard GCC convention with
+ register passing. The regparm call interface is documented at:
+
+ http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html
+
+ Types used by these calls:
+
+ VMI_UINT64 64 bit unsigned integer
+ VMI_UINT32 32 bit unsigned integer
+ VMI_UINT16 16 bit unsigned integer
+ VMI_UINT8 8 bit unsigned integer
+ VMI_INT 32 bit integer
+ VMI_UINT 32 bit unsigned integer
+ VMI_DTR 6 byte compressed descriptor table limit/base
+ VMI_PTE 4 byte page table entry (or page directory)
+ VMI_LONG_PTE 8 byte page table entry (or PDE or PDPE)
+ VMI_SELECTOR 16 bit segment selector
+ VMI_BOOL 32 bit unsigned integer
+ VMI_CYCLES 64 bit unsigned integer
+ VMI_NANOSECS 64 bit unsigned integer
+
+
+ #ifndef VMI_PROTOTYPES_H
+ #define VMI_PROTOTYPES_H
+
+ /* Insert local type definitions here */
+ typedef struct VMI_DTR {
+ uint16 limit;
+ uint32 offset __attribute__ ((packed));
+ } VMI_DTR;
+
+ typedef struct APState {
+ VMI_UINT32 cr0;
+ VMI_UINT32 cr2;
+ VMI_UINT32 cr3;
+ VMI_UINT32 cr4;
+
+ VMI_UINT64 efer;
+
+ VMI_UINT32 eip;
+ VMI_UINT32 eflags;
+ VMI_UINT32 eax;
+ VMI_UINT32 ebx;
+ VMI_UINT32 ecx;
+ VMI_UINT32 edx;
+ VMI_UINT32 esp;
+ VMI_UINT32 ebp;
+ VMI_UINT32 esi;
+ VMI_UINT32 edi;
+ VMI_UINT16 cs;
+ VMI_UINT16 ss;
+
+ VMI_UINT16 ds;
+ VMI_UINT16 es;
+ VMI_UINT16 fs;
+ VMI_UINT16 gs;
+ VMI_UINT16 ldtr;
+
+ VMI_UINT16 gdtrLimit;
+ VMI_UINT32 gdtrBase;
+ VMI_UINT32 idtrBase;
+ VMI_UINT16 idtrLimit;
+ } APState;
+
+ #define VMICALL __attribute__((regparm(3)))
+
+ /* CORE INTERFACE CALLS */
+ VMICALL void VMI_Init(void);
+
+ /* PROCESSOR STATE CALLS */
+ VMICALL void VMI_DisableInterrupts(void);
+ VMICALL void VMI_EnableInterrupts(void);
+
+ VMICALL VMI_UINT VMI_GetInterruptMask(void);
+ VMICALL void VMI_SetInterruptMask(VMI_UINT mask);
+
+ VMICALL void VMI_Pause(void);
+ VMICALL void VMI_Halt(void);
+ VMICALL void VMI_Shutdown(void);
+ VMICALL void VMI_Reboot(VMI_INT how);
+
+ #define VMI_REBOOT_SOFT 0x0
+ #define VMI_REBOOT_HARD 0x1
+
+ void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID);
+
+ /* DESCRIPTOR RELATED CALLS */
+ VMICALL void VMI_SetGDT(VMI_DTR *gdtr);
+ VMICALL void VMI_SetIDT(VMI_DTR *idtr);
+ VMICALL void VMI_SetLDT(VMI_SELECTOR ldtSel);
+ VMICALL void VMI_SetTR(VMI_SELECTOR ldtSel);
+
+ VMICALL void VMI_GetGDT(VMI_DTR *gdtr);
+ VMICALL void VMI_GetIDT(VMI_DTR *idtr);
+ VMICALL VMI_SELECTOR VMI_GetLDT(void);
+ VMICALL VMI_SELECTOR VMI_GetTR(void);
+
+ VMICALL void VMI_WriteGDTEntry(void *gdt,
+ VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+ VMICALL void VMI_WriteLDTEntry(void *gdt,
+ VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+ VMICALL void VMI_WriteIDTEntry(void *gdt,
+ VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+
+ /* CPU CONTROL CALLS */
+ VMICALL void VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg);
+ VMICALL void VMI_WRMSR_SPLIT(VMI_UINT32 valLo, VMI_UINT32 valHi,
+ VMI_UINT32 reg);
+
+ /* Not truly a proper C function; use dummy to align reg in ECX */
+ VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
+
+ VMICALL void VMI_SetCR0(VMI_UINT val);
+ VMICALL void VMI_SetCR2(VMI_UINT val);
+ VMICALL void VMI_SetCR3(VMI_UINT val);
+ VMICALL void VMI_SetCR4(VMI_UINT val);
+
+ VMICALL VMI_UINT32 VMI_GetCR0(void);
+ VMICALL VMI_UINT32 VMI_GetCR2(void);
+ VMICALL VMI_UINT32 VMI_GetCR3(void);
+ VMICALL VMI_UINT32 VMI_GetCR4(void);
+
+ VMICALL void VMI_CLTS(void);
+
+ VMICALL void VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val);
+ VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num);
+
+ /* PROCESSOR INFORMATION CALLS */
+
+ VMICALL VMI_UINT64 VMI_RDTSC(void);
+ VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
+
+ /* STACK / PRIVILEGE TRANSITION CALLS */
+ VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0);
+
+ /* I/O CALLS */
+ /* Native port in EDX - use dummy */
+ VMICALL VMI_UINT8 VMI_INB(VMI_UINT dummy, VMI_UINT port);
+ VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port);
+ VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port);
+
+ VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port);
+ VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port);
+ VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port);
+
+ VMICALL void VMI_IODelay(void);
+ VMICALL void VMI_WBINVD(void);
+ VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask);
+
+ /* APIC CALLS */
+ VMICALL void VMI_APICWrite(void *reg, VMI_UINT32 value);
+ VMICALL VMI_UINT32 VMI_APICRead(void *reg);
+
+ /* TIMER CALLS */
+ VMICALL VMI_NANOSECS VMI_GetWallclockTime(void);
+ VMICALL VMI_BOOL VMI_WallclockUpdated(void);
+
+ /* Predefined rate of the wallclock. */
+ #define VMI_WALLCLOCK_HZ 1000000000
+
+ VMICALL VMI_CYCLES VMI_GetCycleFrequency(void);
+ VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter);
+
+ /* Defined cycle counters */
+ #define VMI_CYCLES_REAL 0
+ #define VMI_CYCLES_AVAILABLE 1
+ #define VMI_CYCLES_STOLEN 2
+
+ VMICALL void VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry,
+ VMI_CYCLES period);
+ VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags);
+
+ /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */
+ #define VMI_ALARM_COUNTER_MASK 0x000000ff
+
+ #define VMI_ALARM_WIRED_IRQ0 0x00000000
+ #define VMI_ALARM_WIRED_LVTT 0x00010000
+
+ #define VMI_ALARM_IS_ONESHOT 0x00000000
+ #define VMI_ALARM_IS_PERIODIC 0x00000100
+
+ /* MMU CALLS */
+ VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va,
+ VMI_UINT32 pages, VMI_UINT32 ppn);
+
+ /* The number of VMI address translation slot */
+ #define VMI_LINEAR_MAP_SLOTS 4
+
+ VMICALL void VMI_InvalPage(VMI_UINT32 va);
+ VMICALL void VMI_FlushTLB(int how);
+
+ /* Flags used by VMI_FlushTLB call */
+ #define VMI_FLUSH_TLB 0x01
+ #define VMI_FLUSH_GLOBAL 0x02
+
+ #endif
+
+
+Appendix C - Sensitive x86 instructions in the paravirtual environment
+
+ This is a list of x86 instructions which may operate in a different manner
+ when run inside of a paravirtual environment.
+
+ ARPL - continues to function as normal, but kernel segment registers
+ may be different, so parameters to this instruction may need
+ to be modified. (System)
+
+ IRET - the IRET instruction will be unable to change the IOPL, VM,
+ VIF, VIP, or IF fields. (System)
+
+ the IRET instruction may #GP if the return CS/SS RPL are
+ below the CPL, or are not equal. (System)
+
+ LAR - the LAR instruction will reveal changes to the DPL field of
+ descriptors in the GDT and LDT tables. (System, User)
+
+ LSL - the LSL instruction will reveal changes to the segment limit
+ of descriptors in the GDT and LDT tables. (System, User)
+
+ LSS - the LSS instruction may #GP if the RPL is not set properly.
+ (System)
+
+ MOV - the mov %seg, %reg instruction may reveal a different RPL
+ on the segment register. (System)
+
+ The mov %reg, %ss instruction may #GP if the RPL is not set
+ to the current CPL. (System)
+
+ POP - the pop %ss instruction may #GP if the RPL is not set to
+ the appropriate CPL. (System)
+
+ POPF - the POPF instruction will be unable to set the hardware
+ interrupt flag. (System)
+
+ PUSH - the push %seg instruction may reveal a different RPL on the
+ segment register. (System)
+
+ PUSHF- the PUSHF instruction will reveal a possible different IOPL,
+ and the value of the hardware interrupt flag, which is always
+ set. (System, User)
+
+ SGDT - the SGDT instruction will reveal the location and length of
+ the GDT shadow instead of the guest GDT. (System, User)
+
+ SIDT - the SIDT instruction will reveal the location and length of
+ the IDT shadow instead of the guest IDT. (System, User)
+
+ SLDT - the SLDT instruction will reveal the selector used for
+ the shadow LDT rather than the selector loaded by the guest.
+ (System, User).
+
+ STR - the STR instruction will reveal the selector used for the
+ shadow TSS rather than the selector loaded by the guest.
+ (System, User).
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC, PATCH 1/24] i386 Vmi documentation
@ 2006-03-13 18:41 Zachary Amsden
0 siblings, 0 replies; 26+ messages in thread
From: Zachary Amsden @ 2006-03-13 18:41 UTC (permalink / raw)
To: Linus Torvalds, Linux Kernel Mailing List,
Virtualization Mailing List, Xen-devel, Andrew Morton,
Zachary Amsden, Dan Hecht, Dan Arai, Anne Holler,
Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
Documentation for the VMI API.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Pratap Subrahmanyam <pratap@vmware.com>
Signed-off-by: Daniel Arai <arai@vmware.com>
Signed-off-by: Daniel Hecht <dhecht@vmware.com>
Index: linux-2.6.16-rc5/Documentation/vmi_spec.txt
===================================================================
--- linux-2.6.16-rc5.orig/Documentation/vmi_spec.txt 2006-03-09 23:33:29.000000000 -0800
+++ linux-2.6.16-rc5/Documentation/vmi_spec.txt 2006-03-10 12:55:29.000000000 -0800
@@ -0,0 +1,2197 @@
+
+ Paravirtualization API Version 2.0
+
+ Zachary Amsden, Daniel Arai, Daniel Hecht, Pratap Subrahmanyam
+ Copyright (C) 2005, 2006, VMware, Inc.
+ All rights reserved
+
+Revision history:
+ 1.0: Initial version
+ 1.1: arai 2005-11-15
+ Added SMP-related sections: AP startup and Local APIC support
+ 1.2: dhecht 2006-02-23
+ Added Time Interface section and Time related VMI calls
+
+Contents
+
+1) Motivations
+2) Overview
+ Initialization
+ Privilege model
+ Memory management
+ Segmentation
+ Interrupt and I/O subsystem
+ IDT management
+ Transparent Paravirtualization
+ 3rd Party Extensions
+ AP Startup
+ State Synchronization in SMP systems
+ Local APIC Support
+ Time Interface
+3) Architectural Differences from Native Hardware
+4) ROM Implementation
+ Detection
+ Data layout
+ Call convention
+ PCI implementation
+
+Appendix A - VMI ROM low level ABI
+Appendix B - VMI C prototypes
+Appendix C - Sensitive x86 instructions
+
+
+1) Motivations
+
+ There are several high level goals which must be balanced in designing
+ an API for paravirtualization. The most general concerns are:
+
+ Portability - it should be easy to port a guest OS to use the API
+ High performance - the API must not obstruct a high performance
+ hypervisor implementation
+ Maintainability - it should be easy to maintain and upgrade the guest
+ OS
+ Extensibility - it should be possible for future expansion of the
+ API
+
+ Portability.
+
+ The general approach to paravirtualization rather than full
+ virtualization is to modify the guest operating system. This means
+ there is implicitly some code cost to port a guest OS to run in a
+ paravirtual environment. The closer the API resembles a native
+ platform which the OS supports, the lower the cost of porting.
+ Rather than provide an alternative, high level interface for this
+ API, the approach is to provide a low level interface which
+ encapsulates the sensitive and performance critical parts of the
+ system. Thus, we have direct parallels to most privileged
+ instructions, and the process of converting a guest OS to use these
+ instructions is in many cases a simple replacement of one function
+ for another. Although this is sufficient for CPU virtualization,
+ performance concerns have forced us to add additional calls for
+ memory management, and notifications about updates to certain CPU
+ data structures. Support for this in the Linux operating system has
+ proved to be very minimal in cost because of the already somewhat
+ portable and modular design of the memory management layer.
+
+ High Performance.
+
+ Providing a low level API that closely resembles hardware does not
+ provide any support for compound operations; indeed, typical
+ compound operations on hardware can be updating of many page table
+ entries, flushing system TLBs, or providing floating point safety.
+ Since these operations may require several privileged or sensitive
+ operations, it becomes important to defer some of these operations
+ until explicit flushes are issued, or to provide higher level
+ operations around some of these functions. In order to keep with
+ the goal of portability, this has been done only when deemed
+ necessary for performance reasons, and we have tried to package
+ these compound operations into methods that are typically used in
+ guest operating systems. In the future, we envision that additional
+ higher level abstractions will be added as an adjunct to the
+ low-level API. These higher level abstractions will target large
+ bulk operations such as creation, and destruction of address spaces,
+ context switches, thread creation and control.
+
+ Maintainability.
+
+ In the course of development with a virtualized environment, it is
+ not uncommon for support of new features or higher performance to
+ require radical changes to the operation of the system. If these
+ changes are visible to the guest OS in a paravirtualized system,
+ this will require updates to the guest kernel, which presents a
+ maintenance problem. In the Linux world, the rapid pace of
+ development on the kernel means new kernel versions are produced
+ every few months. This rapid pace is not always appropriate for end
+ users, so it is not uncommon to have dozens of different versions of
+ the Linux kernel in use that must be actively supported. To keep
+ this many versions in sync with potentially radical changes in the
+ paravirtualized system is not a scalable solution. To reduce the
+ maintenance burden as much as possible, while still allowing the
+ implementation to accommodate changes, the design provides a stable
+ ABI with semantic invariants. The underlying implementation of the
+ ABI and details of what data or how it communicates with the
+ hypervisor are not visible to the guest OS. As a result, in most
+ cases, the guest OS need not even be recompiled to work with a newer
+ hypervisor. This allows performance optimizations, bug fixes,
+ debugging, or statistical instrumentation to be added to the API
+ implementation without any impact on the guest kernel. This is
+ achieved by publishing a block of code from the hypervisor in the
+ form of a ROM. The guest OS makes calls into this ROM to perform
+ privileged or sensitive actions in the system.
+
+ Extensibility.
+
+ In order to provide a vehicle for new features, new device support,
+ and general evolution, the API uses feature compartmentalization
+ with controlled versioning. The API is split into sections, with
+ each section having independent versions. Each section has a top
+ level version which is incremented for each major revision, with a
+ minor version indicating incremental level. Version compatibility
+ is based on matching the major version field, and changes of the
+ major version are assumed to break compatibility. This allows
+ accurate matching of compatibility. In the event of incompatible
+ API changes, multiple APIs may be advertised by the hypervisor if it
+ wishes to support older versions of guest kernels. This provides
+ the most general forward / backward compatibility possible.
+ Currently, the API has a core section for CPU / MMU virtualization
+ support, with additional sections provided for each supported device
+ class.
+
+2) Overview
+
+ Initialization.
+
+ Initialization is done with a bootstrap loader that creates
+ the "start of day" state. This is a known state, running 32-bit
+ protected mode code with paging enabled. The guest has all the
+ standard structures in memory that are provided by a native ROM
+ boot environment, including a memory map and ACPI tables. For
+ the native hardware, this bootstrap loader can be run before
+ the kernel code proper, and this environment can be created
+ readily from within the hypervisor for the virtual case. At
+ some point, the bootstrap loader or the kernel itself invokes
+ the initialization call to enter paravirtualized mode.
+
+ Privilege Model.
+
+ The guest kernel must be modified to run at a dynamic privilege
+ level, since if entry to paravirtual mode is successful, the kernel
+ is no longer allowed to run at the highest hardware privilege level.
+ On the IA-32 architecture, this means the kernel will be running at
+ CPL 1-2, and with the hypervisor running at CPL0, and user code at
+ CPL3. The IOPL will be lowered as well to avoid giving the guest
+ direct access to hardware ports and control of the interrupt flag.
+
+ This change causes certain IA-32 instructions to become "sensitive",
+ so additional support for clearing and setting the hardware
+ interrupt flag are present. Since the switch into paravirtual mode
+ may happen dynamically, the guest OS must not rely on testing for a
+ specific privilege level by checking the RPL field of segment
+ selectors, but should check for privileged execution by performing
+ an (RPL != 3 && !EFLAGS_VM) comparison. This means the DPL of kernel
+ ring descriptors in the GDT or LDT may be raised to match the CPL of
+ the kernel. This change is visible by inspecting the segments
+ registers while running in privileged code, and by using the LAR
+ instruction.
+
+ The system also cannot be allowed to write directly to the hardware
+ GDT, LDT, IDT, or TSS, so these data structures are maintained by the
+ hypervisor, and may be shadowed or guest visible structures. These
+ structures are required to be page aligned to support non-shadowed
+ operation.
+
+ Currently, the system only provides for two guest security domains,
+ kernel (which runs at the equivalent of virtual CPL-0), and user
+ (which runs at the equivalent of virtual CPL-3, with no hardware
+ access). Typically, this is not a problem, but if a guest OS relies
+ on using multiple hardware rings for privilege isolation, this
+ interface would need to be expanded to support that.
+
+ Memory Management.
+
+ Since a virtual machine typically does not have access to all the
+ physical memory on the machine, there is a need to redefine the
+ physical address space layout for the virtual machine. The
+ spectrum of possibilities ranges from presenting the guest with
+ a view of a physically contiguous memory of a boot-time determined
+ size, exactly what the guest would see when running on hardware, to
+ the opposite, which presents the guest with the actual machine pages
+ which the hypervisor has allocated for it. Using this approach
+ requires the guest to obtain information about the pages it has
+ from the hypervisor; this can be done by using the memory map which
+ would normally be passed to the guest by the BIOS.
+
+ The interface is designed to support either mode of operation.
+ This allows the implementation to use either direct page tables
+ or shadow page tables, or some combination of both. All writes to
+ page table entries are done through calls to the hypervisor
+ interface layer. The guest notifies the hypervisor about page
+ tables updates, flushes, and invalidations through API calls.
+
+ The guest OS is also responsible for notifying the hypervisor about
+ which pages in its physical memory are going to be used to hold page
+ tables or page directories. Both PAE and non-PAE paging modes are
+ supported. When the guest is finished using pages as page tables, it
+ should release them promptly to allow the hypervisor to free the
+ page table shadows. Using a page as both a page table and a page
+ directory for linear page table access is possible, but currently
+ not supported by our implementation.
+
+ The hypervisor lives concurrently in the same address space as the
+ guest operating system. Although this is not strictly necessary on
+ IA-32 hardware, performance would be severely degraded if that were
+ not the case. The hypervisor must therefore reserve some portion of
+ linear address space for its own use. The implementation currently
+ reserves the top 64 megabytes of linear space for the hypervisor.
+ This requires the guest to relocate any data in high linear space
+ down by 64 megabytes. For non-paging mode guests, this means the
+ high 64 megabytes of physical memory should be reserved. Because
+ page tables are not sensitive to CPL, only to user/supervisor level,
+ the hypervisor must combine segment protection to ensure that the
+ guest can not access this 64 megabyte region.
+
+ An experimental patch is available to enable boot-time sizing of
+ the hypervisor hole.
+
+ Segmentation.
+
+ The IA-32 architecture provides segmented virtual memory, which can
+ be used as another form of privilege separation. Each segment
+ contains a base, limit, and properties. The base is added to the
+ virtual address to form a linear address. The limit determines the
+ length of linear space which is addressable through the segment.
+ The properties determine read/write, code and data size of the
+ region, as well as the direction in which segments grow. Segments
+ are loaded from descriptors in one of two system tables, the GDT or
+ the LDT, and the values loaded are cached until the next load of the
+ segment. This property, known as segment caching, allows the
+ machine to be put into a non-reversible state by writing over the
+ descriptor table entry from which a segment was loaded. There is no
+ efficient way to extract the base field of the segment after it is
+ loaded, as it is hidden by the processor. In a hypervisor
+ environment, the guest OS can be interrupted at any point in time by
+ interrupts and NMIs which must be serviced by the hypervisor. The
+ hypervisor must be able to recreate the original guest state when it
+ is done servicing the external event.
+
+ To avoid creating non-reversible segments, the hypervisor will
+ forcibly reload any live segment registers that are updated by
+ writes to the descriptor tables. *N.B - in the event that a segment
+ is put into an invalid or not present state by an update to the
+ descriptor table, the segment register must be forced to NULL so
+ that reloading it will not cause a general protection fault (#GP)
+ when restoring the guest state. This may require the guest to save
+ the segment register value before issuing a hypervisor API call
+ which will update the descriptor table.*
+
+ Because the hypervisor must protect its own memory space from
+ privileged code running in the guest at CPL1-2, descriptors may not
+ provide access to the 64 megabyte region of high linear space. To
+ achieve this, the hypervisor will truncate descriptors in the
+ descriptor tables. This means that attempts by the guest to access
+ through negative offsets to the segment base will fault, so this is
+ highly discouraged (some TLS implementations on Linux do this).
+ In addition, this causes the truncated length of the segment to
+ become visible to the guest through the LSL instruction.
+
+ Interrupt and I/O Subsystem.
+
+ For security reasons, the guest operating system is not given
+ control over the hardware interrupt flag. We provide a virtual
+ interrupt flag that is under guest control. The virtual operating
+ system always runs with hardware interrupts enabled, but hardware
+ interrupts are transparent to the guest. The API provides calls for
+ all instructions which modify the interrupt flag.
+
+ The paravirtualization environment provides a legacy programmable
+ interrupt controller (PIC) to the virtual machine. Future releases
+ will provide a virtual interrupt controller (VIC) that provides
+ more advanced features.
+
+ In addition to a virtual interrupt flag, there is also a virtual
+ IOPL field which the guest can use to enable access to port I/O
+ from userspace for privileged applications.
+
+ Generic PCI based device probing is available to detect virtual
+ devices. The use of PCI is pragmatic, since it allows a vendor
+ ID, class ID, and device ID to identify the appropriate driver
+ for each virtual device.
+
+ IDT Management.
+
+ The paravirtual operating environment provides the traditional x86
+ interrupt descriptor table for handling external interrupts,
+ software interrupts, and exceptions. The interrupt descriptor table
+ provides the destination code selector and EIP for interruptions.
+ The current task state structure (TSS) provides the new stack
+ address to use for interruptions that result in a privilege level
+ change. The guest OS is responsible for notifying the hypervisor
+ when it updates the stack address in the TSS.
+
+ Two types of indirect control flow are of critical importance to the
+ performance of an operating system. These are system calls and page
+ faults. The guest is also responsible for calling out to the
+ hypervisor when it updates gates in the IDT. Making IDT and TSS
+ updates known to the hypervisor in this fashion allows efficient
+ delivery through these performance critical gates.
+
+ Transparent Paravirtualization.
+
+ The guest operating system may provide an alternative implementation
+ of the VMI option rom compiled in. This implementation should
+ provide implementations of the VMI calls that are suitable for
+ running on native x86 hardware. This code may be used by the guest
+ operating system while it is being loaded, and may also be used if
+ the operating system is loaded on hardware that does not support
+ paravirtualization.
+
+ When the guest detects that the VMI option rom is available, it
+ replaces the compiled-in version of the rom with the rom provided by
+ the platform. This can be accomplished by copying the rom contents,
+ or by remapping the virtual address containing the compiled-in rom
+ to point to the platform's ROM. When booting on a platform that
+ does not provide a VMI rom, the operating system can continue to use
+ the compiled-in version to run in a non-paravirtualized fashion.
+
+ 3rd Party Extensions.
+
+ If desired, it should be possible for 3rd party virtual machine
+ monitors to implement a paravirtualization environment that can run
+ guests written to this specification.
+
+ The general mechanism for providing customized features and
+ capabilities is to provide notification of these feature through
+ the CPUID call, and allowing configuration of CPU features
+ through RDMSR / WRMSR instructions. This allows a hypervisor vendor
+ ID to be published, and the kernel may enable or disable specific
+ features based on this id. This has the advantage of following
+ closely the boot time logic of many operating systems that enables
+ certain performance enhancements or bugfixes based on processor
+ revision, using exactly the same mechanism.
+
+ An exact formal specification of the new CPUID functions and which
+ functions are vendor specific is still needed.
+
+ AP Startup.
+
+ Application Processor startup in paravirtual SMP systems works a bit
+ differently than in a traditional x86 system.
+
+ APs will launch directly in paravirtual mode with initial state
+ provided by the BSP. Rather than the traditional init/startup
+ IPI sequence, the BSP must issue the init IPI, a set application
+ processor state hypercall, followed by the startup IPI.
+
+ The initial state contains the AP's control registers, general
+ purpose registers and segment registers, as well as the IDTR,
+ GDTR, LDTR and EFER. Any processor state not included in the initial
+ AP state (including x87 FPRs, SSE register states, and MSRs other than
+ EFER), are left in the poweron state.
+
+ The BSP must construct the initial GDT used by each AP. The segment
+ register hidden state will be loaded from the GDT specified in the
+ initial AP state. The IDT and (if used) LDT may either be constructed by
+ the BSP or by the AP.
+
+ Similarly, the initial page tables used by each AP must also be
+ constructed by the BSP.
+
+ If an AP's initial state is invalid, or no initial state is provided
+ before a start IPI is received by that AP, then the AP will fail to start.
+ It is therefore advisable to have a timeout for waiting for AP's to start,
+ as is recommended for traditional x86 systems.
+
+ See VMI_SetInitialAPState in Appendix A for a description of the
+ VMI_SetInitialAPState hypercall and the associated APState data structure.
+
+ State Synchronization In SMP Systems.
+
+ Some in-memory data structures that may require no special synchronization
+ on a traditional x86 systems need special handling when run on a
+ hypervisor. Two of particular note are the descriptor tables and page
+ tables.
+
+ Each processor in an SMP system should have its own GDT and LDT. Changes
+ to each processor's descriptor tables must be made on that processor
+ via the appropriate VMI calls. There is no VMI interface for updating
+ another CPU's descriptor tables (aside from VMI_SetInitialAPState),
+ and the result of memory writes to other processors' descriptor tables
+ are undefined.
+
+ Page tables have slightly different semantics than in a traditional x86
+ system. As in traditional x86 systems, page table writes may not be
+ respected by the current CPU until a TLB flush or invlpg is issued.
+ In a paravirtual system, the hypervisor implementation is free to
+ provide either shared or private caches of the guest's page tables.
+ Page table updates must therefore be propagated to the other CPUs
+ before they are guaranteed to be noticed.
+
+ In particular, when doing TLB shootdown, the initiating processor
+ must ensure that all deferred page table updates are flushed to the
+ hypervisor, to ensure that the receiving processor has the most up-to-date
+ mapping when it performs its invlpg.
+
+ Local APIC Support.
+
+ A traditional x86 local APIC is provided by the hypervisor. The local
+ APIC is enabled and its address is set via the IA32_APIC_BASE MSR, as
+ usual. APIC registers may be read and written via ordinary memory
+ operations.
+
+ For performance reasons, higher performance APIC read and write interfaces
+ are provided. If possible, these interfaces should be used to access
+ the local APIC.
+
+ The IO-APIC is not included in this spec, as it is typically not
+ performance critical, and used mainly for initial wiring of IRQ pins.
+ Currently, we implement a fully functional IO-APIC with all the
+ capabilities of real hardware. This may seem like an unnecessary burden,
+ but if the goal is transparent paravirtualization, the kernel must
+ provide fallback support for an IO-APIC anyway. In addition, the
+ hypervisor must support an IO-APIC for SMP non-paravirtualized guests.
+ The net result is less code on both sides, and an already well defined
+ interface between the two. This avoids the complexity burden of having
+ to support two different interfaces to achieve the same task.
+
+ One shortcut we have found most helpful is to simply disable NMI delivery
+ to the paravirtualized kernel. There is no reason NMIs can't be
+ supported, but typical uses for them are not as productive in a
+ virtualized environment. Watchdog NMIs are of limited use if the OS is
+ already correct and running on stable hardware; profiling NMIs are
+ similarly of less use, since this task is accomplished with more accuracy
+ in the VMM itself; and NMIs for machine check errors should be handled
+ outside of the VM. The addition of NMI support does create additional
+ complexity for the trap handling code in the VM, and although the task is
+ surmountable, the value proposition is debatable. Here, again, feedback
+ is desired.
+
+ Time Interface.
+
+ In a virtualized environment, virtual machines (VM) will time share
+ the system with each other and with other processes running on the
+ host system. Therefore, a VM's virtual CPUs (VCPUs) will be
+ executing on the host's physical CPUs (PCPUs) for only some portion
+ of time. This section of the VMI exposes a paravirtual view of
+ time to the guest operating systems so that they may operate more
+ effectively in a virtual environment. The interface also provides
+ a way for the VCPUs to set alarms in this paravirtual view of time.
+
+ Time Domains:
+
+ a) Wallclock Time:
+
+ Wallclock time exposed to the VM through this interface indicates
+ the number of nanoseconds since epoch, 1970-01-01T00:00:00Z (ISO
+ 8601 date format). If the host's wallclock time changes (say, when
+ an error in the host's clock is corrected), so does the wallclock
+ time as viewed through this interface.
+
+ b) Real Time:
+
+ Another view of time accessible through this interface is real
+ time. Real time always progresses except for when the VM is
+ stopped or suspended. Real time is presented to the guest as a
+ counter which increments at a constant rate defined (and presented)
+ by the hypervisor. All the VCPUs of a VM share the same real time
+ counter.
+
+ The unit of the counter is called "cycles". The unit and initial
+ value (corresponding to the time the VM enters para-virtual mode)
+ are chosen by the hypervisor so that the real time counter will not
+ rollover in any practical length of time. It is expected that the
+ frequency (cycles per second) is chosen such that this clock
+ provides a "high-resolution" view of time. The unit can only
+ change when the VM (re)enters paravirtual mode.
+
+ c) Stolen time and Available time:
+
+ A VCPU is always in one of three states: running, halted, or ready.
+ The VCPU is in the 'running' state if it is executing. When the
+ VCPU executes the HLT interface, the VCPU enters the 'halted' state
+ and remains halted until there is some work pending for the VCPU
+ (e.g. an alarm expires, host I/O completes on behalf of virtual
+ I/O). At this point, the VCPU enters the 'ready' state (waiting
+ for the hypervisor to reschedule it). Finally, at any time when
+ the VCPU is not in the 'running' state nor the 'halted' state, it
+ is in the 'ready' state.
+
+ For example, consider the following sequence of events, with times
+ given in real time:
+
+ (Example 1)
+
+ At 0 ms, VCPU executing guest code.
+ At 1 ms, VCPU requests virtual I/O.
+ At 2 ms, Host performs I/O for virtual I/0.
+ At 3 ms, VCPU executes VMI_Halt.
+ At 4 ms, Host completes I/O for virtual I/O request.
+ At 5 ms, VCPU begins executing guest code, vectoring to the interrupt
+ handler for the device initiating the virtual I/O.
+ At 6 ms, VCPU preempted by hypervisor.
+ At 9 ms, VCPU begins executing guest code.
+
+ From 0 ms to 3 ms, VCPU is in the 'running' state. At 3 ms, VCPU
+ enters the 'halted' state and remains in this state until the 4 ms
+ mark. From 4 ms to 5 ms, the VCPU is in the 'ready' state. At 5
+ ms, the VCPU re-enters the 'running' state until it is preempted by
+ the hypervisor at the 6 ms mark. From 6 ms to 9 ms, VCPU is again
+ in the 'ready' state, and finally 'running' again after 9 ms.
+
+ Stolen time is defined per VCPU to progress at the rate of real
+ time when the VCPU is in the 'ready' state, and does not progress
+ otherwise. Available time is defined per VCPU to progress at the
+ rate of real time when the VCPU is in the 'running' and 'halted'
+ states, and does not progress when the VCPU is in the 'ready'
+ state.
+
+ So, for the above example, the following table indicates these time
+ values for the VCPU at each ms boundary:
+
+ Real time Stolen time Available time
+ 0 0 0
+ 1 0 1
+ 2 0 2
+ 3 0 3
+ 4 0 4
+ 5 1 4
+ 6 1 5
+ 7 2 5
+ 8 3 5
+ 9 4 5
+ 10 4 6
+
+ Notice that at any point:
+ real_time == stolen_time + available_time
+
+ Stolen time and available time are also presented as counters in
+ "cycles" units. The initial value of the stolen time counter is 0.
+ This implies the initial value of the available time counter is the
+ same as the real time counter.
+
+ Alarms:
+
+ Alarms can be set (armed) against the real time counter or the
+ available time counter. Alarms can be programmed to expire once
+ (one-shot) or on a regular period (periodic). They are armed by
+ indicating an absolute counter value expiry, and in the case of a
+ periodic alarm, a non-zero relative period counter value. [TBD:
+ The method of wiring the alarms to an interrupt vector is dependent
+ upon the virtual interrupt controller portion of the interface.
+ Currently, the alarms may be wired as if they are attached to IRQ0
+ or the vector in the local APIC LVTT. This way, the alarms can be
+ used as drop in replacements for the PIT or local APIC timer.]
+
+ Alarms are per-vcpu mechanisms. An alarm set by vcpu0 will fire
+ only on vcpu0, while an alarm set by vcpu1 will only fire on vcpu1.
+ If an alarm is set relative to available time, its expiry is a
+ value relative to the available time counter of the vcpu that set
+ it.
+
+ The interface includes a method to cancel (disarm) an alarm. On
+ each vcpu, one alarm can be set against each of the two counters
+ (real time and available time). A vcpu in the 'halted' state
+ becomes 'ready' when any of its alarm's counters reaches the
+ expiry.
+
+ An alarm "fires" by signaling the virtual interrupt controller. An
+ alarm will fire as soon as possible after the counter value is
+ greater than or equal to the alarm's current expiry. However, an
+ alarm can fire only when its vcpu is in the 'running' state.
+
+ If the alarm is periodic, a sequence of expiry values,
+
+ E(i) = e0 + p * i , i = 0, 1, 2, 3, ...
+
+ where 'e0' is the expiry specified when setting the alarm and 'p'
+ is the period of the alarm, is used to arm the alarm. Initially,
+ E(0) is used as the expiry. When the alarm fires, the next expiry
+ value in the sequence that is greater than the current value of the
+ counter is used as the alarm's new expiry.
+
+ One-shot alarms have only one expiry. When a one-shot alarm fires,
+ it is automatically disarmed.
+
+ Suppose an alarm is set relative to real time with expiry at the 3
+ ms mark and a period of 2 ms. It will expire on these real time
+ marks: 3, 5, 7, 9. Note that even if the alarm does not fire
+ during the 5 ms to 7 ms interval, the alarm can fire at most once
+ during the 7 ms to 9 ms interval (unless, of course, it is
+ reprogrammed).
+
+ If an alarm is set relative to available time with expiry at the 1
+ ms mark (in available time) and with a period of 2 ms, then it will
+ expire on these available time marks: 1, 3, 5. In the scenario
+ described in example 1, those available time values correspond to
+ these values in real time: 1, 3, 6.
+
+3) Architectural Differences from Native Hardware.
+
+ For the sake of performance, some requirements are imposed on kernel
+ fault handlers which are not present on real hardware. Most modern
+ operating systems should have no trouble meeting these requirements.
+ Failure to meet these requirements may prevent the kernel from
+ working properly.
+
+ 1) The hardware flags on entry to a fault handler may not match
+ the EFLAGS image on the fault handler stack. The stack image
+ is correct, and will have the correct state of the interrupt
+ and arithmetic flags.
+
+ 2) The stack used for kernel traps must be flat - that is, zero base,
+ segment limit determined by the hypervisor.
+
+ 3) On entry to any fault handler, the stack must have sufficient space
+ to hold 32 bytes of data, or the guest may be terminated.
+
+ 4) When calling VMI functions, the kernel must be running on a
+ flat 32-bit stack and code segment.
+
+ 5) Most VMI functions require flat data and extra segment (DS and ES)
+ segments as well; notable exceptions are IRET and SYSEXIT.
+ XXXPara - may need to add STI and CLI to this list.
+
+ 6) Interrupts must always be enabled when running code in userspace.
+
+ 7) IOPL semantics for userspace are changed; although userspace may be
+ granted port access, it can not affect the interrupt flag.
+
+ 8) The EIPs at which faults may occur in VMI calls may not match the
+ original native instruction EIP; this is a bug in the system
+ today, as many guests do rely on lazy fault handling.
+
+ 9) On entry to V8086 mode, MSR_SYSENTER_CS is cleared to zero.
+
+ 10) Todo - we would like to support these features, but they are not
+ fully tested and / or implemented:
+
+ Userspace 16-bit stack support
+ Proper handling of faulting IRETs
+
+4) ROM Implementation
+
+ Modularization
+
+ Originally, we envisioned modularizing the ROM API into several
+ subsections, but the close coupling between the initial layers
+ and the requirement to support native PCI bus devices has made
+ ROM components for network or block devices unnecessary to this
+ point in time.
+
+ VMI - the virtual machine interface. This is the core CPU, I/O
+ and MMU virtualization layer. I/O is currently limited
+ to port access to emulated devices.
+
+ Detection
+
+ The presence of hypervisor ROMs can be recognized by scanning the
+ upper region of the first megabyte of physical memory. Multiple
+ ROMs may be provided to support older API versions for legacy guest
+ OS support. ROM detection is done in the traditional manner, by
+ scanning the memory region from C8000h - DFFFFh in 2 kilobyte
+ increments. The romSignature bytes must be '0x55, 0xAA', and the
+ checksum of the region indicated by the romLength field must be zero.
+ The checksum is a simple 8-bit addition of all bytes in the ROM region.
+
+ Data layout
+
+ typedef struct HyperRomHeader {
+ uint16_t romSignature;
+ int8_t romLength;
+ unsigned char romEntry[4];
+ uint8_t romPad0;
+ uint32_t hyperSignature;
+ uint8_t APIVersionMinor;
+ uint8_t APIVersionMajor;
+ uint8_t reserved0;
+ uint8_t reserved1;
+ uint32_t reserved2;
+ uint32_t reserved3;
+ uint16_t pciHeaderOffset;
+ uint16_t pnpHeaderOffset;
+ uint32_t romPad3;
+ char reserved[32];
+ char elfHeader[64];
+ } HyperRomHeader;
+
+ The first set of fields is defined by the BIOS:
+
+ romSignature - fixed 0xAA55, BIOS ROM signature
+ romLength - the length of the ROM, in 512 byte chunks.
+ Determines the area to be checksummed.
+ romEntry - 16-bit initialization code stub used by BIOS.
+ romPad0 - reserved
+
+ The next set of fields is defined by this API:
+
+ hyperSignature - a 4 byte signature providing recognition of the
+ device class represented by this ROM. Each
+ device class defines its own unique signature.
+ APIVersionMinor - the revision level of this device class' API.
+ This indicates incremental changes to the API.
+ APIVersionMajor - the major version. Used to indicates large
+ revisions or additions to the API which break
+ compatibility with the previous version.
+ reserved0,1,2,3 - for future expansion
+
+ The next set of fields is defined by the PCI / PnP BIOS spec:
+
+ pciHeaderOffset - relative offset to the PCI device header from
+ the start of this ROM.
+ pnpHeaderOffset - relative offset to the PnP boot header from the
+ start of this ROM.
+ romPad3 - reserved by PCI spec.
+
+ Finally, there is space for future header fields, and an area
+ reserved for an ELF header to point to symbol information.
+
+Appendix A - VMI ROM Low Level ABI
+
+ OS writers intending to port their OS to the paravirtualizable x86
+ processor being modeled by this hypervisor need to access the
+ hypervisor through the VMI layer. It is possible although it is
+ currently unimplemented to add or replace the functionality of
+ individual hypervisor calls by providing your own ROM images. This is
+ intended to allow third party customizations.
+
+ VMI compatible ROMs user the signature "cVmi" in the hyperSignature
+ field of the ROM header.
+
+ Many of these calls are compatible with the SVR4 C call ABI, using up
+ to three register arguments. Some calls are not, due to restrictions
+ of the native instruction set. Calls which diverge from this ABI are
+ noted. In GNU terms, this means most of the calls are compatible with
+ regparm(3) argument passing.
+
+ Most of these calls behave as standard C functions, and as such, may
+ clobber registers EAX, EDX, ECX, flags. Memory clobbers are noted
+ explicitly, since many of them may be inlined without a memory clobber.
+
+ Most of these calls require well defined segment conventions - that is,
+ flat full size 32-bit segments for all the general segments, CS, SS, DS,
+ ES. Exceptions in some cases are noted.
+
+ The net result of these choices is that most of the calls are very
+ easy to make from C-code, and calls that are likely to be required in
+ low level trap handling code are easy to call from assembler. Most
+ of these calls are also very easily implemented by the hypervisor
+ vendor in C code, and only the performance critical calls from
+ assembler paths require custom assembly implementations.
+
+ CORE INTERFACE CALLS
+
+ This set of calls provides the base functionality to establish running
+ the kernel in VMI mode.
+
+ The interface will be expanded to include feature negotiation, more
+ explicit control over call bundling and flushing, and hypervisor
+ notifications to allow inline code patching.
+
+ VMI_Init
+
+ VMICALL void VMI_Init(void);
+
+ Initializes the hypervisor environment. Returns zero on success,
+ or -1 if the hypervisor could not be initialized. Note that this
+ is a recoverable error if the guest provides the requisite native
+ code to support transparent paravirtualization.
+
+ Inputs: None
+ Outputs: EAX = result
+ Clobbers: Standard
+ Segments: Standard
+
+
+ PROCESSOR STATE CALLS
+
+ This set of calls controls the online status of the processor. It
+ include interrupt control, reboot, halt, and shutdown functionality.
+ Future expansions may include deep sleep and hotplug CPU capabilities.
+
+ VMI_DisableInterrupts
+
+ VMICALL void VMI_DisableInterrupts(void);
+
+ Disable maskable interrupts on the processor.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Flags only
+ Segments: As this is both performance critical and likely to
+ be called from low level interrupt code, this call does not
+ require flat DS/ES segments, but uses the stack segment for
+ data access. Therefore only CS/SS must be well defined.
+
+ VMI_EnableInterrupts
+
+ VMICALL void VMI_EnableInterrupts(void);
+
+ Enable maskable interrupts on the processor. Note that the
+ current implementation always will deliver any pending interrupts
+ on a call which enables interrupts, for compatibility with kernel
+ code which expects this behavior. Whether this should be required
+ is open for debate.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Flags only
+ Segments: CS/SS only
+
+ VMI_GetInterruptMask
+
+ VMICALL VMI_UINT VMI_GetInterruptMask(void);
+
+ Returns the current interrupt state mask of the processor. The
+ mask is defined to be 0x200 (matching processor flag IF) to indicate
+ interrupts are enabled.
+
+ Inputs: None
+ Outputs: EAX = mask
+ Clobbers: Flags only
+ Segments: CS/SS only
+
+ VMI_SetInterruptMask
+
+ VMICALL void VMI_SetInterruptMask(VMI_UINT mask);
+
+ Set the current interrupt state mask of the processor. Also
+ delivers any pending interrupts if the mask is set to allow
+ them.
+
+ Inputs: EAX = mask
+ Outputs: None
+ Clobbers: Flags only
+ Segments: CS/SS only
+
+ VMI_DeliverInterrupts (For future debate)
+
+ Enable and deliver any pending interrupts. This would remove
+ the implicit delivery semantic from the SetInterruptMask and
+ EnableInterrupts calls.
+
+ VMI_Pause
+
+ VMICALL void VMI_Pause(void);
+
+ Pause the processor temporarily, to allow a hypertwin or remote
+ CPU to continue operation without lock or cache contention.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_Halt
+
+ VMICALL void VMI_Halt(void);
+
+ Put the processor into interruptible halt mode. This is defined
+ to be a non-running mode where maskable interrupts are enabled,
+ not a deep low power sleep mode.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_Shutdown
+
+ VMICALL void VMI_Shutdown(void);
+
+ Put the processor into non-interruptible halt mode. This is defined
+ to be a non-running mode where maskable interrupts are disabled,
+ indicates a power-off event for this CPU.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_Reboot:
+
+ VMICALL void VMI_Reboot(VMI_INT how);
+
+ Reboot the virtual machine, using a hard or soft reboot. A soft
+ reboot corresponds to the effects of an INIT IPI, and preserves
+ some APIC and CR state. A hard reboot corresponds to a hardware
+ reset.
+
+ Inputs: EAX = reboot mode
+ #define VMI_REBOOT_SOFT 0x0
+ #define VMI_REBOOT_HARD 0x1
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetInitialAPState:
+
+ void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID);
+
+ Sets the initial state of the application processor with local APIC ID
+ "apicID" to the state in apState. apState must be the page-aligned
+ linear address of the APState structure describing the initial state of
+ the specified application processor.
+
+ Control register CR0 must have both PE and PG set; the result of
+ either of these bits being cleared is undefined. It is recommended
+ that for best performance, all processors in the system have the same
+ setting of the CR4 PAE bit. LME and LMA in EFER are both currently
+ unsupported. The result of setting either of these bits is undefined.
+
+ Inputs: EAX = pointer to APState structure for new co-processor
+ EDX = APIC ID of processor to initialize
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+
+ DESCRIPTOR RELATED CALLS
+
+ VMI_SetGDT
+
+ VMICALL void VMI_SetGDT(VMI_DTR *gdtr);
+
+ Load the global descriptor table limit and base registers. In
+ addition to the straightforward load of the hardware registers, this
+ has the additional side effect of reloading all segment registers in a
+ virtual machine. The reason is that otherwise, the hidden part of
+ segment registers (the base field) may be put into a non-reversible
+ state. Non-reversible segments are problematic because they can not be
+ reloaded - any subsequent loads of the segment will load the new
+ descriptor state. In general, is not possible to resume direct
+ execution of the virtual machine if certain segments become
+ non-reversible.
+
+ A load of the GDTR may cause the guest visible memory image of the GDT
+ to be changed. This allows the hypervisor to share the GDT pages with
+ the guest, but also continue to maintain appropriate protections on the
+ GDT page by transparently adjusting the DPL and RPL of descriptors in
+ the GDT.
+
+ Inputs: EAX = pointer to descriptor limit / base
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_SetIDT
+
+ VMICALL void VMI_SetIDT(VMI_DTR *idtr);
+
+ Load the interrupt descriptor table limit and base registers. The IDT
+ format is defined to be the same as native hardware.
+
+ A load of the IDTR may cause the guest visible memory image of the IDT
+ to be changed. This allows the hypervisor to rewrite the IDT pages in
+ a format more suitable to the hypervisor, which may include adjusting
+ the DPL and RPL of descriptors in the guest IDT.
+
+ Inputs: EAX = pointer to descriptor limit / base
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_SetLDT
+
+ VMICALL void VMI_SetLDT(VMI_SELECTOR ldtSel);
+
+ Load the local descriptor table. This has the additional side effect
+ of of reloading all segment registers. See VMI_SetGDT for an
+ explanation of why this is required. A load of the LDT may cause the
+ guest visible memory image of the LDT to be changed, just as GDT and
+ IDT loads.
+
+ Inputs: EAX = GDT selector of LDT descriptor
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_SetTR
+
+ VMICALL void VMI_SetTR(VMI_SELECTOR ldtSel);
+
+ Load the task register. Functionally equivalent to the LTR
+ instruction.
+
+ Inputs: EAX = GDT selector of TR descriptor
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_GetGDT
+
+ VMICALL void VMI_GetGDT(VMI_DTR *gdtr);
+
+ Copy the GDT limit and base fields into the provided pointer. This is
+ equivalent to the SGDT instruction, which is non-virtualizable.
+
+ Inputs: EAX = pointer to descriptor limit / base
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_GetIDT
+
+ VMICALL void VMI_GetIDT(VMI_DTR *idtr);
+
+ Copy the IDT limit and base fields into the provided pointer. This is
+ equivalent to the SIDT instruction, which is non-virtualizable.
+
+ Inputs: EAX = pointer to descriptor limit / base
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_GetLDT
+
+ VMICALL VMI_SELECTOR VMI_GetLDT(void);
+
+ Load the task register. Functionally equivalent to the SLDT
+ instruction, which is non-virtualizable.
+
+ Inputs: None
+ Outputs: EAX = selector of LDT descriptor
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_GetTR
+
+ VMICALL VMI_SELECTOR VMI_GetTR(void);
+
+ Load the task register. Functionally equivalent to the STR
+ instruction, which is non-virtualizable.
+
+ Inputs: None
+ Outputs: EAX = selector of TR descriptor
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_WriteGDTEntry
+
+ VMICALL void VMI_WriteGDTEntry(void *gdt, VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+
+ Write a descriptor to a GDT entry. Note that writes to the GDT itself
+ may be disallowed by the hypervisor, in which case this call must be
+ converted into a hypercall. In addition, since the descriptor may need
+ to be modified to change limits and / or permissions, the guest kernel
+ should not assume the update will be binary identical to the passed
+ input.
+
+ Inputs: EAX = pointer to GDT base
+ EDX = GDT entry number
+ ECX = descriptor low word
+ ST(1) = descriptor high word
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_WriteLDTEntry
+
+ VMICALL void VMI_WriteLDTEntry(void *gdt, VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+
+ Write a descriptor to a LDT entry. Note that writes to the LDT itself
+ may be disallowed by the hypervisor, in which case this call must be
+ converted into a hypercall. In addition, since the descriptor may need
+ to be modified to change limits and / or permissions, the guest kernel
+ should not assume the update will be binary identical to the passed
+ input.
+
+ Inputs: EAX = pointer to LDT base
+ EDX = LDT entry number
+ ECX = descriptor low word
+ ST(1) = descriptor high word
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_WriteIDTEntry
+
+ VMICALL void VMI_WriteIDTEntry(void *gdt, VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+
+ Write a descriptor to a IDT entry. Since the descriptor may need to be
+ modified to change limits and / or permissions, the guest kernel should
+ not assume the update will be binary identical to the passed input.
+
+ Inputs: EAX = pointer to IDT base
+ EDX = IDT entry number
+ ECX = descriptor low word
+ ST(1) = descriptor high word
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+
+ CPU CONTROL CALLS
+
+ These calls encapsulate the set of privileged instructions used to
+ manipulate the CPU control state. These instructions are all properly
+ virtualizable using trap and emulate, but for performance reasons, a
+ direct call may be more efficient. With hardware virtualization
+ capabilities, many of these calls can be left as IDENT translations, that
+ is, inline implementations of the native instructions, which are not
+ rewritten by the hypervisor. Some of these calls are performance critical
+ during context switch paths, and some are not, but they are all included
+ for completeness, with the exceptions of the obsoleted LMSW and SMSW
+ instructions.
+
+ VMI_WRMSR
+
+ VMICALL void VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg);
+
+ Write to a model specific register. This functions identically to the
+ hardware WRMSR instruction. Note that a hypervisor may not implement
+ the full set of MSRs supported by native hardware, since many of them
+ are not useful in the context of a virtual machine.
+
+ Inputs: ECX = model specific register index
+ EAX = low word of register
+ EDX = high word of register
+ Outputs: None
+ Clobbers: Standard, Memory
+ Segments: Standard
+
+ VMI_RDMSR
+
+ VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
+
+ Read from a model specific register. This functions identically to the
+ hardware RDMSR instruction. Note that a hypervisor may not implement
+ the full set of MSRs supported by native hardware, since many of them
+ are not useful in the context of a virtual machine.
+
+ Inputs: ECX = machine specific register index
+ Outputs: EAX = low word of register
+ EDX = high word of register
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetCR0
+
+ VMICALL void VMI_SetCR0(VMI_UINT val);
+
+ Write to control register zero. This can cause TLB flush and FPU
+ handling side effects. The set of features available to the kernel
+ depend on the completeness of the hypervisor. An explicit list of
+ supported functionality or required settings may need to be negotiated
+ by the hypervisor and kernel during bootstrapping. This is likely to
+ be implementation or vendor specific, and the precise restrictions are
+ not yet worked out. Our implementation in general supports turning on
+ additional functionality - enabling protected mode, paging, page write
+ protections; however, once those features have been enabled, they may
+ not be disabled on the virtual hardware.
+
+ Inputs: EAX = input to control register
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetCR2
+
+ VMICALL void VMI_SetCR2(VMI_UINT val);
+
+ Write to control register two. This has no side effects other than
+ updating the CR2 register value.
+
+ Inputs: EAX = input to control register
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetCR3
+
+ VMICALL void VMI_SetCR3(VMI_UINT val);
+
+ Write to control register three. This causes a TLB flush on the local
+ processor. In addition, this update may be queued as part of a lazy
+ call invocation, which allows multiple hypercalls to be issued during
+ the context switch path. The queuing convention is to be negotiated
+ with the hypervisor during bootstrapping, but the interfaces for this
+ negotiation are currently vendor specific.
+
+ Inputs: EAX = input to control register
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+ Queue Class: MMU
+
+ VMI_SetCR4
+
+ VMICALL void VMI_SetCR3(VMI_UINT val);
+
+ Write to control register four. This can cause TLB flush and many
+ other CPU side effects. The set of features available to the kernel
+ depend on the completeness of the hypervisor. An explicit list of
+ supported functionality or required settings may need to be negotiated
+ by the hypervisor and kernel during bootstrapping. This is likely to
+ be implementation or vendor specific, and the precise restrictions are
+ not yet worked out. Our implementation in general supports turning on
+ additional MMU functionality - enabling global pages, large pages, PAE
+ mode, and other features - however, once those features have been
+ enabled, they may not be disabled on the virtual hardware. The
+ remaining CPU control bits of CR4 remain active and behave identically
+ to real hardware.
+
+ Inputs: EAX = input to control register
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_GetCR0
+ VMI_GetCR2
+ VMI_GetCR3
+ VMI_GetCR4
+
+ VMICALL VMI_UINT32 VMI_GetCR0(void);
+ VMICALL VMI_UINT32 VMI_GetCR2(void);
+ VMICALL VMI_UINT32 VMI_GetCR3(void);
+ VMICALL VMI_UINT32 VMI_GetCR4(void);
+
+ Read the value of a control register into EAX. The register contents
+ are identical to the native hardware control registers; CR0 contains
+ the control bits and task switched flag, CR2 contains the last page
+ fault address, CR3 contains the page directory base pointer, and CR4
+ contains various feature control bits.
+
+ Inputs: None
+ Outputs: EAX = value of control register
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_CLTS
+
+ VMICALL void VMI_CLTS(void);
+
+ Used to clear the task switched (TS) flag in control register zero. A
+ replacement for the CLTS instruction.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetDR
+
+ VMICALL void VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val);
+
+ Set the debug register to the given value. If a hypervisor
+ implementation supports debug registers, this functions equivalently to
+ native hardware move to DR instructions.
+
+ Inputs: EAX = debug register number
+ EDX = debug register value
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_GetDR
+
+ VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num);
+
+ Read a debug register. If debug registers are not supported, the
+ implementation is free to return zero values.
+
+ Inputs: EAX = debug register number
+ Outputs: EAX = debug register value
+ Clobbers: Standard
+ Segments: Standard
+
+
+ PROCESSOR INFORMATION CALLS
+
+ These calls provide access to processor identification, performance and
+ cycle data, which may be inaccurate due to the nature of running on
+ virtual hardware. This information may be visible in a non-virtualizable
+ way to applications running outside of the kernel. As such, both RDTSC
+ and RDPMC should be disabled by kernels or hypervisors where information
+ leakage is a concern, and the accuracy of data retrieved by these functions
+ is up to the individual hypervisor vendor.
+
+ VMI_CPUID
+
+ /* Not expressible as a C function */
+
+ The CPUID instruction provides processor feature identification in a
+ vendor specific manner. The instruction itself is non-virtualizable
+ without hardware support, requiring a hypervisor assisted CPUID call
+ that emulates the effect of the native instruction, while masking any
+ unsupported CPU feature bits.
+
+ Inputs: EAX = CPUID number
+ ECX = sub-level query (nonstandard)
+ Outputs: EAX = CPUID dword 0
+ EBX = CPUID dword 1
+ ECX = CPUID dword 2
+ EDX = CPUID dword 3
+ Clobbers: Flags only
+ Segments: Standard
+
+ VMI_RDTSC
+
+ VMICALL VMI_UINT64 VMI_RDTSC(void);
+
+ The RDTSC instruction provides a cycles counter which may be made
+ visible to userspace. For better or worse, many applications have made
+ use of this feature to implement userspace timers, database indices, or
+ for micro-benchmarking of performance. This instruction is extremely
+ problematic for virtualization, because even though it is selectively
+ virtualizable using trap and emulate, it is much more expensive to
+ virtualize it in this fashion. On the other hand, if this instruction
+ is allowed to execute without trapping, the cycle counter provided
+ could be wrong in any number of circumstances due to hardware drift,
+ migration, suspend/resume, CPU hotplug, and other unforeseen
+ consequences of running inside of a virtual machine. There is no
+ standard specification for how this instruction operates when issued
+ from userspace programs, but the VMI call here provides a proper
+ interface for the kernel to read this cycle counter.
+
+ Inputs: None
+ Outputs: EAX = low word of TSC cycle counter
+ EDX = high word of TSC cycle counter
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_RDPMC
+
+ VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
+
+ Similar to RDTSC, this call provides the functionality of reading
+ processor performance counters. It also is selectively visible to
+ userspace, and maintaining accurate data for the performance counters
+ is an extremely difficult task due to the side effects introduced by
+ the hypervisor.
+
+ Inputs: ECX = performance counter index
+ Outputs: EAX = low word of counter
+ EDX = high word of counter
+ Clobbers: Standard
+ Segments: Standard
+
+
+ STACK / PRIVILEGE TRANSITION CALLS
+
+ This set of calls encapsulates mechanisms required to transfer between
+ higher privileged kernel tasks and userspace. The stack switching and
+ return mechanisms are also used to return from interrupt handlers into
+ the kernel, which may involve atomic interrupt state and stack
+ transitions.
+
+ VMI_UpdateKernelStack
+
+ VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0);
+
+ Inform the hypervisor that a new kernel stack pointer has been loaded
+ in the TSS structure. This new kernel stack pointer will be used for
+ entry into the kernel on interrupts from userspace.
+
+ Inputs: EAX = pointer to TSS structure
+ EDX = new kernel stack top
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_IRET
+
+ /* No C prototype provided */
+
+ Perform a near equivalent of the IRET instruction, which atomically
+ switches off the current stack and restore the interrupt mask. This
+ may return to userspace or back to the kernel from an interrupt or
+ exception handler. The VMI_IRET call does not restore IOPL from the
+ stack image, as the native hardware equivalent would. Instead, IOPL
+ must be explicitly restored using a VMI_SetIOPL call. The VMI_IRET
+ call does, however, restore the state of the EFLAGS_VM bit from the
+ stack image in the event that the hypervisor and kernel both support
+ V8086 execution mode. If the hypervisor does not support V8086 mode,
+ this can be silently ignored, generating an error that the guest must
+ deal with. Note this call is made using a CALL instruction, just as
+ all other VMI calls, so the EIP of the call site is available to the
+ VMI layer. This allows faults during the sequence to be properly
+ passed back to the guest kernel with the correct EIP.
+
+ Note that returning to userspace with interrupts disabled is an invalid
+ operation in a paravirtualized kernel, and the results of an attempt to
+ do so are undefined.
+
+ Also note that when issuing the VMI_IRET call, the userspace data
+ segments may have already been restored, so only the stack and code
+ segments can be assumed valid.
+
+ There is currently no support for IRET calls from a 16-bit stack
+ segment, which poses a problem for supporting certain userspace
+ applications which make use of high bits of ESP on a 16-bit stack. How
+ to best resolve this is an open question. One possibility is to
+ introduce a new VMI call which can operate on 16-bit segments, since it
+ is desirable to make the common case here as fast as possible.
+
+ Inputs: ST(0) = New EIP
+ ST(1) = New CS
+ ST(2) = New Flags (including interrupt mask)
+ ST(3) = New ESP (for userspace returns)
+ ST(4) = New SS (for userspace returns)
+ ST(5) = New ES (for v8086 returns)
+ ST(6) = New DS (for v8086 returns)
+ ST(7) = New FS (for v8086 returns)
+ ST(8) = New GS (for v8086 returns)
+ Outputs: None (does not return)
+ Clobbers: None (does not return)
+ Segments: CS / SS only
+
+ VMI_SYSEXIT
+
+ /* No C prototype provided */
+
+ For hypervisors and processors which support SYSENTER / SYSEXIT, the
+ VMI_SYSEXIT call is provided as a binary equivalent to the native
+ SYSENTER instruction. Since interrupts must always be enabled in
+ userspace, the VMI version of this function always combines atomically
+ enabling interrupts with the return to userspace.
+
+ Inputs: EDX = New EIP
+ ECX = New ESP
+ Outputs: None (does not return)
+ Clobbers: None (does not return)
+ Segments: CS / SS only
+
+
+ I/O CALLS
+
+ This set of calls incorporates I/O related calls - PIO, setting I/O
+ privilege level, and forcing memory writeback for device coherency.
+
+ VMI_INB
+ VMI_INW
+ VMI_INL
+
+ VMICALL VMI_UINT8 VMI_INB(VMI_UINT dummy, VMI_UINT port);
+ VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port);
+ VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port);
+
+ Input a byte, word, or doubleword from an I/O port. These
+ instructions have binary equivalent semantics to native instructions.
+
+ Inputs: EDX = port number
+ EDX, rather than EAX is used, because the native
+ encoding of the instruction may use this register
+ implicitly.
+ Outputs: EAX = port value
+ Clobbers: Memory only
+ Segments: Standard
+
+ VMI_OUTB
+ VMI_OUTW
+ VMI_OUTL
+
+ VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port);
+ VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port);
+ VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port);
+
+ Output a byte, word, or doubleword to an I/O port. These
+ instructions have binary equivalent semantics to native instructions.
+
+ Inputs: EAX = port value
+ EDX = port number
+ Outputs: None
+ Clobbers: None
+ Segments: Standard
+
+ VMI_INSB
+ VMI_INSW
+ VMI_INSL
+
+ /* Not expressible as C functions */
+
+ Input a string of bytes, words, or doublewords from an I/O port. These
+ instructions have binary equivalent semantics to native instructions.
+ They do not follow a C calling convention, and clobber only the same
+ registers as native instructions.
+
+ Inputs: EDI = destination address
+ EDX = port number
+ ECX = count
+ Outputs: None
+ Clobbers: ESI, ECX, Memory
+ Segments: Standard
+
+ VMI_OUTSB
+ VMI_OUTSW
+ VMI_OUTSL
+
+ /* Not expressible as C functions */
+
+ Output a string of bytes, words, or doublewords to an I/O port. These
+ instructions have binary equivalent semantics to native instructions.
+ They do not follow a C calling convention, and clobber only the same
+ registers as native instructions.
+
+ Inputs: ESI = source address
+ EDX = port number
+ ECX = count
+ Outputs: None
+ Clobbers: ESI, ECX
+ Segments: Standard
+
+ VMI_IODelay
+
+ VMICALL void VMI_IODelay(void);
+
+ Delay the processor by time required to access a bus register. This is
+ easily implemented on native hardware by an access to a bus scratch
+ register, but is typically not useful in a virtual machine. It is
+ paravirtualized to remove the overhead implied by executing the native
+ delay.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetIOPLMask
+
+ VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask);
+
+ Set the IOPL mask of the processor to allow userspace to access I/O
+ ports. Note the mask is pre-shifted, so an IOPL of 3 would be
+ expressed as (3 << 12). If the guest chooses to use IOPL to allow
+ CPL-3 access to I/O ports, it must explicitly set and restore IOPL
+ using these calls; attempting to set the IOPL flags with popf or iret
+ may produce no result.
+
+ Inputs: EAX = Mask
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_WBINVD
+
+ VMICALL void VMI_WBINVD(void);
+
+ Write back and invalidate the data cache. This is used to synchronize
+ I/O memory.
+
+ Inputs: None
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_INVD
+
+ This instruction is deprecated. It is invalid to execute in a virtual
+ machine. It is documented here only because it is still declared in
+ the interface, and dropping it required a version change.
+
+
+ APIC CALLS
+
+ APIC virtualization is currently quite simple. These calls support the
+ functionality of the hardware APIC in a form that allows for more
+ efficient implementation in a hypervisor, by avoiding trapping access to
+ APIC memory. The calls are kept simple to make the implementation
+ compatible with native hardware. The APIC must be mapped at a page
+ boundary in the processor virtual address space.
+
+ VMI_APICWrite
+
+ VMICALL void VMI_APICWrite(void *reg, VMI_UINT32 value);
+
+ Write to a local APIC register. Side effects are the same as native
+ hardware APICs.
+
+ Inputs: EAX = APIC register address
+ EDX = value to write
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_APICRead
+
+ VMICALL VMI_UINT32 VMI_APICRead(void *reg);
+
+ Read from a local APIC register. Side effects are the same as native
+ hardware APICs.
+
+ Inputs: EAX = APIC register address
+ Outputs: EAX = APIC register value
+ Clobbers: Standard
+ Segments: Standard
+
+
+ TIMER CALLS
+
+ The VMI interfaces define a highly accurate and efficient timer interface
+ that is available when running inside of a hypervisor. This is an
+ optional but highly recommended feature which avoids many of the problems
+ presented by classical timer virtualization. It provides notions of
+ stolen time, counters, and wall clock time which allows the VM to
+ get the most accurate information in a way which is free of races and
+ legacy hardware dependence.
+
+ VMI_GetWallclockTime
+
+ VMI_NANOSECS VMICALL VMI_GetWallclockTime(void);
+
+ VMI_GetWallclockTime returns the current wallclock time as the number
+ of nanoseconds since the epoch. Nanosecond resolution along with the
+ 64-bit unsigned type provide over 580 years from epoch until rollover.
+ The wallclock time is relative to the host's wallclock time.
+
+ Inputs: None
+ Outputs: EAX = low word, wallclock time in nanoseconds
+ EDX = high word, wallclock time in nanoseconds
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_WallclockUpdated
+
+ VMI_BOOL VMICALL VMI_WallclockUpdated(void);
+
+ VMI_WallclockUpdated returns TRUE if the wallclock time has changed
+ relative to the real cycle counter since the previous time that
+ VMI_WallclockUpdated was polled. For example, while a VM is suspended,
+ the real cycle counter will halt, but wallclock time will continue to
+ advance. Upon resuming the VM, the first call to VMI_WallclockUpdated
+ will return TRUE.
+
+ Inputs: None
+ Outputs: EAX = 0 for FALSE, 1 for TRUE
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_GetCycleFrequency
+
+ VMICALL VMI_CYCLES VMI_GetCycleFrequency(void);
+
+ VMI_GetCycleFrequency returns the number of cycles in one second. This
+ value can be used by the guest to convert between cycles and other time
+ units.
+
+ Inputs: None
+ Outputs: EAX = low word, cycle frequency
+ EDX = high word, cycle frequency
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_GetCycleCounter
+
+ VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter);
+
+ VMI_GetCycleCounter returns the current value, in cycles units, of the
+ counter corresponding to 'whichCounter' if it is one of
+ VMI_CYCLES_REAL, VMI_CYCLES_AVAILABLE or VMI_CYCLES_STOLEN.
+ VMI_GetCycleCounter returns 0 for any other value of 'whichCounter'.
+
+ Inputs: EAX = counter index, one of
+ #define VMI_CYCLES_REAL 0
+ #define VMI_CYCLES_AVAILABLE 1
+ #define VMI_CYCLES_STOLEN 2
+ Outputs: EAX = low word, cycle counter
+ EDX = high word, cycle counter
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_SetAlarm
+
+ VMICALL void VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry,
+ VMI_CYCLES period);
+
+ VMI_SetAlarm is used to arm the vcpu's alarms. The 'flags' parameter
+ is used to specify which counter's alarm is being set (VMI_CYCLES_REAL
+ or VMI_CYCLES_AVAILABLE), how to deliver the alarm to the vcpu
+ (VMI_ALARM_WIRED_IRQ0 or VMI_ALARM_WIRED_LVTT), and the mode
+ (VMI_ALARM_IS_ONESHOT or VMI_ALARM_IS_PERIODIC). If the alarm is set
+ against the VMI_ALARM_STOLEN counter or an undefined counter number,
+ the call is a nop. The 'expiry' parameter indicates the expiry of the
+ alarm, and for periodic alarms, the 'period' parameter indicates the
+ period of the alarm. If the value of 'period' is zero, the alarm is
+ armed as a one-shot alarm regardless of the mode specified by 'flags'.
+ Finally, a call to VMI_SetAlarm for an alarm that is already armed is
+ equivalent to first calling VMI_CancelAlarm and then calling
+ VMI_SetAlarm, except that the value returned by VMI_CancelAlarm is not
+ accessible.
+
+ /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */
+
+ Inputs: EAX = flags value, cycle counter number or'ed with
+ #define VMI_ALARM_WIRED_IRQ0 0x00000000
+ #define VMI_ALARM_WIRED_LVTT 0x00010000
+ #define VMI_ALARM_IS_ONESHOT 0x00000000
+ #define VMI_ALARM_IS_PERIODIC 0x00000100
+ EDX = low word, alarm expiry
+ ECX = high word, alarm expiry
+ ST(0) = low word, alarm expiry
+ ST(1) = high word, alarm expiry
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_CancelAlarm
+
+ VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags);
+
+ VMI_CancelAlarm is used to disarm an alarm. The 'flags' parameter
+ indicates which alarm to cancel (VMI_CYCLES_REAL or
+ VMI_CYCLES_AVAILABLE). The return value indicates whether or not the
+ cancel succeeded. A return value of FALSE indicates that the alarm was
+ already disarmed either because a) the alarm was never set or b) it was
+ a one-shot alarm and has already fired (though perhaps not yet
+ delivered to the guest). TRUE indicates that the alarm was armed and
+ either a) the alarm was one-shot and has not yet fired (and will no
+ longer fire until it is rearmed) or b) the alarm was periodic.
+
+ Inputs: EAX = cycle counter number
+ Outputs: EAX = 0 for FALSE, 1 for TRUE
+ Clobbers: Standard
+ Segments: Standard
+
+
+ MMU CALLS
+
+ The MMU plays a large role in paravirtualization due to the large
+ performance opportunities realized by gaining insight into the guest
+ machine's use of page tables. These calls are designed to accommodate the
+ existing MMU functionality in the guest OS while providing the hypervisor
+ with hints that can be used to optimize performance to a large degree.
+
+ VMI_SetLinearMapping
+
+ VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va,
+ VMI_UINT32 pages, VMI_UINT32 ppn);
+
+ /* The number of VMI address translation slot */
+ #define VMI_LINEAR_MAP_SLOTS 4
+
+ Register a virtual to physical translation of virtual address range to
+ physical pages. This may be used to register single pages or to
+ register large ranges. There is an upper limit on the number of active
+ mappings, which should be sufficient to allow the hypervisor and VMI
+ layer to perform page translation without requiring dynamic storage.
+ Translations are only required to be registered for addresses used to
+ access page table entries through the VMI page table access functions.
+ The guest is free to use the provided linear map slots in a manner that
+ it finds most convenient. Kernels which linearly map a large chunk of
+ physical memory and use page tables in this linear region will only
+ need to register one such region after initialization of the VMI.
+ Hypervisors which do not require linear to physical conversion hints
+ are free to leave these calls as NOPs, which is the default when
+ inlined into the native kernel.
+
+ Inputs: EAX = linear map slot
+ EDX = virtual address start of mapping
+ ECX = number of pages in mapping
+ ST(0) = physical frame number to which pages are mapped
+ Outputs: None
+ Clobbers: Standard
+ Segments: Standard
+
+ VMI_FlushTLB
+
+ VMICALL void VMI_FlushTLB(int how);
+
+ Flush all non-global mappings in the TLB, optionally flushing global
+ mappings as well. The VMI_FLUSH_TLB flag should always be specified,
+ optionally or'ed with the VMI_FLUSH_GLOBAL flag.
+
+ Inputs: EAX = flush type
+ #define VMI_FLUSH_TLB 0x01
+ #define VMI_FLUSH_GLOBAL 0x02
+ Outputs: None
+ Clobbers: Standard, memory (implied)
+ Segments: Standard
+
+ VMI_InvalPage
+
+ VMICALL void VMI_InvalPage(VMI_UINT32 va);
+
+ Invalidate the TLB mapping for a single page or large page at the
+ given virtual address.
+
+ Inputs: EAX = virtual address
+ Outputs: None
+ Clobbers: Standard, memory (implied)
+ Segments: Standard
+
+ The remaining documentation here needs updating when the PTE accessors are
+ simplified.
+
+ 70) VMI_SetPte
+
+ void VMI_SetPte(VMI_PTE pte, VMI_PTE *ptep);
+
+ Assigns a new value to a page table / directory entry. It is a
+ requirement that ptep points to a page that has already been
+ registered with the hypervisor as a page of the appropriate type
+ using the VMI_RegisterPageUsage function.
+
+ 71) VMI_SwapPte
+
+ VMI_PTE VMI_SwapPte(VMI_PTE pte, VMI_PTE *ptep);
+
+ Write 'pte' into the page table entry pointed by 'ptep', and returns
+ the old value in 'ptep'. This function acts atomically on the PTE
+ to provide up to date A/D bit information in the returned value.
+
+ 72) VMI_TestAndSetPteBit
+
+ VMI_BOOL VMI_TestAndSetPteBit(VMI_INT bit, VMI_PTE *ptep);
+
+ Atomically set a bit in a page table entry. Returns zero if the bit
+ was not set, and non-zero if the bit was set.
+
+ 73) VMI_TestAndClearPteBit
+
+ VMI_BOOL VMI_TestAndSetClearBit(VMI_INT bit, VMI_PTE *ptep);
+
+ Atomically clear a bit in a page table entry. Returns zero if the bit
+ was not set, and non-zero if the bit was set.
+
+ 74) VMI_SetPteLong
+ 75) VMI_SwapPteLong
+ 76) VMI_TestAndSetPteBitLong
+ 77) VMI_TestAndClearPteBitLong
+
+ void VMI_SetPteLong(VMI_PAE_PTE pte, VMI_PAE_PTE *ptep);
+ VMI_PAE_PTE VMI_SwapPteLong(VMI_UINT64 pte, VMI_PAE_PTE *ptep);
+ VMI_BOOL VMI_TestAndSetPteBitLong(VMI_INT bit, VMI_PAE_PTE *ptep);
+ VMI_BOOL VMI_TestAndSetClearBitLong(VMI_INT bit, VMI_PAE_PTE *ptep);
+
+ These functions act identically to the 32-bit PTE update functions,
+ but provide support for PAE mode. The calls are guaranteed to never
+ create a temporarily invalid but present page mapping that could be
+ accidentally prefetched by another processor, and all returned bits
+ are guaranteed to be atomically up to date.
+
+ One special exception is the VMI_SwapPteLong function only provides
+ synchronization against A/D bits from other processors, not against
+ other invocations of VMI_SwapPteLong.
+
+ 78) VMI_ClonePageTable
+ VMI_ClonePageDirectory
+
+ #define VMI_MKCLONE(start, count) (((start) << 16) | (count))
+
+ void VMI_ClonePageTable(VMI_UINT32 dstPPN, VMI_UINT32 srcPPN,
+ VMI_UINT32 flags);
+ void VMI_ClonePageDirectory(VMI_UINT32 dstPPN, VMI_UINT32 srcPPN,
+ VMI_UINT32 flags);
+
+ These functions tell the hypervisor to allocate a page shadow
+ at the PT or PD level using a shadow template. Because of the
+ availability of bits in the flags, these calls may be merged
+ together as well as flag the PAE-ness of the shadows.
+
+ 80) VMI_RegisterPageUsage
+ 81) VMI_ReleasePage
+
+ #define VMI_PAGE_PT 0x01
+ #define VMI_PAGE_PD 0x02
+ #define VMI_PAGE_PDP 0x04
+ #define VMI_PAGE_PML4 0x08
+ #define VMI_PAGE_GDT 0x10
+ #define VMI_PAGE_LDT 0x20
+ #define VMI_PAGE_IDT 0x40
+ #define VMI_PAGE_TSS 0x80
+
+ void VMI_RegisterPageUsage(VMI_UINT32 ppn, int flags);
+ void VMI_ReleasePage(VMI_UINT32 ppn, int flags);
+
+ These are used to register a page with the hypervisor as being of a
+ particular type, for instance, VMI_PAGE_PT says it is a page table
+ page.
+
+ 85) VMI_SetDeferredMode
+
+ void VMI_SetDeferredMode(VMI_UINT32 deferBits);
+
+ Set the lazy state update mode to the specified set of bits. This
+ allows the processor, hypervisor, or VMI layer to lazily update
+ certain CPU and MMU state. When setting this to a more permissive
+ setting, no flush is implied, but when clearing bits in the current
+ defer mask, all pending state will be flushed.
+
+ The 'deferBits' is a mask specifying how to flush.
+
+ #define VMI_DEFER_NONE 0x00
+
+ Disallow all asynchronous state updates. This is the default
+ state.
+
+ #define VMI_DEFER_MMU 0x01
+
+ Flush all pending page table updates. Note that page faults,
+ invalidations and TLB flushes will implicitly flush all pending
+ updates.
+
+ #define VMI_DEFER_CPU 0x02
+
+ Allow CPU state updates to control registers to be deferred, with
+ the exception of updates that change FPU state. This is useful
+ for combining a reload of the page table base in CR3 with other
+ updates, such as the current kernel stack.
+
+ #define VMI_DEFER_DT 0x04
+
+ Allow descriptor table updates to be delayed. This allows the
+ VMI_UpdateGDT / IDT / LDT calls to be asynchronously queued.
+
+ 86) VMI_FlushDeferredCalls
+
+ void VMI_FlushDeferredCalls(void);
+
+ Flush all asynchronous state updates which may be queued as
+ a result of setting deferred update mode.
+
+
+Appendix B - VMI C prototypes
+
+ Most of the VMI calls are properly callable C functions. Note that for the
+ absolute best performance, assembly calls are preferable in some cases, as
+ they do not imply all of the side effects of a C function call, such as
+ register clobber and memory access. Nevertheless, these wrappers serve as
+ a useful interface definition for higher level languages.
+
+ In some cases, a dummy variable is passed as an unused input to force
+ proper alignment of the remaining register values.
+
+ The call convention for these is defined to be standard GCC convention with
+ register passing. The regparm call interface is documented at:
+
+ http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html
+
+ Types used by these calls:
+
+ VMI_UINT64 64 bit unsigned integer
+ VMI_UINT32 32 bit unsigned integer
+ VMI_UINT16 16 bit unsigned integer
+ VMI_UINT8 8 bit unsigned integer
+ VMI_INT 32 bit integer
+ VMI_UINT 32 bit unsigned integer
+ VMI_DTR 6 byte compressed descriptor table limit/base
+ VMI_PTE 4 byte page table entry (or page directory)
+ VMI_LONG_PTE 8 byte page table entry (or PDE or PDPE)
+ VMI_SELECTOR 16 bit segment selector
+ VMI_BOOL 32 bit unsigned integer
+ VMI_CYCLES 64 bit unsigned integer
+ VMI_NANOSECS 64 bit unsigned integer
+
+
+ #ifndef VMI_PROTOTYPES_H
+ #define VMI_PROTOTYPES_H
+
+ /* Insert local type definitions here */
+ typedef struct VMI_DTR {
+ uint16 limit;
+ uint32 offset __attribute__ ((packed));
+ } VMI_DTR;
+
+ typedef struct APState {
+ VMI_UINT32 cr0;
+ VMI_UINT32 cr2;
+ VMI_UINT32 cr3;
+ VMI_UINT32 cr4;
+
+ VMI_UINT64 efer;
+
+ VMI_UINT32 eip;
+ VMI_UINT32 eflags;
+ VMI_UINT32 eax;
+ VMI_UINT32 ebx;
+ VMI_UINT32 ecx;
+ VMI_UINT32 edx;
+ VMI_UINT32 esp;
+ VMI_UINT32 ebp;
+ VMI_UINT32 esi;
+ VMI_UINT32 edi;
+ VMI_UINT16 cs;
+ VMI_UINT16 ss;
+
+ VMI_UINT16 ds;
+ VMI_UINT16 es;
+ VMI_UINT16 fs;
+ VMI_UINT16 gs;
+ VMI_UINT16 ldtr;
+
+ VMI_UINT16 gdtrLimit;
+ VMI_UINT32 gdtrBase;
+ VMI_UINT32 idtrBase;
+ VMI_UINT16 idtrLimit;
+ } APState;
+
+ #define VMICALL __attribute__((regparm(3)))
+
+ /* CORE INTERFACE CALLS */
+ VMICALL void VMI_Init(void);
+
+ /* PROCESSOR STATE CALLS */
+ VMICALL void VMI_DisableInterrupts(void);
+ VMICALL void VMI_EnableInterrupts(void);
+
+ VMICALL VMI_UINT VMI_GetInterruptMask(void);
+ VMICALL void VMI_SetInterruptMask(VMI_UINT mask);
+
+ VMICALL void VMI_Pause(void);
+ VMICALL void VMI_Halt(void);
+ VMICALL void VMI_Shutdown(void);
+ VMICALL void VMI_Reboot(VMI_INT how);
+
+ #define VMI_REBOOT_SOFT 0x0
+ #define VMI_REBOOT_HARD 0x1
+
+ void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID);
+
+ /* DESCRIPTOR RELATED CALLS */
+ VMICALL void VMI_SetGDT(VMI_DTR *gdtr);
+ VMICALL void VMI_SetIDT(VMI_DTR *idtr);
+ VMICALL void VMI_SetLDT(VMI_SELECTOR ldtSel);
+ VMICALL void VMI_SetTR(VMI_SELECTOR ldtSel);
+
+ VMICALL void VMI_GetGDT(VMI_DTR *gdtr);
+ VMICALL void VMI_GetIDT(VMI_DTR *idtr);
+ VMICALL VMI_SELECTOR VMI_GetLDT(void);
+ VMICALL VMI_SELECTOR VMI_GetTR(void);
+
+ VMICALL void VMI_WriteGDTEntry(void *gdt,
+ VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+ VMICALL void VMI_WriteLDTEntry(void *gdt,
+ VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+ VMICALL void VMI_WriteIDTEntry(void *gdt,
+ VMI_UINT entry,
+ VMI_UINT32 descLo,
+ VMI_UINT32 descHi);
+
+ /* CPU CONTROL CALLS */
+ VMICALL void VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg);
+ VMICALL void VMI_WRMSR_SPLIT(VMI_UINT32 valLo, VMI_UINT32 valHi,
+ VMI_UINT32 reg);
+
+ /* Not truly a proper C function; use dummy to align reg in ECX */
+ VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
+
+ VMICALL void VMI_SetCR0(VMI_UINT val);
+ VMICALL void VMI_SetCR2(VMI_UINT val);
+ VMICALL void VMI_SetCR3(VMI_UINT val);
+ VMICALL void VMI_SetCR4(VMI_UINT val);
+
+ VMICALL VMI_UINT32 VMI_GetCR0(void);
+ VMICALL VMI_UINT32 VMI_GetCR2(void);
+ VMICALL VMI_UINT32 VMI_GetCR3(void);
+ VMICALL VMI_UINT32 VMI_GetCR4(void);
+
+ VMICALL void VMI_CLTS(void);
+
+ VMICALL void VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val);
+ VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num);
+
+ /* PROCESSOR INFORMATION CALLS */
+
+ VMICALL VMI_UINT64 VMI_RDTSC(void);
+ VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
+
+ /* STACK / PRIVILEGE TRANSITION CALLS */
+ VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0);
+
+ /* I/O CALLS */
+ /* Native port in EDX - use dummy */
+ VMICALL VMI_UINT8 VMI_INB(VMI_UINT dummy, VMI_UINT port);
+ VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port);
+ VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port);
+
+ VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port);
+ VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port);
+ VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port);
+
+ VMICALL void VMI_IODelay(void);
+ VMICALL void VMI_WBINVD(void);
+ VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask);
+
+ /* APIC CALLS */
+ VMICALL void VMI_APICWrite(void *reg, VMI_UINT32 value);
+ VMICALL VMI_UINT32 VMI_APICRead(void *reg);
+
+ /* TIMER CALLS */
+ VMICALL VMI_NANOSECS VMI_GetWallclockTime(void);
+ VMICALL VMI_BOOL VMI_WallclockUpdated(void);
+
+ /* Predefined rate of the wallclock. */
+ #define VMI_WALLCLOCK_HZ 1000000000
+
+ VMICALL VMI_CYCLES VMI_GetCycleFrequency(void);
+ VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter);
+
+ /* Defined cycle counters */
+ #define VMI_CYCLES_REAL 0
+ #define VMI_CYCLES_AVAILABLE 1
+ #define VMI_CYCLES_STOLEN 2
+
+ VMICALL void VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry,
+ VMI_CYCLES period);
+ VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags);
+
+ /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */
+ #define VMI_ALARM_COUNTER_MASK 0x000000ff
+
+ #define VMI_ALARM_WIRED_IRQ0 0x00000000
+ #define VMI_ALARM_WIRED_LVTT 0x00010000
+
+ #define VMI_ALARM_IS_ONESHOT 0x00000000
+ #define VMI_ALARM_IS_PERIODIC 0x00000100
+
+ /* MMU CALLS */
+ VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va,
+ VMI_UINT32 pages, VMI_UINT32 ppn);
+
+ /* The number of VMI address translation slot */
+ #define VMI_LINEAR_MAP_SLOTS 4
+
+ VMICALL void VMI_InvalPage(VMI_UINT32 va);
+ VMICALL void VMI_FlushTLB(int how);
+
+ /* Flags used by VMI_FlushTLB call */
+ #define VMI_FLUSH_TLB 0x01
+ #define VMI_FLUSH_GLOBAL 0x02
+
+ #endif
+
+
+Appendix C - Sensitive x86 instructions in the paravirtual environment
+
+ This is a list of x86 instructions which may operate in a different manner
+ when run inside of a paravirtual environment.
+
+ ARPL - continues to function as normal, but kernel segment registers
+ may be different, so parameters to this instruction may need
+ to be modified. (System)
+
+ IRET - the IRET instruction will be unable to change the IOPL, VM,
+ VIF, VIP, or IF fields. (System)
+
+ the IRET instruction may #GP if the return CS/SS RPL are
+ below the CPL, or are not equal. (System)
+
+ LAR - the LAR instruction will reveal changes to the DPL field of
+ descriptors in the GDT and LDT tables. (System, User)
+
+ LSL - the LSL instruction will reveal changes to the segment limit
+ of descriptors in the GDT and LDT tables. (System, User)
+
+ LSS - the LSS instruction may #GP if the RPL is not set properly.
+ (System)
+
+ MOV - the mov %seg, %reg instruction may reveal a different RPL
+ on the segment register. (System)
+
+ The mov %reg, %ss instruction may #GP if the RPL is not set
+ to the current CPL. (System)
+
+ POP - the pop %ss instruction may #GP if the RPL is not set to
+ the appropriate CPL. (System)
+
+ POPF - the POPF instruction will be unable to set the hardware
+ interrupt flag. (System)
+
+ PUSH - the push %seg instruction may reveal a different RPL on the
+ segment register. (System)
+
+ PUSHF- the PUSHF instruction will reveal a possible different IOPL,
+ and the value of the hardware interrupt flag, which is always
+ set. (System, User)
+
+ SGDT - the SGDT instruction will reveal the location and length of
+ the GDT shadow instead of the guest GDT. (System, User)
+
+ SIDT - the SIDT instruction will reveal the location and length of
+ the IDT shadow instead of the guest IDT. (System, User)
+
+ SLDT - the SLDT instruction will reveal the selector used for
+ the shadow LDT rather than the selector loaded by the guest.
+ (System, User).
+
+ STR - the STR instruction will reveal the selector used for the
+ shadow TSS rather than the selector loaded by the guest.
+ (System, User).
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-13 17:59 [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
@ 2006-03-13 22:49 ` Chris Wright
2006-03-14 0:00 ` Zachary Amsden
2006-03-14 4:11 ` Rik van Riel
2006-03-22 20:05 ` Andi Kleen
2 siblings, 1 reply; 26+ messages in thread
From: Chris Wright @ 2006-03-13 22:49 UTC (permalink / raw)
To: Zachary Amsden
Cc: Andrew Morton, Joshua LeVasseur, Xen-devel, Pratap Subrahmanyam,
Wim Coekaerts, Jack Lo, Dan Hecht, Linux Kernel Mailing List,
Jan Beulich, Christopher Li, Chris Wright,
Virtualization Mailing List, Linus Torvalds, Anne Holler,
Jyothy Reddy, Kip Macy, Ky Srinivasan, Leendert van Doorn,
Dan Arai
* Zachary Amsden (zach@vmware.com) wrote:
Thanks for the very complete Documentation! Some comments interspersed
below.
> + High Performance.
> +
> + Providing a low level API that closely resembles hardware does not
> + provide any support for compound operations; indeed, typical
> + compound operations on hardware can be updating of many page table
> + entries, flushing system TLBs, or providing floating point safety.
> + Since these operations may require several privileged or sensitive
> + operations, it becomes important to defer some of these operations
> + until explicit flushes are issued, or to provide higher level
> + operations around some of these functions. In order to keep with
> + the goal of portability, this has been done only when deemed
> + necessary for performance reasons, and we have tried to package
> + these compound operations into methods that are typically used in
> + guest operating systems. In the future, we envision that additional
> + higher level abstractions will be added as an adjunct to the
> + low-level API. These higher level abstractions will target large
> + bulk operations such as creation, and destruction of address spaces,
> + context switches, thread creation and control.
This is an area where in the past VMI hasn't been well-suited to support
Xen. It's the higher level abstractions which make the performance
story of paravirt compelling. I haven't made it through the whole
patchset yet, but the bits you mention above as work to be done are
certainly important to good performance.
> + Maintainability.
> +
> + In the course of development with a virtualized environment, it is
> + not uncommon for support of new features or higher performance to
> + require radical changes to the operation of the system. If these
> + changes are visible to the guest OS in a paravirtualized system,
> + this will require updates to the guest kernel, which presents a
> + maintenance problem. In the Linux world, the rapid pace of
> + development on the kernel means new kernel versions are produced
> + every few months. This rapid pace is not always appropriate for end
> + users, so it is not uncommon to have dozens of different versions of
> + the Linux kernel in use that must be actively supported.
We do not want an interface which slows down the pace. We work with
source and drop cruft as quickly as possible (referring to internal
changes, not user-visible ABI changes here). Making changes that
require a new guest for some significant performance gain is perfectly
reasonable. What we want to avoid is making changes that require a
new guest to simply boot. This is akin to rev'ing hardware w/out any
backwards compatibility. This goal doesn't require VMI and ROMs, but
I agree it requires clear interface definitions.
> + Privilege Model.
> + Currently, the system only provides for two guest security domains,
> + kernel (which runs at the equivalent of virtual CPL-0), and user
> + (which runs at the equivalent of virtual CPL-3, with no hardware
> + access). Typically, this is not a problem, but if a guest OS relies
> + on using multiple hardware rings for privilege isolation, this
> + interface would need to be expanded to support that.
I don't think this is an issue, but good to have noted down.
> + The guest OS is also responsible for notifying the hypervisor about
> + which pages in its physical memory are going to be used to hold page
> + tables or page directories. Both PAE and non-PAE paging modes are
> + supported.
Presumably simultaneously, so single ROM supports PAE and non-PAE guests?
So VMI has PAE specific bits of the interface?
> + An experimental patch is available to enable boot-time sizing of
> + the hypervisor hole.
It'll be nice to have it eventually be dynamic.
> + Interrupt and I/O Subsystem.
> +
> + For security reasons, the guest operating system is not given
> + control over the hardware interrupt flag. We provide a virtual
> + interrupt flag that is under guest control. The virtual operating
> + system always runs with hardware interrupts enabled, but hardware
> + interrupts are transparent to the guest. The API provides calls for
> + all instructions which modify the interrupt flag.
> +
> + The paravirtualization environment provides a legacy programmable
> + interrupt controller (PIC) to the virtual machine. Future releases
> + will provide a virtual interrupt controller (VIC) that provides
> + more advanced features.
VIC is then just a formalized event mechanism between guest and VMM?
> + The general mechanism for providing customized features and
> + capabilities is to provide notification of these feature through
> + the CPUID call, and allowing configuration of CPU features
> + through RDMSR / WRMSR instructions. This allows a hypervisor vendor
> + ID to be published, and the kernel may enable or disable specific
> + features based on this id. This has the advantage of following
> + closely the boot time logic of many operating systems that enables
> + certain performance enhancements or bugfixes based on processor
> + revision, using exactly the same mechanism.
I like this idea, there's been a couple times when it seemed the simplest
way to handle some Xen features, but it's absolutely ripe for basically
unmanaged interface changes.
> + One shortcut we have found most helpful is to simply disable NMI delivery
> + to the paravirtualized kernel. There is no reason NMIs can't be
> + supported, but typical uses for them are not as productive in a
> + virtualized environment. Watchdog NMIs are of limited use if the OS is
> + already correct and running on stable hardware; profiling NMIs are
> + similarly of less use, since this task is accomplished with more accuracy
> + in the VMM itself; and NMIs for machine check errors should be handled
> + outside of the VM. The addition of NMI support does create additional
> + complexity for the trap handling code in the VM, and although the task is
> + surmountable, the value proposition is debatable. Here, again, feedback
> + is desired.
Xen allows propagating NMI's to the privileged dom0. This may make
sense for some errors that aren't fatal, but I'm not sure how much it's
used.
> + Alarms:
> +
> + Alarms can be set (armed) against the real time counter or the
> + available time counter. Alarms can be programmed to expire once
> + (one-shot) or on a regular period (periodic). They are armed by
> + indicating an absolute counter value expiry, and in the case of a
> + periodic alarm, a non-zero relative period counter value. [TBD:
> + The method of wiring the alarms to an interrupt vector is dependent
> + upon the virtual interrupt controller portion of the interface.
> + Currently, the alarms may be wired as if they are attached to IRQ0
> + or the vector in the local APIC LVTT. This way, the alarms can be
> + used as drop in replacements for the PIT or local APIC timer.]
Hmm, makes me wonder what you do in the case of giving physical
access to hardware. Xen makes a distinction between irq types of
physical and virtual, and the timer is virtual. I guess VIC is an area
that warrants more discussion.
> + typedef struct HyperRomHeader {
> + uint16_t romSignature;
> + int8_t romLength;
> + unsigned char romEntry[4];
> + uint8_t romPad0;
> + uint32_t hyperSignature;
> + uint8_t APIVersionMinor;
> + uint8_t APIVersionMajor;
> + uint8_t reserved0;
> + uint8_t reserved1;
> + uint32_t reserved2;
> + uint32_t reserved3;
> + uint16_t pciHeaderOffset;
> + uint16_t pnpHeaderOffset;
> + uint32_t romPad3;
> + char reserved[32];
> + char elfHeader[64];
> + } HyperRomHeader;
As a general rule, all these typedef'd structs and StudlyCaps don't
complement Linux CodingStyle.
> + VMI_Init
> +
> + VMICALL void VMI_Init(void);
> +
> + Initializes the hypervisor environment. Returns zero on success,
> + or -1 if the hypervisor could not be initialized. Note that this
> + is a recoverable error if the guest provides the requisite native
> + code to support transparent paravirtualization.
This provides an interesting support issue, i.e. just what platform are
you runnnig on?
> + Inputs: None
> + Outputs: EAX = result
> + Clobbers: Standard
> + Segments: Standard
> +
> +
> + PROCESSOR STATE CALLS
> +
> + This set of calls controls the online status of the processor. It
> + include interrupt control, reboot, halt, and shutdown functionality.
> + Future expansions may include deep sleep and hotplug CPU capabilities.
> +
> + VMI_DisableInterrupts
> +
> + VMICALL void VMI_DisableInterrupts(void);
> +
> + Disable maskable interrupts on the processor.
> +
> + Inputs: None
> + Outputs: None
> + Clobbers: Flags only
> + Segments: As this is both performance critical and likely to
> + be called from low level interrupt code, this call does not
> + require flat DS/ES segments, but uses the stack segment for
> + data access. Therefore only CS/SS must be well defined.
> +
> + VMI_EnableInterrupts
> +
> + VMICALL void VMI_EnableInterrupts(void);
> +
> + Enable maskable interrupts on the processor. Note that the
> + current implementation always will deliver any pending interrupts
> + on a call which enables interrupts, for compatibility with kernel
> + code which expects this behavior. Whether this should be required
> + is open for debate.
> +
> + Inputs: None
> + Outputs: None
> + Clobbers: Flags only
> + Segments: CS/SS only
> +
> + VMI_GetInterruptMask
> +
> + VMICALL VMI_UINT VMI_GetInterruptMask(void);
> +
> + Returns the current interrupt state mask of the processor. The
> + mask is defined to be 0x200 (matching processor flag IF) to indicate
> + interrupts are enabled.
> +
> + Inputs: None
> + Outputs: EAX = mask
> + Clobbers: Flags only
> + Segments: CS/SS only
> +
> + VMI_SetInterruptMask
> +
> + VMICALL void VMI_SetInterruptMask(VMI_UINT mask);
> +
> + Set the current interrupt state mask of the processor. Also
> + delivers any pending interrupts if the mask is set to allow
> + them.
> +
> + Inputs: EAX = mask
> + Outputs: None
> + Clobbers: Flags only
> + Segments: CS/SS only
> +
> + VMI_DeliverInterrupts (For future debate)
> +
> + Enable and deliver any pending interrupts. This would remove
> + the implicit delivery semantic from the SetInterruptMask and
> + EnableInterrupts calls.
How do you keep forwards and backwards compat here? Guest that's coded
to do simple implicit version would never get interrupts delivered on
newer ROM?
> + CPU CONTROL CALLS
> +
> + These calls encapsulate the set of privileged instructions used to
> + manipulate the CPU control state. These instructions are all properly
> + virtualizable using trap and emulate, but for performance reasons, a
> + direct call may be more efficient. With hardware virtualization
> + capabilities, many of these calls can be left as IDENT translations, that
> + is, inline implementations of the native instructions, which are not
> + rewritten by the hypervisor. Some of these calls are performance critical
> + during context switch paths, and some are not, but they are all included
> + for completeness, with the exceptions of the obsoleted LMSW and SMSW
> + instructions.
Included just for completeness can be beginning of API bloat.
> + VMI_WRMSR
> +
> + VMICALL void VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg);
> +
> + Write to a model specific register. This functions identically to the
> + hardware WRMSR instruction. Note that a hypervisor may not implement
> + the full set of MSRs supported by native hardware, since many of them
> + are not useful in the context of a virtual machine.
> +
> + Inputs: ECX = model specific register index
> + EAX = low word of register
> + EDX = high word of register
> + Outputs: None
> + Clobbers: Standard, Memory
> + Segments: Standard
> +
> + VMI_RDMSR
> +
> + VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
> +
> + Read from a model specific register. This functions identically to the
> + hardware RDMSR instruction. Note that a hypervisor may not implement
> + the full set of MSRs supported by native hardware, since many of them
> + are not useful in the context of a virtual machine.
> +
> + Inputs: ECX = machine specific register index
> + Outputs: EAX = low word of register
> + EDX = high word of register
> + Clobbers: Standard
> + Segments: Standard
> +
> + VMI_SetCR0
> +
> + VMICALL void VMI_SetCR0(VMI_UINT val);
> +
> + Write to control register zero. This can cause TLB flush and FPU
> + handling side effects. The set of features available to the kernel
> + depend on the completeness of the hypervisor. An explicit list of
> + supported functionality or required settings may need to be negotiated
> + by the hypervisor and kernel during bootstrapping. This is likely to
> + be implementation or vendor specific, and the precise restrictions are
> + not yet worked out. Our implementation in general supports turning on
> + additional functionality - enabling protected mode, paging, page write
> + protections; however, once those features have been enabled, they may
> + not be disabled on the virtual hardware.
> +
> + Inputs: EAX = input to control register
> + Outputs: None
> + Clobbers: Standard
> + Segments: Standard
clts, setcr0, readcr0 are interrelated for typical use. is it expected
the hypervisor uses consitent regsister (either native or shadowed)
here, or is it meant to be undefined?
> + VMI_INVD
> +
> + This instruction is deprecated. It is invalid to execute in a virtual
> + machine. It is documented here only because it is still declared in
> + the interface, and dropping it required a version change.
Rev the version, no need to discuss deprecated interface ;-) Good example
of how this has the ability to carry bloat forward though.
> + MMU CALLS
Many of these will look the same on x86-64, but the API is not
64-bit clean so has to be duplicated.
> + The MMU plays a large role in paravirtualization due to the large
> + performance opportunities realized by gaining insight into the guest
> + machine's use of page tables. These calls are designed to accommodate the
> + existing MMU functionality in the guest OS while providing the hypervisor
> + with hints that can be used to optimize performance to a large degree.
> +
> + VMI_SetLinearMapping
> + VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va,
> + VMI_UINT32 pages, VMI_UINT32 ppn);
> +
> + /* The number of VMI address translation slot */
> + #define VMI_LINEAR_MAP_SLOTS 4
> +
> + Register a virtual to physical translation of virtual address range to
> + physical pages. This may be used to register single pages or to
> + register large ranges. There is an upper limit on the number of active
> + mappings, which should be sufficient to allow the hypervisor and VMI
> + layer to perform page translation without requiring dynamic storage.
> + Translations are only required to be registered for addresses used to
> + access page table entries through the VMI page table access functions.
> + The guest is free to use the provided linear map slots in a manner that
> + it finds most convenient. Kernels which linearly map a large chunk of
> + physical memory and use page tables in this linear region will only
> + need to register one such region after initialization of the VMI.
> + Hypervisors which do not require linear to physical conversion hints
> + are free to leave these calls as NOPs, which is the default when
> + inlined into the native kernel.
> +
> + Inputs: EAX = linear map slot
> + EDX = virtual address start of mapping
> + ECX = number of pages in mapping
> + ST(0) = physical frame number to which pages are mapped
> + Outputs: None
> + Clobbers: Standard
> + Segments: Standard
> +
> + VMI_FlushTLB
> +
> + VMICALL void VMI_FlushTLB(int how);
> +
> + Flush all non-global mappings in the TLB, optionally flushing global
> + mappings as well. The VMI_FLUSH_TLB flag should always be specified,
> + optionally or'ed with the VMI_FLUSH_GLOBAL flag.
> +
> + Inputs: EAX = flush type
> + #define VMI_FLUSH_TLB 0x01
> + #define VMI_FLUSH_GLOBAL 0x02
> + Outputs: None
> + Clobbers: Standard, memory (implied)
> + Segments: Standard
> +
> + VMI_InvalPage
> +
> + VMICALL void VMI_InvalPage(VMI_UINT32 va);
> +
> + Invalidate the TLB mapping for a single page or large page at the
> + given virtual address.
> +
> + Inputs: EAX = virtual address
> + Outputs: None
> + Clobbers: Standard, memory (implied)
> + Segments: Standard
> +
> + The remaining documentation here needs updating when the PTE accessors are
> + simplified.
> +
> + 70) VMI_SetPte
> +
> + void VMI_SetPte(VMI_PTE pte, VMI_PTE *ptep);
> +
> + Assigns a new value to a page table / directory entry. It is a
> + requirement that ptep points to a page that has already been
> + registered with the hypervisor as a page of the appropriate type
> + using the VMI_RegisterPageUsage function.
> +
> + 71) VMI_SwapPte
> +
> + VMI_PTE VMI_SwapPte(VMI_PTE pte, VMI_PTE *ptep);
> +
> + Write 'pte' into the page table entry pointed by 'ptep', and returns
> + the old value in 'ptep'. This function acts atomically on the PTE
> + to provide up to date A/D bit information in the returned value.
> +
> + 72) VMI_TestAndSetPteBit
> +
> + VMI_BOOL VMI_TestAndSetPteBit(VMI_INT bit, VMI_PTE *ptep);
> +
> + Atomically set a bit in a page table entry. Returns zero if the bit
> + was not set, and non-zero if the bit was set.
> +
> + 73) VMI_TestAndClearPteBit
> +
> + VMI_BOOL VMI_TestAndSetClearBit(VMI_INT bit, VMI_PTE *ptep);
> +
> + Atomically clear a bit in a page table entry. Returns zero if the bit
> + was not set, and non-zero if the bit was set.
> +
> + 74) VMI_SetPteLong
> + 75) VMI_SwapPteLong
> + 76) VMI_TestAndSetPteBitLong
> + 77) VMI_TestAndClearPteBitLong
> +
> + void VMI_SetPteLong(VMI_PAE_PTE pte, VMI_PAE_PTE *ptep);
> + VMI_PAE_PTE VMI_SwapPteLong(VMI_UINT64 pte, VMI_PAE_PTE *ptep);
> + VMI_BOOL VMI_TestAndSetPteBitLong(VMI_INT bit, VMI_PAE_PTE *ptep);
> + VMI_BOOL VMI_TestAndSetClearBitLong(VMI_INT bit, VMI_PAE_PTE *ptep);
> +
> + These functions act identically to the 32-bit PTE update functions,
> + but provide support for PAE mode. The calls are guaranteed to never
> + create a temporarily invalid but present page mapping that could be
> + accidentally prefetched by another processor, and all returned bits
> + are guaranteed to be atomically up to date.
Heh, answers that question I had above ;-)
> + 85) VMI_SetDeferredMode
Is this the batching, multi-call analog?
> + void VMI_SetDeferredMode(VMI_UINT32 deferBits);
> +
> + Set the lazy state update mode to the specified set of bits. This
> + allows the processor, hypervisor, or VMI layer to lazily update
> + certain CPU and MMU state. When setting this to a more permissive
> + setting, no flush is implied, but when clearing bits in the current
> + defer mask, all pending state will be flushed.
> +
> + The 'deferBits' is a mask specifying how to flush.
> +
> + #define VMI_DEFER_NONE 0x00
> +
> + Disallow all asynchronous state updates. This is the default
> + state.
> +
> + #define VMI_DEFER_MMU 0x01
> +
> + Flush all pending page table updates. Note that page faults,
> + invalidations and TLB flushes will implicitly flush all pending
> + updates.
> +
> + #define VMI_DEFER_CPU 0x02
> +
> + Allow CPU state updates to control registers to be deferred, with
> + the exception of updates that change FPU state. This is useful
> + for combining a reload of the page table base in CR3 with other
> + updates, such as the current kernel stack.
> +
> + #define VMI_DEFER_DT 0x04
> +
> + Allow descriptor table updates to be delayed. This allows the
> + VMI_UpdateGDT / IDT / LDT calls to be asynchronously queued.
> +
> + 86) VMI_FlushDeferredCalls
> +
> + void VMI_FlushDeferredCalls(void);
> +
> + Flush all asynchronous state updates which may be queued as
> + a result of setting deferred update mode.
> +
> +
> +Appendix B - VMI C prototypes
> +
> + Most of the VMI calls are properly callable C functions. Note that for the
> + absolute best performance, assembly calls are preferable in some cases, as
> + they do not imply all of the side effects of a C function call, such as
> + register clobber and memory access. Nevertheless, these wrappers serve as
> + a useful interface definition for higher level languages.
> +
> + In some cases, a dummy variable is passed as an unused input to force
> + proper alignment of the remaining register values.
> +
> + The call convention for these is defined to be standard GCC convention with
> + register passing. The regparm call interface is documented at:
> +
> + http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html
> +
> + Types used by these calls:
> +
> + VMI_UINT64 64 bit unsigned integer
> + VMI_UINT32 32 bit unsigned integer
> + VMI_UINT16 16 bit unsigned integer
> + VMI_UINT8 8 bit unsigned integer
> + VMI_INT 32 bit integer
> + VMI_UINT 32 bit unsigned integer
> + VMI_DTR 6 byte compressed descriptor table limit/base
> + VMI_PTE 4 byte page table entry (or page directory)
> + VMI_LONG_PTE 8 byte page table entry (or PDE or PDPE)
> + VMI_SELECTOR 16 bit segment selector
> + VMI_BOOL 32 bit unsigned integer
> + VMI_CYCLES 64 bit unsigned integer
> + VMI_NANOSECS 64 bit unsigned integer
All caps typedefs are not very popular w.r.t. CodingStyle.
> + #ifndef VMI_PROTOTYPES_H
> + #define VMI_PROTOTYPES_H
> +
> + /* Insert local type definitions here */
> + typedef struct VMI_DTR {
> + uint16 limit;
> + uint32 offset __attribute__ ((packed));
> + } VMI_DTR;
> +
> + typedef struct APState {
> + VMI_UINT32 cr0;
> + VMI_UINT32 cr2;
> + VMI_UINT32 cr3;
> + VMI_UINT32 cr4;
> +
> + VMI_UINT64 efer;
> +
> + VMI_UINT32 eip;
> + VMI_UINT32 eflags;
> + VMI_UINT32 eax;
> + VMI_UINT32 ebx;
> + VMI_UINT32 ecx;
> + VMI_UINT32 edx;
> + VMI_UINT32 esp;
> + VMI_UINT32 ebp;
> + VMI_UINT32 esi;
> + VMI_UINT32 edi;
> + VMI_UINT16 cs;
> + VMI_UINT16 ss;
> +
> + VMI_UINT16 ds;
> + VMI_UINT16 es;
> + VMI_UINT16 fs;
> + VMI_UINT16 gs;
> + VMI_UINT16 ldtr;
> +
> + VMI_UINT16 gdtrLimit;
> + VMI_UINT32 gdtrBase;
> + VMI_UINT32 idtrBase;
> + VMI_UINT16 idtrLimit;
> + } APState;
> +
> + #define VMICALL __attribute__((regparm(3)))
I understand it's for ABI documentation, but in Linux it's FASTCALL.
> + /* CORE INTERFACE CALLS */
> + VMICALL void VMI_Init(void);
> +
> + /* PROCESSOR STATE CALLS */
> + VMICALL void VMI_DisableInterrupts(void);
> + VMICALL void VMI_EnableInterrupts(void);
> +
> + VMICALL VMI_UINT VMI_GetInterruptMask(void);
> + VMICALL void VMI_SetInterruptMask(VMI_UINT mask);
> +
> + VMICALL void VMI_Pause(void);
> + VMICALL void VMI_Halt(void);
> + VMICALL void VMI_Shutdown(void);
> + VMICALL void VMI_Reboot(VMI_INT how);
> +
> + #define VMI_REBOOT_SOFT 0x0
> + #define VMI_REBOOT_HARD 0x1
> +
> + void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID);
> +
> + /* DESCRIPTOR RELATED CALLS */
> + VMICALL void VMI_SetGDT(VMI_DTR *gdtr);
> + VMICALL void VMI_SetIDT(VMI_DTR *idtr);
> + VMICALL void VMI_SetLDT(VMI_SELECTOR ldtSel);
> + VMICALL void VMI_SetTR(VMI_SELECTOR ldtSel);
> +
> + VMICALL void VMI_GetGDT(VMI_DTR *gdtr);
> + VMICALL void VMI_GetIDT(VMI_DTR *idtr);
> + VMICALL VMI_SELECTOR VMI_GetLDT(void);
> + VMICALL VMI_SELECTOR VMI_GetTR(void);
> +
> + VMICALL void VMI_WriteGDTEntry(void *gdt,
> + VMI_UINT entry,
> + VMI_UINT32 descLo,
> + VMI_UINT32 descHi);
> + VMICALL void VMI_WriteLDTEntry(void *gdt,
> + VMI_UINT entry,
> + VMI_UINT32 descLo,
> + VMI_UINT32 descHi);
> + VMICALL void VMI_WriteIDTEntry(void *gdt,
> + VMI_UINT entry,
> + VMI_UINT32 descLo,
> + VMI_UINT32 descHi);
> +
> + /* CPU CONTROL CALLS */
> + VMICALL void VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg);
> + VMICALL void VMI_WRMSR_SPLIT(VMI_UINT32 valLo, VMI_UINT32 valHi,
> + VMI_UINT32 reg);
> +
> + /* Not truly a proper C function; use dummy to align reg in ECX */
> + VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
> +
> + VMICALL void VMI_SetCR0(VMI_UINT val);
> + VMICALL void VMI_SetCR2(VMI_UINT val);
> + VMICALL void VMI_SetCR3(VMI_UINT val);
> + VMICALL void VMI_SetCR4(VMI_UINT val);
> +
> + VMICALL VMI_UINT32 VMI_GetCR0(void);
> + VMICALL VMI_UINT32 VMI_GetCR2(void);
> + VMICALL VMI_UINT32 VMI_GetCR3(void);
> + VMICALL VMI_UINT32 VMI_GetCR4(void);
> +
> + VMICALL void VMI_CLTS(void);
> +
> + VMICALL void VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val);
> + VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num);
> +
> + /* PROCESSOR INFORMATION CALLS */
> +
> + VMICALL VMI_UINT64 VMI_RDTSC(void);
> + VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
> +
> + /* STACK / PRIVILEGE TRANSITION CALLS */
> + VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0);
> +
> + /* I/O CALLS */
> + /* Native port in EDX - use dummy */
> + VMICALL VMI_UINT8 VMI_INB(VMI_UINT dummy, VMI_UINT port);
> + VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port);
> + VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port);
> +
> + VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port);
> + VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port);
> + VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port);
> +
> + VMICALL void VMI_IODelay(void);
> + VMICALL void VMI_WBINVD(void);
> + VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask);
> +
> + /* APIC CALLS */
> + VMICALL void VMI_APICWrite(void *reg, VMI_UINT32 value);
> + VMICALL VMI_UINT32 VMI_APICRead(void *reg);
> +
> + /* TIMER CALLS */
> + VMICALL VMI_NANOSECS VMI_GetWallclockTime(void);
> + VMICALL VMI_BOOL VMI_WallclockUpdated(void);
> +
> + /* Predefined rate of the wallclock. */
> + #define VMI_WALLCLOCK_HZ 1000000000
> +
> + VMICALL VMI_CYCLES VMI_GetCycleFrequency(void);
> + VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter);
> +
> + /* Defined cycle counters */
> + #define VMI_CYCLES_REAL 0
> + #define VMI_CYCLES_AVAILABLE 1
> + #define VMI_CYCLES_STOLEN 2
> +
> + VMICALL void VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry,
> + VMI_CYCLES period);
> + VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags);
> +
> + /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */
> + #define VMI_ALARM_COUNTER_MASK 0x000000ff
> +
> + #define VMI_ALARM_WIRED_IRQ0 0x00000000
> + #define VMI_ALARM_WIRED_LVTT 0x00010000
> +
> + #define VMI_ALARM_IS_ONESHOT 0x00000000
> + #define VMI_ALARM_IS_PERIODIC 0x00000100
> +
> + /* MMU CALLS */
> + VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va,
> + VMI_UINT32 pages, VMI_UINT32 ppn);
> +
> + /* The number of VMI address translation slot */
> + #define VMI_LINEAR_MAP_SLOTS 4
> +
> + VMICALL void VMI_InvalPage(VMI_UINT32 va);
> + VMICALL void VMI_FlushTLB(int how);
> +
> + /* Flags used by VMI_FlushTLB call */
> + #define VMI_FLUSH_TLB 0x01
> + #define VMI_FLUSH_GLOBAL 0x02
> +
> + #endif
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-13 22:49 ` Chris Wright
@ 2006-03-14 0:00 ` Zachary Amsden
2006-03-14 21:27 ` Chris Wright
0 siblings, 1 reply; 26+ messages in thread
From: Zachary Amsden @ 2006-03-14 0:00 UTC (permalink / raw)
To: Chris Wright
Cc: Linus Torvalds, Linux Kernel Mailing List,
Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
Chris Wright wrote:
Hi Chris, thank you for your comments. I've tried to answer as much as
I can - hopefully I found all your questions.
>> + guest operating systems. In the future, we envision that additional
>> + higher level abstractions will be added as an adjunct to the
>> + low-level API. These higher level abstractions will target large
>> + bulk operations such as creation, and destruction of address spaces,
>> + context switches, thread creation and control.
>>
>
> This is an area where in the past VMI hasn't been well-suited to support
> Xen. It's the higher level abstractions which make the performance
> story of paravirt compelling. I haven't made it through the whole
> patchset yet, but the bits you mention above as work to be done are
> certainly important to good performance.
>
For example, multicalls, which we support, and batched page table
operations, which we support, and vendor designed virtual devices, which
we support. What is unclear to me is why you need to keep pushing
higher up the stack to get more performance. If you could have any
higher level hypercall you wanted, what would it be? Most people say -
fork() / exec(). But why? You've just radically changed the way the
guest must operate its MMU, and you've radically constrained the way
page tables and memory management structures must be layed out by
putting a ton of commonality in their infrastructure that is shared by
the hypervisor and the kernel. You've likely vastly complicated the
design of a virtualized kernel that still runs on native hardware. But
what can you truly gain, that you can not gain from a simpler, less
complicated interface that just says -
Ok, I'm about to update a whole bunch of pages tables.
Ok, I'm done and I might want to use them now. Please make sure the
hardware TLB will be in sync.
Pushing up the stack with a higher level API is a serious consideration,
but only if you can show serious results from it. I'm not convinced
that you can actually hone in on anything /that isn't already a
performance problem on native kernels/. Consider, for example, that we
don't actually support remote TLB shootdown IPIs via VMI calls. Why is
this a performance problem? Well, very likely, those IPI shootdowns are
going to be synchronous. And if you don't co-schedule the CPUs in your
virtual machine, you might just have issued synchronous IPIs to VCPUs
that aren't even running. A serious performance problem.
Is it? Or is it really, just another case where the _native_ kernel can
be even more clever, and avoid doing those IPI shootdowns in the
firstplace? I've watched IPI shootdown in Linux get drastically better
in the 2.6 series of kernels, and see (anecdotal number quoting) maybe 4
or 5 of them in the course of a kernel compile. There is no longer a
giant performance boon to be gained here.
Similarly, you can almost argue the same thing with spinlocks - if you
really are seeing performance issues because of the wakeup of a
descheduled remote VPU, maybe you really need to think about moving that
lock off a hot path or using a better, lock free synchronization method.
I'm not arguing against these features - in fact, I think they can be
done in a way that doesn't intrude too much inside of the kernel. After
all, locks and IPIs tend to be part of the lower layer architecture
anyways. And they definitely do win back some of the background noise
introduced by virtualization. But if you decide to make the interface
more complicated, you really need to have an accurate measure of exactly
what you can gain by it to justify that complexity.
Personally, I'm all for making lock primitives and shootdowns an
_optional_ extension to the interface. As with many other relatively
straightforward and non-intrusive changes. I know some of you will
disagree with me, but I think a lot of what is being referred to as
"higher level" paravirtualization is really an attempt to solve
pre-existing problems in the performance of the underlying system.
There are advanced and useful things you can do with higher level
paravirtualization, but I am not convinced at all that incredible
performance gain is one of them.
> We do not want an interface which slows down the pace. We work with
> source and drop cruft as quickly as possible (referring to internal
> changes, not user-visible ABI changes here). Making changes that
> require a new guest for some significant performance gain is perfectly
> reasonable. What we want to avoid is making changes that require a
> new guest to simply boot. This is akin to rev'ing hardware w/out any
> backwards compatibility. This goal doesn't require VMI and ROMs, but
> I agree it requires clear interface definitions.
>
This is why we provide the minor / major interface numbers. Bump the
minor number, you get a new feature. Bump the required minor version in
the guest when it relies on that feature. Bump the major number when
you break compatibility. More on this below.
>
>> + VMI_DeliverInterrupts (For future debate)
>> +
>> + Enable and deliver any pending interrupts. This would remove
>> + the implicit delivery semantic from the SetInterruptMask and
>> + EnableInterrupts calls.
>>
>
> How do you keep forwards and backwards compat here? Guest that's coded
> to do simple implicit version would never get interrupts delivered on
> newer ROM?
>
This isn't part of the interface. If it were to be included, you could
do two things - bump the minor version, and add non-delivery semantic
enable and restore interrupt calls, or bump the major version and drop
the delivery semantic from the originals.
I agree this is pretty clumsy. Expect to see more discussion about
using annotations to expand the interface without breaking binary
compatibility, as well as providing more advanced feature control. I
wanted to integrate more advanced feature control / probing into this
version of the VMI, but there are so many possible ways to do it that it
would be much nicer to get feedback from the community on what is the
best interface.
>
>> + CPU CONTROL CALLS
>> +
>> + These calls encapsulate the set of privileged instructions used to
>> + manipulate the CPU control state. These instructions are all properly
>> + virtualizable using trap and emulate, but for performance reasons, a
>> + direct call may be more efficient. With hardware virtualization
>> + capabilities, many of these calls can be left as IDENT translations, that
>> + is, inline implementations of the native instructions, which are not
>> + rewritten by the hypervisor. Some of these calls are performance critical
>> + during context switch paths, and some are not, but they are all included
>> + for completeness, with the exceptions of the obsoleted LMSW and SMSW
>> + instructions.
>>
>
> Included just for completeness can be beginning of API bloat.
>
The design impact of this bloat is zero - if you don't want to implement
virtual methods for, say, debug register access - then you don't need to
do anything. You trap and emulate by default. If on the other hand,
you do want to hook them, you are welcome to. The hypervisor is free to
choose the design costs that are appropriate for their usage scenarios,
as is the kernel - it's not in the spec, but certainly is open for
debate that certain classes of instructions such as these need not even
be converted to VMI calls. We did implement all of these in Linux for
performance and symmetry.
>
> clts, setcr0, readcr0 are interrelated for typical use. is it expected
> the hypervisor uses consitent regsister (either native or shadowed)
> here, or is it meant to be undefined?
>
CLTS allows the elimination of an extra GetCR0 call, and they all
operate on the same (shadowed) register.
> Many of these will look the same on x86-64, but the API is not
> 64-bit clean so has to be duplicated.
>
Yes, register pressure forces the PAE API to be slightly different from
the long mode API. But long mode has different register calling
conventions anyway, so it is not a big deal. The important thing is,
once the MMU mess is sorted out, the same interface can be used from C
code for both platforms, and the details about which lock primitives are
used can be hidden. The cost of which lock primitives to use differs on
32-bit and 64-bit platforms, across vendor, and the style of the
hypervisor implementation (direct / writable / shadowed page tables).
>
>
>> + 85) VMI_SetDeferredMode
>>
>
> Is this the batching, multi-call analog?
>
Yes. This interface needs to be documented in a much better fashion.
But the idea is that VMI calls are mapped into Xen multicalls by
allowing deferred completion of certain classes of operations. That
same mode of deferred operation is used to batch PTE updates in our
implementation (although Xen uses writable page tables now, this used to
provide the same support facility in Xen as well). To complement this,
there is an explicit flush - and it turns out this maps very nicely,
getting rid of a lot of the XenoLinux changes around mmu_context.h.
>> +
>> + VMI_CYCLES 64 bit unsigned integer
>> + VMI_NANOSECS 64 bit unsigned integer
>>
>
> All caps typedefs are not very popular w.r.t. CodingStyle.
>
We know this. This is not a Linux interface. This is the API
documentation, meant to be considerably different in style. Where this
ugliness has crept into our Linux patches, I have been steadily removing
it and making them look nicer. But the vast difference in the style of
the doc is to avoid namespace collision.
>
>> + #define VMICALL __attribute__((regparm(3)))
>>
>
> I understand it's for ABI documentation, but in Linux it's FASTCALL.
>
Actually, FASTCALL is regparm(2), I think.
Cheers,
Zach
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-13 17:59 [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
2006-03-13 22:49 ` Chris Wright
@ 2006-03-14 4:11 ` Rik van Riel
2006-03-22 20:05 ` Andi Kleen
2 siblings, 0 replies; 26+ messages in thread
From: Rik van Riel @ 2006-03-14 4:11 UTC (permalink / raw)
To: Zachary Amsden
Cc: Linus Torvalds, Linux Kernel Mailing List,
Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Chris Wright, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
On Mon, 13 Mar 2006, Zachary Amsden wrote:
> + Zachary Amsden, Daniel Arai, Daniel Hecht, Pratap Subrahmanyam
> + Copyright (C) 2005, 2006, VMware, Inc.
> + All rights reserved
Btw, this copyright claim doesn't look very GPL compatible.
You might want to get that checked out.
--
All Rights Reversed
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-14 0:00 ` Zachary Amsden
@ 2006-03-14 21:27 ` Chris Wright
2006-03-14 22:29 ` Zachary Amsden
2006-03-16 1:16 ` Chris Wright
0 siblings, 2 replies; 26+ messages in thread
From: Chris Wright @ 2006-03-14 21:27 UTC (permalink / raw)
To: Zachary Amsden
Cc: Chris Wright, Linus Torvalds, Linux Kernel Mailing List,
Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
* Zachary Amsden (zach@vmware.com) wrote:
> Pushing up the stack with a higher level API is a serious consideration,
> but only if you can show serious results from it. I'm not convinced
> that you can actually hone in on anything /that isn't already a
> performance problem on native kernels/. Consider, for example, that we
> don't actually support remote TLB shootdown IPIs via VMI calls. Why is
> this a performance problem? Well, very likely, those IPI shootdowns are
> going to be synchronous. And if you don't co-schedule the CPUs in your
> virtual machine, you might just have issued synchronous IPIs to VCPUs
> that aren't even running. A serious performance problem.
>
> Is it? Or is it really, just another case where the _native_ kernel can
> be even more clever, and avoid doing those IPI shootdowns in the
> firstplace? I've watched IPI shootdown in Linux get drastically better
> in the 2.6 series of kernels, and see (anecdotal number quoting) maybe 4
> or 5 of them in the course of a kernel compile. There is no longer a
> giant performance boon to be gained here.
>
> Similarly, you can almost argue the same thing with spinlocks - if you
> really are seeing performance issues because of the wakeup of a
> descheduled remote VPU, maybe you really need to think about moving that
> lock off a hot path or using a better, lock free synchronization method.
>
> I'm not arguing against these features - in fact, I think they can be
> done in a way that doesn't intrude too much inside of the kernel. After
> all, locks and IPIs tend to be part of the lower layer architecture
> anyways. And they definitely do win back some of the background noise
> introduced by virtualization. But if you decide to make the interface
> more complicated, you really need to have an accurate measure of exactly
> what you can gain by it to justify that complexity.
Yes, I completely agree. Without specific performance numbers it's just
hand waving. To make it more concrete, I'll work on a compare/contrast
of the interfaces so we have specifics to discuss.
> >Included just for completeness can be beginning of API bloat.
>
> The design impact of this bloat is zero - if you don't want to implement
> virtual methods for, say, debug register access - then you don't need to
> do anything. You trap and emulate by default. If on the other hand,
> you do want to hook them, you are welcome to. The hypervisor is free to
> choose the design costs that are appropriate for their usage scenarios,
> as is the kernel - it's not in the spec, but certainly is open for
> debate that certain classes of instructions such as these need not even
> be converted to VMI calls. We did implement all of these in Linux for
> performance and symmetry.
Yup. Just noting that API without clear users is the type of thing that
is regularly rejected from Linux.
> >Many of these will look the same on x86-64, but the API is not
> >64-bit clean so has to be duplicated.
>
> Yes, register pressure forces the PAE API to be slightly different from
> the long mode API. But long mode has different register calling
> conventions anyway, so it is not a big deal. The important thing is,
> once the MMU mess is sorted out, the same interface can be used from C
> code for both platforms, and the details about which lock primitives are
> used can be hidden. The cost of which lock primitives to use differs on
> 32-bit and 64-bit platforms, across vendor, and the style of the
> hypervisor implementation (direct / writable / shadowed page tables).
My mistake, it makes perfect sense from ABI point of view.
> >Is this the batching, multi-call analog?
>
> Yes. This interface needs to be documented in a much better fashion.
> But the idea is that VMI calls are mapped into Xen multicalls by
> allowing deferred completion of certain classes of operations. That
> same mode of deferred operation is used to batch PTE updates in our
> implementation (although Xen uses writable page tables now, this used to
> provide the same support facility in Xen as well). To complement this,
> there is an explicit flush - and it turns out this maps very nicely,
> getting rid of a lot of the XenoLinux changes around mmu_context.h.
Are these valid differences? Or did I misunderstand the batching
mechanism?
1) can't use stack based args, so have to allocate each data structure,
which could conceivably fail unless it's some fixed buffer.
2) complicates the rom implementation slightly where implementation of
each deferrable part of the API needs to have switch (am I deferred or
not) to then build the batch, or make direct hypercall.
3) flushing in smp, have to be careful to manage simulataneous defers
and flushes from potentially multiple cpus in guest.
Doesn't seem these are showstoppers, just differences worth noting.
There aren't as many multicalls left in Xen these days anyway.
thanks,
-chris
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-14 21:27 ` Chris Wright
@ 2006-03-14 22:29 ` Zachary Amsden
2006-03-15 2:57 ` Chris Wright
2006-03-16 1:16 ` Chris Wright
1 sibling, 1 reply; 26+ messages in thread
From: Zachary Amsden @ 2006-03-14 22:29 UTC (permalink / raw)
To: Chris Wright
Cc: Andrew Morton, Joshua LeVasseur, Xen-devel, Pratap Subrahmanyam,
Wim Coekaerts, Jack Lo, Dan Hecht, Linux Kernel Mailing List,
Jan Beulich, Christopher Li, Virtualization Mailing List,
Linus Torvalds, Anne Holler, Jyothy Reddy, Kip Macy,
Ky Srinivasan, Leendert van Doorn, Dan Arai
Chris Wright wrote:
> Yup. Just noting that API without clear users is the type of thing that
> is regularly rejected from Linux.
>
Yes. It is becoming clear from feedback from you and Andi that there
are things in the API that are unnecessary for Linux. But keep in mind,
they may be necessary for other operating systems. I think we should
probably drop the Linux changes to issue things like RDTSC and such via
VMI call wrappers. It does simplify the Linux interface.
But I still think they should be part of the spec - an optional part of
the spec, that need not be implemented by Linux or even by the
hypervisor. If some vendor or kernel combination finds that they are a
performance concern, as they readily could become, they can drop in the
functionality when and if they need it. No reason to complicate things
on either end, but also no reason to purposely add asymmetry to the spec
just because the current set of calls is sufficient for the currently
known fast paths.
>
>>> Many of these will look the same on x86-64, but the API is not
>>> 64-bit clean so has to be duplicated.
>>>
>> Yes, register pressure forces the PAE API to be slightly different from
>> the long mode API. But long mode has different register calling
>> conventions anyway, so it is not a big deal. The important thing is,
>> once the MMU mess is sorted out, the same interface can be used from C
>> code for both platforms, and the details about which lock primitives are
>> used can be hidden. The cost of which lock primitives to use differs on
>> 32-bit and 64-bit platforms, across vendor, and the style of the
>> hypervisor implementation (direct / writable / shadowed page tables).
>>
>
> My mistake, it makes perfect sense from ABI point of view.
>
>
>>> Is this the batching, multi-call analog?
>>>
>> Yes. This interface needs to be documented in a much better fashion.
>> But the idea is that VMI calls are mapped into Xen multicalls by
>> allowing deferred completion of certain classes of operations. That
>> same mode of deferred operation is used to batch PTE updates in our
>> implementation (although Xen uses writable page tables now, this used to
>> provide the same support facility in Xen as well). To complement this,
>> there is an explicit flush - and it turns out this maps very nicely,
>> getting rid of a lot of the XenoLinux changes around mmu_context.h.
>>
>
> Are these valid differences? Or did I misunderstand the batching
> mechanism?
>
> 1) can't use stack based args, so have to allocate each data structure,
> which could conceivably fail unless it's some fixed buffer.
>
We use a fixed buffer that is private to our VMI layer. It's a per-cpu
packing struct for hypercalls. Dynamically allocating from the kernel
inside the interface layer is a really great way to get into a whole lot
of trouble.
> 2) complicates the rom implementation slightly where implementation of
> each deferrable part of the API needs to have switch (am I deferred or
> not) to then build the batch, or make direct hypercall.
>
This is an overhead that is easily absorbed by the gain. The direct
hypercalls are mostly either always direct, or always queued. The page
table updates already have conditional logic to do the right thing, and
Xen doesn't require the queueing of these anymore anyways. And the
flush happens at an explicit point. The best approach can still be fine
tuned. You could have separate VMI calls for queued vs. non-queued
operation. But that greatly bloats the interface and doesn't make sense
for everything. I believe the best solution is to annotate this in the
VMI call itself. Consider the VMI call number, not as an integer, but
as an identifier tuple. Perhaps I'm going overboard here. Perhaps not.
31--------24-23---------16-15--------8-7-----------0
| family | call number | reserved | annotation |
---------------------------------------------------
Now, you have multiple families of calls -
0x00 legacy
0x01 CPU
0x02 Segmentation
0x03 MMU
0xFF reserved for experimentation
And each family has children:
0x03 MMU:
0x00 SetPTE
0x01 SetLongPTE
0x02 FlushTLB
Now, lets say I add a new feature, and I don't want to redefine part of
the interface. Lets say that feature is queuing of hypercalls. I have
this private, annotation field as part of the identifier for each
hypercall - in effect, really just the hypercall number.
And I don't want to break binary compatibility of the interface. So
what I do is I define a new annotation that is specific to the affected
calls.
0x00 SetPTE
0x00 - no annotation
0x01 - may be queued !
Now, the hypercall isn't any different. Hypervisors which are unware of
the annotation treat it no differently. But hypervisors that support
PTE queuing recognize it as a hint and use it appropriately.
Queuing is a common enough optimization that it might even make sense to
have a bit set aside in the call ID for it. Having this type of static
annotation allows you to get rid of the dynamic concerns you have.
The really nice thing about defining your interface this way is you have
a hierarchy of different classes of the interface, with the ability to
add new classes, new calls within a class, and new annotations
(upgrades, if you will) or those calls.
And it provides a natural way to query for supported families of support
- do you support a virtual event channel? Should I do some extra work
to give you MMU hints or not? And you can add extra, optional
functionality on to existing call sites. Something vary useful, if you
say, realize that you want to add a hint field to one of your calls
without breaking the old interface or forcing another vendor into
complicating the ir hypervisor. Which is most of what
paravirtualization is anyway. Extra, optionally used hints about how
things are being used that allow the hypervisor implementation to avoid
making costly assumptions to ensure correctness under unknown constraints.
Is this worth threshing out more? I think so, since it does provide a
nice value proposition as well as overcoming the rather clumsy top level
versioning scheme.
Thanks again for your feedback,
Zach
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-14 22:29 ` Zachary Amsden
@ 2006-03-15 2:57 ` Chris Wright
2006-03-15 5:44 ` Zachary Amsden
2006-03-15 22:56 ` Daniel Arai
0 siblings, 2 replies; 26+ messages in thread
From: Chris Wright @ 2006-03-15 2:57 UTC (permalink / raw)
To: Zachary Amsden
Cc: Chris Wright, Linus Torvalds, Linux Kernel Mailing List,
Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
* Zachary Amsden (zach@vmware.com) wrote:
> >1) can't use stack based args, so have to allocate each data structure,
> >which could conceivably fail unless it's some fixed buffer.
>
> We use a fixed buffer that is private to our VMI layer. It's a per-cpu
> packing struct for hypercalls. Dynamically allocating from the kernel
> inside the interface layer is a really great way to get into a whole lot
> of trouble.
Heh, indeed that's why I asked. per-cpu buffer means ROM state knows
which vcpu is current. How is this done in OS agnostic method w/out
trapping to hypervisor? Some shared data that ROM and VMM know about,
and VMM updates as it schedules each vcpu?
> >2) complicates the rom implementation slightly where implementation of
> >each deferrable part of the API needs to have switch (am I deferred or
> >not) to then build the batch, or make direct hypercall.
>
> This is an overhead that is easily absorbed by the gain. The direct
> hypercalls are mostly either always direct, or always queued. The page
> table updates already have conditional logic to do the right thing, and
> Xen doesn't require the queueing of these anymore anyways. And the
> flush happens at an explicit point. The best approach can still be fine
> tuned. You could have separate VMI calls for queued vs. non-queued
> operation. But that greatly bloats the interface and doesn't make sense
> for everything. I believe the best solution is to annotate this in the
> VMI call itself. Consider the VMI call number, not as an integer, but
> as an identifier tuple. Perhaps I'm going overboard here. Perhaps not.
>
> 31--------24-23---------16-15--------8-7-----------0
> | family | call number | reserved | annotation |
> ---------------------------------------------------
I agree with your final assessment, needs more threshing out. It does
feel a bit overkill at first blush. I worry about these semantic
changes as an annotation instead of explicit API update. But I guess
we still have more work on finding the right actual interface, not just
the possible ways to annotate the calls.
thanks,
-chris
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-15 2:57 ` Chris Wright
@ 2006-03-15 5:44 ` Zachary Amsden
2006-03-15 22:56 ` Daniel Arai
1 sibling, 0 replies; 26+ messages in thread
From: Zachary Amsden @ 2006-03-15 5:44 UTC (permalink / raw)
To: Chris Wright
Cc: Linus Torvalds, Linux Kernel Mailing List,
Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
Chris Wright wrote:
> * Zachary Amsden (zach@vmware.com) wrote:
>
>>> 1) can't use stack based args, so have to allocate each data structure,
>>> which could conceivably fail unless it's some fixed buffer.
>>>
>> We use a fixed buffer that is private to our VMI layer. It's a per-cpu
>> packing struct for hypercalls. Dynamically allocating from the kernel
>> inside the interface layer is a really great way to get into a whole lot
>> of trouble.
>>
>
> Heh, indeed that's why I asked. per-cpu buffer means ROM state knows
> which vcpu is current. How is this done in OS agnostic method w/out
> trapping to hypervisor? Some shared data that ROM and VMM know about,
> and VMM updates as it schedules each vcpu?
>
Yes, we have private mappings per CPU. I don't think that is as
feasible on Xen, since it requires the hypervisor to support a per-CPU
PD shadow for each root. But alternative implementations are possible
using segmentation. The primary advantage is that you don't need to
call back from the interface layer to disable preemption for per-CPU
data access.
It turns out to be really easy if you add the loadsegment / savesegment
macros to the VMI interface, and require the kernel to abstain from
using, say, the GS segment. I think this is the path we are going down
for the VMI on Xen 3 port.
> I agree with your final assessment, needs more threshing out. It does
> feel a bit overkill at first blush. I worry about these semantic
> changes as an annotation instead of explicit API update. But I guess
> we still have more work on finding the right actual interface, not just
> the possible ways to annotate the calls.
>
Yes, lets focus on finding the right interface for now - and just leave
the door open a bit for the future.
Cheers,
Zach
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-15 2:57 ` Chris Wright
2006-03-15 5:44 ` Zachary Amsden
@ 2006-03-15 22:56 ` Daniel Arai
1 sibling, 0 replies; 26+ messages in thread
From: Daniel Arai @ 2006-03-15 22:56 UTC (permalink / raw)
To: Chris Wright
Cc: Zachary Amsden, Linus Torvalds, Linux Kernel Mailing List,
Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
Chris Wright wrote:
> * Zachary Amsden (zach@vmware.com) wrote:
>
>>>1) can't use stack based args, so have to allocate each data structure,
>>>which could conceivably fail unless it's some fixed buffer.
>>
>>We use a fixed buffer that is private to our VMI layer. It's a per-cpu
>>packing struct for hypercalls. Dynamically allocating from the kernel
>>inside the interface layer is a really great way to get into a whole lot
>>of trouble.
>
>
> Heh, indeed that's why I asked. per-cpu buffer means ROM state knows
> which vcpu is current. How is this done in OS agnostic method w/out
> trapping to hypervisor? Some shared data that ROM and VMM know about,
> and VMM updates as it schedules each vcpu?
Each VCPU gets a private data area at the same linear address. The VMM
constructs private page table shadows for each VCPU, and the shadows magically
contain the right mappings for that VCPU's private data area.
Other hypervisor implementations (especially those that don't make use of shadow
page tables) would have to come up with something along the lines that you're
suggesting.
Dan.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-14 21:27 ` Chris Wright
2006-03-14 22:29 ` Zachary Amsden
@ 2006-03-16 1:16 ` Chris Wright
2006-03-16 3:40 ` Eli Collins
1 sibling, 1 reply; 26+ messages in thread
From: Chris Wright @ 2006-03-16 1:16 UTC (permalink / raw)
To: Chris Wright
Cc: Zachary Amsden, Linus Torvalds, Linux Kernel Mailing List,
Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
* Chris Wright (chrisw@sous-sol.org) wrote:
> Yes, I completely agree. Without specific performance numbers it's just
> hand waving. To make it more concrete, I'll work on a compare/contrast
> of the interfaces so we have specifics to discuss.
Here's a comparison of API's. In some cases there's trivial 1-to-1
mappings, and in other cases there's really no mapping. The mapping is
(loosely) annotated below the interface as [ VMI_foo(*) ]. The trailing
asterisk is meant to note the API maps at high-level, but the details
may make the mapping difficult (details such as VA vs. MFN, for example).
Thanks to Christian for doing the bulk of this comparison.
PROCESSOR STATE CALLS
- shared_info->vcpu_info[]->evtchn_upcall_mask
Enable/Disable interrupts and query whether interrupts are enabled or
disabled.
[ VMI_DisableInterrupts, VMI_EnabledInterrupts, VMI_GetInterruptMask,
VMI_SetInterruptMask ]
- shared_info->vcpu_info[]->evtchn_upcall_pending
Query if an interrupt is pending
[ ]
- force_evtchn_callback = HYPERVISOR_xen_version(0, NULL)
Deliver pending interrupts.
[ VMI_DeliverInterrupts ]
(EVENT CHANNEL, virtual interrupts)
- HYPERVISOR_event_channel_op(EVTCHNOP_alloc_unbound, ...)
Allocate a port in domain <dom> and mark as accepting interdomain
bindings from domain <remote_dom>. A fresh port is allocated in <dom>
and returned as <port>.
[ ]
- HYPERVISOR_event_channel_op(EVTCHNOP_bind_interdomain, ...)
Construct an interdomain event channel between the calling domain and
<remote_dom>. <remote_dom,remote_port> must identify a port that is
unbound and marked as accepting bindings from the calling domain. A fresh
port is allocated in the calling domain and returned as <local_port>.
[ ]
- HYPERVISOR_event_channel_op(EVTCHNOP_bind_virq, ...)
Bind a local event channel to VIRQ <irq> on specified vcpu.
[ ]
- HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, ...)
Bind a local event channel to PIRQ <irq>.
[ PIC programming* ]
- HYPERVISOR_event_channel_op(EVTCHNOP_bind_ipi, ...)
Bind a local event channel to receive events.
[ ]
- HYPERVISOR_event_channel_op(EVTCHNOP_close, ...)
Close a local event channel <port>. If the channel is interdomain then
the remote end is placed in the unbound state (EVTCHNSTAT_unbound),
awaiting a new connection.
[ ]
- HYPERVISOR_event_channel_op(EVTCHNOP_send, ...)
Send an event to the remote end of the channel whose local endpoint is <port>.
[ ]
- HYPERVISOR_event_channel_op(EVTCHNOP_status, ...)
Get the current status of the communication channel which has an endpoint
at <dom, port>.
[ ]
- HYPERVISOR_event_channel_op(EVTCHNOP_bind_vcpu, ...)
Specify which vcpu a channel should notify when an event is pending.
[ ]
- HYPERVISOR_event_channel_op(EVTCHNOP_unmask, ...)
Unmask the specified local event-channel port and deliver a notification
to the appropriate VCPU if an event is pending.
[ ]
- HYPERVISOR_sched_op(SCHEDOP_yield, ...)
Voluntarily yield the CPU.
[ VMI_Pause ]
- HYPERVISOR_sched_op(SCHEDOP_block, ...)
Block execution of this VCPU until an event is received for processing.
If called with event upcalls masked, this operation will atomically
reenable event delivery and check for pending events before blocking the
VCPU. This avoids a "wakeup waiting" race.
Periodic timer interrupts are not delivered when guest is blocked,
except for explicit timer events setup with HYPERVISOR_set_timer_op.
[ VMI_Halt ]
- HYPERVISOR_sched_op(SCHEDOP_shutdown, ...)
Halt execution of this domain (all VCPUs) and notify the system controller.
[ VMI_Shutdown, VMI_Reboot ]
- HYPERVISOR_sched_op(SCHEDOP_shutdown, SHUTDOWN_suspend, ...)
Clean up, save suspend info, kill
[ ]
- HYPERVISOR_sched_op_new(SCHEDOP_poll, ...)
Poll a set of event-channel ports. Return when one or more are pending. An
optional timeout may be specified.
[ ]
- HYPERVISOR_vcpu_op(VCPUOP_initialise, ...)
Initialise a VCPU. Each VCPU can be initialised only once. A
newly-initialised VCPU will not run until it is brought up by VCPUOP_up.
[ VMI_SetInitialAPState ]
- HYPERVISOR_vcpu_op(VCPUOP_up, ...)
Bring up a VCPU. This makes the VCPU runnable. This operation will fail
if the VCPU has not been initialised (VCPUOP_initialise).
[ ]
- HYPERVISOR_vcpu_op(VCPUOP_down, ...)
Bring down a VCPU (i.e., make it non-runnable).
There are a few caveats that callers should observe:
1. This operation may return, and VCPU_is_up may return false, before the
VCPU stops running (i.e., the command is asynchronous). It is a good
idea to ensure that the VCPU has entered a non-critical loop before
bringing it down. Alternatively, this operation is guaranteed
synchronous if invoked by the VCPU itself.
2. After a VCPU is initialised, there is currently no way to drop all its
references to domain memory. Even a VCPU that is down still holds
memory references via its pagetable base pointer and GDT. It is good
practise to move a VCPU onto an 'idle' or default page table, LDT and
GDT before bringing it down.
[ ]
- HYPERVISOR_vcpu_op(VCPUOP_is_up, ...)
Returns 1 if the given VCPU is up.
[ ]
- HYPERVISOR_vcpu_op(VCPUOP_get_runstate_info, ...)
Return information about the state and running time of a VCPU.
[ ]
- HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, ...)
Register a shared memory area from which the guest may obtain its own
runstate information without needing to execute a hypercall.
Notes:
1. The registered address may be virtual or physical, depending on the
platform. The virtual address should be registered on x86 systems.
2. Only one shared area may be registered per VCPU. The shared area is
updated by the hypervisor each time the VCPU is scheduled. Thus
runstate.state will always be RUNSTATE_running and
runstate.state_entry_time will indicate the system time at which the
VCPU was last scheduled to run.
[ ]
DESCRIPTOR RELATED CALLS
- HYPERVISOR_set_gdt(unsigned long *frame_list, int entries)
Load the global descriptor table.
For non-shadow-translate mode guests, the frame_list is a list of
machine pages which contain the gdt.
[ VMI_SetGDT* ]
- HYPERVISOR_set_trap_table(struct trap_info *table)
Load the interrupt descriptor table.
The trap table is in a format which allows easier access from C code.
It's easier to build and easier to use in software trap despatch code.
It can easily be converted into a hardware interrupt descriptor table.
[ VMI_SetIDT, VMI_WriteIDTEntry ]
- HYPERVISOR_mmuext_op(MMUEXT_SET_LDT, ...)
Load local descriptor table.
linear_addr: Linear address of LDT base (NB. must be page-aligned).
nr_ents: Number of entries in LDT.
[ VMI_SetLDT* ]
- HYPERVISOR_update_descriptor(u64 pa, u64 desc)
Write a descriptor to a GDT or LDT.
For non-shadow-translate mode guests, the address is a machine address.
[ VMI_WriteGDTEntry*, VMI_WriteLDTEntry* ]
CPU CONTROL CALLS
- HYPERVISOR_mmuext_op(MMUEXT_NEW_BASEPTR, ...)
Write cr3 register.
[ VMI_SetCR3* ]
- shared_info->vcpu_info[]->arch->cr2
Read cr2 register.
[ VMI_GetCR2 ]
- HYPERVISOR_fpu_taskswitch(0)
Clear the taskswitch flag in control register 0.
[ VMI_CLTS ]
- HYPERVISOR_fpu_taskswitch(1)
Set the taskswitch flag in control register 0.
[ VMI_SetCR0* ]
- HYPERVISOR_set_debugreg(int reg, unsigned long value)
Write debug register.
[ VMI_SetDR ]
- HYPERVISOR_get_debugreg(int reg)
Read debug register.
[ VMI_GetDR ]
PROCESSOR INFORMATION CALLS
STACK / PRIVILEGE TRANSITION CALLS
- HYPERVISOR_stack_switch(unsigned long ss, unsigned long esp)
Set the ring1 stack pointer/segment to use when switching to ring1
from ring3.
[ VMI_UpdateKernelStack ]
- HYPERVISOR_iret
[ VMI_IRET ]
I/O CALLS
- HYPERVISOR_physdev_op(PHYSDEVOP_SET_IOPL, ...)
Set the IOPL mask.
[ VMI_SetIOPLMask ]
- HYPERVISOR_mmuext_op(MMUEXT_FLUSH_CACHE)
No additional arguments. Writes back and flushes cache contents.
(Can just trap and emulate here).
[ VMI_WBINVD ]
- HYPERVISOR_physdev_op(PHYSDEVOP_IRQ_UNMASK_NOTIFY, ...)
Advertise unmask of physical interrupt to hypervisor.
[ ]
- HYPERVISOR_physdev_op(PHYSDEVOP_IRQ_STATUS_QUERY,...)
Query if physical interrupt needs unmaks notify.
[ ]
- HYPERVISOR_physdev_op(PHYSDEVOP_SET_IOBITMAP, ...)
Set IO bitmap for guest.
[ ]
- HYPERVISOR_physdev_op(PHYSDEVOP_APIC_READ, ...)
Read IO-APIC register.
[ ]
- HYPERVISOR_physdev_op(PHYSDEVOP_APIC_WRITE, ...)
Write IO-APIC register.
[ ]
- HYPERVISOR_physdev_op(PHYSDEVOP_ASSIGN_VECTOR, ...)
Assign vector to interrupt.
[ ]
APIC CALLS
TIMER CALLS
- HYPERVISOR_set_timer_op(...)
Set timeout when to trigger timer interrupt even if guest is blocked.
MMU CALLS
- HYPERVISOR_mmuext_op(MMUEXT_(UN)PIN_*_TABLE
mfn: Machine frame number to be (un)pinned as a p.t. page.
[ RegisterPageType* ]
- HYPERVISOR_mmuext_op(MMUEXT_TLB_FLUSH_LOCAL)
No additional arguments. Flushes local TLB.
[ VMI_FlushTLB ]
- HYPERVISOR_mmuext_op(MMUEXT_INVLPG_LOCAL)
linear_addr: Linear address to be flushed from the local TLB.
[ VMI_InvalPage ]
- HYPERVISOR_mmuext_op(MMUEXT_TLB_FLUSH_MULTI)
vcpumask: Pointer to bitmap of VCPUs to be flushed.
- HYPERVISOR_mmuext_op(MMUEXT_INVLPG_MULTI)
linear_addr: Linear address to be flushed.
vcpumask: Pointer to bitmap of VCPUs to be flushed.
- HYPERVISOR_mmuext_op(MMUEXT_TLB_FLUSH_ALL)
No additional arguments. Flushes all VCPUs' TLBs.
- HYPERVISOR_mmuext_op(MMUEXT_INVLPG_ALL)
linear_addr: Linear address to be flushed from all VCPUs' TLBs.
- HYPERVISOR_update_va_mapping(...)
Update pagetable entry mapping a given virtual address.
Avoids having to map the pagetable page in the hypervisor by using
a linear pagetable mapping. Also flush the TLB if requested.
[ ]
- HYPERVISOR_mmu_update(MMU_NORMAL_PT_UPDATE, ...)
Update an entry in a page table.
[ VMI_SetPte* ]
- HYPERVISOR_mmu_update(MMU_MACHPHYS_UPDATE, ...)
Update machine -> phys table entry.
[ no machine -> phys in VMI ]
MEMORY
- HYPERVISOR_memory_op(XENMEM_increase_reservation, ...)
Increase number of frames
[ ]
- HYPERVISOR_memory_op(XENMEM_decrease_reservation, ...)
Drop frames from reservation
[ ]
- HYPERVISOR_memory_op(XENMEM_populate_physmap, ...)
[ ]
- HYPERVISOR_memory_op(XENMEM_maximum_ram_page, ...)
Get maximum MFN of mapped RAM in domain
[ ]
- HYPERVISOR_memory_op(XENMEM_current_reservation, ...)
Get current memory reservation (in pags) of domain
[ ]
- HYPERVISOR_memory_op(XENMEM_maximum_reservation, ...)
Get maximum memory reservation (in pags) of domain
[ ]
MISC
- HYPERVISOR_console_io()
read/write to console (privileged)
- HYPERVISOR_xen_version(XENVER_version, NULL)
Return major:minor (16:16).
- HYPERVISOR_xen_version(XENVER_extraversion)
Return extra version (-unstable, .subminor)
- HYPERVISOR_xen_version(XENVER_compile_info)
Return hypervisor compile information.
- HYPERVISOR_xen_version(XENVER_capabilities)
Return list of supported guest interfaces.
- HYPERVISOR_xen_version(XENVER_platform_parameters)
Return information about the platform.
- HYPERVISOR_xen_version(XENVER_get_features)
Return feature maps.
- HYPERVISOR_set_callbacks
Set entry points for upcalls to the guest from the hypervisor.
Used for event delivery and fatal condition notification.
- HYPERVISOR_vm_assist(VMASST_TYPE_4gb_segments)
Enable emulation of wrap around segments.
- HYPERVISOR_vm_assist(VMASST_TYPE_4gb_segments_notify)
Enable notification on wrap around segment event.
- HYPERVISOR_vm_assist(VMASST_TYPE_writable_pagetables)
Enable writable pagetables.
- HYPERVISOR_nmi_op(XENNMI_register_callback)
Register NMI callback for this (calling) VCPU. Currently this only makes
sense for domain 0, vcpu 0. All other callers will be returned EINVAL.
- HYPERVISOR_nmi_op(XENNMI_unregister_callback)
Deregister NMI callback for this (calling) VCPU.
- HYPERVISOR_multicall
Execute batch of hypercalls.
[VMI_SetDeferredMode*, VMI_FlushDeferredCalls*]
There are some more management specific operations for dom0 and security
that are arguably beyond the scope of this comparison.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-16 1:16 ` Chris Wright
@ 2006-03-16 3:40 ` Eli Collins
0 siblings, 0 replies; 26+ messages in thread
From: Eli Collins @ 2006-03-16 3:40 UTC (permalink / raw)
To: Chris Wright
Cc: Xen-devel, Wim Coekaerts, Christopher Li,
Linux Kernel Mailing List, Virtualization Mailing List,
Linus Torvalds, Anne Holler, Jan Beulich, Jyothy Reddy, Kip Macy,
Ky Srinivasan, Leendert van Doorn
Chris Wright wrote:
> * Chris Wright (chrisw@sous-sol.org) wrote:
<snip>
> - HYPERVISOR_event_channel_op(EVTCHNOP_send, ...)
>
> Send an event to the remote end of the channel whose local endpoint is <port>.
>
> [ ]
VMI_APICWrite is used to send IPIs. In general all the event channel
calls (modulo referencing other guests) are not needed when using a
virtual APIC. Using calls rather than a struct shared between the
hypervisor and the guest is a cleaner interface (no messy changes to
entry.S) and easier to maintain and version. This is true of
shared_info_t in general, not just the event channel.
>
> - HYPERVISOR_vcpu_op(VCPUOP_get_runstate_info, ...)
>
> Return information about the state and running time of a VCPU.
>
> [ ]
See the VMI timer interface. Note that the runstate interface above was
added recently after Dan Hecht pointed out the need for properly
paravirtualizing time (reporting stolen time correctly), the Xen 3.0.0/1
interfaces do not include runstate info.
http://lists.xensource.com/archives/html/xen-devel/2006-02/msg00836.html
It's too bad that Xen's vcpu_time_info_t presents the guest with the
variables used to calculate time rather than time itself; requiring that
the guest calculate time complicates the Linux patches and constrains
future changes to time calculation in the hypervisor.
> - HYPERVISOR_set_trap_table(struct trap_info *table)
>
> Load the interrupt descriptor table.
>
> The trap table is in a format which allows easier access from C code.
> It's easier to build and easier to use in software trap despatch code.
> It can easily be converted into a hardware interrupt descriptor table.
>
> [ VMI_SetIDT, VMI_WriteIDTEntry ]
Passing in trap_info structs (like much of the Xen interface) requires
copying to/from the guest when it's not necessary. To handle VT/Pacifica
Xen needs to understand the hardware table format anyway, so it's
simpler to just use the hardware format.
> - HYPERVISOR_set_timer_op(...)
>
> Set timeout when to trigger timer interrupt even if guest is blocked.
See VMI_SetAlarm and VMI_CancelAlarm.
> - HYPERVISOR_memory_op(XENMEM_increase_reservation, ...)
>
> Increase number of frames
>
> [ ]
>
> - HYPERVISOR_memory_op(XENMEM_decrease_reservation, ...)
>
> Drop frames from reservation
>
> [ ]
Ballooning for VMI guests is currently handled by a driver which uses a
special port in the virtual IO space.
The Xen increase reservation interface would be nicer if it took the
pfns that the guest gave up as an argument (better for this logic to be
in the balloon driver than the hypervisor). Relying on the hypervisor's
allocator to get contiguous pages is also gross. From what I can tell
extent_order is always 0 in XenLinux, an interface that just took a list
of pages would be simpler.
> - HYPERVISOR_xen_version(XENVER_compile_info)
>
> Return hypervisor compile information.
This kind of information seems gratuitous.
> - HYPERVISOR_set_callbacks
>
> Set entry points for upcalls to the guest from the hypervisor.
> Used for event delivery and fatal condition notification.
In the VMI "events" are just interrupts, delivered via the virtual IDT.
> - HYPERVISOR_nmi_op(XENNMI_register_callback)
>
> Register NMI callback for this (calling) VCPU. Currently this only makes
> sense for domain 0, vcpu 0. All other callers will be returned EINVAL.
Like the event callback, this could be integrated into the virtual IDT.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
@ 2006-03-16 3:40 ` Eli Collins
0 siblings, 0 replies; 26+ messages in thread
From: Eli Collins @ 2006-03-16 3:40 UTC (permalink / raw)
To: Chris Wright
Cc: Christopher Li, Xen-devel, Linux Kernel Mailing List, Jan Beulich,
Wim Coekaerts, Virtualization Mailing List, Linus Torvalds,
Anne Holler, Jyothy Reddy, Kip Macy, Ky Srinivasan,
Leendert van Doorn
Chris Wright wrote:
> * Chris Wright (chrisw@sous-sol.org) wrote:
<snip>
> - HYPERVISOR_event_channel_op(EVTCHNOP_send, ...)
>
> Send an event to the remote end of the channel whose local endpoint is <port>.
>
> [ ]
VMI_APICWrite is used to send IPIs. In general all the event channel
calls (modulo referencing other guests) are not needed when using a
virtual APIC. Using calls rather than a struct shared between the
hypervisor and the guest is a cleaner interface (no messy changes to
entry.S) and easier to maintain and version. This is true of
shared_info_t in general, not just the event channel.
>
> - HYPERVISOR_vcpu_op(VCPUOP_get_runstate_info, ...)
>
> Return information about the state and running time of a VCPU.
>
> [ ]
See the VMI timer interface. Note that the runstate interface above was
added recently after Dan Hecht pointed out the need for properly
paravirtualizing time (reporting stolen time correctly), the Xen 3.0.0/1
interfaces do not include runstate info.
http://lists.xensource.com/archives/html/xen-devel/2006-02/msg00836.html
It's too bad that Xen's vcpu_time_info_t presents the guest with the
variables used to calculate time rather than time itself; requiring that
the guest calculate time complicates the Linux patches and constrains
future changes to time calculation in the hypervisor.
> - HYPERVISOR_set_trap_table(struct trap_info *table)
>
> Load the interrupt descriptor table.
>
> The trap table is in a format which allows easier access from C code.
> It's easier to build and easier to use in software trap despatch code.
> It can easily be converted into a hardware interrupt descriptor table.
>
> [ VMI_SetIDT, VMI_WriteIDTEntry ]
Passing in trap_info structs (like much of the Xen interface) requires
copying to/from the guest when it's not necessary. To handle VT/Pacifica
Xen needs to understand the hardware table format anyway, so it's
simpler to just use the hardware format.
> - HYPERVISOR_set_timer_op(...)
>
> Set timeout when to trigger timer interrupt even if guest is blocked.
See VMI_SetAlarm and VMI_CancelAlarm.
> - HYPERVISOR_memory_op(XENMEM_increase_reservation, ...)
>
> Increase number of frames
>
> [ ]
>
> - HYPERVISOR_memory_op(XENMEM_decrease_reservation, ...)
>
> Drop frames from reservation
>
> [ ]
Ballooning for VMI guests is currently handled by a driver which uses a
special port in the virtual IO space.
The Xen increase reservation interface would be nicer if it took the
pfns that the guest gave up as an argument (better for this logic to be
in the balloon driver than the hypervisor). Relying on the hypervisor's
allocator to get contiguous pages is also gross. From what I can tell
extent_order is always 0 in XenLinux, an interface that just took a list
of pages would be simpler.
> - HYPERVISOR_xen_version(XENVER_compile_info)
>
> Return hypervisor compile information.
This kind of information seems gratuitous.
> - HYPERVISOR_set_callbacks
>
> Set entry points for upcalls to the guest from the hypervisor.
> Used for event delivery and fatal condition notification.
In the VMI "events" are just interrupts, delivered via the virtual IDT.
> - HYPERVISOR_nmi_op(XENNMI_register_callback)
>
> Register NMI callback for this (calling) VCPU. Currently this only makes
> sense for domain 0, vcpu 0. All other callers will be returned EINVAL.
Like the event callback, this could be integrated into the virtual IDT.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-13 17:59 [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
2006-03-13 22:49 ` Chris Wright
2006-03-14 4:11 ` Rik van Riel
@ 2006-03-22 20:05 ` Andi Kleen
2006-03-22 21:34 ` Chris Wright
` (2 more replies)
2 siblings, 3 replies; 26+ messages in thread
From: Andi Kleen @ 2006-03-22 20:05 UTC (permalink / raw)
To: virtualization
Cc: Zachary Amsden, Linus Torvalds, Linux Kernel Mailing List,
Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
On Monday 13 March 2006 18:59, Zachary Amsden wrote:
> + The general mechanism for providing customized features and
> + capabilities is to provide notification of these feature through
> + the CPUID call,
How should that work since CPUID cannot be intercepted by
a Hypervisor (without VMX/SVM)?
> + Watchdog NMIs are of limited use if the OS is
> + already correct and running on stable hardware;
So how would your Hypervisor detect a kernel hung with interrupts
off then?
>> profiling NMIs are
> + similarly of less use, since this task is accomplished with more accuracy
> + in the VMM itself
And how does oprofile know about this?
> ; and NMIs for machine check errors should be handled
> + outside of the VM.
Right now yes, but if we ever implement intelligent memory ECC error handling it's questionable
the hypervisor can do a better job. It has far less information about how memory
is used than the kernel.
> + The net result of these choices is that most of the calls are very
> + easy to make from C-code, and calls that are likely to be required in
> + low level trap handling code are easy to call from assembler. Most
> + of these calls are also very easily implemented by the hypervisor
> + vendor in C code, and only the performance critical calls from
> + assembler paths require custom assembly implementations.
> +
> + CORE INTERFACE CALLS
Did I miss it or do you never describe how to find these entry points?
> + VMI_EnableInterrupts
> +
> + VMICALL void VMI_EnableInterrupts(void);
> +
> + Enable maskable interrupts on the processor. Note that the
> + current implementation always will deliver any pending interrupts
> + on a call which enables interrupts, for compatibility with kernel
> + code which expects this behavior. Whether this should be required
> + is open for debate.
A subtle trap is also that it will do so on the next instruction, not the
followon to next like a real x86. At some point there was code in Linux
that dependend on this.
> + VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
> +
> + Read from a model specific register. This functions identically to the
> + hardware RDMSR instruction. Note that a hypervisor may not implement
> + the full set of MSRs supported by native hardware, since many of them
> + are not useful in the context of a virtual machine.
So what happens when the kernel tries to access an unimplemented MSR?
Also we have had occasionally workarounds in the past that required
MSR writes with magic "passwords". How would these be handled?
+
> + VMI_CPUID
> +
> + /* Not expressible as a C function */
> +
> + The CPUID instruction provides processor feature identification in a
> + vendor specific manner. The instruction itself is non-virtualizable
> + without hardware support, requiring a hypervisor assisted CPUID call
> + that emulates the effect of the native instruction, while masking any
> + unsupported CPU feature bits.
Doesn't seem to be very useful because everybody can just call CPUID directly.
> + The RDTSC instruction provides a cycles counter which may be made
> + visible to userspace. For better or worse, many applications have made
> + use of this feature to implement userspace timers, database indices, or
> + for micro-benchmarking of performance. This instruction is extremely
> + problematic for virtualization, because even though it is selectively
> + virtualizable using trap and emulate, it is much more expensive to
> + virtualize it in this fashion. On the other hand, if this instruction
> + is allowed to execute without trapping, the cycle counter provided
> + could be wrong in any number of circumstances due to hardware drift,
> + migration, suspend/resume, CPU hotplug, and other unforeseen
> + consequences of running inside of a virtual machine. There is no
> + standard specification for how this instruction operates when issued
> + from userspace programs, but the VMI call here provides a proper
> + interface for the kernel to read this cycle counter.
Yes, but it will be wrong in a native kernel too so why do you want
to be better than native?
Seems useless to me.
> + VMI_RDPMC
> +
> + VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
> +
> + Similar to RDTSC, this call provides the functionality of reading
> + processor performance counters. It also is selectively visible to
> + userspace, and maintaining accurate data for the performance counters
> + is an extremely difficult task due to the side effects introduced by
> + the hypervisor.
Similar.
Overall feeling is you have far too many calls. You seem to try to implement
a full x86 replacement, but that makes it big and likely to be buggy. And
it's likely impossible to implement in any Hypervisor short of a full emulator
like yours.
I would try a diet and only implement facilities that are actually likely
to be used by modern OS.
There was one other point I wanted to make but I forgot it now @)
-Andi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-22 21:34 ` Chris Wright
@ 2006-03-22 21:13 ` Andi Kleen
2006-03-22 21:57 ` Chris Wright
2006-03-23 0:06 ` Zachary Amsden
0 siblings, 2 replies; 26+ messages in thread
From: Andi Kleen @ 2006-03-22 21:13 UTC (permalink / raw)
To: Chris Wright
Cc: virtualization, Zachary Amsden, Linus Torvalds,
Linux Kernel Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
Leendert van Doorn
On Wednesday 22 March 2006 22:34, Chris Wright wrote:
> * Andi Kleen (ak@suse.de) wrote:
> > On Monday 13 March 2006 18:59, Zachary Amsden wrote:
> >
> > > + The general mechanism for providing customized features and
> > > + capabilities is to provide notification of these feature through
> > > + the CPUID call,
> >
> > How should that work since CPUID cannot be intercepted by
> > a Hypervisor (without VMX/SVM)?
>
> Yeah, it requires guest kernel cooperation/modification.
Even then it's useless for many flags because any user program can (and will)
call CPUID directly.
> > > + The net result of these choices is that most of the calls are very
> > > + easy to make from C-code, and calls that are likely to be required in
> > > + low level trap handling code are easy to call from assembler. Most
> > > + of these calls are also very easily implemented by the hypervisor
> > > + vendor in C code, and only the performance critical calls from
> > > + assembler paths require custom assembly implementations.
> > > +
> > > + CORE INTERFACE CALLS
> >
> > Did I miss it or do you never describe how to find these entry points?
>
> It's the ROM interface. For native they are emitted directly inline.
> For non-native, they are emitted as call stubs, which call to the ROM.
> I don't recall if it's in this doc, but the inline patch has all the
> gory details.
Sure the point was if they write this long fancy document why stop
at documenting the last 5%?
-Andi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-22 20:05 ` Andi Kleen
@ 2006-03-22 21:34 ` Chris Wright
2006-03-22 21:13 ` Andi Kleen
2006-03-22 21:39 ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
2006-03-22 22:04 ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
2 siblings, 1 reply; 26+ messages in thread
From: Chris Wright @ 2006-03-22 21:34 UTC (permalink / raw)
To: Andi Kleen
Cc: virtualization, Zachary Amsden, Linus Torvalds,
Linux Kernel Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
Leendert van Doorn
* Andi Kleen (ak@suse.de) wrote:
> On Monday 13 March 2006 18:59, Zachary Amsden wrote:
>
> > + The general mechanism for providing customized features and
> > + capabilities is to provide notification of these feature through
> > + the CPUID call,
>
> How should that work since CPUID cannot be intercepted by
> a Hypervisor (without VMX/SVM)?
Yeah, it requires guest kernel cooperation/modification.
> > + The net result of these choices is that most of the calls are very
> > + easy to make from C-code, and calls that are likely to be required in
> > + low level trap handling code are easy to call from assembler. Most
> > + of these calls are also very easily implemented by the hypervisor
> > + vendor in C code, and only the performance critical calls from
> > + assembler paths require custom assembly implementations.
> > +
> > + CORE INTERFACE CALLS
>
> Did I miss it or do you never describe how to find these entry points?
It's the ROM interface. For native they are emitted directly inline.
For non-native, they are emitted as call stubs, which call to the ROM.
I don't recall if it's in this doc, but the inline patch has all the
gory details.
thanks,
-chris
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
2006-03-22 20:05 ` Andi Kleen
2006-03-22 21:34 ` Chris Wright
@ 2006-03-22 21:39 ` Andi Kleen
2006-03-22 22:43 ` Daniel Arai
2006-03-22 22:45 ` Zachary Amsden
2006-03-22 22:04 ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
2 siblings, 2 replies; 26+ messages in thread
From: Andi Kleen @ 2006-03-22 21:39 UTC (permalink / raw)
To: virtualization
Cc: Zachary Amsden, Linus Torvalds, Linux Kernel Mailing List,
Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
> There was one other point I wanted to make but I forgot it now @)
Ah yes the point was that since most of the implementations of the hypercalls
likely need fast access to some per CPU state. How would you plan
to implement that? Should it be covered in the specification?
-Andi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-22 21:13 ` Andi Kleen
@ 2006-03-22 21:57 ` Chris Wright
2006-03-23 0:06 ` Zachary Amsden
1 sibling, 0 replies; 26+ messages in thread
From: Chris Wright @ 2006-03-22 21:57 UTC (permalink / raw)
To: Andi Kleen
Cc: Chris Wright, virtualization, Zachary Amsden, Linus Torvalds,
Linux Kernel Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
Leendert van Doorn
* Andi Kleen (ak@suse.de) wrote:
> Even then it's useless for many flags because any user program can (and will)
> call CPUID directly.
Yes, doesn't handle userspace at all. It's useful only to get coherent
view of flags in kernel. Right now, for example, Xen goes in and
basically masks off flags retroactively which is not that nice either.
thanks,
-chris
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-22 22:04 ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
@ 2006-03-22 21:58 ` Andi Kleen
0 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2006-03-22 21:58 UTC (permalink / raw)
To: Zachary Amsden
Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
On Wednesday 22 March 2006 23:04, Zachary Amsden wrote:
>
> It doesn't. But consider that oprofile is a time based NMI sampler.
That's one of its modes, mostly used for people with broken APICs.
But the primary mode of operation is an event based sampler using
performance counter events.
> There still is. This is why you have the "sti; sysexit" pair, and why
> safe_halt() is "sti; hlt". You really don't want interrupts in those
> windows. The architectural oddity forced us to make these calls into
> the VMI interface. A third one, used by some operating systems, is
> "sti; nop; cli" - i.e. deliver pending interrupts and disable again. In
> most other cases, it doesn't matter.
Sounds like something that should be discussed in the spec.
> > Seems useless to me.
> >
>
> Agree. TSC is broken in so many ways, that it really should not be used
> for anything other than unreliable cycle counting.
It can be used with an aggressive white list and if you know what you're
doing. The x86-64 kernel follows this approach, which allows to use
it at least on some common classes of systems (AMD single core, Intel
non NUMA P4)
Actually for cycle counting it is useless because on newer Intel CPUs
it always runs at the highest P state no matter which P state you're in.
My evil plan to deal with that was to export the cycle count running in PMC0
for the NMI watchdog to ring 3 so people could just use RDPMC 0 instead.
There was some opposition to this idea unfortunately.
But the hypervisor should keep its fingers out of all that as far as possible.
>
> >
> >> + VMI_RDPMC
> >> +
> >> + VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
> >> +
> >> + Similar to RDTSC, this call provides the functionality of reading
> >> + processor performance counters. It also is selectively visible to
> >> + userspace, and maintaining accurate data for the performance counters
> >> + is an extremely difficult task due to the side effects introduced by
> >> + the hypervisor.
> >>
> >
> > Similar.
> >
> > Overall feeling is you have far too many calls. You seem to try to implement
> > a full x86 replacement, but that makes it big and likely to be buggy. And
> > it's likely impossible to implement in any Hypervisor short of a full emulator
> > like yours.
> >
> > I would try a diet and only implement facilities that are actually likely
> > to be used by modern OS.
> >
>
> The interface can't really go on too much of a diet - some kernel
> somewhere, maybe not Linux, under some hypervisor, maybe not VMware or
> Xen, may want to use these features.
This might sound arrogant, but I would expect that near all modern
kernels don't use much more of the x86 subset than Linux is using
(biggest exception I can think of would be interrupt priorities)
>
> Taken to the extreme, where the patch processing is done before the
> kernel runs, in the hypervisor itself, using the annotation table
> provided by the guest kernel, it is even easier. If you see an
> annotation for a feature you don't care to implement, you don't do
> anything at all - you leave the native instructions as they are. In
> this case, neither the kernel nor the hypervisor has any extra code at
> all to deal with cases they don't care about. But the rich interface is
> still there, and if someone wants to bathe in butter, who are we to
> judge.
So basically you're trying to implement VT/Pacifica in software
with all these trap?
I'm not sure that's the right approach.
My feeling would be that for a efficient para virtualized interface a better
approach would be to try to optimize the kernels a bit more
for the emulated case.
Longer term there will be more optimizations (like better interaction
of VM maybe or para drivers that work faster). But if the base interface
is already so big that adding even more stuff might make it explode
at some point.
> There certainly are uses for it. For example, WRMSR is not on
> critical paths in i386 Linux today.
Actually i got a feature request today that would require to optionally
do a wrmsr in the context switch :/
-Andi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-22 20:05 ` Andi Kleen
2006-03-22 21:34 ` Chris Wright
2006-03-22 21:39 ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
@ 2006-03-22 22:04 ` Zachary Amsden
2006-03-22 21:58 ` Andi Kleen
2 siblings, 1 reply; 26+ messages in thread
From: Zachary Amsden @ 2006-03-22 22:04 UTC (permalink / raw)
To: Andi Kleen
Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
Andi Kleen wrote:
> On Monday 13 March 2006 18:59, Zachary Amsden wrote:
>
>
>> + The general mechanism for providing customized features and
>> + capabilities is to provide notification of these feature through
>> + the CPUID call,
>>
>
> How should that work since CPUID cannot be intercepted by
> a Hypervisor (without VMX/SVM)?
>
It can be intercepted with a VMI call. I actually think overloading
this for VM features as well, although convenient, might turn out to be
unwieldy.
>> + Watchdog NMIs are of limited use if the OS is
>> + already correct and running on stable hardware;
>>
>
> So how would your Hypervisor detect a kernel hung with interrupts
> off then?
>
The hypervisor can detect it fine - we never disable hardware interrupts
or NMIs except for very small windows in the fault handlers. I'm
arguing that philosophically, using NMIs to detect a software hang means
you have broken software. NMIs for detecting hardware induced hangs are
common and reasonable things to do, but on virtual hardware, that
shouldn't happen either.
>
>>> profiling NMIs are
>>>
>> + similarly of less use, since this task is accomplished with more accuracy
>> + in the VMM itself
>>
>
> And how does oprofile know about this?
>
It doesn't. But consider that oprofile is a time based NMI sampler.
That is less accurate in a VM when you have virtual time, and, somewhat
skewed spacing between NMI delivery, and less than accurate performance
counter information. You can get a lot better results for benchmarks
using the VMM to sample the guest instead.
>> ; and NMIs for machine check errors should be handled
>> + outside of the VM.
>>
>
> Right now yes, but if we ever implement intelligent memory ECC error handling it's questionable
> the hypervisor can do a better job. It has far less information about how memory
> is used than the kernel.
>
Right. I think I may have been too proactive in my defense of disabling
NMIs. I agree now, it is a bug, and it really should be supported. But
it was a convenient shortcut to getting things working - otherwise you
have to have the NMI avoidance logic in entry.S, which is not properly
virtualizable (checks raw segments without masking RPL). But seeing as
I already fixed that, I think we actually could re-enable NMIs now.
Though the usefulness of common cases may be compromised, having the VM
do machine check handling on its own data pages (so it can figure out
which processes to kill / recover) is an extremely useful case.
>> + CORE INTERFACE CALLS
>>
>
> Did I miss it or do you never describe how to find these entry points?
>
It should be described in the ROM probing section in more detail. Our
documentation is getting better with time ;)
>
>> + VMI_EnableInterrupts
>> +
>> + VMICALL void VMI_EnableInterrupts(void);
>> +
>> + Enable maskable interrupts on the processor. Note that the
>> + current implementation always will deliver any pending interrupts
>> + on a call which enables interrupts, for compatibility with kernel
>> + code which expects this behavior. Whether this should be required
>> + is open for debate.
>>
>
> A subtle trap is also that it will do so on the next instruction, not the
> followon to next like a real x86. At some point there was code in Linux
> that dependend on this.
>
There still is. This is why you have the "sti; sysexit" pair, and why
safe_halt() is "sti; hlt". You really don't want interrupts in those
windows. The architectural oddity forced us to make these calls into
the VMI interface. A third one, used by some operating systems, is
"sti; nop; cli" - i.e. deliver pending interrupts and disable again. In
most other cases, it doesn't matter.
>
>> + VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
>> +
>> + Read from a model specific register. This functions identically to the
>> + hardware RDMSR instruction. Note that a hypervisor may not implement
>> + the full set of MSRs supported by native hardware, since many of them
>> + are not useful in the context of a virtual machine.
>>
>
> So what happens when the kernel tries to access an unimplemented MSR?
>
> Also we have had occasionally workarounds in the past that required
> MSR writes with magic "passwords". How would these be handled?
>
I actually already implemented your suggestion on making MSR reads and
writes use trap and emulate - so all of these issues go away. Whether
forcing trap and emulate is a good idea for a minimal open source
hypervisor is another debate.
> +
>
>> + VMI_CPUID
>> +
>> + /* Not expressible as a C function */
>> +
>> + The CPUID instruction provides processor feature identification in a
>> + vendor specific manner. The instruction itself is non-virtualizable
>> + without hardware support, requiring a hypervisor assisted CPUID call
>> + that emulates the effect of the native instruction, while masking any
>> + unsupported CPU feature bits.
>>
>
> Doesn't seem to be very useful because everybody can just call CPUID directly.
>
Which is why the kernel _must_ use the CPUID VMI call. We're a little
bit broken in this respect today, since the boot code in head.S does
CPUID probing before the VMI init call. It works for us because we use
binary translation of the kernel up to this point. In the end, this
will disappear, and the CPUID probing will be done in the alternative
entry point known as the "start of day" state, where the kernel is
already pre-virtualized.
> Yes, but it will be wrong in a native kernel too so why do you want
> to be better than native?
>
> Seems useless to me.
>
Agree. TSC is broken in so many ways, that it really should not be used
for anything other than unreliable cycle counting.
>
>> + VMI_RDPMC
>> +
>> + VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
>> +
>> + Similar to RDTSC, this call provides the functionality of reading
>> + processor performance counters. It also is selectively visible to
>> + userspace, and maintaining accurate data for the performance counters
>> + is an extremely difficult task due to the side effects introduced by
>> + the hypervisor.
>>
>
> Similar.
>
> Overall feeling is you have far too many calls. You seem to try to implement
> a full x86 replacement, but that makes it big and likely to be buggy. And
> it's likely impossible to implement in any Hypervisor short of a full emulator
> like yours.
>
> I would try a diet and only implement facilities that are actually likely
> to be used by modern OS.
>
The interface can't really go on too much of a diet - some kernel
somewhere, maybe not Linux, under some hypervisor, maybe not VMware or
Xen, may want to use these features. What the interface can be is an a
la carte menu. By allowing specific instructions to fall back to trap
and emulate, mainstream OSes don't need to be bothered with changing to
match some rich interface. Other OSes may have vastly different
requirements, and might want to make use of these features heavily, if
they are available. And hypervisors don't need to implement anything
special for these either. Our RDPMC implementation in the ROM is quite
simple:
/*
* VMI_RDPMC - Binary RDPMC equivalent
* Must clobber no registers (other than %eax, %edx return)
*/
VMI_ENTRY(RDPMC)
rdpmc
vmireturn
VMI_CALL_END
Taken to the extreme, where the patch processing is done before the
kernel runs, in the hypervisor itself, using the annotation table
provided by the guest kernel, it is even easier. If you see an
annotation for a feature you don't care to implement, you don't do
anything at all - you leave the native instructions as they are. In
this case, neither the kernel nor the hypervisor has any extra code at
all to deal with cases they don't care about. But the rich interface is
still there, and if someone wants to bathe in butter, who are we to
judge. There certainly are uses for it. For example, WRMSR is not on
critical paths in i386 Linux today. That does not mean we should remove
it from the interface. When a new processor core comes along, and all
of a sudden, you really need that interface back, you want it ready for
use. And this case really did happen - FSBASE and GSBASE MSR writes
moved onto the critical path in x86_64.
I think I carried the diet analogy a little far.
> There was one other point I wanted to make but I forgot it now @)
>
Thanks again for your feedback,
Zach
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
2006-03-22 22:45 ` Zachary Amsden
@ 2006-03-22 22:38 ` Andi Kleen
2006-03-22 23:54 ` Zachary Amsden
0 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2006-03-22 22:38 UTC (permalink / raw)
To: Zachary Amsden
Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
On Wednesday 22 March 2006 23:45, Zachary Amsden wrote:
> I propose an entirely different approach - use segmentation.
That would require a lot of changes to save/restore the segmentation
register at kernel entry/exit since there is no swapgs on i386.
And will be likely slower there too and also even slow down the
VMI-kernel-no-hypervisor.
Still might be the best option.
How did that rumoured Xenolinux-over-VMI implementation solve that problem?
-Andi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
2006-03-22 21:39 ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
@ 2006-03-22 22:43 ` Daniel Arai
2006-03-22 22:45 ` Zachary Amsden
1 sibling, 0 replies; 26+ messages in thread
From: Daniel Arai @ 2006-03-22 22:43 UTC (permalink / raw)
To: Andi Kleen
Cc: virtualization, Zachary Amsden, Linus Torvalds,
Linux Kernel Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
Leendert van Doorn
Andi Kleen wrote:
>>There was one other point I wanted to make but I forgot it now @)
>
>
> Ah yes the point was that since most of the implementations of the hypercalls
> likely need fast access to some per CPU state. How would you plan
> to implement that? Should it be covered in the specification?
I can explain how it works, but it's deliberately not part of the specification.
The whole point of the ROM layer is that it abstracts away the actual hypercall
mechanism for the guest, and the hypervisor can implement whatever is
appropriate for it. This layer allows a VMI guest to run on VMware's
hypervisor, as well as on top of Xen.
We reserve the top 64MB of linear address space for the hypervisor.
Part of this reserved space contains data structures that are shared by the VMI
ROM layer and the hypervisor. Simple VMI interface calls like "read CR 2" are
implemented by reading or writing data from this shared data structure, and
don't require a privilege level change. Things like page table updates go into
a queue in the shared area, so they can easily be batched and processed with
only one actual call into the hypervisor.
Because the guest can manipulate this data page directly, the hypervisor has to
treat any information in it as untrusted. This is similar to how the kernel has
to treat syscall arguments. Guest user code can't touch the shared area, so it
doesn't introduce any new kernel security holes. The guest kernel could
deliberately mess up the shared area contents, but guest kernel code could
corrupt any arbitrary (virtual) machine state anyway.
Because this level of interface is hidden from the guest, we can (and do) make
changes to it without changing VMI itself, or needing to recompile the guest.
We deliberately do not document it. A guest that adheres to the VMI interface
can move to new versions of the ROM/hypervisor interface (that implement the
same VMI interface) without changes.
Dan.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
2006-03-22 21:39 ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
2006-03-22 22:43 ` Daniel Arai
@ 2006-03-22 22:45 ` Zachary Amsden
2006-03-22 22:38 ` Andi Kleen
1 sibling, 1 reply; 26+ messages in thread
From: Zachary Amsden @ 2006-03-22 22:45 UTC (permalink / raw)
To: Andi Kleen
Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
Andi Kleen wrote:
>> There was one other point I wanted to make but I forgot it now @)
>>
>
> Ah yes the point was that since most of the implementations of the hypercalls
> likely need fast access to some per CPU state. How would you plan
> to implement that? Should it be covered in the specification?
>
Probably. We don't have that issue currently, as we have a private
mapping of CPU state for each VCPU at a fixed address. Seeing as that
is not so feasible under Xen, I would say we need to put something in
the spec.
The way Xen deals with this is rather gruesome today. It needs
callbacks into the kernel to disable preemption so that it can
atomically compute the address of the VCPU area, just so that it can
disable interrupts on the VCPU. These contortions make backbending look
easy.
I propose an entirely different approach - use segmentation. This needs
to be in the spec, as we now need to add VMI hook points for saving and
restoring user segments. But in the end it wins, even if you can't
support per-cpu mappings using paging, you can do it with segmentation.
You'll likely get even better performance. And you don't have to worry
about these unclean callbacks into the guest kernel that really make the
interface between Xen and XenoLinux completely enmeshed. And you can
disable interrupts in one instruction:
movb $0, %gs:hypervisor_intFlags
Zach
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
2006-03-22 23:54 ` Zachary Amsden
@ 2006-03-22 23:37 ` Andi Kleen
0 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2006-03-22 23:37 UTC (permalink / raw)
To: Zachary Amsden
Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
On Thursday 23 March 2006 00:54, Zachary Amsden wrote:
> Andi Kleen wrote:
> > On Wednesday 22 March 2006 23:45, Zachary Amsden wrote:
> >
> >
> >> I propose an entirely different approach - use segmentation.
> >>
> >
> > That would require a lot of changes to save/restore the segmentation
> > register at kernel entry/exit since there is no swapgs on i386.
> > And will be likely slower there too and also even slow down the
> > VMI-kernel-no-hypervisor.
> >
>
> There are no changes required to the kernel entry / exit paths. With
> save/restore segment support in the VMI, reserving one segment for the
> hypervisor data area is easy.
Ok that might work yes.
> > Still might be the best option.
> >
> > How did that rumoured Xenolinux-over-VMI implementation solve that problem?
> >
>
> !CONFIG_SMP -- as I believe I saw in the latest Xen patches sent out as
> well?
Ah, cheating. This means the rumoured benchmark numbers are dubious too I guess.
-Andi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
2006-03-22 22:38 ` Andi Kleen
@ 2006-03-22 23:54 ` Zachary Amsden
2006-03-22 23:37 ` Andi Kleen
0 siblings, 1 reply; 26+ messages in thread
From: Zachary Amsden @ 2006-03-22 23:54 UTC (permalink / raw)
To: Andi Kleen
Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn
Andi Kleen wrote:
> On Wednesday 22 March 2006 23:45, Zachary Amsden wrote:
>
>
>> I propose an entirely different approach - use segmentation.
>>
>
> That would require a lot of changes to save/restore the segmentation
> register at kernel entry/exit since there is no swapgs on i386.
> And will be likely slower there too and also even slow down the
> VMI-kernel-no-hypervisor.
>
There are no changes required to the kernel entry / exit paths. With
save/restore segment support in the VMI, reserving one segment for the
hypervisor data area is easy.
I take it back. There is one required change:
kernel_entry:
hypervisor_entry_hook
sti
.... kernel code
This hypervisor_entry_hook can be a nop on native hardware, and the
following for Xen:
push %gs
mov CPU_HYPER_SEL, %gs
pop %gs:SAVED_USER_GS
You already have the IRET / SYSEXIT hooks to restore it on the way
back. And now you have a segment reserved that allows you to deal with
16-bit stack segments during the IRET.
> Still might be the best option.
>
> How did that rumoured Xenolinux-over-VMI implementation solve that problem?
>
!CONFIG_SMP -- as I believe I saw in the latest Xen patches sent out as
well?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC, PATCH 1/24] i386 Vmi documentation
2006-03-22 21:13 ` Andi Kleen
2006-03-22 21:57 ` Chris Wright
@ 2006-03-23 0:06 ` Zachary Amsden
1 sibling, 0 replies; 26+ messages in thread
From: Zachary Amsden @ 2006-03-23 0:06 UTC (permalink / raw)
To: Andi Kleen
Cc: Chris Wright, virtualization, Linus Torvalds,
Linux Kernel Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
Leendert van Doorn
Andi Kleen wrote:
> Even then it's useless for many flags because any user program can (and will)
> call CPUID directly.
Turns out not to matter, since userspace can only make use of
capabilities that are already available to userspace. If the feature
bits for system features are visible to it, it doesn't really matter.
Yes, this could be broken in some cases. But it turns out to be safe.
Even sysenter support, which userspace does care about, is done via
setting the vsyscall page up in the kernel, rather than userspace CPUID
detection.
> Sure the point was if they write this long fancy document why stop
> at documenting the last 5%?
>
Because the last 5% is what is changing to meet Xen's needs. Why
document something that you know you are going to break in a week? I
chose to document the stable interfaces first.
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2006-03-23 0:12 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-13 17:59 [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
2006-03-13 22:49 ` Chris Wright
2006-03-14 0:00 ` Zachary Amsden
2006-03-14 21:27 ` Chris Wright
2006-03-14 22:29 ` Zachary Amsden
2006-03-15 2:57 ` Chris Wright
2006-03-15 5:44 ` Zachary Amsden
2006-03-15 22:56 ` Daniel Arai
2006-03-16 1:16 ` Chris Wright
2006-03-16 3:40 ` Eli Collins
2006-03-16 3:40 ` Eli Collins
2006-03-14 4:11 ` Rik van Riel
2006-03-22 20:05 ` Andi Kleen
2006-03-22 21:34 ` Chris Wright
2006-03-22 21:13 ` Andi Kleen
2006-03-22 21:57 ` Chris Wright
2006-03-23 0:06 ` Zachary Amsden
2006-03-22 21:39 ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
2006-03-22 22:43 ` Daniel Arai
2006-03-22 22:45 ` Zachary Amsden
2006-03-22 22:38 ` Andi Kleen
2006-03-22 23:54 ` Zachary Amsden
2006-03-22 23:37 ` Andi Kleen
2006-03-22 22:04 ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
2006-03-22 21:58 ` Andi Kleen
-- strict thread matches above, loose matches on Subject: below --
2006-03-13 18:41 Zachary Amsden
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.