public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* VMI Interface Proposal Documentation for I386, Part 4
@ 2006-03-13 19:55 Zachary Amsden
  2006-03-15 23:37 ` Pavel Machek
  0 siblings, 1 reply; 6+ messages in thread
From: Zachary Amsden @ 2006-03-13 19:55 UTC (permalink / raw)
  To: Linux Kernel Mailing List


3) Architectural Differences from Native Hardware.

     For the sake of performance, some requirements are imposed on kernel
     fault handlers which are not present on real hardware.  Most modern
     operating systems should have no trouble meeting these requirements.
     Failure to meet these requirements may prevent the kernel from
     working properly.

     1) The hardware flags on entry to a fault handler may not match
        the EFLAGS image on the fault handler stack.  The stack image
        is correct, and will have the correct state of the interrupt
        and arithmetic flags.

     2) The stack used for kernel traps must be flat - that is, zero base,
        segment limit determined by the hypervisor.

     3) On entry to any fault handler, the stack must have sufficient space
        to hold 32 bytes of data, or the guest may be terminated.

     4) When calling VMI functions, the kernel must be running on a
        flat 32-bit stack and code segment.

     5) Most VMI functions require flat data and extra segment (DS and ES)
        segments as well; notable exceptions are IRET and SYSEXIT.
        XXXPara - may need to add STI and CLI to this list.

     6) Interrupts must always be enabled when running code in userspace.

     7) IOPL semantics for userspace are changed; although userspace may be
        granted port access, it can not affect the interrupt flag.

     8) The EIPs at which faults may occur in VMI calls may not match the
        original native instruction EIP; this is a bug in the system
        today, as many guests do rely on lazy fault handling.

     9) On entry to V8086 mode, MSR_SYSENTER_CS is cleared to zero.

     10) Todo - we would like to support these features, but they are not
        fully tested and / or implemented:

        Userspace 16-bit stack support
        Proper handling of faulting IRETs

4) ROM Implementation

   Modularization

     Originally, we envisioned modularizing the ROM API into several
     subsections, but the close coupling between the initial layers
     and the requirement to support native PCI bus devices has made
     ROM components for network or block devices unnecessary to this
     point in time.

    VMI - the virtual machine interface.  This is the core CPU, I/O
          and MMU virtualization layer.  I/O is currently limited
              to port access to emulated devices.
    
   Detection

      The presence of hypervisor ROMs can be recognized by scanning the
      upper region of the first megabyte of physical memory.  Multiple
      ROMs may be provided to support older API versions for legacy guest
      OS support.  ROM detection is done in the traditional manner, by
      scanning the memory region from C8000h - DFFFFh in 2 kilobyte
      increments.  The romSignature bytes must be '0x55, 0xAA', and the
      checksum of the region indicated by the romLength field must be zero.
      The checksum is a simple 8-bit addition of all bytes in the ROM 
region.

   Data layout

      typedef struct HyperRomHeader {
         uint16_t        romSignature;
         int8_t          romLength;
         unsigned char   romEntry[4];
         uint8_t         romPad0;
         uint32_t        hyperSignature;
         uint8_t         APIVersionMinor;
         uint8_t         APIVersionMajor;
         uint8_t         reserved0;
         uint8_t         reserved1;
         uint32_t        reserved2;
         uint32_t        reserved3;
         uint16_t        pciHeaderOffset;
         uint16_t        pnpHeaderOffset;
         uint32_t        romPad3;
         char            reserved[32];
         char            elfHeader[64];
      } HyperRomHeader;

      The first set of fields is defined by the BIOS:

      romSignature - fixed 0xAA55, BIOS ROM signature
      romLength    - the length of the ROM, in 512 byte chunks.
                     Determines the area to be checksummed.
      romEntry     - 16-bit initialization code stub used by BIOS.
      romPad0      - reserved

      The next set of fields is defined by this API:

      hyperSignature  - a 4 byte signature providing recognition of the
                    device class represented by this ROM.  Each
                    device class defines its own unique signature.
      APIVersionMinor - the revision level of this device class' API.
                    This indicates incremental changes to the API.
      APIVersionMajor - the major version. Used to indicates large
                    revisions or additions to the API which break
                    compatibility with the previous version.
      reserved0,1,2,3 - for future expansion

      The next set of fields is defined by the PCI / PnP BIOS spec:

      pciHeaderOffset - relative offset to the PCI device header from
                  the start of this ROM.
      pnpHeaderOffset - relative offset to the PnP boot header from the
                    start of this ROM.
      romPad3         - reserved by PCI spec.

      Finally, there is space for future header fields, and an area
      reserved for an ELF header to point to symbol information.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VMI Interface Proposal Documentation for I386, Part 4
  2006-03-13 19:55 VMI Interface Proposal Documentation for I386, Part 4 Zachary Amsden
@ 2006-03-15 23:37 ` Pavel Machek
  2006-03-16  7:00   ` Zachary Amsden
  2006-03-16 15:18   ` Alan Cox
  0 siblings, 2 replies; 6+ messages in thread
From: Pavel Machek @ 2006-03-15 23:37 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: Linux Kernel Mailing List

Hi!

>     6) Interrupts must always be enabled when running code in userspace.

I'd say this breaks userspace.

This code used to work when ran as root:

void
main(void)
{
        int i;
        iopl(3);
        while (1) {
                asm volatile("cli");
                //              for (i=0; i<20000000; i++)
                for (i=0; i<1000000000; i++)
                        asm volatile("");
                asm volatile("sti");
                sleep(1);
        }
}

...and was actually useful.

>     7) IOPL semantics for userspace are changed; although userspace may be
>        granted port access, it can not affect the interrupt flag.

I'm not sure how will X like this.

								Pavel

-- 
57:        MD5CryptoServiceProvider MD5 = new MD5CryptoServiceProvider();

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VMI Interface Proposal Documentation for I386, Part 4
  2006-03-15 23:37 ` Pavel Machek
@ 2006-03-16  7:00   ` Zachary Amsden
  2006-03-16 15:18   ` Alan Cox
  1 sibling, 0 replies; 6+ messages in thread
From: Zachary Amsden @ 2006-03-16  7:00 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linux Kernel Mailing List, Jack Lo, Pratap Subrahmanyam,
	Daniel Arai, Anne Holler, Daniel Hecht, Andrew Morton,
	Virtualization Mailing List, Eli Collins, Xen-devel, Jyothy Reddy,
	Christopher Li, Chris Wright, Kip Macy, Joshua LeVasseur,
	Leendert van Doorn

Pavel Machek wrote:
> Hi!
>
>   
>>     6) Interrupts must always be enabled when running code in userspace.
>>     
>
> I'd say this breaks userspace.
>   

I agree.  My claim is that this is not an issue in a virtual machine.  
What possible reason can you have to disable interrupts in userspace?  
Well, several.  For one, the X server wants to disable interrupts 
temporarily during probing of dot clocks to get accurate timings, and 
also to avoid the kernel interrupting during a sensitive VGA register 
access.  Several other userspace programs, including CMOS time sync 
utilities do this as well.  I contend this is broken, even on native 
hardware, for two reasons.

1) The sensitive VGA register access argument is bogus.  There is 
already a kernel interface that is used by X11 to take control of video 
which lets the kernel know explicitly not to touch the VGA registers.  
The oddity is due to the fact that there are many write only registers, 
and thus, you can't track state of these without explicit handoff.  The 
same interface can be used to avoid these sensitive accesses.

2) Timing dot clocks by disabling interrupts is still broken and subject 
to random variance.  Chipsets which support system management modes can 
cause the processor to enter SMM mode at any time, even when interrupts 
are disabled and NMIs are masked.  This is deliberately hidden from the 
running code, but it does cause time to elapse, which is visible via the 
TSC and all hardware time counters.  Therefore, you can never get an 
accurate timing in one iteration, and using multiple iterations allows 
you to effectively deal with the same issues you would have if you left 
interrupts enabled.

> This code used to work when ran as root:
>
> void
> main(void)
> {
>         int i;
>         iopl(3);
>         while (1) {
>                 asm volatile("cli");
>                 //              for (i=0; i<20000000; i++)
>                 for (i=0; i<1000000000; i++)
>                         asm volatile("");
>                 asm volatile("sti");
>                 sleep(1);
>         }
> }
>
> ...and was actually useful.
>   

The code you show above can be made to work in a virtual machine, and 
you can allow userspace to disable interrupts and still have a perfectly 
fine solution -- if you restrict the enabling and disabling of 
interrupts in userspace to the cli and sti instructions.  But it does 
not work if you start using nested interrupt control, using pushf and popf.

The virtual machine monitor must always leave hardware interrupts 
enabled, since it must service them without allowing the guest VM to 
interfere.  As such, the actual state of the hardware interrupt flag is 
visible to userspace programs.  CLI and STI get away with this, because 
they are privileged instructions, and as such, they trap when IOPL is 
not present.  But PUSHF and POPF do not.  A POPF instruction which 
changes the interrupt flag behaves differently, depending on the IOPL 
state.  When IOPL is not present, and the POPF would change the state of 
the interrupt flag - nothing happens.  The interrupt flag is not 
changed, but most importantly, it is not a privileged instruction, so it 
does not trap.

Therefore, this instruction is non-virtualizable.  You can not run it 
directly in a virtual machine - you must simulate it.  To simulate it 
requires either straightforward interpretation, hardware virtualization, 
or binary translation.  Therein lies the crux of the problem.  While you 
can allow userspace to enable and disable interrupts using CLI and STI, 
you have no way to simulate its use of the POPF instruction unless you 
use one of these technologies.  This is why we disallow all toggling of 
the interrupt flag from userspace, since one of the design goals of 
paravirtualization is not to change userspace code.

Combined with the above argument that enabling / disabling is really not 
useful for userspace in a virtual machine, we have found that if you 
just completely disallow IOPL'ed userspace to enable and disable 
interrupts, _but_ never issue faults to it if it tries, everything just 
works.  The alternative allows you to get in a state where you can end 
up in a non-virtualizable userspace scenario, which is highly undesirable.

>   
>>     7) IOPL semantics for userspace are changed; although userspace may be
>>        granted port access, it can not affect the interrupt flag.
>>     
See above for the impact on X.  X11 runs perfectly fine in our 
paravirtual VMM.

Nit:  Dropping cc'd persons is probably not a good thing.  Some of the 
people here don't subscribe to LKML in full, and would still like to be 
copied on these messages.  No offense meant or taken.

Zach

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VMI Interface Proposal Documentation for I386, Part 4
  2006-03-15 23:37 ` Pavel Machek
  2006-03-16  7:00   ` Zachary Amsden
@ 2006-03-16 15:18   ` Alan Cox
  2006-03-16 15:29     ` Zachary Amsden
  1 sibling, 1 reply; 6+ messages in thread
From: Alan Cox @ 2006-03-16 15:18 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Zachary Amsden, Linux Kernel Mailing List

On Iau, 2006-03-16 at 00:37 +0100, Pavel Machek wrote:
> This code used to work when ran as root:

Unless it page faulted, or was on SMP, or ....

> I'm not sure how will X like this.

X has not used this ability for many years.

Alan


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VMI Interface Proposal Documentation for I386, Part 4
  2006-03-16 15:18   ` Alan Cox
@ 2006-03-16 15:29     ` Zachary Amsden
  2006-03-16 18:40       ` Alan Cox
  0 siblings, 1 reply; 6+ messages in thread
From: Zachary Amsden @ 2006-03-16 15:29 UTC (permalink / raw)
  To: Alan Cox; +Cc: Pavel Machek, Linux Kernel Mailing List

Alan Cox wrote:
> On Iau, 2006-03-16 at 00:37 +0100, Pavel Machek wrote:
>   
>> This code used to work when ran as root:
>>     
>
> Unless it page faulted, or was on SMP, or ....
>   

Actually, quite interestingly, I believe you can take page faults in 
this scenario - you might end up getting rescheduled and lose the effect 
disabling interrupts, but I think the kernel lives on just fine - as 
long as it doesn't BUG_ON about this.  On SMP, clearly you can't 
disabled IRQs on all processors with it.  But I really think the point 
is to try to eliminate IRQs on a single processor during some critical 
timing sensitive region.  One thing you definitely can't do safely is 
make sysenter based syscalls off the vsyscall page - you will notice 
that you always come back with interrupts enabled.

I just really don't think that is a good idea to do in userspace, when 
writing a kernel module to accomplish this safely is actually really 
quite easy.  I would argue that the various CMOS timer update utilities 
in userspace that do this same thing, really should be moved into the 
kernel as fast as possible - they could race against other CPUs in 
kernel mode that are doing the same thing, and there is no locking 
discipline here whatsoever.

>   
>> I'm not sure how will X like this.
>>     
>
> X has not used this ability for many years.
>   

Good to know.  I thought some piece of xinit still used it to do 
dot-clock probing - but I could be wrong.  We really don't care about 
getting accurate information here, since the dot-clocks don't actually 
exist in a VM.  We simulate virtual SVGA hardware instead of passing 
through any installed card.

Zach

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: VMI Interface Proposal Documentation for I386, Part 4
  2006-03-16 15:29     ` Zachary Amsden
@ 2006-03-16 18:40       ` Alan Cox
  0 siblings, 0 replies; 6+ messages in thread
From: Alan Cox @ 2006-03-16 18:40 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: Pavel Machek, Linux Kernel Mailing List

On Iau, 2006-03-16 at 07:29 -0800, Zachary Amsden wrote:
> quite easy.  I would argue that the various CMOS timer update utilities 
> in userspace that do this same thing, really should be moved into the 
> kernel as fast as possible - they could race against other CPUs in 

They were, something like 8-10 years ago. If your distributor is
shipping code that is doing cli in user space please assist in their
re-education. Several ship code which can fall back if the nvram or rtc
driver is missing but thats compat code.

> Good to know.  I thought some piece of xinit still used it to do 
> dot-clock probing - but I could be wrong.  

It does, but it doesn't disable interrupts. You don't need to.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-03-16 18:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-13 19:55 VMI Interface Proposal Documentation for I386, Part 4 Zachary Amsden
2006-03-15 23:37 ` Pavel Machek
2006-03-16  7:00   ` Zachary Amsden
2006-03-16 15:18   ` Alan Cox
2006-03-16 15:29     ` Zachary Amsden
2006-03-16 18:40       ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox