Re: [RFC, PATCH 1/24] i386 Vmi documentation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
       [not found] ` <20060313224902.GD12807@sorel.sous-sol.org>
@ 2006-03-14  0:00   ` Zachary Amsden
  2006-03-14 21:27     ` Chris Wright
  0 siblings, 1 reply; 21+ messages in thread
From: Zachary Amsden @ 2006-03-14  0:00 UTC (permalink / raw)
  To: Chris Wright
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

Chris Wright wrote:

Hi Chris, thank you for your comments.  I've tried to answer as much as 
I can - hopefully I found all your questions.

>> +     guest operating systems.  In the future, we envision that additional
>> +     higher level abstractions will be added as an adjunct to the
>> +     low-level API.  These higher level abstractions will target large
>> +     bulk operations such as creation, and destruction of address spaces,
>> +     context switches, thread creation and control.
>>     
>
> This is an area where in the past VMI hasn't been well-suited to support
> Xen.  It's the higher level abstractions which make the performance
> story of paravirt compelling.  I haven't made it through the whole
> patchset yet, but the bits you mention above as work to be done are
> certainly important to good performance.
>   

For example, multicalls, which we support, and batched page table 
operations, which we support, and vendor designed virtual devices, which 
we support.  What is unclear to me is why you need to keep pushing 
higher up the stack to get more performance.  If you could have any 
higher level hypercall you wanted, what would it be?  Most people say - 
fork() / exec().  But why?  You've just radically changed the way the 
guest must operate its MMU, and you've radically constrained the way 
page tables and memory management structures must be layed out by 
putting a ton of commonality in their infrastructure that is shared by 
the hypervisor and the kernel.  You've likely vastly complicated the 
design of a virtualized kernel that still runs on native hardware.  But 
what can you truly gain, that you can not gain from a simpler, less 
complicated interface that just says -

Ok, I'm about to update a whole bunch of pages tables.
Ok, I'm done and I might want to use them now.  Please make sure the 
hardware TLB will be in sync.

Pushing up the stack with a higher level API is a serious consideration, 
but only if you can show serious results from it.  I'm not convinced 
that you can actually hone in on anything /that isn't already a 
performance problem on native kernels/.  Consider, for example, that we 
don't actually support remote TLB shootdown IPIs via VMI calls.  Why is 
this a performance problem?  Well, very likely, those IPI shootdowns are 
going to be synchronous.  And if you don't co-schedule the CPUs in your 
virtual machine, you might just have issued synchronous IPIs to VCPUs 
that aren't even running.  A serious performance problem.

Is it?  Or is it really, just another case where the _native_ kernel can 
be even more clever, and avoid doing those IPI shootdowns in the 
firstplace?  I've watched IPI shootdown in Linux get drastically better 
in the 2.6 series of kernels, and see (anecdotal number quoting) maybe 4 
or 5 of them in the course of a kernel compile.  There is no longer a 
giant performance boon to be gained here.

Similarly, you can almost argue the same thing with spinlocks - if you 
really are seeing performance issues because of the wakeup of a 
descheduled remote VPU, maybe you really need to think about moving that 
lock off a hot path or using a better, lock free synchronization method.

I'm not arguing against these features - in fact, I think they can be 
done in a way that doesn't intrude too much inside of the kernel.  After 
all, locks and IPIs tend to be part of the lower layer architecture 
anyways.  And they definitely do win back some of the background noise 
introduced by virtualization.  But if you decide to make the interface 
more complicated, you really need to have an accurate measure of exactly 
what you can gain by it to justify that complexity.

Personally, I'm all for making lock primitives and shootdowns an 
_optional_ extension to the interface.  As with many other relatively 
straightforward and non-intrusive changes.  I know some of you will 
disagree with me, but I think a lot of what is being referred to as 
"higher level" paravirtualization is really an attempt to solve 
pre-existing problems in the performance of the underlying system.

There are advanced and useful things you can do with higher level 
paravirtualization, but I am not convinced at all that incredible 
performance gain is one of them.

> We do not want an interface which slows down the pace.  We work with
> source and drop cruft as quickly as possible (referring to internal
> changes, not user-visible ABI changes here).  Making changes that
> require a new guest for some significant performance gain is perfectly
> reasonable.  What we want to avoid is making changes that require a
> new guest to simply boot.  This is akin to rev'ing hardware w/out any
> backwards compatibility.  This goal doesn't require VMI and ROMs, but
> I agree it requires clear interface definitions.
>   

This is why we provide the minor / major interface numbers.  Bump the 
minor number, you get a new feature.  Bump the required minor version in 
the guest when it relies on that feature.  Bump the major number when 
you break compatibility.  More on this below.

>   
>> +    VMI_DeliverInterrupts (For future debate)
>> +
>> +       Enable and deliver any pending interrupts.  This would remove
>> +       the implicit delivery semantic from the SetInterruptMask and
>> +       EnableInterrupts calls.
>>     
>
> How do you keep forwards and backwards compat here?  Guest that's coded
> to do simple implicit version would never get interrupts delivered on
> newer ROM?
>   

This isn't part of the interface.  If it were to be included, you could 
do two things - bump the minor version, and add non-delivery semantic 
enable and restore interrupt calls, or bump the major version and drop 
the delivery semantic from the originals.

I agree this is pretty clumsy.  Expect to see more discussion about 
using annotations to expand the interface without breaking binary 
compatibility, as well as providing more advanced feature control.  I 
wanted to integrate more advanced feature control / probing into this 
version of the VMI, but there are so many possible ways to do it that it 
would be much nicer to get feedback from the community on what is the 
best interface.

>   
>> +   CPU CONTROL CALLS
>> +
>> +    These calls encapsulate the set of privileged instructions used to
>> +    manipulate the CPU control state.  These instructions are all properly
>> +    virtualizable using trap and emulate, but for performance reasons, a
>> +    direct call may be more efficient.  With hardware virtualization
>> +    capabilities, many of these calls can be left as IDENT translations, that
>> +    is, inline implementations of the native instructions, which are not
>> +    rewritten by the hypervisor.  Some of these calls are performance critical
>> +    during context switch paths, and some are not, but they are all included
>> +    for completeness, with the exceptions of the obsoleted LMSW and SMSW
>> +    instructions.
>>     
>
> Included just for completeness can be beginning of API bloat.
>   

The design impact of this bloat is zero - if you don't want to implement 
virtual methods for, say, debug register access - then you don't need to 
do anything.  You trap and emulate by default.  If on the other hand, 
you do want to hook them, you are welcome to.  The hypervisor is free to 
choose the design costs that are appropriate for their usage scenarios, 
as is the kernel - it's not in the spec, but certainly is open for 
debate that certain classes of instructions such as these need not even 
be converted to VMI calls.  We did implement all of these in Linux for 
performance and symmetry.

>   
> clts, setcr0, readcr0 are interrelated for typical use.  is it expected
> the hypervisor uses consitent regsister (either native or shadowed)
> here, or is it meant to be undefined?
>   

CLTS allows the elimination of an extra GetCR0 call, and they all 
operate on the same (shadowed) register.

> Many of these will look the same on x86-64, but the API is not
> 64-bit clean so has to be duplicated.
>   

Yes, register pressure forces the PAE API to be slightly different from 
the long mode API.  But long mode has different register calling 
conventions anyway, so it is not a big deal.   The important thing is, 
once the MMU mess is sorted out, the same interface can be used from C 
code for both platforms, and the details about which lock primitives are 
used can be hidden.  The cost of which lock primitives to use differs on 
32-bit and 64-bit platforms, across vendor, and the style of the 
hypervisor implementation (direct / writable / shadowed page tables).

>   
>   
>> +    85) VMI_SetDeferredMode
>>     
>
> Is this the batching, multi-call analog?
>   

Yes.  This interface needs to be documented in a much better fashion.  
But the idea is that VMI calls are mapped into Xen multicalls by 
allowing deferred completion of certain classes of operations.  That 
same mode of deferred operation is used to batch PTE updates in our 
implementation (although Xen uses writable page tables now, this used to 
provide the same support facility in Xen as well).  To complement this, 
there is an explicit flush - and it turns out this maps very nicely, 
getting rid of a lot of the XenoLinux changes around mmu_context.h.

>> +
>> +   VMI_CYCLES   64 bit unsigned integer
>> +   VMI_NANOSECS 64 bit unsigned integer
>>     
>
> All caps typedefs are not very popular w.r.t. CodingStyle.
>   

We know this.  This is not a Linux interface.  This is the API 
documentation, meant to be considerably different in style.  Where this 
ugliness has crept into our Linux patches, I have been steadily removing 
it and making them look nicer.  But the vast difference in the style of 
the doc is to avoid namespace collision.

>   
>> +   #define VMICALL __attribute__((regparm(3)))
>>     
>
> I understand it's for ABI documentation, but in Linux it's FASTCALL.
>   

Actually, FASTCALL is regparm(2), I think.

Cheers,

Zach

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-14  0:00   ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
@ 2006-03-14 21:27     ` Chris Wright
       [not found]       ` <441743BD.1070108@vmware.com>
  2006-03-16  1:16       ` Chris Wright
  0 siblings, 2 replies; 21+ messages in thread
From: Chris Wright @ 2006-03-14 21:27 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Chris Wright, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

* Zachary Amsden (zach@vmware.com) wrote:
> Pushing up the stack with a higher level API is a serious consideration, 
> but only if you can show serious results from it.  I'm not convinced 
> that you can actually hone in on anything /that isn't already a 
> performance problem on native kernels/.  Consider, for example, that we 
> don't actually support remote TLB shootdown IPIs via VMI calls.  Why is 
> this a performance problem?  Well, very likely, those IPI shootdowns are 
> going to be synchronous.  And if you don't co-schedule the CPUs in your 
> virtual machine, you might just have issued synchronous IPIs to VCPUs 
> that aren't even running.  A serious performance problem.
> 
> Is it?  Or is it really, just another case where the _native_ kernel can 
> be even more clever, and avoid doing those IPI shootdowns in the 
> firstplace?  I've watched IPI shootdown in Linux get drastically better 
> in the 2.6 series of kernels, and see (anecdotal number quoting) maybe 4 
> or 5 of them in the course of a kernel compile.  There is no longer a 
> giant performance boon to be gained here.
> 
> Similarly, you can almost argue the same thing with spinlocks - if you 
> really are seeing performance issues because of the wakeup of a 
> descheduled remote VPU, maybe you really need to think about moving that 
> lock off a hot path or using a better, lock free synchronization method.
> 
> I'm not arguing against these features - in fact, I think they can be 
> done in a way that doesn't intrude too much inside of the kernel.  After 
> all, locks and IPIs tend to be part of the lower layer architecture 
> anyways.  And they definitely do win back some of the background noise 
> introduced by virtualization.  But if you decide to make the interface 
> more complicated, you really need to have an accurate measure of exactly 
> what you can gain by it to justify that complexity.

Yes, I completely agree.  Without specific performance numbers it's just
hand waving.  To make it more concrete, I'll work on a compare/contrast
of the interfaces so we have specifics to discuss.

> >Included just for completeness can be beginning of API bloat.
> 
> The design impact of this bloat is zero - if you don't want to implement 
> virtual methods for, say, debug register access - then you don't need to 
> do anything.  You trap and emulate by default.  If on the other hand, 
> you do want to hook them, you are welcome to.  The hypervisor is free to 
> choose the design costs that are appropriate for their usage scenarios, 
> as is the kernel - it's not in the spec, but certainly is open for 
> debate that certain classes of instructions such as these need not even 
> be converted to VMI calls.  We did implement all of these in Linux for 
> performance and symmetry.

Yup.  Just noting that API without clear users is the type of thing that
is regularly rejected from Linux.

> >Many of these will look the same on x86-64, but the API is not
> >64-bit clean so has to be duplicated.
> 
> Yes, register pressure forces the PAE API to be slightly different from 
> the long mode API.  But long mode has different register calling 
> conventions anyway, so it is not a big deal.   The important thing is, 
> once the MMU mess is sorted out, the same interface can be used from C 
> code for both platforms, and the details about which lock primitives are 
> used can be hidden.  The cost of which lock primitives to use differs on 
> 32-bit and 64-bit platforms, across vendor, and the style of the 
> hypervisor implementation (direct / writable / shadowed page tables).

My mistake, it makes perfect sense from ABI point of view.

> >Is this the batching, multi-call analog?
> 
> Yes.  This interface needs to be documented in a much better fashion.  
> But the idea is that VMI calls are mapped into Xen multicalls by 
> allowing deferred completion of certain classes of operations.  That 
> same mode of deferred operation is used to batch PTE updates in our 
> implementation (although Xen uses writable page tables now, this used to 
> provide the same support facility in Xen as well).  To complement this, 
> there is an explicit flush - and it turns out this maps very nicely, 
> getting rid of a lot of the XenoLinux changes around mmu_context.h.

Are these valid differences?  Or did I misunderstand the batching
mechanism?

1) can't use stack based args, so have to allocate each data structure,
which could conceivably fail unless it's some fixed buffer.

2) complicates the rom implementation slightly where implementation of
each deferrable part of the API needs to have switch (am I deferred or
not) to then build the batch, or make direct hypercall.

3) flushing in smp, have to be careful to manage simulataneous defers
and flushes from potentially multiple cpus in guest.

Doesn't seem these are showstoppers, just differences worth noting.
There aren't as many multicalls left in Xen these days anyway.

thanks,
-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <441743BD.1070108@vmware.com>]

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
       [not found]       ` <441743BD.1070108@vmware.com>
@ 2006-03-15  2:57         ` Chris Wright
  2006-03-15  5:44           ` Zachary Amsden
  2006-03-15 22:56           ` Daniel Arai
  0 siblings, 2 replies; 21+ messages in thread
From: Chris Wright @ 2006-03-15  2:57 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Chris Wright, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

* Zachary Amsden (zach@vmware.com) wrote:
> >1) can't use stack based args, so have to allocate each data structure,
> >which could conceivably fail unless it's some fixed buffer.
> 
> We use a fixed buffer that is private to our VMI layer.  It's a per-cpu 
> packing struct for hypercalls.  Dynamically allocating from the kernel 
> inside the interface layer is a really great way to get into a whole lot 
> of trouble.

Heh, indeed that's why I asked.  per-cpu buffer means ROM state knows
which vcpu is current.  How is this done in OS agnostic method w/out
trapping to hypervisor?  Some shared data that ROM and VMM know about,
and VMM updates as it schedules each vcpu?

> >2) complicates the rom implementation slightly where implementation of
> >each deferrable part of the API needs to have switch (am I deferred or
> >not) to then build the batch, or make direct hypercall.
> 
> This is an overhead that is easily absorbed by the gain.  The direct 
> hypercalls are mostly either always direct, or always queued.  The page 
> table updates already have conditional logic to do the right thing, and 
> Xen doesn't require the queueing of these anymore anyways.  And the 
> flush happens at an explicit point.  The best approach can still be fine 
> tuned.  You could have separate VMI calls for queued vs. non-queued 
> operation.  But that greatly bloats the interface and doesn't make sense 
> for everything.  I believe the best solution is to annotate this in the 
> VMI call itself.  Consider the VMI call number, not as an integer, but 
> as an identifier tuple.  Perhaps I'm going overboard here.  Perhaps not.
> 
> 31--------24-23---------16-15--------8-7-----------0
> | family   | call number | reserved  | annotation |
> ---------------------------------------------------

I agree with your final assessment, needs more threshing out.  It does
feel a bit overkill at first blush.  I worry about these semantic
changes as an annotation instead of explicit API update.  But I guess
we still have more work on finding the right actual interface, not just
the possible ways to annotate the calls.

thanks,
-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-15  2:57         ` Chris Wright
@ 2006-03-15  5:44           ` Zachary Amsden
  2006-03-15 22:56           ` Daniel Arai
  1 sibling, 0 replies; 21+ messages in thread
From: Zachary Amsden @ 2006-03-15  5:44 UTC (permalink / raw)
  To: Chris Wright
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

Chris Wright wrote:
> * Zachary Amsden (zach@vmware.com) wrote:
>   
>>> 1) can't use stack based args, so have to allocate each data structure,
>>> which could conceivably fail unless it's some fixed buffer.
>>>       
>> We use a fixed buffer that is private to our VMI layer.  It's a per-cpu 
>> packing struct for hypercalls.  Dynamically allocating from the kernel 
>> inside the interface layer is a really great way to get into a whole lot 
>> of trouble.
>>     
>
> Heh, indeed that's why I asked.  per-cpu buffer means ROM state knows
> which vcpu is current.  How is this done in OS agnostic method w/out
> trapping to hypervisor?  Some shared data that ROM and VMM know about,
> and VMM updates as it schedules each vcpu?
>   

Yes, we have private mappings per CPU.  I don't think that is as 
feasible on Xen, since it requires the hypervisor to support a per-CPU 
PD shadow for each root.  But alternative implementations are possible 
using segmentation.  The primary advantage is that you don't need to 
call back from the interface layer to disable preemption for per-CPU 
data access.

It turns out to be really easy if you add the loadsegment / savesegment 
macros to the VMI interface, and require the kernel to abstain from 
using, say, the GS segment.  I think this is the path we are going down 
for the VMI on Xen 3 port.

> I agree with your final assessment, needs more threshing out.  It does
> feel a bit overkill at first blush.  I worry about these semantic
> changes as an annotation instead of explicit API update.  But I guess
> we still have more work on finding the right actual interface, not just
> the possible ways to annotate the calls.
>   

Yes, lets focus on finding the right interface for now - and just leave 
the door open a bit for the future.

Cheers,

Zach

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-15  2:57         ` Chris Wright
  2006-03-15  5:44           ` Zachary Amsden
@ 2006-03-15 22:56           ` Daniel Arai
  1 sibling, 0 replies; 21+ messages in thread
From: Daniel Arai @ 2006-03-15 22:56 UTC (permalink / raw)
  To: Chris Wright
  Cc: Zachary Amsden, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

Chris Wright wrote:
> * Zachary Amsden (zach@vmware.com) wrote:
> 
>>>1) can't use stack based args, so have to allocate each data structure,
>>>which could conceivably fail unless it's some fixed buffer.
>>
>>We use a fixed buffer that is private to our VMI layer.  It's a per-cpu 
>>packing struct for hypercalls.  Dynamically allocating from the kernel 
>>inside the interface layer is a really great way to get into a whole lot 
>>of trouble.
> 
> 
> Heh, indeed that's why I asked.  per-cpu buffer means ROM state knows
> which vcpu is current.  How is this done in OS agnostic method w/out
> trapping to hypervisor?  Some shared data that ROM and VMM know about,
> and VMM updates as it schedules each vcpu?

Each VCPU gets a private data area at the same linear address. The VMM 
constructs private page table shadows for each VCPU, and the shadows magically 
contain the right mappings for that VCPU's private data area.

Other hypervisor implementations (especially those that don't make use of shadow 
page tables) would have to come up with something along the lines that you're 
suggesting.

Dan.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-14 21:27     ` Chris Wright
       [not found]       ` <441743BD.1070108@vmware.com>
@ 2006-03-16  1:16       ` Chris Wright
  2006-03-16  3:40         ` Eli Collins
  1 sibling, 1 reply; 21+ messages in thread
From: Chris Wright @ 2006-03-16  1:16 UTC (permalink / raw)
  To: Chris Wright
  Cc: Zachary Amsden, Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

* Chris Wright (chrisw@sous-sol.org) wrote:
> Yes, I completely agree.  Without specific performance numbers it's just
> hand waving.  To make it more concrete, I'll work on a compare/contrast
> of the interfaces so we have specifics to discuss.

Here's a comparison of API's.  In some cases there's trivial 1-to-1
mappings, and in other cases there's really no mapping.  The mapping is
(loosely) annotated below the interface as [ VMI_foo(*) ].  The trailing
asterisk is meant to note the API maps at high-level, but the details
may make the mapping difficult (details such as VA vs. MFN, for example).
Thanks to Christian for doing the bulk of this comparison.

PROCESSOR STATE CALLS

- shared_info->vcpu_info[]->evtchn_upcall_mask

Enable/Disable interrupts and query whether interrupts are enabled or
disabled.

[ VMI_DisableInterrupts, VMI_EnabledInterrupts, VMI_GetInterruptMask,
VMI_SetInterruptMask ]

- shared_info->vcpu_info[]->evtchn_upcall_pending

Query if an interrupt is pending

[ ]

- force_evtchn_callback = HYPERVISOR_xen_version(0, NULL)

Deliver pending interrupts.

[ VMI_DeliverInterrupts ]

(EVENT CHANNEL, virtual interrupts)
- HYPERVISOR_event_channel_op(EVTCHNOP_alloc_unbound, ...)

Allocate a port in domain <dom> and mark as accepting interdomain
bindings from domain <remote_dom>. A fresh port is allocated in <dom>
and returned as <port>.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_bind_interdomain, ...)

Construct an interdomain event channel between the calling domain and
<remote_dom>. <remote_dom,remote_port> must identify a port that is
unbound and marked as accepting bindings from the calling domain. A fresh
port is allocated in the calling domain and returned as <local_port>.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_bind_virq, ...)

Bind a local event channel to VIRQ <irq> on specified vcpu.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, ...)

Bind a local event channel to PIRQ <irq>.

[ PIC programming* ]

- HYPERVISOR_event_channel_op(EVTCHNOP_bind_ipi, ...)

Bind a local event channel to receive events.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_close, ...)

Close a local event channel <port>. If the channel is interdomain then
the remote end is placed in the unbound state (EVTCHNSTAT_unbound),
awaiting a new connection.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_send, ...)

Send an event to the remote end of the channel whose local endpoint is <port>.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_status, ...)

Get the current status of the communication channel which has an endpoint
at <dom, port>.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_bind_vcpu, ...)

Specify which vcpu a channel should notify when an event is pending.

[ ]

- HYPERVISOR_event_channel_op(EVTCHNOP_unmask, ...)

Unmask the specified local event-channel port and deliver a notification
to the appropriate VCPU if an event is pending.

[ ]

- HYPERVISOR_sched_op(SCHEDOP_yield, ...)

Voluntarily yield the CPU.

[ VMI_Pause ]

- HYPERVISOR_sched_op(SCHEDOP_block, ...)

Block execution of this VCPU until an event is received for processing.
If called with event upcalls masked, this operation will atomically
reenable event delivery and check for pending events before blocking the
VCPU. This avoids a "wakeup waiting" race.

Periodic timer interrupts are not delivered when guest is blocked,
except for explicit timer events setup with HYPERVISOR_set_timer_op.

[ VMI_Halt ]

- HYPERVISOR_sched_op(SCHEDOP_shutdown, ...)

Halt execution of this domain (all VCPUs) and notify the system controller.

[ VMI_Shutdown, VMI_Reboot ]

- HYPERVISOR_sched_op(SCHEDOP_shutdown, SHUTDOWN_suspend, ...)

Clean up, save suspend info, kill

[ ]

- HYPERVISOR_sched_op_new(SCHEDOP_poll, ...)

Poll a set of event-channel ports. Return when one or more are pending. An
optional timeout may be specified.

[ ]

- HYPERVISOR_vcpu_op(VCPUOP_initialise, ...)

Initialise a VCPU. Each VCPU can be initialised only once. A 
newly-initialised VCPU will not run until it is brought up by VCPUOP_up.

[ VMI_SetInitialAPState ]

- HYPERVISOR_vcpu_op(VCPUOP_up, ...)

Bring up a VCPU. This makes the VCPU runnable. This operation will fail
if the VCPU has not been initialised (VCPUOP_initialise).

[ ]

- HYPERVISOR_vcpu_op(VCPUOP_down, ...)

Bring down a VCPU (i.e., make it non-runnable).
There are a few caveats that callers should observe:
 1. This operation may return, and VCPU_is_up may return false, before the
    VCPU stops running (i.e., the command is asynchronous). It is a good
    idea to ensure that the VCPU has entered a non-critical loop before
    bringing it down. Alternatively, this operation is guaranteed
    synchronous if invoked by the VCPU itself.
 2. After a VCPU is initialised, there is currently no way to drop all its
    references to domain memory. Even a VCPU that is down still holds
    memory references via its pagetable base pointer and GDT. It is good
    practise to move a VCPU onto an 'idle' or default page table, LDT and
    GDT before bringing it down.

[ ]

- HYPERVISOR_vcpu_op(VCPUOP_is_up, ...)

Returns 1 if the given VCPU is up.

[ ]

- HYPERVISOR_vcpu_op(VCPUOP_get_runstate_info, ...)

Return information about the state and running time of a VCPU.

[ ]

- HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, ...)

Register a shared memory area from which the guest may obtain its own
runstate information without needing to execute a hypercall.
Notes:
 1. The registered address may be virtual or physical, depending on the
    platform. The virtual address should be registered on x86 systems.
 2. Only one shared area may be registered per VCPU. The shared area is
    updated by the hypervisor each time the VCPU is scheduled. Thus
    runstate.state will always be RUNSTATE_running and
    runstate.state_entry_time will indicate the system time at which the
    VCPU was last scheduled to run.

[ ]

DESCRIPTOR RELATED CALLS

- HYPERVISOR_set_gdt(unsigned long *frame_list, int entries)

Load the global descriptor table.

For non-shadow-translate mode guests, the frame_list is a list of
machine pages which contain the gdt.

[ VMI_SetGDT* ]

- HYPERVISOR_set_trap_table(struct trap_info *table)

Load the interrupt descriptor table.

The trap table is in a format which allows easier access from C code.
It's easier to build and easier to use in software trap despatch code.
It can easily be converted into a hardware interrupt descriptor table.

[ VMI_SetIDT, VMI_WriteIDTEntry ]

- HYPERVISOR_mmuext_op(MMUEXT_SET_LDT, ...)

Load local descriptor table.
linear_addr: Linear address of LDT base (NB. must be page-aligned).
nr_ents: Number of entries in LDT.

[ VMI_SetLDT* ]

- HYPERVISOR_update_descriptor(u64 pa, u64 desc)

Write a descriptor to a GDT or LDT.

For non-shadow-translate mode guests, the address is a machine address.

[ VMI_WriteGDTEntry*, VMI_WriteLDTEntry* ]

CPU CONTROL CALLS

- HYPERVISOR_mmuext_op(MMUEXT_NEW_BASEPTR, ...)

Write cr3 register.

[ VMI_SetCR3* ]

- shared_info->vcpu_info[]->arch->cr2

Read cr2 register.

[ VMI_GetCR2 ]

- HYPERVISOR_fpu_taskswitch(0)

Clear the taskswitch flag in control register 0.

[ VMI_CLTS ]

- HYPERVISOR_fpu_taskswitch(1)

Set the taskswitch flag in control register 0.

[ VMI_SetCR0* ]

- HYPERVISOR_set_debugreg(int reg, unsigned long value)

Write debug register.

[ VMI_SetDR ]

- HYPERVISOR_get_debugreg(int reg)

Read debug register.

[ VMI_GetDR ]

PROCESSOR INFORMATION CALLS

STACK / PRIVILEGE TRANSITION CALLS

- HYPERVISOR_stack_switch(unsigned long ss, unsigned long esp)

Set the ring1 stack pointer/segment to use when switching to ring1
from ring3.

[ VMI_UpdateKernelStack ]

- HYPERVISOR_iret

[ VMI_IRET ]

I/O CALLS

- HYPERVISOR_physdev_op(PHYSDEVOP_SET_IOPL, ...)

Set the IOPL mask.

[ VMI_SetIOPLMask ]

- HYPERVISOR_mmuext_op(MMUEXT_FLUSH_CACHE)

No additional arguments. Writes back and flushes cache contents.
(Can just trap and emulate here).

[ VMI_WBINVD ]

- HYPERVISOR_physdev_op(PHYSDEVOP_IRQ_UNMASK_NOTIFY, ...)

Advertise unmask of physical interrupt to hypervisor.

[ ]

- HYPERVISOR_physdev_op(PHYSDEVOP_IRQ_STATUS_QUERY,...)

Query if physical interrupt needs unmaks notify.

[ ]

- HYPERVISOR_physdev_op(PHYSDEVOP_SET_IOBITMAP, ...)

Set IO bitmap for guest.

[ ]

- HYPERVISOR_physdev_op(PHYSDEVOP_APIC_READ, ...)

Read IO-APIC register.

[ ]

- HYPERVISOR_physdev_op(PHYSDEVOP_APIC_WRITE, ...)

Write IO-APIC register.

[ ]

- HYPERVISOR_physdev_op(PHYSDEVOP_ASSIGN_VECTOR, ...)

Assign vector to interrupt.

[ ]

APIC CALLS

TIMER CALLS

- HYPERVISOR_set_timer_op(...)

Set timeout when to trigger timer interrupt even if guest is blocked.

MMU CALLS

- HYPERVISOR_mmuext_op(MMUEXT_(UN)PIN_*_TABLE

mfn: Machine frame number to be (un)pinned as a p.t. page.

[ RegisterPageType* ]

- HYPERVISOR_mmuext_op(MMUEXT_TLB_FLUSH_LOCAL)

No additional arguments. Flushes local TLB.

[ VMI_FlushTLB ]

- HYPERVISOR_mmuext_op(MMUEXT_INVLPG_LOCAL)

linear_addr: Linear address to be flushed from the local TLB.

[ VMI_InvalPage ]

- HYPERVISOR_mmuext_op(MMUEXT_TLB_FLUSH_MULTI)

vcpumask: Pointer to bitmap of VCPUs to be flushed.

- HYPERVISOR_mmuext_op(MMUEXT_INVLPG_MULTI)

linear_addr: Linear address to be flushed.
vcpumask: Pointer to bitmap of VCPUs to be flushed.

- HYPERVISOR_mmuext_op(MMUEXT_TLB_FLUSH_ALL)

No additional arguments. Flushes all VCPUs' TLBs.

- HYPERVISOR_mmuext_op(MMUEXT_INVLPG_ALL)

linear_addr: Linear address to be flushed from all VCPUs' TLBs.

- HYPERVISOR_update_va_mapping(...)

Update pagetable entry mapping a given virtual address.
Avoids having to map the pagetable page in the hypervisor by using
a linear pagetable mapping.  Also flush the TLB if requested.

[ ]

- HYPERVISOR_mmu_update(MMU_NORMAL_PT_UPDATE, ...)

Update an entry in a page table.

[ VMI_SetPte* ]

- HYPERVISOR_mmu_update(MMU_MACHPHYS_UPDATE, ...)

Update machine -> phys table entry.

[ no machine -> phys in VMI ]

MEMORY

- HYPERVISOR_memory_op(XENMEM_increase_reservation, ...)

Increase number of frames

[ ]

- HYPERVISOR_memory_op(XENMEM_decrease_reservation, ...)

Drop frames from reservation

[ ]

- HYPERVISOR_memory_op(XENMEM_populate_physmap, ...)

[ ]

- HYPERVISOR_memory_op(XENMEM_maximum_ram_page, ...)

Get maximum MFN of mapped RAM in domain

[ ]

- HYPERVISOR_memory_op(XENMEM_current_reservation, ...)

Get current memory reservation (in pags) of domain

[ ]

- HYPERVISOR_memory_op(XENMEM_maximum_reservation, ...)

Get maximum memory reservation (in pags) of domain

[ ]

MISC

- HYPERVISOR_console_io()

read/write to console (privileged)

- HYPERVISOR_xen_version(XENVER_version, NULL)

Return major:minor (16:16).

- HYPERVISOR_xen_version(XENVER_extraversion)

Return extra version (-unstable, .subminor)

- HYPERVISOR_xen_version(XENVER_compile_info)

Return hypervisor compile information.

- HYPERVISOR_xen_version(XENVER_capabilities)

Return list of supported guest interfaces.

- HYPERVISOR_xen_version(XENVER_platform_parameters)

Return information about the platform.

- HYPERVISOR_xen_version(XENVER_get_features)

Return feature maps.

- HYPERVISOR_set_callbacks

Set entry points for upcalls to the guest from the hypervisor.
Used for event delivery and fatal condition notification.

- HYPERVISOR_vm_assist(VMASST_TYPE_4gb_segments)

Enable emulation of wrap around segments.

- HYPERVISOR_vm_assist(VMASST_TYPE_4gb_segments_notify)

Enable notification on wrap around segment event.

- HYPERVISOR_vm_assist(VMASST_TYPE_writable_pagetables)

Enable writable pagetables.

- HYPERVISOR_nmi_op(XENNMI_register_callback)

Register NMI callback for this (calling) VCPU. Currently this only makes
sense for domain 0, vcpu 0. All other callers will be returned EINVAL.

- HYPERVISOR_nmi_op(XENNMI_unregister_callback)

Deregister NMI callback for this (calling) VCPU.

- HYPERVISOR_multicall

Execute batch of hypercalls.

[VMI_SetDeferredMode*, VMI_FlushDeferredCalls*]

There are some more management specific operations for dom0 and security
that are arguably beyond the scope of this comparison.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-16  1:16       ` Chris Wright
@ 2006-03-16  3:40         ` Eli Collins
  0 siblings, 0 replies; 21+ messages in thread
From: Eli Collins @ 2006-03-16  3:40 UTC (permalink / raw)
  To: Chris Wright
  Cc: Xen-devel, Wim Coekaerts, Christopher Li,
	Linux Kernel Mailing List, Virtualization Mailing List,
	Linus Torvalds, Anne Holler, Jan Beulich, Jyothy Reddy, Kip Macy,
	Ky Srinivasan, Leendert van Doorn

Chris Wright wrote:
> * Chris Wright (chrisw@sous-sol.org) wrote:

<snip>

> - HYPERVISOR_event_channel_op(EVTCHNOP_send, ...)
> 
> Send an event to the remote end of the channel whose local endpoint is <port>.
> 
> [ ]

VMI_APICWrite is used to send IPIs. In general all the event channel 
calls (modulo referencing other guests) are not needed when using a 
virtual APIC. Using calls rather than a struct shared between the 
hypervisor and the guest is a cleaner interface (no messy changes to 
entry.S) and easier to maintain and version. This is true of 
shared_info_t in general, not just the event channel.

> 
> - HYPERVISOR_vcpu_op(VCPUOP_get_runstate_info, ...)
> 
> Return information about the state and running time of a VCPU.
> 
> [ ]

See the VMI timer interface. Note that the runstate interface above was 
added recently after Dan Hecht pointed out the need for properly 
paravirtualizing time (reporting stolen time correctly), the Xen 3.0.0/1 
interfaces do not include runstate info.

http://lists.xensource.com/archives/html/xen-devel/2006-02/msg00836.html

It's too bad that Xen's vcpu_time_info_t presents the guest with the 
variables used to calculate time rather than time itself; requiring that 
the guest calculate time complicates the Linux patches and constrains 
future changes to time calculation in the hypervisor.

> - HYPERVISOR_set_trap_table(struct trap_info *table)
> 
> Load the interrupt descriptor table.
> 
> The trap table is in a format which allows easier access from C code.
> It's easier to build and easier to use in software trap despatch code.
> It can easily be converted into a hardware interrupt descriptor table.
> 
> [ VMI_SetIDT, VMI_WriteIDTEntry ]

Passing in trap_info structs (like much of the Xen interface) requires 
copying to/from the guest when it's not necessary. To handle VT/Pacifica 
Xen needs to understand the hardware table format anyway, so it's 
simpler to just use the hardware format.

> - HYPERVISOR_set_timer_op(...)
> 
> Set timeout when to trigger timer interrupt even if guest is blocked.

See VMI_SetAlarm and VMI_CancelAlarm.

> - HYPERVISOR_memory_op(XENMEM_increase_reservation, ...)
> 
> Increase number of frames
> 
> [ ]
> 
> - HYPERVISOR_memory_op(XENMEM_decrease_reservation, ...)
> 
> Drop frames from reservation
> 
> [ ]

Ballooning for VMI guests is currently handled by a driver which uses a 
special port in the virtual IO space.

The Xen increase reservation interface would be nicer if it took the 
pfns that the guest gave up as an argument (better for this logic to be 
in the balloon driver than the hypervisor). Relying on the hypervisor's 
allocator to get contiguous pages is also gross. From what I can tell 
extent_order is always 0 in XenLinux, an interface that just took a list 
of pages would be simpler.

> - HYPERVISOR_xen_version(XENVER_compile_info)
> 
> Return hypervisor compile information.

This kind of information seems gratuitous.

> - HYPERVISOR_set_callbacks
> 
> Set entry points for upcalls to the guest from the hypervisor.
> Used for event delivery and fatal condition notification.

In the VMI "events" are just interrupts, delivered via the virtual IDT.

> - HYPERVISOR_nmi_op(XENNMI_register_callback)
> 
> Register NMI callback for this (calling) VCPU. Currently this only makes
> sense for domain 0, vcpu 0. All other callers will be returned EINVAL.

Like the event callback, this could be integrated into the virtual IDT.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
       [not found] <200603131759.k2DHxeep005627@zach-dev.vmware.com>
       [not found] ` <20060313224902.GD12807@sorel.sous-sol.org>
@ 2006-03-14  4:11 ` Rik van Riel
  2006-03-22 20:05 ` Andi Kleen
  2 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2006-03-14  4:11 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Virtualization Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

On Mon, 13 Mar 2006, Zachary Amsden wrote:

> +	Zachary Amsden, Daniel Arai, Daniel Hecht, Pratap Subrahmanyam
> +	Copyright (C) 2005, 2006, VMware, Inc.
> +	All rights reserved

Btw, this copyright claim doesn't look very GPL compatible.
You might want to get that checked out.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
       [not found] <200603131759.k2DHxeep005627@zach-dev.vmware.com>
       [not found] ` <20060313224902.GD12807@sorel.sous-sol.org>
  2006-03-14  4:11 ` Rik van Riel
@ 2006-03-22 20:05 ` Andi Kleen
  2006-03-22 21:34   ` Chris Wright
                     ` (2 more replies)
  2 siblings, 3 replies; 21+ messages in thread
From: Andi Kleen @ 2006-03-22 20:05 UTC (permalink / raw)
  To: virtualization
  Cc: Zachary Amsden, Linus Torvalds, Linux Kernel Mailing List,
	Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

On Monday 13 March 2006 18:59, Zachary Amsden wrote:

> +     The general mechanism for providing customized features and
> +     capabilities is to provide notification of these feature through
> +     the CPUID call, 

How should that work since CPUID cannot be intercepted by 
a Hypervisor (without VMX/SVM)?  

> + Watchdog NMIs are of limited use if the OS is
> +     already correct and running on stable hardware;

So how would your Hypervisor detect a kernel hung with interrupts
off then?

>>       profiling NMIs are 
> +     similarly of less use, since this task is accomplished with more accuracy
> +     in the VMM itself

And how does oprofile know about this?

> ; and NMIs for machine check errors should be handled 
> +     outside of the VM.  

Right now yes, but if we ever implement intelligent memory ECC error handling it's questionable
the hypervisor can do a better job. It has far less information about how memory
is used than the kernel.

> +   The net result of these choices is that most of the calls are very
> +   easy to make from C-code, and calls that are likely to be required in
> +   low level trap handling code are easy to call from assembler.   Most
> +   of these calls are also very easily implemented by the hypervisor
> +   vendor in C code, and only the performance critical calls from
> +   assembler paths require custom assembly implementations.
> +
> +   CORE INTERFACE CALLS

Did I miss it or do you never describe how to find these entry points? 

> +    VMI_EnableInterrupts
> +
> +       VMICALL void VMI_EnableInterrupts(void);
> +
> +       Enable maskable interrupts on the processor.  Note that the
> +       current implementation always will deliver any pending interrupts
> +       on a call which enables interrupts, for compatibility with kernel
> +       code which expects this behavior.  Whether this should be required
> +       is open for debate.

A subtle trap is also that it will do so on the next instruction, not the 
followon to next like a real x86.  At some point there was code in Linux
that dependend on this.


> +       VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
> +
> +       Read from a model specific register.  This functions identically to the
> +       hardware RDMSR instruction.  Note that a hypervisor may not implement
> +       the full set of MSRs supported by native hardware, since many of them
> +       are not useful in the context of a virtual machine.

So what happens when the kernel tries to access an unimplemented MSR?

Also we have had occasionally workarounds in the past that required 
MSR writes with magic "passwords". How would these be handled?
+
> +    VMI_CPUID
> +
> +       /* Not expressible as a C function */
> +
> +       The CPUID instruction provides processor feature identification in a
> +       vendor specific manner.  The instruction itself is non-virtualizable
> +       without hardware support, requiring a hypervisor assisted CPUID call
> +       that emulates the effect of the native instruction, while masking any
> +       unsupported CPU feature bits.

Doesn't seem to be very useful because everybody can just call CPUID directly.

> +       The RDTSC instruction provides a cycles counter which may be made
> +       visible to userspace.  For better or worse, many applications have made
> +       use of this feature to implement userspace timers, database indices, or
> +       for micro-benchmarking of performance.  This instruction is extremely
> +       problematic for virtualization, because even though it is selectively 
> +       virtualizable using trap and emulate, it is much more expensive to
> +       virtualize it in this fashion.  On the other hand, if this instruction
> +       is allowed to execute without trapping, the cycle counter provided
> +       could be wrong in any number of circumstances due to hardware drift,
> +       migration, suspend/resume, CPU hotplug, and other unforeseen
> +       consequences of running inside of a virtual machine.  There is no
> +       standard specification for how this instruction operates when issued
> +       from userspace programs, but the VMI call here provides a proper
> +       interface for the kernel to read this cycle counter.

Yes, but it will be wrong in a native kernel too so why do you want
to be better than native? 

Seems useless to me.

> +    VMI_RDPMC
> +
> +       VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
> +
> +       Similar to RDTSC, this call provides the functionality of reading
> +       processor performance counters.  It also is selectively visible to
> +       userspace, and maintaining accurate data for the performance counters
> +       is an extremely difficult task due to the side effects introduced by
> +       the hypervisor.

Similar.

Overall feeling is you have far too many calls. You seem to try to implement
a full x86 replacement, but that makes it big and likely to be buggy. And 
it's likely impossible to implement in any Hypervisor short of a full emulator
like yours.

I would try a diet and only implement facilities that are actually likely
to be used by modern OS.

There was one other point I wanted to make but I forgot it now @)

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-22 20:05 ` Andi Kleen
@ 2006-03-22 21:34   ` Chris Wright
  2006-03-22 21:13     ` Andi Kleen
  2006-03-22 21:39   ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
  2006-03-22 22:04   ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
  2 siblings, 1 reply; 21+ messages in thread
From: Chris Wright @ 2006-03-22 21:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Zachary Amsden, Linus Torvalds,
	Linux Kernel Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

* Andi Kleen (ak@suse.de) wrote:
> On Monday 13 March 2006 18:59, Zachary Amsden wrote:
> 
> > +     The general mechanism for providing customized features and
> > +     capabilities is to provide notification of these feature through
> > +     the CPUID call, 
> 
> How should that work since CPUID cannot be intercepted by 
> a Hypervisor (without VMX/SVM)?  

Yeah, it requires guest kernel cooperation/modification.

> > +   The net result of these choices is that most of the calls are very
> > +   easy to make from C-code, and calls that are likely to be required in
> > +   low level trap handling code are easy to call from assembler.   Most
> > +   of these calls are also very easily implemented by the hypervisor
> > +   vendor in C code, and only the performance critical calls from
> > +   assembler paths require custom assembly implementations.
> > +
> > +   CORE INTERFACE CALLS
> 
> Did I miss it or do you never describe how to find these entry points? 

It's the ROM interface.  For native they are emitted directly inline.
For non-native, they are emitted as call stubs, which call to the ROM.
I don't recall if it's in this doc, but the inline patch has all the
gory details.

thanks,
-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-22 21:34   ` Chris Wright
@ 2006-03-22 21:13     ` Andi Kleen
  2006-03-22 21:57       ` Chris Wright
  2006-03-23  0:06       ` Zachary Amsden
  0 siblings, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2006-03-22 21:13 UTC (permalink / raw)
  To: Chris Wright
  Cc: virtualization, Zachary Amsden, Linus Torvalds,
	Linux Kernel Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

On Wednesday 22 March 2006 22:34, Chris Wright wrote:
> * Andi Kleen (ak@suse.de) wrote:
> > On Monday 13 March 2006 18:59, Zachary Amsden wrote:
> > 
> > > +     The general mechanism for providing customized features and
> > > +     capabilities is to provide notification of these feature through
> > > +     the CPUID call, 
> > 
> > How should that work since CPUID cannot be intercepted by 
> > a Hypervisor (without VMX/SVM)?  
> 
> Yeah, it requires guest kernel cooperation/modification.

Even then it's useless for many flags because any user program can (and will) 
call CPUID directly. 
 
> > > +   The net result of these choices is that most of the calls are very
> > > +   easy to make from C-code, and calls that are likely to be required in
> > > +   low level trap handling code are easy to call from assembler.   Most
> > > +   of these calls are also very easily implemented by the hypervisor
> > > +   vendor in C code, and only the performance critical calls from
> > > +   assembler paths require custom assembly implementations.
> > > +
> > > +   CORE INTERFACE CALLS
> > 
> > Did I miss it or do you never describe how to find these entry points? 
> 
> It's the ROM interface.  For native they are emitted directly inline.
> For non-native, they are emitted as call stubs, which call to the ROM.
> I don't recall if it's in this doc, but the inline patch has all the
> gory details.

Sure the point was if they write this long fancy document why stop
at documenting the last 5%?

-Andi
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-22 21:13     ` Andi Kleen
@ 2006-03-22 21:57       ` Chris Wright
  2006-03-23  0:06       ` Zachary Amsden
  1 sibling, 0 replies; 21+ messages in thread
From: Chris Wright @ 2006-03-22 21:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Wright, virtualization, Zachary Amsden, Linus Torvalds,
	Linux Kernel Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

* Andi Kleen (ak@suse.de) wrote:
> Even then it's useless for many flags because any user program can (and will) 
> call CPUID directly. 

Yes, doesn't handle userspace at all.  It's useful only to get coherent
view of flags in kernel.  Right now, for example, Xen goes in and
basically masks off flags retroactively which is not that nice either.

thanks,
-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-22 21:13     ` Andi Kleen
  2006-03-22 21:57       ` Chris Wright
@ 2006-03-23  0:06       ` Zachary Amsden
  1 sibling, 0 replies; 21+ messages in thread
From: Zachary Amsden @ 2006-03-23  0:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Wright, virtualization, Linus Torvalds,
	Linux Kernel Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Dan Arai, Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Andi Kleen wrote:
> Even then it's useless for many flags because any user program can (and will) 
> call CPUID directly. 

Turns out not to matter, since userspace can only make use of 
capabilities that are already available to userspace.  If the feature 
bits for system features are visible to it, it doesn't really matter.  
Yes, this could be broken in some cases.  But it turns out to be safe.  
Even sysenter support, which userspace does care about, is done via 
setting the vsyscall page up in the kernel, rather than userspace CPUID 
detection.

> Sure the point was if they write this long fancy document why stop
> at documenting the last 5%?
>   

Because the last 5% is what is changing to meet Xen's needs.  Why 
document something that you know you are going to break in a week?  I 
chose to document the stable interfaces first.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
  2006-03-22 20:05 ` Andi Kleen
  2006-03-22 21:34   ` Chris Wright
@ 2006-03-22 21:39   ` Andi Kleen
  2006-03-22 22:43     ` Daniel Arai
  2006-03-22 22:45     ` Zachary Amsden
  2006-03-22 22:04   ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
  2 siblings, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2006-03-22 21:39 UTC (permalink / raw)
  To: virtualization
  Cc: Zachary Amsden, Linus Torvalds, Linux Kernel Mailing List,
	Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn


> There was one other point I wanted to make but I forgot it now @)

Ah yes the point was that since most of the implementations of the hypercalls
likely need fast access to some per CPU state. How would you plan
to implement that? Should it be covered in the specification?

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
  2006-03-22 21:39   ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
@ 2006-03-22 22:43     ` Daniel Arai
  2006-03-22 22:45     ` Zachary Amsden
  1 sibling, 0 replies; 21+ messages in thread
From: Daniel Arai @ 2006-03-22 22:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Zachary Amsden, Linus Torvalds,
	Linux Kernel Mailing List, Xen-devel, Andrew Morton, Dan Hecht,
	Anne Holler, Pratap Subrahmanyam, Christopher Li,
	Joshua LeVasseur, Chris Wright, Rik Van Riel, Jyothy Reddy,
	Jack Lo, Kip Macy, Jan Beulich, Ky Srinivasan, Wim Coekaerts,
	Leendert van Doorn

Andi Kleen wrote:
>>There was one other point I wanted to make but I forgot it now @)
> 
> 
> Ah yes the point was that since most of the implementations of the hypercalls
> likely need fast access to some per CPU state. How would you plan
> to implement that? Should it be covered in the specification?

I can explain how it works, but it's deliberately not part of the specification.

The whole point of the ROM layer is that it abstracts away the actual hypercall 
mechanism for the guest, and the hypervisor can implement whatever is 
appropriate for it.  This layer allows a VMI guest to run on VMware's 
hypervisor, as well as on top of Xen.

We reserve the top 64MB of linear address space for the hypervisor.

Part of this reserved space contains data structures that are shared by the VMI 
ROM layer and the hypervisor.  Simple VMI interface calls like "read CR 2" are 
implemented by reading or writing data from this shared data structure, and 
don't require a privilege level change.  Things like page table updates go into 
a queue in the shared area, so they can easily be batched and processed with 
only one actual call into the hypervisor.

Because the guest can manipulate this data page directly, the hypervisor has to 
treat any information in it as untrusted.  This is similar to how the kernel has 
to treat syscall arguments.  Guest user code can't touch the shared area, so it 
doesn't introduce any new kernel security holes.  The guest kernel could 
deliberately mess up the shared area contents, but guest kernel code could 
corrupt any arbitrary (virtual) machine state anyway.

Because this level of interface is hidden from the guest, we can (and do) make 
changes to it without changing VMI itself, or needing to recompile the guest. 
We deliberately do not document it.  A guest that adheres to the VMI interface 
can move to new versions of the ROM/hypervisor interface (that implement the 
same VMI interface) without changes.

Dan.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
  2006-03-22 21:39   ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
  2006-03-22 22:43     ` Daniel Arai
@ 2006-03-22 22:45     ` Zachary Amsden
  2006-03-22 22:38       ` Andi Kleen
  1 sibling, 1 reply; 21+ messages in thread
From: Zachary Amsden @ 2006-03-22 22:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
	Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

Andi Kleen wrote:
>> There was one other point I wanted to make but I forgot it now @)
>>     
>
> Ah yes the point was that since most of the implementations of the hypercalls
> likely need fast access to some per CPU state. How would you plan
> to implement that? Should it be covered in the specification?
>   

Probably.  We don't have that issue currently, as we have a private 
mapping of CPU state for each VCPU at a fixed address.  Seeing as that 
is not so feasible under Xen, I would say we need to put something in 
the spec.

The way Xen deals with this is rather gruesome today.  It needs 
callbacks into the kernel to disable preemption so that it can 
atomically compute the address of the VCPU area, just so that it can 
disable interrupts on the VCPU.  These contortions make backbending look 
easy.

I propose an entirely different approach - use segmentation.  This needs 
to be in the spec, as we now need to add VMI hook points for saving and 
restoring user segments.  But in the end it wins, even if you can't 
support per-cpu mappings using paging, you can do it with segmentation.  
You'll likely get even better performance.  And you don't have to worry 
about these unclean callbacks into the guest kernel that really make the 
interface between Xen and XenoLinux completely enmeshed.  And you can 
disable interrupts in one instruction:

movb $0, %gs:hypervisor_intFlags

Zach

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
  2006-03-22 22:45     ` Zachary Amsden
@ 2006-03-22 22:38       ` Andi Kleen
  2006-03-22 23:54         ` Zachary Amsden
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2006-03-22 22:38 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
	Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

On Wednesday 22 March 2006 23:45, Zachary Amsden wrote:

> I propose an entirely different approach - use segmentation. 

That would require a lot of changes to save/restore the segmentation
register at kernel entry/exit since there is no swapgs on i386. 
And will be likely slower there too and also even slow down the 
VMI-kernel-no-hypervisor.

Still might be the best option.

How did that rumoured Xenolinux-over-VMI implementation solve that problem?

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
  2006-03-22 22:38       ` Andi Kleen
@ 2006-03-22 23:54         ` Zachary Amsden
  2006-03-22 23:37           ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Zachary Amsden @ 2006-03-22 23:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
	Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

Andi Kleen wrote:
> On Wednesday 22 March 2006 23:45, Zachary Amsden wrote:
>
>   
>> I propose an entirely different approach - use segmentation. 
>>     
>
> That would require a lot of changes to save/restore the segmentation
> register at kernel entry/exit since there is no swapgs on i386. 
> And will be likely slower there too and also even slow down the 
> VMI-kernel-no-hypervisor.
>   

There are no changes required to the kernel entry / exit paths.  With 
save/restore segment support in the VMI, reserving one segment for the 
hypervisor data area is easy.

I take it back.  There is one required change:

kernel_entry:
     hypervisor_entry_hook
     sti
     .... kernel code

This hypervisor_entry_hook can be a nop on native hardware, and the 
following for Xen:

push %gs
mov CPU_HYPER_SEL, %gs
pop %gs:SAVED_USER_GS

You already have the IRET / SYSEXIT hooks to restore it on the way 
back.  And now you have a segment reserved that allows you to deal with 
16-bit stack segments during the IRET.

> Still might be the best option.
>
> How did that rumoured Xenolinux-over-VMI implementation solve that problem?
>   

!CONFIG_SMP  -- as I believe I saw in the latest Xen patches sent out as 
well?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation II
  2006-03-22 23:54         ` Zachary Amsden
@ 2006-03-22 23:37           ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2006-03-22 23:37 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
	Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

On Thursday 23 March 2006 00:54, Zachary Amsden wrote:
> Andi Kleen wrote:
> > On Wednesday 22 March 2006 23:45, Zachary Amsden wrote:
> >
> >   
> >> I propose an entirely different approach - use segmentation. 
> >>     
> >
> > That would require a lot of changes to save/restore the segmentation
> > register at kernel entry/exit since there is no swapgs on i386. 
> > And will be likely slower there too and also even slow down the 
> > VMI-kernel-no-hypervisor.
> >   
> 
> There are no changes required to the kernel entry / exit paths.  With 
> save/restore segment support in the VMI, reserving one segment for the 
> hypervisor data area is easy.

Ok that might work yes.

> > Still might be the best option.
> >
> > How did that rumoured Xenolinux-over-VMI implementation solve that problem?
> >   
> 
> !CONFIG_SMP  -- as I believe I saw in the latest Xen patches sent out as 
> well?

Ah, cheating. This means the rumoured benchmark numbers are dubious too I guess.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-22 20:05 ` Andi Kleen
  2006-03-22 21:34   ` Chris Wright
  2006-03-22 21:39   ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
@ 2006-03-22 22:04   ` Zachary Amsden
  2006-03-22 21:58     ` Andi Kleen
  2 siblings, 1 reply; 21+ messages in thread
From: Zachary Amsden @ 2006-03-22 22:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
	Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

Andi Kleen wrote:
> On Monday 13 March 2006 18:59, Zachary Amsden wrote:
>
>   
>> +     The general mechanism for providing customized features and
>> +     capabilities is to provide notification of these feature through
>> +     the CPUID call, 
>>     
>
> How should that work since CPUID cannot be intercepted by 
> a Hypervisor (without VMX/SVM)?  
>   

It can be intercepted with a VMI call.  I actually think overloading 
this for VM features as well, although convenient, might turn out to be 
unwieldy.

>> + Watchdog NMIs are of limited use if the OS is
>> +     already correct and running on stable hardware;
>>     
>
> So how would your Hypervisor detect a kernel hung with interrupts
> off then?
>   

The hypervisor can detect it fine - we never disable hardware interrupts 
or NMIs except for very small windows in the fault handlers.  I'm 
arguing that philosophically, using NMIs to detect a software hang means 
you have broken software.  NMIs for detecting hardware induced hangs are 
common and reasonable things to do, but on virtual hardware, that 
shouldn't happen either.

>   
>>>       profiling NMIs are 
>>>       
>> +     similarly of less use, since this task is accomplished with more accuracy
>> +     in the VMM itself
>>     
>
> And how does oprofile know about this?
>   

It doesn't.  But consider that oprofile is a time based NMI sampler.  
That is less accurate in a VM when you have virtual time, and, somewhat 
skewed spacing between NMI delivery, and less than accurate performance 
counter information.  You can get a lot better results for benchmarks 
using the VMM to sample the guest instead.

>> ; and NMIs for machine check errors should be handled 
>> +     outside of the VM.  
>>     
>
> Right now yes, but if we ever implement intelligent memory ECC error handling it's questionable
> the hypervisor can do a better job. It has far less information about how memory
> is used than the kernel.
>   

Right.  I think I may have been too proactive in my defense of disabling 
NMIs.  I agree now, it is a bug, and it really should be supported.  But 
it was a convenient shortcut to getting things working - otherwise you 
have to have the NMI avoidance logic in entry.S, which is not properly 
virtualizable (checks raw segments without masking RPL).  But seeing as 
I already fixed that, I think we actually could re-enable NMIs now.

Though the usefulness of common cases may be compromised, having the VM 
do machine check handling on its own data pages (so it can figure out 
which processes to kill / recover) is an extremely useful case.

>> +   CORE INTERFACE CALLS
>>     
>
> Did I miss it or do you never describe how to find these entry points? 
>   

It should be described in the ROM probing section in more detail.  Our 
documentation is getting better with time ;)

>   
>> +    VMI_EnableInterrupts
>> +
>> +       VMICALL void VMI_EnableInterrupts(void);
>> +
>> +       Enable maskable interrupts on the processor.  Note that the
>> +       current implementation always will deliver any pending interrupts
>> +       on a call which enables interrupts, for compatibility with kernel
>> +       code which expects this behavior.  Whether this should be required
>> +       is open for debate.
>>     
>
> A subtle trap is also that it will do so on the next instruction, not the 
> followon to next like a real x86.  At some point there was code in Linux
> that dependend on this.
>   

There still is.  This is why you have the "sti; sysexit" pair, and why 
safe_halt() is "sti; hlt".  You really don't want interrupts in those 
windows.  The architectural oddity forced us to make these calls into 
the VMI interface.  A third one, used by some operating systems, is 
"sti; nop; cli" - i.e. deliver pending interrupts and disable again.  In 
most other cases, it doesn't matter.

>   
>> +       VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg);
>> +
>> +       Read from a model specific register.  This functions identically to the
>> +       hardware RDMSR instruction.  Note that a hypervisor may not implement
>> +       the full set of MSRs supported by native hardware, since many of them
>> +       are not useful in the context of a virtual machine.
>>     
>
> So what happens when the kernel tries to access an unimplemented MSR?
>
> Also we have had occasionally workarounds in the past that required 
> MSR writes with magic "passwords". How would these be handled?
>   

I actually already implemented your suggestion on making MSR reads and 
writes use trap and emulate - so all of these issues go away.  Whether 
forcing trap and emulate is a good idea for a minimal open source 
hypervisor is another debate.

> +
>   
>> +    VMI_CPUID
>> +
>> +       /* Not expressible as a C function */
>> +
>> +       The CPUID instruction provides processor feature identification in a
>> +       vendor specific manner.  The instruction itself is non-virtualizable
>> +       without hardware support, requiring a hypervisor assisted CPUID call
>> +       that emulates the effect of the native instruction, while masking any
>> +       unsupported CPU feature bits.
>>     
>
> Doesn't seem to be very useful because everybody can just call CPUID directly.
>   

Which is why the kernel _must_ use the CPUID VMI call.  We're a little 
bit broken in this respect today, since the boot code in head.S does 
CPUID probing before the VMI init call.  It works for us because we use 
binary translation of the kernel up to this point.  In the end, this 
will disappear, and the CPUID probing will be done in the alternative 
entry point known as the "start of day" state, where the kernel is 
already pre-virtualized.

> Yes, but it will be wrong in a native kernel too so why do you want
> to be better than native? 
>
> Seems useless to me.
>   

Agree.  TSC is broken in so many ways, that it really should not be used 
for anything other than unreliable cycle counting.

>   
>> +    VMI_RDPMC
>> +
>> +       VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
>> +
>> +       Similar to RDTSC, this call provides the functionality of reading
>> +       processor performance counters.  It also is selectively visible to
>> +       userspace, and maintaining accurate data for the performance counters
>> +       is an extremely difficult task due to the side effects introduced by
>> +       the hypervisor.
>>     
>
> Similar.
>
> Overall feeling is you have far too many calls. You seem to try to implement
> a full x86 replacement, but that makes it big and likely to be buggy. And 
> it's likely impossible to implement in any Hypervisor short of a full emulator
> like yours.
>
> I would try a diet and only implement facilities that are actually likely
> to be used by modern OS.
>   

The interface can't really go on too much of a diet - some kernel 
somewhere, maybe not Linux, under some hypervisor, maybe not VMware or 
Xen, may want to use these features.  What the interface can be is an a 
la carte menu.  By allowing specific instructions to fall back to trap 
and emulate, mainstream OSes don't need to be bothered with changing to 
match some rich interface.  Other OSes may have vastly different 
requirements, and might want to make use of these features heavily, if 
they are available.  And hypervisors don't need to implement anything 
special for these either.  Our RDPMC implementation in the ROM is quite 
simple:

/*
 * VMI_RDPMC - Binary RDPMC equivalent
 *             Must clobber no registers (other than %eax, %edx return)
 */
VMI_ENTRY(RDPMC)
   rdpmc
   vmireturn
VMI_CALL_END

Taken to the extreme, where the patch processing is done before the 
kernel runs, in the hypervisor itself, using the annotation table 
provided by the guest kernel, it is even easier.  If you see an 
annotation for a feature you don't care to implement, you don't do 
anything at all - you leave the native instructions as they are.  In 
this case, neither the kernel nor the hypervisor has any extra code at 
all to deal with cases they don't care about.  But the rich interface is 
still there, and if someone wants to bathe in butter, who are we to 
judge.  There certainly are uses for it.  For example, WRMSR is not on 
critical paths in i386 Linux today.  That does not mean we should remove 
it from the interface.  When a new processor core comes along, and all 
of a sudden, you really need that interface back, you want it ready for 
use.  And this case really did happen - FSBASE and GSBASE MSR writes 
moved onto the critical path in x86_64.

I think I carried the diet analogy a little far.

> There was one other point I wanted to make but I forgot it now @)
>   

Thanks again for your feedback,

Zach

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC, PATCH 1/24] i386 Vmi documentation
  2006-03-22 22:04   ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
@ 2006-03-22 21:58     ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2006-03-22 21:58 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: virtualization, Linus Torvalds, Linux Kernel Mailing List,
	Xen-devel, Andrew Morton, Dan Hecht, Dan Arai, Anne Holler,
	Pratap Subrahmanyam, Christopher Li, Joshua LeVasseur,
	Chris Wright, Rik Van Riel, Jyothy Reddy, Jack Lo, Kip Macy,
	Jan Beulich, Ky Srinivasan, Wim Coekaerts, Leendert van Doorn

On Wednesday 22 March 2006 23:04, Zachary Amsden wrote:

> 
> It doesn't.  But consider that oprofile is a time based NMI sampler.  

That's one of its modes, mostly used for people with broken APICs. 

But the primary mode of operation is an  event based sampler using 
performance counter events.


> There still is.  This is why you have the "sti; sysexit" pair, and why 
> safe_halt() is "sti; hlt".  You really don't want interrupts in those 
> windows.  The architectural oddity forced us to make these calls into 
> the VMI interface.  A third one, used by some operating systems, is 
> "sti; nop; cli" - i.e. deliver pending interrupts and disable again.  In 
> most other cases, it doesn't matter.

Sounds like something that should be discussed in the spec.
 
> > Seems useless to me.
> >   
> 
> Agree.  TSC is broken in so many ways, that it really should not be used 
> for anything other than unreliable cycle counting.

It can be used with an aggressive white list and if you know what you're
doing. The x86-64 kernel follows this approach, which allows to use
it at least on some common classes of systems (AMD single core, Intel
non NUMA P4) 

Actually for cycle counting it is useless because on newer Intel CPUs
it always runs at the highest P state no matter which P state you're in.

My evil plan to deal with that was to export the cycle count running in PMC0
for the NMI watchdog to ring 3 so people could just use RDPMC 0 instead.
There was some opposition to this idea unfortunately.

But the hypervisor should keep its fingers out of all that as far as possible.




> 
> >   
> >> +    VMI_RDPMC
> >> +
> >> +       VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter);
> >> +
> >> +       Similar to RDTSC, this call provides the functionality of reading
> >> +       processor performance counters.  It also is selectively visible to
> >> +       userspace, and maintaining accurate data for the performance counters
> >> +       is an extremely difficult task due to the side effects introduced by
> >> +       the hypervisor.
> >>     
> >
> > Similar.
> >
> > Overall feeling is you have far too many calls. You seem to try to implement
> > a full x86 replacement, but that makes it big and likely to be buggy. And 
> > it's likely impossible to implement in any Hypervisor short of a full emulator
> > like yours.
> >
> > I would try a diet and only implement facilities that are actually likely
> > to be used by modern OS.
> >   
> 
> The interface can't really go on too much of a diet - some kernel 
> somewhere, maybe not Linux, under some hypervisor, maybe not VMware or 
> Xen, may want to use these features. 

This might sound arrogant, but I would expect that near all modern
kernels don't use much more of the x86 subset than Linux is using
(biggest exception I can think of would be interrupt priorities) 



> 
> Taken to the extreme, where the patch processing is done before the 
> kernel runs, in the hypervisor itself, using the annotation table 
> provided by the guest kernel, it is even easier.  If you see an 
> annotation for a feature you don't care to implement, you don't do 
> anything at all - you leave the native instructions as they are.  In 
> this case, neither the kernel nor the hypervisor has any extra code at 
> all to deal with cases they don't care about.  But the rich interface is 
> still there, and if someone wants to bathe in butter, who are we to 
> judge.  

So basically you're trying to implement VT/Pacifica in software
with all these trap? 

I'm not sure that's the right approach.

My feeling would be that for a efficient para virtualized interface a better
approach would be to try to optimize the kernels a bit more 
for the emulated case.

Longer term there will be more optimizations (like better interaction
of VM maybe or para drivers that work faster). But if the base interface
is already so big that adding even more stuff might make it explode
at some point.

> There certainly are uses for it.  For example, WRMSR is not on  
> critical paths in i386 Linux today. 

Actually i got a feature request today that would require to optionally
do a wrmsr in the context switch :/


-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2006-03-23  0:12 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200603131759.k2DHxeep005627@zach-dev.vmware.com>
     [not found] ` <20060313224902.GD12807@sorel.sous-sol.org>
2006-03-14  0:00   ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
2006-03-14 21:27     ` Chris Wright
     [not found]       ` <441743BD.1070108@vmware.com>
2006-03-15  2:57         ` Chris Wright
2006-03-15  5:44           ` Zachary Amsden
2006-03-15 22:56           ` Daniel Arai
2006-03-16  1:16       ` Chris Wright
2006-03-16  3:40         ` Eli Collins
2006-03-14  4:11 ` Rik van Riel
2006-03-22 20:05 ` Andi Kleen
2006-03-22 21:34   ` Chris Wright
2006-03-22 21:13     ` Andi Kleen
2006-03-22 21:57       ` Chris Wright
2006-03-23  0:06       ` Zachary Amsden
2006-03-22 21:39   ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
2006-03-22 22:43     ` Daniel Arai
2006-03-22 22:45     ` Zachary Amsden
2006-03-22 22:38       ` Andi Kleen
2006-03-22 23:54         ` Zachary Amsden
2006-03-22 23:37           ` Andi Kleen
2006-03-22 22:04   ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
2006-03-22 21:58     ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox