Re: [RFC, PATCH 1/24] i386 Vmi documentation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zachary Amsden <zach@vmware.com>
To: Chris Wright <chrisw@sous-sol.org>
Cc: Andrew Morton <akpm@osdl.org>, Joshua LeVasseur <jtl@ira.uka.de>,
	Xen-devel <xen-devel@lists.xensource.com>,
	Pratap Subrahmanyam <pratap@vmware.com>,
	Wim Coekaerts <wim.coekaerts@oracle.com>,
	Jack Lo <jlo@vmware.com>, Dan Hecht <dhecht@vmware.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Jan Beulich <jbeulich@novell.com>,
	Christopher Li <chrisl@vmware.com>,
	Virtualization Mailing List <virtualization@lists.osdl.org>,
	Linus Torvalds <torvalds@osdl.org>, Anne Holler <anne@vmware.com>,
	Jyothy Reddy <jreddy@vmware.com>, Kip Macy <kmacy@fsmware.com>,
	Ky Srinivasan <ksrinivasan@novell.com>,
	Leendert van Doorn <leendert@watson.ibm.com>,
	Dan Arai <arai@vmware.com>
Subject: Re: [RFC, PATCH 1/24] i386 Vmi documentation
Date: Tue, 14 Mar 2006 14:29:17 -0800	[thread overview]
Message-ID: <441743BD.1070108@vmware.com> (raw)
In-Reply-To: <20060314212742.GL12807@sorel.sous-sol.org>

Chris Wright wrote:
> Yup.  Just noting that API without clear users is the type of thing that
> is regularly rejected from Linux.
>   

Yes.  It is becoming clear from feedback from you and Andi that there 
are things in the API that are unnecessary for Linux.  But keep in mind, 
they may be necessary for other operating systems.  I think we should 
probably drop the Linux changes to issue things like RDTSC and such via 
VMI call wrappers.  It does simplify the Linux interface.

But I still think they should be part of the spec - an optional part of 
the spec, that need not be implemented by Linux or even by the 
hypervisor.  If some vendor or kernel combination finds that they are a 
performance concern, as they readily could become, they can drop in the 
functionality when and if they need it.  No reason to complicate things 
on either end, but also no reason to purposely add asymmetry to the spec 
just because the current set of calls is sufficient for the currently 
known fast paths.

>   
>>> Many of these will look the same on x86-64, but the API is not
>>> 64-bit clean so has to be duplicated.
>>>       
>> Yes, register pressure forces the PAE API to be slightly different from 
>> the long mode API.  But long mode has different register calling 
>> conventions anyway, so it is not a big deal.   The important thing is, 
>> once the MMU mess is sorted out, the same interface can be used from C 
>> code for both platforms, and the details about which lock primitives are 
>> used can be hidden.  The cost of which lock primitives to use differs on 
>> 32-bit and 64-bit platforms, across vendor, and the style of the 
>> hypervisor implementation (direct / writable / shadowed page tables).
>>     
>
> My mistake, it makes perfect sense from ABI point of view.
>
>   
>>> Is this the batching, multi-call analog?
>>>       
>> Yes.  This interface needs to be documented in a much better fashion.  
>> But the idea is that VMI calls are mapped into Xen multicalls by 
>> allowing deferred completion of certain classes of operations.  That 
>> same mode of deferred operation is used to batch PTE updates in our 
>> implementation (although Xen uses writable page tables now, this used to 
>> provide the same support facility in Xen as well).  To complement this, 
>> there is an explicit flush - and it turns out this maps very nicely, 
>> getting rid of a lot of the XenoLinux changes around mmu_context.h.
>>     
>
> Are these valid differences?  Or did I misunderstand the batching
> mechanism?
>
> 1) can't use stack based args, so have to allocate each data structure,
> which could conceivably fail unless it's some fixed buffer.
>   

We use a fixed buffer that is private to our VMI layer.  It's a per-cpu 
packing struct for hypercalls.  Dynamically allocating from the kernel 
inside the interface layer is a really great way to get into a whole lot 
of trouble.

> 2) complicates the rom implementation slightly where implementation of
> each deferrable part of the API needs to have switch (am I deferred or
> not) to then build the batch, or make direct hypercall.
>   

This is an overhead that is easily absorbed by the gain.  The direct 
hypercalls are mostly either always direct, or always queued.  The page 
table updates already have conditional logic to do the right thing, and 
Xen doesn't require the queueing of these anymore anyways.  And the 
flush happens at an explicit point.  The best approach can still be fine 
tuned.  You could have separate VMI calls for queued vs. non-queued 
operation.  But that greatly bloats the interface and doesn't make sense 
for everything.  I believe the best solution is to annotate this in the 
VMI call itself.  Consider the VMI call number, not as an integer, but 
as an identifier tuple.  Perhaps I'm going overboard here.  Perhaps not.

31--------24-23---------16-15--------8-7-----------0
 | family   | call number | reserved  | annotation |
 ---------------------------------------------------

Now, you have multiple families of calls -

0x00 legacy
0x01 CPU
0x02 Segmentation
0x03 MMU
0xFF reserved for experimentation

And each family has children:

0x03 MMU:
   0x00  SetPTE
   0x01  SetLongPTE
   0x02  FlushTLB

Now, lets say I add a new feature, and I don't want to redefine part of 
the interface.  Lets say that feature is queuing of hypercalls.  I have 
this private, annotation field as part of the identifier for each 
hypercall - in effect, really just the hypercall number.

And I don't want to break binary compatibility of the interface.  So 
what I do is I define a new annotation that is specific to the affected 
calls. 

   0x00  SetPTE
        0x00 - no annotation
        0x01 - may be queued !

Now, the hypercall isn't any different.  Hypervisors which are unware of 
the annotation treat it no differently.  But hypervisors that support 
PTE queuing recognize it as a hint and use it appropriately.

Queuing is a common enough optimization that it might even make sense to 
have a bit set aside in the call ID for it.  Having this type of static 
annotation allows you to get rid of the dynamic concerns you have.

The really nice thing about defining your interface this way is you have 
a hierarchy of different classes of the interface, with the ability to 
add new classes, new calls within a class, and new annotations 
(upgrades, if you will) or those calls. 

And it provides a natural way to query for supported families of support 
- do you support a virtual event channel?  Should I do some extra work 
to give you MMU hints or not?  And you can add extra, optional 
functionality on to existing call sites.  Something vary useful, if you 
say, realize that you want to add a hint field to one of your calls 
without breaking the old interface or forcing another vendor into 
complicating the ir hypervisor.  Which is most of what 
paravirtualization is anyway.  Extra, optionally used hints about how 
things are being used that allow the hypervisor implementation to avoid 
making costly assumptions to ensure correctness under unknown constraints.

Is this worth threshing out more?  I think so, since it does provide a 
nice value proposition as well as overcoming the rather clumsy top level 
versioning scheme.

Thanks again for your feedback,

Zach

next prev parent reply	other threads:[~2006-03-14 22:29 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-03-13 17:59 [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
2006-03-13 22:49 ` Chris Wright
2006-03-14  0:00   ` Zachary Amsden
2006-03-14 21:27     ` Chris Wright
2006-03-14 22:29       ` Zachary Amsden [this message]
2006-03-15  2:57         ` Chris Wright
2006-03-15  5:44           ` Zachary Amsden
2006-03-15 22:56           ` Daniel Arai
2006-03-16  1:16       ` Chris Wright
2006-03-16  3:40         ` Eli Collins
2006-03-16  3:40           ` Eli Collins
2006-03-14  4:11 ` Rik van Riel
2006-03-22 20:05 ` Andi Kleen
2006-03-22 21:34   ` Chris Wright
2006-03-22 21:13     ` Andi Kleen
2006-03-22 21:57       ` Chris Wright
2006-03-23  0:06       ` Zachary Amsden
2006-03-22 21:39   ` [RFC, PATCH 1/24] i386 Vmi documentation II Andi Kleen
2006-03-22 22:43     ` Daniel Arai
2006-03-22 22:45     ` Zachary Amsden
2006-03-22 22:38       ` Andi Kleen
2006-03-22 23:54         ` Zachary Amsden
2006-03-22 23:37           ` Andi Kleen
2006-03-22 22:04   ` [RFC, PATCH 1/24] i386 Vmi documentation Zachary Amsden
2006-03-22 21:58     ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2006-03-13 18:41 Zachary Amsden

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=441743BD.1070108@vmware.com \
    --to=zach@vmware.com \
    --cc=akpm@osdl.org \
    --cc=anne@vmware.com \
    --cc=arai@vmware.com \
    --cc=chrisl@vmware.com \
    --cc=chrisw@sous-sol.org \
    --cc=dhecht@vmware.com \
    --cc=jbeulich@novell.com \
    --cc=jlo@vmware.com \
    --cc=jreddy@vmware.com \
    --cc=jtl@ira.uka.de \
    --cc=kmacy@fsmware.com \
    --cc=ksrinivasan@novell.com \
    --cc=leendert@watson.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pratap@vmware.com \
    --cc=torvalds@osdl.org \
    --cc=virtualization@lists.osdl.org \
    --cc=wim.coekaerts@oracle.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.