From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:49084)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <earhart@google.com>) id 1Rt4u3-0006dX-73
	for qemu-devel@nongnu.org; Thu, 02 Feb 2012 17:13:16 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <earhart@google.com>) id 1Rt4u1-0004a3-MS
	for qemu-devel@nongnu.org; Thu, 02 Feb 2012 17:13:15 -0500
Received: from mail-qw0-f45.google.com ([209.85.216.45]:64145)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <earhart@google.com>) id 1Rt4u1-0004Zz-I2
	for qemu-devel@nongnu.org; Thu, 02 Feb 2012 17:13:13 -0500
Received: by qabg40 with SMTP id g40so316926qab.4
	for <qemu-devel@nongnu.org>; Thu, 02 Feb 2012 14:13:12 -0800 (PST)
MIME-Version: 1.0
In-Reply-To: <4F2AB552.2070909@redhat.com>
References: <4F2AB552.2070909@redhat.com>
Date: Thu, 2 Feb 2012 14:13:12 -0800
Message-ID: <CAB9FdM9M2DWXBxxyG-ez_5igT61x5b7ptw+fKfgaqMBU_JS5aA@mail.gmail.com>
From: Rob Earhart <earhart@google.com>
Content-Type: multipart/alternative; boundary=20cf3074d2e88861c204b8028083
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>, KVM list <kvm@vger.kernel.org>, qemu-devel <qemu-devel@nongnu.org>

--20cf3074d2e88861c204b8028083
Content-Type: text/plain; charset=ISO-8859-1

On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity <avi@redhat.com> wrote:

> The kvm api has been accumulating cruft for several years now.  This is
> due to feature creep, fixing mistakes, experience gained by the
> maintainers and developers on how to do things, ports to new
> architectures, and simply as a side effect of a code base that is
> developed slowly and incrementally.
>
> While I don't think we can justify a complete revamp of the API now, I'm
> writing this as a thought experiment to see where a from-scratch API can
> take us.  Of course, if we do implement this, the new and old APIs will
> have to be supported side by side for several years.
>
> Syscalls
> --------
> kvm currently uses the much-loved ioctl() system call as its entry
> point.  While this made it easy to add kvm to the kernel unintrusively,
> it does have downsides:
>
> - overhead in the entry path, for the ioctl dispatch path and vcpu mutex
> (low but measurable)
> - semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
> a vm to be tied to an mm_struct, but the current API ties them to file
> descriptors, which can move between threads and processes.  We check
> that they don't, but we don't want to.
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
>
> Syscalls that operate on the entire guest will pick it up implicitly
> from the mm_struct, and syscalls that operate on a vcpu will pick it up
> from current.
>
>
<snipped>

I like the ioctl() interface.  If the overhead matters in your hot path, I
suspect you're doing it wrong; use irq fds & ioevent fds.  You might fix
the semantic mismatch by having a notion of a "current process's VM" and
"current thread's VCPU", and just use the one /dev/kvm filedescriptor.

Or you could go the other way, and break the connection between VMs and
processes / VCPUs and threads: I don't know how easy it is to do it in
Linux, but a VCPU might be backed by a kernel thread, operated on via
ioctl()s, indicating that they've exited the guest by having their
descriptors become readable (and either use read() or mmap() to pull off
the reason why the VCPU exited).  This would allow for a variety of
different programming styles for the VMM--I'm a fan of CSP model myself,
but that's hard to do with the current API.

It'd be nice to be able to kick a VCPU out of the guest without messing
around with signals.  One possibility would be to tie it to an eventfd;
another might be to add a pseudo-register to indicate whether the VCPU is
explicitly suspended.  (Combined with the decoupling idea, you'd want
another pseudo-register to indicate whether the VMM is implicitly suspended
due to an intercept; a single "runnable" bit is racy if both the VMM and
VCPU are setting it.)

ioevent fds are definitely useful.  It might be cute if they could
synchronously set the VIRTIO_USED_F_NOTIFY bit - the guest could do this
itself, but that'd require giving the guest write access to the used side
of the virtio queue, and I kind of like the idea that it doesn't need write
access there.  Then again, I don't have any perf data to back up the need
for this.

The rest of it sounds great.

)Rob

--20cf3074d2e88861c204b8028083
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div>On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:avi@redhat.com">avi@redhat.com</a>&gt;</span> wrote:</div><div=
 class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 =
0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The kvm api has been accumulating cruft for several years now. =A0This is<b=
r>
due to feature creep, fixing mistakes, experience gained by the<br>
maintainers and developers on how to do things, ports to new<br>
architectures, and simply as a side effect of a code base that is<br>
developed slowly and incrementally.<br>
<br>
While I don&#39;t think we can justify a complete revamp of the API now, I&=
#39;m<br>
writing this as a thought experiment to see where a from-scratch API can<br=
>
take us. =A0Of course, if we do implement this, the new and old APIs will<b=
r>
have to be supported side by side for several years.<br>
<br>
Syscalls<br>
--------<br>
kvm currently uses the much-loved ioctl() system call as its entry<br>
point. =A0While this made it easy to add kvm to the kernel unintrusively,<b=
r>
it does have downsides:<br>
<br>
- overhead in the entry path, for the ioctl dispatch path and vcpu mutex<br=
>
(low but measurable)<br>
- semantic mismatch: kvm really wants a vcpu to be tied to a thread, and<br=
>
a vm to be tied to an mm_struct, but the current API ties them to file<br>
descriptors, which can move between threads and processes. =A0We check<br>
that they don&#39;t, but we don&#39;t want to.<br>
<br>
Moving to syscalls avoids these problems, but introduces new ones:<br>
<br>
- adding new syscalls is generally frowned upon, and kvm will need several<=
br>
- syscalls into modules are harder and rarer than into core kernel code<br>
- will need to add a vcpu pointer to task_struct, and a kvm pointer to<br>
mm_struct<br>
<br>
Syscalls that operate on the entire guest will pick it up implicitly<br>
from the mm_struct, and syscalls that operate on a vcpu will pick it up<br>
from current.<br>
<br></blockquote><div><br></div><div>&lt;snipped&gt;</div><div><br></div><d=
iv><div>I like the ioctl() interface. =A0If the overhead matters in your ho=
t path, I suspect you&#39;re doing it wrong; use irq fds &amp; ioevent fds.=
 =A0You might fix the semantic mismatch by having a notion of a &quot;curre=
nt process&#39;s VM&quot; and &quot;current thread&#39;s VCPU&quot;, and ju=
st use the one /dev/kvm filedescriptor.</div>
<div><br></div><div>Or you could go the other way, and break the connection=
 between VMs and processes / VCPUs and threads: I don&#39;t know how easy i=
t is to do it in Linux, but a VCPU might be backed by a kernel thread, oper=
ated on via ioctl()s, indicating that they&#39;ve exited the guest by havin=
g their descriptors become readable (and either use read() or mmap() to pul=
l off the reason why the VCPU exited). =A0This would allow for a variety of=
 different programming styles for the VMM--I&#39;m a fan of CSP model mysel=
f, but that&#39;s hard to do with the current API.</div>
<div><br></div><div>It&#39;d be nice to be able to kick a VCPU out of the g=
uest without messing around with signals. =A0One possibility would be to ti=
e it to an eventfd; another might be to add a pseudo-register to indicate w=
hether the VCPU is explicitly suspended. =A0(Combined with the decoupling i=
dea, you&#39;d want another pseudo-register to indicate whether the VMM is =
implicitly suspended due to an intercept; a single &quot;runnable&quot; bit=
 is racy if both the VMM and VCPU are setting it.)</div>
<div><br></div><div>ioevent fds are definitely useful. =A0It might be cute =
if they could synchronously set the VIRTIO_USED_F_NOTIFY bit - the guest co=
uld do this itself, but that&#39;d require giving the guest write access to=
 the used side of the virtio queue, and I kind of like the idea that it doe=
sn&#39;t need write access there. =A0Then again, I don&#39;t have any perf =
data to back up the need for this.</div>
<div><br></div><div>The rest of it sounds great.</div><br clear=3D"all"><di=
v><div>)Rob</div></div></div><div><br></div></div>

--20cf3074d2e88861c204b8028083--