From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=60919 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OLkjr-0001xe-9c
	for qemu-devel@nongnu.org; Mon, 07 Jun 2010 18:24:12 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OLkjp-0002AK-I1
	for qemu-devel@nongnu.org; Mon, 07 Jun 2010 18:24:11 -0400
Received: from mail-iw0-f173.google.com ([209.85.214.173]:55935)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OLkjp-0002A9-93
	for qemu-devel@nongnu.org; Mon, 07 Jun 2010 18:24:09 -0400
Received: by iwn41 with SMTP id 41so4069502iwn.4
	for <qemu-devel@nongnu.org>; Mon, 07 Jun 2010 15:24:08 -0700 (PDT)
Message-ID: <4C0D717D.8010104@codemonkey.ws>
Date: Mon, 07 Jun 2010 17:23:57 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
References: <4C0D0FB7.80709@redhat.com> <4C0D26B5.9030708@codemonkey.ws>
	<4C0D3D80.5060208@redhat.com>
In-Reply-To: <4C0D3D80.5060208@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic,
	and pit back to userspace
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: qemu-devel <qemu-devel@nongnu.org>, KVM list <kvm@vger.kernel.org>

On 06/07/2010 01:42 PM, Avi Kivity wrote:
> On 06/07/2010 08:04 PM, Anthony Liguori wrote:
>>
>> I think we could also move the local APIC.
>
> I'm not even sure we can safely move the ioapic/pic (mostly due to 
> churn).  But the local APIC is so heavily accessed by the guest that 
> it's impossible to move it.  Run an ftrace one day, especially on an 
> smp guest.  Every IPI requires several APIC accesses.  Before a halt a 
> tickless kernel sets the wakeup timer.  EOIs.
>
>>
>> To optimize device models, we've tended to put the full device model 
>> in the kernel whereas the hardware vendors have tended to put only 
>> the fast paths of the devices models in hardware.
>>
>> For instance, we could introduce a userspace interface similar to 
>> vapic support whereas a shared page that mapped the APIC's layout was 
>> used with a mask to select which registers trapped on read/write.
>
> That leads to very problematic interfaces.  When you separate along a 
> device boundary, you have a spec that defines the software 
> interfaces.  When you separate along a boundary that you define, it's 
> up to you to get everything right.
>
> In fact with the ioapic/pic/lapic one of the problems is that the 
> interconnection between the devices that is not well defined, and 
> that's where we have bugs.
>
>>
>> That said, I can understand an argument that the local APIC is part 
>> of the CPU state since it's a very special type of device.
>>
>> A better example would be a generic counter kernel mechanism.  I can 
>> envision such a device as doing nothing more than providing a 
>> read-only view of a counter with a userspace configurable divider and 
>> width.  Any write to the counter or read of any other byte outside 
>> the counter register would result in a trap to userspace.
>
> What about latches?  byte access to word registers?  There will be as 
> many special cases as there are timers.
>
> If the kernel supported a bytecode/jit facility I'd happily use that 
> to download portions of the device model into the kernel.
>
>>
>> That should allow both the PIT and the HPET to be accelerated with 
>> minimal effort in the kernel.
>
> IMO it's probably more effort than porting HPET to the kernel.  Try 
> outlining an interface that supports PIT, HPET, RTC, and ACPI PMTIMER.

I was referring specifically to time sources, not time events.

An accelerated counter for HPET is pretty trivial.  It's a 32-bit 
register that's actually a nanosecond value in qemu.  We need to be able 
to set an offset from the host wall clock time, a means to stop it, and 
a means to start it.

The PIT is latched so the kernel needs to know enough about how to 
decode the PIT state to understand the latching.  There's very little 
state associated with latching though so I don't think this is a huge 
problem.  It's a fixed value write to a fixed register followed by a 
read to a fixed register.  The act of latching doesn't effect the state 
beyond the fact that you need to save the latched value in the event 
that you have a live migration before reading the latched value.

The PMTIMER is also pretty straight forward.  It's a variable port 
address (that's fixed during execution).

Even if we require three separate interfaces, the interfaces are so 
simply that it seems like an obvious win.

>>
>> I'd be in favor of a straight port to userspace.  We already have the 
>> interfaces to communicate with an external device model for these 
>> devices so let's just take the kernel code and stick it into 
>> dedicated threads in userspace.
>
> Currently we support an all-or-nothing approach.  I don't think local 
> APIC in userspace is worthwhile.  Esp. as it will slow down vhost and 
> assigned devices significantly - interrupts will have to be mediated 
> by userspace.

Yeah, as I said, I can understand the arguments for keeping the lapic in 
the kernel.

>>
>> I think it's easier to then work to merge the two bits of code in the 
>> same tree than it is to try and take out-of-tree code and merge it 
>> incrementally.
>
> Are you talking about qemu.git/qemu-kvm.git?  That's the least of my 
> concerns, I'm worried about kvm.git.

qemu.git.

>>
>>> 5. Risk
>>>
>>> We may find out after all this is implemented that performance is 
>>> not acceptable and all the work will have to be dropped.
>>
>> That's another advantage to a straight port to userspace.  We can 
>> collect performance data with only a modest amount of engineering 
>> effort.
>
> Port what exactly?  We have a userspace irqchip implementation.  What 
> we don't have is just the ioapic/pic/pit in userspace, and the only 
> way to try it out is to implement the whole thing.

If you take the kernel code and do a pretty straight port: switching 
kernel functions to libc functions and maintaining all the existing 
locking via pthreads, you could then implement a very simple MMIO/PIO 
dispatch mechanism in the kvm code that shortcutted those devices before 
we ever hit the qemu_mutex and the traditional qemu code paths.  It 
should be a relatively easy conversion and it gives a proper vehicle for 
doing experimentations.

In fact, you could pretty quickly determine viability by porting the PIT 
to userspace and implementing a vpit interface in the kernel that 
allowed the channel 0 counters to be latched and read within lightweight 
exits.

Regards,

Anthony Liguori