qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC] Moving the kvm ioapic, pic, and pit back to userspace
@ 2010-06-07 15:26 Avi Kivity
  2010-06-07 16:31 ` [Qemu-devel] " David S. Ahern
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Avi Kivity @ 2010-06-07 15:26 UTC (permalink / raw)
  To: KVM list; +Cc: qemu-devel

I am currently investigating a problem with the a guest running Linux 
malfunctioning in the NMI watchdog code.  The problem is that we don't 
handle NMI delivery mode for the local APIC LINT0 pin; instead we expect 
ExtInt deliver mode or that the line is disabled completely.  In 
addition the i8254 timer is tied to the BSP, while in this case the 
timer can broadcast to all vcpus.

There is some code that tries to second-guess the guest and provide it 
the inputs it sees, but this is fragile.  The only way to get reliable 
operation is to emulate the hardware fully.

Now I'd much rather do that in userspace, since it's a lot of sensitive 
work.  I'll enumerate below the general motivation, advantages and 
disadvantages, and a plan for moving forward.

Motivation
==========

The original motivation for moving the PIC and IOAPIC into the kernel 
was performance, especially for assigned devices.  Both devices are high 
interaction since they deal with interrupts; practically after every 
interrupt there is either a PIC ioport write, or an APIC bus message, 
both signalling an EOI operation.  Moving the PIT into the kernel 
allowed us to catch up with missed timer interrupt injections, and 
speeded up guests which read the PIT counters (e.g. tickless guests).

However, modern guests running on modern qemu use MSI extensively; both 
virtio and assigned devices now have MSI support; and the planned VFIO 
only supports kernel delivery via MSI anyway; line based interrupts will 
need to be mediated by userspace.

The only high frequency non-MSI interrupt sources remaining are the 
various timers; and the default one, HPET, is in userspace (and having 
its own scaling problems as a result).  So in theory we can move PIC, 
IOAPIC, and PIT support to userspace and not lose much performance.

Moving the implementation to userspace allows us more flexibility, and 
more consistency in the implementation of timekeeping for the various 
clock chips; it becomes easier to follow the nuances of real hardware in 
this area.

Interestingly, while the IOAPIC/PIC code was written we proposed making 
it independent of the local APIC; had we done so, the move would have 
been much easier (simply dropping the existing code).


Advantages of a move
====================

1. Reduced kernel footprint

Good for security, and allows fixing bugs without reboots.

2. Centralized timekeeping

Instead of having one solution for PIT timekeeping, and another for RTC 
and HPET timekeeping, we can have all timer chips in userspace.  The 
local APIC timer still needs to be in the kernel - it is much too high 
bandwidth to be in userspace; but on the other hand it is very different 
from the other timer chips.

3. Flexibility

Easier to have wierd board layouts (multiple IOAPICs, etc.).  Not a very 
strong advantage.

Disadvantages
=============

1. Still need to keep the old code around for a long while

We can't just rip it out - old userspace depends on it.  So the security 
advantages are only with cooperating userspace, and the other advantages 
only show up.

2. Need to bring the qemu code up to date

The current qemu ioapic code lags some way behind the kernel; also need 
PIT timekeeping

3. May need kernel support for interval-timer-follows-thread

Currently the timekeeping code has an optimization which causes the 
hrtimer that models the PIT to follow the BSP (which is most likely to 
receive the interrupt); this reduces cpu cross-talk.

I don't think the kernel interval timer code has such an optimization; 
we may need to implement it.

4. Much churn

This is a lot of work.

5. Risk

We may find out after all this is implemented that performance is not 
acceptable and all the work will have to be dropped.


Proposed interface
==================

1. KVM_SET_LINT_PIN (vcpu ioctl)

Sets the value (0 or 1) that a vcpu's LINT0 or LINT1 senses.

Note: problematic; may be high frequency but ignored due to masking at 
the local APIC LVT level.  Will also be broadcast across all vcpus by 
userspace with typical configurations.  We may need a way to tell 
userspace we'll be ignoring those signals.

May also be extended to emulate thermal interrupts if someone feels the 
need.

An alternative is a couple of new fields in kvm_run which are sampled on 
every entry (unless masked).

2. KVM_EXIT_REASON_INTACK (kvm_run exit reason)

Informs userspace that the vcpu is running an INTACK cycle; userspace 
should provide the interrupt vector on the next KVM_VCPU_RUN.

3. KVM_APIC_MESSAGE (vm ioctl)

Sends an APIC message on the APIC message bus, if the destination is in 
the kernel (typically IOAPIC interrupt messages).

4. KVM_EXIT_REASON_APIC_MESSAGE (kvm_run exit reason)

Sends an APIC message on the APIC message bus, if the destination is not 
in the kernel (typically IOAPIC EOI messages).

The above are all architectural, and correspond to wires on physical 
systems.  This increases the confidence that they are correct.

5. KVM_REQUEST_EOI (vcpu ioctl) / KVM_EXIT_EOI (kvm_run exit reason)

We will get EOI messages via KVM_EXIT_REASON_APIC_MESSAGE for 
level-triggered interrupts.  However, for timekeeping we will also need 
a an EOI for edge triggered interrupts (if we choose the ack notifier 
method for timekeeping).

6. KVM_EXIT_REASON_LVT_MASK (kvm_run exit reason)

A notification that the LVT LINT0 or LVT LINT1 mask bit has changed, and 
thus we don't need to issue useless KVM_SET_LINT_PIN ioctls; also useful 
for timekeeping (can disable PIT if configured with ExtInt mode or lapic 
disabled).

7. KVM_EXIT_REASON_APIC_MESSAGE_ACK (kvm_run exit reason)

If we use the current timekeeping method of detecting coalesced 
interrupts, we'll need an acknowledge when an APIC message is accepted 
by a local APIC, with the result (interrupt queued or interrupt 
coalesced).  This will need to be selectable by vcpu and vector number.

8. KVM_CREATE_IRQCHIP (vm ioctl)

A new flag that tells kvm not to create a PIC and IOAPIC.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-07 15:26 [Qemu-devel] [RFC] Moving the kvm ioapic, pic, and pit back to userspace Avi Kivity
@ 2010-06-07 16:31 ` David S. Ahern
  2010-06-07 18:46   ` Avi Kivity
  2010-06-07 17:04 ` Anthony Liguori
  2010-06-09 15:59 ` [Qemu-devel] " Dong, Eddie
  2 siblings, 1 reply; 14+ messages in thread
From: David S. Ahern @ 2010-06-07 16:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, KVM list



On 06/07/10 09:26, Avi Kivity wrote:

> The original motivation for moving the PIC and IOAPIC into the kernel
> was performance, especially for assigned devices.  Both devices are high
> interaction since they deal with interrupts; practically after every
> interrupt there is either a PIC ioport write, or an APIC bus message,
> both signalling an EOI operation.  Moving the PIT into the kernel
> allowed us to catch up with missed timer interrupt injections, and
> speeded up guests which read the PIT counters (e.g. tickless guests).
> 
> However, modern guests running on modern qemu use MSI extensively; both
> virtio and assigned devices now have MSI support; and the planned VFIO
> only supports kernel delivery via MSI anyway; line based interrupts will
> need to be mediated by userspace.

The "modern" guest comment is a bit concerning. 2.4 kernels (e.g.,
RHEL3) use the PIT for timekeeping and will still be around for a while.
RHEL4 and RHEL5 will be around for a long time to come. Not sure how
those fit within the "modern" label, though I see my RHEL4 guest is
using the pit as a timesource.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-07 15:26 [Qemu-devel] [RFC] Moving the kvm ioapic, pic, and pit back to userspace Avi Kivity
  2010-06-07 16:31 ` [Qemu-devel] " David S. Ahern
@ 2010-06-07 17:04 ` Anthony Liguori
  2010-06-07 18:42   ` Avi Kivity
  2010-06-09 15:59 ` [Qemu-devel] " Dong, Eddie
  2 siblings, 1 reply; 14+ messages in thread
From: Anthony Liguori @ 2010-06-07 17:04 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, KVM list

On 06/07/2010 10:26 AM, Avi Kivity wrote:
> I am currently investigating a problem with the a guest running Linux 
> malfunctioning in the NMI watchdog code.  The problem is that we don't 
> handle NMI delivery mode for the local APIC LINT0 pin; instead we 
> expect ExtInt deliver mode or that the line is disabled completely.  
> In addition the i8254 timer is tied to the BSP, while in this case the 
> timer can broadcast to all vcpus.
>
> There is some code that tries to second-guess the guest and provide it 
> the inputs it sees, but this is fragile.  The only way to get reliable 
> operation is to emulate the hardware fully.
>
> Now I'd much rather do that in userspace, since it's a lot of 
> sensitive work.  I'll enumerate below the general motivation, 
> advantages and disadvantages, and a plan for moving forward.
>
> Motivation
> ==========
>
> The original motivation for moving the PIC and IOAPIC into the kernel 
> was performance, especially for assigned devices.  Both devices are 
> high interaction since they deal with interrupts; practically after 
> every interrupt there is either a PIC ioport write, or an APIC bus 
> message, both signalling an EOI operation.  Moving the PIT into the 
> kernel allowed us to catch up with missed timer interrupt injections, 
> and speeded up guests which read the PIT counters (e.g. tickless guests).
>
> However, modern guests running on modern qemu use MSI extensively; 
> both virtio and assigned devices now have MSI support; and the planned 
> VFIO only supports kernel delivery via MSI anyway; line based 
> interrupts will need to be mediated by userspace.
>
> The only high frequency non-MSI interrupt sources remaining are the 
> various timers; and the default one, HPET, is in userspace (and having 
> its own scaling problems as a result).  So in theory we can move PIC, 
> IOAPIC, and PIT support to userspace and not lose much performance.

I think we could also move the local APIC.

To optimize device models, we've tended to put the full device model in 
the kernel whereas the hardware vendors have tended to put only the fast 
paths of the devices models in hardware.

For instance, we could introduce a userspace interface similar to vapic 
support whereas a shared page that mapped the APIC's layout was used 
with a mask to select which registers trapped on read/write.

That said, I can understand an argument that the local APIC is part of 
the CPU state since it's a very special type of device.

A better example would be a generic counter kernel mechanism.  I can 
envision such a device as doing nothing more than providing a read-only 
view of a counter with a userspace configurable divider and width.  Any 
write to the counter or read of any other byte outside the counter 
register would result in a trap to userspace.

That should allow both the PIT and the HPET to be accelerated with 
minimal effort in the kernel.

> Moving the implementation to userspace allows us more flexibility, and 
> more consistency in the implementation of timekeeping for the various 
> clock chips; it becomes easier to follow the nuances of real hardware 
> in this area.
>
> Interestingly, while the IOAPIC/PIC code was written we proposed 
> making it independent of the local APIC; had we done so, the move 
> would have been much easier (simply dropping the existing code).
>
>
> Advantages of a move
> ====================
>
> 1. Reduced kernel footprint
>
> Good for security, and allows fixing bugs without reboots.
>
> 2. Centralized timekeeping
>
> Instead of having one solution for PIT timekeeping, and another for 
> RTC and HPET timekeeping, we can have all timer chips in userspace.  
> The local APIC timer still needs to be in the kernel - it is much too 
> high bandwidth to be in userspace; but on the other hand it is very 
> different from the other timer chips.
>
> 3. Flexibility
>
> Easier to have wierd board layouts (multiple IOAPICs, etc.).  Not a 
> very strong advantage.
>
> Disadvantages
> =============
>
> 1. Still need to keep the old code around for a long while
>
> We can't just rip it out - old userspace depends on it.  So the 
> security advantages are only with cooperating userspace, and the other 
> advantages only show up.
>
> 2. Need to bring the qemu code up to date
>
> The current qemu ioapic code lags some way behind the kernel; also 
> need PIT timekeeping
>
> 3. May need kernel support for interval-timer-follows-thread
>
> Currently the timekeeping code has an optimization which causes the 
> hrtimer that models the PIT to follow the BSP (which is most likely to 
> receive the interrupt); this reduces cpu cross-talk.
>
> I don't think the kernel interval timer code has such an optimization; 
> we may need to implement it.
>
> 4. Much churn
>
> This is a lot of work.

I'd be in favor of a straight port to userspace.  We already have the 
interfaces to communicate with an external device model for these 
devices so let's just take the kernel code and stick it into dedicated 
threads in userspace.

I think it's easier to then work to merge the two bits of code in the 
same tree than it is to try and take out-of-tree code and merge it 
incrementally.

> 5. Risk
>
> We may find out after all this is implemented that performance is not 
> acceptable and all the work will have to be dropped.

That's another advantage to a straight port to userspace.  We can 
collect performance data with only a modest amount of engineering effort.

Regards,

Anthony Liguori

>
> Proposed interface
> ==================
>
> 1. KVM_SET_LINT_PIN (vcpu ioctl)
>
> Sets the value (0 or 1) that a vcpu's LINT0 or LINT1 senses.
>
> Note: problematic; may be high frequency but ignored due to masking at 
> the local APIC LVT level.  Will also be broadcast across all vcpus by 
> userspace with typical configurations.  We may need a way to tell 
> userspace we'll be ignoring those signals.
>
> May also be extended to emulate thermal interrupts if someone feels 
> the need.
>
> An alternative is a couple of new fields in kvm_run which are sampled 
> on every entry (unless masked).
>
> 2. KVM_EXIT_REASON_INTACK (kvm_run exit reason)
>
> Informs userspace that the vcpu is running an INTACK cycle; userspace 
> should provide the interrupt vector on the next KVM_VCPU_RUN.
>
> 3. KVM_APIC_MESSAGE (vm ioctl)
>
> Sends an APIC message on the APIC message bus, if the destination is 
> in the kernel (typically IOAPIC interrupt messages).
>
> 4. KVM_EXIT_REASON_APIC_MESSAGE (kvm_run exit reason)
>
> Sends an APIC message on the APIC message bus, if the destination is 
> not in the kernel (typically IOAPIC EOI messages).
>
> The above are all architectural, and correspond to wires on physical 
> systems.  This increases the confidence that they are correct.
>
> 5. KVM_REQUEST_EOI (vcpu ioctl) / KVM_EXIT_EOI (kvm_run exit reason)
>
> We will get EOI messages via KVM_EXIT_REASON_APIC_MESSAGE for 
> level-triggered interrupts.  However, for timekeeping we will also 
> need a an EOI for edge triggered interrupts (if we choose the ack 
> notifier method for timekeeping).
>
> 6. KVM_EXIT_REASON_LVT_MASK (kvm_run exit reason)
>
> A notification that the LVT LINT0 or LVT LINT1 mask bit has changed, 
> and thus we don't need to issue useless KVM_SET_LINT_PIN ioctls; also 
> useful for timekeeping (can disable PIT if configured with ExtInt mode 
> or lapic disabled).
>
> 7. KVM_EXIT_REASON_APIC_MESSAGE_ACK (kvm_run exit reason)
>
> If we use the current timekeeping method of detecting coalesced 
> interrupts, we'll need an acknowledge when an APIC message is accepted 
> by a local APIC, with the result (interrupt queued or interrupt 
> coalesced).  This will need to be selectable by vcpu and vector number.
>
> 8. KVM_CREATE_IRQCHIP (vm ioctl)
>
> A new flag that tells kvm not to create a PIC and IOAPIC.
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-07 17:04 ` Anthony Liguori
@ 2010-06-07 18:42   ` Avi Kivity
  2010-06-07 22:23     ` Anthony Liguori
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2010-06-07 18:42 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, KVM list

On 06/07/2010 08:04 PM, Anthony Liguori wrote:
>
> I think we could also move the local APIC.

I'm not even sure we can safely move the ioapic/pic (mostly due to 
churn).  But the local APIC is so heavily accessed by the guest that 
it's impossible to move it.  Run an ftrace one day, especially on an smp 
guest.  Every IPI requires several APIC accesses.  Before a halt a 
tickless kernel sets the wakeup timer.  EOIs.

>
> To optimize device models, we've tended to put the full device model 
> in the kernel whereas the hardware vendors have tended to put only the 
> fast paths of the devices models in hardware.
>
> For instance, we could introduce a userspace interface similar to 
> vapic support whereas a shared page that mapped the APIC's layout was 
> used with a mask to select which registers trapped on read/write.

That leads to very problematic interfaces.  When you separate along a 
device boundary, you have a spec that defines the software interfaces.  
When you separate along a boundary that you define, it's up to you to 
get everything right.

In fact with the ioapic/pic/lapic one of the problems is that the 
interconnection between the devices that is not well defined, and that's 
where we have bugs.

>
> That said, I can understand an argument that the local APIC is part of 
> the CPU state since it's a very special type of device.
>
> A better example would be a generic counter kernel mechanism.  I can 
> envision such a device as doing nothing more than providing a 
> read-only view of a counter with a userspace configurable divider and 
> width.  Any write to the counter or read of any other byte outside the 
> counter register would result in a trap to userspace.

What about latches?  byte access to word registers?  There will be as 
many special cases as there are timers.

If the kernel supported a bytecode/jit facility I'd happily use that to 
download portions of the device model into the kernel.

>
> That should allow both the PIT and the HPET to be accelerated with 
> minimal effort in the kernel.

IMO it's probably more effort than porting HPET to the kernel.  Try 
outlining an interface that supports PIT, HPET, RTC, and ACPI PMTIMER.

>
> I'd be in favor of a straight port to userspace.  We already have the 
> interfaces to communicate with an external device model for these 
> devices so let's just take the kernel code and stick it into dedicated 
> threads in userspace.

Currently we support an all-or-nothing approach.  I don't think local 
APIC in userspace is worthwhile.  Esp. as it will slow down vhost and 
assigned devices significantly - interrupts will have to be mediated by 
userspace.

>
> I think it's easier to then work to merge the two bits of code in the 
> same tree than it is to try and take out-of-tree code and merge it 
> incrementally.

Are you talking about qemu.git/qemu-kvm.git?  That's the least of my 
concerns, I'm worried about kvm.git.

>
>> 5. Risk
>>
>> We may find out after all this is implemented that performance is not 
>> acceptable and all the work will have to be dropped.
>
> That's another advantage to a straight port to userspace.  We can 
> collect performance data with only a modest amount of engineering effort.

Port what exactly?  We have a userspace irqchip implementation.  What we 
don't have is just the ioapic/pic/pit in userspace, and the only way to 
try it out is to implement the whole thing.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-07 16:31 ` [Qemu-devel] " David S. Ahern
@ 2010-06-07 18:46   ` Avi Kivity
  2010-06-07 18:54     ` David S. Ahern
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2010-06-07 18:46 UTC (permalink / raw)
  To: David S. Ahern; +Cc: qemu-devel, KVM list

On 06/07/2010 07:31 PM, David S. Ahern wrote:
>
> On 06/07/10 09:26, Avi Kivity wrote:
>
>    
>> The original motivation for moving the PIC and IOAPIC into the kernel
>> was performance, especially for assigned devices.  Both devices are high
>> interaction since they deal with interrupts; practically after every
>> interrupt there is either a PIC ioport write, or an APIC bus message,
>> both signalling an EOI operation.  Moving the PIT into the kernel
>> allowed us to catch up with missed timer interrupt injections, and
>> speeded up guests which read the PIT counters (e.g. tickless guests).
>>
>> However, modern guests running on modern qemu use MSI extensively; both
>> virtio and assigned devices now have MSI support; and the planned VFIO
>> only supports kernel delivery via MSI anyway; line based interrupts will
>> need to be mediated by userspace.
>>      
> The "modern" guest comment is a bit concerning. 2.4 kernels (e.g.,
> RHEL3) use the PIT for timekeeping and will still be around for a while.
> RHEL4 and RHEL5 will be around for a long time to come. Not sure how
> those fit within the "modern" label, though I see my RHEL4 guest is
> using the pit as a timesource.
>    

First of all, the existing code will remain for a long while (several 
years).  We still have to support existing userspace.

But, that's not a satisfactory answer.  I don't want users to choose 
which device model to use according to their guest.  As far as I'm 
concerned all guests are triple-boot with the guest rebooting to a 
different OS every half hour.

So it's important to know how often your RHEL3/4 guest queries the PIT 
(not just receives interrupts, actually reads the counter) under a 
realistic load.  If you have such a number (in reads/sec) that would be 
a good input to this discussion.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-07 18:46   ` Avi Kivity
@ 2010-06-07 18:54     ` David S. Ahern
  2010-06-07 19:16       ` Avi Kivity
  0 siblings, 1 reply; 14+ messages in thread
From: David S. Ahern @ 2010-06-07 18:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, KVM list



On 06/07/10 12:46, Avi Kivity wrote:
> On 06/07/2010 07:31 PM, David S. Ahern wrote:
>>
>> On 06/07/10 09:26, Avi Kivity wrote:
>>
>>   
>>> The original motivation for moving the PIC and IOAPIC into the kernel
>>> was performance, especially for assigned devices.  Both devices are high
>>> interaction since they deal with interrupts; practically after every
>>> interrupt there is either a PIC ioport write, or an APIC bus message,
>>> both signalling an EOI operation.  Moving the PIT into the kernel
>>> allowed us to catch up with missed timer interrupt injections, and
>>> speeded up guests which read the PIT counters (e.g. tickless guests).
>>>
>>> However, modern guests running on modern qemu use MSI extensively; both
>>> virtio and assigned devices now have MSI support; and the planned VFIO
>>> only supports kernel delivery via MSI anyway; line based interrupts will
>>> need to be mediated by userspace.
>>>      
>> The "modern" guest comment is a bit concerning. 2.4 kernels (e.g.,
>> RHEL3) use the PIT for timekeeping and will still be around for a while.
>> RHEL4 and RHEL5 will be around for a long time to come. Not sure how
>> those fit within the "modern" label, though I see my RHEL4 guest is
>> using the pit as a timesource.
>>    
> 
> First of all, the existing code will remain for a long while (several
> years).  We still have to support existing userspace.
> 
> But, that's not a satisfactory answer.  I don't want users to choose
> which device model to use according to their guest.  As far as I'm
> concerned all guests are triple-boot with the guest rebooting to a
> different OS every half hour.
> 
> So it's important to know how often your RHEL3/4 guest queries the PIT
> (not just receives interrupts, actually reads the counter) under a
> realistic load.  If you have such a number (in reads/sec) that would be
> a good input to this discussion.
> 

Aps that invoke gettimeofday a lot. As I recall RHEL3 uses the TSC
between timer interrupts, but RHEL4 samples counters on each
gettimeofday call:

http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html

Because of that performance of applications that timestamp log entries
(like a certain product I work on) takes a hit on KVM unless the TSC is
the clock source.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-07 18:54     ` David S. Ahern
@ 2010-06-07 19:16       ` Avi Kivity
  0 siblings, 0 replies; 14+ messages in thread
From: Avi Kivity @ 2010-06-07 19:16 UTC (permalink / raw)
  To: David S. Ahern; +Cc: qemu-devel, KVM list

On 06/07/2010 09:54 PM, David S. Ahern wrote:
>
>> So it's important to know how often your RHEL3/4 guest queries the PIT
>> (not just receives interrupts, actually reads the counter) under a
>> realistic load.  If you have such a number (in reads/sec) that would be
>> a good input to this discussion.
>>
>>      
> Aps that invoke gettimeofday a lot.

Ask a stupid question, get an "it depends on the workload" answer.

> As I recall RHEL3 uses the TSC
> between timer interrupts, but RHEL4 samples counters on each
> gettimeofday call:
>
> http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html
>
> Because of that performance of applications that timestamp log entries
> (like a certain product I work on) takes a hit on KVM unless the TSC is
> the clock source.
>    

So it looks like dropping the PIT out of the kernel, let alone the 
PIC/IOAPIC, is out of the question.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-07 18:42   ` Avi Kivity
@ 2010-06-07 22:23     ` Anthony Liguori
  2010-06-08  5:48       ` Avi Kivity
  0 siblings, 1 reply; 14+ messages in thread
From: Anthony Liguori @ 2010-06-07 22:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, KVM list

On 06/07/2010 01:42 PM, Avi Kivity wrote:
> On 06/07/2010 08:04 PM, Anthony Liguori wrote:
>>
>> I think we could also move the local APIC.
>
> I'm not even sure we can safely move the ioapic/pic (mostly due to 
> churn).  But the local APIC is so heavily accessed by the guest that 
> it's impossible to move it.  Run an ftrace one day, especially on an 
> smp guest.  Every IPI requires several APIC accesses.  Before a halt a 
> tickless kernel sets the wakeup timer.  EOIs.
>
>>
>> To optimize device models, we've tended to put the full device model 
>> in the kernel whereas the hardware vendors have tended to put only 
>> the fast paths of the devices models in hardware.
>>
>> For instance, we could introduce a userspace interface similar to 
>> vapic support whereas a shared page that mapped the APIC's layout was 
>> used with a mask to select which registers trapped on read/write.
>
> That leads to very problematic interfaces.  When you separate along a 
> device boundary, you have a spec that defines the software 
> interfaces.  When you separate along a boundary that you define, it's 
> up to you to get everything right.
>
> In fact with the ioapic/pic/lapic one of the problems is that the 
> interconnection between the devices that is not well defined, and 
> that's where we have bugs.
>
>>
>> That said, I can understand an argument that the local APIC is part 
>> of the CPU state since it's a very special type of device.
>>
>> A better example would be a generic counter kernel mechanism.  I can 
>> envision such a device as doing nothing more than providing a 
>> read-only view of a counter with a userspace configurable divider and 
>> width.  Any write to the counter or read of any other byte outside 
>> the counter register would result in a trap to userspace.
>
> What about latches?  byte access to word registers?  There will be as 
> many special cases as there are timers.
>
> If the kernel supported a bytecode/jit facility I'd happily use that 
> to download portions of the device model into the kernel.
>
>>
>> That should allow both the PIT and the HPET to be accelerated with 
>> minimal effort in the kernel.
>
> IMO it's probably more effort than porting HPET to the kernel.  Try 
> outlining an interface that supports PIT, HPET, RTC, and ACPI PMTIMER.

I was referring specifically to time sources, not time events.

An accelerated counter for HPET is pretty trivial.  It's a 32-bit 
register that's actually a nanosecond value in qemu.  We need to be able 
to set an offset from the host wall clock time, a means to stop it, and 
a means to start it.

The PIT is latched so the kernel needs to know enough about how to 
decode the PIT state to understand the latching.  There's very little 
state associated with latching though so I don't think this is a huge 
problem.  It's a fixed value write to a fixed register followed by a 
read to a fixed register.  The act of latching doesn't effect the state 
beyond the fact that you need to save the latched value in the event 
that you have a live migration before reading the latched value.

The PMTIMER is also pretty straight forward.  It's a variable port 
address (that's fixed during execution).

Even if we require three separate interfaces, the interfaces are so 
simply that it seems like an obvious win.

>>
>> I'd be in favor of a straight port to userspace.  We already have the 
>> interfaces to communicate with an external device model for these 
>> devices so let's just take the kernel code and stick it into 
>> dedicated threads in userspace.
>
> Currently we support an all-or-nothing approach.  I don't think local 
> APIC in userspace is worthwhile.  Esp. as it will slow down vhost and 
> assigned devices significantly - interrupts will have to be mediated 
> by userspace.

Yeah, as I said, I can understand the arguments for keeping the lapic in 
the kernel.

>>
>> I think it's easier to then work to merge the two bits of code in the 
>> same tree than it is to try and take out-of-tree code and merge it 
>> incrementally.
>
> Are you talking about qemu.git/qemu-kvm.git?  That's the least of my 
> concerns, I'm worried about kvm.git.

qemu.git.

>>
>>> 5. Risk
>>>
>>> We may find out after all this is implemented that performance is 
>>> not acceptable and all the work will have to be dropped.
>>
>> That's another advantage to a straight port to userspace.  We can 
>> collect performance data with only a modest amount of engineering 
>> effort.
>
> Port what exactly?  We have a userspace irqchip implementation.  What 
> we don't have is just the ioapic/pic/pit in userspace, and the only 
> way to try it out is to implement the whole thing.

If you take the kernel code and do a pretty straight port: switching 
kernel functions to libc functions and maintaining all the existing 
locking via pthreads, you could then implement a very simple MMIO/PIO 
dispatch mechanism in the kvm code that shortcutted those devices before 
we ever hit the qemu_mutex and the traditional qemu code paths.  It 
should be a relatively easy conversion and it gives a proper vehicle for 
doing experimentations.

In fact, you could pretty quickly determine viability by porting the PIT 
to userspace and implementing a vpit interface in the kernel that 
allowed the channel 0 counters to be latched and read within lightweight 
exits.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-07 22:23     ` Anthony Liguori
@ 2010-06-08  5:48       ` Avi Kivity
  0 siblings, 0 replies; 14+ messages in thread
From: Avi Kivity @ 2010-06-08  5:48 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, KVM list

On 06/08/2010 01:23 AM, Anthony Liguori wrote:
>>> A better example would be a generic counter kernel mechanism.  I can 
>>> envision such a device as doing nothing more than providing a 
>>> read-only view of a counter with a userspace configurable divider 
>>> and width.  Any write to the counter or read of any other byte 
>>> outside the counter register would result in a trap to userspace.
>>
>> What about latches?  byte access to word registers?  There will be as 
>> many special cases as there are timers.
>>
>> If the kernel supported a bytecode/jit facility I'd happily use that 
>> to download portions of the device model into the kernel.
>>
>>>
>>> That should allow both the PIT and the HPET to be accelerated with 
>>> minimal effort in the kernel.
>>
>> IMO it's probably more effort than porting HPET to the kernel.  Try 
>> outlining an interface that supports PIT, HPET, RTC, and ACPI PMTIMER.
>
>
> I was referring specifically to time sources, not time events.
>
> An accelerated counter for HPET is pretty trivial.  It's a 32-bit 
> register that's actually a nanosecond value in qemu.  We need to be 
> able to set an offset from the host wall clock time, a means to stop 
> it, and a means to start it.
>
> The PIT is latched so the kernel needs to know enough about how to 
> decode the PIT state to understand the latching.  There's very little 
> state associated with latching though so I don't think this is a huge 
> problem.  It's a fixed value write to a fixed register followed by a 
> read to a fixed register.  The act of latching doesn't effect the 
> state beyond the fact that you need to save the latched value in the 
> event that you have a live migration before reading the latched value.
>
> The PMTIMER is also pretty straight forward.  It's a variable port 
> address (that's fixed during execution).
>
> Even if we require three separate interfaces, the interfaces are so 
> simply that it seems like an obvious win.

So a non-generic interface - 4x the interfaces (including RTC).

Those counters raise interrupts when they expire, and set various status 
bits in their hardware.  So we need 4x of:

   set counter value, frequency, and reload interval
   raise alarm to userspace on expiration
   set counter memory/ioport location and availability
   read counter value

and we haven't solved interrupt coalescing.

>
>>>
>>>> 5. Risk
>>>>
>>>> We may find out after all this is implemented that performance is 
>>>> not acceptable and all the work will have to be dropped.
>>>
>>> That's another advantage to a straight port to userspace.  We can 
>>> collect performance data with only a modest amount of engineering 
>>> effort.
>>
>> Port what exactly?  We have a userspace irqchip implementation.  What 
>> we don't have is just the ioapic/pic/pit in userspace, and the only 
>> way to try it out is to implement the whole thing.
>
> If you take the kernel code and do a pretty straight port: switching 
> kernel functions to libc functions and maintaining all the existing 
> locking via pthreads, you could then implement a very simple MMIO/PIO 
> dispatch mechanism in the kvm code that shortcutted those devices 
> before we ever hit the qemu_mutex and the traditional qemu code 
> paths.  It should be a relatively easy conversion and it gives a 
> proper vehicle for doing experimentations.

Those devices don't exist independently of the rest of the devices.  If 
they need to post interrupts, they will need the traditional qemu code 
paths.

(I'm trying to view the move from the POV of the kernel first, assuming 
userspace is as efficient as possible; so I'm not arguing qemu 
inefficiencies should prevent us from doing it.  But they do add up 
considerably to the amount of work involved)

>
> In fact, you could pretty quickly determine viability by porting the 
> PIT to userspace and implementing a vpit interface in the kernel that 
> allowed the channel 0 counters to be latched and read within 
> lightweight exits.


Just looking at it shows the interface is incredibly messy.  You have to 
maintain the control word in the kernel (since it tells you which 
counter to read or write), so now you need a userspace interface to read 
and write the control word.  With the current interface, you have the 
entire thing in a black box that you don't need to worry about (except 
for the speaker port...).


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-07 15:26 [Qemu-devel] [RFC] Moving the kvm ioapic, pic, and pit back to userspace Avi Kivity
  2010-06-07 16:31 ` [Qemu-devel] " David S. Ahern
  2010-06-07 17:04 ` Anthony Liguori
@ 2010-06-09 15:59 ` Dong, Eddie
  2010-06-09 16:05   ` [Qemu-devel] " Avi Kivity
  2 siblings, 1 reply; 14+ messages in thread
From: Dong, Eddie @ 2010-06-09 15:59 UTC (permalink / raw)
  To: Avi Kivity, KVM list; +Cc: Dong, Eddie, qemu-devel

Avi Kivity wrote:
> I am currently investigating a problem with the a guest running Linux
> malfunctioning in the NMI watchdog code.  The problem is that we don't
> handle NMI delivery mode for the local APIC LINT0 pin; instead we
> expect ExtInt deliver mode or that the line is disabled completely. 
> In addition the i8254 timer is tied to the BSP, while in this case the
> timer can broadcast to all vcpus.
> 
> There is some code that tries to second-guess the guest and provide it
> the inputs it sees, but this is fragile.  The only way to get reliable
> operation is to emulate the hardware fully.
> 
> Now I'd much rather do that in userspace, since it's a lot of
> sensitive work.  I'll enumerate below the general motivation,
> advantages and disadvantages, and a plan for moving forward.
> 
> Motivation
> ==========
> 
> The original motivation for moving the PIC and IOAPIC into the kernel
> was performance, especially for assigned devices.  Both devices are
> high interaction since they deal with interrupts; practically after
> every interrupt there is either a PIC ioport write, or an APIC bus
> message, both signalling an EOI operation.  Moving the PIT into the
> kernel allowed us to catch up with missed timer interrupt injections,
> and speeded up guests which read the PIT counters (e.g. tickless
> guests). 
> 
> However, modern guests running on modern qemu use MSI extensively;
> both virtio and assigned devices now have MSI support; and the
> planned VFIO only supports kernel delivery via MSI anyway; line based
> interrupts will need to be mediated by userspace.
> 
> The only high frequency non-MSI interrupt sources remaining are the
> various timers; and the default one, HPET, is in userspace (and having
> its own scaling problems as a result).  So in theory we can move PIC,
> IOAPIC, and PIT support to userspace and not lose much performance.
> 
> Moving the implementation to userspace allows us more flexibility, and
> more consistency in the implementation of timekeeping for the various
> clock chips; it becomes easier to follow the nuances of real hardware
> in this area.
> 
> Interestingly, while the IOAPIC/PIC code was written we proposed
> making it independent of the local APIC; had we done so, the move
> would have been much easier (simply dropping the existing code).
> 
> 
> Advantages of a move
> ====================
> 
> 1. Reduced kernel footprint
> 
> Good for security, and allows fixing bugs without reboots.
> 
> 2. Centralized timekeeping
> 
> Instead of having one solution for PIT timekeeping, and another for
> RTC and HPET timekeeping, we can have all timer chips in userspace. 
> The local APIC timer still needs to be in the kernel - it is much too
> high bandwidth to be in userspace; but on the other hand it is very
> different from the other timer chips.
> 
> 3. Flexibility
> 
> Easier to have wierd board layouts (multiple IOAPICs, etc.).  Not a
> very strong advantage.
> 
> Disadvantages
> =============
> 
> 1. Still need to keep the old code around for a long while
> 
> We can't just rip it out - old userspace depends on it.  So the
> security advantages are only with cooperating userspace, and the
> other advantages only show up.
> 
> 2. Need to bring the qemu code up to date
> 
> The current qemu ioapic code lags some way behind the kernel; also
> need PIT timekeeping
> 
> 3. May need kernel support for interval-timer-follows-thread
> 
> Currently the timekeeping code has an optimization which causes the
> hrtimer that models the PIT to follow the BSP (which is most likely to
> receive the interrupt); this reduces cpu cross-talk.
> 
> I don't think the kernel interval timer code has such an optimization;
> we may need to implement it.
> 
> 4. Much churn
> 
> This is a lot of work.
> 
> 5. Risk
> 
> We may find out after all this is implemented that performance is not
> acceptable and all the work will have to be dropped.
> 
> 

Besides VF IO interrupt and timer interrupt introduced performance overhead risk, EOI message deliver from lapic to ioapic, which becomes in user land now, may have potential scalability issue. For example, if we have a 64 VCPU guest, if each vcpu has 1khz interrupt (or ipi), the EOI from guest will normally have to involve ioapic module for clearance in 64khz which may have long lock contentio. you may reduce the involvement of ioapic eoi by tracking ioapic pin <-> vector map in kernel, but not sure if it is clean enough.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-09 15:59 ` [Qemu-devel] " Dong, Eddie
@ 2010-06-09 16:05   ` Avi Kivity
  2010-06-10  2:37     ` [Qemu-devel] " Dong, Eddie
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2010-06-09 16:05 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: qemu-devel, KVM list

On 06/09/2010 06:59 PM, Dong, Eddie wrote:
>
> Besides VF IO interrupt and timer interrupt introduced performance overhead risk,

VF usually uses MSI

>   EOI message deliver from lapic to ioapic,

Only for non-MSI

>   which becomes in user land now, may have potential scalability issue. For example, if we have a 64 VCPU guest, if each vcpu has 1khz interrupt (or ipi), the EOI from guest will normally have to involve ioapic module for clearance in 64khz which may have long lock contentio.

No, EOI for IPI or for local APIC timer does not involve the IOAPIC.

> you may reduce the involvement of ioapic eoi by tracking ioapic pin<->  vector map in kernel, but not sure if it is clean enough.

It's sufficient to look at TMR, no?  For edge triggered I don't think we 
need the EOI.

But, the amount of churn and risk worries me, so I don't think the move 
is worthwhile.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-09 16:05   ` [Qemu-devel] " Avi Kivity
@ 2010-06-10  2:37     ` Dong, Eddie
  2010-06-10  2:59       ` [Qemu-devel] " Avi Kivity
  0 siblings, 1 reply; 14+ messages in thread
From: Dong, Eddie @ 2010-06-10  2:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Dong, Eddie, qemu-devel, KVM list

Avi Kivity wrote:
> On 06/09/2010 06:59 PM, Dong, Eddie wrote:
>> 
>> Besides VF IO interrupt and timer interrupt introduced performance
>> overhead risk, 
> 
> VF usually uses MSI

Typo, I mean PV IO. 
A VF interrupt usually happens in 4-8KHZ. How about the virtio?
I assume virtio will be widely used together w/ leagcy guest with INTx mode.

> 
>>   EOI message deliver from lapic to ioapic,
> 
> Only for non-MSI
> 
>>   which becomes in user land now, may have potential scalability
>> issue. For example, if we have a 64 VCPU guest, if each vcpu has
>> 1khz interrupt (or ipi), the EOI from guest will normally have to
>> involve ioapic module for clearance in 64khz which may have long
>> lock contentio.    
> 
> No, EOI for IPI or for local APIC timer does not involve the IOAPIC.
> 
>> you may reduce the involvement of ioapic eoi by tracking ioapic
>> pin<->  vector map in kernel, but not sure if it is clean enough. 
> 
> It's sufficient to look at TMR, no?  For edge triggered I don't think
> we need the EOI.

Mmm, I noticed statements difference between new SDM & old SDM.
In old SDM, IPI can have both edge and level trigger mode. But new SDM says only INIT can have both choice.
Given the new SDM eliminates the level trigger mode, it is OK.

> 
> But, the amount of churn and risk worries me, so I don't think the
> move is worthwhile.

This also remind me the debate at early stage of KVM when Gregory Haskins is working on any arbitrary choice of irqchip.
The patch even at that time (without SMP) is very complicated.

I agree from ROI point of view, this movement may be not worthwhile.
thx, eddie

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-10  2:37     ` [Qemu-devel] " Dong, Eddie
@ 2010-06-10  2:59       ` Avi Kivity
  2010-06-10 14:42         ` [Qemu-devel] " Dong, Eddie
  0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2010-06-10  2:59 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: qemu-devel, KVM list

On 06/10/2010 05:37 AM, Dong, Eddie wrote:
> Avi Kivity wrote:
>    
>> On 06/09/2010 06:59 PM, Dong, Eddie wrote:
>>      
>>> Besides VF IO interrupt and timer interrupt introduced performance
>>> overhead risk,
>>>        
>> VF usually uses MSI
>>      
> Typo, I mean PV IO.
>    

That also uses MSI these days.

> A VF interrupt usually happens in 4-8KHZ. How about the virtio?
> I assume virtio will be widely used together w/ leagcy guest with INTx mode.
>    

True, but in time it will be replaced by MSI.

Note without vhost virtio is also in userspace, so there are lots of 
exits anyway for the status register.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
  2010-06-10  2:59       ` [Qemu-devel] " Avi Kivity
@ 2010-06-10 14:42         ` Dong, Eddie
  0 siblings, 0 replies; 14+ messages in thread
From: Dong, Eddie @ 2010-06-10 14:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Dong, Eddie, qemu-devel, KVM list


>> A VF interrupt usually happens in 4-8KHZ. How about the virtio?
>> I assume virtio will be widely used together w/ leagcy guest with
>> INTx mode. 
>> 
> 
> True, but in time it will be replaced by MSI.
> 
> Note without vhost virtio is also in userspace, so there are lots of
> exits anyway for the status register.

Few months ago, we noticed the interrupt frequency of PV I/O in previous solution is almost same with physical NIC interrupt which ticks in ~4KHZ. Each PV I/O frontend driver (or its interrupt source) has similar interrupt frequency which means Nx more interrupt. I guess virtio is in similar situation.

We then did an optimization for PV IO to mitigate the interrupt to guest by setting interrupt throttle in backend side, because native NIC also does in that way -- so called ITR register in Intel NIC. We can see 30-90% CPU utilization saving depending on how many frontend driver interrupt is employed. Not sure if it is adopted in vhost side.

One drawback of course is the latency, but it is mostly tolerable if it is reduced to ~1KHZ. 

Thx, Eddie

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-06-10 14:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-07 15:26 [Qemu-devel] [RFC] Moving the kvm ioapic, pic, and pit back to userspace Avi Kivity
2010-06-07 16:31 ` [Qemu-devel] " David S. Ahern
2010-06-07 18:46   ` Avi Kivity
2010-06-07 18:54     ` David S. Ahern
2010-06-07 19:16       ` Avi Kivity
2010-06-07 17:04 ` Anthony Liguori
2010-06-07 18:42   ` Avi Kivity
2010-06-07 22:23     ` Anthony Liguori
2010-06-08  5:48       ` Avi Kivity
2010-06-09 15:59 ` [Qemu-devel] " Dong, Eddie
2010-06-09 16:05   ` [Qemu-devel] " Avi Kivity
2010-06-10  2:37     ` [Qemu-devel] " Dong, Eddie
2010-06-10  2:59       ` [Qemu-devel] " Avi Kivity
2010-06-10 14:42         ` [Qemu-devel] " Dong, Eddie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).