* [Qemu-devel] [RFC] Moving the kvm ioapic, pic, and pit back to userspace
@ 2010-06-07 15:26 Avi Kivity
2010-06-07 16:31 ` [Qemu-devel] " David S. Ahern
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Avi Kivity @ 2010-06-07 15:26 UTC (permalink / raw)
To: KVM list; +Cc: qemu-devel
I am currently investigating a problem with the a guest running Linux
malfunctioning in the NMI watchdog code. The problem is that we don't
handle NMI delivery mode for the local APIC LINT0 pin; instead we expect
ExtInt deliver mode or that the line is disabled completely. In
addition the i8254 timer is tied to the BSP, while in this case the
timer can broadcast to all vcpus.
There is some code that tries to second-guess the guest and provide it
the inputs it sees, but this is fragile. The only way to get reliable
operation is to emulate the hardware fully.
Now I'd much rather do that in userspace, since it's a lot of sensitive
work. I'll enumerate below the general motivation, advantages and
disadvantages, and a plan for moving forward.
Motivation
==========
The original motivation for moving the PIC and IOAPIC into the kernel
was performance, especially for assigned devices. Both devices are high
interaction since they deal with interrupts; practically after every
interrupt there is either a PIC ioport write, or an APIC bus message,
both signalling an EOI operation. Moving the PIT into the kernel
allowed us to catch up with missed timer interrupt injections, and
speeded up guests which read the PIT counters (e.g. tickless guests).
However, modern guests running on modern qemu use MSI extensively; both
virtio and assigned devices now have MSI support; and the planned VFIO
only supports kernel delivery via MSI anyway; line based interrupts will
need to be mediated by userspace.
The only high frequency non-MSI interrupt sources remaining are the
various timers; and the default one, HPET, is in userspace (and having
its own scaling problems as a result). So in theory we can move PIC,
IOAPIC, and PIT support to userspace and not lose much performance.
Moving the implementation to userspace allows us more flexibility, and
more consistency in the implementation of timekeeping for the various
clock chips; it becomes easier to follow the nuances of real hardware in
this area.
Interestingly, while the IOAPIC/PIC code was written we proposed making
it independent of the local APIC; had we done so, the move would have
been much easier (simply dropping the existing code).
Advantages of a move
====================
1. Reduced kernel footprint
Good for security, and allows fixing bugs without reboots.
2. Centralized timekeeping
Instead of having one solution for PIT timekeeping, and another for RTC
and HPET timekeeping, we can have all timer chips in userspace. The
local APIC timer still needs to be in the kernel - it is much too high
bandwidth to be in userspace; but on the other hand it is very different
from the other timer chips.
3. Flexibility
Easier to have wierd board layouts (multiple IOAPICs, etc.). Not a very
strong advantage.
Disadvantages
=============
1. Still need to keep the old code around for a long while
We can't just rip it out - old userspace depends on it. So the security
advantages are only with cooperating userspace, and the other advantages
only show up.
2. Need to bring the qemu code up to date
The current qemu ioapic code lags some way behind the kernel; also need
PIT timekeeping
3. May need kernel support for interval-timer-follows-thread
Currently the timekeeping code has an optimization which causes the
hrtimer that models the PIT to follow the BSP (which is most likely to
receive the interrupt); this reduces cpu cross-talk.
I don't think the kernel interval timer code has such an optimization;
we may need to implement it.
4. Much churn
This is a lot of work.
5. Risk
We may find out after all this is implemented that performance is not
acceptable and all the work will have to be dropped.
Proposed interface
==================
1. KVM_SET_LINT_PIN (vcpu ioctl)
Sets the value (0 or 1) that a vcpu's LINT0 or LINT1 senses.
Note: problematic; may be high frequency but ignored due to masking at
the local APIC LVT level. Will also be broadcast across all vcpus by
userspace with typical configurations. We may need a way to tell
userspace we'll be ignoring those signals.
May also be extended to emulate thermal interrupts if someone feels the
need.
An alternative is a couple of new fields in kvm_run which are sampled on
every entry (unless masked).
2. KVM_EXIT_REASON_INTACK (kvm_run exit reason)
Informs userspace that the vcpu is running an INTACK cycle; userspace
should provide the interrupt vector on the next KVM_VCPU_RUN.
3. KVM_APIC_MESSAGE (vm ioctl)
Sends an APIC message on the APIC message bus, if the destination is in
the kernel (typically IOAPIC interrupt messages).
4. KVM_EXIT_REASON_APIC_MESSAGE (kvm_run exit reason)
Sends an APIC message on the APIC message bus, if the destination is not
in the kernel (typically IOAPIC EOI messages).
The above are all architectural, and correspond to wires on physical
systems. This increases the confidence that they are correct.
5. KVM_REQUEST_EOI (vcpu ioctl) / KVM_EXIT_EOI (kvm_run exit reason)
We will get EOI messages via KVM_EXIT_REASON_APIC_MESSAGE for
level-triggered interrupts. However, for timekeeping we will also need
a an EOI for edge triggered interrupts (if we choose the ack notifier
method for timekeeping).
6. KVM_EXIT_REASON_LVT_MASK (kvm_run exit reason)
A notification that the LVT LINT0 or LVT LINT1 mask bit has changed, and
thus we don't need to issue useless KVM_SET_LINT_PIN ioctls; also useful
for timekeeping (can disable PIT if configured with ExtInt mode or lapic
disabled).
7. KVM_EXIT_REASON_APIC_MESSAGE_ACK (kvm_run exit reason)
If we use the current timekeeping method of detecting coalesced
interrupts, we'll need an acknowledge when an APIC message is accepted
by a local APIC, with the result (interrupt queued or interrupt
coalesced). This will need to be selectable by vcpu and vector number.
8. KVM_CREATE_IRQCHIP (vm ioctl)
A new flag that tells kvm not to create a PIC and IOAPIC.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-07 15:26 [Qemu-devel] [RFC] Moving the kvm ioapic, pic, and pit back to userspace Avi Kivity
@ 2010-06-07 16:31 ` David S. Ahern
2010-06-07 18:46 ` Avi Kivity
2010-06-07 17:04 ` Anthony Liguori
2010-06-09 15:59 ` [Qemu-devel] " Dong, Eddie
2 siblings, 1 reply; 14+ messages in thread
From: David S. Ahern @ 2010-06-07 16:31 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel, KVM list
On 06/07/10 09:26, Avi Kivity wrote:
> The original motivation for moving the PIC and IOAPIC into the kernel
> was performance, especially for assigned devices. Both devices are high
> interaction since they deal with interrupts; practically after every
> interrupt there is either a PIC ioport write, or an APIC bus message,
> both signalling an EOI operation. Moving the PIT into the kernel
> allowed us to catch up with missed timer interrupt injections, and
> speeded up guests which read the PIT counters (e.g. tickless guests).
>
> However, modern guests running on modern qemu use MSI extensively; both
> virtio and assigned devices now have MSI support; and the planned VFIO
> only supports kernel delivery via MSI anyway; line based interrupts will
> need to be mediated by userspace.
The "modern" guest comment is a bit concerning. 2.4 kernels (e.g.,
RHEL3) use the PIT for timekeeping and will still be around for a while.
RHEL4 and RHEL5 will be around for a long time to come. Not sure how
those fit within the "modern" label, though I see my RHEL4 guest is
using the pit as a timesource.
David
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-07 16:31 ` [Qemu-devel] " David S. Ahern
@ 2010-06-07 18:46 ` Avi Kivity
2010-06-07 18:54 ` David S. Ahern
0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2010-06-07 18:46 UTC (permalink / raw)
To: David S. Ahern; +Cc: qemu-devel, KVM list
On 06/07/2010 07:31 PM, David S. Ahern wrote:
>
> On 06/07/10 09:26, Avi Kivity wrote:
>
>
>> The original motivation for moving the PIC and IOAPIC into the kernel
>> was performance, especially for assigned devices. Both devices are high
>> interaction since they deal with interrupts; practically after every
>> interrupt there is either a PIC ioport write, or an APIC bus message,
>> both signalling an EOI operation. Moving the PIT into the kernel
>> allowed us to catch up with missed timer interrupt injections, and
>> speeded up guests which read the PIT counters (e.g. tickless guests).
>>
>> However, modern guests running on modern qemu use MSI extensively; both
>> virtio and assigned devices now have MSI support; and the planned VFIO
>> only supports kernel delivery via MSI anyway; line based interrupts will
>> need to be mediated by userspace.
>>
> The "modern" guest comment is a bit concerning. 2.4 kernels (e.g.,
> RHEL3) use the PIT for timekeeping and will still be around for a while.
> RHEL4 and RHEL5 will be around for a long time to come. Not sure how
> those fit within the "modern" label, though I see my RHEL4 guest is
> using the pit as a timesource.
>
First of all, the existing code will remain for a long while (several
years). We still have to support existing userspace.
But, that's not a satisfactory answer. I don't want users to choose
which device model to use according to their guest. As far as I'm
concerned all guests are triple-boot with the guest rebooting to a
different OS every half hour.
So it's important to know how often your RHEL3/4 guest queries the PIT
(not just receives interrupts, actually reads the counter) under a
realistic load. If you have such a number (in reads/sec) that would be
a good input to this discussion.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-07 18:46 ` Avi Kivity
@ 2010-06-07 18:54 ` David S. Ahern
2010-06-07 19:16 ` Avi Kivity
0 siblings, 1 reply; 14+ messages in thread
From: David S. Ahern @ 2010-06-07 18:54 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel, KVM list
On 06/07/10 12:46, Avi Kivity wrote:
> On 06/07/2010 07:31 PM, David S. Ahern wrote:
>>
>> On 06/07/10 09:26, Avi Kivity wrote:
>>
>>
>>> The original motivation for moving the PIC and IOAPIC into the kernel
>>> was performance, especially for assigned devices. Both devices are high
>>> interaction since they deal with interrupts; practically after every
>>> interrupt there is either a PIC ioport write, or an APIC bus message,
>>> both signalling an EOI operation. Moving the PIT into the kernel
>>> allowed us to catch up with missed timer interrupt injections, and
>>> speeded up guests which read the PIT counters (e.g. tickless guests).
>>>
>>> However, modern guests running on modern qemu use MSI extensively; both
>>> virtio and assigned devices now have MSI support; and the planned VFIO
>>> only supports kernel delivery via MSI anyway; line based interrupts will
>>> need to be mediated by userspace.
>>>
>> The "modern" guest comment is a bit concerning. 2.4 kernels (e.g.,
>> RHEL3) use the PIT for timekeeping and will still be around for a while.
>> RHEL4 and RHEL5 will be around for a long time to come. Not sure how
>> those fit within the "modern" label, though I see my RHEL4 guest is
>> using the pit as a timesource.
>>
>
> First of all, the existing code will remain for a long while (several
> years). We still have to support existing userspace.
>
> But, that's not a satisfactory answer. I don't want users to choose
> which device model to use according to their guest. As far as I'm
> concerned all guests are triple-boot with the guest rebooting to a
> different OS every half hour.
>
> So it's important to know how often your RHEL3/4 guest queries the PIT
> (not just receives interrupts, actually reads the counter) under a
> realistic load. If you have such a number (in reads/sec) that would be
> a good input to this discussion.
>
Aps that invoke gettimeofday a lot. As I recall RHEL3 uses the TSC
between timer interrupts, but RHEL4 samples counters on each
gettimeofday call:
http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html
Because of that performance of applications that timestamp log entries
(like a certain product I work on) takes a hit on KVM unless the TSC is
the clock source.
David
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-07 18:54 ` David S. Ahern
@ 2010-06-07 19:16 ` Avi Kivity
0 siblings, 0 replies; 14+ messages in thread
From: Avi Kivity @ 2010-06-07 19:16 UTC (permalink / raw)
To: David S. Ahern; +Cc: qemu-devel, KVM list
On 06/07/2010 09:54 PM, David S. Ahern wrote:
>
>> So it's important to know how often your RHEL3/4 guest queries the PIT
>> (not just receives interrupts, actually reads the counter) under a
>> realistic load. If you have such a number (in reads/sec) that would be
>> a good input to this discussion.
>>
>>
> Aps that invoke gettimeofday a lot.
Ask a stupid question, get an "it depends on the workload" answer.
> As I recall RHEL3 uses the TSC
> between timer interrupts, but RHEL4 samples counters on each
> gettimeofday call:
>
> http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html
>
> Because of that performance of applications that timestamp log entries
> (like a certain product I work on) takes a hit on KVM unless the TSC is
> the clock source.
>
So it looks like dropping the PIT out of the kernel, let alone the
PIC/IOAPIC, is out of the question.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-07 15:26 [Qemu-devel] [RFC] Moving the kvm ioapic, pic, and pit back to userspace Avi Kivity
2010-06-07 16:31 ` [Qemu-devel] " David S. Ahern
@ 2010-06-07 17:04 ` Anthony Liguori
2010-06-07 18:42 ` Avi Kivity
2010-06-09 15:59 ` [Qemu-devel] " Dong, Eddie
2 siblings, 1 reply; 14+ messages in thread
From: Anthony Liguori @ 2010-06-07 17:04 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel, KVM list
On 06/07/2010 10:26 AM, Avi Kivity wrote:
> I am currently investigating a problem with the a guest running Linux
> malfunctioning in the NMI watchdog code. The problem is that we don't
> handle NMI delivery mode for the local APIC LINT0 pin; instead we
> expect ExtInt deliver mode or that the line is disabled completely.
> In addition the i8254 timer is tied to the BSP, while in this case the
> timer can broadcast to all vcpus.
>
> There is some code that tries to second-guess the guest and provide it
> the inputs it sees, but this is fragile. The only way to get reliable
> operation is to emulate the hardware fully.
>
> Now I'd much rather do that in userspace, since it's a lot of
> sensitive work. I'll enumerate below the general motivation,
> advantages and disadvantages, and a plan for moving forward.
>
> Motivation
> ==========
>
> The original motivation for moving the PIC and IOAPIC into the kernel
> was performance, especially for assigned devices. Both devices are
> high interaction since they deal with interrupts; practically after
> every interrupt there is either a PIC ioport write, or an APIC bus
> message, both signalling an EOI operation. Moving the PIT into the
> kernel allowed us to catch up with missed timer interrupt injections,
> and speeded up guests which read the PIT counters (e.g. tickless guests).
>
> However, modern guests running on modern qemu use MSI extensively;
> both virtio and assigned devices now have MSI support; and the planned
> VFIO only supports kernel delivery via MSI anyway; line based
> interrupts will need to be mediated by userspace.
>
> The only high frequency non-MSI interrupt sources remaining are the
> various timers; and the default one, HPET, is in userspace (and having
> its own scaling problems as a result). So in theory we can move PIC,
> IOAPIC, and PIT support to userspace and not lose much performance.
I think we could also move the local APIC.
To optimize device models, we've tended to put the full device model in
the kernel whereas the hardware vendors have tended to put only the fast
paths of the devices models in hardware.
For instance, we could introduce a userspace interface similar to vapic
support whereas a shared page that mapped the APIC's layout was used
with a mask to select which registers trapped on read/write.
That said, I can understand an argument that the local APIC is part of
the CPU state since it's a very special type of device.
A better example would be a generic counter kernel mechanism. I can
envision such a device as doing nothing more than providing a read-only
view of a counter with a userspace configurable divider and width. Any
write to the counter or read of any other byte outside the counter
register would result in a trap to userspace.
That should allow both the PIT and the HPET to be accelerated with
minimal effort in the kernel.
> Moving the implementation to userspace allows us more flexibility, and
> more consistency in the implementation of timekeeping for the various
> clock chips; it becomes easier to follow the nuances of real hardware
> in this area.
>
> Interestingly, while the IOAPIC/PIC code was written we proposed
> making it independent of the local APIC; had we done so, the move
> would have been much easier (simply dropping the existing code).
>
>
> Advantages of a move
> ====================
>
> 1. Reduced kernel footprint
>
> Good for security, and allows fixing bugs without reboots.
>
> 2. Centralized timekeeping
>
> Instead of having one solution for PIT timekeeping, and another for
> RTC and HPET timekeeping, we can have all timer chips in userspace.
> The local APIC timer still needs to be in the kernel - it is much too
> high bandwidth to be in userspace; but on the other hand it is very
> different from the other timer chips.
>
> 3. Flexibility
>
> Easier to have wierd board layouts (multiple IOAPICs, etc.). Not a
> very strong advantage.
>
> Disadvantages
> =============
>
> 1. Still need to keep the old code around for a long while
>
> We can't just rip it out - old userspace depends on it. So the
> security advantages are only with cooperating userspace, and the other
> advantages only show up.
>
> 2. Need to bring the qemu code up to date
>
> The current qemu ioapic code lags some way behind the kernel; also
> need PIT timekeeping
>
> 3. May need kernel support for interval-timer-follows-thread
>
> Currently the timekeeping code has an optimization which causes the
> hrtimer that models the PIT to follow the BSP (which is most likely to
> receive the interrupt); this reduces cpu cross-talk.
>
> I don't think the kernel interval timer code has such an optimization;
> we may need to implement it.
>
> 4. Much churn
>
> This is a lot of work.
I'd be in favor of a straight port to userspace. We already have the
interfaces to communicate with an external device model for these
devices so let's just take the kernel code and stick it into dedicated
threads in userspace.
I think it's easier to then work to merge the two bits of code in the
same tree than it is to try and take out-of-tree code and merge it
incrementally.
> 5. Risk
>
> We may find out after all this is implemented that performance is not
> acceptable and all the work will have to be dropped.
That's another advantage to a straight port to userspace. We can
collect performance data with only a modest amount of engineering effort.
Regards,
Anthony Liguori
>
> Proposed interface
> ==================
>
> 1. KVM_SET_LINT_PIN (vcpu ioctl)
>
> Sets the value (0 or 1) that a vcpu's LINT0 or LINT1 senses.
>
> Note: problematic; may be high frequency but ignored due to masking at
> the local APIC LVT level. Will also be broadcast across all vcpus by
> userspace with typical configurations. We may need a way to tell
> userspace we'll be ignoring those signals.
>
> May also be extended to emulate thermal interrupts if someone feels
> the need.
>
> An alternative is a couple of new fields in kvm_run which are sampled
> on every entry (unless masked).
>
> 2. KVM_EXIT_REASON_INTACK (kvm_run exit reason)
>
> Informs userspace that the vcpu is running an INTACK cycle; userspace
> should provide the interrupt vector on the next KVM_VCPU_RUN.
>
> 3. KVM_APIC_MESSAGE (vm ioctl)
>
> Sends an APIC message on the APIC message bus, if the destination is
> in the kernel (typically IOAPIC interrupt messages).
>
> 4. KVM_EXIT_REASON_APIC_MESSAGE (kvm_run exit reason)
>
> Sends an APIC message on the APIC message bus, if the destination is
> not in the kernel (typically IOAPIC EOI messages).
>
> The above are all architectural, and correspond to wires on physical
> systems. This increases the confidence that they are correct.
>
> 5. KVM_REQUEST_EOI (vcpu ioctl) / KVM_EXIT_EOI (kvm_run exit reason)
>
> We will get EOI messages via KVM_EXIT_REASON_APIC_MESSAGE for
> level-triggered interrupts. However, for timekeeping we will also
> need a an EOI for edge triggered interrupts (if we choose the ack
> notifier method for timekeeping).
>
> 6. KVM_EXIT_REASON_LVT_MASK (kvm_run exit reason)
>
> A notification that the LVT LINT0 or LVT LINT1 mask bit has changed,
> and thus we don't need to issue useless KVM_SET_LINT_PIN ioctls; also
> useful for timekeeping (can disable PIT if configured with ExtInt mode
> or lapic disabled).
>
> 7. KVM_EXIT_REASON_APIC_MESSAGE_ACK (kvm_run exit reason)
>
> If we use the current timekeeping method of detecting coalesced
> interrupts, we'll need an acknowledge when an APIC message is accepted
> by a local APIC, with the result (interrupt queued or interrupt
> coalesced). This will need to be selectable by vcpu and vector number.
>
> 8. KVM_CREATE_IRQCHIP (vm ioctl)
>
> A new flag that tells kvm not to create a PIC and IOAPIC.
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-07 17:04 ` Anthony Liguori
@ 2010-06-07 18:42 ` Avi Kivity
2010-06-07 22:23 ` Anthony Liguori
0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2010-06-07 18:42 UTC (permalink / raw)
To: Anthony Liguori; +Cc: qemu-devel, KVM list
On 06/07/2010 08:04 PM, Anthony Liguori wrote:
>
> I think we could also move the local APIC.
I'm not even sure we can safely move the ioapic/pic (mostly due to
churn). But the local APIC is so heavily accessed by the guest that
it's impossible to move it. Run an ftrace one day, especially on an smp
guest. Every IPI requires several APIC accesses. Before a halt a
tickless kernel sets the wakeup timer. EOIs.
>
> To optimize device models, we've tended to put the full device model
> in the kernel whereas the hardware vendors have tended to put only the
> fast paths of the devices models in hardware.
>
> For instance, we could introduce a userspace interface similar to
> vapic support whereas a shared page that mapped the APIC's layout was
> used with a mask to select which registers trapped on read/write.
That leads to very problematic interfaces. When you separate along a
device boundary, you have a spec that defines the software interfaces.
When you separate along a boundary that you define, it's up to you to
get everything right.
In fact with the ioapic/pic/lapic one of the problems is that the
interconnection between the devices that is not well defined, and that's
where we have bugs.
>
> That said, I can understand an argument that the local APIC is part of
> the CPU state since it's a very special type of device.
>
> A better example would be a generic counter kernel mechanism. I can
> envision such a device as doing nothing more than providing a
> read-only view of a counter with a userspace configurable divider and
> width. Any write to the counter or read of any other byte outside the
> counter register would result in a trap to userspace.
What about latches? byte access to word registers? There will be as
many special cases as there are timers.
If the kernel supported a bytecode/jit facility I'd happily use that to
download portions of the device model into the kernel.
>
> That should allow both the PIT and the HPET to be accelerated with
> minimal effort in the kernel.
IMO it's probably more effort than porting HPET to the kernel. Try
outlining an interface that supports PIT, HPET, RTC, and ACPI PMTIMER.
>
> I'd be in favor of a straight port to userspace. We already have the
> interfaces to communicate with an external device model for these
> devices so let's just take the kernel code and stick it into dedicated
> threads in userspace.
Currently we support an all-or-nothing approach. I don't think local
APIC in userspace is worthwhile. Esp. as it will slow down vhost and
assigned devices significantly - interrupts will have to be mediated by
userspace.
>
> I think it's easier to then work to merge the two bits of code in the
> same tree than it is to try and take out-of-tree code and merge it
> incrementally.
Are you talking about qemu.git/qemu-kvm.git? That's the least of my
concerns, I'm worried about kvm.git.
>
>> 5. Risk
>>
>> We may find out after all this is implemented that performance is not
>> acceptable and all the work will have to be dropped.
>
> That's another advantage to a straight port to userspace. We can
> collect performance data with only a modest amount of engineering effort.
Port what exactly? We have a userspace irqchip implementation. What we
don't have is just the ioapic/pic/pit in userspace, and the only way to
try it out is to implement the whole thing.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-07 18:42 ` Avi Kivity
@ 2010-06-07 22:23 ` Anthony Liguori
2010-06-08 5:48 ` Avi Kivity
0 siblings, 1 reply; 14+ messages in thread
From: Anthony Liguori @ 2010-06-07 22:23 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel, KVM list
On 06/07/2010 01:42 PM, Avi Kivity wrote:
> On 06/07/2010 08:04 PM, Anthony Liguori wrote:
>>
>> I think we could also move the local APIC.
>
> I'm not even sure we can safely move the ioapic/pic (mostly due to
> churn). But the local APIC is so heavily accessed by the guest that
> it's impossible to move it. Run an ftrace one day, especially on an
> smp guest. Every IPI requires several APIC accesses. Before a halt a
> tickless kernel sets the wakeup timer. EOIs.
>
>>
>> To optimize device models, we've tended to put the full device model
>> in the kernel whereas the hardware vendors have tended to put only
>> the fast paths of the devices models in hardware.
>>
>> For instance, we could introduce a userspace interface similar to
>> vapic support whereas a shared page that mapped the APIC's layout was
>> used with a mask to select which registers trapped on read/write.
>
> That leads to very problematic interfaces. When you separate along a
> device boundary, you have a spec that defines the software
> interfaces. When you separate along a boundary that you define, it's
> up to you to get everything right.
>
> In fact with the ioapic/pic/lapic one of the problems is that the
> interconnection between the devices that is not well defined, and
> that's where we have bugs.
>
>>
>> That said, I can understand an argument that the local APIC is part
>> of the CPU state since it's a very special type of device.
>>
>> A better example would be a generic counter kernel mechanism. I can
>> envision such a device as doing nothing more than providing a
>> read-only view of a counter with a userspace configurable divider and
>> width. Any write to the counter or read of any other byte outside
>> the counter register would result in a trap to userspace.
>
> What about latches? byte access to word registers? There will be as
> many special cases as there are timers.
>
> If the kernel supported a bytecode/jit facility I'd happily use that
> to download portions of the device model into the kernel.
>
>>
>> That should allow both the PIT and the HPET to be accelerated with
>> minimal effort in the kernel.
>
> IMO it's probably more effort than porting HPET to the kernel. Try
> outlining an interface that supports PIT, HPET, RTC, and ACPI PMTIMER.
I was referring specifically to time sources, not time events.
An accelerated counter for HPET is pretty trivial. It's a 32-bit
register that's actually a nanosecond value in qemu. We need to be able
to set an offset from the host wall clock time, a means to stop it, and
a means to start it.
The PIT is latched so the kernel needs to know enough about how to
decode the PIT state to understand the latching. There's very little
state associated with latching though so I don't think this is a huge
problem. It's a fixed value write to a fixed register followed by a
read to a fixed register. The act of latching doesn't effect the state
beyond the fact that you need to save the latched value in the event
that you have a live migration before reading the latched value.
The PMTIMER is also pretty straight forward. It's a variable port
address (that's fixed during execution).
Even if we require three separate interfaces, the interfaces are so
simply that it seems like an obvious win.
>>
>> I'd be in favor of a straight port to userspace. We already have the
>> interfaces to communicate with an external device model for these
>> devices so let's just take the kernel code and stick it into
>> dedicated threads in userspace.
>
> Currently we support an all-or-nothing approach. I don't think local
> APIC in userspace is worthwhile. Esp. as it will slow down vhost and
> assigned devices significantly - interrupts will have to be mediated
> by userspace.
Yeah, as I said, I can understand the arguments for keeping the lapic in
the kernel.
>>
>> I think it's easier to then work to merge the two bits of code in the
>> same tree than it is to try and take out-of-tree code and merge it
>> incrementally.
>
> Are you talking about qemu.git/qemu-kvm.git? That's the least of my
> concerns, I'm worried about kvm.git.
qemu.git.
>>
>>> 5. Risk
>>>
>>> We may find out after all this is implemented that performance is
>>> not acceptable and all the work will have to be dropped.
>>
>> That's another advantage to a straight port to userspace. We can
>> collect performance data with only a modest amount of engineering
>> effort.
>
> Port what exactly? We have a userspace irqchip implementation. What
> we don't have is just the ioapic/pic/pit in userspace, and the only
> way to try it out is to implement the whole thing.
If you take the kernel code and do a pretty straight port: switching
kernel functions to libc functions and maintaining all the existing
locking via pthreads, you could then implement a very simple MMIO/PIO
dispatch mechanism in the kvm code that shortcutted those devices before
we ever hit the qemu_mutex and the traditional qemu code paths. It
should be a relatively easy conversion and it gives a proper vehicle for
doing experimentations.
In fact, you could pretty quickly determine viability by porting the PIT
to userspace and implementing a vpit interface in the kernel that
allowed the channel 0 counters to be latched and read within lightweight
exits.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-07 22:23 ` Anthony Liguori
@ 2010-06-08 5:48 ` Avi Kivity
0 siblings, 0 replies; 14+ messages in thread
From: Avi Kivity @ 2010-06-08 5:48 UTC (permalink / raw)
To: Anthony Liguori; +Cc: qemu-devel, KVM list
On 06/08/2010 01:23 AM, Anthony Liguori wrote:
>>> A better example would be a generic counter kernel mechanism. I can
>>> envision such a device as doing nothing more than providing a
>>> read-only view of a counter with a userspace configurable divider
>>> and width. Any write to the counter or read of any other byte
>>> outside the counter register would result in a trap to userspace.
>>
>> What about latches? byte access to word registers? There will be as
>> many special cases as there are timers.
>>
>> If the kernel supported a bytecode/jit facility I'd happily use that
>> to download portions of the device model into the kernel.
>>
>>>
>>> That should allow both the PIT and the HPET to be accelerated with
>>> minimal effort in the kernel.
>>
>> IMO it's probably more effort than porting HPET to the kernel. Try
>> outlining an interface that supports PIT, HPET, RTC, and ACPI PMTIMER.
>
>
> I was referring specifically to time sources, not time events.
>
> An accelerated counter for HPET is pretty trivial. It's a 32-bit
> register that's actually a nanosecond value in qemu. We need to be
> able to set an offset from the host wall clock time, a means to stop
> it, and a means to start it.
>
> The PIT is latched so the kernel needs to know enough about how to
> decode the PIT state to understand the latching. There's very little
> state associated with latching though so I don't think this is a huge
> problem. It's a fixed value write to a fixed register followed by a
> read to a fixed register. The act of latching doesn't effect the
> state beyond the fact that you need to save the latched value in the
> event that you have a live migration before reading the latched value.
>
> The PMTIMER is also pretty straight forward. It's a variable port
> address (that's fixed during execution).
>
> Even if we require three separate interfaces, the interfaces are so
> simply that it seems like an obvious win.
So a non-generic interface - 4x the interfaces (including RTC).
Those counters raise interrupts when they expire, and set various status
bits in their hardware. So we need 4x of:
set counter value, frequency, and reload interval
raise alarm to userspace on expiration
set counter memory/ioport location and availability
read counter value
and we haven't solved interrupt coalescing.
>
>>>
>>>> 5. Risk
>>>>
>>>> We may find out after all this is implemented that performance is
>>>> not acceptable and all the work will have to be dropped.
>>>
>>> That's another advantage to a straight port to userspace. We can
>>> collect performance data with only a modest amount of engineering
>>> effort.
>>
>> Port what exactly? We have a userspace irqchip implementation. What
>> we don't have is just the ioapic/pic/pit in userspace, and the only
>> way to try it out is to implement the whole thing.
>
> If you take the kernel code and do a pretty straight port: switching
> kernel functions to libc functions and maintaining all the existing
> locking via pthreads, you could then implement a very simple MMIO/PIO
> dispatch mechanism in the kvm code that shortcutted those devices
> before we ever hit the qemu_mutex and the traditional qemu code
> paths. It should be a relatively easy conversion and it gives a
> proper vehicle for doing experimentations.
Those devices don't exist independently of the rest of the devices. If
they need to post interrupts, they will need the traditional qemu code
paths.
(I'm trying to view the move from the POV of the kernel first, assuming
userspace is as efficient as possible; so I'm not arguing qemu
inefficiencies should prevent us from doing it. But they do add up
considerably to the amount of work involved)
>
> In fact, you could pretty quickly determine viability by porting the
> PIT to userspace and implementing a vpit interface in the kernel that
> allowed the channel 0 counters to be latched and read within
> lightweight exits.
Just looking at it shows the interface is incredibly messy. You have to
maintain the control word in the kernel (since it tells you which
counter to read or write), so now you need a userspace interface to read
and write the control word. With the current interface, you have the
entire thing in a black box that you don't need to worry about (except
for the speaker port...).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-07 15:26 [Qemu-devel] [RFC] Moving the kvm ioapic, pic, and pit back to userspace Avi Kivity
2010-06-07 16:31 ` [Qemu-devel] " David S. Ahern
2010-06-07 17:04 ` Anthony Liguori
@ 2010-06-09 15:59 ` Dong, Eddie
2010-06-09 16:05 ` [Qemu-devel] " Avi Kivity
2 siblings, 1 reply; 14+ messages in thread
From: Dong, Eddie @ 2010-06-09 15:59 UTC (permalink / raw)
To: Avi Kivity, KVM list; +Cc: Dong, Eddie, qemu-devel
Avi Kivity wrote:
> I am currently investigating a problem with the a guest running Linux
> malfunctioning in the NMI watchdog code. The problem is that we don't
> handle NMI delivery mode for the local APIC LINT0 pin; instead we
> expect ExtInt deliver mode or that the line is disabled completely.
> In addition the i8254 timer is tied to the BSP, while in this case the
> timer can broadcast to all vcpus.
>
> There is some code that tries to second-guess the guest and provide it
> the inputs it sees, but this is fragile. The only way to get reliable
> operation is to emulate the hardware fully.
>
> Now I'd much rather do that in userspace, since it's a lot of
> sensitive work. I'll enumerate below the general motivation,
> advantages and disadvantages, and a plan for moving forward.
>
> Motivation
> ==========
>
> The original motivation for moving the PIC and IOAPIC into the kernel
> was performance, especially for assigned devices. Both devices are
> high interaction since they deal with interrupts; practically after
> every interrupt there is either a PIC ioport write, or an APIC bus
> message, both signalling an EOI operation. Moving the PIT into the
> kernel allowed us to catch up with missed timer interrupt injections,
> and speeded up guests which read the PIT counters (e.g. tickless
> guests).
>
> However, modern guests running on modern qemu use MSI extensively;
> both virtio and assigned devices now have MSI support; and the
> planned VFIO only supports kernel delivery via MSI anyway; line based
> interrupts will need to be mediated by userspace.
>
> The only high frequency non-MSI interrupt sources remaining are the
> various timers; and the default one, HPET, is in userspace (and having
> its own scaling problems as a result). So in theory we can move PIC,
> IOAPIC, and PIT support to userspace and not lose much performance.
>
> Moving the implementation to userspace allows us more flexibility, and
> more consistency in the implementation of timekeeping for the various
> clock chips; it becomes easier to follow the nuances of real hardware
> in this area.
>
> Interestingly, while the IOAPIC/PIC code was written we proposed
> making it independent of the local APIC; had we done so, the move
> would have been much easier (simply dropping the existing code).
>
>
> Advantages of a move
> ====================
>
> 1. Reduced kernel footprint
>
> Good for security, and allows fixing bugs without reboots.
>
> 2. Centralized timekeeping
>
> Instead of having one solution for PIT timekeeping, and another for
> RTC and HPET timekeeping, we can have all timer chips in userspace.
> The local APIC timer still needs to be in the kernel - it is much too
> high bandwidth to be in userspace; but on the other hand it is very
> different from the other timer chips.
>
> 3. Flexibility
>
> Easier to have wierd board layouts (multiple IOAPICs, etc.). Not a
> very strong advantage.
>
> Disadvantages
> =============
>
> 1. Still need to keep the old code around for a long while
>
> We can't just rip it out - old userspace depends on it. So the
> security advantages are only with cooperating userspace, and the
> other advantages only show up.
>
> 2. Need to bring the qemu code up to date
>
> The current qemu ioapic code lags some way behind the kernel; also
> need PIT timekeeping
>
> 3. May need kernel support for interval-timer-follows-thread
>
> Currently the timekeeping code has an optimization which causes the
> hrtimer that models the PIT to follow the BSP (which is most likely to
> receive the interrupt); this reduces cpu cross-talk.
>
> I don't think the kernel interval timer code has such an optimization;
> we may need to implement it.
>
> 4. Much churn
>
> This is a lot of work.
>
> 5. Risk
>
> We may find out after all this is implemented that performance is not
> acceptable and all the work will have to be dropped.
>
>
Besides VF IO interrupt and timer interrupt introduced performance overhead risk, EOI message deliver from lapic to ioapic, which becomes in user land now, may have potential scalability issue. For example, if we have a 64 VCPU guest, if each vcpu has 1khz interrupt (or ipi), the EOI from guest will normally have to involve ioapic module for clearance in 64khz which may have long lock contentio. you may reduce the involvement of ioapic eoi by tracking ioapic pin <-> vector map in kernel, but not sure if it is clean enough.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-09 15:59 ` [Qemu-devel] " Dong, Eddie
@ 2010-06-09 16:05 ` Avi Kivity
2010-06-10 2:37 ` [Qemu-devel] " Dong, Eddie
0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2010-06-09 16:05 UTC (permalink / raw)
To: Dong, Eddie; +Cc: qemu-devel, KVM list
On 06/09/2010 06:59 PM, Dong, Eddie wrote:
>
> Besides VF IO interrupt and timer interrupt introduced performance overhead risk,
VF usually uses MSI
> EOI message deliver from lapic to ioapic,
Only for non-MSI
> which becomes in user land now, may have potential scalability issue. For example, if we have a 64 VCPU guest, if each vcpu has 1khz interrupt (or ipi), the EOI from guest will normally have to involve ioapic module for clearance in 64khz which may have long lock contentio.
No, EOI for IPI or for local APIC timer does not involve the IOAPIC.
> you may reduce the involvement of ioapic eoi by tracking ioapic pin<-> vector map in kernel, but not sure if it is clean enough.
It's sufficient to look at TMR, no? For edge triggered I don't think we
need the EOI.
But, the amount of churn and risk worries me, so I don't think the move
is worthwhile.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-09 16:05 ` [Qemu-devel] " Avi Kivity
@ 2010-06-10 2:37 ` Dong, Eddie
2010-06-10 2:59 ` [Qemu-devel] " Avi Kivity
0 siblings, 1 reply; 14+ messages in thread
From: Dong, Eddie @ 2010-06-10 2:37 UTC (permalink / raw)
To: Avi Kivity; +Cc: Dong, Eddie, qemu-devel, KVM list
Avi Kivity wrote:
> On 06/09/2010 06:59 PM, Dong, Eddie wrote:
>>
>> Besides VF IO interrupt and timer interrupt introduced performance
>> overhead risk,
>
> VF usually uses MSI
Typo, I mean PV IO.
A VF interrupt usually happens in 4-8KHZ. How about the virtio?
I assume virtio will be widely used together w/ leagcy guest with INTx mode.
>
>> EOI message deliver from lapic to ioapic,
>
> Only for non-MSI
>
>> which becomes in user land now, may have potential scalability
>> issue. For example, if we have a 64 VCPU guest, if each vcpu has
>> 1khz interrupt (or ipi), the EOI from guest will normally have to
>> involve ioapic module for clearance in 64khz which may have long
>> lock contentio.
>
> No, EOI for IPI or for local APIC timer does not involve the IOAPIC.
>
>> you may reduce the involvement of ioapic eoi by tracking ioapic
>> pin<-> vector map in kernel, but not sure if it is clean enough.
>
> It's sufficient to look at TMR, no? For edge triggered I don't think
> we need the EOI.
Mmm, I noticed statements difference between new SDM & old SDM.
In old SDM, IPI can have both edge and level trigger mode. But new SDM says only INIT can have both choice.
Given the new SDM eliminates the level trigger mode, it is OK.
>
> But, the amount of churn and risk worries me, so I don't think the
> move is worthwhile.
This also remind me the debate at early stage of KVM when Gregory Haskins is working on any arbitrary choice of irqchip.
The patch even at that time (without SMP) is very complicated.
I agree from ROI point of view, this movement may be not worthwhile.
thx, eddie
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-10 2:37 ` [Qemu-devel] " Dong, Eddie
@ 2010-06-10 2:59 ` Avi Kivity
2010-06-10 14:42 ` [Qemu-devel] " Dong, Eddie
0 siblings, 1 reply; 14+ messages in thread
From: Avi Kivity @ 2010-06-10 2:59 UTC (permalink / raw)
To: Dong, Eddie; +Cc: qemu-devel, KVM list
On 06/10/2010 05:37 AM, Dong, Eddie wrote:
> Avi Kivity wrote:
>
>> On 06/09/2010 06:59 PM, Dong, Eddie wrote:
>>
>>> Besides VF IO interrupt and timer interrupt introduced performance
>>> overhead risk,
>>>
>> VF usually uses MSI
>>
> Typo, I mean PV IO.
>
That also uses MSI these days.
> A VF interrupt usually happens in 4-8KHZ. How about the virtio?
> I assume virtio will be widely used together w/ leagcy guest with INTx mode.
>
True, but in time it will be replaced by MSI.
Note without vhost virtio is also in userspace, so there are lots of
exits anyway for the status register.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] RE: [RFC] Moving the kvm ioapic, pic, and pit back to userspace
2010-06-10 2:59 ` [Qemu-devel] " Avi Kivity
@ 2010-06-10 14:42 ` Dong, Eddie
0 siblings, 0 replies; 14+ messages in thread
From: Dong, Eddie @ 2010-06-10 14:42 UTC (permalink / raw)
To: Avi Kivity; +Cc: Dong, Eddie, qemu-devel, KVM list
>> A VF interrupt usually happens in 4-8KHZ. How about the virtio?
>> I assume virtio will be widely used together w/ leagcy guest with
>> INTx mode.
>>
>
> True, but in time it will be replaced by MSI.
>
> Note without vhost virtio is also in userspace, so there are lots of
> exits anyway for the status register.
Few months ago, we noticed the interrupt frequency of PV I/O in previous solution is almost same with physical NIC interrupt which ticks in ~4KHZ. Each PV I/O frontend driver (or its interrupt source) has similar interrupt frequency which means Nx more interrupt. I guess virtio is in similar situation.
We then did an optimization for PV IO to mitigate the interrupt to guest by setting interrupt throttle in backend side, because native NIC also does in that way -- so called ITR register in Intel NIC. We can see 30-90% CPU utilization saving depending on how many frontend driver interrupt is employed. Not sure if it is adopted in vhost side.
One drawback of course is the latency, but it is mostly tolerable if it is reduced to ~1KHZ.
Thx, Eddie
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-06-10 14:46 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-07 15:26 [Qemu-devel] [RFC] Moving the kvm ioapic, pic, and pit back to userspace Avi Kivity
2010-06-07 16:31 ` [Qemu-devel] " David S. Ahern
2010-06-07 18:46 ` Avi Kivity
2010-06-07 18:54 ` David S. Ahern
2010-06-07 19:16 ` Avi Kivity
2010-06-07 17:04 ` Anthony Liguori
2010-06-07 18:42 ` Avi Kivity
2010-06-07 22:23 ` Anthony Liguori
2010-06-08 5:48 ` Avi Kivity
2010-06-09 15:59 ` [Qemu-devel] " Dong, Eddie
2010-06-09 16:05 ` [Qemu-devel] " Avi Kivity
2010-06-10 2:37 ` [Qemu-devel] " Dong, Eddie
2010-06-10 2:59 ` [Qemu-devel] " Avi Kivity
2010-06-10 14:42 ` [Qemu-devel] " Dong, Eddie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).