public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* Re: In kernel PIC support: kernel patch
@ 2007-06-25 22:23 Gregory Haskins
       [not found] ` <468008090200005A00026619-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 22:23 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 2007-06-25 at 18:08 -0400, Avi Kivity wrote:
> There's a slight issue that Windows with the ACPI HAL 
> is dog slow when virtulalized, but maybe we can find another HAL that 
> supports ACPI but doesn't use the APIC.

Is this in reference to the frequent TPR updates, or is there something
else that is also slow with ACPI enabled?

> 
> > Note that there is another standard that would allow us to use built-in
> > routing without ACPI/IOAPIC: MSI.  I know the battle over
> > virtual-bus-rendering is still raging w.r.t. to PCI or not-PCI.  But I
> > will point out that if we do use PCI, setting the PV devices up as MSI
> > capable in the config space potentially eliminates the need to also wire
> > them to an IOAPIC.  They will get their vector data from the MSI setup
> > and can then send APIC messages directly.  (I think there is some
> > confusion about how we can do MSI later in this thread...I will reply to
> > those mails separately)
> >   
> 
> MSI is also a good solution; but do all semi-modern guest OSes support it?

I can't say for sure, but I would guess "yes" because IIUC PCI-Express
only uses MSI (or at least heavily prefers it).


> No, a single ioapic in the kernel, with one interrupt line that is 
> shared between a kernel-provided device and a userspace-provided 
> device.  The kernel would need to provide the OR function that sharing 
> implies.

Got it.  I was confused earlier.

> 
> 
> > If we do end up needing the IOAPIC in the
> > kernel and userspace, I would suggest using two IOAPICs (we just need to
> > update the ACPI tables to say there are two) so that they can operate
> > independently.  This should be supportable from a virtual system
> > infrastructure standpoint, and is much cleaner IMHO.  If you were
> > talking about a different scenario. please clarify
> 
> I was talking about a different scenario, but your solution is cleaner 
> (though it involves more hand-dirtying with the ACPI tables).  Basically 
> each PCI IRQ would have its own dedicated line.

Right. Or more precisely, *anything* you want can have its own line ;)
IIUC a standard IOAPIC has 24 input lines, so you just need one for
every 24 devices allocated.




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
[parent not found: <468E0FD2.2090305@qumranet.com>]
* Re: In kernel PIC support: kernel patch
@ 2007-06-25 22:36 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 22:36 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 2007-06-25 at 18:25 -0400, Avi Kivity wrote:
> Gregory Haskins wrote:
> > On Mon, 2007-06-25 at 18:08 -0400, Avi Kivity wrote:
> >   
> >> There's a slight issue that Windows with the ACPI HAL 
> >> is dog slow when virtulalized, but maybe we can find another HAL that 
> >> supports ACPI but doesn't use the APIC.
> >>     
> >
> > Is this in reference to the frequent TPR updates, or is there something
> > else that is also slow with ACPI enabled?
> >
> >   
> 
> Just the good old tpr.

Ah, yes.  I actually find the later versions of KVM to perform quite
well even with XP+ACPI.  I think its all the work that went into the VMX
light-exits, mostly.  Things seem to run even better with the in-kernel
APIC patch enabled, but admittedly I haven't run any benchmarks on it to
tell for sure.  This is all "seat of the pants" ;).  I am not exactly
sure why the in-kernel APIC would/could make a difference since TPRs are
light-exits in either case IIUC...

I am very interested to see what kind of gains we can get when TPR
shadowing is implemented properly (my attempt wasn't even fully
functional if you recall).  I seem to recall reading somewhere that
TPR-shadow features can work even with 32 bit OSes (I guess MOV-TO-CR8
works in 32-bit mode?) but I don't know if XP would use it or the MMIO
path.  Im guessing MMIO, but perhaps MS will patch it since they are
interested in VMM technology now.  Does AMD support a similar notion of
TPR-shadow? 

-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-25 14:10 Gregory Haskins
       [not found] ` <467F94850200005A00026595-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 14:10 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 2007-06-25 at 21:54 +0800, Dong, Eddie wrote:
>  
> >-----Original Message-----
> >From: Gregory Haskins [mailto:ghaskins@novell.com] 
> >Sent: 2007年6月25日 21:43
> >To: avi@qumranet.com
> >Cc: Dong, Eddie; kvm-devel@lists.sourceforge.net
> >Subject: Re: [kvm-devel] In kernel PIC support: kernel patch
> >
> >On Sat, 2007-06-23 at 20:41 +0300, Avi Kivity wrote:
> >> Dong, Eddie wrote:
> >> > 3: IOAPIC position
> >> > 	Though it looks like neutral to have this one in user or kernel
> >> > space,
> >> > but I'd like to suggest we only support one model. 
> >Considering future
> >> > VT-d
> >> > case, hypervisor need to inject an IRQ directly in KVM (still thru
> >> > IOAPIC)
> >> > without going to user level, so probably moving IOAPIC to 
> >kernel is good
> >> >
> >> > in this point.
> >> >   
> >> 
> >> Even paravirt device drivers will want to inject IRQs via the ioapic 
> >> (when the guest is not paravirt_ops enabled, like older Linux and 
> >> Windows). 
> >
> >Note that its probably not a requirement to do so.  The IOAPIC
> >essentially provides a standard mechanism to map input "irq pins" to
> >APIC messages.  A pv driver could conceivably call kvm_apicbus_send()
> >directly if I knew its vector instead of calling ioapic_setpin() and it
> >would achieve the same goal.
> 
> Basically we are talking about same thing. calling kvm_apicbus_send or
> raise a virtual ioapic pin is basically same. The minor different is who does
> the translation from pin to APIC message, device model or IOAPIC model.
> 
> Using IOAPIC pin probably makes PV driver writter a little bit easier and 
> may be reused by other VMM easily in future.

I fully agree with everything you say here.  I just wanted to point out
that pv-devices dont have a hard-requirement for an IOAPIC connection in
case it makes something easier for someone. ;)

> 
> >
> >What is nice about using an IOAPIC is that the irq routing can be done
> >for you automatically by any ACPI compliant OS.  The device just has to
> >say "raise my output pin" and the IOAPIC translates that into a vector.
> >If you forgo the IOAPIC in favor of direct APIC messages, you have to
> >solve the problem of irq-routing a different way.  
> >
> >Note that there is another standard that would allow us to use built-in
> >routing without ACPI/IOAPIC: MSI.  I know the battle over
> >virtual-bus-rendering is still raging w.r.t. to PCI or not-PCI.  But I
> >will point out that if we do use PCI, setting the PV devices up as MSI
> >capable in the config space potentially eliminates the need to 
> >also wire
> >them to an IOAPIC.  They will get their vector data from the MSI setup
> >and can then send APIC messages directly.  (I think there is some
> >confusion about how we can do MSI later in this thread...I 
> >will reply to
> >those mails separately)
> 
> Yes, all you mentioned here is correct, but it doesn;t impact the
> necessity of moving IOAPIC to kernel, see below.
> 
> >
> >Thoughts?
> >
> >
> >
> >> It's probably okay to implement the device side of a block 
> >> device in qemu, but more difficult for networking. If we have device 
> >> implementations in the kernel then we'll need an ioapic in 
> >the kernel.
> >
> >> 
> >> Also, if we end up sharing interrupts between the kernel and 
> >userspace, 
> >> we'll need the kernel to perform the OR between the level 
> >specified by 
> >> the kernel devices and the level specified by userspace.
> >
> >I not sure I fully understand what you are getting at here.  It sounds
> >like you are talking about splitting a single IOAPIC into both a user
> >and kernel space device?  If we do end up needing the IOAPIC in the
> 
> No, the IOAPIC is in kernel, but some device such as IDE are in user while others
> are in kernel such as pv NIC.
> 
> Due to this, a kernel pv driver will ask for IOAPIC service, so we have to keep 
> it in kernel, otherwise we have to go back to user when kernel device fires, which
> is very slow.
> 
> But, like you mentio> it helps, but then 1: we have to implement MSI now, 2: a little bit more complicated
> in writting pv driver.


Ok, I think I understand what Avi was saying now.  The "or" operation is
in reference to a shared irq-line coming into the IOAPIC. e.g. We need
to maintain a logical OR against the current level of a particular line.
I thought he was talking about maintaining internal IOAPIC state between
user/kern, which is probably messier than two IOAPICs. ;)

> 
> >kernel and userspace, I would suggest using two IOAPICs (we 
> >just need to
> >update the ACPI tables to say there are two) so that they can operate
> >independently.  This should be supportable from a virtual system
> 
> It can, but a little bit more complicated, and probably unable to live
> migration between KVM and Qemu since Qemu only has one IOAPIC.

Good point, but is that a design requirement?


> 
> >infrastructure standpoint, and is much cleaner IMHO.  If you were
> >talking about a different scenario. please clarify.
> >
> 
> Thx,eddie


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-25 13:53 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 13:53 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 2007-06-25 at 11:32 +0800, Dong, Eddie wrote:
> Avi Kivity wrote:
> > Dong, Eddie wrote:
> > We'll add it as soon as it becomes necessary. But isn't it
> > just a memory
> > write?

On real HW, yes.  The Intel guys can correct me if I am wrong, but I
think we can actually simplify this in the virtual-world.  My impression
from reading the specs is that the address that you write to effectively
lets you generate an APIC bus message from a PCI bus-master.  The data
of the message looks very much like an APIC message, which in turn looks
very similar to what you program an IOAPIC pin to map to.

Therefore, if you want to generate MSIs I think you can just have the
pv-device call kvm_apicbus_send() as opposed to emulating the memory
write.  I'm not sure if this helps or not (e.g. perhaps we *want* to
emulate the bus-master write for supporting some pv-device that wouldn't
have access to the KVM_APIC_MSG api for some reason?)





-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-25 13:45 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 13:45 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Sun, 2007-06-24 at 23:28 -0400, Avi Kivity wrote:
> Dong, Eddie wrote:
> >> Need to add a way to detect if this functionality is available, via
> >> KVM_CHECK_EXTENSION. 
> >>     
> >
> > Sure. Question:
> > 	Should we check the extension API basis or feature sets list?
> > I.e. Each API (KVM_CREATE_PIC, KVM_SET_ISA_IRQ_LEVEL) or feature sets:
> > KVM_KERNEL_IRQCHIP.
> >
> >   
> 
> I guess pre-feature.  We want to have fine granularity so we can change 
> small things, but per-API is too fine.
> 
> So in this case I'd suggest KVM_CAP_PIC, KVM_CAP_APIC, etc. (note the 
> KVM_CAP_* namespace).

Yeah, I like this suggestion better than how I currently do it.




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-25 13:44 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 13:44 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 2007-06-25 at 09:45 +0800, Dong, Eddie wrote:

> Sure. Question:
> 	Should we check the extension API basis or feature sets list?
> I.e. Each API (KVM_CREATE_PIC, KVM_SET_ISA_IRQ_LEVEL) or feature sets:
> KVM_KERNEL_IRQCHIP.

Actually, I have that same question.  I currently implemented the v9
code on an API basis, but I now think it should be the other way.





-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-25 13:43 Gregory Haskins
       [not found] ` <467F8E260200005A00026573-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 13:43 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Sat, 2007-06-23 at 20:41 +0300, Avi Kivity wrote:
> Dong, Eddie wrote:
> > 3: IOAPIC position
> > 	Though it looks like neutral to have this one in user or kernel
> > space,
> > but I'd like to suggest we only support one model. Considering future
> > VT-d
> > case, hypervisor need to inject an IRQ directly in KVM (still thru
> > IOAPIC)
> > without going to user level, so probably moving IOAPIC to kernel is good
> >
> > in this point.
> >   
> 
> Even paravirt device drivers will want to inject IRQs via the ioapic 
> (when the guest is not paravirt_ops enabled, like older Linux and 
> Windows). 

Note that its probably not a requirement to do so.  The IOAPIC
essentially provides a standard mechanism to map input "irq pins" to
APIC messages.  A pv driver could conceivably call kvm_apicbus_send()
directly if I knew its vector instead of calling ioapic_setpin() and it
would achieve the same goal.

What is nice about using an IOAPIC is that the irq routing can be done
for you automatically by any ACPI compliant OS.  The device just has to
say "raise my output pin" and the IOAPIC translates that into a vector.
If you forgo the IOAPIC in favor of direct APIC messages, you have to
solve the problem of irq-routing a different way.  

Note that there is another standard that would allow us to use built-in
routing without ACPI/IOAPIC: MSI.  I know the battle over
virtual-bus-rendering is still raging w.r.t. to PCI or not-PCI.  But I
will point out that if we do use PCI, setting the PV devices up as MSI
capable in the config space potentially eliminates the need to also wire
them to an IOAPIC.  They will get their vector data from the MSI setup
and can then send APIC messages directly.  (I think there is some
confusion about how we can do MSI later in this thread...I will reply to
those mails separately)

Thoughts?



> It's probably okay to implement the device side of a block 
> device in qemu, but more difficult for networking. If we have device 
> implementations in the kernel then we'll need an ioapic in the kernel.

> 
> Also, if we end up sharing interrupts between the kernel and userspace, 
> we'll need the kernel to perform the OR between the level specified by 
> the kernel devices and the level specified by userspace.

I not sure I fully understand what you are getting at here.  It sounds
like you are talking about splitting a single IOAPIC into both a user
and kernel space device?  If we do end up needing the IOAPIC in the
kernel and userspace, I would suggest using two IOAPICs (we just need to
update the ACPI tables to say there are two) so that they can operate
independently.  This should be supportable from a virtual system
infrastructure standpoint, and is much cleaner IMHO.  If you were
talking about a different scenario. please clarify.

> 


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-25 13:26 Gregory Haskins
       [not found] ` <467F8A2D0200005A00026567-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 13:26 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Sat, 2007-06-23 at 10:53 +0300, Avi Kivity wrote:
> Gregory Haskins wrote:
> >
> > In fact, my own implementation puts the ISA-userint model as hanging off
> > the kvm structure as well (which would be replaced with a similar PIC
> > model for layer-2 mode in the future).
> >
> >   
> 
> The PIC is certainly a system resource, but at least in qemu it's 
> connected to cpu 0 only. We want to preserve that.
> 

I think there might be some confusion here ;) The *outputs* of the PIC
in both QEMU and in-kernel-APIC are certainly wired to cpu0 (per the
model that qemu/bochs support).  But the *inputs* are system-wide.  In
qemu, you can call the global function "pic_set_irq" (which you could
construe as per-vm, since there is a 1:1 with qemu-process/vm).
Likewise, the inputs in both in-kernel implementation is also per vm by
hanging off of the kvm struct, as opposed to the vcpu struct.  I hope
this helps to clarify.

-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-22 12:14 Gregory Haskins
       [not found] ` <467B84EF0200005A00026401-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-22 12:14 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Fri, 2007-06-22 at 13:50 +0300, Avi Kivity wrote:
> Dong, Eddie wrote:
> 
> > +#include "kvm.h"
> > +#include "virq.h"
> > +
> > +/*
> > + * check if there is pending interrupt without
> > + * intack.
> > + */
> > +int cpu_has_interrupt(struct kvm_vcpu *v)
> > +{
> > +	struct pic_state2 *s = kernel_pic(v->kvm);
> > +
> >   
> 
> Isn't the PIC connected to just one cpu?

I agree with Eddie here.  Its true that only one CPU would ever
associate with PIC (and its usually hardwired to the BSP, but it
technically doesn't have to be depending on the architecture).  But the
PIC(s) are really a system wide resource that should hang off the kvm
struct, IMHO.

In fact, my own implementation puts the ISA-userint model as hanging off
the kvm structure as well (which would be replaced with a similar PIC
model for layer-2 mode in the future).


-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-22 12:06 Gregory Haskins
       [not found] ` <467B830B0200005A000263FB-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-22 12:06 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Fri, 2007-06-22 at 12:57 +0300, Avi Kivity wrote:
> Gregory Haskins wrote:
> >
> > Unless I misunderstood Avi, I think he was suggesting that use use
> > KVM_EXTINT/ISA_INTERRUPT as a way to distinguish between the dual modes
> > of ISA_INTERRUPT as I have today (e.g.
> > ISA_INTERRUPT(level-1)/ISA_INTERRUPT(level-2)).  This doesn't really
> > have anything to do with LINT0/1 (directly, anyway).  Hope this helps to
> > clarify. 
> >   
> 
> Well, we don't need to think in terms of modes.  Each ioctl controls
> interrupts at a different point in the architecture.
> 
> KVM_INTERRUPT is at the interface between the processor core and the
> lapic/system bus.
> 
> KVM_APIC_MESSAGE is at the apic bus.
> 
> KVM_EXTINT is at the interface between the lapic and the system bus.  If
> the guest disabled the local apic, it's equivalent to the KVM_INTERRUPT
> (injecting a vector directly); otherwise it is handled by the lapic
> (LINT0/1 handling).

Ok, now I understand what Eddie was saying about LINT0/1 and we were
both describing the same thing in different ways.  I think we are all on
the same page.

> 
> KVM_ISA_INTERRUPT (maybe KVM_IRQ_LINE, to signify it doesn't carry a
> vector) is at the boundary between the interrupt controller chips and
> the rest of the system.
> 

Ack on s/ISA/IRQ or something similar


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-22 12:02 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-22 12:02 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Fri, 2007-06-22 at 12:46 +0300, Avi Kivity wrote:
> Gregory Haskins wrote:
> > I actually kind of like the distinction you are making between the
> > INTERRUPT, EXTINT, and ISA/IOAPIC APIs and keeping them consistent
> > across modes.  As you suggested below, the ISA/IOAPIC API probably
> > should really stay as one (I vote for retaining the ISA name) since the
> > wiring is an invariant of the platform presented by qemu/bochs.
> >
> >   
> 
> Well, kvm shouldn't really know how the qemu hardware is wired up.  It
> can change; and non-qemu userspaces should always be possible.

I agree, but we have a ways to go to get there to fully support this
notion. :(  In addition to an API that supports it, any in-kernel
components will also likely need to know how to be wired up dynamically.
For instance, today the lapic branch assumes the PIC is wired to LINT0
(as decreed by QEMU), but other architectures might have it wired to an
IOAPIC pin (for routing to other CPUs besides the BSP), for instance).

I guess I was thinking the APIs would evolve as that new capability
emerged.  But you are right...if we can get this part of the API correct
now, lets do it.

> 
> When we change an irq line, there are three things that matter:
> 
> - which chip the line is connected to (PIC 1, PIC2, IOAPIC)
> - which line we're talking about
> - what level the line is to have
> 
> If we allow lists of such triplets to be sent via a single ioctl, then
> we can optimize the case where lines are wired together without teaching
> kvm about the board layout.

Sounds like a good approach.


> 
> (the ISA name might annoy our non-x86 friends.  I see nothing x86
> specific about the interface)
> 

Good point.  Lets go with the other suggestions you and Eddie are
making.

-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-22  3:49 Gregory Haskins
       [not found] ` <467B0E7D0200005A000263E4-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-22  3:49 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Fri, 2007-06-22 at 10:09 +0800, Dong, Eddie wrote:
> Avi Kivity wrote:
> > 
> > Eddie, is that what you were driving at in your simplification
> > attempt? 
> > 
> Yes, some minor thing:
> KVM_ISA_INTERRUPT: Per VM I/F, and need an irq line and irq level.
> KVM_IOAPIC_INTERRUPT: Per VM I/F, may also need irq level for level
> triggered irq.
> KVM_APIC_MESSAGE: Per VCPU I/F

Note that in my implementation the APIC_MESSAGE is also VM I/F.   In
fact, IMHO the only API that should be VCPU I/F is the original API,
KVM_INTERRUPT.  The reason is, the PIC related interrupts are targeted
at the vm-global ISA resources, and the APIC_MESSAGE is targeted at the
vm-global apic *bus*, not a particular lapic (yet).  Its the bus that
then determines the actual routing to its final destination.  I think
this design distinction is important.


> 
> Per step by step approach way, KVM_EXTINT (assume for LINT0/LINT1) can
> be deferred to future implementation.

Unless I misunderstood Avi, I think he was suggesting that use use
KVM_EXTINT/ISA_INTERRUPT as a way to distinguish between the dual modes
of ISA_INTERRUPT as I have today (e.g.
ISA_INTERRUPT(level-1)/ISA_INTERRUPT(level-2)).  This doesn't really
have anything to do with LINT0/1 (directly, anyway).  Hope this helps to
clarify. 

-Greg



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-21 16:44 Gregory Haskins
       [not found] ` <467A72970200005A00026354-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 16:44 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 19:31 +0300, Avi Kivity wrote:

> Wait. Looks like APIs are changing meaning according to some mode. 
> Confusion.
> 
> How about:
> 
> KVM_INTERRUPT == inject this vector now
> KVM_EXTINT == drives the processor's interrupt pin (this is handled by 
> the lapic code). input is a vector and type
> KVM_ISA_INTERRUPT == drives the irq inputs to the PIC. Input is line number.
> KVM_IOAPIC_INTERRUPT == drives the irq inputs to the IOAPIC. Input is 
> line number.
> KVM_APIC_MESSAGE == do something to the apic
> 
> That way, there is no mode (at least in the code), and the APIs keep 
> their meaning.

I actually kind of like the distinction you are making between the
INTERRUPT, EXTINT, and ISA/IOAPIC APIs and keeping them consistent
across modes.  As you suggested below, the ISA/IOAPIC API probably
should really stay as one (I vote for retaining the ISA name) since the
wiring is an invariant of the platform presented by qemu/bochs.

The only unfortunate thing about this change is part of my irqdevice
abstraction goes to waste (e.g. set_pin).  But oh well.  I think the
consistent API is more important.  I will roll this into the fixes I am
making.

-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-21 15:43 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 15:43 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 23:13 +0800, Dong, Eddie wrote:
> We are in same page now :-)

Good :)  I still need to add that "int level" field to KVM_IRQ_INTERRUPT
as we discussed a few weeks ago to be considered complete, though.

-Greg



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-21 15:42 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 15:42 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 15:17 +0000, Gregory Haskins wrote:
> On Thu, 2007-06-21 at 22:28 +0800, Dong, Eddie wrote:
> >
> > > I suppose, but it somewhat defeats the purpose IMO.  Every pin in the
> > > 8259 that gets tickled implicitly means an IOAPIC pin was tickled
> > > also. Do we really want to go to userspace for that?  Essentially
> > 
> > User space can handle this, go to IOAPIC first and then decide to go to 
> > in kernel PIC or not. Here the assumption is that an irq line is either
> > serviced
> > by PIC or IOAPIC, it will never be serviced by both.
> > 
> > So no back to user space.
> 

Thinking about it some more, I see some conditions where this
optimization doesn't work (and was one of the reasons driving my
redesign of the subsystem).  For instance, when linux routes the PIT as
an NMI and an IOAPIC based event.  In this case, the PIT raises an
interrupt which tickles an isa interrupt line that should be routed to
both the PIC and the IOAPIC #IRQ0.  Both devices should have IRQ0
unmasked.  The former will get routed to the APIC via tickling the LINT0
line.  The latter will get routed as an APIC-MSG based on the
programming of the IOAPICs IRQ0 line.

I believe this is a scenario which is mis-emulated today at least
partially due to the assumption that only one device is in use at a
time.

All is not lost, however.  This just helps drive the decision to move
the IOAPIC with the PIC ;)

-Greg 


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-21 15:17 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 15:17 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 22:28 +0800, Dong, Eddie wrote:
>
> > I suppose, but it somewhat defeats the purpose IMO.  Every pin in the
> > 8259 that gets tickled implicitly means an IOAPIC pin was tickled
> > also. Do we really want to go to userspace for that?  Essentially
> 
> User space can handle this, go to IOAPIC first and then decide to go to 
> in kernel PIC or not. Here the assumption is that an irq line is either
> serviced
> by PIC or IOAPIC, it will never be serviced by both.
> 
> So no back to user space.

Yeah, I suppose if we adopt that optimization from Xen/Qemu that is
true.  The design in my head is trying to emulate the actual signal path
in hardware so it gets dispatched to both models.  I suppose you can get
away with a shortcut like that here (though I don't really love the
idea ;).

I think it comes down to the decision, like you said.  If we go with
IOAPIC in the kernel, just route to both like I am thinking.  If we
leave it in userspace, optimize the context switch away for performance
reasons.

-Greg





-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-21 15:10 Gregory Haskins
       [not found] ` <467A5C8C0200005A00026321-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 15:10 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 22:57 +0800, Dong, Eddie wrote:
> Gregory Haskins wrote:
> 
> > So assuming this newly enlightened position is true, I think this
> > means we have a choice: 
> > 
> > 1) Drop support for mixed "level-1" mode and move the PIC to the
> > kernel now as Eddie is doing 
> > 
> > 2) Keep the level-1/2 distinction, and add support for making sure
> > that once a vector is acked in the PIC, it truely gets put into
> > service immediately. 
> > 
> > I can think of a really simple interface for (2).  All we
> > really need to
> > do is
> > 
> > a) go back to synchronous injection (as previously suggested)
> 
> For level-2 support, that means no PIC/APIC in user space, it is the
> device want to assert/dessert an irq line request. There is no notion
> of injecting IRQ. That is why I added new APIs.
> Even with level-1 in consideration, we still need these new APIs to
> support level-2. A device such as kerboard could frequently assert
> and dessert the irq line. A single ker strobe will see 5-10 dessert 
> request from device model and 1 assert request.
> 
> So new APIs are a must IMO.

Fully agree, but I already added them as well ;)

1) KVM_ISA_INTERRUPT:  In level-1 mode, this API allows the userspace
pic to dispatch a vector to the kernel.  In level-2 mode, this allows
any userspace app to tickle an isa based irq line (which feeds into the
inputs of the PIC and IOAPIC.  In other words, a level-2 based userspace
would substitute KVM_ISA_INTERRUPT for pic_set_irq().

2) KVM_APIC_MESSAGE allows any entity (presumably an IOAPIC in
userspace, though there is no restriction here) to send APIC messages.
This will probably only be used in level-1 mode, but its there for both
modes nonetheless.

-Greg



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-21 14:01 Gregory Haskins
       [not found] ` <467A4C970200005A000262FD-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 14:01 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 13:31 +0000, Gregory Haskins wrote:
> On Thu, 2007-06-21 at 21:02 +0800, Dong, Eddie wrote:
> > 
> > Achitectually not only that. A premature IRR->ISR will cause guest OS
> > confuse in many place. A guest (say BIOS) may turn from interrupt
> > enabled mode to polling mode which polls IRR to detect if 
> > there is pending IRQ to handle. In this case we have trouble.
> 
> I completely agree.  But what I am saying is that I can eliminate the
> premature IRR->ISR with the change I am proposing.

After thinking about this some more, I think I finally understand your
point.  I was thinking that you were objecting to the multiple unacked
vectors in ISR.  But really this could be a problem for *any* unacked
vectors?  Hmm..if this is the case, you were right and I was wrong.  My
bad :)

So assuming this newly enlightened position is true, I think this means
we have a choice:

1) Drop support for mixed "level-1" mode and move the PIC to the kernel
now as Eddie is doing

2) Keep the level-1/2 distinction, and add support for making sure that
once a vector is acked in the PIC, it truely gets put into service
immediately.

I can think of a really simple interface for (2).  All we really need to
do is 

a) go back to synchronous injection (as previously suggested)
b) promote kvm_cpuirq_extint to be higher than nmis and localints.

This will ensure the vector is immediately placed into service even over
in-kernel sources.  Deferred interrupts would still take a higher
priority, but we would never intack the pic if deferred interrupts were
pending since irq.pending would be true.

Thoughts?
-Greg

> 
> > 
> > 
> > > However, as I mentioned we can fix that issue with just a few new
> > > lines of 
> > > code and by
> > > reverting the userspace injection model to the one used in prior
> > > releases without having to implement an entire in-kernel PIC to do it.
> > 
> > In-kernel PIC gives us a full in kernel irqchip solution and performance
> > gain. Some OSes may use PIC only.
> 
> Agreed.  I will answer this down below with the level-0/1/2 question
> since they are related.
> 
> 
> > 
> > > 
> > > I think moving the PIC into the kernel has the advantage of allowing
> > > us to move the PIT into the kernel too (which is huge, IMHO), but
> > 
> > Not very big, I just want to do one by one. we have done the code
> > in Xen already.
> > 
> > > that is its sole advantage.  Don't get me wrong: I am in favor of
> > > doing it. However, I wanted to point out that this particular
> > > solution to the problem you found essentially is invalidating
> > > "level-1" mode by only supporting level-0 or level-2, not fixing it. 
> > 
> > Not exactly understand level-0/1/2 meaning? Do u mean we mixed 
> > irqchip mode is a feature requirement? I didn't see the necessity, 
> > maybe I am wrong. But isn't both PIC and APIC in kernel much
> > nature and simple for us?
> 
> At one point there was a debate between moving just the LAPIC, or moving
> everything (LAPIC/PIC/IOAPIC/PIT).  Avi suggested that we start with
> just the LAPIC and see where we were at.  Because we need to remain
> forward and backwards ABI compatible, I added an ioctl for setting the
> in-kernel-pic level.  "0" is backwards compatible mode (everything in
> userspace).  "1" is LAPIC only. "2" is presumable everything in the
> kernel, but today only 0 and 1 are the only two defined so there could
> be more than 3 levels someday if we wanted.
> 
> > 
> > If you really think supporting mixed irqchip mode is a must
> 
> I will leave it to Avi to decide since he implicitly suggested it.  But
> suffice to say, if we *dont* want to support it we will need to get the
> other in-kernel stuff into the lapic branch in its entirety before it
> can be merged so we dont create an ABI issue.
> 
> 
> > , we 
> > need to introduce an intack I/F to notify user level irqchip if the irq 
> > is really ACKed or not. We can do that but I doubt the necessity.
> 
> This is where we disagree, as I know you have mentioned this before.
> The solution I am proposing (which reverts the userspace injection
> method) would not require this ACK that you mention.  The reason is that
> the synchronous nature of the injection takes care of things naturally.
> Here is how:
> 
> With sync-injection, we can only move a vector from the IRR->ISR if
> RFLAGS.IF and !irq.pending.  Once we inject, the vector will move to ISR
> and get queued in kernel (thereby setting an irq.pending bit).  It may
> not inject to the guest immediately, but thats ok because no new vectors
> will be acked in the PIC until that previous vector is processed
> (because irq.pending will remain set).  Therefore, we will never
> prematurely move IRR->ISR until the previous vector really is
> "in-service".  Does this make sense?
> 
> My suggestion assumes that a single vector can stay in ISR without
> actually being in service without ill effect (Todays v9 code could
> presumably have multiple vectors in ISR without being in service, which
> I admit is bogus).  If this assumption is false, then I agree with you
> that we need extra intack I/F.  What are you thoughts here?
> 
> 
> > 
> > > Perhaps we are "ok" with that, but I was under the distinct
> > > impression from Avi et. al. that these variable levels of support
> > > were a design goal for debugging purposes. 
> > > 
> > > I would prefer that we just fix level-1 mode with the changes I
> > > mentioned instead of just disabling it (in addition to adding level-2
> > > mode as Eddie is working on). 
> > > 
> > >> 
> > >> This patch fixes this by introducing new VM level IOCTL
> > >> KVM_SET_ISA_IRQ_LEVEL & KVM_CREATE_PIC to avoid the ABI hole.  The
> > >> original KVM_INTERRUPT is still there to be backward compatible.
> > >> 
> > >> WIth this patch, old Qemu, new qemu (after this patch), old kvm, new
> > >> kvm can work in any combination. Both user level code and kernel
> > >> code will automatically decide the irq source base on irqchip
> > >> location (user or kernel). 
> > >> 
> > >> Some known issues (TODO):
> > >> 	1: SVM support is on going.
> > >> 	The code for VMX to inject irq is same with Xen now since the
> > >> situation is same now (irqchip in hyprevisor), SVM code should be
> > >> able to directly reuse from Xen too.
> > >> 	2: live migration break.
> > >> 	3: Temply APIC support is removed in CPUID to wait for the merge
> > >> of in kernel APIC patch
> > > 
> > > Note that you will need more than just the APIC patch.  My patch only
> > > moves the LAPIC down.  The IOAPIC and 8259s are still in userspace.
> > > Your patch only moves the 8259s down.  This means there is a gap where
> > > the IOAPIC used to be that still needs to be addressed.
> > 
> > Why do u think I/O APIC must be moved down too? A new IOCTL like
> > KVM_IOAPIC_MESSAGE can solve the interface issue IMO and quit 
> > natual.
> 
> I suppose, but it somewhat defeats the purpose IMO.  Every pin in the
> 8259 that gets tickled implicitly means an IOAPIC pin was tickled also.
> Do we really want to go to userspace for that?  Essentially what happens
> then is the PIT ends up having to go through userspace for every tick at
> that point.  On the flip side, the IOAPIC model is very simple so I
> think it makes more sense to move it one to one with the PIC.
> 
> Regards,
> -Greg
> 
> 


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-21 13:31 Gregory Haskins
       [not found] ` <467A45690200005A000262F0-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 13:31 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 21:02 +0800, Dong, Eddie wrote:
> Gregory Haskins wrote:
> > On Wed, 2007-06-20 at 15:43 +0800, Dong, Eddie wrote:
> >> As we discussed, if we move APIC to kernel while leaving PIC in user
> >> level, we have ABI level holes if the interrupt a user level qemu
> >> injected is not immediately injected to guest for some reason. \
> > 
> > Hi Eddie,
> > 
> > I know you worked hard on this, but I still do not agree with this
> > statement.  I agree that v9 introduced an issue related to a)
> > premature IRR->ISR assignment, and b) lack of support for clearing
> > pending ISR with IMR changes (and thank you for pointing it out!). 
> 
> Achitectually not only that. A premature IRR->ISR will cause guest OS
> confuse in many place. A guest (say BIOS) may turn from interrupt
> enabled mode to polling mode which polls IRR to detect if 
> there is pending IRQ to handle. In this case we have trouble.

I completely agree.  But what I am saying is that I can eliminate the
premature IRR->ISR with the change I am proposing.

> 
> 
> > However, as I mentioned we can fix that issue with just a few new
> > lines of 
> > code and by
> > reverting the userspace injection model to the one used in prior
> > releases without having to implement an entire in-kernel PIC to do it.
> 
> In-kernel PIC gives us a full in kernel irqchip solution and performance
> gain. Some OSes may use PIC only.

Agreed.  I will answer this down below with the level-0/1/2 question
since they are related.


> 
> > 
> > I think moving the PIC into the kernel has the advantage of allowing
> > us to move the PIT into the kernel too (which is huge, IMHO), but
> 
> Not very big, I just want to do one by one. we have done the code
> in Xen already.
> 
> > that is its sole advantage.  Don't get me wrong: I am in favor of
> > doing it. However, I wanted to point out that this particular
> > solution to the problem you found essentially is invalidating
> > "level-1" mode by only supporting level-0 or level-2, not fixing it. 
> 
> Not exactly understand level-0/1/2 meaning? Do u mean we mixed 
> irqchip mode is a feature requirement? I didn't see the necessity, 
> maybe I am wrong. But isn't both PIC and APIC in kernel much
> nature and simple for us?

At one point there was a debate between moving just the LAPIC, or moving
everything (LAPIC/PIC/IOAPIC/PIT).  Avi suggested that we start with
just the LAPIC and see where we were at.  Because we need to remain
forward and backwards ABI compatible, I added an ioctl for setting the
in-kernel-pic level.  "0" is backwards compatible mode (everything in
userspace).  "1" is LAPIC only. "2" is presumable everything in the
kernel, but today only 0 and 1 are the only two defined so there could
be more than 3 levels someday if we wanted.

> 
> If you really think supporting mixed irqchip mode is a must

I will leave it to Avi to decide since he implicitly suggested it.  But
suffice to say, if we *dont* want to support it we will need to get the
other in-kernel stuff into the lapic branch in its entirety before it
can be merged so we dont create an ABI issue.


> , we 
> need to introduce an intack I/F to notify user level irqchip if the irq 
> is really ACKed or not. We can do that but I doubt the necessity.

This is where we disagree, as I know you have mentioned this before.
The solution I am proposing (which reverts the userspace injection
method) would not require this ACK that you mention.  The reason is that
the synchronous nature of the injection takes care of things naturally.
Here is how:

With sync-injection, we can only move a vector from the IRR->ISR if
RFLAGS.IF and !irq.pending.  Once we inject, the vector will move to ISR
and get queued in kernel (thereby setting an irq.pending bit).  It may
not inject to the guest immediately, but thats ok because no new vectors
will be acked in the PIC until that previous vector is processed
(because irq.pending will remain set).  Therefore, we will never
prematurely move IRR->ISR until the previous vector really is
"in-service".  Does this make sense?

My suggestion assumes that a single vector can stay in ISR without
actually being in service without ill effect (Todays v9 code could
presumably have multiple vectors in ISR without being in service, which
I admit is bogus).  If this assumption is false, then I agree with you
that we need extra intack I/F.  What are you thoughts here?


> 
> > Perhaps we are "ok" with that, but I was under the distinct
> > impression from Avi et. al. that these variable levels of support
> > were a design goal for debugging purposes. 
> > 
> > I would prefer that we just fix level-1 mode with the changes I
> > mentioned instead of just disabling it (in addition to adding level-2
> > mode as Eddie is working on). 
> > 
> >> 
> >> This patch fixes this by introducing new VM level IOCTL
> >> KVM_SET_ISA_IRQ_LEVEL & KVM_CREATE_PIC to avoid the ABI hole.  The
> >> original KVM_INTERRUPT is still there to be backward compatible.
> >> 
> >> WIth this patch, old Qemu, new qemu (after this patch), old kvm, new
> >> kvm can work in any combination. Both user level code and kernel
> >> code will automatically decide the irq source base on irqchip
> >> location (user or kernel). 
> >> 
> >> Some known issues (TODO):
> >> 	1: SVM support is on going.
> >> 	The code for VMX to inject irq is same with Xen now since the
> >> situation is same now (irqchip in hyprevisor), SVM code should be
> >> able to directly reuse from Xen too.
> >> 	2: live migration break.
> >> 	3: Temply APIC support is removed in CPUID to wait for the merge
> >> of in kernel APIC patch
> > 
> > Note that you will need more than just the APIC patch.  My patch only
> > moves the LAPIC down.  The IOAPIC and 8259s are still in userspace.
> > Your patch only moves the 8259s down.  This means there is a gap where
> > the IOAPIC used to be that still needs to be addressed.
> 
> Why do u think I/O APIC must be moved down too? A new IOCTL like
> KVM_IOAPIC_MESSAGE can solve the interface issue IMO and quit 
> natual.

I suppose, but it somewhat defeats the purpose IMO.  Every pin in the
8259 that gets tickled implicitly means an IOAPIC pin was tickled also.
Do we really want to go to userspace for that?  Essentially what happens
then is the PIT ends up having to go through userspace for every tick at
that point.  On the flip side, the IOAPIC model is very simple so I
think it makes more sense to move it one to one with the PIC.

Regards,
-Greg




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* Re: In kernel PIC support: kernel patch
@ 2007-06-21 12:36 Gregory Haskins
       [not found] ` <467A38720200005A000262C8-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 12:36 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Wed, 2007-06-20 at 15:43 +0800, Dong, Eddie wrote:
> As we discussed, if we move APIC to kernel while leaving PIC in user
> level, we have ABI level holes if the interrupt a user level qemu
> injected is not immediately injected to guest for some reason. \

Hi Eddie,

I know you worked hard on this, but I still do not agree with this
statement.  I agree that v9 introduced an issue related to a) premature
IRR->ISR assignment, and b) lack of support for clearing pending ISR
with IMR changes (and thank you for pointing it out!).  However, as I
mentioned we can fix that issue with just a few new lines of code and by
reverting the userspace injection model to the one used in prior
releases without having to implement an entire in-kernel PIC to do it.

I think moving the PIC into the kernel has the advantage of allowing us
to move the PIT into the kernel too (which is huge, IMHO), but that is
its sole advantage.  Don't get me wrong: I am in favor of doing it.
However, I wanted to point out that this particular solution to the
problem you found essentially is invalidating "level-1" mode by only
supporting level-0 or level-2, not fixing it.  Perhaps we are "ok" with
that, but I was under the distinct impression from Avi et. al. that
these variable levels of support were a design goal for debugging
purposes.

I would prefer that we just fix level-1 mode with the changes I
mentioned instead of just disabling it (in addition to adding level-2
mode as Eddie is working on).

> 
> This patch fixes this by introducing new VM level IOCTL
> KVM_SET_ISA_IRQ_LEVEL & KVM_CREATE_PIC to avoid the ABI hole.  The
> original KVM_INTERRUPT is still there to be backward compatible.
> 
> WIth this patch, old Qemu, new qemu (after this patch), old kvm, new kvm
> can work in any combination. Both user level code and kernel code will
> automatically decide the irq source base on irqchip location (user or
> kernel).
> 
> Some known issues (TODO):
> 	1: SVM support is on going.
> 	The code for VMX to inject irq is same with Xen now since the
> situation is same now (irqchip in hyprevisor), SVM code should be able
> to directly reuse from Xen too.
> 	2: live migration break.
> 	3: Temply APIC support is removed in CPUID to wait for the merge
> of in kernel APIC patch

Note that you will need more than just the APIC patch.  My patch only
moves the LAPIC down.  The IOAPIC and 8259s are still in userspace.
Your patch only moves the 8259s down.  This means there is a gap where
the IOAPIC used to be that still needs to be addressed.


> 
> thx, eddie
> 
>     Signed-off-by: Yaozu (Eddie) Dong <eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> 
>  drivers/kvm/Makefile   |    2
>  drivers/kvm/i8259.c    |  435
> +++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/kvm/kvm.h      |    7
>  drivers/kvm/kvm_main.c |   26 ++
>  drivers/kvm/virq.c     |   58 ++++++
>  drivers/kvm/virq.h     |   70 +++++++
>  drivers/kvm/vmx.c      |   84 ++++++++-
>  include/linux/kvm.h    |    9 +
>  8 files changed, 678 insertions(+), 13 deletions(-)
> 
> 
> diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
> index c0a789f..118c79b 100644
> --- a/drivers/kvm/Makefile
> +++ b/drivers/kvm/Makefile
> @@ -2,7 +2,7 @@
>  # Makefile for Kernel-based Virtual Machine module
>  #
>  
> -kvm-objs := kvm_main.o mmu.o x86_emulate.o
> +kvm-objs := kvm_main.o mmu.o x86_emulate.o i8259.o virq.o
>  obj-$(CONFIG_KVM) += kvm.o
>  kvm-intel-objs = vmx.o
>  obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
> diff --git a/drivers/kvm/i8259.c b/drivers/kvm/i8259.c
> new file mode 100644
> index 0000000..9ba8c1e
> --- /dev/null
> +++ b/drivers/kvm/i8259.c
> @@ -0,0 +1,435 @@
> +/*
> + * QEMU 8259 interrupt controller emulation
> + *
> + * Copyright (c) 2003-2004 Fabrice Bellard
> + * Copyright (c) 2007 Intel Corperation
> + *
> + * Permission is hereby granted, free of charge, to any person
> obtaining a copy
> + * of this software and associated documentation files (the
> "Software"), to deal
> + * in the Software without restriction, including without limitation
> the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or
> sell
> + * copies of the Software, and to permit persons to whom the Software
> is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be
> included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
> SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> DEALINGS IN
> + * THE SOFTWARE.
> + * Authors:
> + *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> + *   Port from Qemu.
> + */
> +#include "virq.h"
> +#include <linux/mm.h>
> +
> +/* set irq level. If an edge is detected, then the IRR is set to 1 */
> +static inline void pic_set_irq1(struct pic_state *s, int irq, int
> level)
> +{
> +	int mask;
> +	mask = 1 << irq;
> +	if (s->elcr & mask) {
> +		/* level triggered */
> +		if (level) {
> +			s->irr |= mask;
> +			s->last_irr |= mask;
> +		} else {
> +			s->irr &= ~mask;
> +			s->last_irr &= ~mask;
> +		}
> +	} else {
> +		/* edge triggered */
> +		if (level) {
> +			if ((s->last_irr & mask) == 0)
> +				s->irr |= mask;
> +			s->last_irr |= mask;
> +		} else {
> +			s->last_irr &= ~mask;
> +		}
> +	}
> +}
> +
> +/* return the highest priority found in mask (highest = smallest
> +   number). Return 8 if no irq */
> +static inline int get_priority(struct pic_state *s, int mask)
> +{
> +	int priority;
> +	if (mask == 0)
> +		return 8;
> +	priority = 0;
> +	while ((mask & (1 << ((priority + s->priority_add) & 7))) == 0)
> +		priority++;
> +	return priority;
> +}
> +
> +/* return the pic wanted interrupt. return -1 if none */
> +static int pic_get_irq(struct pic_state *s)
> +{
> +	int mask, cur_priority, priority;
> +
> +	mask = s->irr & ~s->imr;
> +	priority = get_priority(s, mask);
> +	if (priority == 8)
> +		return -1;
> +	/* compute current priority. If special fully nested mode on the
> +	   master, the IRQ coming from the slave is not taken into
> account
> +	   for the priority computation. */
> +	mask = s->isr;
> +	if (s->special_fully_nested_mode && s ==
> &s->pics_state->pics[0])
> +		mask &= ~(1 << 2);
> +	cur_priority = get_priority(s, mask);
> +	if (priority < cur_priority) {
> +		/* higher priority found: an irq should be generated */
> +		return (priority + s->priority_add) & 7;
> +	} else {
> +		return -1;
> +	}
> +}
> +
> +/* raise irq to CPU if necessary. must be called every time the active
> +   irq may change */
> +static void pic_update_irq(struct pic_state2 *s)
> +{
> +	int irq2, irq;
> +
> +	/* first look at slave pic */
> +	irq2 = pic_get_irq(&s->pics[1]);
> +	if (irq2 >= 0) {
> +		/* if irq request by slave pic, signal master PIC */
> +		pic_set_irq1(&s->pics[0], 2, 1);
> +		pic_set_irq1(&s->pics[0], 2, 0);
> +	}
> +	/* look at requested irq */
> +	irq = pic_get_irq(&s->pics[0]);
> +	if (irq >= 0) {
> +		s->irq_request(s->irq_request_opaque, 1);
> +	}
> +}
> +
> +void pic_set_irq_new(void *opaque, int irq, int level)
> +{
> +	struct pic_state2 *s = opaque;
> +
> +	pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
> +	/* used for IOAPIC irqs */
> +	if (s->alt_irq_func)
> +		s->alt_irq_func(s->alt_irq_opaque, irq, level);
> +	pic_update_irq(s);
> +}
> +
> +/* acknowledge interrupt 'irq' */
> +static inline void pic_intack(struct pic_state *s, int irq)
> +{
> +	if (s->auto_eoi) {
> +		if (s->rotate_on_auto_eoi)
> +			s->priority_add = (irq + 1) & 7;
> +	} else {
> +		s->isr |= (1 << irq);
> +	}
> +	/* We don't clear a level sensitive interrupt here */
> +	if (!(s->elcr & (1 << irq)))
> +		s->irr &= ~(1 << irq);
> +}
> +
> +int pic_read_irq(struct pic_state2 *s)
> +{
> +	int irq, irq2, intno;
> +
> +	irq = pic_get_irq(&s->pics[0]);
> +	if (irq >= 0) {
> +		pic_intack(&s->pics[0], irq);
> +		if (irq == 2) {
> +			irq2 = pic_get_irq(&s->pics[1]);
> +			if (irq2 >= 0) {
> +				pic_intack(&s->pics[1], irq2);
> +			} else {
> +				/* spurious IRQ on slave controller */
> +				irq2 = 7;
> +			}
> +			intno = s->pics[1].irq_base + irq2;
> +			irq = irq2 + 8;
> +		} else {
> +			intno = s->pics[0].irq_base + irq;
> +		}
> +	} else {
> +		/* spurious IRQ on host controller */
> +		irq = 7;
> +		intno = s->pics[0].irq_base + irq;
> +	}
> +	pic_update_irq(s);
> +
> +	return intno;
> +}
> +
> +static void pic_reset(void *opaque)
> +{
> +	struct pic_state *s = opaque;
> +
> +	s->last_irr = 0;
> +	s->irr = 0;
> +	s->imr = 0;
> +	s->isr = 0;
> +	s->priority_add = 0;
> +	s->irq_base = 0;
> +	s->read_reg_select = 0;
> +	s->poll = 0;
> +	s->special_mask = 0;
> +	s->init_state = 0;
> +	s->auto_eoi = 0;
> +	s->rotate_on_auto_eoi = 0;
> +	s->special_fully_nested_mode = 0;
> +	s->init4 = 0;
> +	/* Note: ELCR is not reset */
> +}
> +
> +static void pic_ioport_write(void *opaque, uint32_t addr, uint32_t val)
> +{
> +	struct pic_state *s = opaque;
> +	int priority, cmd, irq;
> +
> +	addr &= 1;
> +	if (addr == 0) {
> +		if (val & 0x10) {
> +			/* init */
> +			pic_reset(s);
> +			/* deassert a pending interrupt */
> +			s->pics_state->irq_request(s->pics_state->
> +						   irq_request_opaque,
> 0);
> +			s->init_state = 1;
> +			s->init4 = val & 1;
> +			if (val & 0x02)
> +				printk(KERN_ERR "single mode not
> supported");
> +			if (val & 0x08)
> +				printk(KERN_ERR
> +				       "level sensitive irq not
> supported");
> +		} else if (val & 0x08) {
> +			if (val & 0x04)
> +				s->poll = 1;
> +			if (val & 0x02)
> +				s->read_reg_select = val & 1;
> +			if (val & 0x40)
> +				s->special_mask = (val >> 5) & 1;
> +		} else {
> +			cmd = val >> 5;
> +			switch (cmd) {
> +			case 0:
> +			case 4:
> +				s->rotate_on_auto_eoi = cmd >> 2;
> +				break;
> +			case 1:	/* end of interrupt */
> +			case 5:
> +				priority = get_priority(s, s->isr);
> +				if (priority != 8) {
> +					irq = (priority +
> s->priority_add) & 7;
> +					s->isr &= ~(1 << irq);
> +					if (cmd == 5)
> +						s->priority_add = (irq +
> 1) & 7;
> +					pic_update_irq(s->pics_state);
> +				}
> +				break;
> +			case 3:
> +				irq = val & 7;
> +				s->isr &= ~(1 << irq);
> +				pic_update_irq(s->pics_state);
> +				break;
> +			case 6:
> +				s->priority_add = (val + 1) & 7;
> +				pic_update_irq(s->pics_state);
> +				break;
> +			case 7:
> +				irq = val & 7;
> +				s->isr &= ~(1 << irq);
> +				s->priority_add = (irq + 1) & 7;
> +				pic_update_irq(s->pics_state);
> +				break;
> +			default:
> +				/* no operation */
> +				break;
> +			}
> +		}
> +	} else {
> +		switch (s->init_state) {
> +		case 0:
> +			/* normal mode */
> +			s->imr = val;
> +			pic_update_irq(s->pics_state);
> +			break;
> +		case 1:
> +			s->irq_base = val & 0xf8;
> +			s->init_state = 2;
> +			break;
> +		case 2:
> +			if (s->init4) {
> +				s->init_state = 3;
> +			} else {
> +				s->init_state = 0;
> +			}
> +			break;
> +		case 3:
> +			s->special_fully_nested_mode = (val >> 4) & 1;
> +			s->auto_eoi = (val >> 1) & 1;
> +			s->init_state = 0;
> +			break;
> +		}
> +	}
> +}
> +
> +static uint32_t pic_poll_read(struct pic_state *s, uint32_t addr1)
> +{
> +	int ret;
> +
> +	ret = pic_get_irq(s);
> +	if (ret >= 0) {
> +		if (addr1 >> 7) {
> +			s->pics_state->pics[0].isr &= ~(1 << 2);
> +			s->pics_state->pics[0].irr &= ~(1 << 2);
> +		}
> +		s->irr &= ~(1 << ret);
> +		s->isr &= ~(1 << ret);
> +		if (addr1 >> 7 || ret != 2)
> +			pic_update_irq(s->pics_state);
> +	} else {
> +		ret = 0x07;
> +		pic_update_irq(s->pics_state);
> +	}
> +
> +	return ret;
> +}
> +
> +static uint32_t pic_ioport_read(void *opaque, uint32_t addr1)
> +{
> +	struct pic_state *s = opaque;
> +	unsigned int addr;
> +	int ret;
> +
> +	addr = addr1;
> +	addr &= 1;
> +	if (s->poll) {
> +		ret = pic_poll_read(s, addr1);
> +		s->poll = 0;
> +	} else {
> +		if (addr == 0) {
> +			if (s->read_reg_select)
> +				ret = s->isr;
> +			else
> +				ret = s->irr;
> +		} else {
> +			ret = s->imr;
> +		}
> +	}
> +	return ret;
> +}
> +
> +static void elcr_ioport_write(void *opaque, uint32_t addr, uint32_t
> val)
> +{
> +	struct pic_state *s = opaque;
> +	s->elcr = val & s->elcr_mask;
> +}
> +
> +static uint32_t elcr_ioport_read(void *opaque, uint32_t addr1)
> +{
> +	struct pic_state *s = opaque;
> +	return s->elcr;
> +}
> +
> +static int picdev_in_range(struct kvm_io_device *this, gpa_t addr)
> +{
> +	switch (addr) {
> +		/* Master PIC */
> +	case 0x20:
> +	case 0x21:
> +		/* Slave PIC */
> +	case 0xa0:
> +	case 0xa1:
> +		/* elcr0  & 1 */
> +	case 0x4d0:
> +	case 0x4d1:
> +		return 1;
> +	default:
> +		return 0;
> +	}
> +}
> +
> +static void picdev_write(struct kvm_io_device *this,
> +			 gpa_t addr, int len, const void *val)
> +{
> +	struct pic_state2 *s = this->private;
> +	unsigned char data = *(unsigned char *)val;
> +
> +	if (len != 1) {
> +		printk(KERN_ERR "PIC: non byte write\n");
> +		return;
> +	}
> +	switch (addr) {
> +	case 0x20:
> +	case 0x21:
> +	case 0xa0:
> +	case 0xa1:
> +		pic_ioport_write(&s->pics[addr >> 7], addr, data);
> +		break;
> +	case 0x4d0:
> +	case 0x4d1:
> +		elcr_ioport_write(&s->pics[addr & 1], addr, data);
> +		break;
> +	}
> +}
> +
> +static void picdev_read(struct kvm_io_device *this,
> +			gpa_t addr, int len, void *val)
> +{
> +	struct pic_state2 *s = this->private;
> +	unsigned char data = 0;
> +
> +	if (len != 1) {
> +		printk(KERN_ERR "PIC: non byte read\n");
> +		return;
> +	}
> +	switch (addr) {
> +	case 0x20:
> +	case 0x21:
> +	case 0xa0:
> +	case 0xa1:
> +		data = pic_ioport_read(&s->pics[addr >> 7], addr);
> +		break;
> +	case 0x4d0:
> +	case 0x4d1:
> +		data = elcr_ioport_read(&s->pics[addr & 1], addr);
> +		break;
> +	}
> +	*(unsigned char *)val = data;
> +}
> +
> +/* callback when PIC0 irq status changed */
> +static void pic_irq_request(void *opaque, int level)
> +{
> +	struct kvm *kvm = opaque;
> +
> +	int_output(kvm->vpic) = level;
> +}
> +
> +struct pic_state2 *create_pic(struct kvm *kvm)
> +{
> +	struct pic_state2 *s;
> +	s = kzalloc(sizeof(struct pic_state2), GFP_KERNEL);
> +	if (!s)
> +		return NULL;
> +	s->pics[0].elcr_mask = 0xf8;
> +	s->pics[1].elcr_mask = 0xde;
> +	s->irq_request = pic_irq_request;
> +	s->irq_request_opaque = kvm;
> +	s->pics[0].pics_state = s;
> +	s->pics[1].pics_state = s;
> +
> +	/* Initialize PIO device */
> +	s->dev.read = picdev_read;
> +	s->dev.write = picdev_write;
> +	s->dev.in_range = picdev_in_range;
> +	s->dev.private = s;
> +	kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
> +	return s;
> +}

If you can, please take a look at that 8259 patch that I sent you.  Like
yours, its based on qemu, but I cleaned it up quite a bit.  I am not a
huge fan of the way the QEMU code smashes models together like this
(e.g. two 8259s hard coded together).  I much prefer to have clean lines
of delineation (e.g. model a single 8259, and then reuse two models in a
cascaded configuration).  It could just be me ;), but I think this
results in a cleaner and easier to maintain code base with fewer bugs in
the long term.



> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
> index a7c5e6b..1e2d3ba 100644
> --- a/drivers/kvm/kvm.h
> +++ b/drivers/kvm/kvm.h
> @@ -456,8 +456,15 @@ struct kvm {
>  	struct file *filp;
>  	struct kvm_io_bus mmio_bus;
>  	struct kvm_io_bus pio_bus;
> +	void *vpic;
>  };
>  
> +#define kernel_pic(kvm) ((kvm)->vpic)
> +static inline int irqchip_in_kernel(struct kvm *kvm)
> +{
> +	return (kernel_pic(kvm) != 0);
> +}
> +
>  struct descriptor_table {
>  	u16 limit;
>  	unsigned long base;
> diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
> index 7826f16..25f99fc 100644
> --- a/drivers/kvm/kvm_main.c
> +++ b/drivers/kvm/kvm_main.c
> @@ -16,6 +16,7 @@
>   */
>  
>  #include "kvm.h"
> +#include "virq.h"
>  
>  #include <linux/kvm.h>
>  #include <linux/module.h>
> @@ -476,6 +477,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	spin_lock(&kvm_lock);
>  	list_del(&kvm->vm_list);
>  	spin_unlock(&kvm_lock);
> +	if (kernel_pic(kvm))
> +		kfree(kernel_pic(kvm));
>  	kvm_io_bus_destroy(&kvm->pio_bus);
>  	kvm_io_bus_destroy(&kvm->mmio_bus);
>  	kvm_free_vcpus(kvm);
> @@ -1372,7 +1375,8 @@ EXPORT_SYMBOL_GPL(emulate_instruction);
>  
>  int kvm_emulate_halt(struct kvm_vcpu *vcpu)
>  {
> -	if (vcpu->irq_summary)
> +	if (vcpu->irq_summary ||
> +           (irqchip_in_kernel(vcpu->kvm) && cpu_has_interrupt(vcpu)))
>  		return 1;
>  
>  	vcpu->run->exit_reason = KVM_EXIT_HLT;
> @@ -2855,6 +2859,26 @@ static long kvm_vm_ioctl(struct file *filp,
>  			goto out;
>  		break;
>  	}
> +	case KVM_CREATE_PIC:
> +		r = -ENOMEM;
> +		kernel_pic(kvm) = create_pic(kvm);
> +		if (kernel_pic(kvm)) {
> +			r = 0;
> +		}
> +		break;
> +	case KVM_SET_ISA_IRQ_LEVEL: {
> +		struct kvm_irq_level irq_event;
> +		r = -EFAULT;
> +		if (copy_from_user(&irq_event, argp, sizeof irq_event))
> +			goto out;
> +		if (irqchip_in_kernel(kvm)) {
> +			pic_set_irq_new(kernel_pic(kvm),
> +					irq_event.irq,
> +					irq_event.level);
> +			r = 0;
> +		}
> +		break;
> +	}
>  	default:
>  		;
>  	}
> diff --git a/drivers/kvm/virq.c b/drivers/kvm/virq.c
> new file mode 100644
> index 0000000..a807296
> --- /dev/null
> +++ b/drivers/kvm/virq.c
> @@ -0,0 +1,58 @@
> +/*
> + * virq.c: API for in kernel interrupt controller
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify
> it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but
> WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License
> along with
> + * this program; if not, write to the Free Software Foundation, Inc.,
> 59 Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.
> + * Authors:
> + *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> + *
> + */
> +
> +#include "kvm.h"
> +#include "virq.h"
> +
> +/*
> + * check if there is pending interrupt without
> + * intack.
> + */
> +int cpu_has_interrupt(struct kvm_vcpu *v)
> +{
> +	struct pic_state2 *s = kernel_pic(v->kvm);
> +
> +	/* PIC */
> +	if (int_output(s))
> +		return 1;
> +	/* TODO: APIC */
> +	return 0;
> +}
> +
> +EXPORT_SYMBOL_GPL(cpu_has_interrupt);
> +
> +/*
> + * Read pending interrupt vector and intack.
> + */
> +int cpu_get_interrupt(struct kvm_vcpu *v)
> +{
> +	struct pic_state2 *s = kernel_pic(v->kvm);
> +	int vector;
> +
> +	int_output(s) = 0;
> +	vector = pic_read_irq(s);
> +	if (vector != -1)
> +		return vector;
> +	/* TODO: APIC */
> +	return -1;
> +}
> +
> +EXPORT_SYMBOL_GPL(cpu_get_interrupt);

Suffice to say I have a problem with this particular injection
methodology, which is why I redesigned it in the APIC series.  Is there
a particular reason you are choosing to propagate it from QEMU to KVM?
I understand that this may create synergy between the Xen and KVM code
bases. which is nice from a bug-fix-sharing perspective.  But, in my
mind that reason alone should not usurp good design.  Lets get this
right while we can (where "right" is what everyone here agrees to, not
my design per se ;)  

> diff --git a/drivers/kvm/virq.h b/drivers/kvm/virq.h
> new file mode 100644
> index 0000000..ed467dc
> --- /dev/null
> +++ b/drivers/kvm/virq.h
> @@ -0,0 +1,70 @@
> +/*
> + * virq.h: in kernel interrupt controller related definitions
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify
> it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but
> WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License
> along with
> + * this program; if not, write to the Free Software Foundation, Inc.,
> 59 Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.
> + * Authors:
> + *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> + *
> + */
> +
> +#ifndef __VIRQ_H
> +#define __VIRQ_H
> +
> +#include "kvm.h"
> +
> +typedef void SetIRQFunc(void *opaque, int irq_num, int level);
> +typedef void IRQRequestFunc(void *opaque, int level);
> +
> +struct PicState2;
> +struct pic_state {
> +	uint8_t last_irr;	/* edge detection */
> +	uint8_t irr;		/* interrupt request register */
> +	uint8_t imr;		/* interrupt mask register */
> +	uint8_t isr;		/* interrupt service register */
> +	uint8_t priority_add;	/* highest irq priority */
> +	uint8_t irq_base;
> +	uint8_t read_reg_select;
> +	uint8_t poll;
> +	uint8_t special_mask;
> +	uint8_t init_state;
> +	uint8_t auto_eoi;
> +	uint8_t rotate_on_auto_eoi;
> +	uint8_t special_fully_nested_mode;
> +	uint8_t init4;		/* true if 4 byte init */
> +	uint8_t elcr;		/* PIIX edge/trigger selection */
> +	uint8_t elcr_mask;
> +	struct pic_state2 *pics_state;
> +};
> +
> +struct pic_state2 {
> +	/* 0 is master pic, 1 is slave pic */
> +	struct pic_state pics[2];
> +	IRQRequestFunc *irq_request;
> +	void *irq_request_opaque;
> +	/* IOAPIC callback support */
> +	SetIRQFunc *alt_irq_func;
> +	void *alt_irq_opaque;
> +	int output;		/* intr from master PIC */
> +	struct kvm_io_device dev;
> +};
> +
> +#define int_output(s)		(((struct pic_state2 *)s)->output)
> +struct pic_state2 *create_pic(struct kvm *kvm);
> +void pic_set_irq_new(void *opaque, int irq, int level);
> +int pic_read_irq(struct pic_state2 *s);
> +int cpu_get_interrupt(struct kvm_vcpu *v);
> +int cpu_has_interrupt(struct kvm_vcpu *v);
> +
> +#endif
> diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
> index b47ddcc..cef0d64 100644
> --- a/drivers/kvm/vmx.c
> +++ b/drivers/kvm/vmx.c
> @@ -16,6 +16,7 @@
>   */
>  
>  #include "kvm.h"
> +#include "virq.h"
>  #include "vmx.h"
>  #include <linux/module.h>
>  #include <linux/kernel.h>
> @@ -1464,6 +1465,16 @@ static void inject_rmode_irq(struct kvm_vcpu
> *vcpu, int irq)
>  	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp -
> 6));
>  }
>  
> +static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq)
> +{
> +	if (vcpu->rmode.active) {
> +		inject_rmode_irq(vcpu, irq);
> +		return;
> +	}
> +	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> +			irq | INTR_TYPE_EXT_INTR |
> INTR_INFO_VALID_MASK);
> +}
> +
>  static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
>  {
>  	int word_index = __ffs(vcpu->irq_summary);
> @@ -1473,13 +1484,7 @@ static void kvm_do_inject_irq(struct kvm_vcpu
> *vcpu)
>  	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
>  	if (!vcpu->irq_pending[word_index])
>  		clear_bit(word_index, &vcpu->irq_summary);
> -
> -	if (vcpu->rmode.active) {
> -		inject_rmode_irq(vcpu, irq);
> -		return;
> -	}
> -	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> -			irq | INTR_TYPE_EXT_INTR |
> INTR_INFO_VALID_MASK);
> +	vmx_inject_irq(vcpu, irq);
>  }
>  
> 
> @@ -1563,7 +1568,7 @@ static int handle_exception(struct kvm_vcpu *vcpu,
> struct kvm_run *kvm_run)
>  		       "intr info 0x%x\n", __FUNCTION__, vect_info,
> intr_info);
>  	}
>  
> -	if (is_external_interrupt(vect_info)) {
> +	if (!irqchip_in_kernel(vcpu->kvm) &&
> is_external_interrupt(vect_info)) {
>  		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
>  		set_bit(irq, vcpu->irq_pending);
>  		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
> @@ -1894,6 +1899,12 @@ static void post_kvm_run_save(struct kvm_vcpu
> *vcpu,
>  static int handle_interrupt_window(struct kvm_vcpu *vcpu,
>  				   struct kvm_run *kvm_run)
>  {
> +	u32 cpu_based_vm_exec_control;
> +
> +	/* clear pending irq */
> +	cpu_based_vm_exec_control =
> vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> +	cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> cpu_based_vm_exec_control);
>  	/*
>  	 * If the user space waits to inject interrupts, exit as soon as
>  	 * possible
> @@ -1986,6 +1997,56 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
>  	vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3));
>  }
>  
> +void enable_irq_window(struct kvm_vcpu *vcpu)
> +{
> +	u32 cpu_based_vm_exec_control;
> +
> +	cpu_based_vm_exec_control =
> vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> +	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> cpu_based_vm_exec_control);
> +}
> +
> +void do_intr_assist(struct kvm_vcpu *vcpu)
> +{
> +	u32 idtv_info_field, intr_info_field;
> +	int has_ext_irq, interrupt_window_open;
> +	/* TODO: Move IDT_Vectoring here */
> +
> +	has_ext_irq = cpu_has_interrupt(vcpu);
> +	intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
> +	idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +	if (intr_info_field & INTR_INFO_VALID_MASK) {
> +		if (idtv_info_field & INTR_INFO_VALID_MASK) {
> +			/* TODO: fault when IDT_Vectoring */
> +			printk(KERN_ERR "Fault when IDT_Vectoring\n");
> +		}
> +		if (has_ext_irq)
> +			enable_irq_window(vcpu);
> +		return;
> +	}
> +	if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) {
> +		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field);
> +		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> +				vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
> +
> +		if ( unlikely(idtv_info_field & 0x800) ) /* valid error
> code */
> +			vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> +				vmcs_read32(IDT_VECTORING_ERROR_CODE));
> +		if ( unlikely(has_ext_irq) )
> +			enable_irq_window(vcpu);
> +		return;
> +	}
> +	if (!has_ext_irq)
> +		return;
> +	interrupt_window_open =
> +		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
> +		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
> +	if (interrupt_window_open)
> +		vmx_inject_irq(vcpu, cpu_get_interrupt(vcpu));
> +	else
> +		enable_irq_window(vcpu);
> +}
> +
>  static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
>  {
>  	u8 fail;
> @@ -1996,9 +2057,6 @@ preempted:
>  		kvm_guest_debug_pre(vcpu);
>  
>  again:
> -	if (!vcpu->mmio_read_completed)
> -		do_interrupt_requests(vcpu, kvm_run);
> -
>  	vmx_save_host_state(vcpu);
>  	kvm_load_guest_fpu(vcpu);
>  
> @@ -2012,6 +2070,10 @@ again:
>  	vmcs_writel(HOST_CR0, read_cr0());
>  
>  	local_irq_disable();
> +	if (irqchip_in_kernel(vcpu->kvm))
> +		do_intr_assist(vcpu);
> +	else if (!vcpu->mmio_read_completed)
> +		do_interrupt_requests(vcpu, kvm_run);

2 comments here:

1) shouldnt this be:

> +	if (!vcpu->mmio_read_completed) {
> +               if (irqchip_in_kernel(vcpu->kvm))
> +		         do_intr_assist(vcpu);
> +		else
> +                        do_interrupt_requests(vcpu, kvm_run);

?

2) I prefer to switch out code like this using indirection instead of if
statements.  The code is IMO cleaner and the run-time overhead is
probably lower, especially as more cases come into play.


Regards,
-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread
* In kernel PIC support: kernel patch
@ 2007-06-20  7:43 Dong, Eddie
       [not found] ` <10EA09EFD8728347A513008B6B0DA77A01A55F0A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Dong, Eddie @ 2007-06-20  7:43 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

[-- Attachment #1: Type: text/plain, Size: 24457 bytes --]

As we discussed, if we move APIC to kernel while leaving PIC in user
level, we have ABI level holes if the interrupt a user level qemu
injected is not immediately injected to guest for some reason. \

This patch fixes this by introducing new VM level IOCTL
KVM_SET_ISA_IRQ_LEVEL & KVM_CREATE_PIC to avoid the ABI hole.  The
original KVM_INTERRUPT is still there to be backward compatible.

WIth this patch, old Qemu, new qemu (after this patch), old kvm, new kvm
can work in any combination. Both user level code and kernel code will
automatically decide the irq source base on irqchip location (user or
kernel).

Some known issues (TODO):
	1: SVM support is on going.
	The code for VMX to inject irq is same with Xen now since the
situation is same now (irqchip in hyprevisor), SVM code should be able
to directly reuse from Xen too.
	2: live migration break.
	3: Temply APIC support is removed in CPUID to wait for the merge
of in kernel APIC patch

thx, eddie

    Signed-off-by: Yaozu (Eddie) Dong <eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

 drivers/kvm/Makefile   |    2
 drivers/kvm/i8259.c    |  435
+++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/kvm.h      |    7
 drivers/kvm/kvm_main.c |   26 ++
 drivers/kvm/virq.c     |   58 ++++++
 drivers/kvm/virq.h     |   70 +++++++
 drivers/kvm/vmx.c      |   84 ++++++++-
 include/linux/kvm.h    |    9 +
 8 files changed, 678 insertions(+), 13 deletions(-)


diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c0a789f..118c79b 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o i8259.o virq.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/i8259.c b/drivers/kvm/i8259.c
new file mode 100644
index 0000000..9ba8c1e
--- /dev/null
+++ b/drivers/kvm/i8259.c
@@ -0,0 +1,435 @@
+/*
+ * QEMU 8259 interrupt controller emulation
+ *
+ * Copyright (c) 2003-2004 Fabrice Bellard
+ * Copyright (c) 2007 Intel Corperation
+ *
+ * Permission is hereby granted, free of charge, to any person
obtaining a copy
+ * of this software and associated documentation files (the
"Software"), to deal
+ * in the Software without restriction, including without limitation
the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell
+ * copies of the Software, and to permit persons to whom the Software
is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN
+ * THE SOFTWARE.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
+ *   Port from Qemu.
+ */
+#include "virq.h"
+#include <linux/mm.h>
+
+/* set irq level. If an edge is detected, then the IRR is set to 1 */
+static inline void pic_set_irq1(struct pic_state *s, int irq, int
level)
+{
+	int mask;
+	mask = 1 << irq;
+	if (s->elcr & mask) {
+		/* level triggered */
+		if (level) {
+			s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->irr &= ~mask;
+			s->last_irr &= ~mask;
+		}
+	} else {
+		/* edge triggered */
+		if (level) {
+			if ((s->last_irr & mask) == 0)
+				s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->last_irr &= ~mask;
+		}
+	}
+}
+
+/* return the highest priority found in mask (highest = smallest
+   number). Return 8 if no irq */
+static inline int get_priority(struct pic_state *s, int mask)
+{
+	int priority;
+	if (mask == 0)
+		return 8;
+	priority = 0;
+	while ((mask & (1 << ((priority + s->priority_add) & 7))) == 0)
+		priority++;
+	return priority;
+}
+
+/* return the pic wanted interrupt. return -1 if none */
+static int pic_get_irq(struct pic_state *s)
+{
+	int mask, cur_priority, priority;
+
+	mask = s->irr & ~s->imr;
+	priority = get_priority(s, mask);
+	if (priority == 8)
+		return -1;
+	/* compute current priority. If special fully nested mode on the
+	   master, the IRQ coming from the slave is not taken into
account
+	   for the priority computation. */
+	mask = s->isr;
+	if (s->special_fully_nested_mode && s ==
&s->pics_state->pics[0])
+		mask &= ~(1 << 2);
+	cur_priority = get_priority(s, mask);
+	if (priority < cur_priority) {
+		/* higher priority found: an irq should be generated */
+		return (priority + s->priority_add) & 7;
+	} else {
+		return -1;
+	}
+}
+
+/* raise irq to CPU if necessary. must be called every time the active
+   irq may change */
+static void pic_update_irq(struct pic_state2 *s)
+{
+	int irq2, irq;
+
+	/* first look at slave pic */
+	irq2 = pic_get_irq(&s->pics[1]);
+	if (irq2 >= 0) {
+		/* if irq request by slave pic, signal master PIC */
+		pic_set_irq1(&s->pics[0], 2, 1);
+		pic_set_irq1(&s->pics[0], 2, 0);
+	}
+	/* look at requested irq */
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		s->irq_request(s->irq_request_opaque, 1);
+	}
+}
+
+void pic_set_irq_new(void *opaque, int irq, int level)
+{
+	struct pic_state2 *s = opaque;
+
+	pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+	/* used for IOAPIC irqs */
+	if (s->alt_irq_func)
+		s->alt_irq_func(s->alt_irq_opaque, irq, level);
+	pic_update_irq(s);
+}
+
+/* acknowledge interrupt 'irq' */
+static inline void pic_intack(struct pic_state *s, int irq)
+{
+	if (s->auto_eoi) {
+		if (s->rotate_on_auto_eoi)
+			s->priority_add = (irq + 1) & 7;
+	} else {
+		s->isr |= (1 << irq);
+	}
+	/* We don't clear a level sensitive interrupt here */
+	if (!(s->elcr & (1 << irq)))
+		s->irr &= ~(1 << irq);
+}
+
+int pic_read_irq(struct pic_state2 *s)
+{
+	int irq, irq2, intno;
+
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		pic_intack(&s->pics[0], irq);
+		if (irq == 2) {
+			irq2 = pic_get_irq(&s->pics[1]);
+			if (irq2 >= 0) {
+				pic_intack(&s->pics[1], irq2);
+			} else {
+				/* spurious IRQ on slave controller */
+				irq2 = 7;
+			}
+			intno = s->pics[1].irq_base + irq2;
+			irq = irq2 + 8;
+		} else {
+			intno = s->pics[0].irq_base + irq;
+		}
+	} else {
+		/* spurious IRQ on host controller */
+		irq = 7;
+		intno = s->pics[0].irq_base + irq;
+	}
+	pic_update_irq(s);
+
+	return intno;
+}
+
+static void pic_reset(void *opaque)
+{
+	struct pic_state *s = opaque;
+
+	s->last_irr = 0;
+	s->irr = 0;
+	s->imr = 0;
+	s->isr = 0;
+	s->priority_add = 0;
+	s->irq_base = 0;
+	s->read_reg_select = 0;
+	s->poll = 0;
+	s->special_mask = 0;
+	s->init_state = 0;
+	s->auto_eoi = 0;
+	s->rotate_on_auto_eoi = 0;
+	s->special_fully_nested_mode = 0;
+	s->init4 = 0;
+	/* Note: ELCR is not reset */
+}
+
+static void pic_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+{
+	struct pic_state *s = opaque;
+	int priority, cmd, irq;
+
+	addr &= 1;
+	if (addr == 0) {
+		if (val & 0x10) {
+			/* init */
+			pic_reset(s);
+			/* deassert a pending interrupt */
+			s->pics_state->irq_request(s->pics_state->
+						   irq_request_opaque,
0);
+			s->init_state = 1;
+			s->init4 = val & 1;
+			if (val & 0x02)
+				printk(KERN_ERR "single mode not
supported");
+			if (val & 0x08)
+				printk(KERN_ERR
+				       "level sensitive irq not
supported");
+		} else if (val & 0x08) {
+			if (val & 0x04)
+				s->poll = 1;
+			if (val & 0x02)
+				s->read_reg_select = val & 1;
+			if (val & 0x40)
+				s->special_mask = (val >> 5) & 1;
+		} else {
+			cmd = val >> 5;
+			switch (cmd) {
+			case 0:
+			case 4:
+				s->rotate_on_auto_eoi = cmd >> 2;
+				break;
+			case 1:	/* end of interrupt */
+			case 5:
+				priority = get_priority(s, s->isr);
+				if (priority != 8) {
+					irq = (priority +
s->priority_add) & 7;
+					s->isr &= ~(1 << irq);
+					if (cmd == 5)
+						s->priority_add = (irq +
1) & 7;
+					pic_update_irq(s->pics_state);
+				}
+				break;
+			case 3:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				pic_update_irq(s->pics_state);
+				break;
+			case 6:
+				s->priority_add = (val + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			case 7:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				s->priority_add = (irq + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			default:
+				/* no operation */
+				break;
+			}
+		}
+	} else {
+		switch (s->init_state) {
+		case 0:
+			/* normal mode */
+			s->imr = val;
+			pic_update_irq(s->pics_state);
+			break;
+		case 1:
+			s->irq_base = val & 0xf8;
+			s->init_state = 2;
+			break;
+		case 2:
+			if (s->init4) {
+				s->init_state = 3;
+			} else {
+				s->init_state = 0;
+			}
+			break;
+		case 3:
+			s->special_fully_nested_mode = (val >> 4) & 1;
+			s->auto_eoi = (val >> 1) & 1;
+			s->init_state = 0;
+			break;
+		}
+	}
+}
+
+static uint32_t pic_poll_read(struct pic_state *s, uint32_t addr1)
+{
+	int ret;
+
+	ret = pic_get_irq(s);
+	if (ret >= 0) {
+		if (addr1 >> 7) {
+			s->pics_state->pics[0].isr &= ~(1 << 2);
+			s->pics_state->pics[0].irr &= ~(1 << 2);
+		}
+		s->irr &= ~(1 << ret);
+		s->isr &= ~(1 << ret);
+		if (addr1 >> 7 || ret != 2)
+			pic_update_irq(s->pics_state);
+	} else {
+		ret = 0x07;
+		pic_update_irq(s->pics_state);
+	}
+
+	return ret;
+}
+
+static uint32_t pic_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct pic_state *s = opaque;
+	unsigned int addr;
+	int ret;
+
+	addr = addr1;
+	addr &= 1;
+	if (s->poll) {
+		ret = pic_poll_read(s, addr1);
+		s->poll = 0;
+	} else {
+		if (addr == 0) {
+			if (s->read_reg_select)
+				ret = s->isr;
+			else
+				ret = s->irr;
+		} else {
+			ret = s->imr;
+		}
+	}
+	return ret;
+}
+
+static void elcr_ioport_write(void *opaque, uint32_t addr, uint32_t
val)
+{
+	struct pic_state *s = opaque;
+	s->elcr = val & s->elcr_mask;
+}
+
+static uint32_t elcr_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct pic_state *s = opaque;
+	return s->elcr;
+}
+
+static int picdev_in_range(struct kvm_io_device *this, gpa_t addr)
+{
+	switch (addr) {
+		/* Master PIC */
+	case 0x20:
+	case 0x21:
+		/* Slave PIC */
+	case 0xa0:
+	case 0xa1:
+		/* elcr0  & 1 */
+	case 0x4d0:
+	case 0x4d1:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+static void picdev_write(struct kvm_io_device *this,
+			 gpa_t addr, int len, const void *val)
+{
+	struct pic_state2 *s = this->private;
+	unsigned char data = *(unsigned char *)val;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte write\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		pic_ioport_write(&s->pics[addr >> 7], addr, data);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		elcr_ioport_write(&s->pics[addr & 1], addr, data);
+		break;
+	}
+}
+
+static void picdev_read(struct kvm_io_device *this,
+			gpa_t addr, int len, void *val)
+{
+	struct pic_state2 *s = this->private;
+	unsigned char data = 0;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte read\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		data = pic_ioport_read(&s->pics[addr >> 7], addr);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		data = elcr_ioport_read(&s->pics[addr & 1], addr);
+		break;
+	}
+	*(unsigned char *)val = data;
+}
+
+/* callback when PIC0 irq status changed */
+static void pic_irq_request(void *opaque, int level)
+{
+	struct kvm *kvm = opaque;
+
+	int_output(kvm->vpic) = level;
+}
+
+struct pic_state2 *create_pic(struct kvm *kvm)
+{
+	struct pic_state2 *s;
+	s = kzalloc(sizeof(struct pic_state2), GFP_KERNEL);
+	if (!s)
+		return NULL;
+	s->pics[0].elcr_mask = 0xf8;
+	s->pics[1].elcr_mask = 0xde;
+	s->irq_request = pic_irq_request;
+	s->irq_request_opaque = kvm;
+	s->pics[0].pics_state = s;
+	s->pics[1].pics_state = s;
+
+	/* Initialize PIO device */
+	s->dev.read = picdev_read;
+	s->dev.write = picdev_write;
+	s->dev.in_range = picdev_in_range;
+	s->dev.private = s;
+	kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
+	return s;
+}
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index a7c5e6b..1e2d3ba 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -456,8 +456,15 @@ struct kvm {
 	struct file *filp;
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
+	void *vpic;
 };
 
+#define kernel_pic(kvm) ((kvm)->vpic)
+static inline int irqchip_in_kernel(struct kvm *kvm)
+{
+	return (kernel_pic(kvm) != 0);
+}
+
 struct descriptor_table {
 	u16 limit;
 	unsigned long base;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 7826f16..25f99fc 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "virq.h"
 
 #include <linux/kvm.h>
 #include <linux/module.h>
@@ -476,6 +477,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
+	if (kernel_pic(kvm))
+		kfree(kernel_pic(kvm));
 	kvm_io_bus_destroy(&kvm->pio_bus);
 	kvm_io_bus_destroy(&kvm->mmio_bus);
 	kvm_free_vcpus(kvm);
@@ -1372,7 +1375,8 @@ EXPORT_SYMBOL_GPL(emulate_instruction);
 
 int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->irq_summary)
+	if (vcpu->irq_summary ||
+           (irqchip_in_kernel(vcpu->kvm) && cpu_has_interrupt(vcpu)))
 		return 1;
 
 	vcpu->run->exit_reason = KVM_EXIT_HLT;
@@ -2855,6 +2859,26 @@ static long kvm_vm_ioctl(struct file *filp,
 			goto out;
 		break;
 	}
+	case KVM_CREATE_PIC:
+		r = -ENOMEM;
+		kernel_pic(kvm) = create_pic(kvm);
+		if (kernel_pic(kvm)) {
+			r = 0;
+		}
+		break;
+	case KVM_SET_ISA_IRQ_LEVEL: {
+		struct kvm_irq_level irq_event;
+		r = -EFAULT;
+		if (copy_from_user(&irq_event, argp, sizeof irq_event))
+			goto out;
+		if (irqchip_in_kernel(kvm)) {
+			pic_set_irq_new(kernel_pic(kvm),
+					irq_event.irq,
+					irq_event.level);
+			r = 0;
+		}
+		break;
+	}
 	default:
 		;
 	}
diff --git a/drivers/kvm/virq.c b/drivers/kvm/virq.c
new file mode 100644
index 0000000..a807296
--- /dev/null
+++ b/drivers/kvm/virq.c
@@ -0,0 +1,58 @@
+/*
+ * virq.c: API for in kernel interrupt controller
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
+ *
+ */
+
+#include "kvm.h"
+#include "virq.h"
+
+/*
+ * check if there is pending interrupt without
+ * intack.
+ */
+int cpu_has_interrupt(struct kvm_vcpu *v)
+{
+	struct pic_state2 *s = kernel_pic(v->kvm);
+
+	/* PIC */
+	if (int_output(s))
+		return 1;
+	/* TODO: APIC */
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(cpu_has_interrupt);
+
+/*
+ * Read pending interrupt vector and intack.
+ */
+int cpu_get_interrupt(struct kvm_vcpu *v)
+{
+	struct pic_state2 *s = kernel_pic(v->kvm);
+	int vector;
+
+	int_output(s) = 0;
+	vector = pic_read_irq(s);
+	if (vector != -1)
+		return vector;
+	/* TODO: APIC */
+	return -1;
+}
+
+EXPORT_SYMBOL_GPL(cpu_get_interrupt);
diff --git a/drivers/kvm/virq.h b/drivers/kvm/virq.h
new file mode 100644
index 0000000..ed467dc
--- /dev/null
+++ b/drivers/kvm/virq.h
@@ -0,0 +1,70 @@
+/*
+ * virq.h: in kernel interrupt controller related definitions
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
+ *
+ */
+
+#ifndef __VIRQ_H
+#define __VIRQ_H
+
+#include "kvm.h"
+
+typedef void SetIRQFunc(void *opaque, int irq_num, int level);
+typedef void IRQRequestFunc(void *opaque, int level);
+
+struct PicState2;
+struct pic_state {
+	uint8_t last_irr;	/* edge detection */
+	uint8_t irr;		/* interrupt request register */
+	uint8_t imr;		/* interrupt mask register */
+	uint8_t isr;		/* interrupt service register */
+	uint8_t priority_add;	/* highest irq priority */
+	uint8_t irq_base;
+	uint8_t read_reg_select;
+	uint8_t poll;
+	uint8_t special_mask;
+	uint8_t init_state;
+	uint8_t auto_eoi;
+	uint8_t rotate_on_auto_eoi;
+	uint8_t special_fully_nested_mode;
+	uint8_t init4;		/* true if 4 byte init */
+	uint8_t elcr;		/* PIIX edge/trigger selection */
+	uint8_t elcr_mask;
+	struct pic_state2 *pics_state;
+};
+
+struct pic_state2 {
+	/* 0 is master pic, 1 is slave pic */
+	struct pic_state pics[2];
+	IRQRequestFunc *irq_request;
+	void *irq_request_opaque;
+	/* IOAPIC callback support */
+	SetIRQFunc *alt_irq_func;
+	void *alt_irq_opaque;
+	int output;		/* intr from master PIC */
+	struct kvm_io_device dev;
+};
+
+#define int_output(s)		(((struct pic_state2 *)s)->output)
+struct pic_state2 *create_pic(struct kvm *kvm);
+void pic_set_irq_new(void *opaque, int irq, int level);
+int pic_read_irq(struct pic_state2 *s);
+int cpu_get_interrupt(struct kvm_vcpu *v);
+int cpu_has_interrupt(struct kvm_vcpu *v);
+
+#endif
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index b47ddcc..cef0d64 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "virq.h"
 #include "vmx.h"
 #include <linux/module.h>
 #include <linux/kernel.h>
@@ -1464,6 +1465,16 @@ static void inject_rmode_irq(struct kvm_vcpu
*vcpu, int irq)
 	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp -
6));
 }
 
+static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq)
+{
+	if (vcpu->rmode.active) {
+		inject_rmode_irq(vcpu, irq);
+		return;
+	}
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+			irq | INTR_TYPE_EXT_INTR |
INTR_INFO_VALID_MASK);
+}
+
 static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 {
 	int word_index = __ffs(vcpu->irq_summary);
@@ -1473,13 +1484,7 @@ static void kvm_do_inject_irq(struct kvm_vcpu
*vcpu)
 	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
 	if (!vcpu->irq_pending[word_index])
 		clear_bit(word_index, &vcpu->irq_summary);
-
-	if (vcpu->rmode.active) {
-		inject_rmode_irq(vcpu, irq);
-		return;
-	}
-	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-			irq | INTR_TYPE_EXT_INTR |
INTR_INFO_VALID_MASK);
+	vmx_inject_irq(vcpu, irq);
 }
 
 
@@ -1563,7 +1568,7 @@ static int handle_exception(struct kvm_vcpu *vcpu,
struct kvm_run *kvm_run)
 		       "intr info 0x%x\n", __FUNCTION__, vect_info,
intr_info);
 	}
 
-	if (is_external_interrupt(vect_info)) {
+	if (!irqchip_in_kernel(vcpu->kvm) &&
is_external_interrupt(vect_info)) {
 		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
 		set_bit(irq, vcpu->irq_pending);
 		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
@@ -1894,6 +1899,12 @@ static void post_kvm_run_save(struct kvm_vcpu
*vcpu,
 static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 				   struct kvm_run *kvm_run)
 {
+	u32 cpu_based_vm_exec_control;
+
+	/* clear pending irq */
+	cpu_based_vm_exec_control =
vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
cpu_based_vm_exec_control);
 	/*
 	 * If the user space waits to inject interrupts, exit as soon as
 	 * possible
@@ -1986,6 +1997,56 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
 	vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3));
 }
 
+void enable_irq_window(struct kvm_vcpu *vcpu)
+{
+	u32 cpu_based_vm_exec_control;
+
+	cpu_based_vm_exec_control =
vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
cpu_based_vm_exec_control);
+}
+
+void do_intr_assist(struct kvm_vcpu *vcpu)
+{
+	u32 idtv_info_field, intr_info_field;
+	int has_ext_irq, interrupt_window_open;
+	/* TODO: Move IDT_Vectoring here */
+
+	has_ext_irq = cpu_has_interrupt(vcpu);
+	intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	if (intr_info_field & INTR_INFO_VALID_MASK) {
+		if (idtv_info_field & INTR_INFO_VALID_MASK) {
+			/* TODO: fault when IDT_Vectoring */
+			printk(KERN_ERR "Fault when IDT_Vectoring\n");
+		}
+		if (has_ext_irq)
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) {
+		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field);
+		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+				vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
+
+		if ( unlikely(idtv_info_field & 0x800) ) /* valid error
code */
+			vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+				vmcs_read32(IDT_VECTORING_ERROR_CODE));
+		if ( unlikely(has_ext_irq) )
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (!has_ext_irq)
+		return;
+	interrupt_window_open =
+		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
+		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
+	if (interrupt_window_open)
+		vmx_inject_irq(vcpu, cpu_get_interrupt(vcpu));
+	else
+		enable_irq_window(vcpu);
+}
+
 static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	u8 fail;
@@ -1996,9 +2057,6 @@ preempted:
 		kvm_guest_debug_pre(vcpu);
 
 again:
-	if (!vcpu->mmio_read_completed)
-		do_interrupt_requests(vcpu, kvm_run);
-
 	vmx_save_host_state(vcpu);
 	kvm_load_guest_fpu(vcpu);
 
@@ -2012,6 +2070,10 @@ again:
 	vmcs_writel(HOST_CR0, read_cr0());
 
 	local_irq_disable();
+	if (irqchip_in_kernel(vcpu->kvm))
+		do_intr_assist(vcpu);
+	else if (!vcpu->mmio_read_completed)
+		do_interrupt_requests(vcpu, kvm_run);
 
 	vcpu->guest_mode = 1;
 	if (vcpu->requests)
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index e6edca8..fbcfc86 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -41,6 +41,12 @@ struct kvm_memory_alias {
 	__u64 target_phys_addr;
 };
 
+/* for KVM_SET_IRQ_LEVEL */
+struct kvm_irq_level {
+	__u32 irq;
+	__u32 level;
+};
+
 enum kvm_exit_reason {
 	KVM_EXIT_UNKNOWN          = 0,
 	KVM_EXIT_EXCEPTION        = 1,
@@ -282,6 +288,9 @@ struct kvm_signal_mask {
 #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct
kvm_dirty_log)
 #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct
kvm_memory_alias)
+/* Device model IOC */
+#define KVM_CREATE_PIC		  _IO(KVMIO,  0x60)
+#define KVM_SET_ISA_IRQ_LEVEL	  _IO(KVMIO,  0x61)
 
 /*
  * ioctls for vcpu fds

[-- Attachment #2: pic-kernel-v23.patch --]
[-- Type: application/octet-stream, Size: 22491 bytes --]

commit 58df4587880e541057386abf2aea7b63bd7d9d97
Author: root <root@vt32-pae.(none)>
Date:   Wed Jun 20 15:31:31 2007 +0800

        In kernel PIC support.
        Mixed mode of irqchip has a hole in
        injecting device irq which may be not
        immediately processed and thus difficult
        to handle.
        Move PIC to kernel together with APIC can
        both solve the issue and get performance
        gain.
    
    Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>

diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c0a789f..118c79b 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o i8259.o virq.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/i8259.c b/drivers/kvm/i8259.c
new file mode 100644
index 0000000..9ba8c1e
--- /dev/null
+++ b/drivers/kvm/i8259.c
@@ -0,0 +1,435 @@
+/*
+ * QEMU 8259 interrupt controller emulation
+ *
+ * Copyright (c) 2003-2004 Fabrice Bellard
+ * Copyright (c) 2007 Intel Corperation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *   Port from Qemu.
+ */
+#include "virq.h"
+#include <linux/mm.h>
+
+/* set irq level. If an edge is detected, then the IRR is set to 1 */
+static inline void pic_set_irq1(struct pic_state *s, int irq, int level)
+{
+	int mask;
+	mask = 1 << irq;
+	if (s->elcr & mask) {
+		/* level triggered */
+		if (level) {
+			s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->irr &= ~mask;
+			s->last_irr &= ~mask;
+		}
+	} else {
+		/* edge triggered */
+		if (level) {
+			if ((s->last_irr & mask) == 0)
+				s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->last_irr &= ~mask;
+		}
+	}
+}
+
+/* return the highest priority found in mask (highest = smallest
+   number). Return 8 if no irq */
+static inline int get_priority(struct pic_state *s, int mask)
+{
+	int priority;
+	if (mask == 0)
+		return 8;
+	priority = 0;
+	while ((mask & (1 << ((priority + s->priority_add) & 7))) == 0)
+		priority++;
+	return priority;
+}
+
+/* return the pic wanted interrupt. return -1 if none */
+static int pic_get_irq(struct pic_state *s)
+{
+	int mask, cur_priority, priority;
+
+	mask = s->irr & ~s->imr;
+	priority = get_priority(s, mask);
+	if (priority == 8)
+		return -1;
+	/* compute current priority. If special fully nested mode on the
+	   master, the IRQ coming from the slave is not taken into account
+	   for the priority computation. */
+	mask = s->isr;
+	if (s->special_fully_nested_mode && s == &s->pics_state->pics[0])
+		mask &= ~(1 << 2);
+	cur_priority = get_priority(s, mask);
+	if (priority < cur_priority) {
+		/* higher priority found: an irq should be generated */
+		return (priority + s->priority_add) & 7;
+	} else {
+		return -1;
+	}
+}
+
+/* raise irq to CPU if necessary. must be called every time the active
+   irq may change */
+static void pic_update_irq(struct pic_state2 *s)
+{
+	int irq2, irq;
+
+	/* first look at slave pic */
+	irq2 = pic_get_irq(&s->pics[1]);
+	if (irq2 >= 0) {
+		/* if irq request by slave pic, signal master PIC */
+		pic_set_irq1(&s->pics[0], 2, 1);
+		pic_set_irq1(&s->pics[0], 2, 0);
+	}
+	/* look at requested irq */
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		s->irq_request(s->irq_request_opaque, 1);
+	}
+}
+
+void pic_set_irq_new(void *opaque, int irq, int level)
+{
+	struct pic_state2 *s = opaque;
+
+	pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+	/* used for IOAPIC irqs */
+	if (s->alt_irq_func)
+		s->alt_irq_func(s->alt_irq_opaque, irq, level);
+	pic_update_irq(s);
+}
+
+/* acknowledge interrupt 'irq' */
+static inline void pic_intack(struct pic_state *s, int irq)
+{
+	if (s->auto_eoi) {
+		if (s->rotate_on_auto_eoi)
+			s->priority_add = (irq + 1) & 7;
+	} else {
+		s->isr |= (1 << irq);
+	}
+	/* We don't clear a level sensitive interrupt here */
+	if (!(s->elcr & (1 << irq)))
+		s->irr &= ~(1 << irq);
+}
+
+int pic_read_irq(struct pic_state2 *s)
+{
+	int irq, irq2, intno;
+
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		pic_intack(&s->pics[0], irq);
+		if (irq == 2) {
+			irq2 = pic_get_irq(&s->pics[1]);
+			if (irq2 >= 0) {
+				pic_intack(&s->pics[1], irq2);
+			} else {
+				/* spurious IRQ on slave controller */
+				irq2 = 7;
+			}
+			intno = s->pics[1].irq_base + irq2;
+			irq = irq2 + 8;
+		} else {
+			intno = s->pics[0].irq_base + irq;
+		}
+	} else {
+		/* spurious IRQ on host controller */
+		irq = 7;
+		intno = s->pics[0].irq_base + irq;
+	}
+	pic_update_irq(s);
+
+	return intno;
+}
+
+static void pic_reset(void *opaque)
+{
+	struct pic_state *s = opaque;
+
+	s->last_irr = 0;
+	s->irr = 0;
+	s->imr = 0;
+	s->isr = 0;
+	s->priority_add = 0;
+	s->irq_base = 0;
+	s->read_reg_select = 0;
+	s->poll = 0;
+	s->special_mask = 0;
+	s->init_state = 0;
+	s->auto_eoi = 0;
+	s->rotate_on_auto_eoi = 0;
+	s->special_fully_nested_mode = 0;
+	s->init4 = 0;
+	/* Note: ELCR is not reset */
+}
+
+static void pic_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+{
+	struct pic_state *s = opaque;
+	int priority, cmd, irq;
+
+	addr &= 1;
+	if (addr == 0) {
+		if (val & 0x10) {
+			/* init */
+			pic_reset(s);
+			/* deassert a pending interrupt */
+			s->pics_state->irq_request(s->pics_state->
+						   irq_request_opaque, 0);
+			s->init_state = 1;
+			s->init4 = val & 1;
+			if (val & 0x02)
+				printk(KERN_ERR "single mode not supported");
+			if (val & 0x08)
+				printk(KERN_ERR
+				       "level sensitive irq not supported");
+		} else if (val & 0x08) {
+			if (val & 0x04)
+				s->poll = 1;
+			if (val & 0x02)
+				s->read_reg_select = val & 1;
+			if (val & 0x40)
+				s->special_mask = (val >> 5) & 1;
+		} else {
+			cmd = val >> 5;
+			switch (cmd) {
+			case 0:
+			case 4:
+				s->rotate_on_auto_eoi = cmd >> 2;
+				break;
+			case 1:	/* end of interrupt */
+			case 5:
+				priority = get_priority(s, s->isr);
+				if (priority != 8) {
+					irq = (priority + s->priority_add) & 7;
+					s->isr &= ~(1 << irq);
+					if (cmd == 5)
+						s->priority_add = (irq + 1) & 7;
+					pic_update_irq(s->pics_state);
+				}
+				break;
+			case 3:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				pic_update_irq(s->pics_state);
+				break;
+			case 6:
+				s->priority_add = (val + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			case 7:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				s->priority_add = (irq + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			default:
+				/* no operation */
+				break;
+			}
+		}
+	} else {
+		switch (s->init_state) {
+		case 0:
+			/* normal mode */
+			s->imr = val;
+			pic_update_irq(s->pics_state);
+			break;
+		case 1:
+			s->irq_base = val & 0xf8;
+			s->init_state = 2;
+			break;
+		case 2:
+			if (s->init4) {
+				s->init_state = 3;
+			} else {
+				s->init_state = 0;
+			}
+			break;
+		case 3:
+			s->special_fully_nested_mode = (val >> 4) & 1;
+			s->auto_eoi = (val >> 1) & 1;
+			s->init_state = 0;
+			break;
+		}
+	}
+}
+
+static uint32_t pic_poll_read(struct pic_state *s, uint32_t addr1)
+{
+	int ret;
+
+	ret = pic_get_irq(s);
+	if (ret >= 0) {
+		if (addr1 >> 7) {
+			s->pics_state->pics[0].isr &= ~(1 << 2);
+			s->pics_state->pics[0].irr &= ~(1 << 2);
+		}
+		s->irr &= ~(1 << ret);
+		s->isr &= ~(1 << ret);
+		if (addr1 >> 7 || ret != 2)
+			pic_update_irq(s->pics_state);
+	} else {
+		ret = 0x07;
+		pic_update_irq(s->pics_state);
+	}
+
+	return ret;
+}
+
+static uint32_t pic_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct pic_state *s = opaque;
+	unsigned int addr;
+	int ret;
+
+	addr = addr1;
+	addr &= 1;
+	if (s->poll) {
+		ret = pic_poll_read(s, addr1);
+		s->poll = 0;
+	} else {
+		if (addr == 0) {
+			if (s->read_reg_select)
+				ret = s->isr;
+			else
+				ret = s->irr;
+		} else {
+			ret = s->imr;
+		}
+	}
+	return ret;
+}
+
+static void elcr_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+{
+	struct pic_state *s = opaque;
+	s->elcr = val & s->elcr_mask;
+}
+
+static uint32_t elcr_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct pic_state *s = opaque;
+	return s->elcr;
+}
+
+static int picdev_in_range(struct kvm_io_device *this, gpa_t addr)
+{
+	switch (addr) {
+		/* Master PIC */
+	case 0x20:
+	case 0x21:
+		/* Slave PIC */
+	case 0xa0:
+	case 0xa1:
+		/* elcr0  & 1 */
+	case 0x4d0:
+	case 0x4d1:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+static void picdev_write(struct kvm_io_device *this,
+			 gpa_t addr, int len, const void *val)
+{
+	struct pic_state2 *s = this->private;
+	unsigned char data = *(unsigned char *)val;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte write\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		pic_ioport_write(&s->pics[addr >> 7], addr, data);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		elcr_ioport_write(&s->pics[addr & 1], addr, data);
+		break;
+	}
+}
+
+static void picdev_read(struct kvm_io_device *this,
+			gpa_t addr, int len, void *val)
+{
+	struct pic_state2 *s = this->private;
+	unsigned char data = 0;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte read\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		data = pic_ioport_read(&s->pics[addr >> 7], addr);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		data = elcr_ioport_read(&s->pics[addr & 1], addr);
+		break;
+	}
+	*(unsigned char *)val = data;
+}
+
+/* callback when PIC0 irq status changed */
+static void pic_irq_request(void *opaque, int level)
+{
+	struct kvm *kvm = opaque;
+
+	int_output(kvm->vpic) = level;
+}
+
+struct pic_state2 *create_pic(struct kvm *kvm)
+{
+	struct pic_state2 *s;
+	s = kzalloc(sizeof(struct pic_state2), GFP_KERNEL);
+	if (!s)
+		return NULL;
+	s->pics[0].elcr_mask = 0xf8;
+	s->pics[1].elcr_mask = 0xde;
+	s->irq_request = pic_irq_request;
+	s->irq_request_opaque = kvm;
+	s->pics[0].pics_state = s;
+	s->pics[1].pics_state = s;
+
+	/* Initialize PIO device */
+	s->dev.read = picdev_read;
+	s->dev.write = picdev_write;
+	s->dev.in_range = picdev_in_range;
+	s->dev.private = s;
+	kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
+	return s;
+}
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index a7c5e6b..1e2d3ba 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -456,8 +456,15 @@ struct kvm {
 	struct file *filp;
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
+	void *vpic;
 };
 
+#define kernel_pic(kvm) ((kvm)->vpic)
+static inline int irqchip_in_kernel(struct kvm *kvm)
+{
+	return (kernel_pic(kvm) != 0);
+}
+
 struct descriptor_table {
 	u16 limit;
 	unsigned long base;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 7826f16..25f99fc 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "virq.h"
 
 #include <linux/kvm.h>
 #include <linux/module.h>
@@ -476,6 +477,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
+	if (kernel_pic(kvm))
+		kfree(kernel_pic(kvm));
 	kvm_io_bus_destroy(&kvm->pio_bus);
 	kvm_io_bus_destroy(&kvm->mmio_bus);
 	kvm_free_vcpus(kvm);
@@ -1372,7 +1375,8 @@ EXPORT_SYMBOL_GPL(emulate_instruction);
 
 int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->irq_summary)
+	if (vcpu->irq_summary ||
+           (irqchip_in_kernel(vcpu->kvm) && cpu_has_interrupt(vcpu)))
 		return 1;
 
 	vcpu->run->exit_reason = KVM_EXIT_HLT;
@@ -2855,6 +2859,26 @@ static long kvm_vm_ioctl(struct file *filp,
 			goto out;
 		break;
 	}
+	case KVM_CREATE_PIC:
+		r = -ENOMEM;
+		kernel_pic(kvm) = create_pic(kvm);
+		if (kernel_pic(kvm)) {
+			r = 0;
+		}
+		break;
+	case KVM_SET_ISA_IRQ_LEVEL: {
+		struct kvm_irq_level irq_event;
+		r = -EFAULT;
+		if (copy_from_user(&irq_event, argp, sizeof irq_event))
+			goto out;
+		if (irqchip_in_kernel(kvm)) {
+			pic_set_irq_new(kernel_pic(kvm),
+					irq_event.irq,
+					irq_event.level);
+			r = 0;
+		}
+		break;
+	}
 	default:
 		;
 	}
diff --git a/drivers/kvm/virq.c b/drivers/kvm/virq.c
new file mode 100644
index 0000000..a807296
--- /dev/null
+++ b/drivers/kvm/virq.c
@@ -0,0 +1,58 @@
+/*
+ * virq.c: API for in kernel interrupt controller
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *
+ */
+
+#include "kvm.h"
+#include "virq.h"
+
+/*
+ * check if there is pending interrupt without
+ * intack.
+ */
+int cpu_has_interrupt(struct kvm_vcpu *v)
+{
+	struct pic_state2 *s = kernel_pic(v->kvm);
+
+	/* PIC */
+	if (int_output(s))
+		return 1;
+	/* TODO: APIC */
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(cpu_has_interrupt);
+
+/*
+ * Read pending interrupt vector and intack.
+ */
+int cpu_get_interrupt(struct kvm_vcpu *v)
+{
+	struct pic_state2 *s = kernel_pic(v->kvm);
+	int vector;
+
+	int_output(s) = 0;
+	vector = pic_read_irq(s);
+	if (vector != -1)
+		return vector;
+	/* TODO: APIC */
+	return -1;
+}
+
+EXPORT_SYMBOL_GPL(cpu_get_interrupt);
diff --git a/drivers/kvm/virq.h b/drivers/kvm/virq.h
new file mode 100644
index 0000000..ed467dc
--- /dev/null
+++ b/drivers/kvm/virq.h
@@ -0,0 +1,70 @@
+/*
+ * virq.h: in kernel interrupt controller related definitions
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *
+ */
+
+#ifndef __VIRQ_H
+#define __VIRQ_H
+
+#include "kvm.h"
+
+typedef void SetIRQFunc(void *opaque, int irq_num, int level);
+typedef void IRQRequestFunc(void *opaque, int level);
+
+struct PicState2;
+struct pic_state {
+	uint8_t last_irr;	/* edge detection */
+	uint8_t irr;		/* interrupt request register */
+	uint8_t imr;		/* interrupt mask register */
+	uint8_t isr;		/* interrupt service register */
+	uint8_t priority_add;	/* highest irq priority */
+	uint8_t irq_base;
+	uint8_t read_reg_select;
+	uint8_t poll;
+	uint8_t special_mask;
+	uint8_t init_state;
+	uint8_t auto_eoi;
+	uint8_t rotate_on_auto_eoi;
+	uint8_t special_fully_nested_mode;
+	uint8_t init4;		/* true if 4 byte init */
+	uint8_t elcr;		/* PIIX edge/trigger selection */
+	uint8_t elcr_mask;
+	struct pic_state2 *pics_state;
+};
+
+struct pic_state2 {
+	/* 0 is master pic, 1 is slave pic */
+	struct pic_state pics[2];
+	IRQRequestFunc *irq_request;
+	void *irq_request_opaque;
+	/* IOAPIC callback support */
+	SetIRQFunc *alt_irq_func;
+	void *alt_irq_opaque;
+	int output;		/* intr from master PIC */
+	struct kvm_io_device dev;
+};
+
+#define int_output(s)		(((struct pic_state2 *)s)->output)
+struct pic_state2 *create_pic(struct kvm *kvm);
+void pic_set_irq_new(void *opaque, int irq, int level);
+int pic_read_irq(struct pic_state2 *s);
+int cpu_get_interrupt(struct kvm_vcpu *v);
+int cpu_has_interrupt(struct kvm_vcpu *v);
+
+#endif
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index b47ddcc..cef0d64 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "virq.h"
 #include "vmx.h"
 #include <linux/module.h>
 #include <linux/kernel.h>
@@ -1464,6 +1465,16 @@ static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq)
 	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp - 6));
 }
 
+static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq)
+{
+	if (vcpu->rmode.active) {
+		inject_rmode_irq(vcpu, irq);
+		return;
+	}
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+			irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
+}
+
 static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 {
 	int word_index = __ffs(vcpu->irq_summary);
@@ -1473,13 +1484,7 @@ static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
 	if (!vcpu->irq_pending[word_index])
 		clear_bit(word_index, &vcpu->irq_summary);
-
-	if (vcpu->rmode.active) {
-		inject_rmode_irq(vcpu, irq);
-		return;
-	}
-	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-			irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
+	vmx_inject_irq(vcpu, irq);
 }
 
 
@@ -1563,7 +1568,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 		       "intr info 0x%x\n", __FUNCTION__, vect_info, intr_info);
 	}
 
-	if (is_external_interrupt(vect_info)) {
+	if (!irqchip_in_kernel(vcpu->kvm) && is_external_interrupt(vect_info)) {
 		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
 		set_bit(irq, vcpu->irq_pending);
 		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
@@ -1894,6 +1899,12 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 				   struct kvm_run *kvm_run)
 {
+	u32 cpu_based_vm_exec_control;
+
+	/* clear pending irq */
+	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
 	/*
 	 * If the user space waits to inject interrupts, exit as soon as
 	 * possible
@@ -1986,6 +1997,56 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
 	vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3));
 }
 
+void enable_irq_window(struct kvm_vcpu *vcpu)
+{
+	u32 cpu_based_vm_exec_control;
+
+	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
+}
+
+void do_intr_assist(struct kvm_vcpu *vcpu)
+{
+	u32 idtv_info_field, intr_info_field;
+	int has_ext_irq, interrupt_window_open;
+	/* TODO: Move IDT_Vectoring here */
+
+	has_ext_irq = cpu_has_interrupt(vcpu);
+	intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	if (intr_info_field & INTR_INFO_VALID_MASK) {
+		if (idtv_info_field & INTR_INFO_VALID_MASK) {
+			/* TODO: fault when IDT_Vectoring */
+			printk(KERN_ERR "Fault when IDT_Vectoring\n");
+		}
+		if (has_ext_irq)
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) {
+		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field);
+		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+				vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
+
+		if ( unlikely(idtv_info_field & 0x800) ) /* valid error code */
+			vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+				vmcs_read32(IDT_VECTORING_ERROR_CODE));
+		if ( unlikely(has_ext_irq) )
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (!has_ext_irq)
+		return;
+	interrupt_window_open =
+		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
+		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
+	if (interrupt_window_open)
+		vmx_inject_irq(vcpu, cpu_get_interrupt(vcpu));
+	else
+		enable_irq_window(vcpu);
+}
+
 static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	u8 fail;
@@ -1996,9 +2057,6 @@ preempted:
 		kvm_guest_debug_pre(vcpu);
 
 again:
-	if (!vcpu->mmio_read_completed)
-		do_interrupt_requests(vcpu, kvm_run);
-
 	vmx_save_host_state(vcpu);
 	kvm_load_guest_fpu(vcpu);
 
@@ -2012,6 +2070,10 @@ again:
 	vmcs_writel(HOST_CR0, read_cr0());
 
 	local_irq_disable();
+	if (irqchip_in_kernel(vcpu->kvm))
+		do_intr_assist(vcpu);
+	else if (!vcpu->mmio_read_completed)
+		do_interrupt_requests(vcpu, kvm_run);
 
 	vcpu->guest_mode = 1;
 	if (vcpu->requests)
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index e6edca8..fbcfc86 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -41,6 +41,12 @@ struct kvm_memory_alias {
 	__u64 target_phys_addr;
 };
 
+/* for KVM_SET_IRQ_LEVEL */
+struct kvm_irq_level {
+	__u32 irq;
+	__u32 level;
+};
+
 enum kvm_exit_reason {
 	KVM_EXIT_UNKNOWN          = 0,
 	KVM_EXIT_EXCEPTION        = 1,
@@ -282,6 +288,9 @@ struct kvm_signal_mask {
 #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct kvm_dirty_log)
 #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct kvm_memory_alias)
+/* Device model IOC */
+#define KVM_CREATE_PIC		  _IO(KVMIO,  0x60)
+#define KVM_SET_ISA_IRQ_LEVEL	  _IO(KVMIO,  0x61)
 
 /*
  * ioctls for vcpu fds

[-- Attachment #3: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #4: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2007-07-08  8:18 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-25 22:23 In kernel PIC support: kernel patch Gregory Haskins
     [not found] ` <468008090200005A00026619-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-25 22:25   ` Avi Kivity
2007-06-26  0:30   ` Dong, Eddie
2007-06-27 15:50   ` Luca
     [not found] <468E0FD2.2090305@qumranet.com>
     [not found] ` <468E0FD2.2090305-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-07-06 17:46   ` Dong, Eddie
     [not found]     ` <10EA09EFD8728347A513008B6B0DA77A01B8F3C5-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-07-08  8:18       ` Avi Kivity
  -- strict thread matches above, loose matches on Subject: below --
2007-06-25 22:36 Gregory Haskins
2007-06-25 14:10 Gregory Haskins
     [not found] ` <467F94850200005A00026595-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-25 21:13   ` Dor Laor
     [not found]     ` <64F9B87B6B770947A9F8391472E032160C654F85-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-06-25 21:57       ` Avi Kivity
2007-06-25 13:53 Gregory Haskins
2007-06-25 13:45 Gregory Haskins
2007-06-25 13:44 Gregory Haskins
2007-06-25 13:43 Gregory Haskins
     [not found] ` <467F8E260200005A00026573-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-25 13:54   ` Dong, Eddie
2007-06-25 22:08   ` Avi Kivity
2007-06-25 13:26 Gregory Haskins
     [not found] ` <467F8A2D0200005A00026567-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-25 22:00   ` Avi Kivity
2007-06-22 12:14 Gregory Haskins
     [not found] ` <467B84EF0200005A00026401-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-23  7:53   ` Avi Kivity
2007-06-22 12:06 Gregory Haskins
     [not found] ` <467B830B0200005A000263FB-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-22 15:09   ` Dong, Eddie
     [not found]     ` <10EA09EFD8728347A513008B6B0DA77A014E8AC7-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-06-23 17:41       ` Avi Kivity
     [not found]         ` <467D5B5B.9090702-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-25  3:32           ` Dong, Eddie
     [not found]             ` <10EA09EFD8728347A513008B6B0DA77A01AA46F5-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-06-25  3:39               ` Avi Kivity
     [not found]                 ` <467F38F7.9090101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-25  4:11                   ` Dong, Eddie
2007-06-22 12:02 Gregory Haskins
2007-06-22  3:49 Gregory Haskins
     [not found] ` <467B0E7D0200005A000263E4-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-22  7:25   ` Dong, Eddie
2007-06-22  9:57   ` Avi Kivity
2007-06-21 16:44 Gregory Haskins
     [not found] ` <467A72970200005A00026354-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-22  9:46   ` Avi Kivity
2007-06-21 15:43 Gregory Haskins
2007-06-21 15:42 Gregory Haskins
2007-06-21 15:17 Gregory Haskins
2007-06-21 15:10 Gregory Haskins
     [not found] ` <467A5C8C0200005A00026321-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-21 15:13   ` Dong, Eddie
2007-06-21 16:31   ` Avi Kivity
     [not found]     ` <467AA7FF.5000701-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-22  2:09       ` Dong, Eddie
     [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01A56750-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-06-22 10:32           ` Avi Kivity
2007-06-21 14:01 Gregory Haskins
     [not found] ` <467A4C970200005A000262FD-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-21 14:57   ` Dong, Eddie
2007-06-21 13:31 Gregory Haskins
     [not found] ` <467A45690200005A000262F0-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-21 14:28   ` Dong, Eddie
2007-06-21 12:36 Gregory Haskins
     [not found] ` <467A38720200005A000262C8-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-21 13:02   ` Dong, Eddie
2007-06-20  7:43 Dong, Eddie
     [not found] ` <10EA09EFD8728347A513008B6B0DA77A01A55F0A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-06-22 10:50   ` Avi Kivity
     [not found]     ` <467BA95A.704-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-22 14:21       ` Rusty Russell
     [not found]         ` <1182522074.17478.45.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-23  7:50           ` Avi Kivity
2007-06-25  1:45       ` Dong, Eddie
     [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01AA45AD-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-06-25  3:28           ` Avi Kivity
2007-06-27 15:25       ` Dong, Eddie
     [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01AE9255-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-07-02 23:53           ` Avi Kivity
     [not found]             ` <46898FE6.7040709-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-07-03  4:45               ` Dong, Eddie
     [not found]                 ` <10EA09EFD8728347A513008B6B0DA77A01B40B4A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-07-03 14:52                   ` Avi Kivity
2007-07-03  7:19               ` Dong, Eddie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox