Re: In kernel PIC support: kernel patch

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* Re: In kernel PIC support: kernel patch
@ 2007-06-25 22:23 Gregory Haskins
       [not found] ` <468008090200005A00026619-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 22:23 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 2007-06-25 at 18:08 -0400, Avi Kivity wrote:
> There's a slight issue that Windows with the ACPI HAL 
> is dog slow when virtulalized, but maybe we can find another HAL that 
> supports ACPI but doesn't use the APIC.

Is this in reference to the frequent TPR updates, or is there something
else that is also slow with ACPI enabled?

> 
> > Note that there is another standard that would allow us to use built-in
> > routing without ACPI/IOAPIC: MSI.  I know the battle over
> > virtual-bus-rendering is still raging w.r.t. to PCI or not-PCI.  But I
> > will point out that if we do use PCI, setting the PV devices up as MSI
> > capable in the config space potentially eliminates the need to also wire
> > them to an IOAPIC.  They will get their vector data from the MSI setup
> > and can then send APIC messages directly.  (I think there is some
> > confusion about how we can do MSI later in this thread...I will reply to
> > those mails separately)
> >   
> 
> MSI is also a good solution; but do all semi-modern guest OSes support it?

I can't say for sure, but I would guess "yes" because IIUC PCI-Express
only uses MSI (or at least heavily prefers it).


> No, a single ioapic in the kernel, with one interrupt line that is 
> shared between a kernel-provided device and a userspace-provided 
> device.  The kernel would need to provide the OR function that sharing 
> implies.

Got it.  I was confused earlier.

> 
> 
> > If we do end up needing the IOAPIC in the
> > kernel and userspace, I would suggest using two IOAPICs (we just need to
> > update the ACPI tables to say there are two) so that they can operate
> > independently.  This should be supportable from a virtual system
> > infrastructure standpoint, and is much cleaner IMHO.  If you were
> > talking about a different scenario. please clarify
> 
> I was talking about a different scenario, but your solution is cleaner 
> (though it involves more hand-dirtying with the ACPI tables).  Basically 
> each PCI IRQ would have its own dedicated line.

Right. Or more precisely, *anything* you want can have its own line ;)
IIUC a standard IOAPIC has 24 input lines, so you just need one for
every 24 devices allocated.




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <468008090200005A00026619-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <468008090200005A00026619-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-25 22:25   ` Avi Kivity
  2007-06-26  0:30   ` Dong, Eddie
  2007-06-27 15:50   ` Luca
  2 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-06-25 22:25 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> On Mon, 2007-06-25 at 18:08 -0400, Avi Kivity wrote:
>   
>> There's a slight issue that Windows with the ACPI HAL 
>> is dog slow when virtulalized, but maybe we can find another HAL that 
>> supports ACPI but doesn't use the APIC.
>>     
>
> Is this in reference to the frequent TPR updates, or is there something
> else that is also slow with ACPI enabled?
>
>   

Just the good old tpr.



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
       [not found] ` <468008090200005A00026619-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  2007-06-25 22:25   ` Avi Kivity
@ 2007-06-26  0:30   ` Dong, Eddie
  2007-06-27 15:50   ` Luca
  2 siblings, 0 replies; 57+ messages in thread
From: Dong, Eddie @ 2007-06-26  0:30 UTC (permalink / raw)
  To: Gregory Haskins, avi-atKUWr5tajBWk0Htik3J/w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>> 
>> MSI is also a good solution; but do all semi-modern guest 
>OSes support it?

Linux support MSI from 2.6.18, Vista has full support. But XP doesn't.

>
>I can't say for sure, but I would guess "yes" because IIUC PCI-Express
>only uses MSI (or at least heavily prefers it).
>
>
>> 
>> I was talking about a different scenario, but your solution 
>is cleaner 
>> (though it involves more hand-dirtying with the ACPI 
>tables).  Basically 
>> each PCI IRQ would have its own dedicated line.
>
>Right. Or more precisely, *anything* you want can have its own line ;)
>IIUC a standard IOAPIC has 24 input lines, so you just need one for
>every 24 devices allocated.
>

This can work but I think it is unnecessary. The benfit I can see is
that
we don't need to share irq with pv driver, but it still has limit of pv
irq lines. 
Here if we add an IOAPIC, we got 24 additional pv irq lines, while it
still 
may be not enough, then we need to go back to share irq line policy
again.

So I think simplicity is better until we see performance gain.

BTW, what I say migration issue is that if we add additional IOAPIC,
even we 
don't use any pv drivers, migration between Qemu and KVM becomes
difficult.
thx, eddie

>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
       [not found] ` <468008090200005A00026619-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  2007-06-25 22:25   ` Avi Kivity
  2007-06-26  0:30   ` Dong, Eddie
@ 2007-06-27 15:50   ` Luca
  2 siblings, 0 replies; 57+ messages in thread
From: Luca @ 2007-06-27 15:50 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 6/26/07, Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, 2007-06-25 at 18:08 -0400, Avi Kivity wrote:
> > > Note that there is another standard that would allow us to use built-in
> > > routing without ACPI/IOAPIC: MSI.  I know the battle over
> > > virtual-bus-rendering is still raging w.r.t. to PCI or not-PCI.  But I
> > > will point out that if we do use PCI, setting the PV devices up as MSI
> > > capable in the config space potentially eliminates the need to also wire
> > > them to an IOAPIC.  They will get their vector data from the MSI setup
> > > and can then send APIC messages directly.  (I think there is some
> > > confusion about how we can do MSI later in this thread...I will reply to
> > > those mails separately)
> > >
> >
> > MSI is also a good solution; but do all semi-modern guest OSes support it?
>
> I can't say for sure, but I would guess "yes" because IIUC PCI-Express
> only uses MSI (or at least heavily prefers it).

AFAIK for PCI-E devices MSI support is mandated by the standard. OTHO
there's also the  INTx emulation (the device *always* uses a write
transaction to signal the interrupt condition, and the host bridge may
turn the message into a legacy interrupt) so that the OS can drive the
device even without fiddling with MSI.
As Eddie said XP probably would be confused by MSI-only devices.

Luca

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <468E0FD2.2090305@qumranet.com>]

[parent not found: <468E0FD2.2090305-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <468E0FD2.2090305-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-07-06 17:46   ` Dong, Eddie
       [not found]     ` <10EA09EFD8728347A513008B6B0DA77A01B8F3C5-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Dong, Eddie @ 2007-07-06 17:46 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Avi Kivity wrote:
> Merged both into the 'lapic2' branch.  Got the following oops running
> Windows (i386 on Intel): 
> 

Avi:
	This one should fixed the bug.
	Against	lapic2  dc03e1b70198e0e4aadfc9063399590fa42cf55e.
thx,eddie

Destroy io bus earlier than all the registered bus node get freed
including vpic.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index d785b2e..06e5804 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -427,9 +427,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
        spin_lock(&kvm_lock);
        list_del(&kvm->vm_list);
        spin_unlock(&kvm_lock);
-       kfree(kvm->vpic);
        kvm_io_bus_destroy(&kvm->pio_bus);
        kvm_io_bus_destroy(&kvm->mmio_bus);
+       kfree(kvm->vpic);
        kvm_free_vcpus(kvm);
        kvm_free_physmem(kvm);
        kfree(kvm);

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 57+ messages in thread

[parent not found: <10EA09EFD8728347A513008B6B0DA77A01B8F3C5-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]     ` <10EA09EFD8728347A513008B6B0DA77A01B8F3C5-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2007-07-08  8:18       ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-07-08  8:18 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dong, Eddie wrote:
> Avi Kivity wrote:
>   
>> Merged both into the 'lapic2' branch.  Got the following oops running
>> Windows (i386 on Intel): 
>>
>>     
>
> Avi:
> 	This one should fixed the bug.
> 	Against	lapic2  dc03e1b70198e0e4aadfc9063399590fa42cf55e.
> thx,eddie
>
> Destroy io bus earlier than all the registered bus node get freed
> including vpic.
>
>   

Applied to lapic2 branch.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-25 22:36 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 22:36 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 2007-06-25 at 18:25 -0400, Avi Kivity wrote:
> Gregory Haskins wrote:
> > On Mon, 2007-06-25 at 18:08 -0400, Avi Kivity wrote:
> >   
> >> There's a slight issue that Windows with the ACPI HAL 
> >> is dog slow when virtulalized, but maybe we can find another HAL that 
> >> supports ACPI but doesn't use the APIC.
> >>     
> >
> > Is this in reference to the frequent TPR updates, or is there something
> > else that is also slow with ACPI enabled?
> >
> >   
> 
> Just the good old tpr.

Ah, yes.  I actually find the later versions of KVM to perform quite
well even with XP+ACPI.  I think its all the work that went into the VMX
light-exits, mostly.  Things seem to run even better with the in-kernel
APIC patch enabled, but admittedly I haven't run any benchmarks on it to
tell for sure.  This is all "seat of the pants" ;).  I am not exactly
sure why the in-kernel APIC would/could make a difference since TPRs are
light-exits in either case IIUC...

I am very interested to see what kind of gains we can get when TPR
shadowing is implemented properly (my attempt wasn't even fully
functional if you recall).  I seem to recall reading somewhere that
TPR-shadow features can work even with 32 bit OSes (I guess MOV-TO-CR8
works in 32-bit mode?) but I don't know if XP would use it or the MMIO
path.  Im guessing MMIO, but perhaps MS will patch it since they are
interested in VMM technology now.  Does AMD support a similar notion of
TPR-shadow? 

-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-25 14:10 Gregory Haskins
       [not found] ` <467F94850200005A00026595-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 14:10 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 2007-06-25 at 21:54 +0800, Dong, Eddie wrote:
>  
> >-----Original Message-----
> >From: Gregory Haskins [mailto:ghaskins@novell.com] 
> >Sent: 2007年6月25日 21:43
> >To: avi@qumranet.com
> >Cc: Dong, Eddie; kvm-devel@lists.sourceforge.net
> >Subject: Re: [kvm-devel] In kernel PIC support: kernel patch
> >
> >On Sat, 2007-06-23 at 20:41 +0300, Avi Kivity wrote:
> >> Dong, Eddie wrote:
> >> > 3: IOAPIC position
> >> > 	Though it looks like neutral to have this one in user or kernel
> >> > space,
> >> > but I'd like to suggest we only support one model. 
> >Considering future
> >> > VT-d
> >> > case, hypervisor need to inject an IRQ directly in KVM (still thru
> >> > IOAPIC)
> >> > without going to user level, so probably moving IOAPIC to 
> >kernel is good
> >> >
> >> > in this point.
> >> >   
> >> 
> >> Even paravirt device drivers will want to inject IRQs via the ioapic 
> >> (when the guest is not paravirt_ops enabled, like older Linux and 
> >> Windows). 
> >
> >Note that its probably not a requirement to do so.  The IOAPIC
> >essentially provides a standard mechanism to map input "irq pins" to
> >APIC messages.  A pv driver could conceivably call kvm_apicbus_send()
> >directly if I knew its vector instead of calling ioapic_setpin() and it
> >would achieve the same goal.
> 
> Basically we are talking about same thing. calling kvm_apicbus_send or
> raise a virtual ioapic pin is basically same. The minor different is who does
> the translation from pin to APIC message, device model or IOAPIC model.
> 
> Using IOAPIC pin probably makes PV driver writter a little bit easier and 
> may be reused by other VMM easily in future.

I fully agree with everything you say here.  I just wanted to point out
that pv-devices dont have a hard-requirement for an IOAPIC connection in
case it makes something easier for someone. ;)

> 
> >
> >What is nice about using an IOAPIC is that the irq routing can be done
> >for you automatically by any ACPI compliant OS.  The device just has to
> >say "raise my output pin" and the IOAPIC translates that into a vector.
> >If you forgo the IOAPIC in favor of direct APIC messages, you have to
> >solve the problem of irq-routing a different way.  
> >
> >Note that there is another standard that would allow us to use built-in
> >routing without ACPI/IOAPIC: MSI.  I know the battle over
> >virtual-bus-rendering is still raging w.r.t. to PCI or not-PCI.  But I
> >will point out that if we do use PCI, setting the PV devices up as MSI
> >capable in the config space potentially eliminates the need to 
> >also wire
> >them to an IOAPIC.  They will get their vector data from the MSI setup
> >and can then send APIC messages directly.  (I think there is some
> >confusion about how we can do MSI later in this thread...I 
> >will reply to
> >those mails separately)
> 
> Yes, all you mentioned here is correct, but it doesn;t impact the
> necessity of moving IOAPIC to kernel, see below.
> 
> >
> >Thoughts?
> >
> >
> >
> >> It's probably okay to implement the device side of a block 
> >> device in qemu, but more difficult for networking. If we have device 
> >> implementations in the kernel then we'll need an ioapic in 
> >the kernel.
> >
> >> 
> >> Also, if we end up sharing interrupts between the kernel and 
> >userspace, 
> >> we'll need the kernel to perform the OR between the level 
> >specified by 
> >> the kernel devices and the level specified by userspace.
> >
> >I not sure I fully understand what you are getting at here.  It sounds
> >like you are talking about splitting a single IOAPIC into both a user
> >and kernel space device?  If we do end up needing the IOAPIC in the
> 
> No, the IOAPIC is in kernel, but some device such as IDE are in user while others
> are in kernel such as pv NIC.
> 
> Due to this, a kernel pv driver will ask for IOAPIC service, so we have to keep 
> it in kernel, otherwise we have to go back to user when kernel device fires, which
> is very slow.
> 
> But, like you mentio> it helps, but then 1: we have to implement MSI now, 2: a little bit more complicated
> in writting pv driver.


Ok, I think I understand what Avi was saying now.  The "or" operation is
in reference to a shared irq-line coming into the IOAPIC. e.g. We need
to maintain a logical OR against the current level of a particular line.
I thought he was talking about maintaining internal IOAPIC state between
user/kern, which is probably messier than two IOAPICs. ;)

> 
> >kernel and userspace, I would suggest using two IOAPICs (we 
> >just need to
> >update the ACPI tables to say there are two) so that they can operate
> >independently.  This should be supportable from a virtual system
> 
> It can, but a little bit more complicated, and probably unable to live
> migration between KVM and Qemu since Qemu only has one IOAPIC.

Good point, but is that a design requirement?


> 
> >infrastructure standpoint, and is much cleaner IMHO.  If you were
> >talking about a different scenario. please clarify.
> >
> 
> Thx,eddie


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467F94850200005A00026595-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467F94850200005A00026595-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-25 21:13   ` Dor Laor
       [not found]     ` <64F9B87B6B770947A9F8391472E032160C654F85-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Dor Laor @ 2007-06-25 21:13 UTC (permalink / raw)
  To: Gregory Haskins, eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>
>> It can, but a little bit more complicated, and probably unable to
live
>> migration between KVM and Qemu since Qemu only has one IOAPIC.
>
>Good point, but is that a design requirement?
>

Migration between KVM and qemu is not a requirement and I don't see
important value there. When many PV devices will be around it wouldn't
be possible also.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <64F9B87B6B770947A9F8391472E032160C654F85-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]     ` <64F9B87B6B770947A9F8391472E032160C654F85-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
@ 2007-06-25 21:57       ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-06-25 21:57 UTC (permalink / raw)
  To: Dor Laor; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dor Laor wrote:
>>> It can, but a little bit more complicated, and probably unable to
>>>       
> live
>   
>>> migration between KVM and Qemu since Qemu only has one IOAPIC.
>>>       
>> Good point, but is that a design requirement?
>>
>>     
>
> Migration between KVM and qemu is not a requirement and I don't see
> important value there. When many PV devices will be around it wouldn't
> be possible also.
>   

If device emulation for these pv devices is added to qemu, or if the 
in-kernel pv devices have an interface that allows qemu to pass a 
hypercall, then it can be made to work.



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-25 13:53 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 13:53 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 2007-06-25 at 11:32 +0800, Dong, Eddie wrote:
> Avi Kivity wrote:
> > Dong, Eddie wrote:
> > We'll add it as soon as it becomes necessary. But isn't it
> > just a memory
> > write?

On real HW, yes.  The Intel guys can correct me if I am wrong, but I
think we can actually simplify this in the virtual-world.  My impression
from reading the specs is that the address that you write to effectively
lets you generate an APIC bus message from a PCI bus-master.  The data
of the message looks very much like an APIC message, which in turn looks
very similar to what you program an IOAPIC pin to map to.

Therefore, if you want to generate MSIs I think you can just have the
pv-device call kvm_apicbus_send() as opposed to emulating the memory
write.  I'm not sure if this helps or not (e.g. perhaps we *want* to
emulate the bus-master write for supporting some pv-device that wouldn't
have access to the KVM_APIC_MSG api for some reason?)

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-25 13:45 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 13:45 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Sun, 2007-06-24 at 23:28 -0400, Avi Kivity wrote:
> Dong, Eddie wrote:
> >> Need to add a way to detect if this functionality is available, via
> >> KVM_CHECK_EXTENSION. 
> >>     
> >
> > Sure. Question:
> > 	Should we check the extension API basis or feature sets list?
> > I.e. Each API (KVM_CREATE_PIC, KVM_SET_ISA_IRQ_LEVEL) or feature sets:
> > KVM_KERNEL_IRQCHIP.
> >
> >   
> 
> I guess pre-feature.  We want to have fine granularity so we can change 
> small things, but per-API is too fine.
> 
> So in this case I'd suggest KVM_CAP_PIC, KVM_CAP_APIC, etc. (note the 
> KVM_CAP_* namespace).

Yeah, I like this suggestion better than how I currently do it.




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-25 13:44 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 13:44 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 2007-06-25 at 09:45 +0800, Dong, Eddie wrote:

> Sure. Question:
> 	Should we check the extension API basis or feature sets list?
> I.e. Each API (KVM_CREATE_PIC, KVM_SET_ISA_IRQ_LEVEL) or feature sets:
> KVM_KERNEL_IRQCHIP.

Actually, I have that same question.  I currently implemented the v9
code on an API basis, but I now think it should be the other way.





-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-25 13:43 Gregory Haskins
       [not found] ` <467F8E260200005A00026573-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 13:43 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Sat, 2007-06-23 at 20:41 +0300, Avi Kivity wrote:
> Dong, Eddie wrote:
> > 3: IOAPIC position
> > 	Though it looks like neutral to have this one in user or kernel
> > space,
> > but I'd like to suggest we only support one model. Considering future
> > VT-d
> > case, hypervisor need to inject an IRQ directly in KVM (still thru
> > IOAPIC)
> > without going to user level, so probably moving IOAPIC to kernel is good
> >
> > in this point.
> >   
> 
> Even paravirt device drivers will want to inject IRQs via the ioapic 
> (when the guest is not paravirt_ops enabled, like older Linux and 
> Windows). 

Note that its probably not a requirement to do so.  The IOAPIC
essentially provides a standard mechanism to map input "irq pins" to
APIC messages.  A pv driver could conceivably call kvm_apicbus_send()
directly if I knew its vector instead of calling ioapic_setpin() and it
would achieve the same goal.

What is nice about using an IOAPIC is that the irq routing can be done
for you automatically by any ACPI compliant OS.  The device just has to
say "raise my output pin" and the IOAPIC translates that into a vector.
If you forgo the IOAPIC in favor of direct APIC messages, you have to
solve the problem of irq-routing a different way.  

Note that there is another standard that would allow us to use built-in
routing without ACPI/IOAPIC: MSI.  I know the battle over
virtual-bus-rendering is still raging w.r.t. to PCI or not-PCI.  But I
will point out that if we do use PCI, setting the PV devices up as MSI
capable in the config space potentially eliminates the need to also wire
them to an IOAPIC.  They will get their vector data from the MSI setup
and can then send APIC messages directly.  (I think there is some
confusion about how we can do MSI later in this thread...I will reply to
those mails separately)

Thoughts?

> It's probably okay to implement the device side of a block 
> device in qemu, but more difficult for networking. If we have device 
> implementations in the kernel then we'll need an ioapic in the kernel.

> 
> Also, if we end up sharing interrupts between the kernel and userspace, 
> we'll need the kernel to perform the OR between the level specified by 
> the kernel devices and the level specified by userspace.

I not sure I fully understand what you are getting at here.  It sounds
like you are talking about splitting a single IOAPIC into both a user
and kernel space device?  If we do end up needing the IOAPIC in the
kernel and userspace, I would suggest using two IOAPICs (we just need to
update the ACPI tables to say there are two) so that they can operate
independently.  This should be supportable from a virtual system
infrastructure standpoint, and is much cleaner IMHO.  If you were
talking about a different scenario. please clarify.

> 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467F8E260200005A00026573-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467F8E260200005A00026573-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-25 13:54   ` Dong, Eddie
  2007-06-25 22:08   ` Avi Kivity
  1 sibling, 0 replies; 57+ messages in thread
From: Dong, Eddie @ 2007-06-25 13:54 UTC (permalink / raw)
  To: Gregory Haskins, avi-atKUWr5tajBWk0Htik3J/w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

[-- Attachment #1: Type: text/plain, Size: 4407 bytes --]

 

>-----Original Message-----
>From: Gregory Haskins [mailto:ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org] 
>Sent: 2007年6月25日 21:43
>To: avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org
>Cc: Dong, Eddie; kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
>Subject: Re: [kvm-devel] In kernel PIC support: kernel patch
>
>On Sat, 2007-06-23 at 20:41 +0300, Avi Kivity wrote:
>> Dong, Eddie wrote:
>> > 3: IOAPIC position
>> > 	Though it looks like neutral to have this one in user or kernel
>> > space,
>> > but I'd like to suggest we only support one model. 
>Considering future
>> > VT-d
>> > case, hypervisor need to inject an IRQ directly in KVM (still thru
>> > IOAPIC)
>> > without going to user level, so probably moving IOAPIC to 
>kernel is good
>> >
>> > in this point.
>> >   
>> 
>> Even paravirt device drivers will want to inject IRQs via the ioapic 
>> (when the guest is not paravirt_ops enabled, like older Linux and 
>> Windows). 
>
>Note that its probably not a requirement to do so.  The IOAPIC
>essentially provides a standard mechanism to map input "irq pins" to
>APIC messages.  A pv driver could conceivably call kvm_apicbus_send()
>directly if I knew its vector instead of calling ioapic_setpin() and it
>would achieve the same goal.

Basically we are talking about same thing. calling kvm_apicbus_send or
raise a virtual ioapic pin is basically same. The minor different is who does
the translation from pin to APIC message, device model or IOAPIC model.

Using IOAPIC pin probably makes PV driver writter a little bit easier and 
may be reused by other VMM easily in future.

>
>What is nice about using an IOAPIC is that the irq routing can be done
>for you automatically by any ACPI compliant OS.  The device just has to
>say "raise my output pin" and the IOAPIC translates that into a vector.
>If you forgo the IOAPIC in favor of direct APIC messages, you have to
>solve the problem of irq-routing a different way.  
>
>Note that there is another standard that would allow us to use built-in
>routing without ACPI/IOAPIC: MSI.  I know the battle over
>virtual-bus-rendering is still raging w.r.t. to PCI or not-PCI.  But I
>will point out that if we do use PCI, setting the PV devices up as MSI
>capable in the config space potentially eliminates the need to 
>also wire
>them to an IOAPIC.  They will get their vector data from the MSI setup
>and can then send APIC messages directly.  (I think there is some
>confusion about how we can do MSI later in this thread...I 
>will reply to
>those mails separately)

Yes, all you mentioned here is correct, but it doesn;t impact the
necessity of moving IOAPIC to kernel, see below.

>
>Thoughts?
>
>
>
>> It's probably okay to implement the device side of a block 
>> device in qemu, but more difficult for networking. If we have device 
>> implementations in the kernel then we'll need an ioapic in 
>the kernel.
>
>> 
>> Also, if we end up sharing interrupts between the kernel and 
>userspace, 
>> we'll need the kernel to perform the OR between the level 
>specified by 
>> the kernel devices and the level specified by userspace.
>
>I not sure I fully understand what you are getting at here.  It sounds
>like you are talking about splitting a single IOAPIC into both a user
>and kernel space device?  If we do end up needing the IOAPIC in the

No, the IOAPIC is in kernel, but some device such as IDE are in user while others
are in kernel such as pv NIC.

Due to this, a kernel pv driver will ask for IOAPIC service, so we have to keep 
it in kernel, otherwise we have to go back to user when kernel device fires, which
is very slow.

But, like you mentioned above, if pv device driver uses MSI or APIC message, 
it helps, but then 1: we have to implement MSI now, 2: a little bit more complicated
in writting pv driver.

>kernel and userspace, I would suggest using two IOAPICs (we 
>just need to
>update the ACPI tables to say there are two) so that they can operate
>independently.  This should be supportable from a virtual system

It can, but a little bit more complicated, and probably unable to live
migration between KVM and Qemu since Qemu only has one IOAPIC.

>infrastructure standpoint, and is much cleaner IMHO.  If you were
>talking about a different scenario. please clarify.
>

Thx,eddie


[-- Attachment #2: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
       [not found] ` <467F8E260200005A00026573-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  2007-06-25 13:54   ` Dong, Eddie
@ 2007-06-25 22:08   ` Avi Kivity
  1 sibling, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-06-25 22:08 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>> n paravirt device drivers will want to inject IRQs via the ioapic 
>> (when the guest is not paravirt_ops enabled, like older Linux and 
>> Windows). 
>>     
>
> Note that its probably not a requirement to do so.  The IOAPIC
> essentially provides a standard mechanism to map input "irq pins" to
> APIC messages.  A pv driver could conceivably call kvm_apicbus_send()
> directly if I knew its vector instead of calling ioapic_setpin() and it
> would achieve the same goal.
>
> What is nice about using an IOAPIC is that the irq routing can be done
> for you automatically by any ACPI compliant OS.  The device just has to
> say "raise my output pin" and the IOAPIC translates that into a vector.
> If you forgo the IOAPIC in favor of direct APIC messages, you have to
> solve the problem of irq-routing a different way.  
>
>   

Very good point.  There's a slight issue that Windows with the ACPI HAL 
is dog slow when virtulalized, but maybe we can find another HAL that 
supports ACPI but doesn't use the APIC.

> Note that there is another standard that would allow us to use built-in
> routing without ACPI/IOAPIC: MSI.  I know the battle over
> virtual-bus-rendering is still raging w.r.t. to PCI or not-PCI.  But I
> will point out that if we do use PCI, setting the PV devices up as MSI
> capable in the config space potentially eliminates the need to also wire
> them to an IOAPIC.  They will get their vector data from the MSI setup
> and can then send APIC messages directly.  (I think there is some
> confusion about how we can do MSI later in this thread...I will reply to
> those mails separately)
>   

MSI is also a good solution; but do all semi-modern guest OSes support it?

>   
>> It's probably okay to implement the device side of a block 
>> device in qemu, but more difficult for networking. If we have device 
>> implementations in the kernel then we'll need an ioapic in the kernel.
>>     
>
>   
>> Also, if we end up sharing interrupts between the kernel and userspace, 
>> we'll need the kernel to perform the OR between the level specified by 
>> the kernel devices and the level specified by userspace.
>>     
>
> I not sure I fully understand what you are getting at here.  It sounds
> like you are talking about splitting a single IOAPIC into both a user
> and kernel space device?  

No, a single ioapic in the kernel, with one interrupt line that is 
shared between a kernel-provided device and a userspace-provided 
device.  The kernel would need to provide the OR function that sharing 
implies.


> If we do end up needing the IOAPIC in the
> kernel and userspace, I would suggest using two IOAPICs (we just need to
> update the ACPI tables to say there are two) so that they can operate
> independently.  This should be supportable from a virtual system
> infrastructure standpoint, and is much cleaner IMHO.  If you were
> talking about a different scenario. please clarify

I was talking about a different scenario, but your solution is cleaner 
(though it involves more hand-dirtying with the ACPI tables).  Basically 
each PCI IRQ would have its own dedicated line.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-25 13:26 Gregory Haskins
       [not found] ` <467F8A2D0200005A00026567-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-25 13:26 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Sat, 2007-06-23 at 10:53 +0300, Avi Kivity wrote:
> Gregory Haskins wrote:
> >
> > In fact, my own implementation puts the ISA-userint model as hanging off
> > the kvm structure as well (which would be replaced with a similar PIC
> > model for layer-2 mode in the future).
> >
> >   
> 
> The PIC is certainly a system resource, but at least in qemu it's 
> connected to cpu 0 only. We want to preserve that.
> 

I think there might be some confusion here ;) The *outputs* of the PIC
in both QEMU and in-kernel-APIC are certainly wired to cpu0 (per the
model that qemu/bochs support).  But the *inputs* are system-wide.  In
qemu, you can call the global function "pic_set_irq" (which you could
construe as per-vm, since there is a 1:1 with qemu-process/vm).
Likewise, the inputs in both in-kernel implementation is also per vm by
hanging off of the kvm struct, as opposed to the vcpu struct.  I hope
this helps to clarify.

-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467F8A2D0200005A00026567-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467F8A2D0200005A00026567-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-25 22:00   ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-06-25 22:00 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> On Sat, 2007-06-23 at 10:53 +0300, Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> In fact, my own implementation puts the ISA-userint model as hanging off
>>> the kvm structure as well (which would be replaced with a similar PIC
>>> model for layer-2 mode in the future).
>>>
>>>   
>>>       
>> The PIC is certainly a system resource, but at least in qemu it's 
>> connected to cpu 0 only. We want to preserve that.
>>
>>     
>
> I think there might be some confusion here ;) The *outputs* of the PIC
> in both QEMU and in-kernel-APIC are certainly wired to cpu0 (per the
> model that qemu/bochs support).  But the *inputs* are system-wide.  In
> qemu, you can call the global function "pic_set_irq" (which you could
> construe as per-vm, since there is a 1:1 with qemu-process/vm).
> Likewise, the inputs in both in-kernel implementation is also per vm by
> hanging off of the kvm struct, as opposed to the vcpu struct.  I hope
> this helps to clarify.
>
>   


This is my understanding too.  I didn't suggest the PIC ioctls should be 
per-vcpu, just that current semantics should be preserved.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-22 12:14 Gregory Haskins
       [not found] ` <467B84EF0200005A00026401-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-22 12:14 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Fri, 2007-06-22 at 13:50 +0300, Avi Kivity wrote:
> Dong, Eddie wrote:
> 
> > +#include "kvm.h"
> > +#include "virq.h"
> > +
> > +/*
> > + * check if there is pending interrupt without
> > + * intack.
> > + */
> > +int cpu_has_interrupt(struct kvm_vcpu *v)
> > +{
> > +	struct pic_state2 *s = kernel_pic(v->kvm);
> > +
> >   
> 
> Isn't the PIC connected to just one cpu?

I agree with Eddie here.  Its true that only one CPU would ever
associate with PIC (and its usually hardwired to the BSP, but it
technically doesn't have to be depending on the architecture).  But the
PIC(s) are really a system wide resource that should hang off the kvm
struct, IMHO.

In fact, my own implementation puts the ISA-userint model as hanging off
the kvm structure as well (which would be replaced with a similar PIC
model for layer-2 mode in the future).


-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467B84EF0200005A00026401-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467B84EF0200005A00026401-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-23  7:53   ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-06-23  7:53 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> On Fri, 2007-06-22 at 13:50 +0300, Avi Kivity wrote:
>   
>> Dong, Eddie wrote:
>>
>>     
>>> +#include "kvm.h"
>>> +#include "virq.h"
>>> +
>>> +/*
>>> + * check if there is pending interrupt without
>>> + * intack.
>>> + */
>>> +int cpu_has_interrupt(struct kvm_vcpu *v)
>>> +{
>>> +	struct pic_state2 *s = kernel_pic(v->kvm);
>>> +
>>>   
>>>       
>> Isn't the PIC connected to just one cpu?
>>     
>
> I agree with Eddie here.  Its true that only one CPU would ever
> associate with PIC (and its usually hardwired to the BSP, but it
> technically doesn't have to be depending on the architecture).  But the
> PIC(s) are really a system wide resource that should hang off the kvm
> struct, IMHO.
>
> In fact, my own implementation puts the ISA-userint model as hanging off
> the kvm structure as well (which would be replaced with a similar PIC
> model for layer-2 mode in the future).
>
>   

The PIC is certainly a system resource, but at least in qemu it's 
connected to cpu 0 only. We want to preserve that.

I guess it doesn't really matter because smp configurations will use the 
ioapic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-22 12:06 Gregory Haskins
       [not found] ` <467B830B0200005A000263FB-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-22 12:06 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Fri, 2007-06-22 at 12:57 +0300, Avi Kivity wrote:
> Gregory Haskins wrote:
> >
> > Unless I misunderstood Avi, I think he was suggesting that use use
> > KVM_EXTINT/ISA_INTERRUPT as a way to distinguish between the dual modes
> > of ISA_INTERRUPT as I have today (e.g.
> > ISA_INTERRUPT(level-1)/ISA_INTERRUPT(level-2)).  This doesn't really
> > have anything to do with LINT0/1 (directly, anyway).  Hope this helps to
> > clarify. 
> >   
> 
> Well, we don't need to think in terms of modes.  Each ioctl controls
> interrupts at a different point in the architecture.
> 
> KVM_INTERRUPT is at the interface between the processor core and the
> lapic/system bus.
> 
> KVM_APIC_MESSAGE is at the apic bus.
> 
> KVM_EXTINT is at the interface between the lapic and the system bus.  If
> the guest disabled the local apic, it's equivalent to the KVM_INTERRUPT
> (injecting a vector directly); otherwise it is handled by the lapic
> (LINT0/1 handling).

Ok, now I understand what Eddie was saying about LINT0/1 and we were
both describing the same thing in different ways.  I think we are all on
the same page.

> 
> KVM_ISA_INTERRUPT (maybe KVM_IRQ_LINE, to signify it doesn't carry a
> vector) is at the boundary between the interrupt controller chips and
> the rest of the system.
> 

Ack on s/ISA/IRQ or something similar


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467B830B0200005A000263FB-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467B830B0200005A000263FB-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-22 15:09   ` Dong, Eddie
       [not found]     ` <10EA09EFD8728347A513008B6B0DA77A014E8AC7-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Dong, Eddie @ 2007-06-22 15:09 UTC (permalink / raw)
  To: Gregory Haskins, avi-atKUWr5tajBWk0Htik3J/w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Woo, we have gone through so many threads within single day and get
almost 
everything in hand now :-) Should we make a short notes on the decision 
to be referenced in future (Any place to KVM I/F)? 

Basically we have solved the I/Fs issue and part IRQ delivery issue.

Several other minor points in my mind:

1: Cross architecture consideration
If we want to consider other platform too such as IA64 and PPC like 
that in Xen, then use GSI is better for KVM_IOAPIC_INTERRUPT. 
BTW, IA64 platform doesn;t have PIC usually.(The work for IA64 is
under going.)

2: MSI.
	Not sure if there will have requirement to support a virtual
device
with MSI support, if yes, then we may add new APIs in future. So let us
reserve some I/F for future irq injection.

3: IOAPIC position
	Though it looks like neutral to have this one in user or kernel
space,
but I'd like to suggest we only support one model. Considering future
VT-d
case, hypervisor need to inject an IRQ directly in KVM (still thru
IOAPIC)
without going to user level, so probably moving IOAPIC to kernel is good

in this point.
	Any suggestion?

thanks, eddie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <10EA09EFD8728347A513008B6B0DA77A014E8AC7-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]     ` <10EA09EFD8728347A513008B6B0DA77A014E8AC7-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2007-06-23 17:41       ` Avi Kivity
       [not found]         ` <467D5B5B.9090702-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2007-06-23 17:41 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dong, Eddie wrote:
> Woo, we have gone through so many threads within single day and get
> almost 
> everything in hand now :-) Should we make a short notes on the decision 
> to be referenced in future (Any place to KVM I/F)? 
>
> Basically we have solved the I/Fs issue and part IRQ delivery issue.
>
> Several other minor points in my mind:
>
> 1: Cross architecture consideration
> If we want to consider other platform too such as IA64 and PPC like 
> that in Xen, then use GSI is better for KVM_IOAPIC_INTERRUPT. 
> BTW, IA64 platform doesn;t have PIC usually.(The work for IA64 is
> under going.)
>
>   

What is GSI?

> 2: MSI.
> 	Not sure if there will have requirement to support a virtual
> device
> with MSI support, if yes, then we may add new APIs in future. So let us
> reserve some I/F for future irq injection.
>   

We'll add it as soon as it becomes necessary. But isn't it just a memory 
write?

> 3: IOAPIC position
> 	Though it looks like neutral to have this one in user or kernel
> space,
> but I'd like to suggest we only support one model. Considering future
> VT-d
> case, hypervisor need to inject an IRQ directly in KVM (still thru
> IOAPIC)
> without going to user level, so probably moving IOAPIC to kernel is good
>
> in this point.
>   

Even paravirt device drivers will want to inject IRQs via the ioapic 
(when the guest is not paravirt_ops enabled, like older Linux and 
Windows). It's probably okay to implement the device side of a block 
device in qemu, but more difficult for networking. If we have device 
implementations in the kernel then we'll need an ioapic in the kernel.

Also, if we end up sharing interrupts between the kernel and userspace, 
we'll need the kernel to perform the OR between the level specified by 
the kernel devices and the level specified by userspace.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467D5B5B.9090702-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]         ` <467D5B5B.9090702-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-25  3:32           ` Dong, Eddie
       [not found]             ` <10EA09EFD8728347A513008B6B0DA77A01AA46F5-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Dong, Eddie @ 2007-06-25  3:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Avi Kivity wrote:
> Dong, Eddie wrote:
>> Woo, we have gone through so many threads within single day and get
>> almost everything in hand now :-) Should we make a short notes on
>> the decision to be referenced in future (Any place to KVM I/F)?
>> 
>> Basically we have solved the I/Fs issue and part IRQ delivery issue.
>> 
>> Several other minor points in my mind:
>> 
>> 1: Cross architecture consideration
>> If we want to consider other platform too such as IA64 and PPC like
>> that in Xen, then use GSI is better for KVM_IOAPIC_INTERRUPT.
>> BTW, IA64 platform doesn;t have PIC usually.(The work for IA64 is
>> under going.) 
>> 
>> 
> 
> What is GSI?

GSI is ACPI notion of "Global System Interrupt".
(Just parameter passing difference: GSI use GSI #, 
our previous discussion use: PIC/IOAPIC # + line #)


> 
>> 2: MSI.
>> 	Not sure if there will have requirement to support a virtual
>> device
>> with MSI support, if yes, then we may add new APIs in future. So let
>> us reserve some I/F for future irq injection.
>> 
> 
> We'll add it as soon as it becomes necessary. But isn't it
> just a memory
> write?

Yes. So you mean we just add a memory Write I/F? It could be
another good alternative.
(I suppose MSI emulation is in kernel too if other irqchip is in
kernel.)


> 
>> 3: IOAPIC position
>> 	Though it looks like neutral to have this one in user or kernel
>> space, but I'd like to suggest we only support one model.
>> Considering future VT-d case, hypervisor need to inject an IRQ
>> directly in KVM (still thru IOAPIC) without going to user level, so
>> probably moving IOAPIC to kernel is good 
>> 
>> in this point.
>> 
> 
> Even paravirt device drivers will want to inject IRQs via the ioapic
> (when the guest is not paravirt_ops enabled, like older Linux and
> Windows). It's probably okay to implement the device side of a block
> device in qemu, but more difficult for networking. If we have device
> implementations in the kernel then we'll need an ioapic in the kernel.
> 
> Also, if we end up sharing interrupts between the kernel and
> userspace, we'll need the kernel to perform the OR between the level
> specified by the kernel devices and the level specified by userspace.

Yes, so basically in kernel IOAPIC has stronger reason :-) Any strong
reason
to support user level IOAPIC + kernel level APIC/PIC/MSI ?
thx,eddie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <10EA09EFD8728347A513008B6B0DA77A01AA46F5-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]             ` <10EA09EFD8728347A513008B6B0DA77A01AA46F5-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2007-06-25  3:39               ` Avi Kivity
       [not found]                 ` <467F38F7.9090101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2007-06-25  3:39 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dong, Eddie wrote:
> Avi Kivity wrote:
>   
>> Dong, Eddie wrote:
>>     
>>> Woo, we have gone through so many threads within single day and get
>>> almost everything in hand now :-) Should we make a short notes on
>>> the decision to be referenced in future (Any place to KVM I/F)?
>>>
>>> Basically we have solved the I/Fs issue and part IRQ delivery issue.
>>>
>>> Several other minor points in my mind:
>>>
>>> 1: Cross architecture consideration
>>> If we want to consider other platform too such as IA64 and PPC like
>>> that in Xen, then use GSI is better for KVM_IOAPIC_INTERRUPT.
>>> BTW, IA64 platform doesn;t have PIC usually.(The work for IA64 is
>>> under going.) 
>>>
>>>
>>>       
>> What is GSI?
>>     
>
> GSI is ACPI notion of "Global System Interrupt".
> (Just parameter passing difference: GSI use GSI #, 
> our previous discussion use: PIC/IOAPIC # + line #)
>
>
>   

Can you refer me to any documentation about it?

>>> 2: MSI.
>>> 	Not sure if there will have requirement to support a virtual
>>> device
>>> with MSI support, if yes, then we may add new APIs in future. So let
>>> us reserve some I/F for future irq injection.
>>>
>>>       
>> We'll add it as soon as it becomes necessary. But isn't it
>> just a memory
>> write?
>>     
>
> Yes. So you mean we just add a memory Write I/F? It could be
> another good alternative.
> (I suppose MSI emulation is in kernel too if other irqchip is in
> kernel.)
>   

We already have kernel->user write (normal mmio)
If both the interrupt originator and the irqchip are in the kernel there 
is no need for an api.
So all that is left is an mmio api from user to kernel.  Which makes 
sense: if we have memory resources in the kernel, we need to be able to 
access them from userspace (one example already exists, RAM, which is 
accessed in userspace through mmap()).


>
>   
>>> 3: IOAPIC position
>>> 	Though it looks like neutral to have this one in user or kernel
>>> space, but I'd like to suggest we only support one model.
>>> Considering future VT-d case, hypervisor need to inject an IRQ
>>> directly in KVM (still thru IOAPIC) without going to user level, so
>>> probably moving IOAPIC to kernel is good 
>>>
>>> in this point.
>>>
>>>       
>> Even paravirt device drivers will want to inject IRQs via the ioapic
>> (when the guest is not paravirt_ops enabled, like older Linux and
>> Windows). It's probably okay to implement the device side of a block
>> device in qemu, but more difficult for networking. If we have device
>> implementations in the kernel then we'll need an ioapic in the kernel.
>>
>> Also, if we end up sharing interrupts between the kernel and
>> userspace, we'll need the kernel to perform the OR between the level
>> specified by the kernel devices and the level specified by userspace.
>>     
>
> Yes, so basically in kernel IOAPIC has stronger reason :-) Any strong
> reason
> to support user level IOAPIC + kernel level APIC/PIC/MSI ?
>   

No strong reason.  It may be desirable for debuggability, but not much else.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467F38F7.9090101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]                 ` <467F38F7.9090101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-25  4:11                   ` Dong, Eddie
  0 siblings, 0 replies; 57+ messages in thread
From: Dong, Eddie @ 2007-06-25  4:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Avi Kivity wrote:

>> 
>> GSI is ACPI notion of "Global System Interrupt".
>> (Just parameter passing difference: GSI use GSI #,
>> our previous discussion use: PIC/IOAPIC # + line #)
>> 
>> 
>> 
> 
> Can you refer me to any documentation about it?


ACPI spec 3.0a, section 5.2.12.
http://www.acpi.info/spec.htm has a copy. I don't where is formal
release org.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-22 12:02 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-22 12:02 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Fri, 2007-06-22 at 12:46 +0300, Avi Kivity wrote:
> Gregory Haskins wrote:
> > I actually kind of like the distinction you are making between the
> > INTERRUPT, EXTINT, and ISA/IOAPIC APIs and keeping them consistent
> > across modes.  As you suggested below, the ISA/IOAPIC API probably
> > should really stay as one (I vote for retaining the ISA name) since the
> > wiring is an invariant of the platform presented by qemu/bochs.
> >
> >   
> 
> Well, kvm shouldn't really know how the qemu hardware is wired up.  It
> can change; and non-qemu userspaces should always be possible.

I agree, but we have a ways to go to get there to fully support this
notion. :(  In addition to an API that supports it, any in-kernel
components will also likely need to know how to be wired up dynamically.
For instance, today the lapic branch assumes the PIC is wired to LINT0
(as decreed by QEMU), but other architectures might have it wired to an
IOAPIC pin (for routing to other CPUs besides the BSP), for instance).

I guess I was thinking the APIs would evolve as that new capability
emerged.  But you are right...if we can get this part of the API correct
now, lets do it.

> 
> When we change an irq line, there are three things that matter:
> 
> - which chip the line is connected to (PIC 1, PIC2, IOAPIC)
> - which line we're talking about
> - what level the line is to have
> 
> If we allow lists of such triplets to be sent via a single ioctl, then
> we can optimize the case where lines are wired together without teaching
> kvm about the board layout.

Sounds like a good approach.


> 
> (the ISA name might annoy our non-x86 friends.  I see nothing x86
> specific about the interface)
> 

Good point.  Lets go with the other suggestions you and Eddie are
making.

-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-22  3:49 Gregory Haskins
       [not found] ` <467B0E7D0200005A000263E4-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-22  3:49 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Fri, 2007-06-22 at 10:09 +0800, Dong, Eddie wrote:
> Avi Kivity wrote:
> > 
> > Eddie, is that what you were driving at in your simplification
> > attempt? 
> > 
> Yes, some minor thing:
> KVM_ISA_INTERRUPT: Per VM I/F, and need an irq line and irq level.
> KVM_IOAPIC_INTERRUPT: Per VM I/F, may also need irq level for level
> triggered irq.
> KVM_APIC_MESSAGE: Per VCPU I/F

Note that in my implementation the APIC_MESSAGE is also VM I/F.   In
fact, IMHO the only API that should be VCPU I/F is the original API,
KVM_INTERRUPT.  The reason is, the PIC related interrupts are targeted
at the vm-global ISA resources, and the APIC_MESSAGE is targeted at the
vm-global apic *bus*, not a particular lapic (yet).  Its the bus that
then determines the actual routing to its final destination.  I think
this design distinction is important.

> 
> Per step by step approach way, KVM_EXTINT (assume for LINT0/LINT1) can
> be deferred to future implementation.

Unless I misunderstood Avi, I think he was suggesting that use use
KVM_EXTINT/ISA_INTERRUPT as a way to distinguish between the dual modes
of ISA_INTERRUPT as I have today (e.g.
ISA_INTERRUPT(level-1)/ISA_INTERRUPT(level-2)).  This doesn't really
have anything to do with LINT0/1 (directly, anyway).  Hope this helps to
clarify. 

-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467B0E7D0200005A000263E4-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467B0E7D0200005A000263E4-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-22  7:25   ` Dong, Eddie
  2007-06-22  9:57   ` Avi Kivity
  1 sibling, 0 replies; 57+ messages in thread
From: Dong, Eddie @ 2007-06-22  7:25 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> On Fri, 2007-06-22 at 10:09 +0800, Dong, Eddie wrote:
>> Avi Kivity wrote:
>>> 
>>> Eddie, is that what you were driving at in your simplification
>>> attempt? 
>>> 
>> Yes, some minor thing:
>> KVM_ISA_INTERRUPT: Per VM I/F, and need an irq line and irq level.
>> KVM_IOAPIC_INTERRUPT: Per VM I/F, may also need irq level for level
>> triggered irq. KVM_APIC_MESSAGE: Per VCPU I/F
> 
> Note that in my implementation the APIC_MESSAGE is also VM I/F.   In
> fact, IMHO the only API that should be VCPU I/F is the original API,
> KVM_INTERRUPT.  The reason is, the PIC related interrupts are targeted
> at the vm-global ISA resources, and the APIC_MESSAGE is targeted at
> the vm-global apic *bus*, not a particular lapic (yet).  Its the bus
> that then determines the actual routing to its final destination.  I
> think this design distinction is important.
> 
> 
>> 
>> Per step by step approach way, KVM_EXTINT (assume for LINT0/LINT1)
>> can be deferred to future implementation.
> 
> Unless I misunderstood Avi, I think he was suggesting that use use
> KVM_EXTINT/ISA_INTERRUPT as a way to distinguish between the dual
> modes of ISA_INTERRUPT as I have today (e.g.
> ISA_INTERRUPT(level-1)/ISA_INTERRUPT(level-2)).  This doesn't really
> have anything to do with LINT0/1 (directly, anyway).  Hope
> this helps to
> clarify.
> 
Sure, that is OK too.
Thx, Eddie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
       [not found] ` <467B0E7D0200005A000263E4-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  2007-06-22  7:25   ` Dong, Eddie
@ 2007-06-22  9:57   ` Avi Kivity
  1 sibling, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-06-22  9:57 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> On Fri, 2007-06-22 at 10:09 +0800, Dong, Eddie wrote:
>   
>> Avi Kivity wrote:
>>     
>>> Eddie, is that what you were driving at in your simplification
>>> attempt? 
>>>
>>>       
>> Yes, some minor thing:
>> KVM_ISA_INTERRUPT: Per VM I/F, and need an irq line and irq level.
>> KVM_IOAPIC_INTERRUPT: Per VM I/F, may also need irq level for level
>> triggered irq.
>> KVM_APIC_MESSAGE: Per VCPU I/F
>>     
>
> Note that in my implementation the APIC_MESSAGE is also VM I/F.   In
> fact, IMHO the only API that should be VCPU I/F is the original API,
> KVM_INTERRUPT.  The reason is, the PIC related interrupts are targeted
> at the vm-global ISA resources, and the APIC_MESSAGE is targeted at the
> vm-global apic *bus*, not a particular lapic (yet).  Its the bus that
> then determines the actual routing to its final destination.  I think
> this design distinction is important.
>
>   

Good point.  I agree.

>   
>> Per step by step approach way, KVM_EXTINT (assume for LINT0/LINT1) can
>> be deferred to future implementation.
>>     
>
> Unless I misunderstood Avi, I think he was suggesting that use use
> KVM_EXTINT/ISA_INTERRUPT as a way to distinguish between the dual modes
> of ISA_INTERRUPT as I have today (e.g.
> ISA_INTERRUPT(level-1)/ISA_INTERRUPT(level-2)).  This doesn't really
> have anything to do with LINT0/1 (directly, anyway).  Hope this helps to
> clarify. 
>   

Well, we don't need to think in terms of modes.  Each ioctl controls
interrupts at a different point in the architecture.

KVM_INTERRUPT is at the interface between the processor core and the
lapic/system bus.

KVM_APIC_MESSAGE is at the apic bus.

KVM_EXTINT is at the interface between the lapic and the system bus.  If
the guest disabled the local apic, it's equivalent to the KVM_INTERRUPT
(injecting a vector directly); otherwise it is handled by the lapic
(LINT0/1 handling).

KVM_ISA_INTERRUPT (maybe KVM_IRQ_LINE, to signify it doesn't carry a
vector) is at the boundary between the interrupt controller chips and
the rest of the system.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-21 16:44 Gregory Haskins
       [not found] ` <467A72970200005A00026354-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 16:44 UTC (permalink / raw)
  To: avi-atKUWr5tajBWk0Htik3J/w; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 19:31 +0300, Avi Kivity wrote:

> Wait. Looks like APIs are changing meaning according to some mode. 
> Confusion.
> 
> How about:
> 
> KVM_INTERRUPT == inject this vector now
> KVM_EXTINT == drives the processor's interrupt pin (this is handled by 
> the lapic code). input is a vector and type
> KVM_ISA_INTERRUPT == drives the irq inputs to the PIC. Input is line number.
> KVM_IOAPIC_INTERRUPT == drives the irq inputs to the IOAPIC. Input is 
> line number.
> KVM_APIC_MESSAGE == do something to the apic
> 
> That way, there is no mode (at least in the code), and the APIs keep 
> their meaning.

I actually kind of like the distinction you are making between the
INTERRUPT, EXTINT, and ISA/IOAPIC APIs and keeping them consistent
across modes.  As you suggested below, the ISA/IOAPIC API probably
should really stay as one (I vote for retaining the ISA name) since the
wiring is an invariant of the platform presented by qemu/bochs.

The only unfortunate thing about this change is part of my irqdevice
abstraction goes to waste (e.g. set_pin).  But oh well.  I think the
consistent API is more important.  I will roll this into the fixes I am
making.

-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467A72970200005A00026354-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467A72970200005A00026354-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-22  9:46   ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-06-22  9:46 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> On Thu, 2007-06-21 at 19:31 +0300, Avi Kivity wrote:
>
>   
>> Wait. Looks like APIs are changing meaning according to some mode. 
>> Confusion.
>>
>> How about:
>>
>> KVM_INTERRUPT == inject this vector now
>> KVM_EXTINT == drives the processor's interrupt pin (this is handled by 
>> the lapic code). input is a vector and type
>> KVM_ISA_INTERRUPT == drives the irq inputs to the PIC. Input is line number.
>> KVM_IOAPIC_INTERRUPT == drives the irq inputs to the IOAPIC. Input is 
>> line number.
>> KVM_APIC_MESSAGE == do something to the apic
>>
>> That way, there is no mode (at least in the code), and the APIs keep 
>> their meaning.
>>     
>
> I actually kind of like the distinction you are making between the
> INTERRUPT, EXTINT, and ISA/IOAPIC APIs and keeping them consistent
> across modes.  As you suggested below, the ISA/IOAPIC API probably
> should really stay as one (I vote for retaining the ISA name) since the
> wiring is an invariant of the platform presented by qemu/bochs.
>
>   

Well, kvm shouldn't really know how the qemu hardware is wired up.  It
can change; and non-qemu userspaces should always be possible.

When we change an irq line, there are three things that matter:

- which chip the line is connected to (PIC 1, PIC2, IOAPIC)
- which line we're talking about
- what level the line is to have

If we allow lists of such triplets to be sent via a single ioctl, then
we can optimize the case where lines are wired together without teaching
kvm about the board layout.

(the ISA name might annoy our non-x86 friends.  I see nothing x86
specific about the interface)

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-21 15:43 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 15:43 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 23:13 +0800, Dong, Eddie wrote:
> We are in same page now :-)

Good :)  I still need to add that "int level" field to KVM_IRQ_INTERRUPT
as we discussed a few weeks ago to be considered complete, though.

-Greg



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-21 15:42 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 15:42 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 15:17 +0000, Gregory Haskins wrote:
> On Thu, 2007-06-21 at 22:28 +0800, Dong, Eddie wrote:
> >
> > > I suppose, but it somewhat defeats the purpose IMO.  Every pin in the
> > > 8259 that gets tickled implicitly means an IOAPIC pin was tickled
> > > also. Do we really want to go to userspace for that?  Essentially
> > 
> > User space can handle this, go to IOAPIC first and then decide to go to 
> > in kernel PIC or not. Here the assumption is that an irq line is either
> > serviced
> > by PIC or IOAPIC, it will never be serviced by both.
> > 
> > So no back to user space.
> 

Thinking about it some more, I see some conditions where this
optimization doesn't work (and was one of the reasons driving my
redesign of the subsystem).  For instance, when linux routes the PIT as
an NMI and an IOAPIC based event.  In this case, the PIT raises an
interrupt which tickles an isa interrupt line that should be routed to
both the PIC and the IOAPIC #IRQ0.  Both devices should have IRQ0
unmasked.  The former will get routed to the APIC via tickling the LINT0
line.  The latter will get routed as an APIC-MSG based on the
programming of the IOAPICs IRQ0 line.

I believe this is a scenario which is mis-emulated today at least
partially due to the assumption that only one device is in use at a
time.

All is not lost, however.  This just helps drive the decision to move
the IOAPIC with the PIC ;)

-Greg 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-21 15:17 Gregory Haskins
  0 siblings, 0 replies; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 15:17 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 22:28 +0800, Dong, Eddie wrote:
>
> > I suppose, but it somewhat defeats the purpose IMO.  Every pin in the
> > 8259 that gets tickled implicitly means an IOAPIC pin was tickled
> > also. Do we really want to go to userspace for that?  Essentially
> 
> User space can handle this, go to IOAPIC first and then decide to go to 
> in kernel PIC or not. Here the assumption is that an irq line is either
> serviced
> by PIC or IOAPIC, it will never be serviced by both.
> 
> So no back to user space.

Yeah, I suppose if we adopt that optimization from Xen/Qemu that is
true.  The design in my head is trying to emulate the actual signal path
in hardware so it gets dispatched to both models.  I suppose you can get
away with a shortcut like that here (though I don't really love the
idea ;).

I think it comes down to the decision, like you said.  If we go with
IOAPIC in the kernel, just route to both like I am thinking.  If we
leave it in userspace, optimize the context switch away for performance
reasons.

-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-21 15:10 Gregory Haskins
       [not found] ` <467A5C8C0200005A00026321-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 15:10 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 22:57 +0800, Dong, Eddie wrote:
> Gregory Haskins wrote:
> 
> > So assuming this newly enlightened position is true, I think this
> > means we have a choice: 
> > 
> > 1) Drop support for mixed "level-1" mode and move the PIC to the
> > kernel now as Eddie is doing 
> > 
> > 2) Keep the level-1/2 distinction, and add support for making sure
> > that once a vector is acked in the PIC, it truely gets put into
> > service immediately. 
> > 
> > I can think of a really simple interface for (2).  All we
> > really need to
> > do is
> > 
> > a) go back to synchronous injection (as previously suggested)
> 
> For level-2 support, that means no PIC/APIC in user space, it is the
> device want to assert/dessert an irq line request. There is no notion
> of injecting IRQ. That is why I added new APIs.
> Even with level-1 in consideration, we still need these new APIs to
> support level-2. A device such as kerboard could frequently assert
> and dessert the irq line. A single ker strobe will see 5-10 dessert 
> request from device model and 1 assert request.
> 
> So new APIs are a must IMO.

Fully agree, but I already added them as well ;)

1) KVM_ISA_INTERRUPT:  In level-1 mode, this API allows the userspace
pic to dispatch a vector to the kernel.  In level-2 mode, this allows
any userspace app to tickle an isa based irq line (which feeds into the
inputs of the PIC and IOAPIC.  In other words, a level-2 based userspace
would substitute KVM_ISA_INTERRUPT for pic_set_irq().

2) KVM_APIC_MESSAGE allows any entity (presumably an IOAPIC in
userspace, though there is no restriction here) to send APIC messages.
This will probably only be used in level-1 mode, but its there for both
modes nonetheless.

-Greg



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467A5C8C0200005A00026321-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467A5C8C0200005A00026321-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-21 15:13   ` Dong, Eddie
  2007-06-21 16:31   ` Avi Kivity
  1 sibling, 0 replies; 57+ messages in thread
From: Dong, Eddie @ 2007-06-21 15:13 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

We are in same page now :-)


Gregory Haskins wrote:
> On Thu, 2007-06-21 at 22:57 +0800, Dong, Eddie wrote:
>> Gregory Haskins wrote:
>> 
>>> So assuming this newly enlightened position is true, I think this
>>> means we have a choice: 
>>> 
>>> 1) Drop support for mixed "level-1" mode and move the PIC to the
>>> kernel now as Eddie is doing 
>>> 
>>> 2) Keep the level-1/2 distinction, and add support for making sure
>>> that once a vector is acked in the PIC, it truely gets put into
>>> service immediately. 
>>> 
>>> I can think of a really simple interface for (2).  All we
>>> really need to
>>> do is
>>> 
>>> a) go back to synchronous injection (as previously suggested)
>> 
>> For level-2 support, that means no PIC/APIC in user space, it is the
>> device want to assert/dessert an irq line request. There is no notion
>> of injecting IRQ. That is why I added new APIs.
>> Even with level-1 in consideration, we still need these new APIs to
>> support level-2. A device such as kerboard could frequently assert
>> and dessert the irq line. A single ker strobe will see 5-10 dessert
>> request from device model and 1 assert request.
>> 
>> So new APIs are a must IMO.
> 
> Fully agree, but I already added them as well ;)
> 
> 1) KVM_ISA_INTERRUPT:  In level-1 mode, this API allows the userspace
> pic to dispatch a vector to the kernel.  In level-2 mode, this allows
> any userspace app to tickle an isa based irq line (which feeds into
> the inputs of the PIC and IOAPIC.  In other words, a level-2 based
> userspace would substitute KVM_ISA_INTERRUPT for pic_set_irq().
> 
> 2) KVM_APIC_MESSAGE allows any entity (presumably an IOAPIC in
> userspace, though there is no restriction here) to send APIC messages.
> This will probably only be used in level-1 mode, but its there for
> both modes nonetheless. 
> 
> -Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
       [not found] ` <467A5C8C0200005A00026321-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  2007-06-21 15:13   ` Dong, Eddie
@ 2007-06-21 16:31   ` Avi Kivity
       [not found]     ` <467AA7FF.5000701-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2007-06-21 16:31 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> 1) KVM_ISA_INTERRUPT:  In level-1 mode, this API allows the userspace
> pic to dispatch a vector to the kernel.  In level-2 mode, this allows
> any userspace app to tickle an isa based irq line (which feeds into the
> inputs of the PIC and IOAPIC.  In other words, a level-2 based userspace
> would substitute KVM_ISA_INTERRUPT for pic_set_irq().
>
>   

Wait. Looks like APIs are changing meaning according to some mode. 
Confusion.

How about:

KVM_INTERRUPT == inject this vector now
KVM_EXTINT == drives the processor's interrupt pin (this is handled by 
the lapic code). input is a vector and type
KVM_ISA_INTERRUPT == drives the irq inputs to the PIC. Input is line number.
KVM_IOAPIC_INTERRUPT == drives the irq inputs to the IOAPIC. Input is 
line number.
KVM_APIC_MESSAGE == do something to the apic

That way, there is no mode (at least in the code), and the APIs keep 
their meaning.

Note that if an irq line is wired to both the ioapic and the PIC, we may 
be better off having a single ioctl that can combine the ISA and IOAPIC 
variants (driving two inputs at once).

Eddie, is that what you were driving at in your simplification attempt?

-- 
error compiling committee.c: too many arguments to function

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467AA7FF.5000701-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]     ` <467AA7FF.5000701-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-22  2:09       ` Dong, Eddie
       [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01A56750-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Dong, Eddie @ 2007-06-22  2:09 UTC (permalink / raw)
  To: Avi Kivity, Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Avi Kivity wrote:
> Gregory Haskins wrote:
>> 1) KVM_ISA_INTERRUPT:  In level-1 mode, this API allows the userspace
>> pic to dispatch a vector to the kernel.  In level-2 mode, this allows
>> any userspace app to tickle an isa based irq line (which feeds into
>> the inputs of the PIC and IOAPIC.  In other words, a level-2 based
>> userspace would substitute KVM_ISA_INTERRUPT for pic_set_irq().
>> 
>> 
> 
> Wait. Looks like APIs are changing meaning according to some mode.
> Confusion. 
> 
> How about:
> 
> KVM_INTERRUPT == inject this vector now
> KVM_EXTINT == drives the processor's interrupt pin (this is handled by
> the lapic code). input is a vector and type
> KVM_ISA_INTERRUPT == drives the irq inputs to the PIC. Input
> is line number.
> KVM_IOAPIC_INTERRUPT == drives the irq inputs to the IOAPIC. Input is
> line number. KVM_APIC_MESSAGE == do something to the apic
> 
> That way, there is no mode (at least in the code), and the APIs keep
> their meaning. 
> 
> Note that if an irq line is wired to both the ioapic and the
> PIC, we may
> be better off having a single ioctl that can combine the ISA
> and IOAPIC
> variants (driving two inputs at once).
> 
> Eddie, is that what you were driving at in your simplification
> attempt? 
> 
Yes, some minor thing:
KVM_ISA_INTERRUPT: Per VM I/F, and need an irq line and irq level.
KVM_IOAPIC_INTERRUPT: Per VM I/F, may also need irq level for level
triggered irq.
KVM_APIC_MESSAGE: Per VCPU I/F

Per step by step approach way, KVM_EXTINT (assume for LINT0/LINT1) can
be deferred to future implementation.

The PIC patch currently implemented KVM_ISA_INTERRUPT (original named
KVM_SET_ISA_IRQ_LEVEL). Need to confirm for the name considering level
information in the cmmand.

thx,eddie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <10EA09EFD8728347A513008B6B0DA77A01A56750-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01A56750-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2007-06-22 10:32           ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-06-22 10:32 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dong, Eddie wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> 1) KVM_ISA_INTERRUPT:  In level-1 mode, this API allows the userspace
>>> pic to dispatch a vector to the kernel.  In level-2 mode, this allows
>>> any userspace app to tickle an isa based irq line (which feeds into
>>> the inputs of the PIC and IOAPIC.  In other words, a level-2 based
>>> userspace would substitute KVM_ISA_INTERRUPT for pic_set_irq().
>>>
>>>
>>>       
>> Wait. Looks like APIs are changing meaning according to some mode.
>> Confusion. 
>>
>> How about:
>>
>> KVM_INTERRUPT == inject this vector now
>> KVM_EXTINT == drives the processor's interrupt pin (this is handled by
>> the lapic code). input is a vector and type
>> KVM_ISA_INTERRUPT == drives the irq inputs to the PIC. Input
>> is line number.
>> KVM_IOAPIC_INTERRUPT == drives the irq inputs to the IOAPIC. Input is
>> line number. KVM_APIC_MESSAGE == do something to the apic
>>
>> That way, there is no mode (at least in the code), and the APIs keep
>> their meaning. 
>>
>> Note that if an irq line is wired to both the ioapic and the
>> PIC, we may
>> be better off having a single ioctl that can combine the ISA
>> and IOAPIC
>> variants (driving two inputs at once).
>>
>> Eddie, is that what you were driving at in your simplification
>> attempt? 
>>
>>     
> Yes, some minor thing:
> KVM_ISA_INTERRUPT: Per VM I/F, and need an irq line and irq level.
> KVM_IOAPIC_INTERRUPT: Per VM I/F, may also need irq level for level
> triggered irq.
> KVM_APIC_MESSAGE: Per VCPU I/F
>
> Per step by step approach way, KVM_EXTINT (assume for LINT0/LINT1) can
> be deferred to future implementation.
>
> The PIC patch currently implemented KVM_ISA_INTERRUPT (original named
> KVM_SET_ISA_IRQ_LEVEL). Need to confirm for the name considering level
> information in the cmmand.
>   

If we add a chip ID, then KVM_SET_IRQ_LEVEL is best.  If not, then
KVM_SET_ISA_IRQ_LEVEL is better.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-21 14:01 Gregory Haskins
       [not found] ` <467A4C970200005A000262FD-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 14:01 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 13:31 +0000, Gregory Haskins wrote:
> On Thu, 2007-06-21 at 21:02 +0800, Dong, Eddie wrote:
> > 
> > Achitectually not only that. A premature IRR->ISR will cause guest OS
> > confuse in many place. A guest (say BIOS) may turn from interrupt
> > enabled mode to polling mode which polls IRR to detect if 
> > there is pending IRQ to handle. In this case we have trouble.
> 
> I completely agree.  But what I am saying is that I can eliminate the
> premature IRR->ISR with the change I am proposing.

After thinking about this some more, I think I finally understand your
point.  I was thinking that you were objecting to the multiple unacked
vectors in ISR.  But really this could be a problem for *any* unacked
vectors?  Hmm..if this is the case, you were right and I was wrong.  My
bad :)

So assuming this newly enlightened position is true, I think this means
we have a choice:

1) Drop support for mixed "level-1" mode and move the PIC to the kernel
now as Eddie is doing

2) Keep the level-1/2 distinction, and add support for making sure that
once a vector is acked in the PIC, it truely gets put into service
immediately.

I can think of a really simple interface for (2).  All we really need to
do is 

a) go back to synchronous injection (as previously suggested)
b) promote kvm_cpuirq_extint to be higher than nmis and localints.

This will ensure the vector is immediately placed into service even over
in-kernel sources.  Deferred interrupts would still take a higher
priority, but we would never intack the pic if deferred interrupts were
pending since irq.pending would be true.

Thoughts?
-Greg

> 
> > 
> > 
> > > However, as I mentioned we can fix that issue with just a few new
> > > lines of 
> > > code and by
> > > reverting the userspace injection model to the one used in prior
> > > releases without having to implement an entire in-kernel PIC to do it.
> > 
> > In-kernel PIC gives us a full in kernel irqchip solution and performance
> > gain. Some OSes may use PIC only.
> 
> Agreed.  I will answer this down below with the level-0/1/2 question
> since they are related.
> 
> 
> > 
> > > 
> > > I think moving the PIC into the kernel has the advantage of allowing
> > > us to move the PIT into the kernel too (which is huge, IMHO), but
> > 
> > Not very big, I just want to do one by one. we have done the code
> > in Xen already.
> > 
> > > that is its sole advantage.  Don't get me wrong: I am in favor of
> > > doing it. However, I wanted to point out that this particular
> > > solution to the problem you found essentially is invalidating
> > > "level-1" mode by only supporting level-0 or level-2, not fixing it. 
> > 
> > Not exactly understand level-0/1/2 meaning? Do u mean we mixed 
> > irqchip mode is a feature requirement? I didn't see the necessity, 
> > maybe I am wrong. But isn't both PIC and APIC in kernel much
> > nature and simple for us?
> 
> At one point there was a debate between moving just the LAPIC, or moving
> everything (LAPIC/PIC/IOAPIC/PIT).  Avi suggested that we start with
> just the LAPIC and see where we were at.  Because we need to remain
> forward and backwards ABI compatible, I added an ioctl for setting the
> in-kernel-pic level.  "0" is backwards compatible mode (everything in
> userspace).  "1" is LAPIC only. "2" is presumable everything in the
> kernel, but today only 0 and 1 are the only two defined so there could
> be more than 3 levels someday if we wanted.
> 
> > 
> > If you really think supporting mixed irqchip mode is a must
> 
> I will leave it to Avi to decide since he implicitly suggested it.  But
> suffice to say, if we *dont* want to support it we will need to get the
> other in-kernel stuff into the lapic branch in its entirety before it
> can be merged so we dont create an ABI issue.
> 
> 
> > , we 
> > need to introduce an intack I/F to notify user level irqchip if the irq 
> > is really ACKed or not. We can do that but I doubt the necessity.
> 
> This is where we disagree, as I know you have mentioned this before.
> The solution I am proposing (which reverts the userspace injection
> method) would not require this ACK that you mention.  The reason is that
> the synchronous nature of the injection takes care of things naturally.
> Here is how:
> 
> With sync-injection, we can only move a vector from the IRR->ISR if
> RFLAGS.IF and !irq.pending.  Once we inject, the vector will move to ISR
> and get queued in kernel (thereby setting an irq.pending bit).  It may
> not inject to the guest immediately, but thats ok because no new vectors
> will be acked in the PIC until that previous vector is processed
> (because irq.pending will remain set).  Therefore, we will never
> prematurely move IRR->ISR until the previous vector really is
> "in-service".  Does this make sense?
> 
> My suggestion assumes that a single vector can stay in ISR without
> actually being in service without ill effect (Todays v9 code could
> presumably have multiple vectors in ISR without being in service, which
> I admit is bogus).  If this assumption is false, then I agree with you
> that we need extra intack I/F.  What are you thoughts here?
> 
> 
> > 
> > > Perhaps we are "ok" with that, but I was under the distinct
> > > impression from Avi et. al. that these variable levels of support
> > > were a design goal for debugging purposes. 
> > > 
> > > I would prefer that we just fix level-1 mode with the changes I
> > > mentioned instead of just disabling it (in addition to adding level-2
> > > mode as Eddie is working on). 
> > > 
> > >> 
> > >> This patch fixes this by introducing new VM level IOCTL
> > >> KVM_SET_ISA_IRQ_LEVEL & KVM_CREATE_PIC to avoid the ABI hole.  The
> > >> original KVM_INTERRUPT is still there to be backward compatible.
> > >> 
> > >> WIth this patch, old Qemu, new qemu (after this patch), old kvm, new
> > >> kvm can work in any combination. Both user level code and kernel
> > >> code will automatically decide the irq source base on irqchip
> > >> location (user or kernel). 
> > >> 
> > >> Some known issues (TODO):
> > >> 	1: SVM support is on going.
> > >> 	The code for VMX to inject irq is same with Xen now since the
> > >> situation is same now (irqchip in hyprevisor), SVM code should be
> > >> able to directly reuse from Xen too.
> > >> 	2: live migration break.
> > >> 	3: Temply APIC support is removed in CPUID to wait for the merge
> > >> of in kernel APIC patch
> > > 
> > > Note that you will need more than just the APIC patch.  My patch only
> > > moves the LAPIC down.  The IOAPIC and 8259s are still in userspace.
> > > Your patch only moves the 8259s down.  This means there is a gap where
> > > the IOAPIC used to be that still needs to be addressed.
> > 
> > Why do u think I/O APIC must be moved down too? A new IOCTL like
> > KVM_IOAPIC_MESSAGE can solve the interface issue IMO and quit 
> > natual.
> 
> I suppose, but it somewhat defeats the purpose IMO.  Every pin in the
> 8259 that gets tickled implicitly means an IOAPIC pin was tickled also.
> Do we really want to go to userspace for that?  Essentially what happens
> then is the PIT ends up having to go through userspace for every tick at
> that point.  On the flip side, the IOAPIC model is very simple so I
> think it makes more sense to move it one to one with the PIC.
> 
> Regards,
> -Greg
> 
> 


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467A4C970200005A000262FD-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467A4C970200005A000262FD-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-21 14:57   ` Dong, Eddie
  0 siblings, 0 replies; 57+ messages in thread
From: Dong, Eddie @ 2007-06-21 14:57 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:

> So assuming this newly enlightened position is true, I think this
> means we have a choice: 
> 
> 1) Drop support for mixed "level-1" mode and move the PIC to the
> kernel now as Eddie is doing 
> 
> 2) Keep the level-1/2 distinction, and add support for making sure
> that once a vector is acked in the PIC, it truely gets put into
> service immediately. 
> 
> I can think of a really simple interface for (2).  All we
> really need to
> do is
> 
> a) go back to synchronous injection (as previously suggested)

For level-2 support, that means no PIC/APIC in user space, it is the
device want to assert/dessert an irq line request. There is no notion
of injecting IRQ. That is why I added new APIs.
Even with level-1 in consideration, we still need these new APIs to
support level-2. A device such as kerboard could frequently assert
and dessert the irq line. A single ker strobe will see 5-10 dessert 
request from device model and 1 assert request.

So new APIs are a must IMO.

thx, eddie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-21 13:31 Gregory Haskins
       [not found] ` <467A45690200005A000262F0-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 13:31 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2007-06-21 at 21:02 +0800, Dong, Eddie wrote:
> Gregory Haskins wrote:
> > On Wed, 2007-06-20 at 15:43 +0800, Dong, Eddie wrote:
> >> As we discussed, if we move APIC to kernel while leaving PIC in user
> >> level, we have ABI level holes if the interrupt a user level qemu
> >> injected is not immediately injected to guest for some reason. \
> > 
> > Hi Eddie,
> > 
> > I know you worked hard on this, but I still do not agree with this
> > statement.  I agree that v9 introduced an issue related to a)
> > premature IRR->ISR assignment, and b) lack of support for clearing
> > pending ISR with IMR changes (and thank you for pointing it out!). 
> 
> Achitectually not only that. A premature IRR->ISR will cause guest OS
> confuse in many place. A guest (say BIOS) may turn from interrupt
> enabled mode to polling mode which polls IRR to detect if 
> there is pending IRQ to handle. In this case we have trouble.

I completely agree.  But what I am saying is that I can eliminate the
premature IRR->ISR with the change I am proposing.

> 
> 
> > However, as I mentioned we can fix that issue with just a few new
> > lines of 
> > code and by
> > reverting the userspace injection model to the one used in prior
> > releases without having to implement an entire in-kernel PIC to do it.
> 
> In-kernel PIC gives us a full in kernel irqchip solution and performance
> gain. Some OSes may use PIC only.

Agreed.  I will answer this down below with the level-0/1/2 question
since they are related.

> 
> > 
> > I think moving the PIC into the kernel has the advantage of allowing
> > us to move the PIT into the kernel too (which is huge, IMHO), but
> 
> Not very big, I just want to do one by one. we have done the code
> in Xen already.
> 
> > that is its sole advantage.  Don't get me wrong: I am in favor of
> > doing it. However, I wanted to point out that this particular
> > solution to the problem you found essentially is invalidating
> > "level-1" mode by only supporting level-0 or level-2, not fixing it. 
> 
> Not exactly understand level-0/1/2 meaning? Do u mean we mixed 
> irqchip mode is a feature requirement? I didn't see the necessity, 
> maybe I am wrong. But isn't both PIC and APIC in kernel much
> nature and simple for us?

At one point there was a debate between moving just the LAPIC, or moving
everything (LAPIC/PIC/IOAPIC/PIT).  Avi suggested that we start with
just the LAPIC and see where we were at.  Because we need to remain
forward and backwards ABI compatible, I added an ioctl for setting the
in-kernel-pic level.  "0" is backwards compatible mode (everything in
userspace).  "1" is LAPIC only. "2" is presumable everything in the
kernel, but today only 0 and 1 are the only two defined so there could
be more than 3 levels someday if we wanted.

> 
> If you really think supporting mixed irqchip mode is a must

I will leave it to Avi to decide since he implicitly suggested it.  But
suffice to say, if we *dont* want to support it we will need to get the
other in-kernel stuff into the lapic branch in its entirety before it
can be merged so we dont create an ABI issue.

> , we 
> need to introduce an intack I/F to notify user level irqchip if the irq 
> is really ACKed or not. We can do that but I doubt the necessity.

This is where we disagree, as I know you have mentioned this before.
The solution I am proposing (which reverts the userspace injection
method) would not require this ACK that you mention.  The reason is that
the synchronous nature of the injection takes care of things naturally.
Here is how:

With sync-injection, we can only move a vector from the IRR->ISR if
RFLAGS.IF and !irq.pending.  Once we inject, the vector will move to ISR
and get queued in kernel (thereby setting an irq.pending bit).  It may
not inject to the guest immediately, but thats ok because no new vectors
will be acked in the PIC until that previous vector is processed
(because irq.pending will remain set).  Therefore, we will never
prematurely move IRR->ISR until the previous vector really is
"in-service".  Does this make sense?

My suggestion assumes that a single vector can stay in ISR without
actually being in service without ill effect (Todays v9 code could
presumably have multiple vectors in ISR without being in service, which
I admit is bogus).  If this assumption is false, then I agree with you
that we need extra intack I/F.  What are you thoughts here?

> 
> > Perhaps we are "ok" with that, but I was under the distinct
> > impression from Avi et. al. that these variable levels of support
> > were a design goal for debugging purposes. 
> > 
> > I would prefer that we just fix level-1 mode with the changes I
> > mentioned instead of just disabling it (in addition to adding level-2
> > mode as Eddie is working on). 
> > 
> >> 
> >> This patch fixes this by introducing new VM level IOCTL
> >> KVM_SET_ISA_IRQ_LEVEL & KVM_CREATE_PIC to avoid the ABI hole.  The
> >> original KVM_INTERRUPT is still there to be backward compatible.
> >> 
> >> WIth this patch, old Qemu, new qemu (after this patch), old kvm, new
> >> kvm can work in any combination. Both user level code and kernel
> >> code will automatically decide the irq source base on irqchip
> >> location (user or kernel). 
> >> 
> >> Some known issues (TODO):
> >> 	1: SVM support is on going.
> >> 	The code for VMX to inject irq is same with Xen now since the
> >> situation is same now (irqchip in hyprevisor), SVM code should be
> >> able to directly reuse from Xen too.
> >> 	2: live migration break.
> >> 	3: Temply APIC support is removed in CPUID to wait for the merge
> >> of in kernel APIC patch
> > 
> > Note that you will need more than just the APIC patch.  My patch only
> > moves the LAPIC down.  The IOAPIC and 8259s are still in userspace.
> > Your patch only moves the 8259s down.  This means there is a gap where
> > the IOAPIC used to be that still needs to be addressed.
> 
> Why do u think I/O APIC must be moved down too? A new IOCTL like
> KVM_IOAPIC_MESSAGE can solve the interface issue IMO and quit 
> natual.

I suppose, but it somewhat defeats the purpose IMO.  Every pin in the
8259 that gets tickled implicitly means an IOAPIC pin was tickled also.
Do we really want to go to userspace for that?  Essentially what happens
then is the PIT ends up having to go through userspace for every tick at
that point.  On the flip side, the IOAPIC model is very simple so I
think it makes more sense to move it one to one with the PIC.

Regards,
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467A45690200005A000262F0-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467A45690200005A000262F0-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-21 14:28   ` Dong, Eddie
  0 siblings, 0 replies; 57+ messages in thread
From: Dong, Eddie @ 2007-06-21 14:28 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>> If you really think supporting mixed irqchip mode is a must
> 
> I will leave it to Avi to decide since he implicitly suggested it. 
> But suffice to say, if we *dont* want to support it we will need to
> get the other in-kernel stuff into the lapic branch in its entirety
> before it can be merged so we dont create an ABI issue.
> 

Agree, anyway the device model side code (majority for both PIC
patch and APIC patch) can be reused. The only debate is the I/F 
issue. 

> 
>> , we
>> need to introduce an intack I/F to notify user level irqchip if the
>> irq is really ACKed or not. We can do that but I doubt the necessity.
> 
> This is where we disagree, as I know you have mentioned this before.
> The solution I am proposing (which reverts the userspace injection
> method) would not require this ACK that you mention.  The
> reason is that
> the synchronous nature of the injection takes care of things
> naturally. Here is how: 
> 
> With sync-injection, we can only move a vector from the IRR->ISR if
> RFLAGS.IF and !irq.pending.  Once we inject, the vector will
> move to ISR
> and get queued in kernel (thereby setting an irq.pending bit).  It may

That means IRR->ISR happen, right? If yes, the question I raised
is still valid, if not when will be moved?


> not inject to the guest immediately, but thats ok because no
> new vectors
> will be acked in the PIC until that previous vector is processed

That violates the IRQ priority. When the previous irq is still queued, a

new irq may be generated which may have higher priority than the 
previous one. 

> (because irq.pending will remain set).  Therefore, we will never
> prematurely move IRR->ISR until the previous vector really is
> "in-service".  Does this make sense?
>
> My suggestion assumes that a single vector can stay in ISR without
> actually being in service without ill effect (Todays v9 code could
> presumably have multiple vectors in ISR without being in service,
> which I admit is bogus).  If this assumption is false, then I agree
> with you that we need extra intack I/F.  What are you thoughts here?
> 

Yes, only one but this violates irq priority and my concern is still
valid 
if IRR->ISR happened in previous statement.

> 
>>> 
>>> Note that you will need more than just the APIC patch.  My patch
>>> only moves the LAPIC down.  The IOAPIC and 8259s are still in
>>> userspace. Your patch only moves the 8259s down.  This means there
>>> is a gap where the IOAPIC used to be that still needs to be
>>> addressed. 
>> 
>> Why do u think I/O APIC must be moved down too? A new IOCTL like
>> KVM_IOAPIC_MESSAGE can solve the interface issue IMO and quit
>> natual.
> 
> I suppose, but it somewhat defeats the purpose IMO.  Every pin in the
> 8259 that gets tickled implicitly means an IOAPIC pin was tickled
> also. Do we really want to go to userspace for that?  Essentially

User space can handle this, go to IOAPIC first and then decide to go to 
in kernel PIC or not. Here the assumption is that an irq line is either
serviced
by PIC or IOAPIC, it will never be serviced by both.

So no back to user space.

> what happens
> then is the PIT ends up having to go through userspace for
> every tick at

After PIT is moved to kernel, no this issue. For now, PIT is in user
level.


> that point.  On the flip side, the IOAPIC model is very simple so I
> think it makes more sense to move it one to one with the PIC.

I agree too, either are OK and just depend on decision. 
PIC device model is about 400-500
lines of code, IOAPIC should be less than that, APIC is about 800-900
lines of code in Xen but 1500-1800 lines in V09 patch.

size of irq abstraction layer could be sperated from above device model
for now.

Thx,eddie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
@ 2007-06-21 12:36 Gregory Haskins
       [not found] ` <467A38720200005A000262C8-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Gregory Haskins @ 2007-06-21 12:36 UTC (permalink / raw)
  To: eddie.dong-ral2JQCrhuEAvxtiuMwx3w
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Wed, 2007-06-20 at 15:43 +0800, Dong, Eddie wrote:
> As we discussed, if we move APIC to kernel while leaving PIC in user
> level, we have ABI level holes if the interrupt a user level qemu
> injected is not immediately injected to guest for some reason. \

Hi Eddie,

I know you worked hard on this, but I still do not agree with this
statement.  I agree that v9 introduced an issue related to a) premature
IRR->ISR assignment, and b) lack of support for clearing pending ISR
with IMR changes (and thank you for pointing it out!).  However, as I
mentioned we can fix that issue with just a few new lines of code and by
reverting the userspace injection model to the one used in prior
releases without having to implement an entire in-kernel PIC to do it.

I think moving the PIC into the kernel has the advantage of allowing us
to move the PIT into the kernel too (which is huge, IMHO), but that is
its sole advantage.  Don't get me wrong: I am in favor of doing it.
However, I wanted to point out that this particular solution to the
problem you found essentially is invalidating "level-1" mode by only
supporting level-0 or level-2, not fixing it.  Perhaps we are "ok" with
that, but I was under the distinct impression from Avi et. al. that
these variable levels of support were a design goal for debugging
purposes.

I would prefer that we just fix level-1 mode with the changes I
mentioned instead of just disabling it (in addition to adding level-2
mode as Eddie is working on).

> 
> This patch fixes this by introducing new VM level IOCTL
> KVM_SET_ISA_IRQ_LEVEL & KVM_CREATE_PIC to avoid the ABI hole.  The
> original KVM_INTERRUPT is still there to be backward compatible.
> 
> WIth this patch, old Qemu, new qemu (after this patch), old kvm, new kvm
> can work in any combination. Both user level code and kernel code will
> automatically decide the irq source base on irqchip location (user or
> kernel).
> 
> Some known issues (TODO):
> 	1: SVM support is on going.
> 	The code for VMX to inject irq is same with Xen now since the
> situation is same now (irqchip in hyprevisor), SVM code should be able
> to directly reuse from Xen too.
> 	2: live migration break.
> 	3: Temply APIC support is removed in CPUID to wait for the merge
> of in kernel APIC patch

Note that you will need more than just the APIC patch.  My patch only
moves the LAPIC down.  The IOAPIC and 8259s are still in userspace.
Your patch only moves the 8259s down.  This means there is a gap where
the IOAPIC used to be that still needs to be addressed.


> 
> thx, eddie
> 
>     Signed-off-by: Yaozu (Eddie) Dong <eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> 
>  drivers/kvm/Makefile   |    2
>  drivers/kvm/i8259.c    |  435
> +++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/kvm/kvm.h      |    7
>  drivers/kvm/kvm_main.c |   26 ++
>  drivers/kvm/virq.c     |   58 ++++++
>  drivers/kvm/virq.h     |   70 +++++++
>  drivers/kvm/vmx.c      |   84 ++++++++-
>  include/linux/kvm.h    |    9 +
>  8 files changed, 678 insertions(+), 13 deletions(-)
> 
> 
> diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
> index c0a789f..118c79b 100644
> --- a/drivers/kvm/Makefile
> +++ b/drivers/kvm/Makefile
> @@ -2,7 +2,7 @@
>  # Makefile for Kernel-based Virtual Machine module
>  #
>  
> -kvm-objs := kvm_main.o mmu.o x86_emulate.o
> +kvm-objs := kvm_main.o mmu.o x86_emulate.o i8259.o virq.o
>  obj-$(CONFIG_KVM) += kvm.o
>  kvm-intel-objs = vmx.o
>  obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
> diff --git a/drivers/kvm/i8259.c b/drivers/kvm/i8259.c
> new file mode 100644
> index 0000000..9ba8c1e
> --- /dev/null
> +++ b/drivers/kvm/i8259.c
> @@ -0,0 +1,435 @@
> +/*
> + * QEMU 8259 interrupt controller emulation
> + *
> + * Copyright (c) 2003-2004 Fabrice Bellard
> + * Copyright (c) 2007 Intel Corperation
> + *
> + * Permission is hereby granted, free of charge, to any person
> obtaining a copy
> + * of this software and associated documentation files (the
> "Software"), to deal
> + * in the Software without restriction, including without limitation
> the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or
> sell
> + * copies of the Software, and to permit persons to whom the Software
> is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be
> included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
> SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> DEALINGS IN
> + * THE SOFTWARE.
> + * Authors:
> + *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> + *   Port from Qemu.
> + */
> +#include "virq.h"
> +#include <linux/mm.h>
> +
> +/* set irq level. If an edge is detected, then the IRR is set to 1 */
> +static inline void pic_set_irq1(struct pic_state *s, int irq, int
> level)
> +{
> +	int mask;
> +	mask = 1 << irq;
> +	if (s->elcr & mask) {
> +		/* level triggered */
> +		if (level) {
> +			s->irr |= mask;
> +			s->last_irr |= mask;
> +		} else {
> +			s->irr &= ~mask;
> +			s->last_irr &= ~mask;
> +		}
> +	} else {
> +		/* edge triggered */
> +		if (level) {
> +			if ((s->last_irr & mask) == 0)
> +				s->irr |= mask;
> +			s->last_irr |= mask;
> +		} else {
> +			s->last_irr &= ~mask;
> +		}
> +	}
> +}
> +
> +/* return the highest priority found in mask (highest = smallest
> +   number). Return 8 if no irq */
> +static inline int get_priority(struct pic_state *s, int mask)
> +{
> +	int priority;
> +	if (mask == 0)
> +		return 8;
> +	priority = 0;
> +	while ((mask & (1 << ((priority + s->priority_add) & 7))) == 0)
> +		priority++;
> +	return priority;
> +}
> +
> +/* return the pic wanted interrupt. return -1 if none */
> +static int pic_get_irq(struct pic_state *s)
> +{
> +	int mask, cur_priority, priority;
> +
> +	mask = s->irr & ~s->imr;
> +	priority = get_priority(s, mask);
> +	if (priority == 8)
> +		return -1;
> +	/* compute current priority. If special fully nested mode on the
> +	   master, the IRQ coming from the slave is not taken into
> account
> +	   for the priority computation. */
> +	mask = s->isr;
> +	if (s->special_fully_nested_mode && s ==
> &s->pics_state->pics[0])
> +		mask &= ~(1 << 2);
> +	cur_priority = get_priority(s, mask);
> +	if (priority < cur_priority) {
> +		/* higher priority found: an irq should be generated */
> +		return (priority + s->priority_add) & 7;
> +	} else {
> +		return -1;
> +	}
> +}
> +
> +/* raise irq to CPU if necessary. must be called every time the active
> +   irq may change */
> +static void pic_update_irq(struct pic_state2 *s)
> +{
> +	int irq2, irq;
> +
> +	/* first look at slave pic */
> +	irq2 = pic_get_irq(&s->pics[1]);
> +	if (irq2 >= 0) {
> +		/* if irq request by slave pic, signal master PIC */
> +		pic_set_irq1(&s->pics[0], 2, 1);
> +		pic_set_irq1(&s->pics[0], 2, 0);
> +	}
> +	/* look at requested irq */
> +	irq = pic_get_irq(&s->pics[0]);
> +	if (irq >= 0) {
> +		s->irq_request(s->irq_request_opaque, 1);
> +	}
> +}
> +
> +void pic_set_irq_new(void *opaque, int irq, int level)
> +{
> +	struct pic_state2 *s = opaque;
> +
> +	pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
> +	/* used for IOAPIC irqs */
> +	if (s->alt_irq_func)
> +		s->alt_irq_func(s->alt_irq_opaque, irq, level);
> +	pic_update_irq(s);
> +}
> +
> +/* acknowledge interrupt 'irq' */
> +static inline void pic_intack(struct pic_state *s, int irq)
> +{
> +	if (s->auto_eoi) {
> +		if (s->rotate_on_auto_eoi)
> +			s->priority_add = (irq + 1) & 7;
> +	} else {
> +		s->isr |= (1 << irq);
> +	}
> +	/* We don't clear a level sensitive interrupt here */
> +	if (!(s->elcr & (1 << irq)))
> +		s->irr &= ~(1 << irq);
> +}
> +
> +int pic_read_irq(struct pic_state2 *s)
> +{
> +	int irq, irq2, intno;
> +
> +	irq = pic_get_irq(&s->pics[0]);
> +	if (irq >= 0) {
> +		pic_intack(&s->pics[0], irq);
> +		if (irq == 2) {
> +			irq2 = pic_get_irq(&s->pics[1]);
> +			if (irq2 >= 0) {
> +				pic_intack(&s->pics[1], irq2);
> +			} else {
> +				/* spurious IRQ on slave controller */
> +				irq2 = 7;
> +			}
> +			intno = s->pics[1].irq_base + irq2;
> +			irq = irq2 + 8;
> +		} else {
> +			intno = s->pics[0].irq_base + irq;
> +		}
> +	} else {
> +		/* spurious IRQ on host controller */
> +		irq = 7;
> +		intno = s->pics[0].irq_base + irq;
> +	}
> +	pic_update_irq(s);
> +
> +	return intno;
> +}
> +
> +static void pic_reset(void *opaque)
> +{
> +	struct pic_state *s = opaque;
> +
> +	s->last_irr = 0;
> +	s->irr = 0;
> +	s->imr = 0;
> +	s->isr = 0;
> +	s->priority_add = 0;
> +	s->irq_base = 0;
> +	s->read_reg_select = 0;
> +	s->poll = 0;
> +	s->special_mask = 0;
> +	s->init_state = 0;
> +	s->auto_eoi = 0;
> +	s->rotate_on_auto_eoi = 0;
> +	s->special_fully_nested_mode = 0;
> +	s->init4 = 0;
> +	/* Note: ELCR is not reset */
> +}
> +
> +static void pic_ioport_write(void *opaque, uint32_t addr, uint32_t val)
> +{
> +	struct pic_state *s = opaque;
> +	int priority, cmd, irq;
> +
> +	addr &= 1;
> +	if (addr == 0) {
> +		if (val & 0x10) {
> +			/* init */
> +			pic_reset(s);
> +			/* deassert a pending interrupt */
> +			s->pics_state->irq_request(s->pics_state->
> +						   irq_request_opaque,
> 0);
> +			s->init_state = 1;
> +			s->init4 = val & 1;
> +			if (val & 0x02)
> +				printk(KERN_ERR "single mode not
> supported");
> +			if (val & 0x08)
> +				printk(KERN_ERR
> +				       "level sensitive irq not
> supported");
> +		} else if (val & 0x08) {
> +			if (val & 0x04)
> +				s->poll = 1;
> +			if (val & 0x02)
> +				s->read_reg_select = val & 1;
> +			if (val & 0x40)
> +				s->special_mask = (val >> 5) & 1;
> +		} else {
> +			cmd = val >> 5;
> +			switch (cmd) {
> +			case 0:
> +			case 4:
> +				s->rotate_on_auto_eoi = cmd >> 2;
> +				break;
> +			case 1:	/* end of interrupt */
> +			case 5:
> +				priority = get_priority(s, s->isr);
> +				if (priority != 8) {
> +					irq = (priority +
> s->priority_add) & 7;
> +					s->isr &= ~(1 << irq);
> +					if (cmd == 5)
> +						s->priority_add = (irq +
> 1) & 7;
> +					pic_update_irq(s->pics_state);
> +				}
> +				break;
> +			case 3:
> +				irq = val & 7;
> +				s->isr &= ~(1 << irq);
> +				pic_update_irq(s->pics_state);
> +				break;
> +			case 6:
> +				s->priority_add = (val + 1) & 7;
> +				pic_update_irq(s->pics_state);
> +				break;
> +			case 7:
> +				irq = val & 7;
> +				s->isr &= ~(1 << irq);
> +				s->priority_add = (irq + 1) & 7;
> +				pic_update_irq(s->pics_state);
> +				break;
> +			default:
> +				/* no operation */
> +				break;
> +			}
> +		}
> +	} else {
> +		switch (s->init_state) {
> +		case 0:
> +			/* normal mode */
> +			s->imr = val;
> +			pic_update_irq(s->pics_state);
> +			break;
> +		case 1:
> +			s->irq_base = val & 0xf8;
> +			s->init_state = 2;
> +			break;
> +		case 2:
> +			if (s->init4) {
> +				s->init_state = 3;
> +			} else {
> +				s->init_state = 0;
> +			}
> +			break;
> +		case 3:
> +			s->special_fully_nested_mode = (val >> 4) & 1;
> +			s->auto_eoi = (val >> 1) & 1;
> +			s->init_state = 0;
> +			break;
> +		}
> +	}
> +}
> +
> +static uint32_t pic_poll_read(struct pic_state *s, uint32_t addr1)
> +{
> +	int ret;
> +
> +	ret = pic_get_irq(s);
> +	if (ret >= 0) {
> +		if (addr1 >> 7) {
> +			s->pics_state->pics[0].isr &= ~(1 << 2);
> +			s->pics_state->pics[0].irr &= ~(1 << 2);
> +		}
> +		s->irr &= ~(1 << ret);
> +		s->isr &= ~(1 << ret);
> +		if (addr1 >> 7 || ret != 2)
> +			pic_update_irq(s->pics_state);
> +	} else {
> +		ret = 0x07;
> +		pic_update_irq(s->pics_state);
> +	}
> +
> +	return ret;
> +}
> +
> +static uint32_t pic_ioport_read(void *opaque, uint32_t addr1)
> +{
> +	struct pic_state *s = opaque;
> +	unsigned int addr;
> +	int ret;
> +
> +	addr = addr1;
> +	addr &= 1;
> +	if (s->poll) {
> +		ret = pic_poll_read(s, addr1);
> +		s->poll = 0;
> +	} else {
> +		if (addr == 0) {
> +			if (s->read_reg_select)
> +				ret = s->isr;
> +			else
> +				ret = s->irr;
> +		} else {
> +			ret = s->imr;
> +		}
> +	}
> +	return ret;
> +}
> +
> +static void elcr_ioport_write(void *opaque, uint32_t addr, uint32_t
> val)
> +{
> +	struct pic_state *s = opaque;
> +	s->elcr = val & s->elcr_mask;
> +}
> +
> +static uint32_t elcr_ioport_read(void *opaque, uint32_t addr1)
> +{
> +	struct pic_state *s = opaque;
> +	return s->elcr;
> +}
> +
> +static int picdev_in_range(struct kvm_io_device *this, gpa_t addr)
> +{
> +	switch (addr) {
> +		/* Master PIC */
> +	case 0x20:
> +	case 0x21:
> +		/* Slave PIC */
> +	case 0xa0:
> +	case 0xa1:
> +		/* elcr0  & 1 */
> +	case 0x4d0:
> +	case 0x4d1:
> +		return 1;
> +	default:
> +		return 0;
> +	}
> +}
> +
> +static void picdev_write(struct kvm_io_device *this,
> +			 gpa_t addr, int len, const void *val)
> +{
> +	struct pic_state2 *s = this->private;
> +	unsigned char data = *(unsigned char *)val;
> +
> +	if (len != 1) {
> +		printk(KERN_ERR "PIC: non byte write\n");
> +		return;
> +	}
> +	switch (addr) {
> +	case 0x20:
> +	case 0x21:
> +	case 0xa0:
> +	case 0xa1:
> +		pic_ioport_write(&s->pics[addr >> 7], addr, data);
> +		break;
> +	case 0x4d0:
> +	case 0x4d1:
> +		elcr_ioport_write(&s->pics[addr & 1], addr, data);
> +		break;
> +	}
> +}
> +
> +static void picdev_read(struct kvm_io_device *this,
> +			gpa_t addr, int len, void *val)
> +{
> +	struct pic_state2 *s = this->private;
> +	unsigned char data = 0;
> +
> +	if (len != 1) {
> +		printk(KERN_ERR "PIC: non byte read\n");
> +		return;
> +	}
> +	switch (addr) {
> +	case 0x20:
> +	case 0x21:
> +	case 0xa0:
> +	case 0xa1:
> +		data = pic_ioport_read(&s->pics[addr >> 7], addr);
> +		break;
> +	case 0x4d0:
> +	case 0x4d1:
> +		data = elcr_ioport_read(&s->pics[addr & 1], addr);
> +		break;
> +	}
> +	*(unsigned char *)val = data;
> +}
> +
> +/* callback when PIC0 irq status changed */
> +static void pic_irq_request(void *opaque, int level)
> +{
> +	struct kvm *kvm = opaque;
> +
> +	int_output(kvm->vpic) = level;
> +}
> +
> +struct pic_state2 *create_pic(struct kvm *kvm)
> +{
> +	struct pic_state2 *s;
> +	s = kzalloc(sizeof(struct pic_state2), GFP_KERNEL);
> +	if (!s)
> +		return NULL;
> +	s->pics[0].elcr_mask = 0xf8;
> +	s->pics[1].elcr_mask = 0xde;
> +	s->irq_request = pic_irq_request;
> +	s->irq_request_opaque = kvm;
> +	s->pics[0].pics_state = s;
> +	s->pics[1].pics_state = s;
> +
> +	/* Initialize PIO device */
> +	s->dev.read = picdev_read;
> +	s->dev.write = picdev_write;
> +	s->dev.in_range = picdev_in_range;
> +	s->dev.private = s;
> +	kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
> +	return s;
> +}

If you can, please take a look at that 8259 patch that I sent you.  Like
yours, its based on qemu, but I cleaned it up quite a bit.  I am not a
huge fan of the way the QEMU code smashes models together like this
(e.g. two 8259s hard coded together).  I much prefer to have clean lines
of delineation (e.g. model a single 8259, and then reuse two models in a
cascaded configuration).  It could just be me ;), but I think this
results in a cleaner and easier to maintain code base with fewer bugs in
the long term.



> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
> index a7c5e6b..1e2d3ba 100644
> --- a/drivers/kvm/kvm.h
> +++ b/drivers/kvm/kvm.h
> @@ -456,8 +456,15 @@ struct kvm {
>  	struct file *filp;
>  	struct kvm_io_bus mmio_bus;
>  	struct kvm_io_bus pio_bus;
> +	void *vpic;
>  };
>  
> +#define kernel_pic(kvm) ((kvm)->vpic)
> +static inline int irqchip_in_kernel(struct kvm *kvm)
> +{
> +	return (kernel_pic(kvm) != 0);
> +}
> +
>  struct descriptor_table {
>  	u16 limit;
>  	unsigned long base;
> diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
> index 7826f16..25f99fc 100644
> --- a/drivers/kvm/kvm_main.c
> +++ b/drivers/kvm/kvm_main.c
> @@ -16,6 +16,7 @@
>   */
>  
>  #include "kvm.h"
> +#include "virq.h"
>  
>  #include <linux/kvm.h>
>  #include <linux/module.h>
> @@ -476,6 +477,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	spin_lock(&kvm_lock);
>  	list_del(&kvm->vm_list);
>  	spin_unlock(&kvm_lock);
> +	if (kernel_pic(kvm))
> +		kfree(kernel_pic(kvm));
>  	kvm_io_bus_destroy(&kvm->pio_bus);
>  	kvm_io_bus_destroy(&kvm->mmio_bus);
>  	kvm_free_vcpus(kvm);
> @@ -1372,7 +1375,8 @@ EXPORT_SYMBOL_GPL(emulate_instruction);
>  
>  int kvm_emulate_halt(struct kvm_vcpu *vcpu)
>  {
> -	if (vcpu->irq_summary)
> +	if (vcpu->irq_summary ||
> +           (irqchip_in_kernel(vcpu->kvm) && cpu_has_interrupt(vcpu)))
>  		return 1;
>  
>  	vcpu->run->exit_reason = KVM_EXIT_HLT;
> @@ -2855,6 +2859,26 @@ static long kvm_vm_ioctl(struct file *filp,
>  			goto out;
>  		break;
>  	}
> +	case KVM_CREATE_PIC:
> +		r = -ENOMEM;
> +		kernel_pic(kvm) = create_pic(kvm);
> +		if (kernel_pic(kvm)) {
> +			r = 0;
> +		}
> +		break;
> +	case KVM_SET_ISA_IRQ_LEVEL: {
> +		struct kvm_irq_level irq_event;
> +		r = -EFAULT;
> +		if (copy_from_user(&irq_event, argp, sizeof irq_event))
> +			goto out;
> +		if (irqchip_in_kernel(kvm)) {
> +			pic_set_irq_new(kernel_pic(kvm),
> +					irq_event.irq,
> +					irq_event.level);
> +			r = 0;
> +		}
> +		break;
> +	}
>  	default:
>  		;
>  	}
> diff --git a/drivers/kvm/virq.c b/drivers/kvm/virq.c
> new file mode 100644
> index 0000000..a807296
> --- /dev/null
> +++ b/drivers/kvm/virq.c
> @@ -0,0 +1,58 @@
> +/*
> + * virq.c: API for in kernel interrupt controller
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify
> it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but
> WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License
> along with
> + * this program; if not, write to the Free Software Foundation, Inc.,
> 59 Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.
> + * Authors:
> + *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> + *
> + */
> +
> +#include "kvm.h"
> +#include "virq.h"
> +
> +/*
> + * check if there is pending interrupt without
> + * intack.
> + */
> +int cpu_has_interrupt(struct kvm_vcpu *v)
> +{
> +	struct pic_state2 *s = kernel_pic(v->kvm);
> +
> +	/* PIC */
> +	if (int_output(s))
> +		return 1;
> +	/* TODO: APIC */
> +	return 0;
> +}
> +
> +EXPORT_SYMBOL_GPL(cpu_has_interrupt);
> +
> +/*
> + * Read pending interrupt vector and intack.
> + */
> +int cpu_get_interrupt(struct kvm_vcpu *v)
> +{
> +	struct pic_state2 *s = kernel_pic(v->kvm);
> +	int vector;
> +
> +	int_output(s) = 0;
> +	vector = pic_read_irq(s);
> +	if (vector != -1)
> +		return vector;
> +	/* TODO: APIC */
> +	return -1;
> +}
> +
> +EXPORT_SYMBOL_GPL(cpu_get_interrupt);

Suffice to say I have a problem with this particular injection
methodology, which is why I redesigned it in the APIC series.  Is there
a particular reason you are choosing to propagate it from QEMU to KVM?
I understand that this may create synergy between the Xen and KVM code
bases. which is nice from a bug-fix-sharing perspective.  But, in my
mind that reason alone should not usurp good design.  Lets get this
right while we can (where "right" is what everyone here agrees to, not
my design per se ;)  

> diff --git a/drivers/kvm/virq.h b/drivers/kvm/virq.h
> new file mode 100644
> index 0000000..ed467dc
> --- /dev/null
> +++ b/drivers/kvm/virq.h
> @@ -0,0 +1,70 @@
> +/*
> + * virq.h: in kernel interrupt controller related definitions
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify
> it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but
> WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License
> along with
> + * this program; if not, write to the Free Software Foundation, Inc.,
> 59 Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.
> + * Authors:
> + *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> + *
> + */
> +
> +#ifndef __VIRQ_H
> +#define __VIRQ_H
> +
> +#include "kvm.h"
> +
> +typedef void SetIRQFunc(void *opaque, int irq_num, int level);
> +typedef void IRQRequestFunc(void *opaque, int level);
> +
> +struct PicState2;
> +struct pic_state {
> +	uint8_t last_irr;	/* edge detection */
> +	uint8_t irr;		/* interrupt request register */
> +	uint8_t imr;		/* interrupt mask register */
> +	uint8_t isr;		/* interrupt service register */
> +	uint8_t priority_add;	/* highest irq priority */
> +	uint8_t irq_base;
> +	uint8_t read_reg_select;
> +	uint8_t poll;
> +	uint8_t special_mask;
> +	uint8_t init_state;
> +	uint8_t auto_eoi;
> +	uint8_t rotate_on_auto_eoi;
> +	uint8_t special_fully_nested_mode;
> +	uint8_t init4;		/* true if 4 byte init */
> +	uint8_t elcr;		/* PIIX edge/trigger selection */
> +	uint8_t elcr_mask;
> +	struct pic_state2 *pics_state;
> +};
> +
> +struct pic_state2 {
> +	/* 0 is master pic, 1 is slave pic */
> +	struct pic_state pics[2];
> +	IRQRequestFunc *irq_request;
> +	void *irq_request_opaque;
> +	/* IOAPIC callback support */
> +	SetIRQFunc *alt_irq_func;
> +	void *alt_irq_opaque;
> +	int output;		/* intr from master PIC */
> +	struct kvm_io_device dev;
> +};
> +
> +#define int_output(s)		(((struct pic_state2 *)s)->output)
> +struct pic_state2 *create_pic(struct kvm *kvm);
> +void pic_set_irq_new(void *opaque, int irq, int level);
> +int pic_read_irq(struct pic_state2 *s);
> +int cpu_get_interrupt(struct kvm_vcpu *v);
> +int cpu_has_interrupt(struct kvm_vcpu *v);
> +
> +#endif
> diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
> index b47ddcc..cef0d64 100644
> --- a/drivers/kvm/vmx.c
> +++ b/drivers/kvm/vmx.c
> @@ -16,6 +16,7 @@
>   */
>  
>  #include "kvm.h"
> +#include "virq.h"
>  #include "vmx.h"
>  #include <linux/module.h>
>  #include <linux/kernel.h>
> @@ -1464,6 +1465,16 @@ static void inject_rmode_irq(struct kvm_vcpu
> *vcpu, int irq)
>  	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp -
> 6));
>  }
>  
> +static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq)
> +{
> +	if (vcpu->rmode.active) {
> +		inject_rmode_irq(vcpu, irq);
> +		return;
> +	}
> +	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> +			irq | INTR_TYPE_EXT_INTR |
> INTR_INFO_VALID_MASK);
> +}
> +
>  static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
>  {
>  	int word_index = __ffs(vcpu->irq_summary);
> @@ -1473,13 +1484,7 @@ static void kvm_do_inject_irq(struct kvm_vcpu
> *vcpu)
>  	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
>  	if (!vcpu->irq_pending[word_index])
>  		clear_bit(word_index, &vcpu->irq_summary);
> -
> -	if (vcpu->rmode.active) {
> -		inject_rmode_irq(vcpu, irq);
> -		return;
> -	}
> -	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> -			irq | INTR_TYPE_EXT_INTR |
> INTR_INFO_VALID_MASK);
> +	vmx_inject_irq(vcpu, irq);
>  }
>  
> 
> @@ -1563,7 +1568,7 @@ static int handle_exception(struct kvm_vcpu *vcpu,
> struct kvm_run *kvm_run)
>  		       "intr info 0x%x\n", __FUNCTION__, vect_info,
> intr_info);
>  	}
>  
> -	if (is_external_interrupt(vect_info)) {
> +	if (!irqchip_in_kernel(vcpu->kvm) &&
> is_external_interrupt(vect_info)) {
>  		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
>  		set_bit(irq, vcpu->irq_pending);
>  		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
> @@ -1894,6 +1899,12 @@ static void post_kvm_run_save(struct kvm_vcpu
> *vcpu,
>  static int handle_interrupt_window(struct kvm_vcpu *vcpu,
>  				   struct kvm_run *kvm_run)
>  {
> +	u32 cpu_based_vm_exec_control;
> +
> +	/* clear pending irq */
> +	cpu_based_vm_exec_control =
> vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> +	cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> cpu_based_vm_exec_control);
>  	/*
>  	 * If the user space waits to inject interrupts, exit as soon as
>  	 * possible
> @@ -1986,6 +1997,56 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
>  	vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3));
>  }
>  
> +void enable_irq_window(struct kvm_vcpu *vcpu)
> +{
> +	u32 cpu_based_vm_exec_control;
> +
> +	cpu_based_vm_exec_control =
> vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> +	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> cpu_based_vm_exec_control);
> +}
> +
> +void do_intr_assist(struct kvm_vcpu *vcpu)
> +{
> +	u32 idtv_info_field, intr_info_field;
> +	int has_ext_irq, interrupt_window_open;
> +	/* TODO: Move IDT_Vectoring here */
> +
> +	has_ext_irq = cpu_has_interrupt(vcpu);
> +	intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
> +	idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +	if (intr_info_field & INTR_INFO_VALID_MASK) {
> +		if (idtv_info_field & INTR_INFO_VALID_MASK) {
> +			/* TODO: fault when IDT_Vectoring */
> +			printk(KERN_ERR "Fault when IDT_Vectoring\n");
> +		}
> +		if (has_ext_irq)
> +			enable_irq_window(vcpu);
> +		return;
> +	}
> +	if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) {
> +		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field);
> +		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> +				vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
> +
> +		if ( unlikely(idtv_info_field & 0x800) ) /* valid error
> code */
> +			vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> +				vmcs_read32(IDT_VECTORING_ERROR_CODE));
> +		if ( unlikely(has_ext_irq) )
> +			enable_irq_window(vcpu);
> +		return;
> +	}
> +	if (!has_ext_irq)
> +		return;
> +	interrupt_window_open =
> +		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
> +		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
> +	if (interrupt_window_open)
> +		vmx_inject_irq(vcpu, cpu_get_interrupt(vcpu));
> +	else
> +		enable_irq_window(vcpu);
> +}
> +
>  static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
>  {
>  	u8 fail;
> @@ -1996,9 +2057,6 @@ preempted:
>  		kvm_guest_debug_pre(vcpu);
>  
>  again:
> -	if (!vcpu->mmio_read_completed)
> -		do_interrupt_requests(vcpu, kvm_run);
> -
>  	vmx_save_host_state(vcpu);
>  	kvm_load_guest_fpu(vcpu);
>  
> @@ -2012,6 +2070,10 @@ again:
>  	vmcs_writel(HOST_CR0, read_cr0());
>  
>  	local_irq_disable();
> +	if (irqchip_in_kernel(vcpu->kvm))
> +		do_intr_assist(vcpu);
> +	else if (!vcpu->mmio_read_completed)
> +		do_interrupt_requests(vcpu, kvm_run);

2 comments here:

1) shouldnt this be:

> +	if (!vcpu->mmio_read_completed) {
> +               if (irqchip_in_kernel(vcpu->kvm))
> +		         do_intr_assist(vcpu);
> +		else
> +                        do_interrupt_requests(vcpu, kvm_run);

?

2) I prefer to switch out code like this using indirection instead of if
statements.  The code is IMO cleaner and the run-time overhead is
probably lower, especially as more cases come into play.


Regards,
-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467A38720200005A000262C8-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <467A38720200005A000262C8-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
@ 2007-06-21 13:02   ` Dong, Eddie
  0 siblings, 0 replies; 57+ messages in thread
From: Dong, Eddie @ 2007-06-21 13:02 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> On Wed, 2007-06-20 at 15:43 +0800, Dong, Eddie wrote:
>> As we discussed, if we move APIC to kernel while leaving PIC in user
>> level, we have ABI level holes if the interrupt a user level qemu
>> injected is not immediately injected to guest for some reason. \
> 
> Hi Eddie,
> 
> I know you worked hard on this, but I still do not agree with this
> statement.  I agree that v9 introduced an issue related to a)
> premature IRR->ISR assignment, and b) lack of support for clearing
> pending ISR with IMR changes (and thank you for pointing it out!). 

Achitectually not only that. A premature IRR->ISR will cause guest OS
confuse in many place. A guest (say BIOS) may turn from interrupt
enabled mode to polling mode which polls IRR to detect if 
there is pending IRQ to handle. In this case we have trouble.


> However, as I mentioned we can fix that issue with just a few new
> lines of 
> code and by
> reverting the userspace injection model to the one used in prior
> releases without having to implement an entire in-kernel PIC to do it.

In-kernel PIC gives us a full in kernel irqchip solution and performance
gain. Some OSes may use PIC only.

> 
> I think moving the PIC into the kernel has the advantage of allowing
> us to move the PIT into the kernel too (which is huge, IMHO), but

Not very big, I just want to do one by one. we have done the code
in Xen already.

> that is its sole advantage.  Don't get me wrong: I am in favor of
> doing it. However, I wanted to point out that this particular
> solution to the problem you found essentially is invalidating
> "level-1" mode by only supporting level-0 or level-2, not fixing it. 

Not exactly understand level-0/1/2 meaning? Do u mean we mixed 
irqchip mode is a feature requirement? I didn't see the necessity, 
maybe I am wrong. But isn't both PIC and APIC in kernel much
nature and simple for us?

If you really think supporting mixed irqchip mode is a must, we 
need to introduce an intack I/F to notify user level irqchip if the irq 
is really ACKed or not. We can do that but I doubt the necessity.

> Perhaps we are "ok" with that, but I was under the distinct
> impression from Avi et. al. that these variable levels of support
> were a design goal for debugging purposes. 
> 
> I would prefer that we just fix level-1 mode with the changes I
> mentioned instead of just disabling it (in addition to adding level-2
> mode as Eddie is working on). 
> 
>> 
>> This patch fixes this by introducing new VM level IOCTL
>> KVM_SET_ISA_IRQ_LEVEL & KVM_CREATE_PIC to avoid the ABI hole.  The
>> original KVM_INTERRUPT is still there to be backward compatible.
>> 
>> WIth this patch, old Qemu, new qemu (after this patch), old kvm, new
>> kvm can work in any combination. Both user level code and kernel
>> code will automatically decide the irq source base on irqchip
>> location (user or kernel). 
>> 
>> Some known issues (TODO):
>> 	1: SVM support is on going.
>> 	The code for VMX to inject irq is same with Xen now since the
>> situation is same now (irqchip in hyprevisor), SVM code should be
>> able to directly reuse from Xen too.
>> 	2: live migration break.
>> 	3: Temply APIC support is removed in CPUID to wait for the merge
>> of in kernel APIC patch
> 
> Note that you will need more than just the APIC patch.  My patch only
> moves the LAPIC down.  The IOAPIC and 8259s are still in userspace.
> Your patch only moves the 8259s down.  This means there is a gap where
> the IOAPIC used to be that still needs to be addressed.

Why do u think I/O APIC must be moved down too? A new IOCTL like
KVM_IOAPIC_MESSAGE can solve the interface issue IMO and quit 
natual.

Thx,
Eddie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* In kernel PIC support: kernel patch
@ 2007-06-20  7:43 Dong, Eddie
       [not found] ` <10EA09EFD8728347A513008B6B0DA77A01A55F0A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Dong, Eddie @ 2007-06-20  7:43 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

[-- Attachment #1: Type: text/plain, Size: 24457 bytes --]

As we discussed, if we move APIC to kernel while leaving PIC in user
level, we have ABI level holes if the interrupt a user level qemu
injected is not immediately injected to guest for some reason. \

This patch fixes this by introducing new VM level IOCTL
KVM_SET_ISA_IRQ_LEVEL & KVM_CREATE_PIC to avoid the ABI hole.  The
original KVM_INTERRUPT is still there to be backward compatible.

WIth this patch, old Qemu, new qemu (after this patch), old kvm, new kvm
can work in any combination. Both user level code and kernel code will
automatically decide the irq source base on irqchip location (user or
kernel).

Some known issues (TODO):
	1: SVM support is on going.
	The code for VMX to inject irq is same with Xen now since the
situation is same now (irqchip in hyprevisor), SVM code should be able
to directly reuse from Xen too.
	2: live migration break.
	3: Temply APIC support is removed in CPUID to wait for the merge
of in kernel APIC patch

thx, eddie

    Signed-off-by: Yaozu (Eddie) Dong <eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

 drivers/kvm/Makefile   |    2
 drivers/kvm/i8259.c    |  435
+++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/kvm.h      |    7
 drivers/kvm/kvm_main.c |   26 ++
 drivers/kvm/virq.c     |   58 ++++++
 drivers/kvm/virq.h     |   70 +++++++
 drivers/kvm/vmx.c      |   84 ++++++++-
 include/linux/kvm.h    |    9 +
 8 files changed, 678 insertions(+), 13 deletions(-)


diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c0a789f..118c79b 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o i8259.o virq.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/i8259.c b/drivers/kvm/i8259.c
new file mode 100644
index 0000000..9ba8c1e
--- /dev/null
+++ b/drivers/kvm/i8259.c
@@ -0,0 +1,435 @@
+/*
+ * QEMU 8259 interrupt controller emulation
+ *
+ * Copyright (c) 2003-2004 Fabrice Bellard
+ * Copyright (c) 2007 Intel Corperation
+ *
+ * Permission is hereby granted, free of charge, to any person
obtaining a copy
+ * of this software and associated documentation files (the
"Software"), to deal
+ * in the Software without restriction, including without limitation
the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell
+ * copies of the Software, and to permit persons to whom the Software
is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN
+ * THE SOFTWARE.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
+ *   Port from Qemu.
+ */
+#include "virq.h"
+#include <linux/mm.h>
+
+/* set irq level. If an edge is detected, then the IRR is set to 1 */
+static inline void pic_set_irq1(struct pic_state *s, int irq, int
level)
+{
+	int mask;
+	mask = 1 << irq;
+	if (s->elcr & mask) {
+		/* level triggered */
+		if (level) {
+			s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->irr &= ~mask;
+			s->last_irr &= ~mask;
+		}
+	} else {
+		/* edge triggered */
+		if (level) {
+			if ((s->last_irr & mask) == 0)
+				s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->last_irr &= ~mask;
+		}
+	}
+}
+
+/* return the highest priority found in mask (highest = smallest
+   number). Return 8 if no irq */
+static inline int get_priority(struct pic_state *s, int mask)
+{
+	int priority;
+	if (mask == 0)
+		return 8;
+	priority = 0;
+	while ((mask & (1 << ((priority + s->priority_add) & 7))) == 0)
+		priority++;
+	return priority;
+}
+
+/* return the pic wanted interrupt. return -1 if none */
+static int pic_get_irq(struct pic_state *s)
+{
+	int mask, cur_priority, priority;
+
+	mask = s->irr & ~s->imr;
+	priority = get_priority(s, mask);
+	if (priority == 8)
+		return -1;
+	/* compute current priority. If special fully nested mode on the
+	   master, the IRQ coming from the slave is not taken into
account
+	   for the priority computation. */
+	mask = s->isr;
+	if (s->special_fully_nested_mode && s ==
&s->pics_state->pics[0])
+		mask &= ~(1 << 2);
+	cur_priority = get_priority(s, mask);
+	if (priority < cur_priority) {
+		/* higher priority found: an irq should be generated */
+		return (priority + s->priority_add) & 7;
+	} else {
+		return -1;
+	}
+}
+
+/* raise irq to CPU if necessary. must be called every time the active
+   irq may change */
+static void pic_update_irq(struct pic_state2 *s)
+{
+	int irq2, irq;
+
+	/* first look at slave pic */
+	irq2 = pic_get_irq(&s->pics[1]);
+	if (irq2 >= 0) {
+		/* if irq request by slave pic, signal master PIC */
+		pic_set_irq1(&s->pics[0], 2, 1);
+		pic_set_irq1(&s->pics[0], 2, 0);
+	}
+	/* look at requested irq */
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		s->irq_request(s->irq_request_opaque, 1);
+	}
+}
+
+void pic_set_irq_new(void *opaque, int irq, int level)
+{
+	struct pic_state2 *s = opaque;
+
+	pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+	/* used for IOAPIC irqs */
+	if (s->alt_irq_func)
+		s->alt_irq_func(s->alt_irq_opaque, irq, level);
+	pic_update_irq(s);
+}
+
+/* acknowledge interrupt 'irq' */
+static inline void pic_intack(struct pic_state *s, int irq)
+{
+	if (s->auto_eoi) {
+		if (s->rotate_on_auto_eoi)
+			s->priority_add = (irq + 1) & 7;
+	} else {
+		s->isr |= (1 << irq);
+	}
+	/* We don't clear a level sensitive interrupt here */
+	if (!(s->elcr & (1 << irq)))
+		s->irr &= ~(1 << irq);
+}
+
+int pic_read_irq(struct pic_state2 *s)
+{
+	int irq, irq2, intno;
+
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		pic_intack(&s->pics[0], irq);
+		if (irq == 2) {
+			irq2 = pic_get_irq(&s->pics[1]);
+			if (irq2 >= 0) {
+				pic_intack(&s->pics[1], irq2);
+			} else {
+				/* spurious IRQ on slave controller */
+				irq2 = 7;
+			}
+			intno = s->pics[1].irq_base + irq2;
+			irq = irq2 + 8;
+		} else {
+			intno = s->pics[0].irq_base + irq;
+		}
+	} else {
+		/* spurious IRQ on host controller */
+		irq = 7;
+		intno = s->pics[0].irq_base + irq;
+	}
+	pic_update_irq(s);
+
+	return intno;
+}
+
+static void pic_reset(void *opaque)
+{
+	struct pic_state *s = opaque;
+
+	s->last_irr = 0;
+	s->irr = 0;
+	s->imr = 0;
+	s->isr = 0;
+	s->priority_add = 0;
+	s->irq_base = 0;
+	s->read_reg_select = 0;
+	s->poll = 0;
+	s->special_mask = 0;
+	s->init_state = 0;
+	s->auto_eoi = 0;
+	s->rotate_on_auto_eoi = 0;
+	s->special_fully_nested_mode = 0;
+	s->init4 = 0;
+	/* Note: ELCR is not reset */
+}
+
+static void pic_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+{
+	struct pic_state *s = opaque;
+	int priority, cmd, irq;
+
+	addr &= 1;
+	if (addr == 0) {
+		if (val & 0x10) {
+			/* init */
+			pic_reset(s);
+			/* deassert a pending interrupt */
+			s->pics_state->irq_request(s->pics_state->
+						   irq_request_opaque,
0);
+			s->init_state = 1;
+			s->init4 = val & 1;
+			if (val & 0x02)
+				printk(KERN_ERR "single mode not
supported");
+			if (val & 0x08)
+				printk(KERN_ERR
+				       "level sensitive irq not
supported");
+		} else if (val & 0x08) {
+			if (val & 0x04)
+				s->poll = 1;
+			if (val & 0x02)
+				s->read_reg_select = val & 1;
+			if (val & 0x40)
+				s->special_mask = (val >> 5) & 1;
+		} else {
+			cmd = val >> 5;
+			switch (cmd) {
+			case 0:
+			case 4:
+				s->rotate_on_auto_eoi = cmd >> 2;
+				break;
+			case 1:	/* end of interrupt */
+			case 5:
+				priority = get_priority(s, s->isr);
+				if (priority != 8) {
+					irq = (priority +
s->priority_add) & 7;
+					s->isr &= ~(1 << irq);
+					if (cmd == 5)
+						s->priority_add = (irq +
1) & 7;
+					pic_update_irq(s->pics_state);
+				}
+				break;
+			case 3:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				pic_update_irq(s->pics_state);
+				break;
+			case 6:
+				s->priority_add = (val + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			case 7:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				s->priority_add = (irq + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			default:
+				/* no operation */
+				break;
+			}
+		}
+	} else {
+		switch (s->init_state) {
+		case 0:
+			/* normal mode */
+			s->imr = val;
+			pic_update_irq(s->pics_state);
+			break;
+		case 1:
+			s->irq_base = val & 0xf8;
+			s->init_state = 2;
+			break;
+		case 2:
+			if (s->init4) {
+				s->init_state = 3;
+			} else {
+				s->init_state = 0;
+			}
+			break;
+		case 3:
+			s->special_fully_nested_mode = (val >> 4) & 1;
+			s->auto_eoi = (val >> 1) & 1;
+			s->init_state = 0;
+			break;
+		}
+	}
+}
+
+static uint32_t pic_poll_read(struct pic_state *s, uint32_t addr1)
+{
+	int ret;
+
+	ret = pic_get_irq(s);
+	if (ret >= 0) {
+		if (addr1 >> 7) {
+			s->pics_state->pics[0].isr &= ~(1 << 2);
+			s->pics_state->pics[0].irr &= ~(1 << 2);
+		}
+		s->irr &= ~(1 << ret);
+		s->isr &= ~(1 << ret);
+		if (addr1 >> 7 || ret != 2)
+			pic_update_irq(s->pics_state);
+	} else {
+		ret = 0x07;
+		pic_update_irq(s->pics_state);
+	}
+
+	return ret;
+}
+
+static uint32_t pic_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct pic_state *s = opaque;
+	unsigned int addr;
+	int ret;
+
+	addr = addr1;
+	addr &= 1;
+	if (s->poll) {
+		ret = pic_poll_read(s, addr1);
+		s->poll = 0;
+	} else {
+		if (addr == 0) {
+			if (s->read_reg_select)
+				ret = s->isr;
+			else
+				ret = s->irr;
+		} else {
+			ret = s->imr;
+		}
+	}
+	return ret;
+}
+
+static void elcr_ioport_write(void *opaque, uint32_t addr, uint32_t
val)
+{
+	struct pic_state *s = opaque;
+	s->elcr = val & s->elcr_mask;
+}
+
+static uint32_t elcr_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct pic_state *s = opaque;
+	return s->elcr;
+}
+
+static int picdev_in_range(struct kvm_io_device *this, gpa_t addr)
+{
+	switch (addr) {
+		/* Master PIC */
+	case 0x20:
+	case 0x21:
+		/* Slave PIC */
+	case 0xa0:
+	case 0xa1:
+		/* elcr0  & 1 */
+	case 0x4d0:
+	case 0x4d1:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+static void picdev_write(struct kvm_io_device *this,
+			 gpa_t addr, int len, const void *val)
+{
+	struct pic_state2 *s = this->private;
+	unsigned char data = *(unsigned char *)val;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte write\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		pic_ioport_write(&s->pics[addr >> 7], addr, data);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		elcr_ioport_write(&s->pics[addr & 1], addr, data);
+		break;
+	}
+}
+
+static void picdev_read(struct kvm_io_device *this,
+			gpa_t addr, int len, void *val)
+{
+	struct pic_state2 *s = this->private;
+	unsigned char data = 0;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte read\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		data = pic_ioport_read(&s->pics[addr >> 7], addr);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		data = elcr_ioport_read(&s->pics[addr & 1], addr);
+		break;
+	}
+	*(unsigned char *)val = data;
+}
+
+/* callback when PIC0 irq status changed */
+static void pic_irq_request(void *opaque, int level)
+{
+	struct kvm *kvm = opaque;
+
+	int_output(kvm->vpic) = level;
+}
+
+struct pic_state2 *create_pic(struct kvm *kvm)
+{
+	struct pic_state2 *s;
+	s = kzalloc(sizeof(struct pic_state2), GFP_KERNEL);
+	if (!s)
+		return NULL;
+	s->pics[0].elcr_mask = 0xf8;
+	s->pics[1].elcr_mask = 0xde;
+	s->irq_request = pic_irq_request;
+	s->irq_request_opaque = kvm;
+	s->pics[0].pics_state = s;
+	s->pics[1].pics_state = s;
+
+	/* Initialize PIO device */
+	s->dev.read = picdev_read;
+	s->dev.write = picdev_write;
+	s->dev.in_range = picdev_in_range;
+	s->dev.private = s;
+	kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
+	return s;
+}
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index a7c5e6b..1e2d3ba 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -456,8 +456,15 @@ struct kvm {
 	struct file *filp;
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
+	void *vpic;
 };
 
+#define kernel_pic(kvm) ((kvm)->vpic)
+static inline int irqchip_in_kernel(struct kvm *kvm)
+{
+	return (kernel_pic(kvm) != 0);
+}
+
 struct descriptor_table {
 	u16 limit;
 	unsigned long base;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 7826f16..25f99fc 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "virq.h"
 
 #include <linux/kvm.h>
 #include <linux/module.h>
@@ -476,6 +477,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
+	if (kernel_pic(kvm))
+		kfree(kernel_pic(kvm));
 	kvm_io_bus_destroy(&kvm->pio_bus);
 	kvm_io_bus_destroy(&kvm->mmio_bus);
 	kvm_free_vcpus(kvm);
@@ -1372,7 +1375,8 @@ EXPORT_SYMBOL_GPL(emulate_instruction);
 
 int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->irq_summary)
+	if (vcpu->irq_summary ||
+           (irqchip_in_kernel(vcpu->kvm) && cpu_has_interrupt(vcpu)))
 		return 1;
 
 	vcpu->run->exit_reason = KVM_EXIT_HLT;
@@ -2855,6 +2859,26 @@ static long kvm_vm_ioctl(struct file *filp,
 			goto out;
 		break;
 	}
+	case KVM_CREATE_PIC:
+		r = -ENOMEM;
+		kernel_pic(kvm) = create_pic(kvm);
+		if (kernel_pic(kvm)) {
+			r = 0;
+		}
+		break;
+	case KVM_SET_ISA_IRQ_LEVEL: {
+		struct kvm_irq_level irq_event;
+		r = -EFAULT;
+		if (copy_from_user(&irq_event, argp, sizeof irq_event))
+			goto out;
+		if (irqchip_in_kernel(kvm)) {
+			pic_set_irq_new(kernel_pic(kvm),
+					irq_event.irq,
+					irq_event.level);
+			r = 0;
+		}
+		break;
+	}
 	default:
 		;
 	}
diff --git a/drivers/kvm/virq.c b/drivers/kvm/virq.c
new file mode 100644
index 0000000..a807296
--- /dev/null
+++ b/drivers/kvm/virq.c
@@ -0,0 +1,58 @@
+/*
+ * virq.c: API for in kernel interrupt controller
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
+ *
+ */
+
+#include "kvm.h"
+#include "virq.h"
+
+/*
+ * check if there is pending interrupt without
+ * intack.
+ */
+int cpu_has_interrupt(struct kvm_vcpu *v)
+{
+	struct pic_state2 *s = kernel_pic(v->kvm);
+
+	/* PIC */
+	if (int_output(s))
+		return 1;
+	/* TODO: APIC */
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(cpu_has_interrupt);
+
+/*
+ * Read pending interrupt vector and intack.
+ */
+int cpu_get_interrupt(struct kvm_vcpu *v)
+{
+	struct pic_state2 *s = kernel_pic(v->kvm);
+	int vector;
+
+	int_output(s) = 0;
+	vector = pic_read_irq(s);
+	if (vector != -1)
+		return vector;
+	/* TODO: APIC */
+	return -1;
+}
+
+EXPORT_SYMBOL_GPL(cpu_get_interrupt);
diff --git a/drivers/kvm/virq.h b/drivers/kvm/virq.h
new file mode 100644
index 0000000..ed467dc
--- /dev/null
+++ b/drivers/kvm/virq.h
@@ -0,0 +1,70 @@
+/*
+ * virq.h: in kernel interrupt controller related definitions
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
+ *
+ */
+
+#ifndef __VIRQ_H
+#define __VIRQ_H
+
+#include "kvm.h"
+
+typedef void SetIRQFunc(void *opaque, int irq_num, int level);
+typedef void IRQRequestFunc(void *opaque, int level);
+
+struct PicState2;
+struct pic_state {
+	uint8_t last_irr;	/* edge detection */
+	uint8_t irr;		/* interrupt request register */
+	uint8_t imr;		/* interrupt mask register */
+	uint8_t isr;		/* interrupt service register */
+	uint8_t priority_add;	/* highest irq priority */
+	uint8_t irq_base;
+	uint8_t read_reg_select;
+	uint8_t poll;
+	uint8_t special_mask;
+	uint8_t init_state;
+	uint8_t auto_eoi;
+	uint8_t rotate_on_auto_eoi;
+	uint8_t special_fully_nested_mode;
+	uint8_t init4;		/* true if 4 byte init */
+	uint8_t elcr;		/* PIIX edge/trigger selection */
+	uint8_t elcr_mask;
+	struct pic_state2 *pics_state;
+};
+
+struct pic_state2 {
+	/* 0 is master pic, 1 is slave pic */
+	struct pic_state pics[2];
+	IRQRequestFunc *irq_request;
+	void *irq_request_opaque;
+	/* IOAPIC callback support */
+	SetIRQFunc *alt_irq_func;
+	void *alt_irq_opaque;
+	int output;		/* intr from master PIC */
+	struct kvm_io_device dev;
+};
+
+#define int_output(s)		(((struct pic_state2 *)s)->output)
+struct pic_state2 *create_pic(struct kvm *kvm);
+void pic_set_irq_new(void *opaque, int irq, int level);
+int pic_read_irq(struct pic_state2 *s);
+int cpu_get_interrupt(struct kvm_vcpu *v);
+int cpu_has_interrupt(struct kvm_vcpu *v);
+
+#endif
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index b47ddcc..cef0d64 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "virq.h"
 #include "vmx.h"
 #include <linux/module.h>
 #include <linux/kernel.h>
@@ -1464,6 +1465,16 @@ static void inject_rmode_irq(struct kvm_vcpu
*vcpu, int irq)
 	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp -
6));
 }
 
+static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq)
+{
+	if (vcpu->rmode.active) {
+		inject_rmode_irq(vcpu, irq);
+		return;
+	}
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+			irq | INTR_TYPE_EXT_INTR |
INTR_INFO_VALID_MASK);
+}
+
 static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 {
 	int word_index = __ffs(vcpu->irq_summary);
@@ -1473,13 +1484,7 @@ static void kvm_do_inject_irq(struct kvm_vcpu
*vcpu)
 	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
 	if (!vcpu->irq_pending[word_index])
 		clear_bit(word_index, &vcpu->irq_summary);
-
-	if (vcpu->rmode.active) {
-		inject_rmode_irq(vcpu, irq);
-		return;
-	}
-	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-			irq | INTR_TYPE_EXT_INTR |
INTR_INFO_VALID_MASK);
+	vmx_inject_irq(vcpu, irq);
 }
 
 
@@ -1563,7 +1568,7 @@ static int handle_exception(struct kvm_vcpu *vcpu,
struct kvm_run *kvm_run)
 		       "intr info 0x%x\n", __FUNCTION__, vect_info,
intr_info);
 	}
 
-	if (is_external_interrupt(vect_info)) {
+	if (!irqchip_in_kernel(vcpu->kvm) &&
is_external_interrupt(vect_info)) {
 		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
 		set_bit(irq, vcpu->irq_pending);
 		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
@@ -1894,6 +1899,12 @@ static void post_kvm_run_save(struct kvm_vcpu
*vcpu,
 static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 				   struct kvm_run *kvm_run)
 {
+	u32 cpu_based_vm_exec_control;
+
+	/* clear pending irq */
+	cpu_based_vm_exec_control =
vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
cpu_based_vm_exec_control);
 	/*
 	 * If the user space waits to inject interrupts, exit as soon as
 	 * possible
@@ -1986,6 +1997,56 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
 	vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3));
 }
 
+void enable_irq_window(struct kvm_vcpu *vcpu)
+{
+	u32 cpu_based_vm_exec_control;
+
+	cpu_based_vm_exec_control =
vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
cpu_based_vm_exec_control);
+}
+
+void do_intr_assist(struct kvm_vcpu *vcpu)
+{
+	u32 idtv_info_field, intr_info_field;
+	int has_ext_irq, interrupt_window_open;
+	/* TODO: Move IDT_Vectoring here */
+
+	has_ext_irq = cpu_has_interrupt(vcpu);
+	intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	if (intr_info_field & INTR_INFO_VALID_MASK) {
+		if (idtv_info_field & INTR_INFO_VALID_MASK) {
+			/* TODO: fault when IDT_Vectoring */
+			printk(KERN_ERR "Fault when IDT_Vectoring\n");
+		}
+		if (has_ext_irq)
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) {
+		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field);
+		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+				vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
+
+		if ( unlikely(idtv_info_field & 0x800) ) /* valid error
code */
+			vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+				vmcs_read32(IDT_VECTORING_ERROR_CODE));
+		if ( unlikely(has_ext_irq) )
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (!has_ext_irq)
+		return;
+	interrupt_window_open =
+		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
+		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
+	if (interrupt_window_open)
+		vmx_inject_irq(vcpu, cpu_get_interrupt(vcpu));
+	else
+		enable_irq_window(vcpu);
+}
+
 static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	u8 fail;
@@ -1996,9 +2057,6 @@ preempted:
 		kvm_guest_debug_pre(vcpu);
 
 again:
-	if (!vcpu->mmio_read_completed)
-		do_interrupt_requests(vcpu, kvm_run);
-
 	vmx_save_host_state(vcpu);
 	kvm_load_guest_fpu(vcpu);
 
@@ -2012,6 +2070,10 @@ again:
 	vmcs_writel(HOST_CR0, read_cr0());
 
 	local_irq_disable();
+	if (irqchip_in_kernel(vcpu->kvm))
+		do_intr_assist(vcpu);
+	else if (!vcpu->mmio_read_completed)
+		do_interrupt_requests(vcpu, kvm_run);
 
 	vcpu->guest_mode = 1;
 	if (vcpu->requests)
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index e6edca8..fbcfc86 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -41,6 +41,12 @@ struct kvm_memory_alias {
 	__u64 target_phys_addr;
 };
 
+/* for KVM_SET_IRQ_LEVEL */
+struct kvm_irq_level {
+	__u32 irq;
+	__u32 level;
+};
+
 enum kvm_exit_reason {
 	KVM_EXIT_UNKNOWN          = 0,
 	KVM_EXIT_EXCEPTION        = 1,
@@ -282,6 +288,9 @@ struct kvm_signal_mask {
 #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct
kvm_dirty_log)
 #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct
kvm_memory_alias)
+/* Device model IOC */
+#define KVM_CREATE_PIC		  _IO(KVMIO,  0x60)
+#define KVM_SET_ISA_IRQ_LEVEL	  _IO(KVMIO,  0x61)
 
 /*
  * ioctls for vcpu fds

[-- Attachment #2: pic-kernel-v23.patch --]
[-- Type: application/octet-stream, Size: 22491 bytes --]

commit 58df4587880e541057386abf2aea7b63bd7d9d97
Author: root <root@vt32-pae.(none)>
Date:   Wed Jun 20 15:31:31 2007 +0800

        In kernel PIC support.
        Mixed mode of irqchip has a hole in
        injecting device irq which may be not
        immediately processed and thus difficult
        to handle.
        Move PIC to kernel together with APIC can
        both solve the issue and get performance
        gain.
    
    Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>

diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c0a789f..118c79b 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o i8259.o virq.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/i8259.c b/drivers/kvm/i8259.c
new file mode 100644
index 0000000..9ba8c1e
--- /dev/null
+++ b/drivers/kvm/i8259.c
@@ -0,0 +1,435 @@
+/*
+ * QEMU 8259 interrupt controller emulation
+ *
+ * Copyright (c) 2003-2004 Fabrice Bellard
+ * Copyright (c) 2007 Intel Corperation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *   Port from Qemu.
+ */
+#include "virq.h"
+#include <linux/mm.h>
+
+/* set irq level. If an edge is detected, then the IRR is set to 1 */
+static inline void pic_set_irq1(struct pic_state *s, int irq, int level)
+{
+	int mask;
+	mask = 1 << irq;
+	if (s->elcr & mask) {
+		/* level triggered */
+		if (level) {
+			s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->irr &= ~mask;
+			s->last_irr &= ~mask;
+		}
+	} else {
+		/* edge triggered */
+		if (level) {
+			if ((s->last_irr & mask) == 0)
+				s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->last_irr &= ~mask;
+		}
+	}
+}
+
+/* return the highest priority found in mask (highest = smallest
+   number). Return 8 if no irq */
+static inline int get_priority(struct pic_state *s, int mask)
+{
+	int priority;
+	if (mask == 0)
+		return 8;
+	priority = 0;
+	while ((mask & (1 << ((priority + s->priority_add) & 7))) == 0)
+		priority++;
+	return priority;
+}
+
+/* return the pic wanted interrupt. return -1 if none */
+static int pic_get_irq(struct pic_state *s)
+{
+	int mask, cur_priority, priority;
+
+	mask = s->irr & ~s->imr;
+	priority = get_priority(s, mask);
+	if (priority == 8)
+		return -1;
+	/* compute current priority. If special fully nested mode on the
+	   master, the IRQ coming from the slave is not taken into account
+	   for the priority computation. */
+	mask = s->isr;
+	if (s->special_fully_nested_mode && s == &s->pics_state->pics[0])
+		mask &= ~(1 << 2);
+	cur_priority = get_priority(s, mask);
+	if (priority < cur_priority) {
+		/* higher priority found: an irq should be generated */
+		return (priority + s->priority_add) & 7;
+	} else {
+		return -1;
+	}
+}
+
+/* raise irq to CPU if necessary. must be called every time the active
+   irq may change */
+static void pic_update_irq(struct pic_state2 *s)
+{
+	int irq2, irq;
+
+	/* first look at slave pic */
+	irq2 = pic_get_irq(&s->pics[1]);
+	if (irq2 >= 0) {
+		/* if irq request by slave pic, signal master PIC */
+		pic_set_irq1(&s->pics[0], 2, 1);
+		pic_set_irq1(&s->pics[0], 2, 0);
+	}
+	/* look at requested irq */
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		s->irq_request(s->irq_request_opaque, 1);
+	}
+}
+
+void pic_set_irq_new(void *opaque, int irq, int level)
+{
+	struct pic_state2 *s = opaque;
+
+	pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+	/* used for IOAPIC irqs */
+	if (s->alt_irq_func)
+		s->alt_irq_func(s->alt_irq_opaque, irq, level);
+	pic_update_irq(s);
+}
+
+/* acknowledge interrupt 'irq' */
+static inline void pic_intack(struct pic_state *s, int irq)
+{
+	if (s->auto_eoi) {
+		if (s->rotate_on_auto_eoi)
+			s->priority_add = (irq + 1) & 7;
+	} else {
+		s->isr |= (1 << irq);
+	}
+	/* We don't clear a level sensitive interrupt here */
+	if (!(s->elcr & (1 << irq)))
+		s->irr &= ~(1 << irq);
+}
+
+int pic_read_irq(struct pic_state2 *s)
+{
+	int irq, irq2, intno;
+
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		pic_intack(&s->pics[0], irq);
+		if (irq == 2) {
+			irq2 = pic_get_irq(&s->pics[1]);
+			if (irq2 >= 0) {
+				pic_intack(&s->pics[1], irq2);
+			} else {
+				/* spurious IRQ on slave controller */
+				irq2 = 7;
+			}
+			intno = s->pics[1].irq_base + irq2;
+			irq = irq2 + 8;
+		} else {
+			intno = s->pics[0].irq_base + irq;
+		}
+	} else {
+		/* spurious IRQ on host controller */
+		irq = 7;
+		intno = s->pics[0].irq_base + irq;
+	}
+	pic_update_irq(s);
+
+	return intno;
+}
+
+static void pic_reset(void *opaque)
+{
+	struct pic_state *s = opaque;
+
+	s->last_irr = 0;
+	s->irr = 0;
+	s->imr = 0;
+	s->isr = 0;
+	s->priority_add = 0;
+	s->irq_base = 0;
+	s->read_reg_select = 0;
+	s->poll = 0;
+	s->special_mask = 0;
+	s->init_state = 0;
+	s->auto_eoi = 0;
+	s->rotate_on_auto_eoi = 0;
+	s->special_fully_nested_mode = 0;
+	s->init4 = 0;
+	/* Note: ELCR is not reset */
+}
+
+static void pic_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+{
+	struct pic_state *s = opaque;
+	int priority, cmd, irq;
+
+	addr &= 1;
+	if (addr == 0) {
+		if (val & 0x10) {
+			/* init */
+			pic_reset(s);
+			/* deassert a pending interrupt */
+			s->pics_state->irq_request(s->pics_state->
+						   irq_request_opaque, 0);
+			s->init_state = 1;
+			s->init4 = val & 1;
+			if (val & 0x02)
+				printk(KERN_ERR "single mode not supported");
+			if (val & 0x08)
+				printk(KERN_ERR
+				       "level sensitive irq not supported");
+		} else if (val & 0x08) {
+			if (val & 0x04)
+				s->poll = 1;
+			if (val & 0x02)
+				s->read_reg_select = val & 1;
+			if (val & 0x40)
+				s->special_mask = (val >> 5) & 1;
+		} else {
+			cmd = val >> 5;
+			switch (cmd) {
+			case 0:
+			case 4:
+				s->rotate_on_auto_eoi = cmd >> 2;
+				break;
+			case 1:	/* end of interrupt */
+			case 5:
+				priority = get_priority(s, s->isr);
+				if (priority != 8) {
+					irq = (priority + s->priority_add) & 7;
+					s->isr &= ~(1 << irq);
+					if (cmd == 5)
+						s->priority_add = (irq + 1) & 7;
+					pic_update_irq(s->pics_state);
+				}
+				break;
+			case 3:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				pic_update_irq(s->pics_state);
+				break;
+			case 6:
+				s->priority_add = (val + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			case 7:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				s->priority_add = (irq + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			default:
+				/* no operation */
+				break;
+			}
+		}
+	} else {
+		switch (s->init_state) {
+		case 0:
+			/* normal mode */
+			s->imr = val;
+			pic_update_irq(s->pics_state);
+			break;
+		case 1:
+			s->irq_base = val & 0xf8;
+			s->init_state = 2;
+			break;
+		case 2:
+			if (s->init4) {
+				s->init_state = 3;
+			} else {
+				s->init_state = 0;
+			}
+			break;
+		case 3:
+			s->special_fully_nested_mode = (val >> 4) & 1;
+			s->auto_eoi = (val >> 1) & 1;
+			s->init_state = 0;
+			break;
+		}
+	}
+}
+
+static uint32_t pic_poll_read(struct pic_state *s, uint32_t addr1)
+{
+	int ret;
+
+	ret = pic_get_irq(s);
+	if (ret >= 0) {
+		if (addr1 >> 7) {
+			s->pics_state->pics[0].isr &= ~(1 << 2);
+			s->pics_state->pics[0].irr &= ~(1 << 2);
+		}
+		s->irr &= ~(1 << ret);
+		s->isr &= ~(1 << ret);
+		if (addr1 >> 7 || ret != 2)
+			pic_update_irq(s->pics_state);
+	} else {
+		ret = 0x07;
+		pic_update_irq(s->pics_state);
+	}
+
+	return ret;
+}
+
+static uint32_t pic_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct pic_state *s = opaque;
+	unsigned int addr;
+	int ret;
+
+	addr = addr1;
+	addr &= 1;
+	if (s->poll) {
+		ret = pic_poll_read(s, addr1);
+		s->poll = 0;
+	} else {
+		if (addr == 0) {
+			if (s->read_reg_select)
+				ret = s->isr;
+			else
+				ret = s->irr;
+		} else {
+			ret = s->imr;
+		}
+	}
+	return ret;
+}
+
+static void elcr_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+{
+	struct pic_state *s = opaque;
+	s->elcr = val & s->elcr_mask;
+}
+
+static uint32_t elcr_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct pic_state *s = opaque;
+	return s->elcr;
+}
+
+static int picdev_in_range(struct kvm_io_device *this, gpa_t addr)
+{
+	switch (addr) {
+		/* Master PIC */
+	case 0x20:
+	case 0x21:
+		/* Slave PIC */
+	case 0xa0:
+	case 0xa1:
+		/* elcr0  & 1 */
+	case 0x4d0:
+	case 0x4d1:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+static void picdev_write(struct kvm_io_device *this,
+			 gpa_t addr, int len, const void *val)
+{
+	struct pic_state2 *s = this->private;
+	unsigned char data = *(unsigned char *)val;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte write\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		pic_ioport_write(&s->pics[addr >> 7], addr, data);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		elcr_ioport_write(&s->pics[addr & 1], addr, data);
+		break;
+	}
+}
+
+static void picdev_read(struct kvm_io_device *this,
+			gpa_t addr, int len, void *val)
+{
+	struct pic_state2 *s = this->private;
+	unsigned char data = 0;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte read\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		data = pic_ioport_read(&s->pics[addr >> 7], addr);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		data = elcr_ioport_read(&s->pics[addr & 1], addr);
+		break;
+	}
+	*(unsigned char *)val = data;
+}
+
+/* callback when PIC0 irq status changed */
+static void pic_irq_request(void *opaque, int level)
+{
+	struct kvm *kvm = opaque;
+
+	int_output(kvm->vpic) = level;
+}
+
+struct pic_state2 *create_pic(struct kvm *kvm)
+{
+	struct pic_state2 *s;
+	s = kzalloc(sizeof(struct pic_state2), GFP_KERNEL);
+	if (!s)
+		return NULL;
+	s->pics[0].elcr_mask = 0xf8;
+	s->pics[1].elcr_mask = 0xde;
+	s->irq_request = pic_irq_request;
+	s->irq_request_opaque = kvm;
+	s->pics[0].pics_state = s;
+	s->pics[1].pics_state = s;
+
+	/* Initialize PIO device */
+	s->dev.read = picdev_read;
+	s->dev.write = picdev_write;
+	s->dev.in_range = picdev_in_range;
+	s->dev.private = s;
+	kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
+	return s;
+}
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index a7c5e6b..1e2d3ba 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -456,8 +456,15 @@ struct kvm {
 	struct file *filp;
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
+	void *vpic;
 };
 
+#define kernel_pic(kvm) ((kvm)->vpic)
+static inline int irqchip_in_kernel(struct kvm *kvm)
+{
+	return (kernel_pic(kvm) != 0);
+}
+
 struct descriptor_table {
 	u16 limit;
 	unsigned long base;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 7826f16..25f99fc 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "virq.h"
 
 #include <linux/kvm.h>
 #include <linux/module.h>
@@ -476,6 +477,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
+	if (kernel_pic(kvm))
+		kfree(kernel_pic(kvm));
 	kvm_io_bus_destroy(&kvm->pio_bus);
 	kvm_io_bus_destroy(&kvm->mmio_bus);
 	kvm_free_vcpus(kvm);
@@ -1372,7 +1375,8 @@ EXPORT_SYMBOL_GPL(emulate_instruction);
 
 int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->irq_summary)
+	if (vcpu->irq_summary ||
+           (irqchip_in_kernel(vcpu->kvm) && cpu_has_interrupt(vcpu)))
 		return 1;
 
 	vcpu->run->exit_reason = KVM_EXIT_HLT;
@@ -2855,6 +2859,26 @@ static long kvm_vm_ioctl(struct file *filp,
 			goto out;
 		break;
 	}
+	case KVM_CREATE_PIC:
+		r = -ENOMEM;
+		kernel_pic(kvm) = create_pic(kvm);
+		if (kernel_pic(kvm)) {
+			r = 0;
+		}
+		break;
+	case KVM_SET_ISA_IRQ_LEVEL: {
+		struct kvm_irq_level irq_event;
+		r = -EFAULT;
+		if (copy_from_user(&irq_event, argp, sizeof irq_event))
+			goto out;
+		if (irqchip_in_kernel(kvm)) {
+			pic_set_irq_new(kernel_pic(kvm),
+					irq_event.irq,
+					irq_event.level);
+			r = 0;
+		}
+		break;
+	}
 	default:
 		;
 	}
diff --git a/drivers/kvm/virq.c b/drivers/kvm/virq.c
new file mode 100644
index 0000000..a807296
--- /dev/null
+++ b/drivers/kvm/virq.c
@@ -0,0 +1,58 @@
+/*
+ * virq.c: API for in kernel interrupt controller
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *
+ */
+
+#include "kvm.h"
+#include "virq.h"
+
+/*
+ * check if there is pending interrupt without
+ * intack.
+ */
+int cpu_has_interrupt(struct kvm_vcpu *v)
+{
+	struct pic_state2 *s = kernel_pic(v->kvm);
+
+	/* PIC */
+	if (int_output(s))
+		return 1;
+	/* TODO: APIC */
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(cpu_has_interrupt);
+
+/*
+ * Read pending interrupt vector and intack.
+ */
+int cpu_get_interrupt(struct kvm_vcpu *v)
+{
+	struct pic_state2 *s = kernel_pic(v->kvm);
+	int vector;
+
+	int_output(s) = 0;
+	vector = pic_read_irq(s);
+	if (vector != -1)
+		return vector;
+	/* TODO: APIC */
+	return -1;
+}
+
+EXPORT_SYMBOL_GPL(cpu_get_interrupt);
diff --git a/drivers/kvm/virq.h b/drivers/kvm/virq.h
new file mode 100644
index 0000000..ed467dc
--- /dev/null
+++ b/drivers/kvm/virq.h
@@ -0,0 +1,70 @@
+/*
+ * virq.h: in kernel interrupt controller related definitions
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *
+ */
+
+#ifndef __VIRQ_H
+#define __VIRQ_H
+
+#include "kvm.h"
+
+typedef void SetIRQFunc(void *opaque, int irq_num, int level);
+typedef void IRQRequestFunc(void *opaque, int level);
+
+struct PicState2;
+struct pic_state {
+	uint8_t last_irr;	/* edge detection */
+	uint8_t irr;		/* interrupt request register */
+	uint8_t imr;		/* interrupt mask register */
+	uint8_t isr;		/* interrupt service register */
+	uint8_t priority_add;	/* highest irq priority */
+	uint8_t irq_base;
+	uint8_t read_reg_select;
+	uint8_t poll;
+	uint8_t special_mask;
+	uint8_t init_state;
+	uint8_t auto_eoi;
+	uint8_t rotate_on_auto_eoi;
+	uint8_t special_fully_nested_mode;
+	uint8_t init4;		/* true if 4 byte init */
+	uint8_t elcr;		/* PIIX edge/trigger selection */
+	uint8_t elcr_mask;
+	struct pic_state2 *pics_state;
+};
+
+struct pic_state2 {
+	/* 0 is master pic, 1 is slave pic */
+	struct pic_state pics[2];
+	IRQRequestFunc *irq_request;
+	void *irq_request_opaque;
+	/* IOAPIC callback support */
+	SetIRQFunc *alt_irq_func;
+	void *alt_irq_opaque;
+	int output;		/* intr from master PIC */
+	struct kvm_io_device dev;
+};
+
+#define int_output(s)		(((struct pic_state2 *)s)->output)
+struct pic_state2 *create_pic(struct kvm *kvm);
+void pic_set_irq_new(void *opaque, int irq, int level);
+int pic_read_irq(struct pic_state2 *s);
+int cpu_get_interrupt(struct kvm_vcpu *v);
+int cpu_has_interrupt(struct kvm_vcpu *v);
+
+#endif
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index b47ddcc..cef0d64 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "virq.h"
 #include "vmx.h"
 #include <linux/module.h>
 #include <linux/kernel.h>
@@ -1464,6 +1465,16 @@ static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq)
 	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp - 6));
 }
 
+static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq)
+{
+	if (vcpu->rmode.active) {
+		inject_rmode_irq(vcpu, irq);
+		return;
+	}
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+			irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
+}
+
 static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 {
 	int word_index = __ffs(vcpu->irq_summary);
@@ -1473,13 +1484,7 @@ static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
 	if (!vcpu->irq_pending[word_index])
 		clear_bit(word_index, &vcpu->irq_summary);
-
-	if (vcpu->rmode.active) {
-		inject_rmode_irq(vcpu, irq);
-		return;
-	}
-	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-			irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
+	vmx_inject_irq(vcpu, irq);
 }
 
 
@@ -1563,7 +1568,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 		       "intr info 0x%x\n", __FUNCTION__, vect_info, intr_info);
 	}
 
-	if (is_external_interrupt(vect_info)) {
+	if (!irqchip_in_kernel(vcpu->kvm) && is_external_interrupt(vect_info)) {
 		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
 		set_bit(irq, vcpu->irq_pending);
 		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
@@ -1894,6 +1899,12 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 				   struct kvm_run *kvm_run)
 {
+	u32 cpu_based_vm_exec_control;
+
+	/* clear pending irq */
+	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
 	/*
 	 * If the user space waits to inject interrupts, exit as soon as
 	 * possible
@@ -1986,6 +1997,56 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
 	vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3));
 }
 
+void enable_irq_window(struct kvm_vcpu *vcpu)
+{
+	u32 cpu_based_vm_exec_control;
+
+	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
+}
+
+void do_intr_assist(struct kvm_vcpu *vcpu)
+{
+	u32 idtv_info_field, intr_info_field;
+	int has_ext_irq, interrupt_window_open;
+	/* TODO: Move IDT_Vectoring here */
+
+	has_ext_irq = cpu_has_interrupt(vcpu);
+	intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	if (intr_info_field & INTR_INFO_VALID_MASK) {
+		if (idtv_info_field & INTR_INFO_VALID_MASK) {
+			/* TODO: fault when IDT_Vectoring */
+			printk(KERN_ERR "Fault when IDT_Vectoring\n");
+		}
+		if (has_ext_irq)
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) {
+		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field);
+		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+				vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
+
+		if ( unlikely(idtv_info_field & 0x800) ) /* valid error code */
+			vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+				vmcs_read32(IDT_VECTORING_ERROR_CODE));
+		if ( unlikely(has_ext_irq) )
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (!has_ext_irq)
+		return;
+	interrupt_window_open =
+		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
+		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
+	if (interrupt_window_open)
+		vmx_inject_irq(vcpu, cpu_get_interrupt(vcpu));
+	else
+		enable_irq_window(vcpu);
+}
+
 static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	u8 fail;
@@ -1996,9 +2057,6 @@ preempted:
 		kvm_guest_debug_pre(vcpu);
 
 again:
-	if (!vcpu->mmio_read_completed)
-		do_interrupt_requests(vcpu, kvm_run);
-
 	vmx_save_host_state(vcpu);
 	kvm_load_guest_fpu(vcpu);
 
@@ -2012,6 +2070,10 @@ again:
 	vmcs_writel(HOST_CR0, read_cr0());
 
 	local_irq_disable();
+	if (irqchip_in_kernel(vcpu->kvm))
+		do_intr_assist(vcpu);
+	else if (!vcpu->mmio_read_completed)
+		do_interrupt_requests(vcpu, kvm_run);
 
 	vcpu->guest_mode = 1;
 	if (vcpu->requests)
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index e6edca8..fbcfc86 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -41,6 +41,12 @@ struct kvm_memory_alias {
 	__u64 target_phys_addr;
 };
 
+/* for KVM_SET_IRQ_LEVEL */
+struct kvm_irq_level {
+	__u32 irq;
+	__u32 level;
+};
+
 enum kvm_exit_reason {
 	KVM_EXIT_UNKNOWN          = 0,
 	KVM_EXIT_EXCEPTION        = 1,
@@ -282,6 +288,9 @@ struct kvm_signal_mask {
 #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct kvm_dirty_log)
 #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct kvm_memory_alias)
+/* Device model IOC */
+#define KVM_CREATE_PIC		  _IO(KVMIO,  0x60)
+#define KVM_SET_ISA_IRQ_LEVEL	  _IO(KVMIO,  0x61)
 
 /*
  * ioctls for vcpu fds

[-- Attachment #3: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #4: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

[parent not found: <10EA09EFD8728347A513008B6B0DA77A01A55F0A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found] ` <10EA09EFD8728347A513008B6B0DA77A01A55F0A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2007-06-22 10:50   ` Avi Kivity
       [not found]     ` <467BA95A.704-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2007-06-22 10:50 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dong, Eddie wrote:

First, sorry for the delayed review.

> +#include "virq.h"
> +#include <linux/mm.h>
>   

Includes from <linux/> before includes from local directory please.

> +
> +/* set irq level. If an edge is detected, then the IRR is set to 1 */
>   

Kernel comments like this:

/*
 * blah blah blah
 */

This comment applies to the rest of the patch.

> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
> index a7c5e6b..1e2d3ba 100644
> --- a/drivers/kvm/kvm.h
> +++ b/drivers/kvm/kvm.h
> @@ -456,8 +456,15 @@ struct kvm {
>  	struct file *filp;
>  	struct kvm_io_bus mmio_bus;
>  	struct kvm_io_bus pio_bus;
> +	void *vpic;
>   

Why void?

>  };
>  
> +#define kernel_pic(kvm) ((kvm)->vpic)
>   

Please, no function-like macros.  In this case, just put kvm->vpic
everywhere.

>  	spin_lock(&kvm_lock);
>  	list_del(&kvm->vm_list);
>  	spin_unlock(&kvm_lock);
> +	if (kernel_pic(kvm))
> +		kfree(kernel_pic(kvm));
>   

kfree() knows about null pointers, no need for the check.

> +	case KVM_CREATE_PIC:
> +		r = -ENOMEM;
> +		kernel_pic(kvm) = create_pic(kvm);
>   

Ow!

> +		if (kernel_pic(kvm)) {
> +			r = 0;
> +		}
>   

No {} for single-line if statements in the kernel.

> +		break;
> +	case KVM_SET_ISA_IRQ_LEVEL: {
> +		struct kvm_irq_level irq_event;
>   

blank line here.

> +#include "kvm.h"
> +#include "virq.h"
> +
> +/*
> + * check if there is pending interrupt without
> + * intack.
> + */
> +int cpu_has_interrupt(struct kvm_vcpu *v)
> +{
> +	struct pic_state2 *s = kernel_pic(v->kvm);
> +
>   

Isn't the PIC connected to just one cpu?

> +	/* PIC */
> +	if (int_output(s))
> +		return 1;
> +	/* TODO: APIC */
> +	return 0;
> +}
> +
> +EXPORT_SYMBOL_GPL(cpu_has_interrupt);
>   

exported symbols (or even non-static symbols) should begin with kvm_.

> +
> +/*
> + * Read pending interrupt vector and intack.
> + */
> +int cpu_get_interrupt(struct kvm_vcpu *v)
> +{
> +	struct pic_state2 *s = kernel_pic(v->kvm);
> +	int vector;
> +
> +	int_output(s) = 0;
> +	vector = pic_read_irq(s);
> +	if (vector != -1)
> +		return vector;
> +	/* TODO: APIC */
> +	return -1;
> +}
> +
> +EXPORT_SYMBOL_GPL(cpu_get_interrupt);
>   

kvm_.

> +
> +typedef void SetIRQFunc(void *opaque, int irq_num, int level);
> +typedef void IRQRequestFunc(void *opaque, int level);
>   

Lower case names.

> +
> +struct PicState2;
> +struct pic_state {
> +	uint8_t last_irr;	/* edge detection */
> +	uint8_t irr;		/* interrupt request register */
> +	uint8_t imr;		/* interrupt mask register */
> +	uint8_t isr;		/* interrupt service register */
> +	uint8_t priority_add;	/* highest irq priority */
> +	uint8_t irq_base;
> +	uint8_t read_reg_select;
> +	uint8_t poll;
> +	uint8_t special_mask;
> +	uint8_t init_state;
> +	uint8_t auto_eoi;
> +	uint8_t rotate_on_auto_eoi;
> +	uint8_t special_fully_nested_mode;
> +	uint8_t init4;		/* true if 4 byte init */
> +	uint8_t elcr;		/* PIIX edge/trigger selection */
> +	uint8_t elcr_mask;
> +	struct pic_state2 *pics_state;
> +};
>   

Use u8, not uint8_t in kernel code.

> +
> +struct pic_state2 {
> +	/* 0 is master pic, 1 is slave pic */
> +	struct pic_state pics[2];
> +	IRQRequestFunc *irq_request;
> +	void *irq_request_opaque;
> +	/* IOAPIC callback support */
> +	SetIRQFunc *alt_irq_func;
> +	void *alt_irq_opaque;
> +	int output;		/* intr from master PIC */
> +	struct kvm_io_device dev;
> +};
>   

Prefix structure name with kvm_.

> +
> +#define int_output(s)		(((struct pic_state2 *)s)->output)
>   

This is a dangerous macro!  It takes anything and treats it as a pic_state2.

Please remove.


> +struct pic_state2 *create_pic(struct kvm *kvm);
> +void pic_set_irq_new(void *opaque, int irq, int level);
> +int pic_read_irq(struct pic_state2 *s);
> +int cpu_get_interrupt(struct kvm_vcpu *v);
> +int cpu_has_interrupt(struct kvm_vcpu *v);
>   

Prefix globally visible names with kvm_.

>  
> +void enable_irq_window(struct kvm_vcpu *vcpu)
> +{
> +	u32 cpu_based_vm_exec_control;
> +
> +	cpu_based_vm_exec_control =
> vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> +	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> cpu_based_vm_exec_control);
> +}
>   

static.
> +
> +void do_intr_assist(struct kvm_vcpu *vcpu)
> +{
> +	u32 idtv_info_field, intr_info_field;
> +	int has_ext_irq, interrupt_window_open;
> +	/* TODO: Move IDT_Vectoring here */
> +
> +	has_ext_irq = cpu_has_interrupt(vcpu);
> +	intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
> +	idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +	if (intr_info_field & INTR_INFO_VALID_MASK) {
> +		if (idtv_info_field & INTR_INFO_VALID_MASK) {
> +			/* TODO: fault when IDT_Vectoring */
> +			printk(KERN_ERR "Fault when IDT_Vectoring\n");
> +		}
> +		if (has_ext_irq)
> +			enable_irq_window(vcpu);
> +		return;
> +	}
> +	if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) {
> +		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field);
> +		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> +				vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
> +
> +		if ( unlikely(idtv_info_field & 0x800) ) /* valid error
> code */
> +			vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> +				vmcs_read32(IDT_VECTORING_ERROR_CODE));
> +		if ( unlikely(has_ext_irq) )
> +			enable_irq_window(vcpu);
> +		return;
> +	}
>   

This isn't Xen.  No spaces inside if () condition.

>  
> +/* for KVM_SET_IRQ_LEVEL */
> +struct kvm_irq_level {
> +	__u32 irq;
> +	__u32 level;
> +};
>   

As discussed, we might want to make this a list, and add a chip ID so
this can be used for the ioapic too.

> +
>  enum kvm_exit_reason {
>  	KVM_EXIT_UNKNOWN          = 0,
>  	KVM_EXIT_EXCEPTION        = 1,
> @@ -282,6 +288,9 @@ struct kvm_signal_mask {
>  #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
>  #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct
> kvm_dirty_log)
>  #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct
> kvm_memory_alias)
> +/* Device model IOC */
> +#define KVM_CREATE_PIC		  _IO(KVMIO,  0x60)
> +#define KVM_SET_ISA_IRQ_LEVEL	  _IO(KVMIO,  0x61)
>  
>   

Need to add a way to detect if this functionality is available, via
KVM_CHECK_EXTENSION.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <467BA95A.704-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]     ` <467BA95A.704-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-22 14:21       ` Rusty Russell
       [not found]         ` <1182522074.17478.45.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-06-25  1:45       ` Dong, Eddie
  2007-06-27 15:25       ` Dong, Eddie
  2 siblings, 1 reply; 57+ messages in thread
From: Rusty Russell @ 2007-06-22 14:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Fri, 2007-06-22 at 13:50 +0300, Avi Kivity wrote:
> Dong, Eddie wrote:
> > +
> > +/* set irq level. If an edge is detected, then the IRR is set to 1 */
> >   
> 
> Kernel comments like this:
> 
> /*
>  * blah blah blah
>  */
> 
> This comment applies to the rest of the patch.

For one-liners?  Last I paid attention the main debate was about
multi-line comments.

Cheers,
Rusty.



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <1182522074.17478.45.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]         ` <1182522074.17478.45.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-23  7:50           ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-06-23  7:50 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Rusty Russell wrote:
> On Fri, 2007-06-22 at 13:50 +0300, Avi Kivity wrote:
>   
>> Dong, Eddie wrote:
>>     
>>> +
>>> +/* set irq level. If an edge is detected, then the IRR is set to 1 */
>>>   
>>>       
>> Kernel comments like this:
>>
>> /*
>>  * blah blah blah
>>  */
>>
>> This comment applies to the rest of the patch.
>>     
>
> For one-liners?  Last I paid attention the main debate was about
> multi-line comments.
>   

I'd say for multiline comments and function headers.



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
       [not found]     ` <467BA95A.704-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-06-22 14:21       ` Rusty Russell
@ 2007-06-25  1:45       ` Dong, Eddie
       [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01AA45AD-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2007-06-27 15:25       ` Dong, Eddie
  2 siblings, 1 reply; 57+ messages in thread
From: Dong, Eddie @ 2007-06-25  1:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

 
>> +#include "kvm.h"
>> +#include "virq.h"
>> +
>> +/*
>> + * check if there is pending interrupt without
>> + * intack.
>> + */
>> +int cpu_has_interrupt(struct kvm_vcpu *v)
>> +{
>> +	struct pic_state2 *s = kernel_pic(v->kvm);
>> +
>> 
> 
> Isn't the PIC connected to just one cpu?

Yes, but will have APIC irq after merge.




>> +
>>  enum kvm_exit_reason {
>>  	KVM_EXIT_UNKNOWN          = 0,
>>  	KVM_EXIT_EXCEPTION        = 1,
>> @@ -282,6 +288,9 @@ struct kvm_signal_mask {
>>  #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
>>  #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct
>>  kvm_dirty_log) #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43,
>> struct kvm_memory_alias) +/* Device model IOC */
>> +#define KVM_CREATE_PIC		  _IO(KVMIO,  0x60)
>> +#define KVM_SET_ISA_IRQ_LEVEL	  _IO(KVMIO,  0x61)
>> 
>> 
> 
> Need to add a way to detect if this functionality is available, via
> KVM_CHECK_EXTENSION. 

Sure. Question:
	Should we check the extension API basis or feature sets list?
I.e. Each API (KVM_CREATE_PIC, KVM_SET_ISA_IRQ_LEVEL) or feature sets:
KVM_KERNEL_IRQCHIP.

Thx, eddie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <10EA09EFD8728347A513008B6B0DA77A01AA45AD-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01AA45AD-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2007-06-25  3:28           ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-06-25  3:28 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dong, Eddie wrote:
>> Need to add a way to detect if this functionality is available, via
>> KVM_CHECK_EXTENSION. 
>>     
>
> Sure. Question:
> 	Should we check the extension API basis or feature sets list?
> I.e. Each API (KVM_CREATE_PIC, KVM_SET_ISA_IRQ_LEVEL) or feature sets:
> KVM_KERNEL_IRQCHIP.
>
>   

I guess pre-feature.  We want to have fine granularity so we can change 
small things, but per-API is too fine.

So in this case I'd suggest KVM_CAP_PIC, KVM_CAP_APIC, etc. (note the 
KVM_CAP_* namespace).


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
       [not found]     ` <467BA95A.704-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-06-22 14:21       ` Rusty Russell
  2007-06-25  1:45       ` Dong, Eddie
@ 2007-06-27 15:25       ` Dong, Eddie
       [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01AE9255-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2 siblings, 1 reply; 57+ messages in thread
From: Dong, Eddie @ 2007-06-27 15:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

[-- Attachment #1: Type: text/plain, Size: 24430 bytes --]

Here is the updated patch with all comments fixed.


Meanwhile I enabled SVM support by copying code from Xen and  managed to
find an AMD box to have a shoot, it works!
Please also find the svm patch.




 drivers/kvm/Makefile   |    2
 drivers/kvm/i8259.c    |  449
+++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/kvm.h      |   12 +
 drivers/kvm/kvm_irq.c  |   59 ++++++
 drivers/kvm/kvm_irq.h  |   64 ++++++
 drivers/kvm/kvm_main.c |   43 ++++
 drivers/kvm/vmx.c      |   83 +++++++--
 include/linux/kvm.h    |   14 +
 8 files changed, 708 insertions(+), 18 deletions(-)

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

commit d62bc09142edab756538a82f6152cf71ee362ead
Author: root <root@vt32-pae.(none)>
Date:   Mon Jun 25 14:44:59 2007 +0800

    In kernel PIC support

diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c0a789f..48a3076 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o i8259.o kvm_irq.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/i8259.c b/drivers/kvm/i8259.c
new file mode 100644
index 0000000..8bb94e3
--- /dev/null
+++ b/drivers/kvm/i8259.c
@@ -0,0 +1,449 @@
+/*
+ * QEMU 8259 interrupt controller emulation
+ *
+ * Copyright (c) 2003-2004 Fabrice Bellard
+ * Copyright (c) 2007 Intel Corperation
+ *
+ * Permission is hereby granted, free of charge, to any person
obtaining a copy
+ * of this software and associated documentation files (the
"Software"), to deal
+ * in the Software without restriction, including without limitation
the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell
+ * copies of the Software, and to permit persons to whom the Software
is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN
+ * THE SOFTWARE.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
+ *   Port from Qemu.
+ */
+#include <linux/mm.h>
+#include "kvm_irq.h"
+
+/*
+ * set irq level. If an edge is detected, then the IRR is set to 1
+ */
+static inline void pic_set_irq1(struct kvm_pic_state *s, int irq, int
level)
+{
+	int mask;
+	mask = 1 << irq;
+	if (s->elcr & mask) {	/* level triggered */
+		if (level) {
+			s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->irr &= ~mask;
+			s->last_irr &= ~mask;
+		}
+	} else {	/* edge triggered */
+		if (level) {
+			if ((s->last_irr & mask) == 0)
+				s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->last_irr &= ~mask;
+		}
+	}
+}
+
+/*
+ * return the highest priority found in mask (highest = smallest
+ * number). Return 8 if no irq
+ */
+static inline int get_priority(struct kvm_pic_state *s, int mask)
+{
+	int priority;
+	if (mask == 0)
+		return 8;
+	priority = 0;
+	while ((mask & (1 << ((priority + s->priority_add) & 7))) == 0)
+		priority++;
+	return priority;
+}
+
+/*
+ * return the pic wanted interrupt. return -1 if none
+ */
+static int pic_get_irq(struct kvm_pic_state *s)
+{
+	int mask, cur_priority, priority;
+
+	mask = s->irr & ~s->imr;
+	priority = get_priority(s, mask);
+	if (priority == 8)
+		return -1;
+	/*
+	 * compute current priority. If special fully nested mode on the
+	 * master, the IRQ coming from the slave is not taken into
account
+	 * for the priority computation.
+	 */
+	mask = s->isr;
+	if (s->special_fully_nested_mode && s ==
&s->pics_state->pics[0])
+		mask &= ~(1 << 2);
+	cur_priority = get_priority(s, mask);
+	if (priority < cur_priority) {
+		/*
+		 * higher priority found: an irq should be generated
+		 */
+		return (priority + s->priority_add) & 7;
+	} else {
+		return -1;
+	}
+}
+
+/*
+ * raise irq to CPU if necessary. must be called every time the active
+ * irq may change
+ */
+static void pic_update_irq(struct kvm_pic *s)
+{
+	int irq2, irq;
+
+	irq2 = pic_get_irq(&s->pics[1]);
+	if (irq2 >= 0) {
+		/*
+		 * if irq request by slave pic, signal master PIC
+		 */
+		pic_set_irq1(&s->pics[0], 2, 1);
+		pic_set_irq1(&s->pics[0], 2, 0);
+	}
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		s->irq_request(s->irq_request_opaque, 1);
+	}
+}
+
+void kvm_pic_set_irq_new(void *opaque, int irq, int level)
+{
+	struct kvm_pic *s = opaque;
+
+	pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+	pic_update_irq(s);
+}
+
+/*
+ * acknowledge interrupt 'irq'
+ */
+static inline void pic_intack(struct kvm_pic_state *s, int irq)
+{
+	if (s->auto_eoi) {
+		if (s->rotate_on_auto_eoi)
+			s->priority_add = (irq + 1) & 7;
+	} else {
+		s->isr |= (1 << irq);
+	}
+	/*
+	 * We don't clear a level sensitive interrupt here
+	 */
+	if (!(s->elcr & (1 << irq)))
+		s->irr &= ~(1 << irq);
+}
+
+int kvm_pic_read_irq(struct kvm_pic *s)
+{
+	int irq, irq2, intno;
+
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		pic_intack(&s->pics[0], irq);
+		if (irq == 2) {
+			irq2 = pic_get_irq(&s->pics[1]);
+			if (irq2 >= 0) {
+				pic_intack(&s->pics[1], irq2);
+			} else {
+				/*
+				 * spurious IRQ on slave controller
+				 */
+				irq2 = 7;
+			}
+			intno = s->pics[1].irq_base + irq2;
+			irq = irq2 + 8;
+		} else {
+			intno = s->pics[0].irq_base + irq;
+		}
+	} else {
+		/*
+		 * spurious IRQ on host controller
+		 */
+		irq = 7;
+		intno = s->pics[0].irq_base + irq;
+	}
+	pic_update_irq(s);
+
+	return intno;
+}
+
+static void pic_reset(void *opaque)
+{
+	struct kvm_pic_state *s = opaque;
+
+	s->last_irr = 0;
+	s->irr = 0;
+	s->imr = 0;
+	s->isr = 0;
+	s->priority_add = 0;
+	s->irq_base = 0;
+	s->read_reg_select = 0;
+	s->poll = 0;
+	s->special_mask = 0;
+	s->init_state = 0;
+	s->auto_eoi = 0;
+	s->rotate_on_auto_eoi = 0;
+	s->special_fully_nested_mode = 0;
+	s->init4 = 0;
+}
+
+static void pic_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+{
+	struct kvm_pic_state *s = opaque;
+	int priority, cmd, irq;
+
+	addr &= 1;
+	if (addr == 0) {
+		if (val & 0x10) {
+			pic_reset(s);	/* init */
+			/*
+			 * deassert a pending interrupt
+			 */
+			s->pics_state->irq_request(s->pics_state->
+						   irq_request_opaque,
0);
+			s->init_state = 1;
+			s->init4 = val & 1;
+			if (val & 0x02)
+				printk(KERN_ERR "single mode not
supported");
+			if (val & 0x08)
+				printk(KERN_ERR
+				       "level sensitive irq not
supported");
+		} else if (val & 0x08) {
+			if (val & 0x04)
+				s->poll = 1;
+			if (val & 0x02)
+				s->read_reg_select = val & 1;
+			if (val & 0x40)
+				s->special_mask = (val >> 5) & 1;
+		} else {
+			cmd = val >> 5;
+			switch (cmd) {
+			case 0:
+			case 4:
+				s->rotate_on_auto_eoi = cmd >> 2;
+				break;
+			case 1:	/* end of interrupt */
+			case 5:
+				priority = get_priority(s, s->isr);
+				if (priority != 8) {
+					irq = (priority +
s->priority_add) & 7;
+					s->isr &= ~(1 << irq);
+					if (cmd == 5)
+						s->priority_add = (irq +
1) & 7;
+					pic_update_irq(s->pics_state);
+				}
+				break;
+			case 3:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				pic_update_irq(s->pics_state);
+				break;
+			case 6:
+				s->priority_add = (val + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			case 7:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				s->priority_add = (irq + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			default:
+				break;	/* no operation */
+			}
+		}
+	} else {
+		switch (s->init_state) {
+		case 0:		/* normal mode */
+			s->imr = val;
+			pic_update_irq(s->pics_state);
+			break;
+		case 1:
+			s->irq_base = val & 0xf8;
+			s->init_state = 2;
+			break;
+		case 2:
+			if (s->init4) {
+				s->init_state = 3;
+			} else {
+				s->init_state = 0;
+			}
+			break;
+		case 3:
+			s->special_fully_nested_mode = (val >> 4) & 1;
+			s->auto_eoi = (val >> 1) & 1;
+			s->init_state = 0;
+			break;
+		}
+	}
+}
+
+static uint32_t pic_poll_read(struct kvm_pic_state *s, uint32_t addr1)
+{
+	int ret;
+
+	ret = pic_get_irq(s);
+	if (ret >= 0) {
+		if (addr1 >> 7) {
+			s->pics_state->pics[0].isr &= ~(1 << 2);
+			s->pics_state->pics[0].irr &= ~(1 << 2);
+		}
+		s->irr &= ~(1 << ret);
+		s->isr &= ~(1 << ret);
+		if (addr1 >> 7 || ret != 2)
+			pic_update_irq(s->pics_state);
+	} else {
+		ret = 0x07;
+		pic_update_irq(s->pics_state);
+	}
+
+	return ret;
+}
+
+static uint32_t pic_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct kvm_pic_state *s = opaque;
+	unsigned int addr;
+	int ret;
+
+	addr = addr1;
+	addr &= 1;
+	if (s->poll) {
+		ret = pic_poll_read(s, addr1);
+		s->poll = 0;
+	} else {
+		if (addr == 0) {
+			if (s->read_reg_select)
+				ret = s->isr;
+			else
+				ret = s->irr;
+		} else {
+			ret = s->imr;
+		}
+	}
+	return ret;
+}
+
+static void elcr_ioport_write(void *opaque, uint32_t addr, uint32_t
val)
+{
+	struct kvm_pic_state *s = opaque;
+	s->elcr = val & s->elcr_mask;
+}
+
+static uint32_t elcr_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct kvm_pic_state *s = opaque;
+	return s->elcr;
+}
+
+static int picdev_in_range(struct kvm_io_device *this, gpa_t addr)
+{
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+	case 0x4d0:
+	case 0x4d1:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+static void picdev_write(struct kvm_io_device *this,
+			 gpa_t addr, int len, const void *val)
+{
+	struct kvm_pic *s = this->private;
+	unsigned char data = *(unsigned char *)val;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte write\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		pic_ioport_write(&s->pics[addr >> 7], addr, data);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		elcr_ioport_write(&s->pics[addr & 1], addr, data);
+		break;
+	}
+}
+
+static void picdev_read(struct kvm_io_device *this,
+			gpa_t addr, int len, void *val)
+{
+	struct kvm_pic *s = this->private;
+	unsigned char data = 0;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte read\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		data = pic_ioport_read(&s->pics[addr >> 7], addr);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		data = elcr_ioport_read(&s->pics[addr & 1], addr);
+		break;
+	}
+	*(unsigned char *)val = data;
+}
+
+/*
+ * callback when PIC0 irq status changed
+ */
+static void pic_irq_request(void *opaque, int level)
+{
+	struct kvm *kvm = opaque;
+
+	pic_irqchip(kvm)->output = level;
+}
+
+struct kvm_pic *kvm_create_pic(struct kvm *kvm)
+{
+	struct kvm_pic *s;
+	s = kzalloc(sizeof(struct kvm_pic), GFP_KERNEL);
+	if (!s)
+		return NULL;
+	s->pics[0].elcr_mask = 0xf8;
+	s->pics[1].elcr_mask = 0xde;
+	s->irq_request = pic_irq_request;
+	s->irq_request_opaque = kvm;
+	s->pics[0].pics_state = s;
+	s->pics[1].pics_state = s;
+
+	/*
+	 * Initialize PIO device
+	 */
+	s->dev.read = picdev_read;
+	s->dev.write = picdev_write;
+	s->dev.in_range = picdev_in_range;
+	s->dev.private = s;
+	kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
+	return s;
+}
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index a7c5e6b..f1a6773 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -435,6 +435,7 @@ struct kvm_memory_slot {
 	unsigned long *dirty_bitmap;
 };
 
+struct kvm_pic;
 struct kvm {
 	spinlock_t lock; /* protects everything except vcpus */
 	int naliases;
@@ -456,8 +457,19 @@ struct kvm {
 	struct file *filp;
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
+	struct kvm_pic *vpic;
 };
 
+static inline struct kvm_pic *pic_irqchip(struct kvm *kvm)
+{
+	return kvm->vpic;
+}
+
+static inline int irqchip_in_kernel(struct kvm *kvm)
+{
+	return pic_irqchip(kvm) != 0;
+}
+
 struct descriptor_table {
 	u16 limit;
 	unsigned long base;
diff --git a/drivers/kvm/kvm_irq.c b/drivers/kvm/kvm_irq.c
new file mode 100644
index 0000000..38b6a14
--- /dev/null
+++ b/drivers/kvm/kvm_irq.c
@@ -0,0 +1,59 @@
+/*
+ * kvm_irq.c: API for in kernel interrupt controller
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
+ *
+ */
+
+#include "kvm.h"
+#include "kvm_irq.h"
+
+/*
+ * check if there is pending interrupt without
+ * intack.
+ */
+int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
+{
+	struct kvm_pic *s = pic_irqchip(v->kvm);
+
+	if (s->output)	/* PIC */
+		return 1;
+	/*
+	 * TODO: APIC
+	 */
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
+
+/*
+ * Read pending interrupt vector and intack.
+ */
+int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
+{
+	struct kvm_pic *s = pic_irqchip(v->kvm);
+	int vector;
+
+	s->output = 0;
+	vector = kvm_pic_read_irq(s);
+	if (vector != -1)
+		return vector;
+	/*
+	 * TODO: APIC
+	 */
+	return -1;
+}
+EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
diff --git a/drivers/kvm/kvm_irq.h b/drivers/kvm/kvm_irq.h
new file mode 100644
index 0000000..f4d4213
--- /dev/null
+++ b/drivers/kvm/kvm_irq.h
@@ -0,0 +1,64 @@
+/*
+ * kvm_irq.h: in kernel interrupt controller related definitions
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but
WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
+ *
+ */
+
+#ifndef __VIRQ_H
+#define __VIRQ_H
+
+#include "kvm.h"
+
+typedef void irq_request_func(void *opaque, int level);
+
+struct kvm_pic;
+struct kvm_pic_state {
+	u8 last_irr;	/* edge detection */
+	u8 irr;		/* interrupt request register */
+	u8 imr;		/* interrupt mask register */
+	u8 isr;		/* interrupt service register */
+	u8 priority_add;	/* highest irq priority */
+	u8 irq_base;
+	u8 read_reg_select;
+	u8 poll;
+	u8 special_mask;
+	u8 init_state;
+	u8 auto_eoi;
+	u8 rotate_on_auto_eoi;
+	u8 special_fully_nested_mode;
+	u8 init4;		/* true if 4 byte init */
+	u8 elcr;		/* PIIX edge/trigger selection */
+	u8 elcr_mask;
+	struct kvm_pic *pics_state;
+};
+
+struct kvm_pic {
+	struct kvm_pic_state pics[2]; /* 0 is master pic, 1 is slave pic
*/
+	irq_request_func *irq_request;
+	void *irq_request_opaque;
+	int output;		/* intr from master PIC */
+	struct kvm_io_device dev;
+};
+
+struct kvm_pic *kvm_create_pic(struct kvm *kvm);
+void kvm_pic_set_irq_new(void *opaque, int irq, int level);
+int kvm_pic_read_irq(struct kvm_pic *s);
+int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
+int kvm_cpu_has_interrupt(struct kvm_vcpu *v);
+
+#endif
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 5603000..d7e5ca8 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -46,6 +46,7 @@
 
 #include "x86_emulate.h"
 #include "segment_descriptor.h"
+#include "kvm_irq.h"
 
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
@@ -476,6 +477,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
+	kfree(kvm->vpic);
 	kvm_io_bus_destroy(&kvm->pio_bus);
 	kvm_io_bus_destroy(&kvm->mmio_bus);
 	kvm_free_vcpus(kvm);
@@ -1374,7 +1376,8 @@ EXPORT_SYMBOL_GPL(emulate_instruction);
 
 int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->irq_summary)
+	if (vcpu->irq_summary ||
+		(irqchip_in_kernel(vcpu->kvm) &&
kvm_cpu_has_interrupt(vcpu)))
 		return 1;
 
 	vcpu->run->exit_reason = KVM_EXIT_HLT;
@@ -2857,6 +2860,27 @@ static long kvm_vm_ioctl(struct file *filp,
 			goto out;
 		break;
 	}
+	case KVM_CREATE_PIC:
+		r = -ENOMEM;
+		kvm->vpic = kvm_create_pic(kvm);
+		if (kvm->vpic) {
+			r = 0;
+		}
+		break;
+	case KVM_IRQ_LINE: {
+		struct kvm_irq_level irq_event;
+
+		r = -EFAULT;
+		if (copy_from_user(&irq_event, argp, sizeof irq_event))
+			goto out;
+		if (irqchip_in_kernel(kvm)) {
+			kvm_pic_set_irq_new(pic_irqchip(kvm),
+					irq_event.irq,
+					irq_event.level);
+			r = 0;
+		}
+		break;
+	}
 	default:
 		;
 	}
@@ -2988,12 +3012,19 @@ static long kvm_dev_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
-	case KVM_CHECK_EXTENSION:
-		/*
-		 * No extensions defined at present.
-		 */
-		r = 0;
+	case KVM_CHECK_EXTENSION: {
+		int ext = (long)argp;
+
+		switch (ext) {
+		case KVM_CAP_PIC:
+			r = 1;
+			break;
+		default:
+			r = 0;
+			break;
+		}
 		break;
+	}
 	case KVM_GET_VCPU_MMAP_SIZE:
 		r = -EINVAL;
 		if (arg)
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 7d04ffa..2d42c73 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "kvm_irq.h"
 #include "vmx.h"
 #include <linux/module.h>
 #include <linux/kernel.h>
@@ -1468,6 +1469,16 @@ static void inject_rmode_irq(struct kvm_vcpu
*vcpu, int irq)
 	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp -
6));
 }
 
+static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq)
+{
+	if (vcpu->rmode.active) {
+		inject_rmode_irq(vcpu, irq);
+		return;
+	}
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+			irq | INTR_TYPE_EXT_INTR |
INTR_INFO_VALID_MASK);
+}
+
 static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 {
 	int word_index = __ffs(vcpu->irq_summary);
@@ -1477,13 +1488,7 @@ static void kvm_do_inject_irq(struct kvm_vcpu
*vcpu)
 	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
 	if (!vcpu->irq_pending[word_index])
 		clear_bit(word_index, &vcpu->irq_summary);
-
-	if (vcpu->rmode.active) {
-		inject_rmode_irq(vcpu, irq);
-		return;
-	}
-	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-			irq | INTR_TYPE_EXT_INTR |
INTR_INFO_VALID_MASK);
+	vmx_inject_irq(vcpu, irq);
 }
 
 
@@ -1567,7 +1572,7 @@ static int handle_exception(struct kvm_vcpu *vcpu,
struct kvm_run *kvm_run)
 		       "intr info 0x%x\n", __FUNCTION__, vect_info,
intr_info);
 	}
 
-	if (is_external_interrupt(vect_info)) {
+	if (!irqchip_in_kernel(vcpu->kvm) &&
is_external_interrupt(vect_info)) {
 		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
 		set_bit(irq, vcpu->irq_pending);
 		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
@@ -1898,6 +1903,12 @@ static void post_kvm_run_save(struct kvm_vcpu
*vcpu,
 static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 				   struct kvm_run *kvm_run)
 {
+	u32 cpu_based_vm_exec_control;
+
+	/* clear pending irq */
+	cpu_based_vm_exec_control =
vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
cpu_based_vm_exec_control);
 	/*
 	 * If the user space waits to inject interrupts, exit as soon as
 	 * possible
@@ -1989,6 +2000,55 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
 {
 }
 
+static void enable_irq_window(struct kvm_vcpu *vcpu)
+{
+	u32 cpu_based_vm_exec_control;
+
+	cpu_based_vm_exec_control =
vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
cpu_based_vm_exec_control);
+}
+
+static void vmx_intr_assist(struct kvm_vcpu *vcpu)
+{
+	u32 idtv_info_field, intr_info_field;
+	int has_ext_irq, interrupt_window_open;
+
+	has_ext_irq = kvm_cpu_has_interrupt(vcpu);
+	intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	if (intr_info_field & INTR_INFO_VALID_MASK) {
+		if (idtv_info_field & INTR_INFO_VALID_MASK) {
+			/* TODO: fault when IDT_Vectoring */
+			printk(KERN_ERR "Fault when IDT_Vectoring\n");
+		}
+		if (has_ext_irq)
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) {
+		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field);
+		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+				vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
+
+		if (unlikely(idtv_info_field &
INTR_INFO_DELIEVER_CODE_MASK))
+			vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+				vmcs_read32(IDT_VECTORING_ERROR_CODE));
+		if (unlikely(has_ext_irq))
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (!has_ext_irq)
+		return;
+	interrupt_window_open =
+		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
+		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
+	if (interrupt_window_open)
+		vmx_inject_irq(vcpu, kvm_cpu_get_interrupt(vcpu));
+	else
+		enable_irq_window(vcpu);
+}
+
 static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	u8 fail;
@@ -1999,9 +2059,6 @@ preempted:
 		kvm_guest_debug_pre(vcpu);
 
 again:
-	if (!vcpu->mmio_read_completed)
-		do_interrupt_requests(vcpu, kvm_run);
-
 	vmx_save_host_state(vcpu);
 	kvm_load_guest_fpu(vcpu);
 
@@ -2015,6 +2072,10 @@ again:
 	vmcs_writel(HOST_CR0, read_cr0());
 
 	local_irq_disable();
+	if (irqchip_in_kernel(vcpu->kvm))
+		vmx_intr_assist(vcpu);
+	else if (!vcpu->mmio_read_completed)
+		do_interrupt_requests(vcpu, kvm_run);
 
 	vcpu->guest_mode = 1;
 	if (vcpu->requests)
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index e6edca8..1738f92 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -41,6 +41,12 @@ struct kvm_memory_alias {
 	__u64 target_phys_addr;
 };
 
+/* for KVM_SET_IRQ_LEVEL */
+struct kvm_irq_level {
+	__u32 irq;
+	__u32 level;
+};
+
 enum kvm_exit_reason {
 	KVM_EXIT_UNKNOWN          = 0,
 	KVM_EXIT_EXCEPTION        = 1,
@@ -272,6 +278,11 @@ struct kvm_signal_mask {
 #define KVM_GET_VCPU_MMAP_SIZE    _IO(KVMIO,   0x04) /* in bytes */
 
 /*
+ * Extension capability list.
+ */
+#define KVM_CAP_PIC	  0
+
+/*
  * ioctls for VM fds
  */
 #define KVM_SET_MEMORY_REGION     _IOW(KVMIO, 0x40, struct
kvm_memory_region)
@@ -282,6 +293,9 @@ struct kvm_signal_mask {
 #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct
kvm_dirty_log)
 #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct
kvm_memory_alias)
+/* Device model IOC */
+#define KVM_CREATE_PIC		  _IO(KVMIO,  0x60)
+#define KVM_IRQ_LINE		  _IO(KVMIO,  0x61)
 
 /*
  * ioctls for vcpu fds

[-- Attachment #2: pic-user-v25.patch.rebase --]
[-- Type: application/octet-stream, Size: 5468 bytes --]

In kernel PIC support in user level.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>

diff --git a/kernel/Kbuild b/kernel/Kbuild
index e9bcda7..ea64fe3 100644
--- a/kernel/Kbuild
+++ b/kernel/Kbuild
@@ -1,5 +1,5 @@
 EXTRA_CFLAGS := -I$(src)/include -include $(src)/external-module-compat.h
 obj-m := kvm.o kvm-intel.o kvm-amd.o
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o kvm_irq.o i8259.o
 kvm-intel-objs := vmx.o vmx-debug.o
 kvm-amd-objs := svm.o
diff --git a/qemu/cpu-defs.h b/qemu/cpu-defs.h
index 0b49c89..67ba41e 100644
--- a/qemu/cpu-defs.h
+++ b/qemu/cpu-defs.h
@@ -75,6 +75,7 @@ typedef unsigned long ram_addr_t;
 #define EXCP_HLT        0x10001 /* hlt instruction reached */
 #define EXCP_DEBUG      0x10002 /* cpu stopped after a breakpoint or singlestep */
 #define EXCP_HALTED     0x10003 /* cpu is halted (waiting for external event) */
+#define EXCP_IO_WINDOW  0x10004 /* IO Window for user irq injection */
 #define MAX_BREAKPOINTS 32
 
 #define TB_JMP_CACHE_BITS 12
diff --git a/qemu/hw/i8259.c b/qemu/hw/i8259.c
index a2a8187..117340c 100644
--- a/qemu/hw/i8259.c
+++ b/qemu/hw/i8259.c
@@ -177,7 +177,12 @@ int64_t irq_time[16];
 void pic_set_irq_new(void *opaque, int irq, int level)
 {
     PicState2 *s = opaque;
+#ifdef USE_KVM
+    extern int kvm_set_irq(int irq, int level);
 
+    if (kvm_set_irq(irq, level))
+        return;
+#endif
 #if defined(DEBUG_PIC) || defined(DEBUG_IRQ_COUNT)
     if (level != irq_level[irq]) {
 #if defined(DEBUG_PIC)
diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 1849997..811b320 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -601,6 +601,7 @@ static int kvm_writeq(void *opaque, uint64_t addr, uint64_t data)
 
 static int kvm_io_window(void *opaque)
 {
+    env->exception_index = EXCP_IO_WINDOW;
     return 1;
 }
 
@@ -612,7 +613,8 @@ static int kvm_halt(void *opaque, int vcpu)
     env = envs[0];
     if (!((env->interrupt_request & CPU_INTERRUPT_HARD) &&
 	  (env->eflags & IF_MASK))) {
-	    env->hflags |= HF_HALTED_MASK;
+	    /* TODO: for halt emulation, temply walkaround now */
+	    /* env->hflags |= HF_HALTED_MASK; */
 	    env->exception_index = EXCP_HLT;
     }
 
@@ -923,4 +925,9 @@ int kvm_get_phys_ram_page_bitmap(unsigned char *bitmap)
  out:
     return r;
 }
+
+int kvm_set_irq(int irq, int level)
+{
+    return kvm_set_irq_level(kvm_context, irq, level);
+}
 #endif
diff --git a/user/kvmctl.c b/user/kvmctl.c
index 054547c..d610111 100644
--- a/user/kvmctl.c
+++ b/user/kvmctl.c
@@ -66,6 +66,7 @@ struct kvm_context {
 	int dirty_pages_log_all;
 	/// memory regions parameters
 	struct kvm_memory_region mem_regions[KVM_MAX_NUM_MEM_REGIONS];
+	int irqchip_in_kernel;
 };
 
 /*
@@ -288,6 +289,15 @@ int kvm_create(kvm_context_t kvm, unsigned long memory, void **vm_mem)
 	     MAP_PRIVATE|MAP_FIXED, zfd, 0);
 	close(zfd);
 
+	kvm->irqchip_in_kernel = 0;
+	r = ioctl(kvm->fd, KVM_CHECK_EXTENSION, KVM_CAP_PIC);
+	if (r > 0) {	/* kernel irqchip supported */
+		r = ioctl(fd, KVM_CREATE_PIC);
+		if (r >= 0)
+			kvm->irqchip_in_kernel = 1;
+		else
+			printf("Create kernel PIC irqchip failed\n");
+	}
 	r = kvm_create_vcpu(kvm, 0);
 	if (r < 0)
 		return r;
@@ -407,6 +417,21 @@ int kvm_get_mem_map(kvm_context_t kvm, int slot, void *buf)
 #endif /* KVM_GET_MEM_MAP */
 }
 
+int kvm_set_irq_level(kvm_context_t kvm, int irq, int level)
+{
+	struct kvm_irq_level event;
+	int r;
+
+	if (!kvm->irqchip_in_kernel)
+		return 0;
+	event.level = level;
+	event.irq = irq;
+	r = ioctl(kvm->vm_fd, KVM_IRQ_LINE, &event);
+	if (r == -1)
+		perror("kvm_set_irq_level");
+	return 1;
+}
+
 static int handle_io_abi10(kvm_context_t kvm, struct kvm_run_abi10 *run,
 			   int vcpu)
 {
@@ -944,7 +969,8 @@ int kvm_run(kvm_context_t kvm, int vcpu)
 		return kvm_run_abi10(kvm, vcpu);
 
 again:
-	run->request_interrupt_window = try_push_interrupts(kvm);
+	if (!kvm->irqchip_in_kernel)
+		run->request_interrupt_window = try_push_interrupts(kvm);
 	pre_kvm_run(kvm, vcpu);
 	r = ioctl(fd, KVM_RUN, 0);
 	post_kvm_run(kvm, vcpu);
@@ -1022,6 +1048,25 @@ int kvm_guest_debug(kvm_context_t kvm, int vcpu, struct kvm_debug_guest *dbg)
 	return ioctl(kvm->vcpu_fd[vcpu], KVM_DEBUG_GUEST, dbg);
 }
 
+static void cpuid_remove_apic(struct kvm_cpuid *cpuid)
+{
+	int i;
+	struct kvm_cpuid_entry *e, *entry;
+
+	entry = NULL;
+	for (i = 0; i < cpuid->nent; ++i) {
+		e = &cpuid->entries[i];
+		if (e->function == 1) {
+			entry = e;
+			break;
+		}
+	}
+	if (entry) {
+		entry->edx &= ~(1 << 9);
+		printf("Guest APIC capibility removed\n");
+	}
+}
+
 int kvm_setup_cpuid(kvm_context_t kvm, int vcpu, int nent,
 		    struct kvm_cpuid_entry *entries)
 {
@@ -1034,6 +1079,8 @@ int kvm_setup_cpuid(kvm_context_t kvm, int vcpu, int nent,
 
 	cpuid->nent = nent;
 	memcpy(cpuid->entries, entries, nent * sizeof(*entries));
+	/* temply walkaround before merge of in-kernel APIC */
+	cpuid_remove_apic(cpuid);
 	r = ioctl(kvm->vcpu_fd[vcpu], KVM_SET_CPUID, cpuid);
 
 	free(cpuid);
diff --git a/user/kvmctl.h b/user/kvmctl.h
index 956e194..68e1664 100644
--- a/user/kvmctl.h
+++ b/user/kvmctl.h
@@ -395,6 +395,7 @@ int kvm_destroy_memory_alias(kvm_context_t, int slot);
  * \param bitmap Long aligned address of a big enough bitmap (one bit per page)
  */
 int kvm_get_mem_map(kvm_context_t kvm, int slot, void *bitmap);
+int kvm_set_irq_level(kvm_context_t kvm, int irq, int level);
 
 /*!
  * \brief Enable dirty-pages-logging for all memory regions

[-- Attachment #3: svm2.patch --]
[-- Type: application/octet-stream, Size: 3690 bytes --]


Enable SVM for in kernel PIC with a new irq injection path.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>

diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index a0d4428..e3f7d03 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -23,6 +23,7 @@
 #include <asm/desc.h>
 
 #include "kvm_svm.h"
+#include "kvm_irq.h"
 #include "x86_emulate.h"
 
 MODULE_AUTHOR("Qumranet");
@@ -904,7 +905,8 @@ static int pf_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	enum emulation_result er;
 	int r;
 
-	if (is_external_interrupt(exit_int_info))
+	if (!irqchip_in_kernel(vcpu->kvm) &&
+		is_external_interrupt(exit_int_info))
 		push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
 
 	spin_lock(&vcpu->kvm->lock);
@@ -1276,6 +1278,8 @@ static int msr_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 static int interrupt_window_interception(struct kvm_vcpu *vcpu,
 				   struct kvm_run *kvm_run)
 {
+	vcpu->svm->vmcb->control.intercept &= ~(1ULL << INTERCEPT_VINTR);
+	vcpu->svm->vmcb->control.int_ctl &= ~V_IRQ_MASK;
 	/*
 	 * If the user space waits to inject interrupts, exit as soon as
 	 * possible
@@ -1380,22 +1384,55 @@ static void pre_svm_run(struct kvm_vcpu *vcpu)
 }
 
 
-static inline void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
+static inline void svm_inject_irq(struct kvm_vcpu *vcpu, int irq)
 {
 	struct vmcb_control_area *control;
 
 	control = &vcpu->svm->vmcb->control;
-	control->int_vector = pop_irq(vcpu);
+	control->int_vector = irq;
 	control->int_ctl &= ~V_INTR_PRIO_MASK;
 	control->int_ctl |= V_IRQ_MASK |
 		((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT);
 }
 
+static void svm_intr_assist(struct kvm_vcpu *vcpu)
+{
+	struct vmcb *vmcb = vcpu->svm->vmcb;
+	int intr_vector = -1;
+
+	if ((vmcb->control.exit_int_info & SVM_EVTINJ_VALID) &&
+	    ((vmcb->control.exit_int_info & SVM_EVTINJ_TYPE_MASK) == 0)) {
+		intr_vector = vmcb->control.exit_int_info &
+			      SVM_EVTINJ_VEC_MASK;
+		vmcb->control.exit_int_info = 0;
+		svm_inject_irq(vcpu, intr_vector);
+		return;
+	}
+
+	if (vmcb->control.int_ctl & V_IRQ_MASK)
+		return;
+
+	if (!kvm_cpu_has_interrupt(vcpu))
+		return;
+
+	if (!(vmcb->save.rflags & X86_EFLAGS_IF) ||
+	    (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) ||
+	    (vmcb->control.event_inj & SVM_EVTINJ_VALID)) {
+		/* unable to deliver irq, set pending irq */
+		vmcb->control.intercept |= (1ULL << INTERCEPT_VINTR);
+		svm_inject_irq(vcpu, 0x0);
+		return;
+	}
+	/* Okay, we can deliver the interrupt: grab it and update PIC state. */
+	intr_vector = kvm_cpu_get_interrupt(vcpu);
+	svm_inject_irq(vcpu, intr_vector);
+}
+
 static void kvm_reput_irq(struct kvm_vcpu *vcpu)
 {
 	struct vmcb_control_area *control = &vcpu->svm->vmcb->control;
 
-	if (control->int_ctl & V_IRQ_MASK) {
+	if ((control->int_ctl & V_IRQ_MASK) && !irqchip_in_kernel(vcpu->kvm)) {
 		control->int_ctl &= ~V_IRQ_MASK;
 		push_irq(vcpu, control->int_vector);
 	}
@@ -1417,7 +1454,7 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
 		/*
 		 * If interrupts enabled, and not blocked by sti or mov ss. Good.
 		 */
-		kvm_do_inject_irq(vcpu);
+		svm_inject_irq(vcpu, pop_irq(vcpu));
 
 	/*
 	 * Interrupts blocked.  Wait for unblock.
@@ -1487,11 +1524,13 @@ again:
 	if (unlikely(r))
 		return r;
 
-	if (!vcpu->mmio_read_completed)
-		do_interrupt_requests(vcpu, kvm_run);
-
 	clgi();
 
+	if (irqchip_in_kernel(vcpu->kvm))
+		svm_intr_assist(vcpu);
+	else if (!vcpu->mmio_read_completed)
+		do_interrupt_requests(vcpu, kvm_run);
+
 	vcpu->guest_mode = 1;
 	if (vcpu->requests)
 		if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests))
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl

[-- Attachment #4: pic-kernel-v25.patch.rebase --]
[-- Type: application/octet-stream, Size: 22763 bytes --]

In kernel PIC support.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>

commit d62bc09142edab756538a82f6152cf71ee362ead
Author: root <root@vt32-pae.(none)>
Date:   Mon Jun 25 14:44:59 2007 +0800

    In kernel PIC support

diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c0a789f..48a3076 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o i8259.o kvm_irq.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/i8259.c b/drivers/kvm/i8259.c
new file mode 100644
index 0000000..8bb94e3
--- /dev/null
+++ b/drivers/kvm/i8259.c
@@ -0,0 +1,449 @@
+/*
+ * QEMU 8259 interrupt controller emulation
+ *
+ * Copyright (c) 2003-2004 Fabrice Bellard
+ * Copyright (c) 2007 Intel Corperation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *   Port from Qemu.
+ */
+#include <linux/mm.h>
+#include "kvm_irq.h"
+
+/*
+ * set irq level. If an edge is detected, then the IRR is set to 1
+ */
+static inline void pic_set_irq1(struct kvm_pic_state *s, int irq, int level)
+{
+	int mask;
+	mask = 1 << irq;
+	if (s->elcr & mask) {	/* level triggered */
+		if (level) {
+			s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->irr &= ~mask;
+			s->last_irr &= ~mask;
+		}
+	} else {	/* edge triggered */
+		if (level) {
+			if ((s->last_irr & mask) == 0)
+				s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->last_irr &= ~mask;
+		}
+	}
+}
+
+/*
+ * return the highest priority found in mask (highest = smallest
+ * number). Return 8 if no irq
+ */
+static inline int get_priority(struct kvm_pic_state *s, int mask)
+{
+	int priority;
+	if (mask == 0)
+		return 8;
+	priority = 0;
+	while ((mask & (1 << ((priority + s->priority_add) & 7))) == 0)
+		priority++;
+	return priority;
+}
+
+/*
+ * return the pic wanted interrupt. return -1 if none
+ */
+static int pic_get_irq(struct kvm_pic_state *s)
+{
+	int mask, cur_priority, priority;
+
+	mask = s->irr & ~s->imr;
+	priority = get_priority(s, mask);
+	if (priority == 8)
+		return -1;
+	/*
+	 * compute current priority. If special fully nested mode on the
+	 * master, the IRQ coming from the slave is not taken into account
+	 * for the priority computation.
+	 */
+	mask = s->isr;
+	if (s->special_fully_nested_mode && s == &s->pics_state->pics[0])
+		mask &= ~(1 << 2);
+	cur_priority = get_priority(s, mask);
+	if (priority < cur_priority) {
+		/*
+		 * higher priority found: an irq should be generated
+		 */
+		return (priority + s->priority_add) & 7;
+	} else {
+		return -1;
+	}
+}
+
+/*
+ * raise irq to CPU if necessary. must be called every time the active
+ * irq may change
+ */
+static void pic_update_irq(struct kvm_pic *s)
+{
+	int irq2, irq;
+
+	irq2 = pic_get_irq(&s->pics[1]);
+	if (irq2 >= 0) {
+		/*
+		 * if irq request by slave pic, signal master PIC
+		 */
+		pic_set_irq1(&s->pics[0], 2, 1);
+		pic_set_irq1(&s->pics[0], 2, 0);
+	}
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		s->irq_request(s->irq_request_opaque, 1);
+	}
+}
+
+void kvm_pic_set_irq_new(void *opaque, int irq, int level)
+{
+	struct kvm_pic *s = opaque;
+
+	pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+	pic_update_irq(s);
+}
+
+/*
+ * acknowledge interrupt 'irq'
+ */
+static inline void pic_intack(struct kvm_pic_state *s, int irq)
+{
+	if (s->auto_eoi) {
+		if (s->rotate_on_auto_eoi)
+			s->priority_add = (irq + 1) & 7;
+	} else {
+		s->isr |= (1 << irq);
+	}
+	/*
+	 * We don't clear a level sensitive interrupt here
+	 */
+	if (!(s->elcr & (1 << irq)))
+		s->irr &= ~(1 << irq);
+}
+
+int kvm_pic_read_irq(struct kvm_pic *s)
+{
+	int irq, irq2, intno;
+
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		pic_intack(&s->pics[0], irq);
+		if (irq == 2) {
+			irq2 = pic_get_irq(&s->pics[1]);
+			if (irq2 >= 0) {
+				pic_intack(&s->pics[1], irq2);
+			} else {
+				/*
+				 * spurious IRQ on slave controller
+				 */
+				irq2 = 7;
+			}
+			intno = s->pics[1].irq_base + irq2;
+			irq = irq2 + 8;
+		} else {
+			intno = s->pics[0].irq_base + irq;
+		}
+	} else {
+		/*
+		 * spurious IRQ on host controller
+		 */
+		irq = 7;
+		intno = s->pics[0].irq_base + irq;
+	}
+	pic_update_irq(s);
+
+	return intno;
+}
+
+static void pic_reset(void *opaque)
+{
+	struct kvm_pic_state *s = opaque;
+
+	s->last_irr = 0;
+	s->irr = 0;
+	s->imr = 0;
+	s->isr = 0;
+	s->priority_add = 0;
+	s->irq_base = 0;
+	s->read_reg_select = 0;
+	s->poll = 0;
+	s->special_mask = 0;
+	s->init_state = 0;
+	s->auto_eoi = 0;
+	s->rotate_on_auto_eoi = 0;
+	s->special_fully_nested_mode = 0;
+	s->init4 = 0;
+}
+
+static void pic_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+{
+	struct kvm_pic_state *s = opaque;
+	int priority, cmd, irq;
+
+	addr &= 1;
+	if (addr == 0) {
+		if (val & 0x10) {
+			pic_reset(s);	/* init */
+			/*
+			 * deassert a pending interrupt
+			 */
+			s->pics_state->irq_request(s->pics_state->
+						   irq_request_opaque, 0);
+			s->init_state = 1;
+			s->init4 = val & 1;
+			if (val & 0x02)
+				printk(KERN_ERR "single mode not supported");
+			if (val & 0x08)
+				printk(KERN_ERR
+				       "level sensitive irq not supported");
+		} else if (val & 0x08) {
+			if (val & 0x04)
+				s->poll = 1;
+			if (val & 0x02)
+				s->read_reg_select = val & 1;
+			if (val & 0x40)
+				s->special_mask = (val >> 5) & 1;
+		} else {
+			cmd = val >> 5;
+			switch (cmd) {
+			case 0:
+			case 4:
+				s->rotate_on_auto_eoi = cmd >> 2;
+				break;
+			case 1:	/* end of interrupt */
+			case 5:
+				priority = get_priority(s, s->isr);
+				if (priority != 8) {
+					irq = (priority + s->priority_add) & 7;
+					s->isr &= ~(1 << irq);
+					if (cmd == 5)
+						s->priority_add = (irq + 1) & 7;
+					pic_update_irq(s->pics_state);
+				}
+				break;
+			case 3:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				pic_update_irq(s->pics_state);
+				break;
+			case 6:
+				s->priority_add = (val + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			case 7:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				s->priority_add = (irq + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			default:
+				break;	/* no operation */
+			}
+		}
+	} else {
+		switch (s->init_state) {
+		case 0:		/* normal mode */
+			s->imr = val;
+			pic_update_irq(s->pics_state);
+			break;
+		case 1:
+			s->irq_base = val & 0xf8;
+			s->init_state = 2;
+			break;
+		case 2:
+			if (s->init4) {
+				s->init_state = 3;
+			} else {
+				s->init_state = 0;
+			}
+			break;
+		case 3:
+			s->special_fully_nested_mode = (val >> 4) & 1;
+			s->auto_eoi = (val >> 1) & 1;
+			s->init_state = 0;
+			break;
+		}
+	}
+}
+
+static uint32_t pic_poll_read(struct kvm_pic_state *s, uint32_t addr1)
+{
+	int ret;
+
+	ret = pic_get_irq(s);
+	if (ret >= 0) {
+		if (addr1 >> 7) {
+			s->pics_state->pics[0].isr &= ~(1 << 2);
+			s->pics_state->pics[0].irr &= ~(1 << 2);
+		}
+		s->irr &= ~(1 << ret);
+		s->isr &= ~(1 << ret);
+		if (addr1 >> 7 || ret != 2)
+			pic_update_irq(s->pics_state);
+	} else {
+		ret = 0x07;
+		pic_update_irq(s->pics_state);
+	}
+
+	return ret;
+}
+
+static uint32_t pic_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct kvm_pic_state *s = opaque;
+	unsigned int addr;
+	int ret;
+
+	addr = addr1;
+	addr &= 1;
+	if (s->poll) {
+		ret = pic_poll_read(s, addr1);
+		s->poll = 0;
+	} else {
+		if (addr == 0) {
+			if (s->read_reg_select)
+				ret = s->isr;
+			else
+				ret = s->irr;
+		} else {
+			ret = s->imr;
+		}
+	}
+	return ret;
+}
+
+static void elcr_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+{
+	struct kvm_pic_state *s = opaque;
+	s->elcr = val & s->elcr_mask;
+}
+
+static uint32_t elcr_ioport_read(void *opaque, uint32_t addr1)
+{
+	struct kvm_pic_state *s = opaque;
+	return s->elcr;
+}
+
+static int picdev_in_range(struct kvm_io_device *this, gpa_t addr)
+{
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+	case 0x4d0:
+	case 0x4d1:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+static void picdev_write(struct kvm_io_device *this,
+			 gpa_t addr, int len, const void *val)
+{
+	struct kvm_pic *s = this->private;
+	unsigned char data = *(unsigned char *)val;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte write\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		pic_ioport_write(&s->pics[addr >> 7], addr, data);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		elcr_ioport_write(&s->pics[addr & 1], addr, data);
+		break;
+	}
+}
+
+static void picdev_read(struct kvm_io_device *this,
+			gpa_t addr, int len, void *val)
+{
+	struct kvm_pic *s = this->private;
+	unsigned char data = 0;
+
+	if (len != 1) {
+		printk(KERN_ERR "PIC: non byte read\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		data = pic_ioport_read(&s->pics[addr >> 7], addr);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		data = elcr_ioport_read(&s->pics[addr & 1], addr);
+		break;
+	}
+	*(unsigned char *)val = data;
+}
+
+/*
+ * callback when PIC0 irq status changed
+ */
+static void pic_irq_request(void *opaque, int level)
+{
+	struct kvm *kvm = opaque;
+
+	pic_irqchip(kvm)->output = level;
+}
+
+struct kvm_pic *kvm_create_pic(struct kvm *kvm)
+{
+	struct kvm_pic *s;
+	s = kzalloc(sizeof(struct kvm_pic), GFP_KERNEL);
+	if (!s)
+		return NULL;
+	s->pics[0].elcr_mask = 0xf8;
+	s->pics[1].elcr_mask = 0xde;
+	s->irq_request = pic_irq_request;
+	s->irq_request_opaque = kvm;
+	s->pics[0].pics_state = s;
+	s->pics[1].pics_state = s;
+
+	/*
+	 * Initialize PIO device
+	 */
+	s->dev.read = picdev_read;
+	s->dev.write = picdev_write;
+	s->dev.in_range = picdev_in_range;
+	s->dev.private = s;
+	kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
+	return s;
+}
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index a7c5e6b..f1a6773 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -435,6 +435,7 @@ struct kvm_memory_slot {
 	unsigned long *dirty_bitmap;
 };
 
+struct kvm_pic;
 struct kvm {
 	spinlock_t lock; /* protects everything except vcpus */
 	int naliases;
@@ -456,8 +457,19 @@ struct kvm {
 	struct file *filp;
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
+	struct kvm_pic *vpic;
 };
 
+static inline struct kvm_pic *pic_irqchip(struct kvm *kvm)
+{
+	return kvm->vpic;
+}
+
+static inline int irqchip_in_kernel(struct kvm *kvm)
+{
+	return pic_irqchip(kvm) != 0;
+}
+
 struct descriptor_table {
 	u16 limit;
 	unsigned long base;
diff --git a/drivers/kvm/kvm_irq.c b/drivers/kvm/kvm_irq.c
new file mode 100644
index 0000000..38b6a14
--- /dev/null
+++ b/drivers/kvm/kvm_irq.c
@@ -0,0 +1,59 @@
+/*
+ * kvm_irq.c: API for in kernel interrupt controller
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *
+ */
+
+#include "kvm.h"
+#include "kvm_irq.h"
+
+/*
+ * check if there is pending interrupt without
+ * intack.
+ */
+int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
+{
+	struct kvm_pic *s = pic_irqchip(v->kvm);
+
+	if (s->output)	/* PIC */
+		return 1;
+	/*
+	 * TODO: APIC
+	 */
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
+
+/*
+ * Read pending interrupt vector and intack.
+ */
+int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
+{
+	struct kvm_pic *s = pic_irqchip(v->kvm);
+	int vector;
+
+	s->output = 0;
+	vector = kvm_pic_read_irq(s);
+	if (vector != -1)
+		return vector;
+	/*
+	 * TODO: APIC
+	 */
+	return -1;
+}
+EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
diff --git a/drivers/kvm/kvm_irq.h b/drivers/kvm/kvm_irq.h
new file mode 100644
index 0000000..f4d4213
--- /dev/null
+++ b/drivers/kvm/kvm_irq.h
@@ -0,0 +1,64 @@
+/*
+ * kvm_irq.h: in kernel interrupt controller related definitions
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *
+ */
+
+#ifndef __VIRQ_H
+#define __VIRQ_H
+
+#include "kvm.h"
+
+typedef void irq_request_func(void *opaque, int level);
+
+struct kvm_pic;
+struct kvm_pic_state {
+	u8 last_irr;	/* edge detection */
+	u8 irr;		/* interrupt request register */
+	u8 imr;		/* interrupt mask register */
+	u8 isr;		/* interrupt service register */
+	u8 priority_add;	/* highest irq priority */
+	u8 irq_base;
+	u8 read_reg_select;
+	u8 poll;
+	u8 special_mask;
+	u8 init_state;
+	u8 auto_eoi;
+	u8 rotate_on_auto_eoi;
+	u8 special_fully_nested_mode;
+	u8 init4;		/* true if 4 byte init */
+	u8 elcr;		/* PIIX edge/trigger selection */
+	u8 elcr_mask;
+	struct kvm_pic *pics_state;
+};
+
+struct kvm_pic {
+	struct kvm_pic_state pics[2]; /* 0 is master pic, 1 is slave pic */
+	irq_request_func *irq_request;
+	void *irq_request_opaque;
+	int output;		/* intr from master PIC */
+	struct kvm_io_device dev;
+};
+
+struct kvm_pic *kvm_create_pic(struct kvm *kvm);
+void kvm_pic_set_irq_new(void *opaque, int irq, int level);
+int kvm_pic_read_irq(struct kvm_pic *s);
+int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
+int kvm_cpu_has_interrupt(struct kvm_vcpu *v);
+
+#endif
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 5603000..d7e5ca8 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -46,6 +46,7 @@
 
 #include "x86_emulate.h"
 #include "segment_descriptor.h"
+#include "kvm_irq.h"
 
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
@@ -476,6 +477,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
+	kfree(kvm->vpic);
 	kvm_io_bus_destroy(&kvm->pio_bus);
 	kvm_io_bus_destroy(&kvm->mmio_bus);
 	kvm_free_vcpus(kvm);
@@ -1374,7 +1376,8 @@ EXPORT_SYMBOL_GPL(emulate_instruction);
 
 int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->irq_summary)
+	if (vcpu->irq_summary ||
+		(irqchip_in_kernel(vcpu->kvm) && kvm_cpu_has_interrupt(vcpu)))
 		return 1;
 
 	vcpu->run->exit_reason = KVM_EXIT_HLT;
@@ -2857,6 +2860,27 @@ static long kvm_vm_ioctl(struct file *filp,
 			goto out;
 		break;
 	}
+	case KVM_CREATE_PIC:
+		r = -ENOMEM;
+		kvm->vpic = kvm_create_pic(kvm);
+		if (kvm->vpic) {
+			r = 0;
+		}
+		break;
+	case KVM_IRQ_LINE: {
+		struct kvm_irq_level irq_event;
+
+		r = -EFAULT;
+		if (copy_from_user(&irq_event, argp, sizeof irq_event))
+			goto out;
+		if (irqchip_in_kernel(kvm)) {
+			kvm_pic_set_irq_new(pic_irqchip(kvm),
+					irq_event.irq,
+					irq_event.level);
+			r = 0;
+		}
+		break;
+	}
 	default:
 		;
 	}
@@ -2988,12 +3012,19 @@ static long kvm_dev_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
-	case KVM_CHECK_EXTENSION:
-		/*
-		 * No extensions defined at present.
-		 */
-		r = 0;
+	case KVM_CHECK_EXTENSION: {
+		int ext = (long)argp;
+
+		switch (ext) {
+		case KVM_CAP_PIC:
+			r = 1;
+			break;
+		default:
+			r = 0;
+			break;
+		}
 		break;
+	}
 	case KVM_GET_VCPU_MMAP_SIZE:
 		r = -EINVAL;
 		if (arg)
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 7d04ffa..2d42c73 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "kvm_irq.h"
 #include "vmx.h"
 #include <linux/module.h>
 #include <linux/kernel.h>
@@ -1468,6 +1469,16 @@ static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq)
 	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp - 6));
 }
 
+static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq)
+{
+	if (vcpu->rmode.active) {
+		inject_rmode_irq(vcpu, irq);
+		return;
+	}
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+			irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
+}
+
 static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 {
 	int word_index = __ffs(vcpu->irq_summary);
@@ -1477,13 +1488,7 @@ static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
 	if (!vcpu->irq_pending[word_index])
 		clear_bit(word_index, &vcpu->irq_summary);
-
-	if (vcpu->rmode.active) {
-		inject_rmode_irq(vcpu, irq);
-		return;
-	}
-	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-			irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
+	vmx_inject_irq(vcpu, irq);
 }
 
 
@@ -1567,7 +1572,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 		       "intr info 0x%x\n", __FUNCTION__, vect_info, intr_info);
 	}
 
-	if (is_external_interrupt(vect_info)) {
+	if (!irqchip_in_kernel(vcpu->kvm) && is_external_interrupt(vect_info)) {
 		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
 		set_bit(irq, vcpu->irq_pending);
 		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
@@ -1898,6 +1903,12 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 				   struct kvm_run *kvm_run)
 {
+	u32 cpu_based_vm_exec_control;
+
+	/* clear pending irq */
+	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
 	/*
 	 * If the user space waits to inject interrupts, exit as soon as
 	 * possible
@@ -1989,6 +2000,55 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
 {
 }
 
+static void enable_irq_window(struct kvm_vcpu *vcpu)
+{
+	u32 cpu_based_vm_exec_control;
+
+	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
+}
+
+static void vmx_intr_assist(struct kvm_vcpu *vcpu)
+{
+	u32 idtv_info_field, intr_info_field;
+	int has_ext_irq, interrupt_window_open;
+
+	has_ext_irq = kvm_cpu_has_interrupt(vcpu);
+	intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	if (intr_info_field & INTR_INFO_VALID_MASK) {
+		if (idtv_info_field & INTR_INFO_VALID_MASK) {
+			/* TODO: fault when IDT_Vectoring */
+			printk(KERN_ERR "Fault when IDT_Vectoring\n");
+		}
+		if (has_ext_irq)
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) {
+		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field);
+		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+				vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
+
+		if (unlikely(idtv_info_field & INTR_INFO_DELIEVER_CODE_MASK))
+			vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+				vmcs_read32(IDT_VECTORING_ERROR_CODE));
+		if (unlikely(has_ext_irq))
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (!has_ext_irq)
+		return;
+	interrupt_window_open =
+		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
+		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
+	if (interrupt_window_open)
+		vmx_inject_irq(vcpu, kvm_cpu_get_interrupt(vcpu));
+	else
+		enable_irq_window(vcpu);
+}
+
 static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	u8 fail;
@@ -1999,9 +2059,6 @@ preempted:
 		kvm_guest_debug_pre(vcpu);
 
 again:
-	if (!vcpu->mmio_read_completed)
-		do_interrupt_requests(vcpu, kvm_run);
-
 	vmx_save_host_state(vcpu);
 	kvm_load_guest_fpu(vcpu);
 
@@ -2015,6 +2072,10 @@ again:
 	vmcs_writel(HOST_CR0, read_cr0());
 
 	local_irq_disable();
+	if (irqchip_in_kernel(vcpu->kvm))
+		vmx_intr_assist(vcpu);
+	else if (!vcpu->mmio_read_completed)
+		do_interrupt_requests(vcpu, kvm_run);
 
 	vcpu->guest_mode = 1;
 	if (vcpu->requests)
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index e6edca8..1738f92 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -41,6 +41,12 @@ struct kvm_memory_alias {
 	__u64 target_phys_addr;
 };
 
+/* for KVM_SET_IRQ_LEVEL */
+struct kvm_irq_level {
+	__u32 irq;
+	__u32 level;
+};
+
 enum kvm_exit_reason {
 	KVM_EXIT_UNKNOWN          = 0,
 	KVM_EXIT_EXCEPTION        = 1,
@@ -272,6 +278,11 @@ struct kvm_signal_mask {
 #define KVM_GET_VCPU_MMAP_SIZE    _IO(KVMIO,   0x04) /* in bytes */
 
 /*
+ * Extension capability list.
+ */
+#define KVM_CAP_PIC	  0
+
+/*
  * ioctls for VM fds
  */
 #define KVM_SET_MEMORY_REGION     _IOW(KVMIO, 0x40, struct kvm_memory_region)
@@ -282,6 +293,9 @@ struct kvm_signal_mask {
 #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct kvm_dirty_log)
 #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct kvm_memory_alias)
+/* Device model IOC */
+#define KVM_CREATE_PIC		  _IO(KVMIO,  0x60)
+#define KVM_IRQ_LINE		  _IO(KVMIO,  0x61)
 
 /*
  * ioctls for vcpu fds

[-- Attachment #5: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #6: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

[parent not found: <10EA09EFD8728347A513008B6B0DA77A01AE9255-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01AE9255-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2007-07-02 23:53           ` Avi Kivity
       [not found]             ` <46898FE6.7040709-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 57+ messages in thread
From: Avi Kivity @ 2007-07-02 23:53 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dong, Eddie wrote:
> Here is the updated patch with all comments fixed.
>
>
> Meanwhile I enabled SVM support by copying code from Xen and  managed to
> find an AMD box to have a shoot, it works!
> Please also find the svm patch.
>
>
>
>
>  drivers/kvm/kvm_irq.c  |   59 ++++++
>  drivers/kvm/kvm_irq.h  |   64 ++++++
>   

These can be named as irq.c and irq.h.  No need for the kvm prefix.

> +void kvm_pic_set_irq_new(void *opaque, int irq, int level)
>   

Remove _new from the name.  It's just an artifact of qemu's development 
history; we don't need it.

> +
> +int kvm_pic_read_irq(struct kvm_pic *s)
> +{
> +	int irq, irq2, intno;
> +
> +	irq = pic_get_irq(&s->pics[0]);
> +	if (irq >= 0) {
> +		pic_intack(&s->pics[0], irq);
> +		if (irq == 2) {
> +			irq2 = pic_get_irq(&s->pics[1]);
> +			if (irq2 >= 0) {
> +				pic_intack(&s->pics[1], irq2);
> +			} else {
> +				/*
> +				 * spurious IRQ on slave controller
> +				 */
> +				irq2 = 7;
> +			}
>   

A few braces can be removed here.


> +
> +static uint32_t pic_poll_read(struct kvm_pic_state *s, uint32_t addr1)
>   

u32

> +static uint32_t pic_ioport_read(void *opaque, uint32_t addr1)
>   

here too, and a few other places.

> +
> +static void picdev_write(struct kvm_io_device *this,
> +			 gpa_t addr, int len, const void *val)
> +{
> +	struct kvm_pic *s = this->private;
> +	unsigned char data = *(unsigned char *)val;
> +
> +	if (len != 1) {
> +		printk(KERN_ERR "PIC: non byte write\n");
> +		return;
> +	}
>   

This can be used to spam the host kernel log, if the guest issues lots 
of nonbyte accesses.  You can use printk_ratelimit() or something 
similar.  This goes for the read function too.

>  
> +struct kvm_pic;
>  struct kvm {
>  	spinlock_t lock; /* protects everything except vcpus */
>  	int naliases;
> @@ -456,8 +457,19 @@ struct kvm {
>  	struct file *filp;
>  	struct kvm_io_bus mmio_bus;
>  	struct kvm_io_bus pio_bus;
> +	struct kvm_pic *vpic;
>  };
>   

Why not include it inline and avoid the pointer dereference?

>  
> +static inline struct kvm_pic *pic_irqchip(struct kvm *kvm)
> +{
> +	return kvm->vpic;
> +}
> +
> +static inline int irqchip_in_kernel(struct kvm *kvm)
> +{
> +	return pic_irqchip(kvm) != 0;
> +}
>   

That's why...

>  	}
> +	case KVM_CREATE_PIC:
> +		r = -ENOMEM;
> +		kvm->vpic = kvm_create_pic(kvm);
> +		if (kvm->vpic) {
> +			r = 0;
> +		}
>   

braces.  also, the usual logic is to 'goto out' if an error occured.

>  };
>  
> +/* for KVM_SET_IRQ_LEVEL */
> +struct kvm_irq_level {
> +	__u32 irq;
> +	__u32 level;
> +};
>   

Please add an irqchip index, so this can be used for configurations with 
multiple irqchips (likely ioapics).  You'll also need to add padding so 
the structure is 64-bit aligned.

Other than these minor issues, the patch looks fine.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <46898FE6.7040709-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]             ` <46898FE6.7040709-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-07-03  4:45               ` Dong, Eddie
       [not found]                 ` <10EA09EFD8728347A513008B6B0DA77A01B40B4A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2007-07-03  7:19               ` Dong, Eddie
  1 sibling, 1 reply; 57+ messages in thread
From: Dong, Eddie @ 2007-07-03  4:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f


>> +/* for KVM_SET_IRQ_LEVEL */
>> +struct kvm_irq_level {
>> +	__u32 irq;
>> +	__u32 level;
>> +};
>> 
> 
> Please add an irqchip index, so this can be used for
> configurations with
> multiple irqchips (likely ioapics).  You'll also need to add
> padding so
> the structure is 64-bit aligned.
> 
Adding irqchip index is one way, another way is to just use gsi index
(probably rename irq to gsi). I.e. In X86, gsi 0-15 is for both PIC and
IOAPIC
 and gsi 16-23 is for IOAPIC0 only, gsi 24---xxx can be for ioapic 1,2,
.. n.

In IA64, gsi 0-xxx is just for IOAPIC 0,1, .. n.

Since ACPI already defined gsi, so do we really want to use irqchip
index? 
They are functionally equvalent but gsi is cross architecture neutral,
while
irqchip index is very X86 specific.

Eddie

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

[parent not found: <10EA09EFD8728347A513008B6B0DA77A01B40B4A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: In kernel PIC support: kernel patch
       [not found]                 ` <10EA09EFD8728347A513008B6B0DA77A01B40B4A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2007-07-03 14:52                   ` Avi Kivity
  0 siblings, 0 replies; 57+ messages in thread
From: Avi Kivity @ 2007-07-03 14:52 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dong, Eddie wrote:
>>> +/* for KVM_SET_IRQ_LEVEL */
>>> +struct kvm_irq_level {
>>> +	__u32 irq;
>>> +	__u32 level;
>>> +};
>>>
>>>       
>> Please add an irqchip index, so this can be used for
>> configurations with
>> multiple irqchips (likely ioapics).  You'll also need to add
>> padding so
>> the structure is 64-bit aligned.
>>
>>     
> Adding irqchip index is one way, another way is to just use gsi index
> (probably rename irq to gsi). I.e. In X86, gsi 0-15 is for both PIC and
> IOAPIC
>  and gsi 16-23 is for IOAPIC0 only, gsi 24---xxx can be for ioapic 1,2,
> .. n.
>
> In IA64, gsi 0-xxx is just for IOAPIC 0,1, .. n.
>
> Since ACPI already defined gsi, so do we really want to use irqchip
> index? 
> They are functionally equvalent but gsi is cross architecture neutral,
> while
> irqchip index is very X86 specific.
>
>   

GSI is good enough.   However, we want not to use the name gsi in the 
code, since other archs will not hace it.  Just add a comment indicating 
that for x86 and ia64 the irqchip number is interpreted according to the 
gsi conventions.



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: In kernel PIC support: kernel patch
       [not found]             ` <46898FE6.7040709-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-07-03  4:45               ` Dong, Eddie
@ 2007-07-03  7:19               ` Dong, Eddie
  1 sibling, 0 replies; 57+ messages in thread
From: Dong, Eddie @ 2007-07-03  7:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

[-- Attachment #1: Type: text/plain, Size: 2347 bytes --]

Avi Kivity wrote:
> 
> Other than these minor issues, the patch looks fine.

Here is the rebased one. 

If gsi/level is not enough, then following patch can fix this. But not
sure so far.
thx,eddie

against: cfc329b216bc3e54fe1107e8f714c7b3bc133224

 drivers/kvm/Makefile   |    2
 drivers/kvm/i8259.c    |  440
+++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/irq.c      |   59 ++++++
 drivers/kvm/irq.h      |   64 +++++++
 drivers/kvm/kvm.h      |   12 +
 drivers/kvm/kvm_main.c |   46 ++++-
 drivers/kvm/svm.c      |   55 +++++-
 drivers/kvm/vmx.c      |   83 ++++++++-
 include/linux/kvm.h    |   14 +
 9 files changed, 749 insertions(+), 26 deletions(-)

diff --git a/user/kvmctl.c b/user/kvmctl.c
index b9c9f2b..2713c82 100644
--- a/user/kvmctl.c
+++ b/user/kvmctl.c
@@ -425,7 +425,8 @@ int kvm_set_irq_level(kvm_context_t kvm, int irq,
int level)
        if (!kvm->irqchip_in_kernel)
                return 0;
        event.level = level;
-       event.gsi = irq;
+       event.pin = irq % 8;
+       event.chip = irq / 8;
        r = ioctl(kvm->vm_fd, KVM_IRQ_LINE, &event);
        if (r == -1)
                perror("kvm_set_irq_level");


diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 02bf208..642b3cd 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -2875,9 +2875,9 @@ static long kvm_vm_ioctl(struct file *filp,
                if (copy_from_user(&irq_event, argp, sizeof irq_event))
                        goto out;
                if (irqchip_in_kernel(kvm)) {
-                       if (irq_event.gsi < 16)
+                       if (irq_event.chip < 2)
                                kvm_pic_set_irq(pic_irqchip(kvm),
-                                       irq_event.gsi,
+                                       irq_event.chip * 8 +
irq_event.pin,
                                        irq_event.level);
                        /* TODO: IOAPIC */
                        r = 0;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 1d1270a..d27dbca 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -43,7 +43,8 @@ struct kvm_memory_alias {

 /* for KVM_SET_IRQ_LEVEL */
 struct kvm_irq_level {
-       __u32 gsi;
+       __u16 chip;
+       __u16 pin;
        __u32 level;
 };

[-- Attachment #2: pic-kernel-v26.patch.rebase --]
[-- Type: application/octet-stream, Size: 26028 bytes --]

In kernel PIC support.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>

diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c0a789f..952dff3 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o i8259.o irq.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/i8259.c b/drivers/kvm/i8259.c
new file mode 100644
index 0000000..f4ae5f7
--- /dev/null
+++ b/drivers/kvm/i8259.c
@@ -0,0 +1,440 @@
+/*
+ * 8259 interrupt controller emulation
+ *
+ * Copyright (c) 2003-2004 Fabrice Bellard
+ * Copyright (c) 2007 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *   Port from Qemu.
+ */
+#include <linux/mm.h>
+#include "irq.h"
+
+/*
+ * set irq level. If an edge is detected, then the IRR is set to 1
+ */
+static inline void pic_set_irq1(struct kvm_pic_state *s, int irq, int level)
+{
+	int mask;
+	mask = 1 << irq;
+	if (s->elcr & mask)	/* level triggered */
+		if (level) {
+			s->irr |= mask;
+			s->last_irr |= mask;
+		} else {
+			s->irr &= ~mask;
+			s->last_irr &= ~mask;
+		}
+	else	/* edge triggered */
+		if (level) {
+			if ((s->last_irr & mask) == 0)
+				s->irr |= mask;
+			s->last_irr |= mask;
+		} else
+			s->last_irr &= ~mask;
+}
+
+/*
+ * return the highest priority found in mask (highest = smallest
+ * number). Return 8 if no irq
+ */
+static inline int get_priority(struct kvm_pic_state *s, int mask)
+{
+	int priority;
+	if (mask == 0)
+		return 8;
+	priority = 0;
+	while ((mask & (1 << ((priority + s->priority_add) & 7))) == 0)
+		priority++;
+	return priority;
+}
+
+/*
+ * return the pic wanted interrupt. return -1 if none
+ */
+static int pic_get_irq(struct kvm_pic_state *s)
+{
+	int mask, cur_priority, priority;
+
+	mask = s->irr & ~s->imr;
+	priority = get_priority(s, mask);
+	if (priority == 8)
+		return -1;
+	/*
+	 * compute current priority. If special fully nested mode on the
+	 * master, the IRQ coming from the slave is not taken into account
+	 * for the priority computation.
+	 */
+	mask = s->isr;
+	if (s->special_fully_nested_mode && s == &s->pics_state->pics[0])
+		mask &= ~(1 << 2);
+	cur_priority = get_priority(s, mask);
+	if (priority < cur_priority)
+		/*
+		 * higher priority found: an irq should be generated
+		 */
+		return (priority + s->priority_add) & 7;
+	else
+		return -1;
+}
+
+/*
+ * raise irq to CPU if necessary. must be called every time the active
+ * irq may change
+ */
+static void pic_update_irq(struct kvm_pic *s)
+{
+	int irq2, irq;
+
+	irq2 = pic_get_irq(&s->pics[1]);
+	if (irq2 >= 0) {
+		/*
+		 * if irq request by slave pic, signal master PIC
+		 */
+		pic_set_irq1(&s->pics[0], 2, 1);
+		pic_set_irq1(&s->pics[0], 2, 0);
+	}
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0)
+		s->irq_request(s->irq_request_opaque, 1);
+}
+
+void kvm_pic_set_irq(void *opaque, int irq, int level)
+{
+	struct kvm_pic *s = opaque;
+
+	pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+	pic_update_irq(s);
+}
+
+/*
+ * acknowledge interrupt 'irq'
+ */
+static inline void pic_intack(struct kvm_pic_state *s, int irq)
+{
+	if (s->auto_eoi) {
+		if (s->rotate_on_auto_eoi)
+			s->priority_add = (irq + 1) & 7;
+	} else
+		s->isr |= (1 << irq);
+	/*
+	 * We don't clear a level sensitive interrupt here
+	 */
+	if (!(s->elcr & (1 << irq)))
+		s->irr &= ~(1 << irq);
+}
+
+int kvm_pic_read_irq(struct kvm_pic *s)
+{
+	int irq, irq2, intno;
+
+	irq = pic_get_irq(&s->pics[0]);
+	if (irq >= 0) {
+		pic_intack(&s->pics[0], irq);
+		if (irq == 2) {
+			irq2 = pic_get_irq(&s->pics[1]);
+			if (irq2 >= 0)
+				pic_intack(&s->pics[1], irq2);
+			else
+				/*
+				 * spurious IRQ on slave controller
+				 */
+				irq2 = 7;
+			intno = s->pics[1].irq_base + irq2;
+			irq = irq2 + 8;
+		} else
+			intno = s->pics[0].irq_base + irq;
+	} else {
+		/*
+		 * spurious IRQ on host controller
+		 */
+		irq = 7;
+		intno = s->pics[0].irq_base + irq;
+	}
+	pic_update_irq(s);
+
+	return intno;
+}
+
+static void pic_reset(void *opaque)
+{
+	struct kvm_pic_state *s = opaque;
+
+	s->last_irr = 0;
+	s->irr = 0;
+	s->imr = 0;
+	s->isr = 0;
+	s->priority_add = 0;
+	s->irq_base = 0;
+	s->read_reg_select = 0;
+	s->poll = 0;
+	s->special_mask = 0;
+	s->init_state = 0;
+	s->auto_eoi = 0;
+	s->rotate_on_auto_eoi = 0;
+	s->special_fully_nested_mode = 0;
+	s->init4 = 0;
+}
+
+static void pic_ioport_write(void *opaque, u32 addr, u32 val)
+{
+	struct kvm_pic_state *s = opaque;
+	int priority, cmd, irq;
+
+	addr &= 1;
+	if (addr == 0) {
+		if (val & 0x10) {
+			pic_reset(s);	/* init */
+			/*
+			 * deassert a pending interrupt
+			 */
+			s->pics_state->irq_request(s->pics_state->
+						   irq_request_opaque, 0);
+			s->init_state = 1;
+			s->init4 = val & 1;
+			if (val & 0x02)
+				printk(KERN_ERR "single mode not supported");
+			if (val & 0x08)
+				printk(KERN_ERR
+				       "level sensitive irq not supported");
+		} else if (val & 0x08) {
+			if (val & 0x04)
+				s->poll = 1;
+			if (val & 0x02)
+				s->read_reg_select = val & 1;
+			if (val & 0x40)
+				s->special_mask = (val >> 5) & 1;
+		} else {
+			cmd = val >> 5;
+			switch (cmd) {
+			case 0:
+			case 4:
+				s->rotate_on_auto_eoi = cmd >> 2;
+				break;
+			case 1:	/* end of interrupt */
+			case 5:
+				priority = get_priority(s, s->isr);
+				if (priority != 8) {
+					irq = (priority + s->priority_add) & 7;
+					s->isr &= ~(1 << irq);
+					if (cmd == 5)
+						s->priority_add = (irq + 1) & 7;
+					pic_update_irq(s->pics_state);
+				}
+				break;
+			case 3:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				pic_update_irq(s->pics_state);
+				break;
+			case 6:
+				s->priority_add = (val + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			case 7:
+				irq = val & 7;
+				s->isr &= ~(1 << irq);
+				s->priority_add = (irq + 1) & 7;
+				pic_update_irq(s->pics_state);
+				break;
+			default:
+				break;	/* no operation */
+			}
+		}
+	} else
+		switch (s->init_state) {
+		case 0:		/* normal mode */
+			s->imr = val;
+			pic_update_irq(s->pics_state);
+			break;
+		case 1:
+			s->irq_base = val & 0xf8;
+			s->init_state = 2;
+			break;
+		case 2:
+			if (s->init4)
+				s->init_state = 3;
+			else
+				s->init_state = 0;
+			break;
+		case 3:
+			s->special_fully_nested_mode = (val >> 4) & 1;
+			s->auto_eoi = (val >> 1) & 1;
+			s->init_state = 0;
+			break;
+		}
+}
+
+static u32 pic_poll_read(struct kvm_pic_state *s, u32 addr1)
+{
+	int ret;
+
+	ret = pic_get_irq(s);
+	if (ret >= 0) {
+		if (addr1 >> 7) {
+			s->pics_state->pics[0].isr &= ~(1 << 2);
+			s->pics_state->pics[0].irr &= ~(1 << 2);
+		}
+		s->irr &= ~(1 << ret);
+		s->isr &= ~(1 << ret);
+		if (addr1 >> 7 || ret != 2)
+			pic_update_irq(s->pics_state);
+	} else {
+		ret = 0x07;
+		pic_update_irq(s->pics_state);
+	}
+
+	return ret;
+}
+
+static u32 pic_ioport_read(void *opaque, u32 addr1)
+{
+	struct kvm_pic_state *s = opaque;
+	unsigned int addr;
+	int ret;
+
+	addr = addr1;
+	addr &= 1;
+	if (s->poll) {
+		ret = pic_poll_read(s, addr1);
+		s->poll = 0;
+	} else
+		if (addr == 0)
+			if (s->read_reg_select)
+				ret = s->isr;
+			else
+				ret = s->irr;
+		else
+			ret = s->imr;
+	return ret;
+}
+
+static void elcr_ioport_write(void *opaque, u32 addr, u32 val)
+{
+	struct kvm_pic_state *s = opaque;
+	s->elcr = val & s->elcr_mask;
+}
+
+static u32 elcr_ioport_read(void *opaque, u32 addr1)
+{
+	struct kvm_pic_state *s = opaque;
+	return s->elcr;
+}
+
+static int picdev_in_range(struct kvm_io_device *this, gpa_t addr)
+{
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+	case 0x4d0:
+	case 0x4d1:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+static void picdev_write(struct kvm_io_device *this,
+			 gpa_t addr, int len, const void *val)
+{
+	struct kvm_pic *s = this->private;
+	unsigned char data = *(unsigned char *)val;
+
+	if (len != 1) {
+		if (printk_ratelimit())
+			printk(KERN_ERR "PIC: non byte write\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		pic_ioport_write(&s->pics[addr >> 7], addr, data);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		elcr_ioport_write(&s->pics[addr & 1], addr, data);
+		break;
+	}
+}
+
+static void picdev_read(struct kvm_io_device *this,
+			gpa_t addr, int len, void *val)
+{
+	struct kvm_pic *s = this->private;
+	unsigned char data = 0;
+
+	if (len != 1) {
+		if (printk_ratelimit())
+			printk(KERN_ERR "PIC: non byte read\n");
+		return;
+	}
+	switch (addr) {
+	case 0x20:
+	case 0x21:
+	case 0xa0:
+	case 0xa1:
+		data = pic_ioport_read(&s->pics[addr >> 7], addr);
+		break;
+	case 0x4d0:
+	case 0x4d1:
+		data = elcr_ioport_read(&s->pics[addr & 1], addr);
+		break;
+	}
+	*(unsigned char *)val = data;
+}
+
+/*
+ * callback when PIC0 irq status changed
+ */
+static void pic_irq_request(void *opaque, int level)
+{
+	struct kvm *kvm = opaque;
+
+	pic_irqchip(kvm)->output = level;
+}
+
+struct kvm_pic *kvm_create_pic(struct kvm *kvm)
+{
+	struct kvm_pic *s;
+	s = kzalloc(sizeof(struct kvm_pic), GFP_KERNEL);
+	if (!s)
+		return NULL;
+	s->pics[0].elcr_mask = 0xf8;
+	s->pics[1].elcr_mask = 0xde;
+	s->irq_request = pic_irq_request;
+	s->irq_request_opaque = kvm;
+	s->pics[0].pics_state = s;
+	s->pics[1].pics_state = s;
+
+	/*
+	 * Initialize PIO device
+	 */
+	s->dev.read = picdev_read;
+	s->dev.write = picdev_write;
+	s->dev.in_range = picdev_in_range;
+	s->dev.private = s;
+	kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
+	return s;
+}
diff --git a/drivers/kvm/irq.c b/drivers/kvm/irq.c
new file mode 100644
index 0000000..92a65bc
--- /dev/null
+++ b/drivers/kvm/irq.c
@@ -0,0 +1,59 @@
+/*
+ * irq.c: API for in kernel interrupt controller
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *
+ */
+
+#include "kvm.h"
+#include "irq.h"
+
+/*
+ * check if there is pending interrupt without
+ * intack.
+ */
+int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
+{
+	struct kvm_pic *s = pic_irqchip(v->kvm);
+
+	if (s->output)	/* PIC */
+		return 1;
+	/*
+	 * TODO: APIC
+	 */
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
+
+/*
+ * Read pending interrupt vector and intack.
+ */
+int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
+{
+	struct kvm_pic *s = pic_irqchip(v->kvm);
+	int vector;
+
+	s->output = 0;
+	vector = kvm_pic_read_irq(s);
+	if (vector != -1)
+		return vector;
+	/*
+	 * TODO: APIC
+	 */
+	return -1;
+}
+EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
diff --git a/drivers/kvm/irq.h b/drivers/kvm/irq.h
new file mode 100644
index 0000000..a6b3869
--- /dev/null
+++ b/drivers/kvm/irq.h
@@ -0,0 +1,64 @@
+/*
+ * irq.h: in kernel interrupt controller related definitions
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ * Authors:
+ *   Yaozu (Eddie) Dong <Eddie.dong@intel.com>
+ *
+ */
+
+#ifndef __IRQ_H
+#define __IRQ_H
+
+#include "kvm.h"
+
+typedef void irq_request_func(void *opaque, int level);
+
+struct kvm_pic;
+struct kvm_pic_state {
+	u8 last_irr;	/* edge detection */
+	u8 irr;		/* interrupt request register */
+	u8 imr;		/* interrupt mask register */
+	u8 isr;		/* interrupt service register */
+	u8 priority_add;	/* highest irq priority */
+	u8 irq_base;
+	u8 read_reg_select;
+	u8 poll;
+	u8 special_mask;
+	u8 init_state;
+	u8 auto_eoi;
+	u8 rotate_on_auto_eoi;
+	u8 special_fully_nested_mode;
+	u8 init4;		/* true if 4 byte init */
+	u8 elcr;		/* PIIX edge/trigger selection */
+	u8 elcr_mask;
+	struct kvm_pic *pics_state;
+};
+
+struct kvm_pic {
+	struct kvm_pic_state pics[2]; /* 0 is master pic, 1 is slave pic */
+	irq_request_func *irq_request;
+	void *irq_request_opaque;
+	int output;		/* intr from master PIC */
+	struct kvm_io_device dev;
+};
+
+struct kvm_pic *kvm_create_pic(struct kvm *kvm);
+void kvm_pic_set_irq(void *opaque, int irq, int level);
+int kvm_pic_read_irq(struct kvm_pic *s);
+int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
+int kvm_cpu_has_interrupt(struct kvm_vcpu *v);
+
+#endif
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index a7c5e6b..f1a6773 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -435,6 +435,7 @@ struct kvm_memory_slot {
 	unsigned long *dirty_bitmap;
 };
 
+struct kvm_pic;
 struct kvm {
 	spinlock_t lock; /* protects everything except vcpus */
 	int naliases;
@@ -456,8 +457,19 @@ struct kvm {
 	struct file *filp;
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
+	struct kvm_pic *vpic;
 };
 
+static inline struct kvm_pic *pic_irqchip(struct kvm *kvm)
+{
+	return kvm->vpic;
+}
+
+static inline int irqchip_in_kernel(struct kvm *kvm)
+{
+	return pic_irqchip(kvm) != 0;
+}
+
 struct descriptor_table {
 	u16 limit;
 	unsigned long base;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 5603000..02bf208 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -46,6 +46,7 @@
 
 #include "x86_emulate.h"
 #include "segment_descriptor.h"
+#include "irq.h"
 
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
@@ -476,6 +477,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
+	kfree(kvm->vpic);
 	kvm_io_bus_destroy(&kvm->pio_bus);
 	kvm_io_bus_destroy(&kvm->mmio_bus);
 	kvm_free_vcpus(kvm);
@@ -1374,7 +1376,8 @@ EXPORT_SYMBOL_GPL(emulate_instruction);
 
 int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 {
-	if (vcpu->irq_summary)
+	if (vcpu->irq_summary ||
+		(irqchip_in_kernel(vcpu->kvm) && kvm_cpu_has_interrupt(vcpu)))
 		return 1;
 
 	vcpu->run->exit_reason = KVM_EXIT_HLT;
@@ -2857,6 +2860,30 @@ static long kvm_vm_ioctl(struct file *filp,
 			goto out;
 		break;
 	}
+	case KVM_CREATE_PIC:
+		r = -ENOMEM;
+		kvm->vpic = kvm_create_pic(kvm);
+		if (kvm->vpic)
+			r = 0;
+		else
+			goto out;
+		break;
+	case KVM_IRQ_LINE: {
+		struct kvm_irq_level irq_event;
+
+		r = -EFAULT;
+		if (copy_from_user(&irq_event, argp, sizeof irq_event))
+			goto out;
+		if (irqchip_in_kernel(kvm)) {
+			if (irq_event.gsi < 16)
+				kvm_pic_set_irq(pic_irqchip(kvm),
+					irq_event.gsi,
+					irq_event.level);
+			/* TODO: IOAPIC */
+			r = 0;
+		}
+		break;
+	}
 	default:
 		;
 	}
@@ -2988,12 +3015,19 @@ static long kvm_dev_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
-	case KVM_CHECK_EXTENSION:
-		/*
-		 * No extensions defined at present.
-		 */
-		r = 0;
+	case KVM_CHECK_EXTENSION: {
+		int ext = (long)argp;
+
+		switch (ext) {
+		case KVM_CAP_PIC:
+			r = 1;
+			break;
+		default:
+			r = 0;
+			break;
+		}
 		break;
+	}
 	case KVM_GET_VCPU_MMAP_SIZE:
 		r = -EINVAL;
 		if (arg)
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index a0d4428..f614800 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -23,6 +23,7 @@
 #include <asm/desc.h>
 
 #include "kvm_svm.h"
+#include "irq.h"
 #include "x86_emulate.h"
 
 MODULE_AUTHOR("Qumranet");
@@ -904,7 +905,8 @@ static int pf_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	enum emulation_result er;
 	int r;
 
-	if (is_external_interrupt(exit_int_info))
+	if (!irqchip_in_kernel(vcpu->kvm) &&
+		is_external_interrupt(exit_int_info))
 		push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
 
 	spin_lock(&vcpu->kvm->lock);
@@ -1276,6 +1278,8 @@ static int msr_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 static int interrupt_window_interception(struct kvm_vcpu *vcpu,
 				   struct kvm_run *kvm_run)
 {
+	vcpu->svm->vmcb->control.intercept &= ~(1ULL << INTERCEPT_VINTR);
+	vcpu->svm->vmcb->control.int_ctl &= ~V_IRQ_MASK;
 	/*
 	 * If the user space waits to inject interrupts, exit as soon as
 	 * possible
@@ -1380,22 +1384,55 @@ static void pre_svm_run(struct kvm_vcpu *vcpu)
 }
 
 
-static inline void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
+static inline void svm_inject_irq(struct kvm_vcpu *vcpu, int irq)
 {
 	struct vmcb_control_area *control;
 
 	control = &vcpu->svm->vmcb->control;
-	control->int_vector = pop_irq(vcpu);
+	control->int_vector = irq;
 	control->int_ctl &= ~V_INTR_PRIO_MASK;
 	control->int_ctl |= V_IRQ_MASK |
 		((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT);
 }
 
+static void svm_intr_assist(struct kvm_vcpu *vcpu)
+{
+	struct vmcb *vmcb = vcpu->svm->vmcb;
+	int intr_vector = -1;
+
+	if ((vmcb->control.exit_int_info & SVM_EVTINJ_VALID) &&
+	    ((vmcb->control.exit_int_info & SVM_EVTINJ_TYPE_MASK) == 0)) {
+		intr_vector = vmcb->control.exit_int_info &
+			      SVM_EVTINJ_VEC_MASK;
+		vmcb->control.exit_int_info = 0;
+		svm_inject_irq(vcpu, intr_vector);
+		return;
+	}
+
+	if (vmcb->control.int_ctl & V_IRQ_MASK)
+		return;
+
+	if (!kvm_cpu_has_interrupt(vcpu))
+		return;
+
+	if (!(vmcb->save.rflags & X86_EFLAGS_IF) ||
+	    (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) ||
+	    (vmcb->control.event_inj & SVM_EVTINJ_VALID)) {
+		/* unable to deliver irq, set pending irq */
+		vmcb->control.intercept |= (1ULL << INTERCEPT_VINTR);
+		svm_inject_irq(vcpu, 0x0);
+		return;
+	}
+	/* Okay, we can deliver the interrupt: grab it and update PIC state. */
+	intr_vector = kvm_cpu_get_interrupt(vcpu);
+	svm_inject_irq(vcpu, intr_vector);
+}
+
 static void kvm_reput_irq(struct kvm_vcpu *vcpu)
 {
 	struct vmcb_control_area *control = &vcpu->svm->vmcb->control;
 
-	if (control->int_ctl & V_IRQ_MASK) {
+	if ((control->int_ctl & V_IRQ_MASK) && !irqchip_in_kernel(vcpu->kvm)) {
 		control->int_ctl &= ~V_IRQ_MASK;
 		push_irq(vcpu, control->int_vector);
 	}
@@ -1417,7 +1454,7 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
 		/*
 		 * If interrupts enabled, and not blocked by sti or mov ss. Good.
 		 */
-		kvm_do_inject_irq(vcpu);
+		svm_inject_irq(vcpu, pop_irq(vcpu));
 
 	/*
 	 * Interrupts blocked.  Wait for unblock.
@@ -1487,11 +1524,13 @@ again:
 	if (unlikely(r))
 		return r;
 
-	if (!vcpu->mmio_read_completed)
-		do_interrupt_requests(vcpu, kvm_run);
-
 	clgi();
 
+	if (irqchip_in_kernel(vcpu->kvm))
+		svm_intr_assist(vcpu);
+	else if (!vcpu->mmio_read_completed)
+		do_interrupt_requests(vcpu, kvm_run);
+
 	vcpu->guest_mode = 1;
 	if (vcpu->requests)
 		if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests))
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 7d04ffa..598b2b2 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "irq.h"
 #include "vmx.h"
 #include <linux/module.h>
 #include <linux/kernel.h>
@@ -1468,6 +1469,16 @@ static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq)
 	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp - 6));
 }
 
+static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq)
+{
+	if (vcpu->rmode.active) {
+		inject_rmode_irq(vcpu, irq);
+		return;
+	}
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+			irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
+}
+
 static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 {
 	int word_index = __ffs(vcpu->irq_summary);
@@ -1477,13 +1488,7 @@ static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
 	if (!vcpu->irq_pending[word_index])
 		clear_bit(word_index, &vcpu->irq_summary);
-
-	if (vcpu->rmode.active) {
-		inject_rmode_irq(vcpu, irq);
-		return;
-	}
-	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-			irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
+	vmx_inject_irq(vcpu, irq);
 }
 
 
@@ -1567,7 +1572,7 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 		       "intr info 0x%x\n", __FUNCTION__, vect_info, intr_info);
 	}
 
-	if (is_external_interrupt(vect_info)) {
+	if (!irqchip_in_kernel(vcpu->kvm) && is_external_interrupt(vect_info)) {
 		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
 		set_bit(irq, vcpu->irq_pending);
 		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
@@ -1898,6 +1903,12 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 				   struct kvm_run *kvm_run)
 {
+	u32 cpu_based_vm_exec_control;
+
+	/* clear pending irq */
+	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
 	/*
 	 * If the user space waits to inject interrupts, exit as soon as
 	 * possible
@@ -1989,6 +2000,55 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
 {
 }
 
+static void enable_irq_window(struct kvm_vcpu *vcpu)
+{
+	u32 cpu_based_vm_exec_control;
+
+	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
+}
+
+static void vmx_intr_assist(struct kvm_vcpu *vcpu)
+{
+	u32 idtv_info_field, intr_info_field;
+	int has_ext_irq, interrupt_window_open;
+
+	has_ext_irq = kvm_cpu_has_interrupt(vcpu);
+	intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	idtv_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	if (intr_info_field & INTR_INFO_VALID_MASK) {
+		if (idtv_info_field & INTR_INFO_VALID_MASK) {
+			/* TODO: fault when IDT_Vectoring */
+			printk(KERN_ERR "Fault when IDT_Vectoring\n");
+		}
+		if (has_ext_irq)
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) {
+		vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field);
+		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+				vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
+
+		if (unlikely(idtv_info_field & INTR_INFO_DELIEVER_CODE_MASK))
+			vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+				vmcs_read32(IDT_VECTORING_ERROR_CODE));
+		if (unlikely(has_ext_irq))
+			enable_irq_window(vcpu);
+		return;
+	}
+	if (!has_ext_irq)
+		return;
+	interrupt_window_open =
+		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
+		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
+	if (interrupt_window_open)
+		vmx_inject_irq(vcpu, kvm_cpu_get_interrupt(vcpu));
+	else
+		enable_irq_window(vcpu);
+}
+
 static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	u8 fail;
@@ -1999,9 +2059,6 @@ preempted:
 		kvm_guest_debug_pre(vcpu);
 
 again:
-	if (!vcpu->mmio_read_completed)
-		do_interrupt_requests(vcpu, kvm_run);
-
 	vmx_save_host_state(vcpu);
 	kvm_load_guest_fpu(vcpu);
 
@@ -2015,6 +2072,10 @@ again:
 	vmcs_writel(HOST_CR0, read_cr0());
 
 	local_irq_disable();
+	if (irqchip_in_kernel(vcpu->kvm))
+		vmx_intr_assist(vcpu);
+	else if (!vcpu->mmio_read_completed)
+		do_interrupt_requests(vcpu, kvm_run);
 
 	vcpu->guest_mode = 1;
 	if (vcpu->requests)
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index e6edca8..1d1270a 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -41,6 +41,12 @@ struct kvm_memory_alias {
 	__u64 target_phys_addr;
 };
 
+/* for KVM_SET_IRQ_LEVEL */
+struct kvm_irq_level {
+	__u32 gsi;
+	__u32 level;
+};
+
 enum kvm_exit_reason {
 	KVM_EXIT_UNKNOWN          = 0,
 	KVM_EXIT_EXCEPTION        = 1,
@@ -272,6 +278,11 @@ struct kvm_signal_mask {
 #define KVM_GET_VCPU_MMAP_SIZE    _IO(KVMIO,   0x04) /* in bytes */
 
 /*
+ * Extension capability list.
+ */
+#define KVM_CAP_PIC	  0
+
+/*
  * ioctls for VM fds
  */
 #define KVM_SET_MEMORY_REGION     _IOW(KVMIO, 0x40, struct kvm_memory_region)
@@ -282,6 +293,9 @@ struct kvm_signal_mask {
 #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct kvm_dirty_log)
 #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct kvm_memory_alias)
+/* Device model IOC */
+#define KVM_CREATE_PIC		  _IO(KVMIO,  0x60)
+#define KVM_IRQ_LINE		  _IO(KVMIO,  0x61)
 
 /*
  * ioctls for vcpu fds

[-- Attachment #3: pic-user-v26.patch.rebase --]
[-- Type: application/octet-stream, Size: 5450 bytes --]

In kernel PIC support.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>

diff --git a/kernel/Kbuild b/kernel/Kbuild
index e9bcda7..ebb55cd 100644
--- a/kernel/Kbuild
+++ b/kernel/Kbuild
@@ -1,5 +1,5 @@
 EXTRA_CFLAGS := -I$(src)/include -include $(src)/external-module-compat.h
 obj-m := kvm.o kvm-intel.o kvm-amd.o
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o irq.o i8259.o
 kvm-intel-objs := vmx.o vmx-debug.o
 kvm-amd-objs := svm.o
diff --git a/qemu/cpu-defs.h b/qemu/cpu-defs.h
index 0b49c89..67ba41e 100644
--- a/qemu/cpu-defs.h
+++ b/qemu/cpu-defs.h
@@ -75,6 +75,7 @@ typedef unsigned long ram_addr_t;
 #define EXCP_HLT        0x10001 /* hlt instruction reached */
 #define EXCP_DEBUG      0x10002 /* cpu stopped after a breakpoint or singlestep */
 #define EXCP_HALTED     0x10003 /* cpu is halted (waiting for external event) */
+#define EXCP_IO_WINDOW  0x10004 /* IO Window for user irq injection */
 #define MAX_BREAKPOINTS 32
 
 #define TB_JMP_CACHE_BITS 12
diff --git a/qemu/hw/i8259.c b/qemu/hw/i8259.c
index a2a8187..117340c 100644
--- a/qemu/hw/i8259.c
+++ b/qemu/hw/i8259.c
@@ -177,7 +177,12 @@ int64_t irq_time[16];
 void pic_set_irq_new(void *opaque, int irq, int level)
 {
     PicState2 *s = opaque;
+#ifdef USE_KVM
+    extern int kvm_set_irq(int irq, int level);
 
+    if (kvm_set_irq(irq, level))
+        return;
+#endif
 #if defined(DEBUG_PIC) || defined(DEBUG_IRQ_COUNT)
     if (level != irq_level[irq]) {
 #if defined(DEBUG_PIC)
diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 1849997..811b320 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -601,6 +601,7 @@ static int kvm_writeq(void *opaque, uint64_t addr, uint64_t data)
 
 static int kvm_io_window(void *opaque)
 {
+    env->exception_index = EXCP_IO_WINDOW;
     return 1;
 }
 
@@ -612,7 +613,8 @@ static int kvm_halt(void *opaque, int vcpu)
     env = envs[0];
     if (!((env->interrupt_request & CPU_INTERRUPT_HARD) &&
 	  (env->eflags & IF_MASK))) {
-	    env->hflags |= HF_HALTED_MASK;
+	    /* TODO: for halt emulation, temply walkaround now */
+	    /* env->hflags |= HF_HALTED_MASK; */
 	    env->exception_index = EXCP_HLT;
     }
 
@@ -923,4 +925,9 @@ int kvm_get_phys_ram_page_bitmap(unsigned char *bitmap)
  out:
     return r;
 }
+
+int kvm_set_irq(int irq, int level)
+{
+    return kvm_set_irq_level(kvm_context, irq, level);
+}
 #endif
diff --git a/user/kvmctl.c b/user/kvmctl.c
index 054547c..b9c9f2b 100644
--- a/user/kvmctl.c
+++ b/user/kvmctl.c
@@ -66,6 +66,7 @@ struct kvm_context {
 	int dirty_pages_log_all;
 	/// memory regions parameters
 	struct kvm_memory_region mem_regions[KVM_MAX_NUM_MEM_REGIONS];
+	int irqchip_in_kernel;
 };
 
 /*
@@ -288,6 +289,15 @@ int kvm_create(kvm_context_t kvm, unsigned long memory, void **vm_mem)
 	     MAP_PRIVATE|MAP_FIXED, zfd, 0);
 	close(zfd);
 
+	kvm->irqchip_in_kernel = 0;
+	r = ioctl(kvm->fd, KVM_CHECK_EXTENSION, KVM_CAP_PIC);
+	if (r > 0) {	/* kernel irqchip supported */
+		r = ioctl(fd, KVM_CREATE_PIC);
+		if (r >= 0)
+			kvm->irqchip_in_kernel = 1;
+		else
+			printf("Create kernel PIC irqchip failed\n");
+	}
 	r = kvm_create_vcpu(kvm, 0);
 	if (r < 0)
 		return r;
@@ -407,6 +417,21 @@ int kvm_get_mem_map(kvm_context_t kvm, int slot, void *buf)
 #endif /* KVM_GET_MEM_MAP */
 }
 
+int kvm_set_irq_level(kvm_context_t kvm, int irq, int level)
+{
+	struct kvm_irq_level event;
+	int r;
+
+	if (!kvm->irqchip_in_kernel)
+		return 0;
+	event.level = level;
+	event.gsi = irq;
+	r = ioctl(kvm->vm_fd, KVM_IRQ_LINE, &event);
+	if (r == -1)
+		perror("kvm_set_irq_level");
+	return 1;
+}
+
 static int handle_io_abi10(kvm_context_t kvm, struct kvm_run_abi10 *run,
 			   int vcpu)
 {
@@ -944,7 +969,8 @@ int kvm_run(kvm_context_t kvm, int vcpu)
 		return kvm_run_abi10(kvm, vcpu);
 
 again:
-	run->request_interrupt_window = try_push_interrupts(kvm);
+	if (!kvm->irqchip_in_kernel)
+		run->request_interrupt_window = try_push_interrupts(kvm);
 	pre_kvm_run(kvm, vcpu);
 	r = ioctl(fd, KVM_RUN, 0);
 	post_kvm_run(kvm, vcpu);
@@ -1022,6 +1048,25 @@ int kvm_guest_debug(kvm_context_t kvm, int vcpu, struct kvm_debug_guest *dbg)
 	return ioctl(kvm->vcpu_fd[vcpu], KVM_DEBUG_GUEST, dbg);
 }
 
+static void cpuid_remove_apic(struct kvm_cpuid *cpuid)
+{
+	int i;
+	struct kvm_cpuid_entry *e, *entry;
+
+	entry = NULL;
+	for (i = 0; i < cpuid->nent; ++i) {
+		e = &cpuid->entries[i];
+		if (e->function == 1) {
+			entry = e;
+			break;
+		}
+	}
+	if (entry) {
+		entry->edx &= ~(1 << 9);
+		printf("Guest APIC capibility removed\n");
+	}
+}
+
 int kvm_setup_cpuid(kvm_context_t kvm, int vcpu, int nent,
 		    struct kvm_cpuid_entry *entries)
 {
@@ -1034,6 +1079,8 @@ int kvm_setup_cpuid(kvm_context_t kvm, int vcpu, int nent,
 
 	cpuid->nent = nent;
 	memcpy(cpuid->entries, entries, nent * sizeof(*entries));
+	/* temply walkaround before merge of in-kernel APIC */
+	cpuid_remove_apic(cpuid);
 	r = ioctl(kvm->vcpu_fd[vcpu], KVM_SET_CPUID, cpuid);
 
 	free(cpuid);
diff --git a/user/kvmctl.h b/user/kvmctl.h
index 956e194..68e1664 100644
--- a/user/kvmctl.h
+++ b/user/kvmctl.h
@@ -395,6 +395,7 @@ int kvm_destroy_memory_alias(kvm_context_t, int slot);
  * \param bitmap Long aligned address of a big enough bitmap (one bit per page)
  */
 int kvm_get_mem_map(kvm_context_t kvm, int slot, void *bitmap);
+int kvm_set_irq_level(kvm_context_t kvm, int irq, int level);
 
 /*!
  * \brief Enable dirty-pages-logging for all memory regions

[-- Attachment #4: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #5: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2007-07-08  8:18 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-25 22:23 In kernel PIC support: kernel patch Gregory Haskins
     [not found] ` <468008090200005A00026619-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-25 22:25   ` Avi Kivity
2007-06-26  0:30   ` Dong, Eddie
2007-06-27 15:50   ` Luca
     [not found] <468E0FD2.2090305@qumranet.com>
     [not found] ` <468E0FD2.2090305-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-07-06 17:46   ` Dong, Eddie
     [not found]     ` <10EA09EFD8728347A513008B6B0DA77A01B8F3C5-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-07-08  8:18       ` Avi Kivity
  -- strict thread matches above, loose matches on Subject: below --
2007-06-25 22:36 Gregory Haskins
2007-06-25 14:10 Gregory Haskins
     [not found] ` <467F94850200005A00026595-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-25 21:13   ` Dor Laor
     [not found]     ` <64F9B87B6B770947A9F8391472E032160C654F85-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-06-25 21:57       ` Avi Kivity
2007-06-25 13:53 Gregory Haskins
2007-06-25 13:45 Gregory Haskins
2007-06-25 13:44 Gregory Haskins
2007-06-25 13:43 Gregory Haskins
     [not found] ` <467F8E260200005A00026573-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-25 13:54   ` Dong, Eddie
2007-06-25 22:08   ` Avi Kivity
2007-06-25 13:26 Gregory Haskins
     [not found] ` <467F8A2D0200005A00026567-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-25 22:00   ` Avi Kivity
2007-06-22 12:14 Gregory Haskins
     [not found] ` <467B84EF0200005A00026401-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-23  7:53   ` Avi Kivity
2007-06-22 12:06 Gregory Haskins
     [not found] ` <467B830B0200005A000263FB-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-22 15:09   ` Dong, Eddie
     [not found]     ` <10EA09EFD8728347A513008B6B0DA77A014E8AC7-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-06-23 17:41       ` Avi Kivity
     [not found]         ` <467D5B5B.9090702-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-25  3:32           ` Dong, Eddie
     [not found]             ` <10EA09EFD8728347A513008B6B0DA77A01AA46F5-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-06-25  3:39               ` Avi Kivity
     [not found]                 ` <467F38F7.9090101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-25  4:11                   ` Dong, Eddie
2007-06-22 12:02 Gregory Haskins
2007-06-22  3:49 Gregory Haskins
     [not found] ` <467B0E7D0200005A000263E4-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-22  7:25   ` Dong, Eddie
2007-06-22  9:57   ` Avi Kivity
2007-06-21 16:44 Gregory Haskins
     [not found] ` <467A72970200005A00026354-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-22  9:46   ` Avi Kivity
2007-06-21 15:43 Gregory Haskins
2007-06-21 15:42 Gregory Haskins
2007-06-21 15:17 Gregory Haskins
2007-06-21 15:10 Gregory Haskins
     [not found] ` <467A5C8C0200005A00026321-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-21 15:13   ` Dong, Eddie
2007-06-21 16:31   ` Avi Kivity
     [not found]     ` <467AA7FF.5000701-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-22  2:09       ` Dong, Eddie
     [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01A56750-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-06-22 10:32           ` Avi Kivity
2007-06-21 14:01 Gregory Haskins
     [not found] ` <467A4C970200005A000262FD-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-21 14:57   ` Dong, Eddie
2007-06-21 13:31 Gregory Haskins
     [not found] ` <467A45690200005A000262F0-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-21 14:28   ` Dong, Eddie
2007-06-21 12:36 Gregory Haskins
     [not found] ` <467A38720200005A000262C8-Igcdv/6uVdMHoYOw/+koYqIwWpluYiW7@public.gmane.org>
2007-06-21 13:02   ` Dong, Eddie
2007-06-20  7:43 Dong, Eddie
     [not found] ` <10EA09EFD8728347A513008B6B0DA77A01A55F0A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-06-22 10:50   ` Avi Kivity
     [not found]     ` <467BA95A.704-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-22 14:21       ` Rusty Russell
     [not found]         ` <1182522074.17478.45.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-23  7:50           ` Avi Kivity
2007-06-25  1:45       ` Dong, Eddie
     [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01AA45AD-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-06-25  3:28           ` Avi Kivity
2007-06-27 15:25       ` Dong, Eddie
     [not found]         ` <10EA09EFD8728347A513008B6B0DA77A01AE9255-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-07-02 23:53           ` Avi Kivity
     [not found]             ` <46898FE6.7040709-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-07-03  4:45               ` Dong, Eddie
     [not found]                 ` <10EA09EFD8728347A513008B6B0DA77A01B40B4A-wq7ZOvIWXbNpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-07-03 14:52                   ` Avi Kivity
2007-07-03  7:19               ` Dong, Eddie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox