KVM: Patch series for in-kernel APIC support

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* KVM: Patch series for in-kernel APIC support
@ 2007-04-20  3:09 Gregory Haskins
       [not found] ` <20070420030905.12408.40403.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-20  3:09 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

The following is my patch series for adding in-kernel APIC support.  It
supports three "levels" of dynamic configuration (via a new ioctl):

 *  level 0 = (default) compatiblity mode (everything in userspace)
 *  level 1 = LAPIC in kernel, IOAPIC/i8259 in userspace
 *  level 2 = All three in kernel

This patch adds support for the basic framework for the new PIC models 
(level 0) as well as an implementation of level-1.  

level-0 is "code complete" and fully tested.  I have run this patchset
using existing QEMU on 64-bit linux, and 32 bit XP.  Both ran fine with no
discernable difference in behavior.

level-1 is "code complete" and compiles/links error free, but is otherwise
untested since I still do not have a functioning userspace component. I
include it here for review/feedback purposes.

level-2 is partially implemented downstream in my queue, but I did not include
it here since it is still TBD whether we will ever need it.

Note that the first patch (in-kernel-mmio.patch) is completely unchanged
through the last few rounds of review.  However, patch 2-5 are heavily
re-worked from the last time so pay particular attention there.  Most notably:

Patch #2: irqdevice changes:
  1) pending+read_vector are now combined into one call: ack().  Feedback and
     my own discoveries downstream indicated this was a superior design.
  2) raise_intr() is now set_intr() which can define more than one "pin" and
     which can be assert/de-asserted an edge or level triggered signal.  This
     significantly simplified the NMI handling logic (some of which you will
     see here in the series) as well as created a much more extensible model
     to work with.
  3) I merged a previously unpublished patch (deferred-irq.patch) into this
     one because it no longer made sense to keep them separate with the new
     design. This provides "push/pop" operations for IRQs to better handle
     injection failure scenarios.

Patch #3 (preemptible-cpu) you are familiar with, but it changed slightly 
to accommodate the changes in #2

#4 and #5 are debuting for the first time.  Feedback/comments/bugfixes on any
 of the code is more than welcome, but I am particular interested in comments
 on the handling of HRTIMERs in the lapic.c code.  I ran into a brick wall
 with the SLEx 2.6.16 kernel not supporting them (fully, which made it worse).
 However the extern-module-compat methodology seemed inadequate to solve the
 problem.  Please advise if there is a better way to solve that.

>From my perspective, this code could be considered for inclusion at this point
(pending review cycles, etc) since it can fully support the existing system.
I will leave that to the powers that be if they would prefer to see level-1 in
action first.

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] Adds support for in-kernel mmio handlers
       [not found] ` <20070420030905.12408.40403.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-04-20  3:09   ` Gregory Haskins
  2007-04-20  3:09   ` [PATCH 2/5] KVM: Add irqdevice object Gregory Haskins
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Gregory Haskins @ 2007-04-20  3:09 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

From: None <None>


---

 drivers/kvm/kvm.h      |   31 ++++++++++++++++++
 drivers/kvm/kvm_main.c |   82 +++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 101 insertions(+), 12 deletions(-)

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index fceeb84..181099f 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -236,6 +236,36 @@ struct kvm_pio_request {
 	int rep;
 };
 
+struct kvm_io_device {
+	unsigned long (*read)(struct kvm_io_device *this,
+			      gpa_t addr,
+			      int length);
+	void (*write)(struct kvm_io_device *this,
+		      gpa_t addr,
+		      int length,
+		      unsigned long val);
+	int (*in_range)(struct kvm_io_device *this, gpa_t addr);
+
+	void             *private;
+};
+
+/*
+ * It would be nice to use something smarter than a linear search, TBD...
+ * Thankfully we dont expect many devices to register (famous last words :),
+ * so until then it will suffice.  At least its abstracted so we can change
+ * in one place.
+ */
+struct kvm_io_bus {
+	int                   dev_count;
+#define NR_IOBUS_DEVS 6
+	struct kvm_io_device *devs[NR_IOBUS_DEVS];
+};
+
+void kvm_io_bus_init(struct kvm_io_bus *bus);
+struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus, gpa_t addr);
+void kvm_io_bus_register_dev(struct kvm_io_bus *bus, 
+			     struct kvm_io_device *dev);
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 	union {
@@ -345,6 +375,7 @@ struct kvm {
 	unsigned long rmap_overflow;
 	struct list_head vm_list;
 	struct file *filp;
+	struct kvm_io_bus mmio_bus;
 };
 
 struct kvm_stat {
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 4473174..c3c0059 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -294,6 +294,7 @@ static struct kvm *kvm_create_vm(void)
 
 	spin_lock_init(&kvm->lock);
 	INIT_LIST_HEAD(&kvm->active_mmu_pages);
+	kvm_io_bus_init(&kvm->mmio_bus);
 	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
 		struct kvm_vcpu *vcpu = &kvm->vcpus[i];
 
@@ -1015,12 +1016,25 @@ static int emulator_write_std(unsigned long addr,
 	return X86EMUL_UNHANDLEABLE;
 }
 
+static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu, 
+						gpa_t addr)
+{
+	/*
+	 * Note that its important to have this wrapper function because 
+	 * in the very near future we will be checking for MMIOs against 
+	 * the LAPIC as well as the general MMIO bus 
+	 */
+	return kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr);
+}
+
 static int emulator_read_emulated(unsigned long addr,
 				  unsigned long *val,
 				  unsigned int bytes,
 				  struct x86_emulate_ctxt *ctxt)
 {
-	struct kvm_vcpu *vcpu = ctxt->vcpu;
+	struct kvm_vcpu      *vcpu = ctxt->vcpu;
+	struct kvm_io_device *mmio_dev;
+	gpa_t                 gpa;
 
 	if (vcpu->mmio_read_completed) {
 		memcpy(val, vcpu->mmio_data, bytes);
@@ -1029,18 +1043,26 @@ static int emulator_read_emulated(unsigned long addr,
 	} else if (emulator_read_std(addr, val, bytes, ctxt)
 		   == X86EMUL_CONTINUE)
 		return X86EMUL_CONTINUE;
-	else {
-		gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
 
-		if (gpa == UNMAPPED_GVA)
-			return X86EMUL_PROPAGATE_FAULT;
-		vcpu->mmio_needed = 1;
-		vcpu->mmio_phys_addr = gpa;
-		vcpu->mmio_size = bytes;
-		vcpu->mmio_is_write = 0;
+	gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+	if (gpa == UNMAPPED_GVA)
+		return X86EMUL_PROPAGATE_FAULT;
 
-		return X86EMUL_UNHANDLEABLE;
+	/*
+	 * Is this MMIO handled locally? 
+	 */
+	mmio_dev = vcpu_find_mmio_dev(vcpu, gpa);
+	if (mmio_dev) {
+		*val = mmio_dev->read(mmio_dev, gpa, bytes);
+		return X86EMUL_CONTINUE;
 	}
+
+	vcpu->mmio_needed = 1;
+	vcpu->mmio_phys_addr = gpa;
+	vcpu->mmio_size = bytes;
+	vcpu->mmio_is_write = 0;
+	
+	return X86EMUL_UNHANDLEABLE;
 }
 
 static int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -1068,8 +1090,9 @@ static int emulator_write_emulated(unsigned long addr,
 				   unsigned int bytes,
 				   struct x86_emulate_ctxt *ctxt)
 {
-	struct kvm_vcpu *vcpu = ctxt->vcpu;
-	gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+	struct kvm_vcpu      *vcpu = ctxt->vcpu;
+	struct kvm_io_device *mmio_dev;
+	gpa_t                 gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
 
 	if (gpa == UNMAPPED_GVA)
 		return X86EMUL_PROPAGATE_FAULT;
@@ -1077,6 +1100,15 @@ static int emulator_write_emulated(unsigned long addr,
 	if (emulator_write_phys(vcpu, gpa, val, bytes))
 		return X86EMUL_CONTINUE;
 
+	/*
+	 * Is this MMIO handled locally?
+	 */
+	mmio_dev = vcpu_find_mmio_dev(vcpu, gpa);
+	if (mmio_dev) {
+		mmio_dev->write(mmio_dev, gpa, bytes, val);
+		return X86EMUL_CONTINUE;
+	}
+
 	vcpu->mmio_needed = 1;
 	vcpu->mmio_phys_addr = gpa;
 	vcpu->mmio_size = bytes;
@@ -2911,6 +2943,32 @@ static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
 	return NOTIFY_OK;
 }
 
+void kvm_io_bus_init(struct kvm_io_bus *bus)
+{
+	memset(bus, 0, sizeof(*bus));
+}
+
+struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus, gpa_t addr)
+{
+	int i;
+
+	for (i = 0; i < bus->dev_count; i++) {
+		struct kvm_io_device *pos = bus->devs[i];
+
+		if (pos->in_range(pos, addr))
+			return pos;
+	}
+
+	return NULL;
+}
+
+void kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev)
+{
+	BUG_ON(bus->dev_count >= (NR_IOBUS_DEVS-1));
+
+	bus->devs[bus->dev_count++] = dev;
+}
+
 static struct notifier_block kvm_cpu_notifier = {
 	.notifier_call = kvm_cpu_hotplug,
 	.priority = 20, /* must be > scheduler priority */


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] KVM: Add irqdevice object
       [not found] ` <20070420030905.12408.40403.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-04-20  3:09   ` [PATCH 1/5] Adds support for in-kernel mmio handlers Gregory Haskins
@ 2007-04-20  3:09   ` Gregory Haskins
       [not found]     ` <20070420030916.12408.80159.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-04-20  3:09   ` [PATCH 3/5] KVM: Adds ability to preepmt an executing VCPU Gregory Haskins
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-20  3:09 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

The current code is geared towards using a user-mode (A)PIC.  This patch adds
an "irqdevice" abstraction, and implements a "userint" model to handle the
duties of the original code.  Later, we can develop other irqdevice models 
to handle objects like LAPIC, IOAPIC, i8259, etc, as appropriate

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/Makefile    |    2 
 drivers/kvm/irqdevice.h |  174 ++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/kvm.h       |  107 ++++++++++++++++++++++++
 drivers/kvm/kvm_main.c  |   85 ++++++++++++++++---
 drivers/kvm/svm.c       |  140 ++++++++++++++++++++++----------
 drivers/kvm/userint.c   |  206 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/vmx.c       |  137 +++++++++++++++++++++++--------
 7 files changed, 753 insertions(+), 98 deletions(-)

diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c0a789f..540afbc 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o userint.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/irqdevice.h b/drivers/kvm/irqdevice.h
new file mode 100644
index 0000000..caf5f64
--- /dev/null
+++ b/drivers/kvm/irqdevice.h
@@ -0,0 +1,174 @@
+/*
+ * Defines an interface for an abstract interrupt controller.  The model 
+ * consists of a unit with an arbitrary number of input lines N (IRQ0-(N-1)), 
+ * an arbitrary number of output lines (INTR) (LINT, EXTINT, NMI, etc), and 
+ * methods for completing an interrupt-acknowledge cycle (INTA).  A particular 
+ * implementation of this model will define various policies, such as
+ * irq-to-vector translation, INTA/auto-EOI policy, etc.  
+ * 
+ * In addition, the INTR callback mechanism allows the unit to be "wired" to
+ * an interruptible source in a very flexible manner. For instance, an 
+ * irqdevice could have its INTR wired to a VCPU (ala LAPIC), or another 
+ * interrupt controller (ala cascaded i8259s)
+ *
+ * Copyright (C) 2007 Novell
+ *
+ * Authors:
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef __IRQDEVICE_H
+#define __IRQDEVICE_H
+
+struct kvm_irqdevice;
+
+typedef enum {
+    kvm_irqpin_localint,
+    kvm_irqpin_extint,
+    kvm_irqpin_smi,
+    kvm_irqpin_nmi,
+    kvm_irqpin_invalid, /* must always be last */
+}kvm_irqpin_t;
+
+#define KVM_IRQACK_VALID    (1 << 0)
+#define KVM_IRQACK_AGAIN    (1 << 1)
+#define KVM_IRQACK_TPRMASK  (1 << 2)
+
+struct kvm_irqsink {
+	void (*set_intr)(struct kvm_irqsink *this, 
+			 struct kvm_irqdevice *dev,
+			 kvm_irqpin_t pin, int trigger, int value);
+
+	void *private;
+};
+
+struct kvm_irqdevice {
+	int  (*ack)(struct kvm_irqdevice *this, int *vector);
+	int  (*set_pin)(struct kvm_irqdevice *this, int pin, int level);
+	int  (*summary)(struct kvm_irqdevice *this, void *data);
+	void (*destructor)(struct kvm_irqdevice *this);
+
+	void               *private;
+	struct kvm_irqsink  sink;
+};
+
+/**
+ * kvm_irqdevice_init - initialize the kvm_irqdevice for use
+ * @dev: The device
+ *
+ * Description: Initialize the kvm_irqdevice for use.  Should be called before 
+ *              calling any derived implementation init functions
+ * 
+ * Returns: (void)
+ */
+static inline void kvm_irqdevice_init(struct kvm_irqdevice *dev)
+{
+	memset(dev, 0, sizeof(*dev));
+}
+
+/**
+ * kvm_irqdevice_ack - read and ack the highest priority vector from the device
+ * @dev: The device
+ * @vector: Retrieves the highest priority pending vector
+ *                [ NULL = Dont ack a vector, just check pending status]
+ *                [ non-NULL = Pointer to recieve vector data (out only)]
+ *
+ * Description: Read the highest priority pending vector from the device, 
+ *              potentially invoking auto-EOI depending on device policy
+ *
+ * Returns: (int)
+ *   [ -1 = failure]
+ *   [>=0 = bitmap as follows: ]
+ *         [ KVM_IRQACK_VALID   = vector is valid]
+ *         [ KVM_IRQACK_AGAIN   = more unmasked vectors are available]
+ *         [ KVM_IRQACK_TPRMASK = TPR masked vectors are blocked]
+ */
+static inline int kvm_irqdevice_ack(struct kvm_irqdevice *dev, 
+					    int *vector)
+{
+	return dev->ack(dev, vector);
+}
+
+/**
+ * kvm_irqdevice_set_pin - allows the caller to assert/deassert an IRQ
+ * @dev: The device
+ * @pin: The input pin to alter
+ * @level: The value to set (1 = assert, 0 = deassert)
+ *
+ * Description: Allows the caller to assert/deassert an IRQ input pin to the 
+ *              device according to device policy.
+ *
+ * Returns: (int)
+ *   [-1 = failure]
+ *   [ 0 = success]
+ */
+static inline int kvm_irqdevice_set_pin(struct kvm_irqdevice *dev, int pin,
+				  int level)
+{
+	return dev->set_pin(dev, pin, level);
+}
+
+/**
+ * kvm_irqdevice_summary - loads a summary bitmask
+ * @dev: The device
+ * @data: A pointer to a region capable of holding a 256 bit bitmap
+ *
+ * Description: Loads a summary bitmask of all pending vectors (0-255)
+ *
+ * Returns: (int)
+ *   [-1 = failure]
+ *   [ 0 = success]
+ */
+static inline int kvm_irqdevice_summary(struct kvm_irqdevice *dev, void *data)
+{
+	return dev->summary(dev, data);
+}
+
+/**
+ * kvm_irqdevice_register_sink - registers an kvm_irqsink object
+ * @dev: The device
+ * @sink: The sink to register.  Data will be copied so building object from 
+ *        transient storage is ok.
+ *
+ * Description: Registers an kvm_irqsink object as an INTR callback
+ *
+ * Returns: (void)
+ */
+static inline void kvm_irqdevice_register_sink(struct kvm_irqdevice *dev, 
+					       const struct kvm_irqsink *sink)
+{
+	dev->sink = *sink;
+}
+
+/**
+ * kvm_irqdevice_set_intr - invokes a registered INTR callback
+ * @dev: The device
+ * @pin: Identifies the pin to alter - 
+ *           [ KVM_IRQPIN_LOCALINT (default) - an vector is pending on this
+ *                                             device]
+ *           [ KVM_IRQPIN_EXTINT - a vector is pending on an external device]
+ *           [ KVM_IRQPIN_SMI - system-management-interrupt pin]
+ *           [ KVM_IRQPIN_NMI - non-maskable-interrupt pin
+ * @trigger: sensitivity [0 = edge, 1 = level]
+ * @val: [0 = deassert (ignored for edge-trigger), 1 = assert]
+ *
+ * Description: Invokes a registered INTR callback (if present).  This
+ *              function is meant to be used privately by a irqdevice 
+ *              implementation. 
+ *
+ * Returns: (void)
+ */
+static inline void kvm_irqdevice_set_intr(struct kvm_irqdevice *dev,
+					  kvm_irqpin_t pin, int trigger,
+					  int val)
+{
+	struct kvm_irqsink *sink = &dev->sink;
+	if (sink->set_intr)
+		sink->set_intr(sink, dev, pin, trigger, val);
+}
+
+#endif /*  __IRQDEVICE_H */
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 181099f..ef8f986 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -13,6 +13,7 @@
 #include <linux/mm.h>
 
 #include "vmx.h"
+#include "irqdevice.h"
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
 
@@ -157,6 +158,9 @@ struct vmcs {
 
 struct kvm_vcpu;
 
+int kvm_user_irqdev_init(struct kvm_irqdevice *dev);
+int kvm_userint_init(struct kvm_vcpu *vcpu);
+
 /*
  * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
  * 32-bit).  The kvm_mmu structure abstracts the details of the current mmu
@@ -266,6 +270,19 @@ struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus, gpa_t addr);
 void kvm_io_bus_register_dev(struct kvm_io_bus *bus, 
 			     struct kvm_io_device *dev);
 
+#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
+
+/*
+ * structure for maintaining info for interrupting an executing VCPU
+ */
+struct kvm_vcpu_irq {
+	spinlock_t           lock;
+	struct kvm_irqdevice dev;
+	int                  pending;
+	int                  trigger;
+	int                  deferred;
+};
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 	union {
@@ -278,9 +295,7 @@ struct kvm_vcpu {
 	u64 host_tsc;
 	struct kvm_run *run;
 	int interrupt_window_open;
-	unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */
-#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
-	unsigned long irq_pending[NR_IRQ_WORDS];
+	struct kvm_vcpu_irq irq;
 	unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
 	unsigned long rip;      /* needs vcpu_load_rsp_rip() */
 
@@ -343,6 +358,92 @@ struct kvm_vcpu {
 	struct kvm_cpuid_entry cpuid_entries[KVM_MAX_CPUID_ENTRIES];
 };
 
+/*
+ * Assumes lock already held
+ */
+static inline int __kvm_vcpu_irq_all_pending(struct kvm_vcpu *vcpu)
+{
+	int pending = vcpu->irq.pending;
+
+	if (vcpu->irq.deferred != -1)
+		__set_bit(kvm_irqpin_localint, &pending);
+
+	return pending;
+}
+
+/*
+ * These two functions are helpers for determining if a standard interrupt
+ * is pending to replace the old "if (vcpu->irq_summary)" logic.  If the
+ * caller wants to know about some of the new advanced interrupt types 
+ * (SMI, NMI, etc) or to differentiate between localint and extint they will
+ *  have to use the new API
+ */
+static inline int __kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
+{
+	int pending = __kvm_vcpu_irq_all_pending(vcpu);
+
+	if (test_bit(kvm_irqpin_localint, &pending) ||
+	    test_bit(kvm_irqpin_extint, &pending))
+		return 1;
+	
+	return 0;
+}
+
+static inline int kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
+{
+	int ret = 0;
+	int flags;
+
+	spin_lock_irqsave(&vcpu->irq.lock, flags);
+	ret = __kvm_vcpu_irq_pending(vcpu);
+	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
+
+	return ret;
+}
+
+/*
+ * Assumes lock already held
+ */
+static inline int kvm_vcpu_irq_pop(struct kvm_vcpu *vcpu, int *vector)
+{
+	int ret = 0;
+
+	if (vcpu->irq.deferred != -1) {
+		if (vector) {
+			ret |= KVM_IRQACK_VALID;
+			*vector = vcpu->irq.deferred;
+			vcpu->irq.deferred = -1;
+		}
+		ret |= kvm_irqdevice_ack(&vcpu->irq.dev, NULL);
+	} else
+		ret = kvm_irqdevice_ack(&vcpu->irq.dev, vector);
+
+	/*
+	 * If there are no more interrupts and we are edge triggered, 
+	 * we must clear the status flag
+	 */
+	if (!(ret & KVM_IRQACK_AGAIN))
+		__clear_bit(kvm_irqpin_localint, &vcpu->irq.pending);
+
+	return ret;
+}
+
+static inline void __kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
+{
+	BUG_ON(vcpu->irq.deferred != -1); /* We can only hold one deferred */
+
+	vcpu->irq.deferred = irq;
+}
+
+static inline void kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
+{
+	int flags;
+
+	spin_lock_irqsave(&vcpu->irq.lock, flags);
+	__kvm_vcpu_irq_push(vcpu, irq);
+	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
+}
+
 struct kvm_mem_alias {
 	gfn_t base_gfn;
 	unsigned long npages;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index c3c0059..32d456d 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -299,6 +299,11 @@ static struct kvm *kvm_create_vm(void)
 		struct kvm_vcpu *vcpu = &kvm->vcpus[i];
 
 		mutex_init(&vcpu->mutex);
+
+		memset(&vcpu->irq, 0, sizeof(vcpu->irq));
+		spin_lock_init(&vcpu->irq.lock);
+		vcpu->irq.deferred = -1;
+
 		vcpu->cpu = -1;
 		vcpu->kvm = kvm;
 		vcpu->mmu.root_hpa = INVALID_PAGE;
@@ -1989,8 +1994,7 @@ static int kvm_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 	sregs->efer = vcpu->shadow_efer;
 	sregs->apic_base = vcpu->apic_base;
 
-	memcpy(sregs->interrupt_bitmap, vcpu->irq_pending,
-	       sizeof sregs->interrupt_bitmap);
+	kvm_irqdevice_summary(&vcpu->irq.dev, &sregs->interrupt_bitmap);
 
 	vcpu_put(vcpu);
 
@@ -2044,13 +2048,17 @@ static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	if (mmu_reset_needed)
 		kvm_mmu_reset_context(vcpu);
 
-	memcpy(vcpu->irq_pending, sregs->interrupt_bitmap,
-	       sizeof vcpu->irq_pending);
-	vcpu->irq_summary = 0;
-	for (i = 0; i < NR_IRQ_WORDS; ++i)
-		if (vcpu->irq_pending[i])
-			__set_bit(i, &vcpu->irq_summary);
-
+	/*
+	 * walk the interrupt-bitmap and inject an IRQ for each bit found
+	 * 
+	 * note that we skip the first 16 vectors since they are reserved
+	 * and should never be set by an interrupt source
+	 */
+	for (i = 16; i < 256; ++i) {
+		int val = test_bit(i, &sregs->interrupt_bitmap[0]);
+		kvm_irqdevice_set_pin(&vcpu->irq.dev, i, val);
+	}
+ 
 	set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
 	set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
 	set_segment(vcpu, &sregs->es, VCPU_SREG_ES);
@@ -2210,14 +2218,8 @@ static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
 {
 	if (irq->irq < 0 || irq->irq >= 256)
 		return -EINVAL;
-	vcpu_load(vcpu);
-
-	set_bit(irq->irq, vcpu->irq_pending);
-	set_bit(irq->irq / BITS_PER_LONG, &vcpu->irq_summary);
 
-	vcpu_put(vcpu);
-
-	return 0;
+	return kvm_irqdevice_set_pin(&vcpu->irq.dev, irq->irq, 1);
 }
 
 static int kvm_vcpu_ioctl_debug_guest(struct kvm_vcpu *vcpu,
@@ -2319,6 +2321,51 @@ out1:
 }
 
 /*
+ * This function will be invoked whenever the vcpu->irq.dev raises its INTR 
+ * line
+ */
+static void kvm_vcpu_intr(struct kvm_irqsink *this, 
+			  struct kvm_irqdevice *dev,
+			  kvm_irqpin_t pin, int trigger, int val)
+{
+	struct kvm_vcpu *vcpu = (struct kvm_vcpu*)this->private;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vcpu->irq.lock, flags);
+
+	if (val && !test_bit(pin, &vcpu->irq.pending)) {
+		/*
+		 * if the line is being asserted and we currently have 
+		 * it deasserted, we must record
+		 */
+		__set_bit(pin, &vcpu->irq.pending);
+
+		if (trigger)
+			__set_bit(pin, &vcpu->irq.trigger);
+		else
+			__clear_bit(pin, &vcpu->irq.trigger);
+		
+	} else if (!val && trigger)
+		/*
+		 * if the level-sensitive line is being deasserted,
+		 * record it.
+		 */
+		__clear_bit(pin, &vcpu->irq.pending);
+	
+	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
+}
+
+static void kvm_vcpu_irqsink_init(struct kvm_vcpu *vcpu)
+{
+	struct kvm_irqsink sink = {
+		.set_intr   = kvm_vcpu_intr,
+		.private    = vcpu
+	};
+	
+	kvm_irqdevice_register_sink(&vcpu->irq.dev, &sink);
+}
+
+/*
  * Creates some virtual cpus.  Good luck creating more than one.
  */
 static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
@@ -2364,6 +2411,12 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
 	if (r < 0)
 		goto out_free_vcpus;
 
+	kvm_irqdevice_init(&vcpu->irq.dev);
+	kvm_vcpu_irqsink_init(vcpu);
+	r = kvm_userint_init(vcpu);
+	if (r < 0)
+	    goto out_free_vcpus;
+
 	kvm_arch_ops->vcpu_load(vcpu);
 	r = kvm_mmu_setup(vcpu);
 	if (r >= 0)
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index b7e1410..b6b96cb 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -106,24 +106,6 @@ static unsigned get_addr_size(struct kvm_vcpu *vcpu)
 				(cs_attrib & SVM_SELECTOR_DB_MASK) ? 4 : 2;
 }
 
-static inline u8 pop_irq(struct kvm_vcpu *vcpu)
-{
-	int word_index = __ffs(vcpu->irq_summary);
-	int bit_index = __ffs(vcpu->irq_pending[word_index]);
-	int irq = word_index * BITS_PER_LONG + bit_index;
-
-	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
-	if (!vcpu->irq_pending[word_index])
-		clear_bit(word_index, &vcpu->irq_summary);
-	return irq;
-}
-
-static inline void push_irq(struct kvm_vcpu *vcpu, u8 irq)
-{
-	set_bit(irq, vcpu->irq_pending);
-	set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
-}
-
 static inline void clgi(void)
 {
 	asm volatile (SVM_CLGI);
@@ -892,7 +874,12 @@ static int pf_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	int r;
 
 	if (is_external_interrupt(exit_int_info))
-		push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
+		/*
+		 * An exception was taken while we were trying to inject an
+		 * IRQ.  We must defer the injection of the vector until
+		 * the next window.
+		 */
+		kvm_vcpu_irq_push(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
 
 	spin_lock(&vcpu->kvm->lock);
 
@@ -1092,7 +1079,7 @@ static int halt_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 1;
 	skip_emulated_instruction(vcpu);
-	if (vcpu->irq_summary)
+	if (kvm_vcpu_irq_pending(vcpu))
 		return 1;
 
 	kvm_run->exit_reason = KVM_EXIT_HLT;
@@ -1263,7 +1250,7 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu,
 	 * possible
 	 */
 	if (kvm_run->request_interrupt_window &&
-	    !vcpu->irq_summary) {
+	    !kvm_vcpu_irq_pending(vcpu)) {
 		++kvm_stat.irq_window_exits;
 		kvm_run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
 		return 0;
@@ -1366,60 +1353,121 @@ static void pre_svm_run(struct kvm_vcpu *vcpu)
 }
 
 
-static inline void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
-{
-	struct vmcb_control_area *control;
-
-	control = &vcpu->svm->vmcb->control;
-	control->int_vector = pop_irq(vcpu);
-	control->int_ctl &= ~V_INTR_PRIO_MASK;
-	control->int_ctl |= V_IRQ_MASK |
-		((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT);
-}
-
 static void kvm_reput_irq(struct kvm_vcpu *vcpu)
 {
 	struct vmcb_control_area *control = &vcpu->svm->vmcb->control;
 
 	if (control->int_ctl & V_IRQ_MASK) {
 		control->int_ctl &= ~V_IRQ_MASK;
-		push_irq(vcpu, control->int_vector);
+		kvm_vcpu_irq_push(vcpu, control->int_vector);
 	}
 
 	vcpu->interrupt_window_open =
 		!(control->int_state & SVM_INTERRUPT_SHADOW_MASK);
 }
 
-static void do_interrupt_requests(struct kvm_vcpu *vcpu,
-				       struct kvm_run *kvm_run)
+static int do_intr_requests(struct kvm_vcpu *vcpu,
+			    struct kvm_run *kvm_run,
+			    kvm_irqpin_t pin)
 {
 	struct vmcb_control_area *control = &vcpu->svm->vmcb->control;
+	int r = 0;
+	int handled = 0;
 
 	vcpu->interrupt_window_open =
 		(!(control->int_state & SVM_INTERRUPT_SHADOW_MASK) &&
 		 (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF));
 
-	if (vcpu->interrupt_window_open && vcpu->irq_summary)
+	if (vcpu->interrupt_window_open) {
+		int irq;
+		
 		/*
-		 * If interrupts enabled, and not blocked by sti or mov ss. Good.
+		 * If interrupts enabled, and not blocked by sti or mov ss. 
+		 * Good.
 		 */
-		kvm_do_inject_irq(vcpu);
+
+		switch (pin) {
+		case kvm_irqpin_localint:
+			r = kvm_vcpu_irq_pop(vcpu, &irq);
+			break;
+		case kvm_irqpin_extint:
+			printk(KERN_WARNING "KVM: external-interrupts not " \
+			       "handled yet\n");
+			__clear_bit(pin, &vcpu->irq.pending);
+			break;
+		default:
+			panic("KVM: unknown interrupt pin raised: %d\n", pin);
+			break;
+		}
+
+		BUG_ON(r < 0);
+
+		if (r & KVM_IRQACK_VALID) {
+			control = &vcpu->svm->vmcb->control;
+			control->int_vector = irq;
+			control->int_ctl &= ~V_INTR_PRIO_MASK;
+			control->int_ctl |= V_IRQ_MASK |
+				((/*control->int_vector >> 4*/ 0xf) << 
+				 V_INTR_PRIO_SHIFT);
+
+			handled = 1;
+		}
+	}
 
 	/*
 	 * Interrupts blocked.  Wait for unblock.
 	 */
 	if (!vcpu->interrupt_window_open &&
-	    (vcpu->irq_summary || kvm_run->request_interrupt_window)) {
+	    ((r & KVM_IRQACK_AGAIN) ||
+	     __kvm_vcpu_irq_pending(vcpu) || 
+	     kvm_run->request_interrupt_window)) {
 		control->intercept |= 1ULL << INTERCEPT_VINTR;
 	} else
 		control->intercept &= ~(1ULL << INTERCEPT_VINTR);
+
+	return handled;
+}
+
+static void do_interrupt_requests(struct kvm_vcpu *vcpu,
+				  struct kvm_run *kvm_run)
+{
+	int pending = __kvm_vcpu_irq_all_pending(vcpu);
+	int handled = 0;
+
+	while (pending && !handled) {
+		kvm_irqpin_t pin = __fls(pending);
+
+		switch (pin) {
+		case kvm_irqpin_localint:
+		case kvm_irqpin_extint:
+			handled = do_intr_requests(vcpu, kvm_run, pin);
+			break;
+		case kvm_irqpin_smi:
+		case kvm_irqpin_nmi:
+			/* ignored (for now) */	
+			printk(KERN_WARNING 
+			       "KVM: dropping unhandled SMI/NMI: %d\n",
+			       pin);
+			__clear_bit(pin, &vcpu->irq.pending);
+			break;
+		case kvm_irqpin_invalid:
+			/* drop */
+			break;
+		default:
+			panic("KVM: unknown interrupt pin raised: %d\n", pin);
+			break;
+		}
+
+		__clear_bit(pin, &pending);
+	}
 }
 
 static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 			      struct kvm_run *kvm_run)
 {
-	kvm_run->ready_for_interrupt_injection = (vcpu->interrupt_window_open &&
-						  vcpu->irq_summary == 0);
+	kvm_run->ready_for_interrupt_injection = 
+		(vcpu->interrupt_window_open && 
+		 !kvm_vcpu_irq_pending(vcpu));
 	kvm_run->if_flag = (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF) != 0;
 	kvm_run->cr8 = vcpu->cr8;
 	kvm_run->apic_base = vcpu->apic_base;
@@ -1434,7 +1482,7 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu,
 					  struct kvm_run *kvm_run)
 {
-	return (!vcpu->irq_summary &&
+    return (!kvm_vcpu_irq_pending(vcpu) &&
 		kvm_run->request_interrupt_window &&
 		vcpu->interrupt_window_open &&
 		(vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF));
@@ -1464,9 +1512,17 @@ static int svm_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	int r;
 
 again:
+	spin_lock(&vcpu->irq.lock);
+
+	/*
+	 * We must inject interrupts (if any) while the irq_lock
+	 * is held
+	 */
 	if (!vcpu->mmio_read_completed)
 		do_interrupt_requests(vcpu, kvm_run);
 
+	spin_unlock(&vcpu->irq.lock);
+
 	clgi();
 
 	pre_svm_run(vcpu);
diff --git a/drivers/kvm/userint.c b/drivers/kvm/userint.c
new file mode 100644
index 0000000..d4385d6
--- /dev/null
+++ b/drivers/kvm/userint.c
@@ -0,0 +1,206 @@
+/*
+ * User Interrupts IRQ device 
+ *
+ * This acts as an extention of an interrupt controller that exists elsewhere 
+ * (typically in userspace/QEMU).  Because this PIC is a pseudo device that
+ * is downstream from a real emulated PIC, the "IRQ-to-vector" mapping has 
+ * already occured.  Therefore, this PIC has the following unusal properties:
+ *
+ * 1) It has 256 "pins" which are literal vectors (i.e. no translation)
+ * 2) It only supports "auto-EOI" behavior since it is expected that the
+ *    upstream emulated PIC will handle the real EOIs (if applicable)
+ * 3) It only listens to "asserts" on the pins (deasserts are dropped) 
+ *    because its an auto-EOI device anyway.
+ *
+ * Copyright (C) 2007 Novell
+ *
+ * bitarray code based on original vcpu->irq_pending code, 
+ *     Copyright (C) 2007 Qumranet
+ *
+ * Authors:
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include "kvm.h"
+
+/*
+ *----------------------------------------------------------------------
+ * optimized bitarray object - works like bitarrays in bitops, but uses 
+ * a summary field to accelerate lookups.  Assumes external locking 
+ *---------------------------------------------------------------------
+ */
+
+struct bitarray {
+	unsigned long summary; /* 1 per word in pending */
+	unsigned long pending[NR_IRQ_WORDS];
+};
+
+static inline int bitarray_pending(struct bitarray *this)
+{
+	return this->summary ? 1 : 0;	
+}
+
+static inline int bitarray_findhighest(struct bitarray *this)
+{
+	if (!this->summary)
+		return -1;
+	else {
+		int word_index = __fls(this->summary);
+		int bit_index  = __fls(this->pending[word_index]);
+		
+		return word_index * BITS_PER_LONG + bit_index;	
+	}
+}
+
+static inline void bitarray_set(struct bitarray *this, int nr)
+{
+	__set_bit(nr, &this->pending);
+	__set_bit(nr / BITS_PER_LONG, &this->summary); 
+} 
+
+static inline void bitarray_clear(struct bitarray *this, int nr)
+{
+	int word = nr / BITS_PER_LONG;
+
+	__clear_bit(nr, &this->pending);
+	if (!this->pending[word])
+		__clear_bit(word, &this->summary);
+}
+
+static inline int bitarray_test(struct bitarray *this, int nr)
+{
+	return test_bit(nr, &this->pending);
+}
+
+static inline int bitarray_test_and_set(struct bitarray *this, int nr, int val)
+{
+	if (bitarray_test(this, nr) != val) {
+		if (val)
+			bitarray_set(this, nr);
+		else 
+			bitarray_clear(this, nr);
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
+ *----------------------------------------------------------------------
+ * userint interface - provides the actual kvm_irqdevice implementation
+ *---------------------------------------------------------------------
+ */
+
+struct kvm_user_irqdev {
+	spinlock_t      lock;
+	atomic_t        ref_count;
+	struct bitarray pending;
+};
+
+static int user_irqdev_ack(struct kvm_irqdevice *this, int *vector)
+{
+	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
+	int          irq;
+	int          ret = 0;
+
+	spin_lock(&s->lock);
+
+	if (vector) {
+		irq = bitarray_findhighest(&s->pending);
+		
+		if (irq > -1) {
+			/*
+			 * Automatically clear the interrupt as the EOI
+			 * mechanism (if any) will take place in userspace 
+			 */
+			bitarray_clear(&s->pending, irq);
+
+			ret |= KVM_IRQACK_VALID;
+		}
+
+		*vector = irq;
+	}
+
+	if (bitarray_pending(&s->pending))
+		ret |= KVM_IRQACK_AGAIN;
+
+	spin_unlock(&s->lock);
+
+	return ret;
+}
+
+static int user_irqdev_set_pin(struct kvm_irqdevice* this, int irq, int level)
+{
+	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
+	int forward = 0;
+
+	/*
+	 * FIXME: We shouldn't allow HW vectors to come through here
+	 */
+	spin_lock(&s->lock);
+	forward = bitarray_test_and_set(&s->pending, irq, level);
+	spin_unlock(&s->lock);
+
+	/*
+	 * alert the higher layer software we have changes 
+	 */
+	if (forward)
+		kvm_irqdevice_set_intr(this, kvm_irqpin_localint, 0, 
+				       bitarray_pending(&s->pending));
+
+	return 0;
+}
+
+static int user_irqdev_summary(struct kvm_irqdevice* this, void *data)
+{	
+	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
+
+	spin_lock(&s->lock);
+	memcpy(data, s->pending.pending, sizeof s->pending.pending);
+	spin_unlock(&s->lock);
+
+	return 0;
+}
+
+static void user_irqdev_dropref(struct kvm_user_irqdev *s)
+{
+	if (atomic_dec_and_test(&s->ref_count))
+		kfree(s);
+}
+
+static void user_irqdev_destructor(struct kvm_irqdevice *this)
+{
+	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
+	user_irqdev_dropref(s);
+}
+
+int kvm_user_irqdev_init(struct kvm_irqdevice *irqdev)
+{
+	struct kvm_user_irqdev *s;
+
+	s = kzalloc(sizeof(*s), GFP_KERNEL);
+	if (!s)
+		return -ENOMEM;
+
+	spin_lock_init(&s->lock);
+
+	irqdev->ack         = user_irqdev_ack;
+	irqdev->set_pin     = user_irqdev_set_pin;
+	irqdev->summary     = user_irqdev_summary;
+	irqdev->destructor  = user_irqdev_destructor;
+	
+	irqdev->private = s;
+	atomic_inc(&s->ref_count);
+	
+	return 0;
+}
+
+int kvm_userint_init(struct kvm_vcpu *vcpu)
+{
+	return kvm_user_irqdev_init(&vcpu->irq.dev);
+}
+
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 61a6116..afd2d69 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1217,45 +1217,61 @@ static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq)
 	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp - 6));
 }
 
-static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
-{
-	int word_index = __ffs(vcpu->irq_summary);
-	int bit_index = __ffs(vcpu->irq_pending[word_index]);
-	int irq = word_index * BITS_PER_LONG + bit_index;
-
-	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
-	if (!vcpu->irq_pending[word_index])
-		clear_bit(word_index, &vcpu->irq_summary);
-
-	if (vcpu->rmode.active) {
-		inject_rmode_irq(vcpu, irq);
-		return;
-	}
-	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-			irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
-}
-
-
-static void do_interrupt_requests(struct kvm_vcpu *vcpu,
-				       struct kvm_run *kvm_run)
+static int do_intr_requests(struct kvm_vcpu *vcpu,
+			    struct kvm_run *kvm_run,
+			    kvm_irqpin_t pin)
 {
 	u32 cpu_based_vm_exec_control;
-
+	int r = 0;
+	int handled = 0;
+	
 	vcpu->interrupt_window_open =
 		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
 		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
 
 	if (vcpu->interrupt_window_open &&
-	    vcpu->irq_summary &&
-	    !(vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK))
+	    !(vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK)) {
+		int irq;
+		
 		/*
-		 * If interrupts enabled, and not blocked by sti or mov ss. Good.
+		 * If interrupts enabled, and not blocked by sti or mov ss. 
+		 * Good.
 		 */
-		kvm_do_inject_irq(vcpu);
 
+		switch (pin) {
+		case kvm_irqpin_localint:
+			r = kvm_vcpu_irq_pop(vcpu, &irq);
+			break;
+		case kvm_irqpin_extint:
+			printk(KERN_WARNING "KVM: external-interrupts not " \
+			       "handled yet\n");
+			__clear_bit(pin, &vcpu->irq.pending);
+			break;
+		default:
+			panic("KVM: unknown interrupt pin raised: %d\n", pin);
+			break;
+		}
+
+		BUG_ON(r < 0);
+
+		if (r & KVM_IRQACK_VALID) {
+			if (vcpu->rmode.active)
+				inject_rmode_irq(vcpu, irq);
+			else
+				vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+					     irq | 
+					     INTR_TYPE_EXT_INTR | 
+					     INTR_INFO_VALID_MASK);
+
+			handled = 1;
+		}
+	}
+	
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	if (!vcpu->interrupt_window_open &&
-	    (vcpu->irq_summary || kvm_run->request_interrupt_window))
+	    ((r & KVM_IRQACK_AGAIN) ||
+	     __kvm_vcpu_irq_pending(vcpu) ||
+	     kvm_run->request_interrupt_window))
 		/*
 		 * Interrupts blocked.  Wait for unblock.
 		 */
@@ -1263,6 +1279,42 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
 	else
 		cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
 	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
+
+	return handled;
+}
+
+static void do_interrupt_requests(struct kvm_vcpu *vcpu,
+				  struct kvm_run *kvm_run)
+{
+	int pending = __kvm_vcpu_irq_all_pending(vcpu);
+	int handled = 0;
+
+	while (pending && !handled) {
+		kvm_irqpin_t pin = __fls(pending);
+
+		switch (pin) {
+		case kvm_irqpin_localint:
+		case kvm_irqpin_extint:
+			handled = do_intr_requests(vcpu, kvm_run, pin);
+			break;
+		case kvm_irqpin_smi:
+		case kvm_irqpin_nmi:
+			/* ignored (for now) */	
+			printk(KERN_WARNING 
+			       "KVM: dropping unhandled SMI/NMI: %d\n",
+			       pin);
+			__clear_bit(pin, &vcpu->irq.pending);
+			break;
+		case kvm_irqpin_invalid:
+			/* drop */
+			break;
+		default:
+			panic("KVM: unknown interrupt pin raised: %d\n", pin);
+			break;
+		}
+
+		__clear_bit(pin, &pending);
+	}
 }
 
 static void kvm_guest_debug_pre(struct kvm_vcpu *vcpu)
@@ -1313,9 +1365,13 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	}
 
 	if (is_external_interrupt(vect_info)) {
+		/*
+		 * An exception was taken while we were trying to inject an
+		 * IRQ.  We must defer the injection of the vector until
+		 * the next window.
+		 */
 		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
-		set_bit(irq, vcpu->irq_pending);
-		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
+		kvm_vcpu_irq_push(vcpu, irq);
 	}
 
 	if ((intr_info & INTR_INFO_INTR_TYPE_MASK) == 0x200) { /* nmi */
@@ -1619,8 +1675,9 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 	kvm_run->if_flag = (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) != 0;
 	kvm_run->cr8 = vcpu->cr8;
 	kvm_run->apic_base = vcpu->apic_base;
-	kvm_run->ready_for_interrupt_injection = (vcpu->interrupt_window_open &&
-						  vcpu->irq_summary == 0);
+	kvm_run->ready_for_interrupt_injection = 
+		(vcpu->interrupt_window_open && 
+		 !kvm_vcpu_irq_pending(vcpu));
 }
 
 static int handle_interrupt_window(struct kvm_vcpu *vcpu,
@@ -1631,7 +1688,7 @@ static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 	 * possible
 	 */
 	if (kvm_run->request_interrupt_window &&
-	    !vcpu->irq_summary) {
+	    !kvm_vcpu_irq_pending(vcpu)) {
 		kvm_run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
 		++kvm_stat.irq_window_exits;
 		return 0;
@@ -1642,7 +1699,7 @@ static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 static int handle_halt(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	skip_emulated_instruction(vcpu);
-	if (vcpu->irq_summary)
+	if (kvm_vcpu_irq_pending(vcpu))
 		return 1;
 
 	kvm_run->exit_reason = KVM_EXIT_HLT;
@@ -1713,7 +1770,7 @@ static int kvm_handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu,
 					  struct kvm_run *kvm_run)
 {
-	return (!vcpu->irq_summary &&
+	return (!kvm_vcpu_irq_pending(vcpu) &&
 		kvm_run->request_interrupt_window &&
 		vcpu->interrupt_window_open &&
 		(vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF));
@@ -1751,11 +1808,19 @@ again:
 	vmcs_writel(HOST_GS_BASE, segment_base(gs_sel));
 #endif
 
+	if (vcpu->guest_debug.enabled)
+		kvm_guest_debug_pre(vcpu);
+
+	spin_lock(&vcpu->irq.lock);
+	
+	/*
+	 * We must inject interrupts (if any) while the irq.lock
+	 * is held
+	 */
 	if (!vcpu->mmio_read_completed)
 		do_interrupt_requests(vcpu, kvm_run);
 
-	if (vcpu->guest_debug.enabled)
-		kvm_guest_debug_pre(vcpu);
+	spin_unlock(&vcpu->irq.lock);
 
 	fx_save(vcpu->host_fx_image);
 	fx_restore(vcpu->guest_fx_image);


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] KVM: Adds ability to preepmt an executing VCPU
       [not found] ` <20070420030905.12408.40403.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-04-20  3:09   ` [PATCH 1/5] Adds support for in-kernel mmio handlers Gregory Haskins
  2007-04-20  3:09   ` [PATCH 2/5] KVM: Add irqdevice object Gregory Haskins
@ 2007-04-20  3:09   ` Gregory Haskins
       [not found]     ` <20070420030921.12408.97321.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-04-20  3:09   ` [PATCH 4/5] KVM: Local-APIC interface cleanup Gregory Haskins
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-20  3:09 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

The VCPU executes synchronously w.r.t. userspace today, and therefore 
interrupt injection is pretty straight forward.  However, we will soon need
to be able to inject interrupts asynchronous to the execution of the VCPU
due to the introduction of SMP, paravirtualized drivers, and asynchronous
hypercalls.  This patch adds support to the interrupt mechanism to force
a VCPU to VMEXIT when a new interrupt is pending.

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/kvm.h      |    5 ++++
 drivers/kvm/kvm_main.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/svm.c      |   43 ++++++++++++++++++++++++++++++++++++
 drivers/kvm/vmx.c      |   43 ++++++++++++++++++++++++++++++++++++
 4 files changed, 148 insertions(+), 0 deletions(-)

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index ef8f986..64916fc 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -272,6 +272,8 @@ void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
 
 #define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
 
+#define KVM_SIGNAL_VIRTUAL_INTERRUPT 33 /* Hardcoded for now */
+
 /*
  * structure for maintaining info for interrupting an executing VCPU
  */
@@ -281,6 +283,9 @@ struct kvm_vcpu_irq {
 	int                  pending;
 	int                  trigger;
 	int                  deferred;
+	struct task_struct  *task;
+	int                  signo;
+	int                  guest_mode;
 };
 
 struct kvm_vcpu {
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 32d456d..f51d036 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -303,6 +303,10 @@ static struct kvm *kvm_create_vm(void)
 		memset(&vcpu->irq, 0, sizeof(vcpu->irq));
 		spin_lock_init(&vcpu->irq.lock);
 		vcpu->irq.deferred = -1;
+		/*
+		 * This should be settable by userspace someday
+		 */
+		vcpu->irq.signo = KVM_SIGNAL_VIRTUAL_INTERRUPT;
 
 		vcpu->cpu = -1;
 		vcpu->kvm = kvm;
@@ -1843,6 +1847,7 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	int r;
 	sigset_t sigsaved;
+	unsigned long irqsaved;
 
 	vcpu_load(vcpu);
 
@@ -1871,6 +1876,10 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 		kvm_arch_ops->decache_regs(vcpu);
 	}
 
+	spin_lock_irqsave(&vcpu->irq.lock, irqsaved);
+	vcpu->irq.task = current;
+	spin_unlock_irqrestore(&vcpu->irq.lock, irqsaved);
+	
 	r = kvm_arch_ops->run(vcpu, kvm_run);
 
 out:
@@ -2321,6 +2330,20 @@ out1:
 }
 
 /*
+ * This function is invoked whenever we want to interrupt a vcpu that is
+ * currently executing in guest-mode.  It currently is a no-op because
+ * the simple delivery of the IPI to execute this function accomplishes our
+ * goal: To cause a VMEXIT.  We pass the vcpu (which contains the 
+ * vcpu->irq.task, etc) for future use
+ */
+static void kvm_vcpu_guest_intr(void *info)
+{
+#ifdef NOT_YET
+	struct kvm_vcpu *vcpu = (struct kvm_vcpu*)info;
+#endif
+}
+
+/*
  * This function will be invoked whenever the vcpu->irq.dev raises its INTR 
  * line
  */
@@ -2345,6 +2368,40 @@ static void kvm_vcpu_intr(struct kvm_irqsink *this,
 		else
 			__clear_bit(pin, &vcpu->irq.trigger);
 		
+		/*
+		 * then wake up the vcpu
+		 */
+		if (vcpu->irq.task && (vcpu->irq.task != current)) {
+			if (vcpu->irq.guest_mode) {
+				/*
+				 * If we are in guest mode, we can optimize 
+				 * the IPI by executing a function on the 
+				 * owning processor.
+				 */
+				int cpu;
+				
+				/*
+				 * Not sure if disabling preemption is needed
+				 * since we are in a spin-lock anyway?  The
+				 * kick_process() code does this so I copied it
+				 */
+				preempt_disable();
+				cpu = task_cpu(vcpu->irq.task);
+				BUG_ON(cpu == smp_processor_id());
+				smp_call_function_single(cpu, 
+							 kvm_vcpu_guest_intr,
+							 vcpu, 0, 0); 
+				preempt_enable();
+			} else
+				/*
+				 * If we are not in guest mode, we must assume
+				 * that we could be blocked anywhere,
+				 * including userspace. Send a signal to give
+				 * everyone a chance to get notification
+				 */
+				send_sig(vcpu->irq.signo, vcpu->irq.task, 0);
+		}
+		
 	} else if (!val && trigger)
 		/*
 		 * if the level-sensitive line is being deasserted,
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index b6b96cb..4eb65bf 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -1510,11 +1510,40 @@ static int svm_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	u16 gs_selector;
 	u16 ldt_selector;
 	int r;
+	unsigned long irq_flags;
 
 again:
+	/*
+	 * We disable interrupts until the next VMEXIT to eliminate a race 
+	 * condition for delivery of virtual interrutps.  Note that this is
+	 * probably not as bad as it sounds, as interrupts will still invoke
+	 * a VMEXIT once transitioned to GUEST mode (and thus exit this lock 
+	 * scope) even if they are disabled.
+	 *
+	 * FIXME: Do we need to do anything additional to mask IPI/NMIs? 
+	 */
+	local_irq_save(irq_flags);
+
 	spin_lock(&vcpu->irq.lock);
 
 	/*
+	 * If there are any signals pending (virtual interrupt related or 
+	 * otherwise), don't even bother trying to enter guest mode...
+	 */
+	if (signal_pending(current)) {
+		kvm_run->exit_reason = KVM_EXIT_INTR;
+		spin_unlock(&vcpu->irq.lock);
+		local_irq_restore(irq_flags);
+		return -EINTR;
+	}
+
+	/*
+	 * There are optimizations we can make when signaling interrupts
+	 * if we know the VCPU is in GUEST mode, so mark that here
+	 */
+	vcpu->irq.guest_mode = 1;
+
+	/*
 	 * We must inject interrupts (if any) while the irq_lock
 	 * is held
 	 */
@@ -1654,6 +1683,13 @@ again:
 #endif
 		: "cc", "memory" );
 
+	/*
+	 * FIXME: We'd like to turn on interrupts ASAP, but is this so early
+	 * that we will mess up the state of the CPU before we fully 
+	 * transition from guest to host?
+	 */
+	local_irq_restore(irq_flags);
+
 	fx_save(vcpu->guest_fx_image);
 	fx_restore(vcpu->host_fx_image);
 
@@ -1674,6 +1710,13 @@ again:
 	reload_tss(vcpu);
 
 	/*
+	 * Signal that we have transitioned back to host mode 
+	 */
+	spin_lock_irqsave(&vcpu->irq.lock, irq_flags);
+	vcpu->irq.guest_mode = 0;
+	spin_unlock_irqrestore(&vcpu->irq.lock, irq_flags);
+
+	/*
 	 * Profile KVM exit RIPs:
 	 */
 	if (unlikely(prof_on == KVM_PROFILING))
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index afd2d69..496ecc5 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1782,6 +1782,7 @@ static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	u16 fs_sel, gs_sel, ldt_sel;
 	int fs_gs_ldt_reload_needed;
 	int r;
+	unsigned long irq_flags;
 
 again:
 	/*
@@ -1811,9 +1812,37 @@ again:
 	if (vcpu->guest_debug.enabled)
 		kvm_guest_debug_pre(vcpu);
 
+	/*
+	 * We disable interrupts until the next VMEXIT to eliminate a race 
+	 * condition for delivery of virtual interrutps.  Note that this is
+	 * probably not as bad as it sounds, as interrupts will still invoke
+	 * a VMEXIT once transitioned to GUEST mode (and thus exit this lock 
+	 * scope) even if they are disabled.
+	 *
+	 * FIXME: Do we need to do anything additional to mask IPI/NMIs? 
+	 */
+	local_irq_save(irq_flags);
+
 	spin_lock(&vcpu->irq.lock);
 	
 	/*
+	 * If there are any signals pending (virtual interrupt related or 
+	 * otherwise), don't even bother trying to enter guest mode...
+	 */
+	if (signal_pending(current)) {
+		kvm_run->exit_reason = KVM_EXIT_INTR;
+		spin_unlock(&vcpu->irq.lock);
+		local_irq_restore(irq_flags);
+		return -EINTR;
+	}
+
+	/*
+	 * There are optimizations we can make when signaling interrupts
+	 * if we know the VCPU is in GUEST mode, so mark that here
+	 */
+	vcpu->irq.guest_mode = 1;
+	
+	/*
 	 * We must inject interrupts (if any) while the irq.lock
 	 * is held
 	 */
@@ -1947,6 +1976,13 @@ again:
 		[cr2]"i"(offsetof(struct kvm_vcpu, cr2))
 	      : "cc", "memory" );
 
+ 	/*
+	 * FIXME: We'd like to turn on interrupts ASAP, but is this so early
+	 * that we will mess up the state of the CPU before we fully 
+	 * transition from guest to host?
+	 */
+	local_irq_restore(irq_flags);
+
 	/*
 	 * Reload segment selectors ASAP. (it's needed for a functional
 	 * kernel: x86 relies on having __KERNEL_PDA in %fs and x86_64
@@ -1979,6 +2015,13 @@ again:
 
 	asm ("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS));
 
+	/*
+	 * Signal that we have transitioned back to host mode 
+	 */
+	spin_lock_irqsave(&vcpu->irq.lock, irq_flags);
+	vcpu->irq.guest_mode = 0;
+	spin_unlock_irqrestore(&vcpu->irq.lock, irq_flags);
+
 	if (fail) {
 		kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
 		kvm_run->fail_entry.hardware_entry_failure_reason


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] KVM: Local-APIC interface cleanup
       [not found] ` <20070420030905.12408.40403.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (2 preceding siblings ...)
  2007-04-20  3:09   ` [PATCH 3/5] KVM: Adds ability to preepmt an executing VCPU Gregory Haskins
@ 2007-04-20  3:09   ` Gregory Haskins
       [not found]     ` <20070420030926.12408.27637.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-04-20  3:09   ` [PATCH 5/5] KVM: Add support for in-kernel LAPIC model Gregory Haskins
  2007-04-22  9:06   ` KVM: Patch series for in-kernel APIC support Avi Kivity
  5 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-20  3:09 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Adds an abstraction to the LAPIC logic so that we can later substitute it
for an in-kernel model.

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/kvm.h      |    4 +-
 drivers/kvm/kvm_main.c |   19 ++++----
 drivers/kvm/lapic.h    |   76 ++++++++++++++++++++++++++++++
 drivers/kvm/svm.c      |    7 +--
 drivers/kvm/userint.c  |  120 ++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/kvm/vmx.c      |   10 +---
 6 files changed, 208 insertions(+), 28 deletions(-)

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 64916fc..f8c092c 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -14,6 +14,7 @@
 
 #include "vmx.h"
 #include "irqdevice.h"
+#include "lapic.h"
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
 
@@ -301,6 +302,7 @@ struct kvm_vcpu {
 	struct kvm_run *run;
 	int interrupt_window_open;
 	struct kvm_vcpu_irq irq;
+	struct kvm_lapic apic;
 	unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
 	unsigned long rip;      /* needs vcpu_load_rsp_rip() */
 
@@ -311,10 +313,8 @@ struct kvm_vcpu {
 	struct page *para_state_page;
 	gpa_t hypercall_gpa;
 	unsigned long cr4;
-	unsigned long cr8;
 	u64 pdptrs[4]; /* pae */
 	u64 shadow_efer;
-	u64 apic_base;
 	u64 ia32_misc_enable_msr;
 	int nmsrs;
 	struct vmx_msr_entry *guest_msrs;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index f51d036..55be172 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -607,7 +607,7 @@ void set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8)
 		inject_gp(vcpu);
 		return;
 	}
-	vcpu->cr8 = cr8;
+	kvm_lapic_set_tpr(&vcpu->apic, cr8, 0);
 }
 EXPORT_SYMBOL_GPL(set_cr8);
 
@@ -1520,7 +1520,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 		data = 3;
 		break;
 	case MSR_IA32_APICBASE:
-		data = vcpu->apic_base;
+		data = kvm_lapic_get_base(&vcpu->apic);
 		break;
 	case MSR_IA32_MISC_ENABLE:
 		data = vcpu->ia32_misc_enable_msr;
@@ -1598,7 +1598,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	case 0x200 ... 0x2ff: /* MTRRs */
 		break;
 	case MSR_IA32_APICBASE:
-		vcpu->apic_base = data;
+		kvm_lapic_set_base(&vcpu->apic, data, 0);
 		break;
 	case MSR_IA32_MISC_ENABLE:
 		vcpu->ia32_misc_enable_msr = data;
@@ -1855,7 +1855,7 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 		sigprocmask(SIG_SETMASK, &vcpu->sigset, &sigsaved);
 
 	/* re-sync apic's tpr */
-	vcpu->cr8 = kvm_run->cr8;
+	kvm_lapic_set_tpr(&vcpu->apic, kvm_run->cr8, KVM_LAPICFLAGS_USERMODE);
 
 	if (kvm_run->io_completed) {
 		if (vcpu->pio.count) {
@@ -1999,9 +1999,9 @@ static int kvm_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 	sregs->cr2 = vcpu->cr2;
 	sregs->cr3 = vcpu->cr3;
 	sregs->cr4 = vcpu->cr4;
-	sregs->cr8 = vcpu->cr8;
+	sregs->cr8 = kvm_lapic_get_tpr(&vcpu->apic);
 	sregs->efer = vcpu->shadow_efer;
-	sregs->apic_base = vcpu->apic_base;
+	sregs->apic_base = kvm_lapic_get_base(&vcpu->apic);
 
 	kvm_irqdevice_summary(&vcpu->irq.dev, &sregs->interrupt_bitmap);
 
@@ -2036,13 +2036,14 @@ static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	mmu_reset_needed |= vcpu->cr3 != sregs->cr3;
 	vcpu->cr3 = sregs->cr3;
 
-	vcpu->cr8 = sregs->cr8;
+	kvm_lapic_set_tpr(&vcpu->apic, sregs->cr8, KVM_LAPICFLAGS_USERMODE);
 
 	mmu_reset_needed |= vcpu->shadow_efer != sregs->efer;
 #ifdef CONFIG_X86_64
 	kvm_arch_ops->set_efer(vcpu, sregs->efer);
 #endif
-	vcpu->apic_base = sregs->apic_base;
+	kvm_lapic_set_base(&vcpu->apic, sregs->apic_base, 
+			   KVM_LAPICFLAGS_USERMODE);
 
 	kvm_arch_ops->decache_cr0_cr4_guest_bits(vcpu);
 
@@ -2470,6 +2471,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
 
 	kvm_irqdevice_init(&vcpu->irq.dev);
 	kvm_vcpu_irqsink_init(vcpu);
+	kvm_lapic_init(&vcpu->apic);
+
 	r = kvm_userint_init(vcpu);
 	if (r < 0)
 	    goto out_free_vcpus;
diff --git a/drivers/kvm/lapic.h b/drivers/kvm/lapic.h
new file mode 100644
index 0000000..fa36bba
--- /dev/null
+++ b/drivers/kvm/lapic.h
@@ -0,0 +1,76 @@
+/*
+ * Defines an interface for an abstract Local APIC.
+ *
+ * Copyright (C) 2007 Novell
+ *
+ * Authors:
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef __LAPIC_H
+#define __LAPIC_H
+
+#define KVM_LAPICFLAGS_USERMODE (1 << 0)
+
+struct kvm_lapic {
+	void (*set_tpr)(struct kvm_lapic *this, u64 cr8, int flags);
+	u64  (*get_tpr)(struct kvm_lapic *this);
+	void (*set_base)(struct kvm_lapic *this, u64 base, int flags);
+	u64  (*get_base)(struct kvm_lapic *this);
+	void (*reset)(struct kvm_lapic *this);
+	int  (*enabled)(struct kvm_lapic *this);
+	void (*destructor)(struct kvm_lapic *this);
+
+	void *private;
+};
+
+/**
+ * kvm_lapic_init - initialize the kvm_lapic for use
+ * @dev: The device
+ *
+ * Description: Initialize the kvm_lapic for use.  Should be called before 
+ *              calling any derived implementation init functions
+ * 
+ * Returns: (void)
+ */
+static inline void kvm_lapic_init(struct kvm_lapic *dev)
+{
+	memset(dev, 0, sizeof(*dev));
+}
+
+static inline void kvm_lapic_set_tpr(struct kvm_lapic *dev, u64 cr8, int flags)
+{
+	dev->set_tpr(dev, cr8, flags);
+}
+
+static inline u64 kvm_lapic_get_tpr(struct kvm_lapic *dev)
+{
+	return dev->get_tpr(dev);
+}
+
+static inline void kvm_lapic_set_base(struct kvm_lapic *dev, u64 base, 
+				      int flags)
+{
+	dev->set_base(dev, base, flags);
+}
+
+static inline u64 kvm_lapic_get_base(struct kvm_lapic *dev)
+{
+	return dev->get_base(dev);
+}
+
+static inline void kvm_lapic_reset(struct kvm_lapic *dev)
+{
+	dev->reset(dev);
+}
+
+static inline int kvm_lapic_enabled(struct kvm_lapic *dev)
+{
+	return dev->enabled(dev);
+}
+
+#endif /*  __LAPIC_H */
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index 4eb65bf..48b50e0 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -569,9 +569,6 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
 	init_vmcb(vcpu->svm->vmcb);
 
 	fx_init(vcpu);
-	vcpu->apic_base = 0xfee00000 |
-			/*for vcpu 0*/ MSR_IA32_APICBASE_BSP |
-			MSR_IA32_APICBASE_ENABLE;
 
 	return 0;
 
@@ -1469,8 +1466,8 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 		(vcpu->interrupt_window_open && 
 		 !kvm_vcpu_irq_pending(vcpu));
 	kvm_run->if_flag = (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF) != 0;
-	kvm_run->cr8 = vcpu->cr8;
-	kvm_run->apic_base = vcpu->apic_base;
+	kvm_run->cr8 = kvm_lapic_get_tpr(&vcpu->apic);
+	kvm_run->apic_base = kvm_lapic_get_base(&vcpu->apic);
 }
 
 /*
diff --git a/drivers/kvm/userint.c b/drivers/kvm/userint.c
index d4385d6..8398d9e 100644
--- a/drivers/kvm/userint.c
+++ b/drivers/kvm/userint.c
@@ -91,10 +91,16 @@ static inline int bitarray_test_and_set(struct bitarray *this, int nr, int val)
 
 /*
  *----------------------------------------------------------------------
- * userint interface - provides the actual kvm_irqdevice implementation
+ * userint - provides the actual "user interrupts" implementation
  *---------------------------------------------------------------------
  */
 
+/*
+ *-----
+ * irqdevice
+ *-----
+ */
+
 struct kvm_user_irqdev {
 	spinlock_t      lock;
 	atomic_t        ref_count;
@@ -166,16 +172,87 @@ static int user_irqdev_summary(struct kvm_irqdevice* this, void *data)
 	return 0;
 }
 
-static void user_irqdev_dropref(struct kvm_user_irqdev *s)
+static void user_irqdev_destructor(struct kvm_irqdevice *this)
 {
+	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
+
 	if (atomic_dec_and_test(&s->ref_count))
 		kfree(s);
 }
 
-static void user_irqdev_destructor(struct kvm_irqdevice *this)
+/*
+ *-----
+ * lapic
+ *-----
+ */
+
+struct kvm_user_lapic {
+	spinlock_t      lock;
+	atomic_t        ref_count;
+	unsigned long   cr8;
+	u64             apic_base;
+};
+
+
+static void user_lapic_set_tpr(struct kvm_lapic *this, u64 cr8, int flags)
+{
+	struct kvm_user_lapic *s = (struct kvm_user_lapic*)this->private;
+
+	spin_lock(&s->lock);
+	s->cr8 = cr8;
+	spin_unlock(&s->lock);
+}
+
+static u64 user_lapic_get_tpr(struct kvm_lapic *this)
+{
+	struct kvm_user_lapic *s = (struct kvm_user_lapic*)this->private;
+	u64 cr8;
+
+	spin_lock(&s->lock);
+	cr8 = s->cr8;
+	spin_unlock(&s->lock);
+
+	return cr8;
+}
+	
+static void user_lapic_set_base(struct kvm_lapic *this, u64 base, 
+				   int flags)
+{
+	struct kvm_user_lapic *s = (struct kvm_user_lapic*)this->private;
+
+	spin_lock(&s->lock);
+	s->apic_base = base;
+	spin_unlock(&s->lock);
+}
+
+static u64 user_lapic_get_base(struct kvm_lapic *this)
+{
+	struct kvm_user_lapic *s = (struct kvm_user_lapic*)this->private;
+	u64 base;
+
+	spin_lock(&s->lock);
+	base = s->apic_base;
+	spin_unlock(&s->lock);
+
+	return base;
+}
+
+static void user_lapic_reset(struct kvm_lapic *this)
+{
+	/* no-op */
+}
+
+static int user_lapic_enabled(struct kvm_lapic *this)
+{
+	return 1;
+}
+
+static void user_lapic_destructor(struct kvm_lapic *this)
 {
 	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
-	user_irqdev_dropref(s);
+	
+	if (atomic_dec_and_test(&s->ref_count))
+		kfree(s);
 }
 
 int kvm_user_irqdev_init(struct kvm_irqdevice *irqdev)
@@ -195,12 +272,43 @@ int kvm_user_irqdev_init(struct kvm_irqdevice *irqdev)
 	
 	irqdev->private = s;
 	atomic_inc(&s->ref_count);
-	
+
 	return 0;
 }
 
 int kvm_userint_init(struct kvm_vcpu *vcpu)
 {
-	return kvm_user_irqdev_init(&vcpu->irq.dev);
+	struct kvm_lapic     *apic   = &vcpu->apic;
+	struct kvm_user_lapic*s;
+	int ret;
+
+	s = kzalloc(sizeof(*s), GFP_KERNEL);
+	if (!s)
+		return -ENOMEM;
+
+	ret = kvm_user_irqdev_init(&vcpu->irq.dev);
+	if (ret < 0) {
+		kfree(s);
+		return ret;
+	}
+
+	spin_lock_init(&s->lock);
+	s->cr8 = 0;
+	s->apic_base =  0xfee00000 |
+		/*for vcpu 0*/ MSR_IA32_APICBASE_BSP |
+		MSR_IA32_APICBASE_ENABLE;
+
+	apic->set_tpr      = user_lapic_set_tpr;
+	apic->get_tpr      = user_lapic_get_tpr;
+	apic->set_base     = user_lapic_set_base;
+	apic->get_base     = user_lapic_get_base;
+	apic->reset        = user_lapic_reset;
+	apic->enabled      = user_lapic_enabled;
+	apic->destructor   = user_lapic_destructor;
+
+	apic->private = s;
+	atomic_inc(&s->ref_count);
+
+	return 0;
 }
 
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 496ecc5..2979fcf 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -994,10 +994,6 @@ static int vmx_vcpu_setup(struct kvm_vcpu *vcpu)
 
 	memset(vcpu->regs, 0, sizeof(vcpu->regs));
 	vcpu->regs[VCPU_REGS_RDX] = get_rdx_init_val();
-	vcpu->cr8 = 0;
-	vcpu->apic_base = 0xfee00000 |
-			/*for vcpu 0*/ MSR_IA32_APICBASE_BSP |
-			MSR_IA32_APICBASE_ENABLE;
 
 	fx_init(vcpu);
 
@@ -1576,7 +1572,7 @@ static int handle_cr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 			printk(KERN_DEBUG "handle_cr: read CR8 "
 			       "cpu erratum AA15\n");
 			vcpu_load_rsp_rip(vcpu);
-			vcpu->regs[reg] = vcpu->cr8;
+			vcpu->regs[reg] = kvm_lapic_get_tpr(&vcpu->apic);
 			vcpu_put_rsp_rip(vcpu);
 			skip_emulated_instruction(vcpu);
 			return 1;
@@ -1673,8 +1669,8 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 			      struct kvm_run *kvm_run)
 {
 	kvm_run->if_flag = (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) != 0;
-	kvm_run->cr8 = vcpu->cr8;
-	kvm_run->apic_base = vcpu->apic_base;
+	kvm_run->cr8 = kvm_lapic_get_tpr(&vcpu->apic);
+	kvm_run->apic_base = kvm_lapic_get_base(&vcpu->apic);
 	kvm_run->ready_for_interrupt_injection = 
 		(vcpu->interrupt_window_open && 
 		 !kvm_vcpu_irq_pending(vcpu));


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] KVM: Add support for in-kernel LAPIC model
       [not found] ` <20070420030905.12408.40403.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (3 preceding siblings ...)
  2007-04-20  3:09   ` [PATCH 4/5] KVM: Local-APIC interface cleanup Gregory Haskins
@ 2007-04-20  3:09   ` Gregory Haskins
       [not found]     ` <20070420030931.12408.88158.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-04-22  9:06   ` KVM: Patch series for in-kernel APIC support Avi Kivity
  5 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-20  3:09 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/Makefile   |    2 
 drivers/kvm/kernint.c  |  168 +++++
 drivers/kvm/kvm.h      |   14 
 drivers/kvm/kvm_main.c |  142 +++++
 drivers/kvm/lapic.c    | 1472 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kvm.h    |   16 -
 6 files changed, 1808 insertions(+), 6 deletions(-)

diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index 540afbc..1aad737 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o userint.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o userint.o lapic.o kernint.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/kernint.c b/drivers/kvm/kernint.c
new file mode 100644
index 0000000..979a4aa
--- /dev/null
+++ b/drivers/kvm/kernint.c
@@ -0,0 +1,168 @@
+/*
+ * Kernel Interrupt IRQ device
+ *
+ * Provides a model for connecting in-kernel interrupt resources to a VCPU.
+ * 
+ * A typical modern x86 processor has the concept of an internal Local-APIC 
+ * and some external signal pins.  The way in which interrupts are injected is
+ * dependent on whether software enables the LAPIC or not.  When enabled,
+ * interrupts are acknowledged through the LAPIC.  Otherwise they are through
+ * an externally connected PIC (typically an i8259 on the BSP)  
+ *
+ * Copyright (C) 2007 Novell
+ *
+ * Authors:
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include "kvm.h"
+
+extern int kvm_kern_lapic_init(struct kvm_vcpu *vcpu, 
+			       struct kvm_irqdevice *irq_dev);
+struct kvm_kernint {
+	spinlock_t                    lock;
+	atomic_t                      ref_count;
+	struct kvm_vcpu              *vcpu;
+	struct kvm_irqdevice         *self_irq;
+	struct kvm_irqdevice         *ext_irq;
+	struct kvm_irqdevice          apic_irq;
+	struct kvm_lapic             *apic_dev;
+
+};
+
+static void kernint_dropref(struct kvm_kernint *s)
+{
+	if (atomic_dec_and_test(&s->ref_count))
+		kfree(s);
+}
+
+static struct kvm_irqdevice *get_irq_dev(struct kvm_kernint *s)
+{
+	struct kvm_irqdevice *dev;
+	
+	if (kvm_lapic_enabled(s->apic_dev))
+		dev = &s->apic_irq;
+	else
+		dev = s->ext_irq;
+
+	if (!dev)
+		kvm_crash_guest(s->vcpu->kvm);
+
+	return dev;
+}
+
+static int kernint_irqdev_ack(struct kvm_irqdevice *this, int *vector)
+{
+	struct kvm_kernint *s = (struct kvm_kernint*)this->private;
+
+	return kvm_irqdevice_ack(get_irq_dev(s), vector);
+}
+
+static int kernint_irqdev_set_pin(struct kvm_irqdevice* this, 
+				  int irq, int level)
+{
+	/* no-op */
+	return 0;
+}
+
+static int kernint_irqdev_summary(struct kvm_irqdevice* this, void *data)
+{	
+	struct kvm_kernint *s = (struct kvm_kernint*)this->private;
+
+	return kvm_irqdevice_summary(get_irq_dev(s), data);
+}
+
+static void kernint_irqdev_destructor(struct kvm_irqdevice *this)
+{
+	struct kvm_kernint *s = (struct kvm_kernint*)this->private;
+	kernint_dropref(s);
+}
+
+static void kvm_apic_intr(struct kvm_irqsink *this, 
+			  struct kvm_irqdevice *dev,
+			  kvm_irqpin_t pin, int trigger, int val)
+{
+	struct kvm_kernint *s = (struct kvm_kernint*)this->private;
+
+	/*
+	 * If the LAPIC sent us an interrupt it *must* be enabled,
+	 * just forward it on to the CPU
+	 */
+	kvm_irqdevice_set_intr(s->self_irq, pin, trigger, val);
+}
+
+static void kvm_ext_intr(struct kvm_irqsink *this, 
+			 struct kvm_irqdevice *dev,
+			 kvm_irqpin_t pin, int trigger, int val)
+{
+	struct kvm_kernint *s = (struct kvm_kernint*)this->private;
+
+	/*
+	 * If the EXTINT device sent us an interrupt, only forward it if 
+	 * the LAPIC is disabled
+	 */
+	if (!kvm_lapic_enabled(s->apic_dev))
+		return kvm_irqdevice_set_intr(s->self_irq, pin, trigger, val);
+}
+
+int kvm_kernint_init(struct kvm_vcpu *vcpu)
+{
+	struct kvm_irqdevice *irqdev = &vcpu->irq.dev;
+	struct kvm_kernint *s;
+
+	s = kzalloc(sizeof(*s), GFP_KERNEL);
+	if (!s)
+	    return -ENOMEM;
+
+	spin_lock_init(&s->lock);
+	s->vcpu     = vcpu;
+
+	/*
+	 * Configure the irqdevice interface
+	 */
+	irqdev->ack         = kernint_irqdev_ack;
+	irqdev->set_pin     = kernint_irqdev_set_pin;
+	irqdev->summary     = kernint_irqdev_summary;
+	irqdev->destructor  = kernint_irqdev_destructor;
+	
+	irqdev->private = s;
+	atomic_inc(&s->ref_count);
+	s->self_irq = irqdev;
+
+	/*
+	 * Configure the EXTINT device if this is the BSP processor
+	 */
+	if (!vcpu_slot(vcpu)) {
+		struct kvm_irqsink sink = {
+			.set_intr   = kvm_ext_intr,
+			.private    = s
+		};
+		
+		s->ext_irq = &vcpu->kvm->isa_irq;
+		kvm_irqdevice_register_sink(s->ext_irq, &sink);
+	}
+
+	/*
+	 * Configure the LAPIC device
+	 */
+	kvm_irqdevice_init(&s->apic_irq);
+
+	{
+		struct kvm_irqsink sink = {
+			.set_intr   = kvm_apic_intr,
+			.private    = s
+		};
+		
+		kvm_irqdevice_register_sink(&s->apic_irq, &sink);
+	}
+
+	kvm_kern_lapic_init(vcpu, &s->apic_irq);
+	s->apic_dev = &vcpu->apic;
+
+	return 0;
+}
+
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index f8c092c..004bcb2 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -161,6 +161,7 @@ struct kvm_vcpu;
 
 int kvm_user_irqdev_init(struct kvm_irqdevice *dev);
 int kvm_userint_init(struct kvm_vcpu *vcpu);
+int kvm_kernint_init(struct kvm_vcpu *vcpu);
 
 /*
  * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
@@ -303,6 +304,7 @@ struct kvm_vcpu {
 	int interrupt_window_open;
 	struct kvm_vcpu_irq irq;
 	struct kvm_lapic apic;
+	struct kvm_io_device *apic_mmio;
 	unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
 	unsigned long rip;      /* needs vcpu_load_rsp_rip() */
 
@@ -482,6 +484,8 @@ struct kvm {
 	struct list_head vm_list;
 	struct file *filp;
 	struct kvm_io_bus mmio_bus;
+	int enable_kernel_pic;
+	struct kvm_irqdevice isa_irq;
 };
 
 struct kvm_stat {
@@ -570,6 +574,9 @@ extern struct kvm_arch_ops *kvm_arch_ops;
 int kvm_init_arch(struct kvm_arch_ops *ops, struct module *module);
 void kvm_exit_arch(void);
 
+int kvm_apicbus_send(struct kvm *kvm, int dest, int trig_mode, int level, 
+		     int dest_mode, int delivery_mode, int vector);
+
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_mmu_create(struct kvm_vcpu *vcpu);
 int kvm_mmu_setup(struct kvm_vcpu *vcpu);
@@ -701,6 +708,13 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
 	return (struct kvm_mmu_page *)page_private(page);
 }
 
+static inline int vcpu_slot(struct kvm_vcpu *vcpu)
+{
+	return vcpu - vcpu->kvm->vcpus;
+}
+
+void kvm_crash_guest(struct kvm *kvm);
+
 static inline u16 read_fs(void)
 {
 	u16 seg;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 55be172..e7c7661 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -295,6 +295,7 @@ static struct kvm *kvm_create_vm(void)
 	spin_lock_init(&kvm->lock);
 	INIT_LIST_HEAD(&kvm->active_mmu_pages);
 	kvm_io_bus_init(&kvm->mmio_bus);
+	kvm_irqdevice_init(&kvm->isa_irq);
 	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
 		struct kvm_vcpu *vcpu = &kvm->vcpus[i];
 
@@ -391,6 +392,23 @@ static void kvm_free_vcpus(struct kvm *kvm)
 		kvm_free_vcpu(&kvm->vcpus[i]);
 }
 
+/*
+ * The function kills a guest while there still is a user space processes
+ * with a descriptor to it
+ */
+void kvm_crash_guest(struct kvm *kvm)
+{
+	unsigned int i;
+
+	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+		/*
+		 * FIXME: in the future it should send IPI to gracefully 
+		 * stop the other vCPUs
+		 */
+		kvm_free_vcpu(&kvm->vcpus[i]);
+	}
+}
+
 static int kvm_dev_release(struct inode *inode, struct file *filp)
 {
 	return 0;
@@ -908,6 +926,64 @@ out:
 	return r;
 }
 
+static int kvm_vm_ioctl_enable_kernel_pic(struct kvm *kvm, __u32 val)
+{
+	/*
+	 * FIXME: We should not allow this if VCPUs have already been created
+	 */
+	if (kvm->enable_kernel_pic)
+		return -EINVAL;
+
+	/* Someday we may offer two levels of in-kernel PIC support:
+	 *
+	 *  level 0 = (default) compatiblity mode (everything in userspace)
+	 *  level 1 = LAPIC in kernel, IOAPIC/i8259 in userspace
+	 *  level 2 = All three in kernel
+	 *
+	 * For now we only support level 0 and 1.  However, you cant set 
+	 * level 0
+	 */
+	if (val != 1)
+		return -EINVAL;
+
+	kvm->enable_kernel_pic = val;
+
+	/*
+	 * installing a user_irqdev model to the kvm->isa_irq device
+	 * creates a level-1 environment, where the userspace completely
+	 * controls the ISA domain interrupts in the IOAPIC/i8259.
+	 * Interrupts come down to the VCPU either as an ISA vector to
+	 * this controller, or as an APIC bus message (or both)
+	 */
+	kvm_user_irqdev_init(&kvm->isa_irq);
+
+	return 0;
+}
+
+static int kvm_vm_ioctl_isa_interrupt(struct kvm *kvm,
+				      struct kvm_interrupt *irq)
+{
+	if (irq->irq < 0 || irq->irq >= 256)
+		return -EINVAL;
+
+	if (!kvm->enable_kernel_pic)
+		return -EINVAL;
+
+	return kvm_irqdevice_set_pin(&kvm->isa_irq, irq->irq, 1);
+}
+
+static int kvm_vm_ioctl_apic_msg(struct kvm *kvm,
+				 struct kvm_apic_msg *msg)
+{
+	if (!kvm->enable_kernel_pic)
+		return -EINVAL;
+
+	kvm_apicbus_send(kvm, msg->dest, msg->trig_mode, 1, msg->dest_mode, 
+			 msg->delivery_mode, msg->vector);
+
+	return 0;
+}
+
 static gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn)
 {
 	int i;
@@ -1028,10 +1104,16 @@ static int emulator_write_std(unsigned long addr,
 static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu, 
 						gpa_t addr)
 {
+	struct kvm_io_device *dev = vcpu->apic_mmio;
+
+	/*
+	 * First check if the LAPIC will snarf this request
+	 */
+	if (dev && dev->in_range(dev, addr))
+		return dev;
+
 	/*
-	 * Note that its important to have this wrapper function because 
-	 * in the very near future we will be checking for MMIOs against 
-	 * the LAPIC as well as the general MMIO bus 
+	 * And then fallback to allow any device to participate
 	 */
 	return kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr);
 }
@@ -2473,7 +2555,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
 	kvm_vcpu_irqsink_init(vcpu);
 	kvm_lapic_init(&vcpu->apic);
 
-	r = kvm_userint_init(vcpu);
+	if (kvm->enable_kernel_pic)
+	    r = kvm_kernint_init(vcpu);
+	else
+	    r = kvm_userint_init(vcpu);
+
 	if (r < 0)
 	    goto out_free_vcpus;
 
@@ -2595,6 +2681,12 @@ static int kvm_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 	return 0;
 }
 
+static int kvm_vcpu_ioctl_apic_reset(struct kvm_vcpu *vcpu)
+{
+	kvm_lapic_reset(&vcpu->apic);
+	return 0;
+}
+
 static long kvm_vcpu_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2764,6 +2856,13 @@ static long kvm_vcpu_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_APIC_RESET: {
+		r = kvm_vcpu_ioctl_apic_reset(vcpu);
+		if (r)
+			goto out;
+		r = 0;
+		break;
+	}
 	default:
 		;
 	}
@@ -2817,6 +2916,41 @@ static long kvm_vm_ioctl(struct file *filp,
 			goto out;
 		break;
 	}
+	case KVM_ENABLE_KERNEL_PIC: {
+		__u32 val;
+
+		r = -EFAULT;
+		if (copy_from_user(&val, argp, sizeof val))
+			goto out;
+		r = kvm_vm_ioctl_enable_kernel_pic(kvm, val);
+		if (r)
+			goto out;
+		break;
+	}
+	case KVM_ISA_INTERRUPT: {
+		struct kvm_interrupt irq;
+
+		r = -EFAULT;
+		if (copy_from_user(&irq, argp, sizeof irq))
+			goto out;
+		r = kvm_vm_ioctl_isa_interrupt(kvm, &irq);
+		if (r)
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_APIC_MSG: {
+		struct kvm_apic_msg msg;
+
+		r = -EFAULT;
+		if (copy_from_user(&msg, argp, sizeof msg))
+			goto out;
+		r = kvm_vm_ioctl_apic_msg(kvm, &msg);
+		if (r)
+			goto out;
+		r = 0;
+		break;
+	}
 	default:
 		;
 	}
diff --git a/drivers/kvm/lapic.c b/drivers/kvm/lapic.c
new file mode 100644
index 0000000..3ec33af
--- /dev/null
+++ b/drivers/kvm/lapic.c
@@ -0,0 +1,1472 @@
+/*
+ * Local APIC virtualization
+ *
+ * Copyright (C) 2006 Qumranet, Inc.
+ * Copyright (C) 2007 Novell
+ *
+ * Authors:
+ *   Dor Laor <dor.laor-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * Based on Xen 3.0 code, Copyright (c) 2004, Intel Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "kvm.h"
+#include <linux/kvm.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/smp.h>
+#include <linux/hrtimer.h>
+#include <asm/processor.h>
+#include <asm/io.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/current.h>
+
+/*XXX remove this definition after GFW enabled */
+#define APIC_NO_BIOS
+
+#define PRId64 "d"
+#define PRIx64 "llx"
+#define PRIu64 "u"
+#define PRIo64 "o"
+
+#define APIC_BUS_CYCLE_NS 1
+
+/*
+ *-----------------------------------------------------------------------
+ * KERNEL-TIMERS
+ * 
+ * Unfortunately we really need HRTIMERs to do this right, but they might
+ * not exist on olders kernels (sigh).  So we roughly abstract this interface
+ * to support nanosecond resolution, and then emulate it as best we can if
+ * the real HRTIMERs arent available
+ *-----------------------------------------------------------------------
+ */
+
+struct kvm_apic_timer {
+	int (*function)(void *private);
+	void *private;	
+#ifdef KVM_NO_HRTIMER
+	struct timer_list dev;
+#else
+	struct hrtimer dev;
+#endif
+};
+
+#ifdef KVM_NO_HRTIMER
+
+/*
+ *----------------
+ * low-res version
+ *----------------
+ */
+
+static void kvm_apictimer_lr_cb(unsigned long data)
+{
+	struct kvm_apic_timer *timer = (struct kvm_apic_timer*)data;
+
+	/*
+	 * If the callback returns >0, its cyclic
+	 */
+	if (timer->function(timer->private))
+		add_timer(&timer->dev);
+}
+
+static ktime_t kvm_apictimer_now(struct kvm_apic_timer *timer)
+{
+	struct timespec ts;
+
+	getnstimeofday(&ts);
+	return timespec_to_ktime(ts);
+}
+
+#define TICKS_PER_MS (HZ/1000)
+#define TICKS_PER_US (TICKS_PER_MS/1000)
+#define TICKS_PER_NS (TICKS_PER_US/1000)
+
+static void kvm_apictimer_update(struct kvm_apic_timer *timer, ktime_t val)
+{
+	/* FIXME: I'm sure this is broken */
+	ktime_t offset = ktime_sub(val, kvm_apictimer_now(timer));
+	unsigned long timeout = TICKS_PER_NS * ktime_to_ns(offset);
+	if (!timeout)
+		timeout++;
+	timer->dev.expires = jiffies + timeout; 
+}
+
+static void kvm_apictimer_start(struct kvm_apic_timer *timer, ktime_t val)
+{
+	kvm_apictimer_update(timer, val);
+	add_timer(&timer->dev);
+}
+
+static void kvm_apictimer_stop(struct kvm_apic_timer *timer)
+{
+	del_timer(&timer->dev);
+}
+
+static int kvm_apictimer_init(struct kvm_apic_timer *timer)
+{
+	memset(timer, 0, sizeof(timer));
+
+	init_timer(&timer->dev);
+	timer->dev.function = kvm_apictimer_lr_cb;
+	timer->dev.data = (unsigned long)timer;
+
+	return 0;
+}
+
+#else
+
+/*
+ *----------------
+ * hi-res version
+ *----------------
+ */
+
+static enum hrtimer_restart kvm_apictimer_hr_cb(struct hrtimer* data)
+{
+	struct kvm_apic_timer *timer;
+
+	timer = container_of(data, struct kvm_apic_timer, dev);
+
+	/*
+	 * If the callback returns >0, its cyclic
+	 */
+	if (timer->function(timer->private))
+		return HRTIMER_RESTART;
+	else
+		return HRTIMER_NORESTART;
+}
+
+static ktime_t kvm_apictimer_now(struct kvm_apic_timer *timer)
+{
+	return timer->dev.base->get_time();
+}
+
+static void kvm_apictimer_update(struct kvm_apic_timer *timer, ktime_t val)
+{
+	timer->dev.expires = val;
+}
+
+static void kvm_apictimer_start(struct kvm_apic_timer *timer, ktime_t val)
+{
+	hrtimer_start(&timer->dev, val, HRTIMER_ABS);
+}
+
+static void kvm_apictimer_stop(struct kvm_apic_timer *timer)
+{
+	hrtimer_cancel(&timer->dev);
+}
+
+static int kvm_apictimer_init(struct kvm_apic_timer *timer)
+{
+	hrtimer_init(&timer->dev, CLOCK_MONOTONIC, HRTIMER_ABS);
+	timer->dev.function = kvm_apictimer_hr_cb;
+	
+	return 0;
+}
+
+#endif
+
+/*
+ *-----------------------------------------------------------------------
+ * Actual LAPIC model - Enough of that nutty timer junk
+ *-----------------------------------------------------------------------
+ */
+struct kvm_kern_apic {
+	spinlock_t              lock;
+	atomic_t                ref_count;
+	u32                     status;
+	u32                     vcpu_id;
+	u64                     base_msr;
+	unsigned long           base_address;
+	struct kvm_io_device    mmio_dev;
+	struct {
+		unsigned long   pending;
+		u32             divide_count;
+		ktime_t         last_update;
+		struct kvm_apic_timer dev;
+
+	} timer;
+	u32                     err_status;
+	u32                     err_write_count;
+	struct kvm_vcpu         *vcpu;
+	struct kvm_irqdevice    *irq_dev;
+	struct kvm_lapic        *dev;
+	struct page             *regs_page;
+	void                    *regs;
+};
+
+static __inline__ int find_highest_bit(unsigned long *data, int nr_bits)
+{
+	int length = BITS_TO_LONGS(nr_bits);
+	while (length && !data[--length])
+		continue;
+	return __ffs(data[length]) + (length * BITS_PER_LONG);
+}
+
+#define APIC_LVT_NUM			6
+/* 14 is the version for Xeon and Pentium 8.4.8*/
+#define APIC_VERSION			(0x14UL | ((APIC_LVT_NUM - 1) << 16))
+#define VLOCAL_APIC_MEM_LENGTH		(1 << 12)
+/* followed define is not in apicdef.h */
+#define APIC_SHORT_MASK			0xc0000
+#define APIC_DEST_NOSHORT		0x0
+#define APIC_DEST_MASK			0x800
+#define _APIC_GLOB_DISABLE		0x0
+#define APIC_GLOB_DISABLE_MASK		0x1
+#define APIC_SOFTWARE_DISABLE_MASK	0x2
+#define _APIC_BSP_ACCEPT_PIC		0x3
+#define MAX_APIC_INT_VECTOR             256
+
+#define apic_enabled(apic)              \
+	(!((apic)->status &                   \
+	   (APIC_GLOB_DISABLE_MASK | APIC_SOFTWARE_DISABLE_MASK)))
+
+#define apic_global_enabled(apic)       \
+	(!(test_bit(_APIC_GLOB_DISABLE, &(apic)->status)))
+
+#define LVT_MASK \
+	APIC_LVT_MASKED | APIC_SEND_PENDING | APIC_VECTOR_MASK
+
+#define LINT_MASK   \
+	LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY |\
+	APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER
+
+#define KVM_APIC_ID(apic)   \
+	(GET_APIC_ID(apic_get_reg(apic, APIC_ID)))
+
+#define apic_lvt_enabled(apic, lvt_type)    \
+	(!(apic_get_reg(apic, lvt_type) & APIC_LVT_MASKED))
+
+#define apic_lvt_vector(apic, lvt_type)     \
+	(apic_get_reg(apic, lvt_type) & APIC_VECTOR_MASK)
+
+#define apic_lvt_dm(apic, lvt_type)           \
+	(apic_get_reg(apic, lvt_type) & APIC_MODE_MASK)
+
+#define apic_lvtt_period(apic)     \
+	(apic_get_reg(apic, APIC_LVTT) & APIC_LVT_TIMER_PERIODIC)
+
+static int apic_reset(struct kvm_kern_apic *apic);
+
+static inline u32 apic_get_reg(struct kvm_kern_apic *apic, u32 reg)
+{
+	return *((u32 *)(apic->regs + reg));
+}
+
+static inline void apic_set_reg(struct kvm_kern_apic *apic,
+				    u32 reg, u32 val)
+{
+	*((u32 *)(apic->regs + reg)) = val;
+}
+
+static unsigned int apic_lvt_mask[APIC_LVT_NUM] =
+{
+	LVT_MASK | APIC_LVT_TIMER_PERIODIC, 	/* LVTT */
+	LVT_MASK | APIC_MODE_MASK, 		/* LVTTHMR */
+	LVT_MASK | APIC_MODE_MASK, 		/* LVTPC */
+	LINT_MASK, LINT_MASK, 			/* LVT0-1 */
+	LVT_MASK 				/* LVTERR */
+};
+
+#define ASSERT(x)  							     \
+	if (!(x)) { 							     \
+		printk(KERN_EMERG "assertion failed %s: %d: %s\n",           \
+		       __FILE__, __LINE__, #x);                              \
+		BUG();                                                       \
+	}
+
+static int apic_find_highest_irr(struct kvm_kern_apic *apic)
+{
+	int result;
+	
+	result = find_highest_bit((unsigned long *)(apic->regs + APIC_IRR),
+				  MAX_APIC_INT_VECTOR);
+	
+	ASSERT( result == 0 || result >= 16);
+	
+	return result;
+}
+
+
+static int apic_find_highest_isr(struct kvm_kern_apic *apic)
+{
+	int result;
+	
+	result = find_highest_bit((unsigned long *)(apic->regs + APIC_ISR),
+				  MAX_APIC_INT_VECTOR);
+	
+	ASSERT( result == 0 || result >= 16);
+	
+	return result;
+}
+
+static void apic_dropref(struct kvm_kern_apic *apic)
+{
+	if (atomic_dec_and_test(&apic->ref_count)) {
+		if (apic->regs_page) {
+			kvm_apictimer_stop(&apic->timer.dev);
+			__free_page(apic->regs_page);
+			apic->regs_page = 0;
+		}
+
+		kfree(apic);
+	}
+}
+
+#if 0
+static void apic_dump_state(struct kvm_kern_apic *apic)
+{
+	u64 *tmp;
+
+	printk(KERN_INFO "%s begin\n", __FUNCTION__);
+	
+	printk(KERN_INFO "status = 0x%08x\n", apic->status);
+	printk(KERN_INFO "base_msr=0x%016llx, apicbase = 0x%08lx\n", 
+	       apic->base_msr, apic->base_address);
+	
+        tmp = (u64*)(apic->regs + APIC_IRR);
+	printk(KERN_INFO "IRR = 0x%016llx 0x%016llx 0x%016llx 0x%016llx\n", 
+	       tmp[3], tmp[2], tmp[1], tmp[0]);
+	tmp = (u64*)(apic->regs + APIC_ISR);
+	printk(KERN_INFO "ISR = 0x%016llx 0x%016llx 0x%016llx 0x%016llx\n", 
+	       tmp[3], tmp[2], tmp[1], tmp[0]);
+	tmp = (u64*)(apic->regs + APIC_TMR);
+	printk(KERN_INFO "TMR = 0x%016llx 0x%016llx 0x%016llx 0x%016llx\n", 
+	       tmp[3], tmp[2], tmp[1], tmp[0]);
+
+	printk(KERN_INFO "APIC_ID=0x%08x\n", apic_get_reg(apic, APIC_ID));
+	printk(KERN_INFO "APIC_TASKPRI=0x%08x\n", 
+	       apic_get_reg(apic, APIC_TASKPRI) & 0xff);
+	printk(KERN_INFO "APIC_PROCPRI=0x%08x\n", 
+	       apic_get_reg(apic, APIC_PROCPRI));
+	
+	printk(KERN_INFO "APIC_DFR=0x%08x\n", 
+	       apic_get_reg(apic, APIC_DFR) | 0x0FFFFFFF);
+	printk(KERN_INFO "APIC_LDR=0x%08x\n", 
+	       apic_get_reg(apic, APIC_LDR) & APIC_LDR_MASK);
+	printk(KERN_INFO "APIC_SPIV=0x%08x\n",
+	       apic_get_reg(apic, APIC_SPIV) & 0x3ff);
+        printk(KERN_INFO "APIC_ESR=0x%08x\n",
+	       apic_get_reg(apic, APIC_ESR));
+	printk(KERN_INFO "APIC_ICR=0x%08x\n",
+	       apic_get_reg(apic, APIC_ICR) & ~(1 << 12));
+	printk(KERN_INFO "APIC_ICR2=0x%08x\n",
+	       apic_get_reg(apic, APIC_ICR2) & 0xff000000);
+
+	printk(KERN_INFO "APIC_LVTERR=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVTERR));
+	printk(KERN_INFO "APIC_LVT1=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVT1));
+	printk(KERN_INFO "APIC_LVT0=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVT0));
+	printk(KERN_INFO "APIC_LVTPC=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVTPC));
+	printk(KERN_INFO "APIC_LVTTHMR=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVTTHMR));
+	printk(KERN_INFO "APIC_LVTT=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVTT));
+	
+	printk(KERN_INFO "APIC_TMICT=0x%08x\n",
+	       apic_get_reg(apic, APIC_TMICT));
+	printk(KERN_INFO "APIC_TDCR=0x%08x\n",
+	       apic_get_reg(apic, APIC_TDCR));
+        
+	printk(KERN_INFO "%s end\n", __FUNCTION__);
+}
+#endif
+
+
+static u32 apic_update_ppr(struct kvm_kern_apic *apic)
+{
+	u32 tpr, isrv, ppr;
+	int isr;
+	
+	tpr = apic_get_reg(apic, APIC_TASKPRI);
+        isr = apic_find_highest_isr(apic);
+	isrv = (isr >> 4) & 0xf;
+	
+	if ((tpr >> 4) >= isrv)
+		ppr = tpr & 0xff;
+	else
+		ppr = isrv << 4;  /* low 4 bits of PPR have to be cleared */
+	
+	apic_set_reg(apic, APIC_PROCPRI, ppr);
+	
+	pr_debug("%s: ppr 0x%x, isr 0x%x, isrv 0x%x\n",
+	       __FUNCTION__, ppr, isr, isrv);
+	
+	return ppr;
+}
+
+static int apic_match_dest(struct kvm_kern_apic *target,
+			   int dest,
+			   int dest_mode,
+			   int delivery_mode)
+{
+	int result = 0;
+
+	spin_lock_bh(&target->lock);
+
+	if (!dest_mode) /* Physical */
+		result = (GET_APIC_ID(apic_get_reg(target, APIC_ID)) == dest);
+	else { /* Logical */
+		u32 ldr = apic_get_reg(target, APIC_LDR);
+		
+		/* Flat mode */
+		if (apic_get_reg(target, APIC_DFR) == APIC_DFR_FLAT)
+			result = GET_APIC_LOGICAL_ID(ldr) & dest;
+		else {
+			if ((delivery_mode == APIC_DM_LOWEST) &&
+			    (dest == 0xff)) {
+				printk(KERN_ALERT "Broadcast IPI " \
+				       "with lowest priority "
+				       "delivery mode\n");
+				spin_unlock_bh(&target->lock);
+				kvm_crash_guest(target->vcpu->kvm);
+				return 0;
+			}
+			if (GET_APIC_LOGICAL_ID(ldr) == (dest & 0xf))
+				result = (GET_APIC_LOGICAL_ID(ldr) >> 4) & 
+					(dest >> 4);
+			else
+				result = 0;
+		}
+	}
+
+	spin_unlock_bh(&target->lock);
+
+	return result;
+}
+
+/*
+ * Add a pending IRQ into lapic.
+ * Return 1 if successfully added and 0 if discarded.
+ */
+static int __apic_accept_irq(struct kvm_kern_apic *apic,
+			     int delivery_mode,
+			     int vector,
+			     int level,
+			     int trig_mode)
+{
+	int result = 0;
+	
+	switch (delivery_mode) {
+	case APIC_DM_FIXED:
+	case APIC_DM_LOWEST:
+		/* FIXME add logic for vcpu on reset */
+		if (unlikely(apic == NULL || !apic_enabled(apic)))
+			break;
+
+		if (test_and_set_bit(vector, apic->regs + APIC_IRR) &&
+		    trig_mode) {
+			pr_debug("level trig mode repeatedly for vector %d\n",
+			       vector);
+			break;
+		}
+
+		if (trig_mode) {
+			pr_debug("level trig mode for vector %d\n", vector);
+			set_bit(vector, apic->regs + APIC_TMR);
+		}
+		
+		
+		kvm_irqdevice_set_intr(apic->irq_dev, 
+				       kvm_irqpin_localint, 
+				       trig_mode, level);
+		result = 1;
+		break;
+
+	case APIC_DM_REMRD:
+		printk(KERN_WARNING "%s: Ignore deliver mode %d\n", 
+		       __FUNCTION__, delivery_mode);
+		break;
+	case APIC_DM_EXTINT:
+	case APIC_DM_SMI:
+	case APIC_DM_NMI: {
+		kvm_irqpin_t pin = kvm_irqpin_invalid;
+
+		switch (delivery_mode) {
+		case APIC_DM_EXTINT:
+			pin = kvm_irqpin_extint;
+			break;
+		case APIC_DM_SMI:
+			pin = kvm_irqpin_smi;
+			break;
+		case APIC_DM_NMI:
+			pin = kvm_irqpin_nmi;
+			break;
+		default:
+			panic("KVM: illegal delivery_mode");
+		}
+
+		kvm_irqdevice_set_intr(apic->irq_dev, pin, trig_mode, level);
+		result = 1;
+		break;
+	}
+	case APIC_DM_INIT:
+	case APIC_DM_STARTUP: /* FIXME: currently no support for SMP */
+	default:
+		printk(KERN_ALERT "TODO: not support interrupt type %x\n", 
+		       delivery_mode);
+		kvm_crash_guest(apic->vcpu->kvm);
+		break;
+	}
+
+	return result;
+}
+
+static int apic_accept_irq(struct kvm_kern_apic *apic,
+			   int delivery_mode,
+			   int vector,
+			   int level,
+			   int trig_mode)
+{
+	int ret;
+
+	spin_lock_bh(&apic->lock);
+	ret = __apic_accept_irq(apic, delivery_mode, vector,
+				level, trig_mode);
+	spin_unlock_bh(&apic->lock);
+
+	return ret;
+}
+
+static void apic_EOI_set(struct kvm_kern_apic *apic)
+{
+	int vector = apic_find_highest_isr(apic);
+	
+	/*
+	 * Not every write EOI will has corresponding ISR,
+	 * one example is when Kernel check timer on setup_IO_APIC
+	 */
+	if (!vector)
+		return;
+	
+	__clear_bit(vector, apic->regs + APIC_ISR);
+	apic_update_ppr(apic);
+	
+	__clear_bit(vector, apic->regs + APIC_TMR);
+}
+
+static int apic_check_vector(struct kvm_kern_apic *apic,u32 dm, u32 vector)
+{
+	if ((dm == APIC_DM_FIXED) && (vector < 16)) {
+		apic->err_status |= 0x40;
+		__apic_accept_irq(apic, APIC_DM_FIXED,
+				  apic_lvt_vector(apic, APIC_LVTERR), 0, 0);
+		pr_debug("%s: check failed "
+		       " dm %x vector %x\n", __FUNCTION__, dm, vector);
+		return 0;
+	}
+	return 1;
+}
+
+int kvm_apicbus_send(struct kvm *kvm, int dest, int trig_mode, int level, 
+		     int dest_mode, int delivery_mode, int vector)
+{
+	int i;
+	u32 lpr_map = 0;
+
+	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+		struct kvm_kern_apic *target;
+		target = kvm->vcpus[i].apic.private;
+
+		if (!target)
+			continue;
+
+		if (apic_match_dest(target, dest, dest_mode, delivery_mode)) {
+			if (delivery_mode == APIC_DM_LOWEST)
+				__set_bit(target->vcpu_id, &lpr_map);
+			else
+				apic_accept_irq(target, delivery_mode,
+						vector, level, trig_mode);
+		}
+	}
+
+	if (delivery_mode == APIC_DM_LOWEST) {
+		struct kvm_kern_apic *target;
+
+		/* Currently only UP is supported */
+		target = kvm->vcpus[0].apic.private;
+	
+		if (target) 
+			apic_accept_irq(target, delivery_mode,
+					vector, level, trig_mode);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_apicbus_send);
+
+static void apic_ipi(struct kvm_kern_apic *apic)
+{
+	u32 icr_low = apic_get_reg(apic, APIC_ICR);
+	u32 icr_high = apic_get_reg(apic, APIC_ICR2);
+	
+	unsigned int dest =          GET_APIC_DEST_FIELD(icr_high);
+	unsigned int short_hand =    icr_low & APIC_SHORT_MASK;
+	unsigned int trig_mode =     icr_low & APIC_INT_LEVELTRIG;
+	unsigned int level =         icr_low & APIC_INT_ASSERT;
+	unsigned int dest_mode =     icr_low & APIC_DEST_MASK;
+	unsigned int delivery_mode = icr_low & APIC_MODE_MASK;
+	unsigned int vector =        icr_low & APIC_VECTOR_MASK;
+	
+	pr_debug("icr_high 0x%x, icr_low 0x%x, "
+		 "short_hand 0x%x, dest 0x%x, trig_mode 0x%x, level 0x%x, "
+		 "dest_mode 0x%x, delivery_mode 0x%x, vector 0x%x\n",
+		 icr_high, icr_low, short_hand, dest,
+		 trig_mode, level, dest_mode, delivery_mode, vector);
+
+	/*
+	 * We unlock here because we would enter this function in a lock
+	 * state and we dont want to remain this way while we transmit
+	 */
+	spin_unlock_bh(&apic->lock);
+
+	if (short_hand == APIC_DEST_NOSHORT)
+		/*
+		 * If no short-hand notation is in use, just forward the
+		 * message onto the apicbus and let the bus handle the routing.
+		 */
+		kvm_apicbus_send(apic->vcpu->kvm, dest, trig_mode, level, 
+				 dest_mode, delivery_mode, vector);
+	else {
+		/*
+		 * Otherwise we need to consider the short-hand to find the
+		 * correct targets.
+		 */
+		unsigned int i;
+
+		for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+			struct kvm_kern_apic *target;
+			int result = 0;
+
+			target = apic->vcpu->kvm->vcpus[i].apic.private;
+			
+			if (!target)
+				continue;
+			
+			switch (short_hand) {
+			case APIC_DEST_SELF:
+				if (target == apic)
+					result = 1;
+				break;
+			case APIC_DEST_ALLINC:
+				result = 1;
+				break;
+				
+			case APIC_DEST_ALLBUT:
+				if (target != apic)
+					result = 1;
+				break;
+			}
+
+			if (result)
+				apic_accept_irq(target, delivery_mode,
+						vector, level, trig_mode);
+		}
+	}
+
+	/*
+	 * Relock before returning
+	 */
+	spin_lock_bh(&apic->lock);
+
+}
+
+static u32 apic_get_tmcct(struct kvm_kern_apic *apic)
+{
+	u32 counter_passed;
+	ktime_t passed, now = kvm_apictimer_now(&apic->timer.dev);
+	u32 tmcct = apic_get_reg(apic, APIC_TMCCT);
+	
+	ASSERT(apic != NULL);
+	
+	if (unlikely(ktime_to_ns(now) <=
+		     ktime_to_ns(apic->timer.last_update))) {
+		/* Wrap around */
+		passed = ktime_add(
+			({ (ktime_t){ 
+				.tv64 = KTIME_MAX -
+					 (apic->timer.last_update).tv64 }; 
+			}), now);
+		pr_debug("time elapsed\n");
+	} else
+		passed = ktime_sub(now, apic->timer.last_update);
+	
+	counter_passed = ktime_to_ns(passed) / 
+		(APIC_BUS_CYCLE_NS * apic->timer.divide_count);
+        tmcct -= counter_passed;
+	
+	if (tmcct <= 0) {
+		if (unlikely(!apic_lvtt_period(apic))) {
+			tmcct =  0;
+		} else {
+			do {
+				tmcct += apic_get_reg(apic, APIC_TMICT);
+			} while ( tmcct <= 0 );
+		}
+	}
+	
+	apic->timer.last_update = now;
+	apic_set_reg(apic, APIC_TMCCT, tmcct);
+	
+	return tmcct;
+}
+
+static void apic_read_aligned(struct kvm_kern_apic *apic,
+			      unsigned int offset,
+			      unsigned int len,
+			      unsigned int *result)
+{
+	ASSERT(len == 4 && offset > 0 && offset <= APIC_TDCR);
+	*result = 0;
+	
+	switch (offset) {
+	case APIC_ARBPRI:
+		printk(KERN_WARNING "access local APIC ARBPRI register " \
+		       "which is for P6\n");
+		break;
+	
+	case APIC_TMCCT:        /* Timer CCR */
+		*result = apic_get_tmcct(apic);
+		break;
+	
+	case APIC_ESR:
+		apic->err_write_count = 0;
+		*result = apic_get_reg(apic, offset);
+		break;
+	
+	default:
+		*result = apic_get_reg(apic, offset);
+		break;
+	}
+}
+
+static unsigned long __apic_read(struct kvm_kern_apic *apic,
+			       unsigned long address,
+			       unsigned long len)
+{
+	unsigned int alignment;
+	unsigned int tmp;
+	unsigned long result;
+	unsigned int offset = address - apic->base_address;
+	
+	if (offset > APIC_TDCR)
+		return 0;
+	
+	/* some bugs on kernel cause read this with byte*/
+	if (len != 4)
+		pr_debug("read with len=0x%lx, should be 4 instead.\n", len);
+	
+	alignment = offset & 0x3;
+	
+	apic_read_aligned(apic, offset & ~0x3, 4, &tmp);
+	switch (len) {
+	case 1:
+		result = *((unsigned char *)&tmp + alignment);
+		break;
+	
+	case 2:
+		ASSERT(alignment != 3);
+		result = *(unsigned short *)((unsigned char *)&tmp +
+					     alignment);
+		break;
+	
+	case 4:
+		ASSERT(alignment == 0);
+		result = *(unsigned int *)((unsigned char *)&tmp + alignment);
+		break;
+	
+	default:
+		printk(KERN_ALERT "Local APIC read with len=0x%lx, should " \
+		       "be 4 instead.\n", len);
+		kvm_crash_guest(apic->vcpu->kvm);
+		result = 0; /* to make gcc happy */
+		break;
+	}
+	
+	pr_debug("%s: offset 0x%x with length 0x%lx, "
+		 "and the result is 0x%lx\n", __FUNCTION__, 
+		 offset, len, result);
+	
+	return result;
+}
+
+/*
+ *----------------------------------------------------------------------
+ * MMIO
+ *----------------------------------------------------------------------
+ */
+
+static unsigned long apic_mmio_read(struct kvm_io_device *this,
+				    gpa_t address,
+				    int len)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	unsigned long result;
+
+	spin_lock_bh(&apic->lock);
+	result = __apic_read(apic, address, len);
+	spin_unlock_bh(&apic->lock);
+
+	return result;
+}
+
+static void apic_mmio_write(struct kvm_io_device *this,
+			    gpa_t address,
+			    int len,
+			    unsigned long val)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	unsigned int offset = address - apic->base_address;
+
+	spin_lock_bh(&apic->lock);
+
+	/* too common printing */
+        if (offset != APIC_EOI)
+		pr_debug("%s: offset 0x%x with length 0x%lx, and value is " \
+			 "0x%lx\n",
+		       __FUNCTION__, offset, len, val);
+
+	/*
+	 * According to IA 32 Manual, all registers should be accessed with
+	 * 32 bits alignment.
+	 */
+	if (len != 4) {
+		unsigned int tmp;
+		unsigned char alignment;
+		
+		/* Some kernels do will access with byte/word alignment */
+		pr_debug("Notice: Local APIC write with len = %lx\n",len);
+		alignment = offset & 0x3;
+		tmp = __apic_read(apic, offset & ~0x3, 4);
+		switch (len) {
+		case 1:
+			/*
+			 * XXX the saddr is a tmp variable from caller, so 
+			 * should be ok.  But we should still change the 
+			 * following ref to val to local variable later
+			 */
+			val = (tmp & ~(0xff << (8*alignment))) |
+			      ((val & 0xff) << (8*alignment));
+			break;
+	
+		case 2:
+			if (alignment != 0x0 && alignment != 0x2) {
+				printk(KERN_ALERT "alignment error for apic " \
+				       "with len == 2\n");
+				kvm_crash_guest(apic->vcpu->kvm);
+			}
+			
+			val = (tmp & ~(0xffff << (8*alignment))) |
+			      ((val & 0xffff) << (8*alignment));
+			break;
+	
+		case 3:
+			/* will it happen? */
+			printk(KERN_ALERT "apic_write with len = 3 !!!\n");
+			kvm_crash_guest(apic->vcpu->kvm);
+			break;
+	
+		default:
+			printk(KERN_ALERT "Local APIC write with len = %x, " \
+			       "should be 4 instead\n", len);
+			kvm_crash_guest(apic->vcpu->kvm);
+			break;
+		}
+	}
+
+	offset &= 0xff0;
+
+	switch (offset) {
+	case APIC_ID:   /* Local APIC ID */
+		apic_set_reg(apic, APIC_ID, val);
+		break;
+
+	case APIC_TASKPRI:
+		apic_set_reg(apic, APIC_TASKPRI, val & 0xff);
+		apic_update_ppr(apic);
+		break;
+
+	case APIC_EOI:
+		apic_EOI_set(apic);
+		break;
+
+	case APIC_LDR:
+		apic_set_reg(apic, APIC_LDR, val & APIC_LDR_MASK);
+		break;
+
+	case APIC_DFR:
+		apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
+		break;
+
+	case APIC_SPIV:
+		apic_set_reg(apic, APIC_SPIV, val & 0x3ff);
+		if (!(val & APIC_SPIV_APIC_ENABLED)) {
+			int i;
+			u32 lvt_val;
+
+			apic->status |= APIC_SOFTWARE_DISABLE_MASK;
+			for (i = 0; i < APIC_LVT_NUM; i++) {
+				lvt_val = apic_get_reg(apic, 
+							   APIC_LVTT + 
+							   0x10 * i);
+				apic_set_reg(apic, APIC_LVTT + 0x10 * i,
+						 lvt_val | APIC_LVT_MASKED);
+			}
+
+			if ((apic_get_reg(apic, APIC_LVT0) & 
+			     APIC_MODE_MASK) == APIC_DM_EXTINT)
+				clear_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+		} else {
+			apic->status &= ~APIC_SOFTWARE_DISABLE_MASK;
+			if ((apic_get_reg(apic, APIC_LVT0) & 
+			     APIC_MODE_MASK) == APIC_DM_EXTINT)
+				set_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+		}
+		break;
+
+	case APIC_ESR:
+		apic->err_write_count = !apic->err_write_count;
+		if (!apic->err_write_count)
+			apic->err_status = 0;
+		break;
+
+	case APIC_ICR:
+		/* No delay here, so we always clear the pending bit*/
+		apic_set_reg(apic, APIC_ICR, val & ~(1 << 12));
+		apic_ipi(apic);
+		break;
+
+	case APIC_ICR2:
+		apic_set_reg(apic, APIC_ICR2, val & 0xff000000);
+		break;
+
+	case APIC_LVTT:
+	case APIC_LVTTHMR:
+	case APIC_LVTPC:
+	case APIC_LVT0:
+	case APIC_LVT1:
+	case APIC_LVTERR:
+	{
+		if (apic->status & APIC_SOFTWARE_DISABLE_MASK)
+			val |= APIC_LVT_MASKED;
+
+		val &= apic_lvt_mask[(offset - APIC_LVTT) >> 4];
+                apic_set_reg(apic, offset, val);
+
+		/* On hardware, when write vector less than 0x20 will error */
+		if (!(val & APIC_LVT_MASKED))
+			apic_check_vector(apic, apic_lvt_dm(apic, offset),
+					  apic_lvt_vector(apic, offset));
+		if (!apic->vcpu_id && (offset == APIC_LVT0)) {
+			if ((val & APIC_MODE_MASK) == APIC_DM_EXTINT)
+				if (val & APIC_LVT_MASKED)
+					clear_bit(_APIC_BSP_ACCEPT_PIC, 
+						  &apic->status);
+				else
+					set_bit(_APIC_BSP_ACCEPT_PIC, 
+						&apic->status);
+			else
+				clear_bit(_APIC_BSP_ACCEPT_PIC, 
+					  &apic->status);
+		}
+	}
+		break;
+
+	case APIC_TMICT:
+	{
+		ktime_t now = kvm_apictimer_now(&apic->timer.dev);
+		u32 offset;
+
+		apic_set_reg(apic, APIC_TMICT, val);
+		apic_set_reg(apic, APIC_TMCCT, val);
+		apic->timer.last_update = now;
+		offset = APIC_BUS_CYCLE_NS * apic->timer.divide_count * val;
+
+		/* Make sure the lock ordering is coherent */
+		spin_unlock_bh(&apic->lock);
+		kvm_apictimer_stop(&apic->timer.dev);
+		kvm_apictimer_start(&apic->timer.dev, 
+				    ktime_add_ns(now, offset));
+
+		pr_debug("%s: bus cycle is %"PRId64"ns, now 0x%016"PRIx64", "
+			 "timer initial count 0x%x, offset 0x%x, "
+			 "expire @ 0x%016"PRIx64".\n", __FUNCTION__,
+			 APIC_BUS_CYCLE_NS, ktime_to_ns(now),
+			 apic_get_reg(apic, APIC_TMICT),
+			 offset, ktime_to_ns(ktime_add_ns(now, offset)));
+	}
+		return;
+
+	case APIC_TDCR:
+	{
+		unsigned int tmp1, tmp2;
+
+		tmp1 = val & 0xf;
+		tmp2 = ((tmp1 & 0x3) | ((tmp1 & 0x8) >> 1)) + 1;
+		apic->timer.divide_count = 0x1 << (tmp2 & 0x7);
+
+		apic_set_reg(apic, APIC_TDCR, val);
+
+		pr_debug("timer divide count is 0x%x\n",
+		       apic->timer.divide_count);
+	}
+		break;
+
+	default:
+		printk(KERN_WARNING "Local APIC Write to read-only register\n");
+		break;
+	}
+
+	spin_unlock_bh(&apic->lock);
+}
+
+static int apic_mmio_range(struct kvm_io_device *this, gpa_t addr)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	int ret = 0;
+
+	spin_lock_bh(&apic->lock);
+
+	if (apic_global_enabled(apic) &&
+	    (addr >= apic->base_address) &&
+	    (addr < apic->base_address + VLOCAL_APIC_MEM_LENGTH))
+		ret = 1;
+
+	spin_unlock_bh(&apic->lock);
+	
+	return ret;
+}
+
+static void apic_mmio_register(struct kvm_kern_apic *apic)
+{
+	/* Register ourselves with the MMIO subsystem */
+	struct kvm_io_device *dev = &apic->mmio_dev;
+
+	dev->read     = apic_mmio_read;
+	dev->write    = apic_mmio_write;
+	dev->in_range = apic_mmio_range;
+
+	dev->private = apic;
+	atomic_inc(&apic->ref_count);
+		
+	apic->vcpu->apic_mmio = dev;
+}
+
+/*
+ *----------------------------------------------------------------------
+ * LAPIC interface
+ *----------------------------------------------------------------------
+ */
+
+static void apic_lapic_set_tpr(struct kvm_lapic *this, u64 cr8, int flags)
+{
+	struct kvm_kern_apic* apic = (struct kvm_kern_apic*)this->private;
+
+	/*
+	 * Usermode applications should not be trying to modify the TPR if
+	 * in-kernel interrupts are enabled
+	 */
+	BUG_ON(flags & KVM_LAPICFLAGS_USERMODE);
+
+	spin_lock_bh(&apic->lock);
+	apic_set_reg(apic, APIC_TASKPRI, ((cr8 & 0x0f) << 4));
+	apic_update_ppr(apic);
+	spin_unlock_bh(&apic->lock);
+}
+
+static u64 apic_lapic_get_tpr(struct kvm_lapic *this)
+{
+	struct kvm_kern_apic* apic = (struct kvm_kern_apic*)this->private;
+	u64 tpr;
+
+	spin_lock_bh(&apic->lock);
+	tpr = (u64)apic_get_reg(apic, APIC_TASKPRI);
+	spin_unlock_bh(&apic->lock);
+
+	return (tpr & 0xf0) >> 4;
+}
+
+static void apic_lapic_set_base(struct kvm_lapic *this, u64 value, int flags)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+
+	/*
+	 * Usermode applications should not be trying to modify the APICBASE if
+	 * in-kernel interrupts are enabled
+	 */
+	BUG_ON(flags & KVM_LAPICFLAGS_USERMODE);
+
+	spin_lock_bh(&apic->lock);
+	if (apic->vcpu_id )
+		value &= ~MSR_IA32_APICBASE_BSP;
+	
+	apic->base_msr = value;
+	apic->base_address = apic->base_msr & MSR_IA32_APICBASE_BASE;
+	
+	/* with FSB delivery interrupt, we can restart APIC functionality */
+	if (!(value & MSR_IA32_APICBASE_ENABLE))
+		set_bit(_APIC_GLOB_DISABLE, &apic->status);
+	else
+		clear_bit(_APIC_GLOB_DISABLE, &apic->status);
+	
+	pr_debug("apic base msr is 0x%016"PRIx64", and base address is " \
+		 "0x%lx.\n", apic->base_msr, apic->base_address);
+
+	spin_unlock_bh(&apic->lock);
+}
+
+static u64 apic_lapic_get_base(struct kvm_lapic *this)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	u64 base;
+
+	spin_lock_bh(&apic->lock);
+	base = apic->base_msr;
+	spin_unlock_bh(&apic->lock);
+
+	return base;
+}
+
+static void apic_lapic_reset(struct kvm_lapic *this)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+
+	apic_reset(apic);
+}
+
+static int apic_lapic_enabled(struct kvm_lapic *this)
+{	
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	int ret;
+
+	spin_lock_bh(&apic->lock);
+	ret = apic_enabled(apic);
+	spin_unlock_bh(&apic->lock);
+
+	return ret;
+}
+
+static void apic_lapic_destructor(struct kvm_lapic *this)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	apic_dropref(apic);
+}
+
+static void apic_lapic_register(struct kvm_kern_apic *apic)
+{
+	struct kvm_lapic *dev = apic->dev;
+
+	dev->set_tpr      = apic_lapic_set_tpr;
+	dev->get_tpr      = apic_lapic_get_tpr;
+	dev->set_base     = apic_lapic_set_base;
+	dev->get_base     = apic_lapic_get_base;
+	dev->reset        = apic_lapic_reset;
+	dev->enabled      = apic_lapic_enabled;
+	dev->destructor   = apic_lapic_destructor;
+
+	dev->private = apic;
+	atomic_inc(&apic->ref_count);
+}
+
+/*
+ *----------------------------------------------------------------------
+ * timer interface
+ *----------------------------------------------------------------------
+ */
+static int __apic_timer_fn(struct kvm_kern_apic *apic)
+{
+	u32 vector;
+	ktime_t now;
+	int result = 0;
+
+	if (unlikely(!apic_enabled(apic) || 
+		     !apic_lvt_enabled(apic, APIC_LVTT))) {
+		pr_debug("%s: time interrupt although apic is down\n",
+			 __FUNCTION__);
+		return 0;
+	}
+	
+	vector                  = apic_lvt_vector(apic, APIC_LVTT);
+	now                     = kvm_apictimer_now(&apic->timer.dev);
+	apic->timer.last_update = now;
+	apic->timer.pending++;
+		
+	__apic_accept_irq(apic, APIC_DM_FIXED, vector, 1, 0);
+			
+	if (apic_lvtt_period(apic)) {
+		u32 offset;
+		u32 tmict = apic_get_reg(apic, APIC_TMICT);
+			   
+		apic_set_reg(apic, APIC_TMCCT, tmict);
+		offset = APIC_BUS_CYCLE_NS * apic->timer.divide_count * tmict;
+                
+		result = 1;
+		kvm_apictimer_update(&apic->timer.dev, 
+				     ktime_add_ns(now, offset));
+		
+		pr_debug("%s: now 0x%016"PRIx64", expire @ 0x%016"PRIx64", "
+		       "timer initial count 0x%x, timer current count 0x%x.\n",
+		       __FUNCTION__,
+		       ktime_to_ns(now), ktime_add_ns(now, offset),
+		       apic_get_reg(apic, APIC_TMICT),
+	               apic_get_reg(apic, APIC_TMCCT));
+	} else {
+		apic_set_reg(apic, APIC_TMCCT, 0);
+		pr_debug("%s: now 0x%016"PRIx64", "
+		       "timer initial count 0x%x, timer current count 0x%x.\n",
+		       __FUNCTION__,
+		       ktime_to_ns(now), apic_get_reg(apic, APIC_TMICT),
+		       apic_get_reg(apic, APIC_TMCCT));
+	}
+	
+	return result;
+}
+
+static int apic_timer_fn(void *private)
+{
+	struct kvm_kern_apic *apic = private;
+	int restart_timer = 0;
+
+	spin_lock_bh(&apic->lock);
+	restart_timer = __apic_timer_fn(apic);
+	spin_unlock_bh(&apic->lock);
+
+	return restart_timer;
+}
+
+/*
+ *----------------------------------------------------------------------
+ * IRQDEVICE interface
+ *----------------------------------------------------------------------
+ */
+
+static int apic_irqdev_ack(struct kvm_irqdevice *this, int *vector) 
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	int ret = 0;
+	int irq;
+
+	spin_lock_bh(&apic->lock);
+	
+	if (!apic_enabled(apic))
+		goto out;
+
+	if (vector) {
+		irq = apic_find_highest_irr(apic);
+		if ((irq & 0xf0) > apic_get_reg(apic, APIC_PROCPRI)) {
+			BUG_ON (irq < 0x10);
+
+			__set_bit(irq, apic->regs + APIC_ISR);
+			__clear_bit(irq, apic->regs + APIC_IRR);
+			apic_update_ppr(apic);
+			
+			/*
+			 * We have to special case the timer interrupt
+			 * because we want the vector to stay pending
+			 * for each tick of the clock, even for a backlog.
+			 * Therefore, if this was a timer vector and we
+			 * still have ticks pending, keep IRR set
+			 */
+			if (irq == apic_lvt_vector(apic, APIC_LVTT)) {
+				BUG_ON(!apic->timer.pending);
+				apic->timer.pending--;
+				if (apic->timer.pending)
+					__set_bit(irq, apic->regs + APIC_IRR);
+			}
+
+			ret |= KVM_IRQACK_VALID;
+			*vector = irq;		
+		}
+		else
+			*vector = -1;
+	}
+	
+	/*
+	 * Read it again to see if anything is still pending above TPR
+	 */
+	irq = apic_find_highest_irr(apic);
+	if ((irq & 0xf0) > apic_get_reg(apic, APIC_PROCPRI))
+		ret |= KVM_IRQACK_AGAIN;
+	else {
+		/*
+		 * See if there is anything masked by TPR
+		 */
+		irq = find_first_bit(apic->regs + APIC_IRR, 
+				     MAX_APIC_INT_VECTOR/8);
+		if (irq && 
+		    ((irq & 0xf0) <= apic_get_reg(apic, APIC_PROCPRI))) {
+			ret |= KVM_IRQACK_TPRMASK;
+		}
+	}
+
+ out:
+	spin_unlock_bh(&apic->lock);
+	
+	return ret;
+}
+
+static int apic_irqdev_set_pin(struct kvm_irqdevice* this, int irq, int level)
+{
+	/*
+	 * set_pin() is a no-op on the APIC.  You must inject interrupts by
+	 * passing messages on the APIC bus.
+	 */
+	return 0;
+}
+
+static int apic_irqdev_summary(struct kvm_irqdevice *this, void *data)
+{
+	/* FIXME */
+	return 0;
+}
+
+static void apic_irqdev_destructor(struct kvm_irqdevice *this)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+
+	apic_dropref(apic);
+}
+
+static void apic_irqdev_register(struct kvm_kern_apic *apic)
+{
+	struct kvm_irqdevice *dev = apic->irq_dev;
+
+	dev->ack         = apic_irqdev_ack;
+	dev->set_pin     = apic_irqdev_set_pin;
+	dev->summary     = apic_irqdev_summary;
+	dev->destructor  = apic_irqdev_destructor;
+
+	dev->private = apic;
+	atomic_inc(&apic->ref_count);
+}
+
+static int apic_reset(struct kvm_kern_apic *apic)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+	ASSERT(apic != NULL);
+	vcpu = apic->vcpu;
+	ASSERT(vcpu != NULL);
+
+	/* Stop the timer in case it's a reset to an active apic */
+	kvm_apictimer_stop(&apic->timer.dev);
+
+	spin_lock_bh(&apic->lock);
+
+	apic_set_reg(apic, APIC_ID, vcpu_slot(vcpu) << 24);
+	apic_set_reg(apic, APIC_LVR, APIC_VERSION);
+
+	for (i = 0; i < APIC_LVT_NUM; i++)
+		apic_set_reg(apic, APIC_LVTT + 0x10 * i, APIC_LVT_MASKED);
+
+	apic_set_reg(apic, APIC_DFR, 0xffffffffU);
+	apic_set_reg(apic, APIC_SPIV, 0xff); 
+	apic_set_reg(apic, APIC_TASKPRI, 0);
+	apic_set_reg(apic, APIC_LDR, 0);
+        apic_set_reg(apic, APIC_ESR, 0);
+	apic_set_reg(apic, APIC_ICR, 0);
+	apic_set_reg(apic, APIC_ICR2, 0);
+	apic_set_reg(apic, APIC_TDCR, 0);
+	apic_set_reg(apic, APIC_TMICT, 0);
+	memset((void*)(apic->regs + APIC_IRR), 0, KVM_IRQ_BITMAP_SIZE(u8));
+	memset((void*)(apic->regs + APIC_ISR), 0, KVM_IRQ_BITMAP_SIZE(u8));
+	memset((void*)(apic->regs + APIC_TMR), 0, KVM_IRQ_BITMAP_SIZE(u8));
+	
+	apic->base_msr = 
+		MSR_IA32_APICBASE_ENABLE |
+		APIC_DEFAULT_PHYS_BASE;
+        if (vcpu_slot(vcpu) == 0)
+		apic->base_msr |= MSR_IA32_APICBASE_BSP;
+        apic->base_address = apic->base_msr & MSR_IA32_APICBASE_BASE;
+
+	kvm_apictimer_init(&apic->timer.dev);
+	apic->timer.dev.function = apic_timer_fn;
+	apic->timer.divide_count = 0;
+	apic->timer.pending = 0;
+	apic->status = 0;
+
+#ifdef APIC_NO_BIOS
+	/*
+	 * XXX According to mp specification, BIOS will enable LVT0/1,
+	 * remove it after BIOS enabled
+	 */
+	if (!vcpu_slot(vcpu)) {
+		apic_set_reg(apic, APIC_LVT0, APIC_MODE_EXTINT << 8);
+		apic_set_reg(apic, APIC_LVT1, APIC_MODE_NMI << 8);
+		set_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+	}
+#endif
+
+	spin_unlock_bh(&apic->lock);
+
+	printk(KERN_INFO  "%s: vcpu=%p, id=%d, base_msr=" \
+	       "0x%016"PRIx64", base_address=0x%0lx.\n", __FUNCTION__, vcpu,  
+	       GET_APIC_ID(apic_get_reg(apic, APIC_ID)),
+	       apic->base_msr, apic->base_address);
+	
+	return 1;
+}
+
+int kvm_kern_lapic_init(struct kvm_vcpu *vcpu, 
+			struct kvm_irqdevice *irq_dev)
+{
+	struct kvm_kern_apic *apic = NULL;
+	struct kvm_io_device *mmio_dev = NULL;
+
+	ASSERT(vcpu != NULL);
+	pr_debug("apic_init %d\n", vcpu_slot(vcpu));
+	
+	apic = kzalloc(sizeof(*apic), GFP_KERNEL);
+	if (!apic)
+		goto nomem;
+
+	spin_lock_init(&apic->lock);
+	atomic_inc(&apic->ref_count);
+	apic->vcpu_id = vcpu_slot(vcpu);
+
+	apic->regs_page = alloc_page(GFP_KERNEL);
+	if ( apic->regs_page == NULL ) {
+		printk(KERN_ALERT "malloc apic regs error for vcpu %x\n", 
+		       vcpu_slot(vcpu));
+		goto nomem;
+	}
+	apic->regs = page_address(apic->regs_page);
+	memset(apic->regs, 0, PAGE_SIZE);
+
+	apic->vcpu    = vcpu;
+	apic->irq_dev = irq_dev;
+	apic->dev     = &vcpu->apic;
+
+	apic_irqdev_register(apic);
+	apic_lapic_register(apic);
+	apic_mmio_register(apic);
+	
+	apic_reset(apic);
+	return 0;
+
+ nomem:
+	if (mmio_dev)
+		kfree(mmio_dev);
+
+	if (apic)
+		apic_dropref(apic);
+
+	return -ENOMEM;
+}
+
+
+
+
+
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 07bf353..4b0bd39 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -233,6 +233,17 @@ struct kvm_dirty_log {
 	};
 };
 
+/* for KVM_APIC */
+struct kvm_apic_msg {
+	/* in */
+	__u32 dest;
+	__u32 trig_mode;
+	__u32 dest_mode;
+	__u32 delivery_mode;
+	__u32 vector;
+	__u32 padding;
+};
+
 struct kvm_cpuid_entry {
 	__u32 function;
 	__u32 eax;
@@ -284,6 +295,9 @@ struct kvm_signal_mask {
 #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct kvm_dirty_log)
 #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct kvm_memory_alias)
+#define KVM_ENABLE_KERNEL_PIC     _IOW(KVMIO, 0x44, __u32)
+#define KVM_ISA_INTERRUPT         _IOW(KVMIO, 0x45, struct kvm_interrupt)
+#define KVM_APIC_MSG		  _IOW(KVMIO, 0x46, struct kvm_apic_msg)
 
 /*
  * ioctls for vcpu fds
@@ -302,5 +316,5 @@ struct kvm_signal_mask {
 #define KVM_SET_SIGNAL_MASK       _IOW(KVMIO,  0x8b, struct kvm_signal_mask)
 #define KVM_GET_FPU               _IOR(KVMIO,  0x8c, struct kvm_fpu)
 #define KVM_SET_FPU               _IOW(KVMIO,  0x8d, struct kvm_fpu)
-
+#define KVM_APIC_RESET		  _IO(KVMIO,   0x8e)
 #endif


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] KVM: Add irqdevice object
       [not found]     ` <20070420030916.12408.80159.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-04-22  8:42       ` Avi Kivity
       [not found]         ` <462B1FD8.4080004-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-22  8:42 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> The current code is geared towards using a user-mode (A)PIC.  This patch adds
> an "irqdevice" abstraction, and implements a "userint" model to handle the
> duties of the original code.  Later, we can develop other irqdevice models 
> to handle objects like LAPIC, IOAPIC, i8259, etc, as appropriate
>
> +
> +typedef enum {
> +    kvm_irqpin_localint,
> +    kvm_irqpin_extint,
> +    kvm_irqpin_smi,
> +    kvm_irqpin_nmi,
> +    kvm_irqpin_invalid, /* must always be last */
> +}kvm_irqpin_t;
>   

This describes the processor irq pins, as opposed to an interrupt 
controller's irq pins, yes? If so, let the name reflect that (and let 
there be a space after the closing brace).

> +
> +#define KVM_IRQACK_VALID    (1 << 0)
> +#define KVM_IRQACK_AGAIN    (1 << 1)
> +#define KVM_IRQACK_TPRMASK  (1 << 2)
> +
> +struct kvm_irqsink {
> +	void (*set_intr)(struct kvm_irqsink *this, 
> +			 struct kvm_irqdevice *dev,
> +			 kvm_irqpin_t pin, int trigger, int value);
> +
> +	void *private;
> +};
> +
> +struct kvm_irqdevice {
> +	int  (*ack)(struct kvm_irqdevice *this, int *vector);
> +	int  (*set_pin)(struct kvm_irqdevice *this, int pin, int level);
> +	int  (*summary)(struct kvm_irqdevice *this, void *data);
> +	void (*destructor)(struct kvm_irqdevice *this);
>   

[do we actually need a virtual destructor?]

> +/**
> + * kvm_irqdevice_ack - read and ack the highest priority vector from the device
> + * @dev: The device
> + * @vector: Retrieves the highest priority pending vector
> + *                [ NULL = Dont ack a vector, just check pending status]
> + *                [ non-NULL = Pointer to recieve vector data (out only)]
> + *
> + * Description: Read the highest priority pending vector from the device, 
> + *              potentially invoking auto-EOI depending on device policy
> + *
> + * Returns: (int)
> + *   [ -1 = failure]
> + *   [>=0 = bitmap as follows: ]
> + *         [ KVM_IRQACK_VALID   = vector is valid]
> + *         [ KVM_IRQACK_AGAIN   = more unmasked vectors are available]
> + *         [ KVM_IRQACK_TPRMASK = TPR masked vectors are blocked]
> + */
> +static inline int kvm_irqdevice_ack(struct kvm_irqdevice *dev, 
> +					    int *vector)
> +{
> +	return dev->ack(dev, vector);
> +}
>   

This is an improvement over the previous patch, but I'm vaguely 
disturbed by the complexity of the return code. I don't have an 
alternative to suggest at this time, though.

> +
> +/**
> + * kvm_irqdevice_summary - loads a summary bitmask
> + * @dev: The device
> + * @data: A pointer to a region capable of holding a 256 bit bitmap
> + *
> + * Description: Loads a summary bitmask of all pending vectors (0-255)
> + *
> + * Returns: (int)
> + *   [-1 = failure]
> + *   [ 0 = success]
> + */
> +static inline int kvm_irqdevice_summary(struct kvm_irqdevice *dev, void *data)
> +{
> +	return dev->summary(dev, data);
> +}
>   

This really works only for the userint case. It can be dropped from the 
generic interface IMO. Each interrupt controller will have its own save 
restore interface which userspace will have to know about (as it has to 
know about configuring the interrupt controller).

> +/**
> + * kvm_irqdevice_set_intr - invokes a registered INTR callback
> + * @dev: The device
> + * @pin: Identifies the pin to alter - 
> + *           [ KVM_IRQPIN_LOCALINT (default) - an vector is pending on this
> + *                                             device]
> + *           [ KVM_IRQPIN_EXTINT - a vector is pending on an external device]
> + *           [ KVM_IRQPIN_SMI - system-management-interrupt pin]
> + *           [ KVM_IRQPIN_NMI - non-maskable-interrupt pin
> + * @trigger: sensitivity [0 = edge, 1 = level]
> + * @val: [0 = deassert (ignored for edge-trigger), 1 = assert]
> + *
> + * Description: Invokes a registered INTR callback (if present).  This
> + *              function is meant to be used privately by a irqdevice 
> + *              implementation. 
> + *
> + * Returns: (void)
> + */
> +static inline void kvm_irqdevice_set_intr(struct kvm_irqdevice *dev,
> +					  kvm_irqpin_t pin, int trigger,
> +					  int val)
> +{
> +	struct kvm_irqsink *sink = &dev->sink;
> +	if (sink->set_intr)
> +		sink->set_intr(sink, dev, pin, trigger, val);
> +}
>   

Do you see more than one implementation for ->set_intr (e.g. for 
cascading)? If not, it can be de-pointered.

Shouldn't 'trigger' be part of the pin configuration rather than passed 
on every invocation?

>  
> +/*
> + * Assumes lock already held
> + */
> +static inline int __kvm_vcpu_irq_all_pending(struct kvm_vcpu *vcpu)
> +{
> +	int pending = vcpu->irq.pending;
> +
> +	if (vcpu->irq.deferred != -1)
> +		__set_bit(kvm_irqpin_localint, &pending);
> +
> +	return pending;
> +}
> +
> +/*
> + * These two functions are helpers for determining if a standard interrupt
> + * is pending to replace the old "if (vcpu->irq_summary)" logic.  If the
> + * caller wants to know about some of the new advanced interrupt types 
> + * (SMI, NMI, etc) or to differentiate between localint and extint they will
> + *  have to use the new API
> + */
> +static inline int __kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
> +{
> +	int pending = __kvm_vcpu_irq_all_pending(vcpu);
> +
> +	if (test_bit(kvm_irqpin_localint, &pending) ||
> +	    test_bit(kvm_irqpin_extint, &pending))
> +		return 1;
> +	
> +	return 0;
> +}
> +
> +static inline int kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
> +{
> +	int ret = 0;
> +	int flags;
> +
> +	spin_lock_irqsave(&vcpu->irq.lock, flags);
> +	ret = __kvm_vcpu_irq_pending(vcpu);
> +	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
>   

The locking seems superfluous.

> +
> +	return ret;
> +}
> +
> +/*
> + * Assumes lock already held
> + */
> +static inline int kvm_vcpu_irq_pop(struct kvm_vcpu *vcpu, int *vector)
> +{
> +	int ret = 0;
> +
> +	if (vcpu->irq.deferred != -1) {
> +		if (vector) {
> +			ret |= KVM_IRQACK_VALID;
> +			*vector = vcpu->irq.deferred;
> +			vcpu->irq.deferred = -1;
> +		}
> +		ret |= kvm_irqdevice_ack(&vcpu->irq.dev, NULL);
> +	} else
> +		ret = kvm_irqdevice_ack(&vcpu->irq.dev, vector);
> +
> +	/*
> +	 * If there are no more interrupts and we are edge triggered, 
> +	 * we must clear the status flag
> +	 */
> +	if (!(ret & KVM_IRQACK_AGAIN))
> +		__clear_bit(kvm_irqpin_localint, &vcpu->irq.pending);
>   

Can localint actually be edge-triggered?

> +
> +	return ret;
> +}
> +
> +static inline void __kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
> +{
> +	BUG_ON(vcpu->irq.deferred != -1); /* We can only hold one deferred */
> +
> +	vcpu->irq.deferred = irq;
> +}
> +
> +static inline void kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
> +{
> +	int flags;
> +
> +	spin_lock_irqsave(&vcpu->irq.lock, flags);
> +	__kvm_vcpu_irq_push(vcpu, irq);
> +	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
> +}
> +
>   

Can you explain the logic behind push()/pop()? I realize you inherited 
it, but I don't think it fits well into the new model.

> @@ -2044,13 +2048,17 @@ static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>  	if (mmu_reset_needed)
>  		kvm_mmu_reset_context(vcpu);
>  
> -	memcpy(vcpu->irq_pending, sregs->interrupt_bitmap,
> -	       sizeof vcpu->irq_pending);
> -	vcpu->irq_summary = 0;
> -	for (i = 0; i < NR_IRQ_WORDS; ++i)
> -		if (vcpu->irq_pending[i])
> -			__set_bit(i, &vcpu->irq_summary);
> -
> +	/*
> +	 * walk the interrupt-bitmap and inject an IRQ for each bit found
> +	 * 
> +	 * note that we skip the first 16 vectors since they are reserved
> +	 * and should never be set by an interrupt source
> +	 */
> +	for (i = 16; i < 256; ++i) {
> +		int val = test_bit(i, &sregs->interrupt_bitmap[0]);
> +		kvm_irqdevice_set_pin(&vcpu->irq.dev, i, val);
> +	}
> + 
>   

Theory vs. practice. The bios will set the first pic to vectors 8-15.

> @@ -2319,6 +2321,51 @@ out1:
>  }
>  
>  /*
> + * This function will be invoked whenever the vcpu->irq.dev raises its INTR 
> + * line
> + */
> +static void kvm_vcpu_intr(struct kvm_irqsink *this, 
> +			  struct kvm_irqdevice *dev,
> +			  kvm_irqpin_t pin, int trigger, int val)
> +{
> +	struct kvm_vcpu *vcpu = (struct kvm_vcpu*)this->private;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&vcpu->irq.lock, flags);
> +
> +	if (val && !test_bit(pin, &vcpu->irq.pending)) {
> +		/*
> +		 * if the line is being asserted and we currently have 
> +		 * it deasserted, we must record
> +		 */
> +		__set_bit(pin, &vcpu->irq.pending);
> +
> +		if (trigger)
> +			__set_bit(pin, &vcpu->irq.trigger);
> +		else
> +			__clear_bit(pin, &vcpu->irq.trigger);
> +		
> +	} else if (!val && trigger)
> +		/*
> +		 * if the level-sensitive line is being deasserted,
> +		 * record it.
> +		 */
> +		__clear_bit(pin, &vcpu->irq.pending);
> +	
> +	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
> +}
>   

That's quite a mouthful :)


>   * Creates some virtual cpus.  Good luck creating more than one.
>   */
>  static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
> @@ -2364,6 +2411,12 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
>  	if (r < 0)
>  		goto out_free_vcpus;
>  
> +	kvm_irqdevice_init(&vcpu->irq.dev);
> +	kvm_vcpu_irqsink_init(vcpu);
> +	r = kvm_userint_init(vcpu);
> +	if (r < 0)
> +	    goto out_free_vcpus;
>   

Bad indent.

>  static inline void clgi(void)
>  {
>  	asm volatile (SVM_CLGI);
> @@ -892,7 +874,12 @@ static int pf_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
>  	int r;
>  
>  	if (is_external_interrupt(exit_int_info))
> -		push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
> +		/*
> +		 * An exception was taken while we were trying to inject an
> +		 * IRQ.  We must defer the injection of the vector until
> +		 * the next window.
> +		 */
> +		kvm_vcpu_irq_push(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
>   

Ah, I remember what push/pop is for now. We actually have ->ack() to 
deal with this now. Unfortunately with auto-eoi we don't have a good 
place to call it. So push() is a kind of unack() for eoi interrupts.

>  @ -1434,7 +1482,7 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
>  static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu,
>  					  struct kvm_run *kvm_run)
>  {
> -	return (!vcpu->irq_summary &&
> +    return (!kvm_vcpu_irq_pending(vcpu) &&
>   

whoops.


Overall this seems to be improving, but I'm concerned about the much 
increased complexity of it all. Probably much of it is unavoidable, but 
I'd like not to see any unnecessary stuff as debugging in this area is 
pretty much impossible.



-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] KVM: Adds ability to preepmt an executing VCPU
       [not found]     ` <20070420030921.12408.97321.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-04-22  8:50       ` Avi Kivity
       [not found]         ` <462B21C7.2060007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-22  8:50 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> The VCPU executes synchronously w.r.t. userspace today, and therefore 
> interrupt injection is pretty straight forward.  However, we will soon need
> to be able to inject interrupts asynchronous to the execution of the VCPU
> due to the introduction of SMP, paravirtualized drivers, and asynchronous
> hypercalls.  This patch adds support to the interrupt mechanism to force
> a VCPU to VMEXIT when a new interrupt is pending.
>
> Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
> ---
>   

[vmresume/return code]

>  	/*
> +	 * Signal that we have transitioned back to host mode 
> +	 */
> +	spin_lock_irqsave(&vcpu->irq.lock, irq_flags);
> +	vcpu->irq.guest_mode = 0;
> +	spin_unlock_irqrestore(&vcpu->irq.lock, irq_flags);
> +
>   

You need to check for an interrupt here. Otherwise you might go back to 
user mode and sleep there, no?

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] KVM: Local-APIC interface cleanup
       [not found]     ` <20070420030926.12408.27637.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-04-22  8:54       ` Avi Kivity
       [not found]         ` <462B22AE.4090108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-22  8:54 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> Adds an abstraction to the LAPIC logic so that we can later substitute it
> for an in-kernel model.
>
>   

This is overly abstracted.  It's not like you can (on real hardware) 
wire your own lapic and plug it into the processor.  It's well defined, 
and there are just three modes of operation:

- emulated by userspace
- emulated in-kernel, but disabled
- emulated in-kernel

The differentiation can be made by installing or not installing the mmio 
handler and the irqdevice stuff.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] KVM: Add support for in-kernel LAPIC model
       [not found]     ` <20070420030931.12408.88158.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-04-22  9:04       ` Avi Kivity
       [not found]         ` <462B250E.6050603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-22  9:04 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
> ---
>
>  drivers/kvm/Makefile   |    2 
>  drivers/kvm/kernint.c  |  168 +++++
>  drivers/kvm/kvm.h      |   14 
>  drivers/kvm/kvm_main.c |  142 +++++
>  drivers/kvm/lapic.c    | 1472 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kvm.h    |   16 -
>  6 files changed, 1808 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
> index 540afbc..1aad737 100644
> --- a/drivers/kvm/Makefile
> +++ b/drivers/kvm/Makefile
> @@ -2,7 +2,7 @@
>  # Makefile for Kernel-based Virtual Machine module
>  #
>  
> -kvm-objs := kvm_main.o mmu.o x86_emulate.o userint.o
> +kvm-objs := kvm_main.o mmu.o x86_emulate.o userint.o lapic.o kernint.o
>  obj-$(CONFIG_KVM) += kvm.o
>  kvm-intel-objs = vmx.o
>  obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
> diff --git a/drivers/kvm/kernint.c b/drivers/kvm/kernint.c
> new file mode 100644
> index 0000000..979a4aa
> --- /dev/null
> +++ b/drivers/kvm/kernint.c
> @@ -0,0 +1,168 @@
> +/*
> + * Kernel Interrupt IRQ device
> + *
> + * Provides a model for connecting in-kernel interrupt resources to a VCPU.
> + * 
> + * A typical modern x86 processor has the concept of an internal Local-APIC 
> + * and some external signal pins.  The way in which interrupts are injected is
> + * dependent on whether software enables the LAPIC or not.  When enabled,
> + * interrupts are acknowledged through the LAPIC.  Otherwise they are through
> + * an externally connected PIC (typically an i8259 on the BSP)  
> + *
> + * Copyright (C) 2007 Novell
> + *
> + * Authors:
> + *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "kvm.h"
> +
> +extern int kvm_kern_lapic_init(struct kvm_vcpu *vcpu, 
> +			       struct kvm_irqdevice *irq_dev);
> +struct kvm_kernint {
> +	spinlock_t                    lock;
> +	atomic_t                      ref_count;
> +	struct kvm_vcpu              *vcpu;
> +	struct kvm_irqdevice         *self_irq;
> +	struct kvm_irqdevice         *ext_irq;
> +	struct kvm_irqdevice          apic_irq;
> +	struct kvm_lapic             *apic_dev;
> +
> +};
>   

This is nice.  I don't think a ref count is really necessary though, as 
the configuration is fairly static.

>  struct kvm_stat {
> @@ -570,6 +574,9 @@ extern struct kvm_arch_ops *kvm_arch_ops;
>  int kvm_init_arch(struct kvm_arch_ops *ops, struct module *module);
>  void kvm_exit_arch(void);
>  
> +int kvm_apicbus_send(struct kvm *kvm, int dest, int trig_mode, int level, 
> +		     int dest_mode, int delivery_mode, int vector);
> +
>   

Pack'em into a struct?


[... actual lapic code ...]

>  struct kvm_cpuid_entry {
>  	__u32 function;
>  	__u32 eax;
> @@ -284,6 +295,9 @@ struct kvm_signal_mask {
>  #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
>  #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct kvm_dirty_log)
>  #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct kvm_memory_alias)
> +#define KVM_ENABLE_KERNEL_PIC     _IOW(KVMIO, 0x44, __u32)
> +#define KVM_ISA_INTERRUPT         _IOW(KVMIO, 0x45, struct kvm_interrupt)
> +#define KVM_APIC_MSG		  _IOW(KVMIO, 0x46, struct kvm_apic_msg)
>  
>  /*
>   * ioctls for vcpu fds
> @@ -302,5 +316,5 @@ struct kvm_signal_mask {
>  #define KVM_SET_SIGNAL_MASK       _IOW(KVMIO,  0x8b, struct kvm_signal_mask)
>  #define KVM_GET_FPU               _IOR(KVMIO,  0x8c, struct kvm_fpu)
>  #define KVM_SET_FPU               _IOW(KVMIO,  0x8d, struct kvm_fpu)
> -
> +#define KVM_APIC_RESET		  _IO(KVMIO,   0x8e)
>  #endif
>
>   

You need to advertise the lapic ioctls via KVM_CHECK_EXTENSION.

Overall looks good.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM: Patch series for in-kernel APIC support
       [not found] ` <20070420030905.12408.40403.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (4 preceding siblings ...)
  2007-04-20  3:09   ` [PATCH 5/5] KVM: Add support for in-kernel LAPIC model Gregory Haskins
@ 2007-04-22  9:06   ` Avi Kivity
  5 siblings, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2007-04-22  9:06 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> The following is my patch series for adding in-kernel APIC support.  It
> supports three "levels" of dynamic configuration (via a new ioctl):
>
>  *  level 0 = (default) compatiblity mode (everything in userspace)
>  *  level 1 = LAPIC in kernel, IOAPIC/i8259 in userspace
>  *  level 2 = All three in kernel
>
> This patch adds support for the basic framework for the new PIC models 
> (level 0) as well as an implementation of level-1.  
>
> level-0 is "code complete" and fully tested.  I have run this patchset
> using existing QEMU on 64-bit linux, and 32 bit XP.  Both ran fine with no
> discernable difference in behavior.
>
> level-1 is "code complete" and compiles/links error free, but is otherwise
> untested since I still do not have a functioning userspace component. I
> include it here for review/feedback purposes.
>   

It would be nice if the Intel folks could review the lapic code, as I 
have very little experience in this area.

> level-2 is partially implemented downstream in my queue, but I did not include
> it here since it is still TBD whether we will ever need it.
>
> Note that the first patch (in-kernel-mmio.patch) is completely unchanged
> through the last few rounds of review.  However, patch 2-5 are heavily
> re-worked from the last time so pay particular attention there.  Most notably:
>
> Patch #2: irqdevice changes:
>   1) pending+read_vector are now combined into one call: ack().  Feedback and
>      my own discoveries downstream indicated this was a superior design.
>   2) raise_intr() is now set_intr() which can define more than one "pin" and
>      which can be assert/de-asserted an edge or level triggered signal.  This
>      significantly simplified the NMI handling logic (some of which you will
>      see here in the series) as well as created a much more extensible model
>      to work with.
>   3) I merged a previously unpublished patch (deferred-irq.patch) into this
>      one because it no longer made sense to keep them separate with the new
>      design. This provides "push/pop" operations for IRQs to better handle
>      injection failure scenarios.
>
> Patch #3 (preemptible-cpu) you are familiar with, but it changed slightly 
> to accommodate the changes in #2
>
> #4 and #5 are debuting for the first time.  Feedback/comments/bugfixes on any
>  of the code is more than welcome, but I am particular interested in comments
>  on the handling of HRTIMERs in the lapic.c code.  I ran into a brick wall
>  with the SLEx 2.6.16 kernel not supporting them (fully, which made it worse).
>  However the extern-module-compat methodology seemed inadequate to solve the
>  problem.  Please advise if there is a better way to solve that.
>
> >From my perspective, this code could be considered for inclusion at this point
> (pending review cycles, etc) since it can fully support the existing system.
> I will leave that to the powers that be if they would prefer to see level-1 in
> action first.
>   

I want to see code working (stressed, even) before it's merged, 
regardless of review status.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] KVM: Add irqdevice object
       [not found]         ` <462B1FD8.4080004-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-23 13:58           ` Gregory Haskins
       [not found]             ` <462C8333.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-23 13:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Sun, Apr 22, 2007 at  4:42 AM, in message <462B1FD8.4080004-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>> The current code is geared towards using a user- mode (A)PIC.  This patch adds
>> an "irqdevice" abstraction, and implements a "userint" model to handle the
>> duties of the original code.  Later, we can develop other irqdevice models 
>> to handle objects like LAPIC, IOAPIC, i8259, etc, as appropriate
>>
>> +
>> +typedef enum {
>> +    kvm_irqpin_localint,
>> +    kvm_irqpin_extint,
>> +    kvm_irqpin_smi,
>> +    kvm_irqpin_nmi,
>> +    kvm_irqpin_invalid, /* must always be last */
>> +}kvm_irqpin_t;
>>   
> 
> This describes the processor irq pins, as opposed to an interrupt 
> controller's irq pins, yes? If so, let the name reflect that (and let 
> there be a space after the closing brace).
> 

Ack.  I will change to something like kvm_cpuirq_t

>> +
>> +#define KVM_IRQACK_VALID    (1 << 0)
>> +#define KVM_IRQACK_AGAIN    (1 << 1)
>> +#define KVM_IRQACK_TPRMASK  (1 << 2)
>> +
>> +struct kvm_irqsink {
>> +	void (*set_intr)(struct kvm_irqsink *this, 
>> +			 struct kvm_irqdevice *dev,
>> +			 kvm_irqpin_t pin, int trigger, int value);
>> +
>> +	void *private;
>> +};
>> +
>> +struct kvm_irqdevice {
>> +	int  (*ack)(struct kvm_irqdevice *this, int *vector);
>> +	int  (*set_pin)(struct kvm_irqdevice *this, int pin, int level);
>> +	int  (*summary)(struct kvm_irqdevice *this, void *data);
>> +	void (*destructor)(struct kvm_irqdevice *this);
>>   
> 
> [do we actually need a virtual destructor?]

I believe it is the right thing to do, yes.  The implementation of the irqdevice destructor may be as simple as a kfree(), or could be arbitrarily complex (don't forget that we will have multiple models..we already have three: userint, kernint, and lapic.  There may also be i8259 and i8259_cascaded in the future).


> 
>> +/**
>> + * kvm_irqdevice_ack -  read and ack the highest priority vector from the 
> device
>> + * @dev: The device
>> + * @vector: Retrieves the highest priority pending vector
>> + *                [ NULL = Dont ack a vector, just check pending status]
>> + *                [ non- NULL = Pointer to recieve vector data (out only)]
>> + *
>> + * Description: Read the highest priority pending vector from the device, 
>> + *              potentially invoking auto- EOI depending on device policy
>> + *
>> + * Returns: (int)
>> + *   [ - 1 = failure]
>> + *   [>=0 = bitmap as follows: ]
>> + *         [ KVM_IRQACK_VALID   = vector is valid]
>> + *         [ KVM_IRQACK_AGAIN   = more unmasked vectors are available]
>> + *         [ KVM_IRQACK_TPRMASK = TPR masked vectors are blocked]
>> + */
>> +static inline int kvm_irqdevice_ack(struct kvm_irqdevice *dev, 
>> +					    int *vector)
>> +{
>> +	return dev- >ack(dev, vector);
>> +}
>>   
> 
> This is an improvement over the previous patch, but I'm vaguely 
> disturbed by the complexity of the return code. I don't have an 
> alternative to suggest at this time, though.

Would you prefer to see a by-ref flags field passed in coupled with a more traditional return code?

> 
>> +
>> +/**
>> + * kvm_irqdevice_summary -  loads a summary bitmask
>> + * @dev: The device
>> + * @data: A pointer to a region capable of holding a 256 bit bitmap
>> + *
>> + * Description: Loads a summary bitmask of all pending vectors (0- 255)
>> + *
>> + * Returns: (int)
>> + *   [- 1 = failure]
>> + *   [ 0 = success]
>> + */
>> +static inline int kvm_irqdevice_summary(struct kvm_irqdevice *dev, void 
> *data)
>> +{
>> +	return dev- >summary(dev, data);
>> +}
>>   
> 
> This really works only for the userint case. It can be dropped from the 
> generic interface IMO. Each interrupt controller will have its own save 
> restore interface which userspace will have to know about (as it has to 
> know about configuring the interrupt controller).

Hmm..let me give this some thought about how I can do this differently.  


> 
>> +/**
>> + * kvm_irqdevice_set_intr -  invokes a registered INTR callback
>> + * @dev: The device
>> + * @pin: Identifies the pin to alter -  
>> + *           [ KVM_IRQPIN_LOCALINT (default) -  an vector is pending on this
>> + *                                             device]
>> + *           [ KVM_IRQPIN_EXTINT -  a vector is pending on an external 
> device]
>> + *           [ KVM_IRQPIN_SMI -  system- management- interrupt pin]
>> + *           [ KVM_IRQPIN_NMI -  non- maskable- interrupt pin
>> + * @trigger: sensitivity [0 = edge, 1 = level]
>> + * @val: [0 = deassert (ignored for edge- trigger), 1 = assert]
>> + *
>> + * Description: Invokes a registered INTR callback (if present).  This
>> + *              function is meant to be used privately by a irqdevice 
>> + *              implementation. 
>> + *
>> + * Returns: (void)
>> + */
>> +static inline void kvm_irqdevice_set_intr(struct kvm_irqdevice *dev,
>> +					  kvm_irqpin_t pin, int trigger,
>> +					  int val)
>> +{
>> +	struct kvm_irqsink *sink = &dev- >sink;
>> +	if (sink- >set_intr)
>> +		sink- >set_intr(sink, dev, pin, trigger, val);
>> +}
>>   
> 
> Do you see more than one implementation for - >set_intr (e.g. for 
> cascading)? If not, it can be de- pointered.

Yeah, I definitely see more than one consumer.  Case in point, the kernint module that was included in this series registers intr() handlers for its two irqdevices (apic, and ext).  Also, if we end up having level-2 support we will be using it even more for the cascaded i8259s

> 
> Shouldn't 'trigger' be part of the pin configuration rather than passed 
> on every invocation?

Hmm..this is a good point.  I was trying to accommodate the flexibility of the APIC message format where a vector can be arbitrarily sensitive.  But your question made me realize that the way I did this is flawed anyway.  The receiving software isn't going to respond appropriately to a vector (i.e. localint) that changes sensitivity on the fly.   I will fix this.

> 
>>  
>> +/*
>> + * Assumes lock already held
>> + */
>> +static inline int __kvm_vcpu_irq_all_pending(struct kvm_vcpu *vcpu)
>> +{
>> +	int pending = vcpu- >irq.pending;
>> +
>> +	if (vcpu- >irq.deferred != - 1)
>> +		__set_bit(kvm_irqpin_localint, &pending);
>> +
>> +	return pending;
>> +}
>> +
>> +/*
>> + * These two functions are helpers for determining if a standard interrupt
>> + * is pending to replace the old "if (vcpu- >irq_summary)" logic.  If the
>> + * caller wants to know about some of the new advanced interrupt types 
>> + * (SMI, NMI, etc) or to differentiate between localint and extint they 
> will
>> + *  have to use the new API
>> + */
>> +static inline int __kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
>> +{
>> +	int pending = __kvm_vcpu_irq_all_pending(vcpu);
>> +
>> +	if (test_bit(kvm_irqpin_localint, &pending) ||
>> +	    test_bit(kvm_irqpin_extint, &pending))
>> +		return 1;
>> +	
>> +	return 0;
>> +}
>> +
>> +static inline int kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
>> +{
>> +	int ret = 0;
>> +	int flags;
>> +
>> +	spin_lock_irqsave(&vcpu- >irq.lock, flags);
>> +	ret = __kvm_vcpu_irq_pending(vcpu);
>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, flags);
>>   
> 
> The locking seems superfluous.

I believe there are places where we need to call the locked version of kvm_vcpu_irq_pending in the code, but I will review this to make sure.


> 
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Assumes lock already held
>> + */
>> +static inline int kvm_vcpu_irq_pop(struct kvm_vcpu *vcpu, int *vector)
>> +{
>> +	int ret = 0;
>> +
>> +	if (vcpu- >irq.deferred != - 1) {
>> +		if (vector) {
>> +			ret |= KVM_IRQACK_VALID;
>> +			*vector = vcpu- >irq.deferred;
>> +			vcpu- >irq.deferred = - 1;
>> +		}
>> +		ret |= kvm_irqdevice_ack(&vcpu- >irq.dev, NULL);
>> +	} else
>> +		ret = kvm_irqdevice_ack(&vcpu- >irq.dev, vector);
>> +
>> +	/*
>> +	 * If there are no more interrupts and we are edge triggered, 
>> +	 * we must clear the status flag
>> +	 */
>> +	if (!(ret & KVM_IRQACK_AGAIN))
>> +		__clear_bit(kvm_irqpin_localint, &vcpu- >irq.pending);
>>   
> 
> Can localint actually be edge- triggered?

See my earlier comment.  This needs review.

> 
>> +
>> +	return ret;
>> +}
>> +
>> +static inline void __kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
>> +{
>> +	BUG_ON(vcpu- >irq.deferred != - 1); /* We can only hold one deferred */
>> +
>> +	vcpu- >irq.deferred = irq;
>> +}
>> +
>> +static inline void kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
>> +{
>> +	int flags;
>> +
>> +	spin_lock_irqsave(&vcpu- >irq.lock, flags);
>> +	__kvm_vcpu_irq_push(vcpu, irq);
>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, flags);
>> +}
>> +
>>   
> 
> Can you explain the logic behind push()/pop()? I realize you inherited 
> it, but I don't think it fits well into the new model.

It seems you have already figured this out in your later comments, but just to make sure we are clear I will answer your question anyway:  The problem as I see it is that real-world PICs have the notion of an interrupt being accepted by the CPU during the acknowledgment cycle.  What happens during that cycle is PIC dependent, but for something like an 8259 or LAPIC, generally it means at least moving the pending bit from the IRR to the ISR register.  Once the vector is acknowledged, it is considered dispatched to the CPU.  However, for VMs this is not always an atomic operation (e.g. the injection may fail under a certain set of circumstances such as those that cause a VMEXIT before the injection is complete).  During those cases, we don't want to lose the interrupt so something must be done to preserve our current state for the next injection window.

In the original KVM code, the vector was simply re-inserted back into the (effective) userint model's state.  This solved the problem neatly albeit potentially unnaturally when compared to the real-world.  When you introduce the models of actual PICs things get more complex.  I had a choice between somehow aborting the previously accepted vector, or adding a new layer between the PIC and the vCPU (e.g. irq.deferred).  Since the real-world PICs have no notion of "abort-ack", it would have been unnatural to add that feature at that layer.  In addition, the operation would have to be supported with each model.  The irq.deferred code works with all models and doesn't require a hack to the emulation of the PIC(s).   It moves the problem to the VCPU which is the layer where the difference is (PCPU vs VCPU).


> 
>> @@ - 2044,13 +2048,17 @@ static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu 
> *vcpu,
>>  	if (mmu_reset_needed)
>>  		kvm_mmu_reset_context(vcpu);
>>  
>> -	memcpy(vcpu- >irq_pending, sregs- >interrupt_bitmap,
>> -	       sizeof vcpu- >irq_pending);
>> -	vcpu- >irq_summary = 0;
>> -	for (i = 0; i < NR_IRQ_WORDS; ++i)
>> -		if (vcpu- >irq_pending[i])
>> -			__set_bit(i, &vcpu- >irq_summary);
>> -
>> +	/*
>> +	 * walk the interrupt- bitmap and inject an IRQ for each bit found
>> +	 * 
>> +	 * note that we skip the first 16 vectors since they are reserved
>> +	 * and should never be set by an interrupt source
>> +	 */
>> +	for (i = 16; i < 256; ++i) {
>> +		int val = test_bit(i, &sregs- >interrupt_bitmap[0]);
>> +		kvm_irqdevice_set_pin(&vcpu- >irq.dev, i, val);
>> +	}
>> + 
>>   
> 
> Theory vs. practice. The bios will set the first pic to vectors 8- 15.

Doh!  Didn't realize that.  Will fix.

> 
>> @@ - 2319,6 +2321,51 @@ out1:
>>  }
>>  
>>  /*
>> + * This function will be invoked whenever the vcpu- >irq.dev raises its INTR 
>> + * line
>> + */
>> +static void kvm_vcpu_intr(struct kvm_irqsink *this, 
>> +			  struct kvm_irqdevice *dev,
>> +			  kvm_irqpin_t pin, int trigger, int val)
>> +{
>> +	struct kvm_vcpu *vcpu = (struct kvm_vcpu*)this- >private;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&vcpu- >irq.lock, flags);
>> +
>> +	if (val && !test_bit(pin, &vcpu- >irq.pending)) {
>> +		/*
>> +		 * if the line is being asserted and we currently have 
>> +		 * it deasserted, we must record
>> +		 */
>> +		__set_bit(pin, &vcpu- >irq.pending);
>> +
>> +		if (trigger)
>> +			__set_bit(pin, &vcpu- >irq.trigger);
>> +		else
>> +			__clear_bit(pin, &vcpu- >irq.trigger);
>> +		
>> +	} else if (!val && trigger)
>> +		/*
>> +		 * if the level- sensitive line is being deasserted,
>> +		 * record it.
>> +		 */
>> +		__clear_bit(pin, &vcpu- >irq.pending);
>> +	
>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, flags);
>> +}
>>   
> 
> That's quite a mouthful :)
> 
> 
>>   * Creates some virtual cpus.  Good luck creating more than one.
>>   */
>>  static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
>> @@ - 2364,6 +2411,12 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, 
> int n)
>>  	if (r < 0)
>>  		goto out_free_vcpus;
>>  
>> +	kvm_irqdevice_init(&vcpu- >irq.dev);
>> +	kvm_vcpu_irqsink_init(vcpu);
>> +	r = kvm_userint_init(vcpu);
>> +	if (r < 0)
>> +	    goto out_free_vcpus;
>>   
> 
> Bad indent.

Ack.

> 
>>  static inline void clgi(void)
>>  {
>>  	asm volatile (SVM_CLGI);
>> @@ - 892,7 +874,12 @@ static int pf_interception(struct kvm_vcpu *vcpu, struct 
> kvm_run *kvm_run)
>>  	int r;
>>  
>>  	if (is_external_interrupt(exit_int_info))
>> -		push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
>> +		/*
>> +		 * An exception was taken while we were trying to inject an
>> +		 * IRQ.  We must defer the injection of the vector until
>> +		 * the next window.
>> +		 */
>> +		kvm_vcpu_irq_push(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
>>   
> 
> Ah, I remember what push/pop is for now. We actually have - >ack() to 
> deal with this now. Unfortunately with auto- eoi we don't have a good 
> place to call it. So push() is a kind of unack() for eoi interrupts.

Sort of.  I think my explanation above covers this, so I wont go into it deeper here.

> 
>>  @ - 1434,7 +1482,7 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
>>  static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu,
>>  					  struct kvm_run *kvm_run)
>>  {
>> -	return (!vcpu- >irq_summary &&
>> +    return (!kvm_vcpu_irq_pending(vcpu) &&
>>   
> 
> whoops.

Ack.

> 
> 
> Overall this seems to be improving, but I'm concerned about the much 
> increased complexity of it all. Probably much of it is unavoidable, but 
> I'd like not to see any unnecessary stuff as debugging in this area is 
> pretty much impossible.

Agreed.  One of my primary goals are to make this as simple and clean, yet maintainable as possible.  Adding a significant new function will inevitably complicate the code to some degree, but hopefully we are doing it in a reasonable manner.  I believe I have achieved that, but comments regarding better ways to do some of all of it are always welcome.

-Greg





-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] KVM: Adds ability to preepmt an executing VCPU
       [not found]         ` <462B21C7.2060007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-23 15:42           ` Gregory Haskins
       [not found]             ` <462C9B94.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-23 15:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Sun, Apr 22, 2007 at  4:50 AM, in message <462B21C7.2060007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>>  	/*
>> +	 * Signal that we have transitioned back to host mode 
>> +	 */
>> +	spin_lock_irqsave(&vcpu- >irq.lock, irq_flags);
>> +	vcpu- >irq.guest_mode = 0;
>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, irq_flags);
>> +
>>   
> 
> You need to check for an interrupt here. Otherwise you might go back to 
> user mode and sleep there, no?

It's subtle, but I'm not sure if you need to or not.  I'm glad you brought it up because its something I wanted to talk about.  The case where interrupts are raised outside the guest_mode brackets is obviously handled since we inject a signal, so the area in question is while guest_mode  = 1.

In the simple case, an async-interrupt comes in and sends an IPI to cause a VMEXIT.  The guest exits with an EXTERNAL_INTERRUPT exception (IIUC), and the current handler causes the system to loop back into the VMENTER code (and thus injecting the interrupt).

If the guest exits because of HLT, this is also handled since the current handle_halt() code checks if there are pending interrupts first before allowing the halt.   If there are interrupts, its loops back into VMENTER.

For other cases, (e.g. the guest is VMEXITing for a reason other than the IPI)  the guest may need some userspace assistance:  e.g. MMIO servicing, exception handling, etc..  In this case, looping back to VMENTER would be incorrect.  The current code already handles this correctly too.  However, the potential problem (as you pointed out) is if the userspace wants to sleep after servicing those types of requests, but doesn't realize that it cannot due to a pending interrupt.  I currently am not sure if this is a problem or not since halt is handled.  

Are there any other userspace sleeps that we need to handle (e.g. maybe AIO)?  If so, one way to handle this is to mark the exportable state of the VCPU such that userspace can tell if interrupts are pending.  However, I'm not really sure if this is the best way to do it or if it can be easily done in a way that doesn't break ABI compatiblity.  Please advise. 

-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] KVM: Local-APIC interface cleanup
       [not found]         ` <462B22AE.4090108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-23 15:55           ` Gregory Haskins
       [not found]             ` <462C9EAE.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-23 15:55 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Sun, Apr 22, 2007 at  4:54 AM, in message <462B22AE.4090108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>> Adds an abstraction to the LAPIC logic so that we can later substitute it
>> for an in- kernel model.
>>
>>   
> 
> This is overly abstracted.  It's not like you can (on real hardware) 
> wire your own lapic and plug it into the processor.  

Agreed, but the key point is that under KVM, you *can* plug in more that one LAPIC (e.g. userint and kernint).  Notice that I did not try to completely abstract an LAPIC.  Rather, I tried to identify the common touchpoints between the userspace and kernel version. In short, that basically came down to CR8/TPR and the APIC_BASE_MSR handling.  The other functions of the APIC (e.g. kvm_apicbus_send()) which were not common between the two I simply defined as non-virtuals.  I figured that my minimalist abstraction was preferable to doing this all over the place:

 if (!vcpu->kvm->enable_kernel_pic)
      vcpu->cr8 = cr8;
 else
     apic_set_tpr(cr8)

Instead, you can just do:

  kvm_lapic_set_tpr(&vcpu->apic, cr8)

and let the model figure out the right action.  This is easily reversible if you prefer.  I just figured mine was a cleaner way of accomplishing the same thing.  Perhaps I am a bit overzealous ;)

> The differentiation can be made by installing or not installing the mmio 
> handler and the irqdevice stuff.

Well, it's actually a bit more complicated than that.  Per my previous example, handing TPR is simple in the userspace case (just save it as part of the VCPU state), where its complex in the in-kernel state (if lowering TPR unmasks pending vectors, inject them).  However, the core code doesn't have to care.  It simply notes that TPR changed and the model handles the rest.

Thoughts?
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] KVM: Add support for in-kernel LAPIC model
       [not found]         ` <462B250E.6050603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-23 15:57           ` Gregory Haskins
  0 siblings, 0 replies; 22+ messages in thread
From: Gregory Haskins @ 2007-04-23 15:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Sun, Apr 22, 2007 at  5:04 AM, in message <462B250E.6050603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>> Signed- off- by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
>> ---
>>
>>  drivers/kvm/Makefile   |    2 
>>  drivers/kvm/kernint.c  |  168 +++++
>>  drivers/kvm/kvm.h      |   14 
>>  drivers/kvm/kvm_main.c |  142 +++++
>>  drivers/kvm/lapic.c    | 1472 
> ++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/kvm.h    |   16 -
>>  6 files changed, 1808 insertions(+), 6 deletions(- )
>>
>> diff -- git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
>> index 540afbc..1aad737 100644
>> ---  a/drivers/kvm/Makefile
>> +++ b/drivers/kvm/Makefile
>> @@ - 2,7 +2,7 @@
>>  # Makefile for Kernel- based Virtual Machine module
>>  #
>>  
>> - kvm- objs := kvm_main.o mmu.o x86_emulate.o userint.o
>> +kvm- objs := kvm_main.o mmu.o x86_emulate.o userint.o lapic.o kernint.o
>>  obj- $(CONFIG_KVM) += kvm.o
>>  kvm- intel- objs = vmx.o
>>  obj- $(CONFIG_KVM_INTEL) += kvm- intel.o
>> diff -- git a/drivers/kvm/kernint.c b/drivers/kvm/kernint.c
>> new file mode 100644
>> index 0000000..979a4aa
>> ---  /dev/null
>> +++ b/drivers/kvm/kernint.c
>> @@ - 0,0 +1,168 @@
>> +/*
>> + * Kernel Interrupt IRQ device
>> + *
>> + * Provides a model for connecting in- kernel interrupt resources to a VCPU.
>> + * 
>> + * A typical modern x86 processor has the concept of an internal Local- APIC 
>> + * and some external signal pins.  The way in which interrupts are injected 
> is
>> + * dependent on whether software enables the LAPIC or not.  When enabled,
>> + * interrupts are acknowledged through the LAPIC.  Otherwise they are 
> through
>> + * an externally connected PIC (typically an i8259 on the BSP)  
>> + *
>> + * Copyright (C) 2007 Novell
>> + *
>> + * Authors:
>> + *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top- level directory.
>> + *
>> + */
>> +
>> +#include "kvm.h"
>> +
>> +extern int kvm_kern_lapic_init(struct kvm_vcpu *vcpu, 
>> +			       struct kvm_irqdevice *irq_dev);
>> +struct kvm_kernint {
>> +	spinlock_t                    lock;
>> +	atomic_t                      ref_count;
>> +	struct kvm_vcpu              *vcpu;
>> +	struct kvm_irqdevice         *self_irq;
>> +	struct kvm_irqdevice         *ext_irq;
>> +	struct kvm_irqdevice          apic_irq;
>> +	struct kvm_lapic             *apic_dev;
>> +
>> +};
>>   
> 
> This is nice.  I don't think a ref count is really necessary though, as 
> the configuration is fairly static.

Ack.

> 
>>  struct kvm_stat {
>> @@ - 570,6 +574,9 @@ extern struct kvm_arch_ops *kvm_arch_ops;
>>  int kvm_init_arch(struct kvm_arch_ops *ops, struct module *module);
>>  void kvm_exit_arch(void);
>>  
>> +int kvm_apicbus_send(struct kvm *kvm, int dest, int trig_mode, int level, 
>> +		     int dest_mode, int delivery_mode, int vector);
>> +
>>   
> 
> Pack'em into a struct?

I see no reason why not.  I'll make this change.

> 
> 
> [... actual lapic code ...]
> 
>>  struct kvm_cpuid_entry {
>>  	__u32 function;
>>  	__u32 eax;
>> @@ - 284,6 +295,9 @@ struct kvm_signal_mask {
>>  #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
>>  #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct kvm_dirty_log)
>>  #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct 
> kvm_memory_alias)
>> +#define KVM_ENABLE_KERNEL_PIC     _IOW(KVMIO, 0x44, __u32)
>> +#define KVM_ISA_INTERRUPT         _IOW(KVMIO, 0x45, struct kvm_interrupt)
>> +#define KVM_APIC_MSG		  _IOW(KVMIO, 0x46, struct kvm_apic_msg)
>>  
>>  /*
>>   * ioctls for vcpu fds
>> @@ - 302,5 +316,5 @@ struct kvm_signal_mask {
>>  #define KVM_SET_SIGNAL_MASK       _IOW(KVMIO,  0x8b, struct 
> kvm_signal_mask)
>>  #define KVM_GET_FPU               _IOR(KVMIO,  0x8c, struct kvm_fpu)
>>  #define KVM_SET_FPU               _IOW(KVMIO,  0x8d, struct kvm_fpu)
>> -
>> +#define KVM_APIC_RESET		  _IO(KVMIO,   0x8e)
>>  #endif
>>
>>   
> 
> You need to advertise the lapic ioctls via KVM_CHECK_EXTENSION.

Ack.

> 
> Overall looks good.

Thanks!
-Greg




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] KVM: Add irqdevice object
       [not found]             ` <462C8333.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-04-24  9:09               ` Avi Kivity
       [not found]                 ` <462DC954.1020400-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-24  9:09 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>>> +
>>> +struct kvm_irqdevice {
>>> +	int  (*ack)(struct kvm_irqdevice *this, int *vector);
>>> +	int  (*set_pin)(struct kvm_irqdevice *this, int pin, int level);
>>> +	int  (*summary)(struct kvm_irqdevice *this, void *data);
>>> +	void (*destructor)(struct kvm_irqdevice *this);
>>>   
>>>       
>> [do we actually need a virtual destructor?]
>>     
>
> I believe it is the right thing to do, yes.  The implementation of the irqdevice destructor may be as simple as a kfree(), or could be arbitrarily complex (don't forget that we will have multiple models..we already have three: userint, kernint, and lapic.  There may also be i8259 and i8259_cascaded in the future).
>
>   

Yes, but does it need to be a function pointer? IOW, is the point it is
called generic code or already irqdevice-specific?

>   
>>> +/**
>>> + * kvm_irqdevice_ack -  read and ack the highest priority vector from the 
>>>       
>> device
>>     
>>> + * @dev: The device
>>> + * @vector: Retrieves the highest priority pending vector
>>> + *                [ NULL = Dont ack a vector, just check pending status]
>>> + *                [ non- NULL = Pointer to recieve vector data (out only)]
>>> + *
>>> + * Description: Read the highest priority pending vector from the device, 
>>> + *              potentially invoking auto- EOI depending on device policy
>>> + *
>>> + * Returns: (int)
>>> + *   [ - 1 = failure]
>>> + *   [>=0 = bitmap as follows: ]
>>> + *         [ KVM_IRQACK_VALID   = vector is valid]
>>> + *         [ KVM_IRQACK_AGAIN   = more unmasked vectors are available]
>>> + *         [ KVM_IRQACK_TPRMASK = TPR masked vectors are blocked]
>>> + */
>>> +static inline int kvm_irqdevice_ack(struct kvm_irqdevice *dev, 
>>> +					    int *vector)
>>> +{
>>> +	return dev- >ack(dev, vector);
>>> +}
>>>   
>>>       
>> This is an improvement over the previous patch, but I'm vaguely 
>> disturbed by the complexity of the return code. I don't have an 
>> alternative to suggest at this time, though.
>>     
>
> Would you prefer to see a by-ref flags field passed in coupled with a more traditional return code?
>
>   

While I enjoy nitpicking on the names and types of parameters, my
concern here is the exploding number of combinations, each of which can
be used by the arch to hide bugs in.

Bugs in this code are going to be exceedingly hard to debug; they'll be
by nature non-repeatable and timing-sensitive, and as the OS that makes
heaviest use of the APIC and tends to crash at the slightest
mis-emulation is closed source, much of the debugging is done by staring
at the code.

We already have a report that about missing mouse clicks, which is
possibly caused by interrupt mis-emulation.  If you want to know exactly
why I'm worried about increasing complexity, try to debug it.

[Of course, complexity inevitably grows, and even when people remove
code and simplify things, usually it is in order to add even more code
and more complexity.  But I want to be on the right side of the
complexity/performance/flexibility/stability tradeoff.]
>
>   
>>> +/**
>>> + * kvm_irqdevice_set_intr -  invokes a registered INTR callback
>>> + * @dev: The device
>>> + * @pin: Identifies the pin to alter -  
>>> + *           [ KVM_IRQPIN_LOCALINT (default) -  an vector is pending on this
>>> + *                                             device]
>>> + *           [ KVM_IRQPIN_EXTINT -  a vector is pending on an external 
>>>       
>> device]
>>     
>>> + *           [ KVM_IRQPIN_SMI -  system- management- interrupt pin]
>>> + *           [ KVM_IRQPIN_NMI -  non- maskable- interrupt pin
>>> + * @trigger: sensitivity [0 = edge, 1 = level]
>>> + * @val: [0 = deassert (ignored for edge- trigger), 1 = assert]
>>> + *
>>> + * Description: Invokes a registered INTR callback (if present).  This
>>> + *              function is meant to be used privately by a irqdevice 
>>> + *              implementation. 
>>> + *
>>> + * Returns: (void)
>>> + */
>>> +static inline void kvm_irqdevice_set_intr(struct kvm_irqdevice *dev,
>>> +					  kvm_irqpin_t pin, int trigger,
>>> +					  int val)
>>> +{
>>> +	struct kvm_irqsink *sink = &dev- >sink;
>>> +	if (sink- >set_intr)
>>> +		sink- >set_intr(sink, dev, pin, trigger, val);
>>> +}
>>>   
>>>       
>> Do you see more than one implementation for - >set_intr (e.g. for 
>> cascading)? If not, it can be de- pointered.
>>     
>
> Yeah, I definitely see more than one consumer.  Case in point, the kernint module that was included in this series registers intr() handlers for its two irqdevices (apic, and ext).  Also, if we end up having level-2 support we will be using it even more for the cascaded i8259s
>   

Okay.

 

>>
>>> + *  have to use the new API
>>> + */
>>> +static inline int __kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
>>> +{
>>> +	int pending = __kvm_vcpu_irq_all_pending(vcpu);
>>> +
>>> +	if (test_bit(kvm_irqpin_localint, &pending) ||
>>> +	    test_bit(kvm_irqpin_extint, &pending))
>>> +		return 1;
>>> +	
>>> +	return 0;
>>> +}
>>> +
>>> +static inline int kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
>>> +{
>>> +	int ret = 0;
>>> +	int flags;
>>> +
>>> +	spin_lock_irqsave(&vcpu- >irq.lock, flags);
>>> +	ret = __kvm_vcpu_irq_pending(vcpu);
>>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, flags);
>>>   
>>>       
>> The locking seems superfluous.
>>     
>
> I believe there are places where we need to call the locked version of kvm_vcpu_irq_pending in the code, but I will review this to make sure.
>
>   

I meant, __kvm_vcpu_irq_pending is just reading stuff.

>   
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +static inline void __kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
>>> +{
>>> +	BUG_ON(vcpu- >irq.deferred != - 1); /* We can only hold one deferred */
>>> +
>>> +	vcpu- >irq.deferred = irq;
>>> +}
>>> +
>>> +static inline void kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
>>> +{
>>> +	int flags;
>>> +
>>> +	spin_lock_irqsave(&vcpu- >irq.lock, flags);
>>> +	__kvm_vcpu_irq_push(vcpu, irq);
>>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, flags);
>>> +}
>>> +
>>>   
>>>       
>> Can you explain the logic behind push()/pop()? I realize you inherited 
>> it, but I don't think it fits well into the new model.
>>     
>
> It seems you have already figured this out in your later comments, but just to make sure we are clear I will answer your question anyway:  The problem as I see it is that real-world PICs have the notion of an interrupt being accepted by the CPU during the acknowledgment cycle.  What happens during that cycle is PIC dependent, but for something like an 8259 or LAPIC, generally it means at least moving the pending bit from the IRR to the ISR register.  Once the vector is acknowledged, it is considered dispatched to the CPU.  However, for VMs this is not always an atomic operation (e.g. the injection may fail under a certain set of circumstances such as those that cause a VMEXIT before the injection is complete).  During those cases, we don't want to lose the interrupt so something must be done to preserve our current state for the next injection window.
>
> In the original KVM code, the vector was simply re-inserted back into the (effective) userint model's state.  This solved the problem neatly albeit potentially unnaturally when compared to the real-world.  When you introduce the models of actual PICs things get more complex.  I had a choice between somehow aborting the previously accepted vector, or adding a new layer between the PIC and the vCPU (e.g. irq.deferred).  Since the real-world PICs have no notion of "abort-ack", it would have been unnatural to add that feature at that layer.  In addition, the operation would have to be supported with each model.  The irq.deferred code works with all models and doesn't require a hack to the emulation of the PIC(s).   It moves the problem to the VCPU which is the layer where the difference is (PCPU vs VCPU).
>
>   

But, once the vcpu gets back to the deferred irq, the tpr may have
changed and no longer allow acceptance of this irq.

Thinking a bit about this, the current code suffers from the same
problem.  I guess it works because no OS is insane enough to page out
the IDT or GDT, so the only faults we can get are handled by kvm, not
the guest.

So it seems the correct description is not 'un-ack the interrupt', as we
have effectively acked it, but actually queue it pending host-only kvm
processing.  I'm not 100% sure that's the only case, though.

>   
>>>  static inline void clgi(void)
>>>  {
>>>  	asm volatile (SVM_CLGI);
>>> @@ - 892,7 +874,12 @@ static int pf_interception(struct kvm_vcpu *vcpu, struct 
>>>       
>> kvm_run *kvm_run)
>>     
>>>  	int r;
>>>  
>>>  	if (is_external_interrupt(exit_int_info))
>>> -		push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
>>> +		/*
>>> +		 * An exception was taken while we were trying to inject an
>>> +		 * IRQ.  We must defer the injection of the vector until
>>> +		 * the next window.
>>> +		 */
>>> +		kvm_vcpu_irq_push(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
>>>   
>>>       
>> Ah, I remember what push/pop is for now. We actually have - >ack() to 
>> deal with this now. Unfortunately with auto- eoi we don't have a good 
>> place to call it. So push() is a kind of unack() for eoi interrupts.
>>     
>
> Sort of.  I think my explanation above covers this, so I wont go into it deeper here.
>
>   

Yeah.  Well, at least some of the uses are not unack() related, and we
can't really do unack(), so I was wrong.




-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] KVM: Adds ability to preepmt an executing VCPU
       [not found]             ` <462C9B94.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-04-24  9:17               ` Avi Kivity
       [not found]                 ` <462DCB3E.6070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-24  9:17 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>>>> On Sun, Apr 22, 2007 at  4:50 AM, in message <462B21C7.2060007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
>>>>         
> Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
>   
>> Gregory Haskins wrote:
>>     
>>>  	/*
>>> +	 * Signal that we have transitioned back to host mode 
>>> +	 */
>>> +	spin_lock_irqsave(&vcpu- >irq.lock, irq_flags);
>>> +	vcpu- >irq.guest_mode = 0;
>>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, irq_flags);
>>> +
>>>   
>>>       
>> You need to check for an interrupt here. Otherwise you might go back to 
>> user mode and sleep there, no?
>>     
>
>
> It's subtle, but I'm not sure if you need to or not.  I'm glad you brought it up because its something I wanted to talk about.  The case where interrupts are raised outside the guest_mode brackets is obviously handled since we inject a signal, so the area in question is while guest_mode  = 1.
>
> In the simple case, an async-interrupt comes in and sends an IPI to cause a VMEXIT.  The guest exits with an EXTERNAL_INTERRUPT exception (IIUC), and the current handler causes the system to loop back into the VMENTER code (and thus injecting the interrupt).
>   

Okay.

> If the guest exits because of HLT, this is also handled since the current handle_halt() code checks if there are pending interrupts first before allowing the halt.   If there are interrupts, its loops back into VMENTER.
>   

Okay.

> For other cases, (e.g. the guest is VMEXITing for a reason other than the IPI)  the guest may need some userspace assistance:  e.g. MMIO servicing, exception handling, etc..  In this case, looping back to VMENTER would be incorrect.  The current code already handles this correctly too.  However, the potential problem (as you pointed out) is if the userspace wants to sleep after servicing those types of requests, but doesn't realize that it cannot due to a pending interrupt.  I currently am not sure if this is a problem or not since halt is handled.  
>   

I'm not sure either.  It seems natural to assume that qemu will re-enter
the guest without sleeping except for hlt.

> Are there any other userspace sleeps that we need to handle (e.g. maybe AIO)?  If so, one way to handle this is to mark the exportable state of the VCPU such that userspace can tell if interrupts are pending.  However, I'm not really sure if this is the best way to do it or if it can be easily done in a way that doesn't break ABI compatiblity.  Please advise. 
>   

The ABI can be extended by adding fields to struct kvm_vcpu_run and an
extension check to indicate their availability.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] KVM: Local-APIC interface cleanup
       [not found]             ` <462C9EAE.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-04-24  9:26               ` Avi Kivity
       [not found]                 ` <462DCD31.4030108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-24  9:26 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>>>> On Sun, Apr 22, 2007 at  4:54 AM, in message <462B22AE.4090108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
>>>>         
> Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
>   
>> Gregory Haskins wrote:
>>     
>>> Adds an abstraction to the LAPIC logic so that we can later substitute it
>>> for an in- kernel model.
>>>
>>>   
>>>       
>> This is overly abstracted.  It's not like you can (on real hardware) 
>> wire your own lapic and plug it into the processor.  
>>     
>
> Agreed, but the key point is that under KVM, you *can* plug in more that one LAPIC (e.g. userint and kernint).  Notice that I did not try to completely abstract an LAPIC.  Rather, I tried to identify the common touchpoints between the userspace and kernel version. In short, that basically came down to CR8/TPR and the APIC_BASE_MSR handling.  The other functions of the APIC (e.g. kvm_apicbus_send()) which were not common between the two I simply defined as non-virtuals.  I figured that my minimalist abstraction was preferable to doing this all over the place:
>
>  if (!vcpu->kvm->enable_kernel_pic)
>       vcpu->cr8 = cr8;
>  else
>      apic_set_tpr(cr8)
>
> Instead, you can just do:
>
>   kvm_lapic_set_tpr(&vcpu->apic, cr8)
>   

You can put the if statement into kvm_lapic_set_tpr().

And please keep cr8 in the vcpu, people may want to read it.

> and let the model figure out the right action.  This is easily reversible if you prefer.  I just figured mine was a cleaner way of accomplishing the same thing.  Perhaps I am a bit overzealous ;)
>  
>   
>> The differentiation can be made by installing or not installing the mmio 
>> handler and the irqdevice stuff.
>>     
>
> Well, it's actually a bit more complicated than that.  Per my previous example, handing TPR is simple in the userspace case (just save it as part of the VCPU state), where its complex in the in-kernel state (if lowering TPR unmasks pending vectors, inject them).  However, the core code doesn't have to care.  It simply notes that TPR changed and the model handles the rest.
>   

#include "diatribes/complexity.h"

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] KVM: Add irqdevice object
       [not found]                 ` <462DC954.1020400-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-26 14:37                   ` Gregory Haskins
       [not found]                     ` <463080C8.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-26 14:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f


Hi Avi,  Sorry for the delay.  I have been traveling this week.  See inline...

>>> On Tue, Apr 24, 2007 at  5:09 AM, in message <462DC954.1020400-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>>>> +
>>>> +struct kvm_irqdevice {
>>>> +	int  (*ack)(struct kvm_irqdevice *this, int *vector);
>>>> +	int  (*set_pin)(struct kvm_irqdevice *this, int pin, int level);
>>>> +	int  (*summary)(struct kvm_irqdevice *this, void *data);
>>>> +	void (*destructor)(struct kvm_irqdevice *this);
>>>>   
>>>>       
>>> [do we actually need a virtual destructor?]
>>>     
>>
>> I believe it is the right thing to do, yes.  The implementation of the 
> irqdevice destructor may be as simple as a kfree(), or could be arbitrarily 
> complex (don't forget that we will have multiple models..we already have 
> three: userint, kernint, and lapic.  There may also be i8259 and 
> i8259_cascaded in the future).
>>
>>   
> 
> Yes, but does it need to be a function pointer? IOW, is the point it is
> called generic code or already irqdevice- specific?

The code can-be/is irqdevice specific, thus the virtual.  In some cases, it will be as simple as a kfree().  In others, (kernint, for instance), it might need to drop references to the apic/ext devices and do other cleanup (which reminds me that I should look at this to make sure its done right today) ;)

> 
>>   
>>>> +/**
>>>> + * kvm_irqdevice_ack -   read and ack the highest priority vector from the 
>>>>       
>>> device
>>>     
>>>> + * @dev: The device
>>>> + * @vector: Retrieves the highest priority pending vector
>>>> + *                [ NULL = Dont ack a vector, just check pending status]
>>>> + *                [ non-  NULL = Pointer to recieve vector data (out only)]
>>>> + *
>>>> + * Description: Read the highest priority pending vector from the device, 
>>>> + *              potentially invoking auto-  EOI depending on device policy
>>>> + *
>>>> + * Returns: (int)
>>>> + *   [ -  1 = failure]
>>>> + *   [>=0 = bitmap as follows: ]
>>>> + *         [ KVM_IRQACK_VALID   = vector is valid]
>>>> + *         [ KVM_IRQACK_AGAIN   = more unmasked vectors are available]
>>>> + *         [ KVM_IRQACK_TPRMASK = TPR masked vectors are blocked]
>>>> + */
>>>> +static inline int kvm_irqdevice_ack(struct kvm_irqdevice *dev, 
>>>> +					    int *vector)
>>>> +{
>>>> +	return dev-  >ack(dev, vector);
>>>> +}
>>>>   
>>>>       
>>> This is an improvement over the previous patch, but I'm vaguely 
>>> disturbed by the complexity of the return code. I don't have an 
>>> alternative to suggest at this time, though.
>>>     
>>
>> Would you prefer to see a by- ref flags field passed in coupled with a more 
> traditional return code?
>>
>>   
> 
> While I enjoy nitpicking on the names and types of parameters, my
> concern here is the exploding number of combinations, each of which can
> be used by the arch to hide bugs in.
> 
> Bugs in this code are going to be exceedingly hard to debug; they'll be
> by nature non- repeatable and timing- sensitive, and as the OS that makes
> heaviest use of the APIC and tends to crash at the slightest
> mis- emulation is closed source, much of the debugging is done by staring
> at the code.
> 
> We already have a report that about missing mouse clicks, which is
> possibly caused by interrupt mis- emulation.  If you want to know exactly
> why I'm worried about increasing complexity, try to debug it.
> 
> [Of course, complexity inevitably grows, and even when people remove
> code and simplify things, usually it is in order to add even more code
> and more complexity.  But I want to be on the right side of the
> complexity/performance/flexibility/stability tradeoff.]

We are on the same page here.  I have and will continue to strive to make design choices here that are sensitive to these and other similar issues.  As always, comments on ways to improve these choices are always welcome.


> 
>>>
>>>> + *  have to use the new API
>>>> + */
>>>> +static inline int __kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	int pending = __kvm_vcpu_irq_all_pending(vcpu);
>>>> +
>>>> +	if (test_bit(kvm_irqpin_localint, &pending) ||
>>>> +	    test_bit(kvm_irqpin_extint, &pending))
>>>> +		return 1;
>>>> +	
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static inline int kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	int ret = 0;
>>>> +	int flags;
>>>> +
>>>> +	spin_lock_irqsave(&vcpu-  >irq.lock, flags);
>>>> +	ret = __kvm_vcpu_irq_pending(vcpu);
>>>> +	spin_unlock_irqrestore(&vcpu-  >irq.lock, flags);
>>>>   
>>>>       
>>> The locking seems superfluous.
>>>     
>>
>> I believe there are places where we need to call the locked version of 
> kvm_vcpu_irq_pending in the code, but I will review this to make sure.
>>
>>   
> 
> I meant, __kvm_vcpu_irq_pending is just reading stuff.

Ah, I see.  I am not 100% sure about this, but I think you can make the same argument here as you can with that "double check locks are broken" article that you sent out.  If I got anything out of that article (it was very interesting, BTW), its that the locks do more than protect critical sections:  They are an implicit memory barrier also.  I am under the impression that we want that behavior here.  I can be convinced otherwise....

> 
>>   
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static inline void __kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
>>>> +{
>>>> +	BUG_ON(vcpu-  >irq.deferred != -  1); /* We can only hold one deferred */
>>>> +
>>>> +	vcpu-  >irq.deferred = irq;
>>>> +}
>>>> +
>>>> +static inline void kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
>>>> +{
>>>> +	int flags;
>>>> +
>>>> +	spin_lock_irqsave(&vcpu-  >irq.lock, flags);
>>>> +	__kvm_vcpu_irq_push(vcpu, irq);
>>>> +	spin_unlock_irqrestore(&vcpu-  >irq.lock, flags);
>>>> +}
>>>> +
>>>>   
>>>>       
>>> Can you explain the logic behind push()/pop()? I realize you inherited 
>>> it, but I don't think it fits well into the new model.
>>>     
>>
>> It seems you have already figured this out in your later comments, but just 
> to make sure we are clear I will answer your question anyway:  The problem as 
> I see it is that real- world PICs have the notion of an interrupt being 
> accepted by the CPU during the acknowledgment cycle.  What happens during 
> that cycle is PIC dependent, but for something like an 8259 or LAPIC, 
> generally it means at least moving the pending bit from the IRR to the ISR 
> register.  Once the vector is acknowledged, it is considered dispatched to 
> the CPU.  However, for VMs this is not always an atomic operation (e.g. the 
> injection may fail under a certain set of circumstances such as those that 
> cause a VMEXIT before the injection is complete).  During those cases, we 
> don't want to lose the interrupt so something must be done to preserve our 
> current state for the next injection window.
>>
>> In the original KVM code, the vector was simply re- inserted back into the 
> (effective) userint model's state.  This solved the problem neatly albeit 
> potentially unnaturally when compared to the real- world.  When you introduce 
> the models of actual PICs things get more complex.  I had a choice between 
> somehow aborting the previously accepted vector, or adding a new layer 
> between the PIC and the vCPU (e.g. irq.deferred).  Since the real- world PICs 
> have no notion of "abort- ack", it would have been unnatural to add that 
> feature at that layer.  In addition, the operation would have to be supported 
> with each model.  The irq.deferred code works with all models and doesn't 
> require a hack to the emulation of the PIC(s).   It moves the problem to the 
> VCPU which is the layer where the difference is (PCPU vs VCPU).
>>
>>   
> 
> But, once the vcpu gets back to the deferred irq, the tpr may have
> changed and no longer allow acceptance of this irq.

True, but I am not convinced this is a problem.  (see below)

> 
> Thinking a bit about this, the current code suffers from the same
> problem.  

Right

> I guess it works because no OS is insane enough to page out
> the IDT or GDT, so the only faults we can get are handled by kvm, not
> the guest.

This is my thinking as well.  The conditions that cause an injection failure are probably relatively light-weight w.r.t. the guests execution context.  Like for instance, maybe an NMI comes in during the VMENTRY and causes an immediate VMEXIT (e.g. the guest never made any forward progress, and therefore nothing else (e.g. TPR) has changed)

> 
> So it seems the correct description is not 'un- ack the interrupt', as we
> have effectively acked it, but actually queue it pending host- only kvm
> processing.

This is exactly what I have done (if I understood what you were saying).  When the injection fails we push the vector to the irq.deferred entry which takes a higher priority in the queue than the backing irqdevice (since it believes the vector is already dispatched).  

> I'm not 100% sure that's the only case, though.

Yeah, me either.  Lets hope so for now and we can address it when something comes along to reveal that this was an incorrect assumption.  Otherwise we could be doing something ugly to the emulation for no good reason.



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] KVM: Adds ability to preepmt an executing VCPU
       [not found]                 ` <462DCB3E.6070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-26 14:40                   ` Gregory Haskins
  0 siblings, 0 replies; 22+ messages in thread
From: Gregory Haskins @ 2007-04-26 14:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Tue, Apr 24, 2007 at  5:17 AM, in message <462DCB3E.6070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
> 
>> Are there any other userspace sleeps that we need to handle (e.g. maybe 
> AIO)?  If so, one way to handle this is to mark the exportable state of the 
> VCPU such that userspace can tell if interrupts are pending.  However, I'm 
> not really sure if this is the best way to do it or if it can be easily done 
> in a way that doesn't break ABI compatiblity.  Please advise. 
>>   
> 
> The ABI can be extended by adding fields to struct kvm_vcpu_run and an
> extension check to indicate their availability.

Cool.  I was not aware of that capability.  In that case, I think the reasonable thing to do is just that (extend to allow "pending" status).  If the new userspace (which will implicitly understand the presence of an in-kernel APIC) needs to know, it can be coded to check this state.

-Greg




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] KVM: Local-APIC interface cleanup
       [not found]                 ` <462DCD31.4030108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-26 14:43                   ` Gregory Haskins
  0 siblings, 0 replies; 22+ messages in thread
From: Gregory Haskins @ 2007-04-26 14:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Tue, Apr 24, 2007 at  5:26 AM, in message <462DCD31.4030108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>
> #include "diatribes/complexity.h"

Hehe...ok, point taken.  I will make the changes you recommended here:

1) leave cr8 state in VCPU
2) convert LAPIC abstraction from using a vtable to non-virtuals
3) handle LAPIC present/enabled in the abstraction using checks instead of function pointers.



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] KVM: Add irqdevice object
       [not found]                     ` <463080C8.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-04-26 16:26                       ` Avi Kivity
  0 siblings, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2007-04-26 16:26 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> Hi Avi,  Sorry for the delay.  I have been traveling this week.  See inline...      
>   
>>>> [do we actually need a virtual destructor?]
>>>>     
>>>>         
>>> I believe it is the right thing to do, yes.  The implementation of the 
>>>       
>> irqdevice destructor may be as simple as a kfree(), or could be arbitrarily 
>> complex (don't forget that we will have multiple models..we already have 
>> three: userint, kernint, and lapic.  There may also be i8259 and 
>> i8259_cascaded in the future).
>>     
>>>   
>>>       
>> Yes, but does it need to be a function pointer? IOW, is the point it is
>> called generic code or already irqdevice- specific?
>>     
>
> The code can-be/is irqdevice specific, thus the virtual.  In some cases, it will be as simple as a kfree().  In others, (kernint, for instance), it might need to drop references to the apic/ext devices and do other cleanup (which reminds me that I should look at this to make sure its done right today) ;)
>
>   

I mean, "called from generic code".  e.g. in C++ a destructor need not 
be virtual unless you're destroying the object via a pointer to the base 
class.

>> I meant, __kvm_vcpu_irq_pending is just reading stuff.
>>     
>
> Ah, I see.  I am not 100% sure about this, but I think you can make the same argument here as you can with that "double check locks are broken" article that you sent out.  If I got anything out of that article (it was very interesting, BTW), its that the locks do more than protect critical sections:  They are an implicit memory barrier also.  I am under the impression that we want that behavior here.  I can be convinced otherwise....
>
>   

There's  a whole bunch of memory barriers available (smp__rmb() seems to 
be indicated here, which is a noop on x86 IIRC)

>   
>> I guess it works because no OS is insane enough to page out
>> the IDT or GDT, so the only faults we can get are handled by kvm, not
>> the guest.
>>     
>
> This is my thinking as well.  The conditions that cause an injection failure are probably relatively light-weight w.r.t. the guests execution context.  Like for instance, maybe an NMI comes in during the VMENTRY and causes an immediate VMEXIT (e.g. the guest never made any forward progress, and therefore nothing else (e.g. TPR) has changed)
>
>   

That, or a shadow pagefault.

>> So it seems the correct description is not 'un- ack the interrupt', as we
>> have effectively acked it, but actually queue it pending host- only kvm
>> processing.
>>     
>
> This is exactly what I have done (if I understood what you were saying).  When the injection fails we push the vector to the irq.deferred entry which takes a higher priority in the queue than the backing irqdevice (since it believes the vector is already dispatched).  
>
>   

Okay.  Windows can demand page drivers, but probably not the IDT :)  We 
shall have to live with the knowledge that emulation is incorrect wrt 
taking exceptions while delivering external interrupts to the guest.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-04-26 16:26 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-20  3:09 KVM: Patch series for in-kernel APIC support Gregory Haskins
     [not found] ` <20070420030905.12408.40403.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-04-20  3:09   ` [PATCH 1/5] Adds support for in-kernel mmio handlers Gregory Haskins
2007-04-20  3:09   ` [PATCH 2/5] KVM: Add irqdevice object Gregory Haskins
     [not found]     ` <20070420030916.12408.80159.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-04-22  8:42       ` Avi Kivity
     [not found]         ` <462B1FD8.4080004-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-23 13:58           ` Gregory Haskins
     [not found]             ` <462C8333.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-04-24  9:09               ` Avi Kivity
     [not found]                 ` <462DC954.1020400-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-26 14:37                   ` Gregory Haskins
     [not found]                     ` <463080C8.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-04-26 16:26                       ` Avi Kivity
2007-04-20  3:09   ` [PATCH 3/5] KVM: Adds ability to preepmt an executing VCPU Gregory Haskins
     [not found]     ` <20070420030921.12408.97321.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-04-22  8:50       ` Avi Kivity
     [not found]         ` <462B21C7.2060007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-23 15:42           ` Gregory Haskins
     [not found]             ` <462C9B94.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-04-24  9:17               ` Avi Kivity
     [not found]                 ` <462DCB3E.6070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-26 14:40                   ` Gregory Haskins
2007-04-20  3:09   ` [PATCH 4/5] KVM: Local-APIC interface cleanup Gregory Haskins
     [not found]     ` <20070420030926.12408.27637.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-04-22  8:54       ` Avi Kivity
     [not found]         ` <462B22AE.4090108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-23 15:55           ` Gregory Haskins
     [not found]             ` <462C9EAE.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-04-24  9:26               ` Avi Kivity
     [not found]                 ` <462DCD31.4030108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-26 14:43                   ` Gregory Haskins
2007-04-20  3:09   ` [PATCH 5/5] KVM: Add support for in-kernel LAPIC model Gregory Haskins
     [not found]     ` <20070420030931.12408.88158.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-04-22  9:04       ` Avi Kivity
     [not found]         ` <462B250E.6050603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-23 15:57           ` Gregory Haskins
2007-04-22  9:06   ` KVM: Patch series for in-kernel APIC support Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox