[PATCH 0/4] Kernel side patches for in-kernel APIC

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] Kernel side patches for in-kernel APIC
@ 2007-05-02 21:43 Gregory Haskins
       [not found] ` <20070502212713.16738.8133.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-02 21:43 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

The following is my kernel side patch series for adding in-kernel APIC logic.
By default, the code enters "level-0" mode and should be compatible with
existing userspace.  I have a patch series for userspace which enables
"level-1" mode which I will forward after this one.

I have incorporated most of the feedback I have received to date.  There were
a few things that I had initially agreed to do that you may find missing from
the changes.  I found a few places where my original decisions made more sense
to me than what I agreed to change, so I left them pending further
discussion.  E.g. "kvm_irqpin_t" was going to change to "kvm_cpuirq_t" but I
decided against it for reasons I can discuss if anyone is so inclined.

level-0 has been tested with both 32 bit windows and 64-bit linux *before* I
moved to git-HEAD.  They both worked without any discernable differences in
behavior.

I then bumped up to git-HEAD and adjusted all my patches to get ready for
submission.  Unfortunately I seem to have run into a (known/unknown)
regression(*) in the KVM codebase with that update where things arent working
quite right.  What I did confirm was that the system behaves the same both
with and without my patches for both level-0 and level-1 behavior.  It is now
at a point where testers could start to look at my patches and provide
bug/performance feedback in addition to the code review comments.  All are
welcome.  

(*) I see a guest-crash exception very early on in 64 bit Ubuntu 6.x, and I
get a blank screen in the SLED-10 64 bit graphical screen.  I have seen some
emails floating around potentially talking about some screen drawing issues,
so perhaps its another occurance of that. In any case, I have been able to get
quite a significant amount of testing/bug-fixing done just with the SLED-10
installers kernel booting, so I think things are relatively working (even in
level-1 mode).  To reiterate, I see the exact behavior in both git-HEAD and
with my patches, so its not necessarily anything I have done.  Otherwise, I
would not request testing help until it was reasonably stable.

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/4] KVM: Adds support for in-kernel mmio handlers
       [not found] ` <20070502212713.16738.8133.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
@ 2007-05-02 21:43   ` Gregory Haskins
       [not found]     ` <20070502214315.16738.68984.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
  2007-05-02 21:43   ` [PATCH 2/4] KVM: Add irqdevice object Gregory Haskins
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-02 21:43 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/kvm.h      |   60 +++++++++++++++++++++++++++++++
 drivers/kvm/kvm_main.c |   94 ++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 142 insertions(+), 12 deletions(-)

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 9c20d5d..b76631b 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -254,6 +254,65 @@ struct kvm_stat {
 	u32 light_exits;
 };
 
+struct kvm_io_device {
+	void (*read)(struct kvm_io_device *this,
+		     gpa_t addr,
+		     int len,
+		     void *val);
+	void (*write)(struct kvm_io_device *this,
+		      gpa_t addr,
+		      int len,
+		      const void *val);
+	int (*in_range)(struct kvm_io_device *this, gpa_t addr);
+	void (*destructor)(struct kvm_io_device *this);
+
+	void             *private;
+};
+
+static inline void kvm_iodevice_read(struct kvm_io_device *dev,
+				     gpa_t addr,
+				     int len,
+				     void *val)
+{
+	dev->read(dev, addr, len, val);
+}
+
+static inline void kvm_iodevice_write(struct kvm_io_device *dev,
+				      gpa_t addr,
+				      int len,
+				      const void *val)
+{
+	dev->write(dev, addr, len, val);
+}
+
+static inline int kvm_iodevice_inrange(struct kvm_io_device *dev, gpa_t addr)
+{
+	return dev->in_range(dev, addr);
+}
+
+static inline void kvm_iodevice_destructor(struct kvm_io_device *dev)
+{
+	dev->destructor(dev);
+}
+
+/*
+ * It would be nice to use something smarter than a linear search, TBD...
+ * Thankfully we dont expect many devices to register (famous last words :),
+ * so until then it will suffice.  At least its abstracted so we can change
+ * in one place.
+ */
+struct kvm_io_bus {
+	int                   dev_count;
+#define NR_IOBUS_DEVS 6
+	struct kvm_io_device *devs[NR_IOBUS_DEVS];
+};
+
+void kvm_io_bus_init(struct kvm_io_bus *bus);
+void kvm_io_bus_destroy(struct kvm_io_bus *bus);
+struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus, gpa_t addr);
+void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
+			     struct kvm_io_device *dev);
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 	union {
@@ -367,6 +426,7 @@ struct kvm {
 	unsigned long rmap_overflow;
 	struct list_head vm_list;
 	struct file *filp;
+	struct kvm_io_bus mmio_bus;
 };
 
 struct descriptor_table {
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index a3723dd..089fad6 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -295,6 +295,7 @@ static struct kvm *kvm_create_vm(void)
 
 	spin_lock_init(&kvm->lock);
 	INIT_LIST_HEAD(&kvm->active_mmu_pages);
+	kvm_io_bus_init(&kvm->mmio_bus);
 	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
 		struct kvm_vcpu *vcpu = &kvm->vcpus[i];
 
@@ -392,6 +393,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
+	kvm_io_bus_destroy(&kvm->mmio_bus);
 	kvm_free_vcpus(kvm);
 	kvm_free_physmem(kvm);
 	kfree(kvm);
@@ -1015,12 +1017,25 @@ static int emulator_write_std(unsigned long addr,
 	return X86EMUL_UNHANDLEABLE;
 }
 
+static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu,
+						gpa_t addr)
+{
+	/*
+	 * Note that its important to have this wrapper function because
+	 * in the very near future we will be checking for MMIOs against
+	 * the LAPIC as well as the general MMIO bus
+	 */
+	return kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr);
+}
+
 static int emulator_read_emulated(unsigned long addr,
 				  void *val,
 				  unsigned int bytes,
 				  struct x86_emulate_ctxt *ctxt)
 {
-	struct kvm_vcpu *vcpu = ctxt->vcpu;
+	struct kvm_vcpu      *vcpu = ctxt->vcpu;
+	struct kvm_io_device *mmio_dev;
+	gpa_t                 gpa;
 
 	if (vcpu->mmio_read_completed) {
 		memcpy(val, vcpu->mmio_data, bytes);
@@ -1029,18 +1044,26 @@ static int emulator_read_emulated(unsigned long addr,
 	} else if (emulator_read_std(addr, val, bytes, ctxt)
 		   == X86EMUL_CONTINUE)
 		return X86EMUL_CONTINUE;
-	else {
-		gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
 
-		if (gpa == UNMAPPED_GVA)
-			return X86EMUL_PROPAGATE_FAULT;
-		vcpu->mmio_needed = 1;
-		vcpu->mmio_phys_addr = gpa;
-		vcpu->mmio_size = bytes;
-		vcpu->mmio_is_write = 0;
+	gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+	if (gpa == UNMAPPED_GVA)
+		return X86EMUL_PROPAGATE_FAULT;
 
-		return X86EMUL_UNHANDLEABLE;
+	/*
+	 * Is this MMIO handled locally?
+	 */
+	mmio_dev = vcpu_find_mmio_dev(vcpu, gpa);
+	if (mmio_dev) {
+		kvm_iodevice_read(mmio_dev, gpa, bytes, val);
+		return X86EMUL_CONTINUE;
 	}
+
+	vcpu->mmio_needed = 1;
+	vcpu->mmio_phys_addr = gpa;
+	vcpu->mmio_size = bytes;
+	vcpu->mmio_is_write = 0;
+
+	return X86EMUL_UNHANDLEABLE;
 }
 
 static int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -1068,8 +1091,9 @@ static int emulator_write_emulated(unsigned long addr,
 				   unsigned int bytes,
 				   struct x86_emulate_ctxt *ctxt)
 {
-	struct kvm_vcpu *vcpu = ctxt->vcpu;
-	gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+	struct kvm_vcpu      *vcpu = ctxt->vcpu;
+	struct kvm_io_device *mmio_dev;
+	gpa_t                 gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
 
 	if (gpa == UNMAPPED_GVA) {
 		kvm_arch_ops->inject_page_fault(vcpu, addr, 2);
@@ -1079,6 +1103,15 @@ static int emulator_write_emulated(unsigned long addr,
 	if (emulator_write_phys(vcpu, gpa, val, bytes))
 		return X86EMUL_CONTINUE;
 
+	/*
+	 * Is this MMIO handled locally?
+	 */
+	mmio_dev = vcpu_find_mmio_dev(vcpu, gpa);
+	if (mmio_dev) {
+		kvm_iodevice_write(mmio_dev, gpa, bytes, val);
+		return X86EMUL_CONTINUE;
+	}
+
 	vcpu->mmio_needed = 1;
 	vcpu->mmio_phys_addr = gpa;
 	vcpu->mmio_size = bytes;
@@ -2907,6 +2940,43 @@ static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
 	return NOTIFY_OK;
 }
 
+void kvm_io_bus_init(struct kvm_io_bus *bus)
+{
+	memset(bus, 0, sizeof(*bus));
+}
+
+void kvm_io_bus_destroy(struct kvm_io_bus *bus)
+{
+	int i;
+
+	for (i = 0; i < bus->dev_count; i++) {
+		struct kvm_io_device *pos = bus->devs[i];
+
+		kvm_iodevice_destructor(pos);
+	}
+}
+
+struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus, gpa_t addr)
+{
+	int i;
+
+	for (i = 0; i < bus->dev_count; i++) {
+		struct kvm_io_device *pos = bus->devs[i];
+
+		if (pos->in_range(pos, addr))
+			return pos;
+	}
+
+	return NULL;
+}
+
+void kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev)
+{
+	BUG_ON(bus->dev_count >= (NR_IOBUS_DEVS-1));
+
+	bus->devs[bus->dev_count++] = dev;
+}
+
 static struct notifier_block kvm_cpu_notifier = {
 	.notifier_call = kvm_cpu_hotplug,
 	.priority = 20, /* must be > scheduler priority */


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/4] KVM: Add irqdevice object
       [not found] ` <20070502212713.16738.8133.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
  2007-05-02 21:43   ` [PATCH 1/4] KVM: Adds support for in-kernel mmio handlers Gregory Haskins
@ 2007-05-02 21:43   ` Gregory Haskins
       [not found]     ` <20070502214320.16738.21505.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
  2007-05-02 21:43   ` [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU Gregory Haskins
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-02 21:43 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

The current code is geared towards using a user-mode (A)PIC.  This patch adds
an "irqdevice" abstraction, and implements a "userint" model to handle the
duties of the original code.  Later, we can develop other irqdevice models
to handle objects like LAPIC, IOAPIC, i8259, etc, as appropriate

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/Makefile    |    2 
 drivers/kvm/irqdevice.h |  165 +++++++++++++++++++++++++++++++++++
 drivers/kvm/kvm.h       |  108 ++++++++++++++++++++++-
 drivers/kvm/kvm_main.c  |   58 +++++++++---
 drivers/kvm/svm.c       |  140 +++++++++++++++++++++--------
 drivers/kvm/userint.c   |  224 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/vmx.c       |  144 +++++++++++++++++++++++-------
 7 files changed, 744 insertions(+), 97 deletions(-)

diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c0a789f..540afbc 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o userint.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/irqdevice.h b/drivers/kvm/irqdevice.h
new file mode 100644
index 0000000..a7dfcc7
--- /dev/null
+++ b/drivers/kvm/irqdevice.h
@@ -0,0 +1,165 @@
+/*
+ * Defines an interface for an abstract interrupt controller.  The model
+ * consists of a unit with an arbitrary number of input lines N (IRQ0-(N-1)),
+ * an arbitrary number of output lines (INTR) (LINT, EXTINT, NMI, etc), and
+ * methods for completing an interrupt-acknowledge cycle (INTA).  A particular
+ * implementation of this model will define various policies, such as
+ * irq-to-vector translation, INTA/auto-EOI policy, etc.
+ *
+ * In addition, the INTR callback mechanism allows the unit to be "wired" to
+ * an interruptible source in a very flexible manner. For instance, an
+ * irqdevice could have its INTR wired to a VCPU (ala LAPIC), or another
+ * interrupt controller (ala cascaded i8259s)
+ *
+ * Copyright (C) 2007 Novell
+ *
+ * Authors:
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef __IRQDEVICE_H
+#define __IRQDEVICE_H
+
+struct kvm_irqdevice;
+
+typedef enum {
+	kvm_irqpin_localint,
+	kvm_irqpin_extint,
+	kvm_irqpin_smi,
+	kvm_irqpin_nmi,
+	kvm_irqpin_invalid, /* must always be last */
+} kvm_irqpin_t;
+
+#define KVM_IRQACK_VALID    (1 << 0)
+#define KVM_IRQACK_AGAIN    (1 << 1)
+#define KVM_IRQACK_TPRMASK  (1 << 2)
+
+struct kvm_irqsink {
+	void (*set_intr)(struct kvm_irqsink *this,
+			 struct kvm_irqdevice *dev,
+			 kvm_irqpin_t pin);
+
+	void *private;
+};
+
+struct kvm_irqdevice {
+	int  (*ack)(struct kvm_irqdevice *this, int *vector);
+	int  (*set_pin)(struct kvm_irqdevice *this, int pin, int level);
+	void (*destructor)(struct kvm_irqdevice *this);
+
+	void               *private;
+	struct kvm_irqsink  sink;
+};
+
+/**
+ * kvm_irqdevice_init - initialize the kvm_irqdevice for use
+ * @dev: The device
+ *
+ * Description: Initialize the kvm_irqdevice for use.  Should be called before
+ *              calling any derived implementation init functions
+ *
+ * Returns: (void)
+ */
+static inline void kvm_irqdevice_init(struct kvm_irqdevice *dev)
+{
+	memset(dev, 0, sizeof(*dev));
+}
+
+/**
+ * kvm_irqdevice_ack - read and ack the highest priority vector from the device
+ * @dev: The device
+ * @vector: Retrieves the highest priority pending vector
+ *                [ NULL = Dont ack a vector, just check pending status]
+ *                [ non-NULL = Pointer to recieve vector data (out only)]
+ *
+ * Description: Read the highest priority pending vector from the device,
+ *              potentially invoking auto-EOI depending on device policy
+ *
+ * Returns: (int)
+ *   [ -1 = failure]
+ *   [>=0 = bitmap as follows: ]
+ *         [ KVM_IRQACK_VALID   = vector is valid]
+ *         [ KVM_IRQACK_AGAIN   = more unmasked vectors are available]
+ *         [ KVM_IRQACK_TPRMASK = TPR masked vectors are blocked]
+ */
+static inline int kvm_irqdevice_ack(struct kvm_irqdevice *dev,
+					    int *vector)
+{
+	return dev->ack(dev, vector);
+}
+
+/**
+ * kvm_irqdevice_set_pin - allows the caller to assert/deassert an IRQ
+ * @dev: The device
+ * @pin: The input pin to alter
+ * @level: The value to set (1 = assert, 0 = deassert)
+ *
+ * Description: Allows the caller to assert/deassert an IRQ input pin to the
+ *              device according to device policy.
+ *
+ * Returns: (int)
+ *   [-1 = failure]
+ *   [ 0 = success]
+ */
+static inline int kvm_irqdevice_set_pin(struct kvm_irqdevice *dev, int pin,
+				  int level)
+{
+	return dev->set_pin(dev, pin, level);
+}
+
+/**
+ * kvm_irqdevice_register_sink - registers an kvm_irqsink object
+ * @dev: The device
+ * @sink: The sink to register.  Data will be copied so building object from
+ *        transient storage is ok.
+ *
+ * Description: Registers an kvm_irqsink object as an INTR callback
+ *
+ * Returns: (void)
+ */
+static inline void kvm_irqdevice_register_sink(struct kvm_irqdevice *dev,
+					       const struct kvm_irqsink *sink)
+{
+	dev->sink = *sink;
+}
+
+/**
+ * kvm_irqdevice_destructor - destroys an irqdevice
+ * @dev: The device
+ *
+ * Returns: (void)
+ */
+static inline void kvm_irqdevice_destructor(struct kvm_irqdevice *dev)
+{
+	dev->destructor(dev);
+}
+
+/**
+ * kvm_irqdevice_set_intr - invokes a registered INTR callback
+ * @dev: The device
+ * @pin: Identifies the pin to alter -
+ *           [ KVM_IRQPIN_LOCALINT (default) - an vector is pending on this
+ *                                             device]
+ *           [ KVM_IRQPIN_EXTINT - a vector is pending on an external device]
+ *           [ KVM_IRQPIN_SMI - system-management-interrupt pin]
+ *           [ KVM_IRQPIN_NMI - non-maskable-interrupt pin
+ *
+ * Description: Invokes a registered INTR callback (if present).  This
+ *              function is meant to be used privately by a irqdevice
+ *              implementation.
+ *
+ * Returns: (void)
+ */
+static inline void kvm_irqdevice_set_intr(struct kvm_irqdevice *dev,
+					  kvm_irqpin_t pin)
+{
+	struct kvm_irqsink *sink = &dev->sink;
+	if (sink->set_intr)
+		sink->set_intr(sink, dev, pin);
+}
+
+#endif /*  __IRQDEVICE_H */
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index b76631b..d41d653 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -13,6 +13,7 @@
 #include <linux/mm.h>
 
 #include "vmx.h"
+#include "irqdevice.h"
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
 
@@ -158,6 +159,11 @@ struct vmcs {
 
 struct kvm_vcpu;
 
+int kvm_user_irqdev_init(struct kvm_irqdevice *dev);
+int kvm_user_irqdev_save(struct kvm_irqdevice *this, void *data);
+int kvm_user_irqdev_restore(struct kvm_irqdevice *this, void *data);
+int kvm_userint_init(struct kvm_vcpu *vcpu);
+
 /*
  * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
  * 32-bit).  The kvm_mmu structure abstracts the details of the current mmu
@@ -313,6 +319,18 @@ struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus, gpa_t addr);
 void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
 			     struct kvm_io_device *dev);
 
+#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
+
+/*
+ * structure for maintaining info for interrupting an executing VCPU
+ */
+struct kvm_vcpu_irq {
+	spinlock_t           lock;
+	struct kvm_irqdevice dev;
+	int                  pending;
+	int                  deferred;
+};
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 	union {
@@ -325,9 +343,7 @@ struct kvm_vcpu {
 	u64 host_tsc;
 	struct kvm_run *run;
 	int interrupt_window_open;
-	unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */
-#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
-	unsigned long irq_pending[NR_IRQ_WORDS];
+	struct kvm_vcpu_irq irq;
 	unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
 	unsigned long rip;      /* needs vcpu_load_rsp_rip() */
 
@@ -394,6 +410,92 @@ struct kvm_vcpu {
 	struct kvm_cpuid_entry cpuid_entries[KVM_MAX_CPUID_ENTRIES];
 };
 
+/*
+ * Assumes lock already held
+ */
+static inline int __kvm_vcpu_irq_all_pending(struct kvm_vcpu *vcpu)
+{
+	int pending = vcpu->irq.pending;
+
+	if (vcpu->irq.deferred != -1)
+		__set_bit(kvm_irqpin_localint, &pending);
+
+	return pending;
+}
+
+/*
+ * These two functions are helpers for determining if a standard interrupt
+ * is pending to replace the old "if (vcpu->irq_summary)" logic.  If the
+ * caller wants to know about some of the new advanced interrupt types
+ * (SMI, NMI, etc) or to differentiate between localint and extint they will
+ *  have to use the new API
+ */
+static inline int __kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
+{
+	int pending = __kvm_vcpu_irq_all_pending(vcpu);
+
+	if (test_bit(kvm_irqpin_localint, &pending) ||
+	    test_bit(kvm_irqpin_extint, &pending))
+		return 1;
+
+	return 0;
+}
+
+static inline int kvm_vcpu_irq_pending(struct kvm_vcpu *vcpu)
+{
+	int ret = 0;
+	int flags;
+
+	spin_lock_irqsave(&vcpu->irq.lock, flags);
+	ret = __kvm_vcpu_irq_pending(vcpu);
+	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
+
+	return ret;
+}
+
+/*
+ * Assumes lock already held
+ */
+static inline int kvm_vcpu_irq_pop(struct kvm_vcpu *vcpu, int *vector)
+{
+	int ret = 0;
+
+	if (vcpu->irq.deferred != -1) {
+		if (vector) {
+			ret |= KVM_IRQACK_VALID;
+			*vector = vcpu->irq.deferred;
+			vcpu->irq.deferred = -1;
+		}
+		ret |= kvm_irqdevice_ack(&vcpu->irq.dev, NULL);
+	} else
+		ret = kvm_irqdevice_ack(&vcpu->irq.dev, vector);
+
+	/*
+	 * If there are no more interrupts and we are edge triggered,
+	 * we must clear the status flag
+	 */
+	if (!(ret & KVM_IRQACK_AGAIN))
+		__clear_bit(kvm_irqpin_localint, &vcpu->irq.pending);
+
+	return ret;
+}
+
+static inline void __kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
+{
+	BUG_ON(vcpu->irq.deferred != -1); /* We can only hold one deferred */
+
+	vcpu->irq.deferred = irq;
+}
+
+static inline void kvm_vcpu_irq_push(struct kvm_vcpu *vcpu, int irq)
+{
+	int flags;
+
+	spin_lock_irqsave(&vcpu->irq.lock, flags);
+	__kvm_vcpu_irq_push(vcpu, irq);
+	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
+}
+
 struct kvm_mem_alias {
 	gfn_t base_gfn;
 	unsigned long npages;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 089fad6..9aeb2f7 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -300,6 +300,11 @@ static struct kvm *kvm_create_vm(void)
 		struct kvm_vcpu *vcpu = &kvm->vcpus[i];
 
 		mutex_init(&vcpu->mutex);
+
+		memset(&vcpu->irq, 0, sizeof(vcpu->irq));
+		spin_lock_init(&vcpu->irq.lock);
+		vcpu->irq.deferred = -1;
+
 		vcpu->cpu = -1;
 		vcpu->kvm = kvm;
 		vcpu->mmu.root_hpa = INVALID_PAGE;
@@ -367,6 +372,7 @@ static void kvm_free_vcpu(struct kvm_vcpu *vcpu)
 	vcpu_load(vcpu);
 	kvm_mmu_destroy(vcpu);
 	vcpu_put(vcpu);
+	kvm_irqdevice_destructor(&vcpu->irq.dev);
 	kvm_arch_ops->vcpu_free(vcpu);
 	free_page((unsigned long)vcpu->run);
 	vcpu->run = NULL;
@@ -1985,8 +1991,7 @@ static int kvm_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 	sregs->efer = vcpu->shadow_efer;
 	sregs->apic_base = vcpu->apic_base;
 
-	memcpy(sregs->interrupt_bitmap, vcpu->irq_pending,
-	       sizeof sregs->interrupt_bitmap);
+	kvm_user_irqdev_save(&vcpu->irq.dev, &sregs->interrupt_bitmap);
 
 	vcpu_put(vcpu);
 
@@ -2003,7 +2008,6 @@ static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 				    struct kvm_sregs *sregs)
 {
 	int mmu_reset_needed = 0;
-	int i;
 	struct descriptor_table dt;
 
 	vcpu_load(vcpu);
@@ -2040,12 +2044,8 @@ static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	if (mmu_reset_needed)
 		kvm_mmu_reset_context(vcpu);
 
-	memcpy(vcpu->irq_pending, sregs->interrupt_bitmap,
-	       sizeof vcpu->irq_pending);
-	vcpu->irq_summary = 0;
-	for (i = 0; i < NR_IRQ_WORDS; ++i)
-		if (vcpu->irq_pending[i])
-			__set_bit(i, &vcpu->irq_summary);
+	kvm_user_irqdev_restore(&vcpu->irq.dev,
+				&sregs->interrupt_bitmap[0]);
 
 	set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
 	set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
@@ -2206,14 +2206,8 @@ static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
 {
 	if (irq->irq < 0 || irq->irq >= 256)
 		return -EINVAL;
-	vcpu_load(vcpu);
-
-	set_bit(irq->irq, vcpu->irq_pending);
-	set_bit(irq->irq / BITS_PER_LONG, &vcpu->irq_summary);
 
-	vcpu_put(vcpu);
-
-	return 0;
+	return kvm_irqdevice_set_pin(&vcpu->irq.dev, irq->irq, 1);
 }
 
 static int kvm_vcpu_ioctl_debug_guest(struct kvm_vcpu *vcpu,
@@ -2315,6 +2309,32 @@ out1:
 }
 
 /*
+ * This function will be invoked whenever the vcpu->irq.dev raises its INTR
+ * line
+ */
+static void kvm_vcpu_intr(struct kvm_irqsink *this,
+			  struct kvm_irqdevice *dev,
+			  kvm_irqpin_t pin)
+{
+	struct kvm_vcpu *vcpu = (struct kvm_vcpu*)this->private;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vcpu->irq.lock, flags);
+	__set_bit(pin, &vcpu->irq.pending);
+	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
+}
+
+static void kvm_vcpu_irqsink_init(struct kvm_vcpu *vcpu)
+{
+	struct kvm_irqsink sink = {
+		.set_intr   = kvm_vcpu_intr,
+		.private    = vcpu
+	};
+
+	kvm_irqdevice_register_sink(&vcpu->irq.dev, &sink);
+}
+
+/*
  * Creates some virtual cpus.  Good luck creating more than one.
  */
 static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
@@ -2361,6 +2381,12 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
 	if (r < 0)
 		goto out_free_vcpus;
 
+	kvm_irqdevice_init(&vcpu->irq.dev);
+	kvm_vcpu_irqsink_init(vcpu);
+	r = kvm_userint_init(vcpu);
+	if (r < 0)
+		goto out_free_vcpus;
+
 	kvm_arch_ops->vcpu_load(vcpu);
 	r = kvm_mmu_setup(vcpu);
 	if (r >= 0)
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index b621403..39cc596 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -106,24 +106,6 @@ static unsigned get_addr_size(struct kvm_vcpu *vcpu)
 				(cs_attrib & SVM_SELECTOR_DB_MASK) ? 4 : 2;
 }
 
-static inline u8 pop_irq(struct kvm_vcpu *vcpu)
-{
-	int word_index = __ffs(vcpu->irq_summary);
-	int bit_index = __ffs(vcpu->irq_pending[word_index]);
-	int irq = word_index * BITS_PER_LONG + bit_index;
-
-	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
-	if (!vcpu->irq_pending[word_index])
-		clear_bit(word_index, &vcpu->irq_summary);
-	return irq;
-}
-
-static inline void push_irq(struct kvm_vcpu *vcpu, u8 irq)
-{
-	set_bit(irq, vcpu->irq_pending);
-	set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
-}
-
 static inline void clgi(void)
 {
 	asm volatile (SVM_CLGI);
@@ -904,7 +886,12 @@ static int pf_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	int r;
 
 	if (is_external_interrupt(exit_int_info))
-		push_irq(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
+		/*
+		 * An exception was taken while we were trying to inject an
+		 * IRQ.  We must defer the injection of the vector until
+		 * the next window.
+		 */
+		kvm_vcpu_irq_push(vcpu, exit_int_info & SVM_EVTINJ_VEC_MASK);
 
 	spin_lock(&vcpu->kvm->lock);
 
@@ -1114,7 +1101,7 @@ static int halt_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 1;
 	skip_emulated_instruction(vcpu);
-	if (vcpu->irq_summary)
+	if (kvm_vcpu_irq_pending(vcpu))
 		return 1;
 
 	kvm_run->exit_reason = KVM_EXIT_HLT;
@@ -1285,7 +1272,7 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu,
 	 * possible
 	 */
 	if (kvm_run->request_interrupt_window &&
-	    !vcpu->irq_summary) {
+	    !kvm_vcpu_irq_pending(vcpu)) {
 		++vcpu->stat.irq_window_exits;
 		kvm_run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
 		return 0;
@@ -1384,60 +1371,121 @@ static void pre_svm_run(struct kvm_vcpu *vcpu)
 }
 
 
-static inline void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
-{
-	struct vmcb_control_area *control;
-
-	control = &vcpu->svm->vmcb->control;
-	control->int_vector = pop_irq(vcpu);
-	control->int_ctl &= ~V_INTR_PRIO_MASK;
-	control->int_ctl |= V_IRQ_MASK |
-		((/*control->int_vector >> 4*/ 0xf) << V_INTR_PRIO_SHIFT);
-}
-
 static void kvm_reput_irq(struct kvm_vcpu *vcpu)
 {
 	struct vmcb_control_area *control = &vcpu->svm->vmcb->control;
 
 	if (control->int_ctl & V_IRQ_MASK) {
 		control->int_ctl &= ~V_IRQ_MASK;
-		push_irq(vcpu, control->int_vector);
+		kvm_vcpu_irq_push(vcpu, control->int_vector);
 	}
 
 	vcpu->interrupt_window_open =
 		!(control->int_state & SVM_INTERRUPT_SHADOW_MASK);
 }
 
-static void do_interrupt_requests(struct kvm_vcpu *vcpu,
-				       struct kvm_run *kvm_run)
+static int do_intr_requests(struct kvm_vcpu *vcpu,
+			    struct kvm_run *kvm_run,
+			    kvm_irqpin_t pin)
 {
 	struct vmcb_control_area *control = &vcpu->svm->vmcb->control;
+	int r = 0;
+	int handled = 0;
 
 	vcpu->interrupt_window_open =
 		(!(control->int_state & SVM_INTERRUPT_SHADOW_MASK) &&
 		 (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF));
 
-	if (vcpu->interrupt_window_open && vcpu->irq_summary)
+	if (vcpu->interrupt_window_open) {
+		int irq;
+
 		/*
-		 * If interrupts enabled, and not blocked by sti or mov ss. Good.
+		 * If interrupts enabled, and not blocked by sti or mov ss.
+		 * Good.
 		 */
-		kvm_do_inject_irq(vcpu);
+
+		switch (pin) {
+		case kvm_irqpin_localint:
+			r = kvm_vcpu_irq_pop(vcpu, &irq);
+			break;
+		case kvm_irqpin_extint:
+			printk(KERN_WARNING "KVM: external-interrupts not " \
+			       "handled yet\n");
+			__clear_bit(pin, &vcpu->irq.pending);
+			break;
+		default:
+			panic("KVM: unknown interrupt pin raised: %d\n", pin);
+			break;
+		}
+
+		BUG_ON(r < 0);
+
+		if (r & KVM_IRQACK_VALID) {
+			control = &vcpu->svm->vmcb->control;
+			control->int_vector = irq;
+			control->int_ctl &= ~V_INTR_PRIO_MASK;
+			control->int_ctl |= V_IRQ_MASK |
+				((/*control->int_vector >> 4*/ 0xf) <<
+				 V_INTR_PRIO_SHIFT);
+
+			handled = 1;
+		}
+	}
 
 	/*
 	 * Interrupts blocked.  Wait for unblock.
 	 */
 	if (!vcpu->interrupt_window_open &&
-	    (vcpu->irq_summary || kvm_run->request_interrupt_window)) {
+	    ((r & KVM_IRQACK_AGAIN) ||
+	     __kvm_vcpu_irq_pending(vcpu) ||
+	     kvm_run->request_interrupt_window)) {
 		control->intercept |= 1ULL << INTERCEPT_VINTR;
 	} else
 		control->intercept &= ~(1ULL << INTERCEPT_VINTR);
+
+	return handled;
+}
+
+static void do_interrupt_requests(struct kvm_vcpu *vcpu,
+				  struct kvm_run *kvm_run)
+{
+	int pending = __kvm_vcpu_irq_all_pending(vcpu);
+	int handled = 0;
+
+	while (pending && !handled) {
+		kvm_irqpin_t pin = __fls(pending);
+
+		switch (pin) {
+		case kvm_irqpin_localint:
+		case kvm_irqpin_extint:
+			handled = do_intr_requests(vcpu, kvm_run, pin);
+			break;
+		case kvm_irqpin_smi:
+		case kvm_irqpin_nmi:
+			/* ignored (for now) */
+			printk(KERN_WARNING
+			       "KVM: dropping unhandled SMI/NMI: %d\n",
+			       pin);
+			__clear_bit(pin, &vcpu->irq.pending);
+			break;
+		case kvm_irqpin_invalid:
+			/* drop */
+			break;
+		default:
+			panic("KVM: unknown interrupt pin raised: %d\n", pin);
+			break;
+		}
+
+		__clear_bit(pin, &pending);
+	}
 }
 
 static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 			      struct kvm_run *kvm_run)
 {
-	kvm_run->ready_for_interrupt_injection = (vcpu->interrupt_window_open &&
-						  vcpu->irq_summary == 0);
+	kvm_run->ready_for_interrupt_injection =
+		(vcpu->interrupt_window_open &&
+		 !kvm_vcpu_irq_pending(vcpu));
 	kvm_run->if_flag = (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF) != 0;
 	kvm_run->cr8 = vcpu->cr8;
 	kvm_run->apic_base = vcpu->apic_base;
@@ -1452,7 +1500,7 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu,
 					  struct kvm_run *kvm_run)
 {
-	return (!vcpu->irq_summary &&
+	return (!kvm_vcpu_irq_pending(vcpu) &&
 		kvm_run->request_interrupt_window &&
 		vcpu->interrupt_window_open &&
 		(vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF));
@@ -1482,9 +1530,17 @@ static int svm_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	int r;
 
 again:
+	spin_lock(&vcpu->irq.lock);
+
+	/*
+	 * We must inject interrupts (if any) while the irq_lock
+	 * is held
+	 */
 	if (!vcpu->mmio_read_completed)
 		do_interrupt_requests(vcpu, kvm_run);
 
+	spin_unlock(&vcpu->irq.lock);
+
 	clgi();
 
 	pre_svm_run(vcpu);
diff --git a/drivers/kvm/userint.c b/drivers/kvm/userint.c
new file mode 100644
index 0000000..eeac005
--- /dev/null
+++ b/drivers/kvm/userint.c
@@ -0,0 +1,224 @@
+/*
+ * User Interrupts IRQ device
+ *
+ * This acts as an extention of an interrupt controller that exists elsewhere
+ * (typically in userspace/QEMU).  Because this PIC is a pseudo device that
+ * is downstream from a real emulated PIC, the "IRQ-to-vector" mapping has
+ * already occured.  Therefore, this PIC has the following unusal properties:
+ *
+ * 1) It has 256 "pins" which are literal vectors (i.e. no translation)
+ * 2) It only supports "auto-EOI" behavior since it is expected that the
+ *    upstream emulated PIC will handle the real EOIs (if applicable)
+ * 3) It only listens to "asserts" on the pins (deasserts are dropped)
+ *    because its an auto-EOI device anyway.
+ *
+ * Copyright (C) 2007 Novell
+ *
+ * bitarray code based on original vcpu->irq_pending code,
+ *     Copyright (C) 2007 Qumranet
+ *
+ * Authors:
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include "kvm.h"
+
+/*
+ *----------------------------------------------------------------------
+ * optimized bitarray object - works like bitarrays in bitops, but uses
+ * a summary field to accelerate lookups.  Assumes external locking
+ *---------------------------------------------------------------------
+ */
+
+struct bitarray {
+	unsigned long summary; /* 1 per word in pending */
+	unsigned long pending[NR_IRQ_WORDS];
+};
+
+static inline int bitarray_pending(struct bitarray *this)
+{
+	return this->summary ? 1 : 0;
+}
+
+static inline int bitarray_findhighest(struct bitarray *this)
+{
+	if (!this->summary)
+		return -1;
+	else {
+		int word_index = __fls(this->summary);
+		int bit_index  = __fls(this->pending[word_index]);
+
+		return word_index * BITS_PER_LONG + bit_index;
+	}
+}
+
+static inline void bitarray_set(struct bitarray *this, int nr)
+{
+	__set_bit(nr, &this->pending);
+	__set_bit(nr / BITS_PER_LONG, &this->summary);
+}
+
+static inline void bitarray_clear(struct bitarray *this, int nr)
+{
+	int word = nr / BITS_PER_LONG;
+
+	__clear_bit(nr, &this->pending);
+	if (!this->pending[word])
+		__clear_bit(word, &this->summary);
+}
+
+static inline int bitarray_test(struct bitarray *this, int nr)
+{
+	return test_bit(nr, &this->pending);
+}
+
+static inline int bitarray_test_and_set(struct bitarray *this, int nr, int val)
+{
+	if (bitarray_test(this, nr) != val) {
+		if (val)
+			bitarray_set(this, nr);
+		else
+			bitarray_clear(this, nr);
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
+ *----------------------------------------------------------------------
+ * userint interface - provides the actual kvm_irqdevice implementation
+ *---------------------------------------------------------------------
+ */
+
+struct kvm_user_irqdev {
+	spinlock_t      lock;
+	atomic_t        ref_count;
+	struct bitarray pending;
+};
+
+static int user_irqdev_ack(struct kvm_irqdevice *this, int *vector)
+{
+	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
+	int          irq;
+	int          ret = 0;
+
+	spin_lock(&s->lock);
+
+	if (vector) {
+		irq = bitarray_findhighest(&s->pending);
+
+		if (irq > -1) {
+			/*
+			 * Automatically clear the interrupt as the EOI
+			 * mechanism (if any) will take place in userspace
+			 */
+			bitarray_clear(&s->pending, irq);
+
+			ret |= KVM_IRQACK_VALID;
+		}
+
+		*vector = irq;
+	}
+
+	if (bitarray_pending(&s->pending))
+		ret |= KVM_IRQACK_AGAIN;
+
+	spin_unlock(&s->lock);
+
+	return ret;
+}
+
+static int user_irqdev_set_pin(struct kvm_irqdevice *this, int irq, int level)
+{
+	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
+	int forward = 0;
+
+	spin_lock(&s->lock);
+	forward = bitarray_test_and_set(&s->pending, irq, level);
+	spin_unlock(&s->lock);
+
+	/*
+	 * alert the higher layer software we have changes
+	 */
+	if (forward)
+		kvm_irqdevice_set_intr(this, kvm_irqpin_localint);
+
+	return 0;
+}
+
+static void user_irqdev_destructor(struct kvm_irqdevice *this)
+{
+	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
+
+	if (atomic_dec_and_test(&s->ref_count))
+		kfree(s);
+}
+
+int kvm_user_irqdev_init(struct kvm_irqdevice *irqdev)
+{
+	struct kvm_user_irqdev *s;
+
+	s = kzalloc(sizeof(*s), GFP_KERNEL);
+	if (!s)
+		return -ENOMEM;
+
+	spin_lock_init(&s->lock);
+
+	irqdev->ack         = user_irqdev_ack;
+	irqdev->set_pin     = user_irqdev_set_pin;
+	irqdev->destructor  = user_irqdev_destructor;
+
+	irqdev->private = s;
+	atomic_inc(&s->ref_count);
+
+	return 0;
+}
+
+int kvm_user_irqdev_save(struct kvm_irqdevice *this, void *data)
+{
+	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
+
+	spin_lock(&s->lock);
+	memcpy(data, s->pending.pending, sizeof s->pending.pending);
+	spin_unlock(&s->lock);
+
+	return 0;
+}
+
+int kvm_user_irqdev_restore(struct kvm_irqdevice *this, void *data)
+{
+	struct kvm_user_irqdev *s = (struct kvm_user_irqdev*)this->private;
+	int i;
+	int forward = 0;
+
+	spin_lock(&s->lock);
+
+	/*
+	 * walk the interrupt-bitmap and inject an IRQ for each bit found
+	 */
+	for (i = 0; i < 256; ++i) {
+		int val = test_bit(i, data);
+		forward = bitarray_test_and_set(&s->pending, i, val);
+	}
+
+	spin_unlock(&s->lock);
+
+	/*
+	 * alert the higher layer software we have changes
+	 */
+	if (forward)
+		kvm_irqdevice_set_intr(this, kvm_irqpin_localint);
+
+	return 0;
+}
+
+int kvm_userint_init(struct kvm_vcpu *vcpu)
+{
+	return kvm_user_irqdev_init(&vcpu->irq.dev);
+}
+
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 19edb34..5665286 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1301,45 +1301,71 @@ static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq)
 	vmcs_writel(GUEST_RSP, (vmcs_readl(GUEST_RSP) & ~0xffff) | (sp - 6));
 }
 
-static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
-{
-	int word_index = __ffs(vcpu->irq_summary);
-	int bit_index = __ffs(vcpu->irq_pending[word_index]);
-	int irq = word_index * BITS_PER_LONG + bit_index;
-
-	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
-	if (!vcpu->irq_pending[word_index])
-		clear_bit(word_index, &vcpu->irq_summary);
-
-	if (vcpu->rmode.active) {
-		inject_rmode_irq(vcpu, irq);
-		return;
-	}
-	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-			irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK);
-}
-
-
-static void do_interrupt_requests(struct kvm_vcpu *vcpu,
-				       struct kvm_run *kvm_run)
+static int do_intr_requests(struct kvm_vcpu *vcpu,
+			    struct kvm_run *kvm_run,
+			    kvm_irqpin_t pin)
 {
 	u32 cpu_based_vm_exec_control;
+	int r = 0;
+	int handled = 0;
 
 	vcpu->interrupt_window_open =
 		((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
 		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
 
 	if (vcpu->interrupt_window_open &&
-	    vcpu->irq_summary &&
-	    !(vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK))
+	    !(vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK)) {
+		int irq;
+
 		/*
-		 * If interrupts enabled, and not blocked by sti or mov ss. Good.
+		 * If interrupts enabled, and not blocked by sti or mov ss.
+		 * Good.
 		 */
-		kvm_do_inject_irq(vcpu);
+
+		switch (pin) {
+		case kvm_irqpin_localint:
+			r = kvm_vcpu_irq_pop(vcpu, &irq);
+			break;
+		case kvm_irqpin_extint:
+			printk(KERN_WARNING "KVM: external-interrupts not " \
+			       "handled yet\n");
+			__clear_bit(pin, &vcpu->irq.pending);
+			break;
+		case kvm_irqpin_nmi:
+			/*
+			 * FIXME: Someday we will handle this using the
+			 * specific VMX NMI features.  For now, just inject
+			 * the NMI as a standard interrupt on vector 2
+			 */
+			r |= KVM_IRQACK_VALID;
+			__clear_bit(pin, &vcpu->irq.pending);
+			irq = 2;
+			break;
+		default:
+			panic("KVM: unknown interrupt pin raised: %d\n", pin);
+			break;
+		}
+
+		BUG_ON(r < 0);
+
+		if (r & KVM_IRQACK_VALID) {
+			if (vcpu->rmode.active)
+				inject_rmode_irq(vcpu, irq);
+			else
+				vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+					     irq |
+					     INTR_TYPE_EXT_INTR |
+					     INTR_INFO_VALID_MASK);
+
+			handled = 1;
+		}
+	}
 
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	if (!vcpu->interrupt_window_open &&
-	    (vcpu->irq_summary || kvm_run->request_interrupt_window))
+	    ((r & KVM_IRQACK_AGAIN) ||
+	     __kvm_vcpu_irq_pending(vcpu) ||
+	     kvm_run->request_interrupt_window))
 		/*
 		 * Interrupts blocked.  Wait for unblock.
 		 */
@@ -1347,6 +1373,41 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
 	else
 		cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
 	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
+
+	return handled;
+}
+
+static void do_interrupt_requests(struct kvm_vcpu *vcpu,
+				  struct kvm_run *kvm_run)
+{
+	int pending = __kvm_vcpu_irq_all_pending(vcpu);
+	int handled = 0;
+
+	while (pending && !handled) {
+		kvm_irqpin_t pin = __fls(pending);
+
+		switch (pin) {
+		case kvm_irqpin_localint:
+		case kvm_irqpin_extint:
+		case kvm_irqpin_nmi:
+			handled = do_intr_requests(vcpu, kvm_run, pin);
+			break;
+		case kvm_irqpin_smi:
+			/* ignored (for now) */
+			printk(KERN_WARNING
+			       "KVM: dropping unhandled SMI\n");
+			__clear_bit(pin, &vcpu->irq.pending);
+			break;
+		case kvm_irqpin_invalid:
+			/* drop */
+			break;
+		default:
+			panic("KVM: unknown interrupt pin raised: %d\n", pin);
+			break;
+		}
+
+		__clear_bit(pin, &pending);
+	}
 }
 
 static void kvm_guest_debug_pre(struct kvm_vcpu *vcpu)
@@ -1397,9 +1458,13 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	}
 
 	if (is_external_interrupt(vect_info)) {
+		/*
+		 * An exception was taken while we were trying to inject an
+		 * IRQ.  We must defer the injection of the vector until
+		 * the next window.
+		 */
 		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
-		set_bit(irq, vcpu->irq_pending);
-		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
+		kvm_vcpu_irq_push(vcpu, irq);
 	}
 
 	if ((intr_info & INTR_INFO_INTR_TYPE_MASK) == 0x200) { /* nmi */
@@ -1719,8 +1784,9 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 	kvm_run->if_flag = (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) != 0;
 	kvm_run->cr8 = vcpu->cr8;
 	kvm_run->apic_base = vcpu->apic_base;
-	kvm_run->ready_for_interrupt_injection = (vcpu->interrupt_window_open &&
-						  vcpu->irq_summary == 0);
+	kvm_run->ready_for_interrupt_injection =
+		(vcpu->interrupt_window_open &&
+		 !kvm_vcpu_irq_pending(vcpu));
 }
 
 static int handle_interrupt_window(struct kvm_vcpu *vcpu,
@@ -1731,7 +1797,7 @@ static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 	 * possible
 	 */
 	if (kvm_run->request_interrupt_window &&
-	    !vcpu->irq_summary) {
+	    !kvm_vcpu_irq_pending(vcpu)) {
 		kvm_run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
 		++vcpu->stat.irq_window_exits;
 		return 0;
@@ -1742,7 +1808,7 @@ static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 static int handle_halt(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	skip_emulated_instruction(vcpu);
-	if (vcpu->irq_summary)
+	if (kvm_vcpu_irq_pending(vcpu))
 		return 1;
 
 	kvm_run->exit_reason = KVM_EXIT_HLT;
@@ -1812,7 +1878,7 @@ static int kvm_handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu,
 					  struct kvm_run *kvm_run)
 {
-	return (!vcpu->irq_summary &&
+	return (!kvm_vcpu_irq_pending(vcpu) &&
 		kvm_run->request_interrupt_window &&
 		vcpu->interrupt_window_open &&
 		(vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF));
@@ -1855,11 +1921,19 @@ preempted:
 	vmcs_writel(HOST_GS_BASE, segment_base(gs_sel));
 #endif
 
+	if (vcpu->guest_debug.enabled)
+		kvm_guest_debug_pre(vcpu);
+
+	spin_lock(&vcpu->irq.lock);
+
+	/*
+	 * We must inject interrupts (if any) while the irq.lock
+	 * is held
+	 */
 	if (!vcpu->mmio_read_completed)
 		do_interrupt_requests(vcpu, kvm_run);
 
-	if (vcpu->guest_debug.enabled)
-		kvm_guest_debug_pre(vcpu);
+	spin_unlock(&vcpu->irq.lock);
 
 	if (vcpu->fpu_active) {
 		fx_save(vcpu->host_fx_image);


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU
       [not found] ` <20070502212713.16738.8133.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
  2007-05-02 21:43   ` [PATCH 1/4] KVM: Adds support for in-kernel mmio handlers Gregory Haskins
  2007-05-02 21:43   ` [PATCH 2/4] KVM: Add irqdevice object Gregory Haskins
@ 2007-05-02 21:43   ` Gregory Haskins
       [not found]     ` <20070502214325.16738.42702.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
  2007-05-02 21:43   ` [PATCH 4/4] KVM: Add support for in-kernel LAPIC model Gregory Haskins
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-02 21:43 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

The VCPU executes synchronously w.r.t. userspace today, and therefore
interrupt injection is pretty straight forward.  However, we will soon need
to be able to inject interrupts asynchronous to the execution of the VCPU
due to the introduction of SMP, paravirtualized drivers, and asynchronous
hypercalls.  This patch adds support to the interrupt mechanism to force
a VCPU to VMEXIT when a new interrupt is pending.

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/kvm.h      |    5 +++
 drivers/kvm/kvm_main.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/kvm/svm.c      |   43 ++++++++++++++++++++++++++++
 drivers/kvm/vmx.c      |   43 ++++++++++++++++++++++++++++
 4 files changed, 164 insertions(+), 1 deletions(-)

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index d41d653..15c8bec 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -321,6 +321,8 @@ void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
 
 #define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
 
+#define KVM_SIGNAL_VIRTUAL_INTERRUPT 33 /* Hardcoded for now */
+
 /*
  * structure for maintaining info for interrupting an executing VCPU
  */
@@ -329,6 +331,9 @@ struct kvm_vcpu_irq {
 	struct kvm_irqdevice dev;
 	int                  pending;
 	int                  deferred;
+	struct task_struct  *task;
+	int                  signo;
+	int                  guest_mode;
 };
 
 struct kvm_vcpu {
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 9aeb2f7..6acbd9b 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -304,6 +304,10 @@ static struct kvm *kvm_create_vm(void)
 		memset(&vcpu->irq, 0, sizeof(vcpu->irq));
 		spin_lock_init(&vcpu->irq.lock);
 		vcpu->irq.deferred = -1;
+		/*
+		 * This should be settable by userspace someday
+		 */
+		vcpu->irq.signo = KVM_SIGNAL_VIRTUAL_INTERRUPT;
 
 		vcpu->cpu = -1;
 		vcpu->kvm = kvm;
@@ -366,13 +370,20 @@ static void free_pio_guest_pages(struct kvm_vcpu *vcpu)
 
 static void kvm_free_vcpu(struct kvm_vcpu *vcpu)
 {
+	unsigned long irqsave;
+
 	if (!vcpu->vmcs)
 		return;
 
 	vcpu_load(vcpu);
 	kvm_mmu_destroy(vcpu);
 	vcpu_put(vcpu);
+
+	spin_lock_irqsave(&vcpu->irq.lock, irqsave);
+	vcpu->irq.task = NULL;
+	spin_unlock_irqrestore(&vcpu->irq.lock, irqsave);
 	kvm_irqdevice_destructor(&vcpu->irq.dev);
+
 	kvm_arch_ops->vcpu_free(vcpu);
 	free_page((unsigned long)vcpu->run);
 	vcpu->run = NULL;
@@ -1831,6 +1842,7 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	int r;
 	sigset_t sigsaved;
+	unsigned long irqsaved;
 
 	vcpu_load(vcpu);
 
@@ -1868,6 +1880,10 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 		kvm_arch_ops->decache_regs(vcpu);
 	}
 
+	spin_lock_irqsave(&vcpu->irq.lock, irqsaved);
+	vcpu->irq.task = current;
+	spin_unlock_irqrestore(&vcpu->irq.lock, irqsaved);
+
 	r = kvm_arch_ops->run(vcpu, kvm_run);
 
 out:
@@ -2309,6 +2325,20 @@ out1:
 }
 
 /*
+ * This function is invoked whenever we want to interrupt a vcpu that is
+ * currently executing in guest-mode.  It currently is a no-op because
+ * the simple delivery of the IPI to execute this function accomplishes our
+ * goal: To cause a VMEXIT.  We pass the vcpu (which contains the
+ * vcpu->irq.task, etc) for future use
+ */
+static void kvm_vcpu_guest_intr(void *info)
+{
+#ifdef NOT_YET
+	struct kvm_vcpu *vcpu = (struct kvm_vcpu*)info;
+#endif
+}
+
+/*
  * This function will be invoked whenever the vcpu->irq.dev raises its INTR
  * line
  */
@@ -2318,10 +2348,52 @@ static void kvm_vcpu_intr(struct kvm_irqsink *this,
 {
 	struct kvm_vcpu *vcpu = (struct kvm_vcpu*)this->private;
 	unsigned long flags;
+	int direct_ipi = -1;
 
 	spin_lock_irqsave(&vcpu->irq.lock, flags);
-	__set_bit(pin, &vcpu->irq.pending);
+
+	if (!test_bit(pin, &vcpu->irq.pending)) {
+		/*
+		 * Record the change..
+		 */
+		__set_bit(pin, &vcpu->irq.pending);
+
+		/*
+		 * then wake up the vcpu (if necessary)
+		 */
+		if (vcpu->irq.task && (vcpu->irq.task != current)) {
+			if (vcpu->irq.guest_mode) {
+				/*
+				 * If we are in guest mode, we can optimize
+				 * the IPI by executing a function directly
+				 * on the owning processor.
+				 */
+				direct_ipi = task_cpu(vcpu->irq.task);
+				BUG_ON(direct_ipi == smp_processor_id());
+			} else
+				/*
+				 * otherwise, we must assume that we could be
+				 * blocked anywhere, including userspace. Send
+				 * a signal to give everyone a chance to get
+				 * notification
+				 */
+				send_sig(vcpu->irq.signo, vcpu->irq.task, 0);
+		}
+	}
+
 	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
+
+	if (direct_ipi != -1) {
+		/*
+		 * Not sure if disabling preemption is needed.
+		 * The kick_process() code does this so I copied it
+		 */
+		preempt_disable();
+		smp_call_function_single(direct_ipi,
+					 kvm_vcpu_guest_intr,
+					 vcpu, 0, 0);
+		preempt_enable();
+	}
 }
 
 static void kvm_vcpu_irqsink_init(struct kvm_vcpu *vcpu)
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index 39cc596..a73232b 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -1528,11 +1528,40 @@ static int svm_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	u16 gs_selector;
 	u16 ldt_selector;
 	int r;
+	unsigned long irq_flags;
 
 again:
+	/*
+	 * We disable interrupts until the next VMEXIT to eliminate a race
+	 * condition for delivery of virtual interrutps.  Note that this is
+	 * probably not as bad as it sounds, as interrupts will still invoke
+	 * a VMEXIT once transitioned to GUEST mode (and thus exit this lock
+	 * scope) even if they are disabled.
+	 *
+	 * FIXME: Do we need to do anything additional to mask IPI/NMIs?
+	 */
+	local_irq_save(irq_flags);
+
 	spin_lock(&vcpu->irq.lock);
 
 	/*
+	 * If there are any signals pending (virtual interrupt related or
+	 * otherwise), don't even bother trying to enter guest mode...
+	 */
+	if (signal_pending(current)) {
+		kvm_run->exit_reason = KVM_EXIT_INTR;
+		spin_unlock(&vcpu->irq.lock);
+		local_irq_restore(irq_flags);
+		return -EINTR;
+	}
+
+	/*
+	 * There are optimizations we can make when signaling interrupts
+	 * if we know the VCPU is in GUEST mode, so mark that here
+	 */
+	vcpu->irq.guest_mode = 1;
+
+	/*
 	 * We must inject interrupts (if any) while the irq_lock
 	 * is held
 	 */
@@ -1674,6 +1703,13 @@ again:
 #endif
 		: "cc", "memory" );
 
+	/*
+	 * FIXME: We'd like to turn on interrupts ASAP, but is this so early
+	 * that we will mess up the state of the CPU before we fully
+	 * transition from guest to host?
+	 */
+	local_irq_restore(irq_flags);
+
 	if (vcpu->fpu_active) {
 		fx_save(vcpu->guest_fx_image);
 		fx_restore(vcpu->host_fx_image);
@@ -1696,6 +1732,13 @@ again:
 	reload_tss(vcpu);
 
 	/*
+	 * Signal that we have transitioned back to host mode
+	 */
+	spin_lock_irqsave(&vcpu->irq.lock, irq_flags);
+	vcpu->irq.guest_mode = 0;
+	spin_unlock_irqrestore(&vcpu->irq.lock, irq_flags);
+
+	/*
 	 * Profile KVM exit RIPs:
 	 */
 	if (unlikely(prof_on == KVM_PROFILING))
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 5665286..1aac001 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1890,6 +1890,7 @@ static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	u16 fs_sel, gs_sel, ldt_sel;
 	int fs_gs_ldt_reload_needed;
 	int r;
+	unsigned long irq_flags;
 
 preempted:
 	/*
@@ -1924,9 +1925,37 @@ preempted:
 	if (vcpu->guest_debug.enabled)
 		kvm_guest_debug_pre(vcpu);
 
+	/*
+	 * We disable interrupts until the next VMEXIT to eliminate a race
+	 * condition for delivery of virtual interrutps.  Note that this is
+	 * probably not as bad as it sounds, as interrupts will still invoke
+	 * a VMEXIT once transitioned to GUEST mode (and thus exit this lock
+	 * scope) even if they are disabled.
+	 *
+	 * FIXME: Do we need to do anything additional to mask IPI/NMIs?
+	 */
+	local_irq_save(irq_flags);
+
 	spin_lock(&vcpu->irq.lock);
 
 	/*
+	 * If there are any signals pending (virtual interrupt related or
+	 * otherwise), don't even bother trying to enter guest mode...
+	 */
+	if (signal_pending(current)) {
+		kvm_run->exit_reason = KVM_EXIT_INTR;
+		spin_unlock(&vcpu->irq.lock);
+		local_irq_restore(irq_flags);
+		return -EINTR;
+	}
+
+	/*
+	 * There are optimizations we can make when signaling interrupts
+	 * if we know the VCPU is in GUEST mode, so mark that here
+	 */
+	vcpu->irq.guest_mode = 1;
+
+	/*
 	 * We must inject interrupts (if any) while the irq.lock
 	 * is held
 	 */
@@ -2067,12 +2096,26 @@ again:
 		[cr2]"i"(offsetof(struct kvm_vcpu, cr2))
 	      : "cc", "memory" );
 
+ 	/*
+	 * FIXME: We'd like to turn on interrupts ASAP, but is this so early
+	 * that we will mess up the state of the CPU before we fully
+	 * transition from guest to host?
+	 */
+	local_irq_restore(irq_flags);
+
 	++vcpu->stat.exits;
 
 	vcpu->interrupt_window_open = (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0;
 
 	asm ("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS));
 
+	/*
+	 * Signal that we have transitioned back to host mode
+	 */
+	spin_lock_irqsave(&vcpu->irq.lock, irq_flags);
+	vcpu->irq.guest_mode = 0;
+	spin_unlock_irqrestore(&vcpu->irq.lock, irq_flags);
+
 	if (unlikely(fail)) {
 		kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
 		kvm_run->fail_entry.hardware_entry_failure_reason


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/4] KVM: Add support for in-kernel LAPIC model
       [not found] ` <20070502212713.16738.8133.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
                     ` (2 preceding siblings ...)
  2007-05-02 21:43   ` [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU Gregory Haskins
@ 2007-05-02 21:43   ` Gregory Haskins
       [not found]     ` <20070502214330.16738.51436.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
  2007-05-03 18:57   ` [PATCH 0/4] Kernel side patches for in-kernel APIC Nakajima, Jun
  2007-05-06  7:45   ` Avi Kivity
  5 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-02 21:43 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/Makefile   |    2 
 drivers/kvm/kernint.c  |  154 +++++
 drivers/kvm/kvm.h      |   35 +
 drivers/kvm/kvm_main.c |  175 +++++
 drivers/kvm/lapic.c    | 1582 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/svm.c      |   22 -
 drivers/kvm/userint.c  |    8 
 drivers/kvm/vmx.c      |   16 
 8 files changed, 1954 insertions(+), 40 deletions(-)

diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index 540afbc..1aad737 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o userint.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o userint.o lapic.o kernint.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/kernint.c b/drivers/kvm/kernint.c
new file mode 100644
index 0000000..5b2d844
--- /dev/null
+++ b/drivers/kvm/kernint.c
@@ -0,0 +1,154 @@
+/*
+ * Kernel Interrupt IRQ device
+ *
+ * Provides a model for connecting in-kernel interrupt resources to a VCPU.
+ *
+ * A typical modern x86 processor has the concept of an internal Local-APIC
+ * and some external signal pins.  The way in which interrupts are injected is
+ * dependent on whether software enables the LAPIC or not.  When enabled,
+ * interrupts are acknowledged through the LAPIC.  Otherwise they are through
+ * an externally connected PIC (typically an i8259 on the BSP)
+ *
+ * Copyright (C) 2007 Novell
+ *
+ * Authors:
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include "kvm.h"
+
+struct kvm_kernint {
+	struct kvm_vcpu              *vcpu;
+	struct kvm_irqdevice         *self_irq;
+	struct kvm_irqdevice         *ext_irq;
+	struct kvm_irqdevice          apic_irq;
+
+};
+
+static struct kvm_irqdevice *get_irq_dev(struct kvm_kernint *s)
+{
+	struct kvm_irqdevice *dev;
+
+	if (kvm_lapic_enabled(s->vcpu))
+		dev = &s->apic_irq;
+	else
+		dev = s->ext_irq;
+
+	if (!dev)
+		kvm_crash_guest(s->vcpu->kvm);
+
+	return dev;
+}
+
+static int kernint_irqdev_ack(struct kvm_irqdevice *this, int *vector)
+{
+	struct kvm_kernint *s = (struct kvm_kernint*)this->private;
+
+	return kvm_irqdevice_ack(get_irq_dev(s), vector);
+}
+
+static int kernint_irqdev_set_pin(struct kvm_irqdevice *this,
+				  int irq, int level)
+{
+	/* no-op */
+	return 0;
+}
+
+static void kernint_irqdev_destructor(struct kvm_irqdevice *this)
+{
+	struct kvm_kernint *s = (struct kvm_kernint*)this->private;
+
+	kvm_irqdevice_destructor(&s->apic_irq);
+	kvm_lapic_destroy(s->vcpu);
+	kfree(s);
+}
+
+static void kvm_apic_intr(struct kvm_irqsink *this,
+			  struct kvm_irqdevice *dev,
+			  kvm_irqpin_t pin)
+{
+	struct kvm_kernint *s = (struct kvm_kernint*)this->private;
+
+	/*
+	 * If the LAPIC sent us an interrupt it *must* be enabled,
+	 * just forward it on to the CPU
+	 */
+	kvm_irqdevice_set_intr(s->self_irq, pin);
+}
+
+static void kvm_ext_intr(struct kvm_irqsink *this,
+			 struct kvm_irqdevice *dev,
+			 kvm_irqpin_t pin)
+{
+	struct kvm_kernint *s = (struct kvm_kernint*)this->private;
+
+	/*
+	 * If the EXTINT device sent us an interrupt, forward it to the LINT0
+	 * pin of the LAPIC
+	 */
+	if (pin != kvm_irqpin_localint)
+	    return;
+
+	/*
+	 * "irq 0" = LINT0, 1 = LINT1
+	 */
+	kvm_irqdevice_set_pin(&s->apic_irq, 0, 1);
+}
+
+int kvm_kernint_init(struct kvm_vcpu *vcpu)
+{
+	struct kvm_irqdevice *irqdev = &vcpu->irq.dev;
+	struct kvm_kernint *s;
+
+	s = kzalloc(sizeof(*s), GFP_KERNEL);
+	if (!s)
+	    return -ENOMEM;
+
+	s->vcpu = vcpu;
+
+	/*
+	 * Configure the irqdevice interface
+	 */
+	irqdev->ack         = kernint_irqdev_ack;
+	irqdev->set_pin     = kernint_irqdev_set_pin;
+	irqdev->destructor  = kernint_irqdev_destructor;
+
+	irqdev->private = s;
+	s->self_irq = irqdev;
+
+	/*
+	 * Configure the EXTINT device if this is the BSP processor
+	 */
+	if (!vcpu_slot(vcpu)) {
+		struct kvm_irqsink sink = {
+			.set_intr   = kvm_ext_intr,
+			.private    = s
+		};
+
+		s->ext_irq = &vcpu->kvm->isa_irq;
+		kvm_irqdevice_register_sink(s->ext_irq, &sink);
+	}
+
+	/*
+	 * Configure the LAPIC device
+	 */
+	kvm_irqdevice_init(&s->apic_irq);
+
+	{
+		struct kvm_irqsink sink = {
+			.set_intr   = kvm_apic_intr,
+			.private    = s
+		};
+
+		kvm_irqdevice_register_sink(&s->apic_irq, &sink);
+	}
+
+	kvm_lapic_init(vcpu, &s->apic_irq, 0);
+
+	return 0;
+}
+
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 15c8bec..24aad0f 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -163,6 +163,21 @@ int kvm_user_irqdev_init(struct kvm_irqdevice *dev);
 int kvm_user_irqdev_save(struct kvm_irqdevice *this, void *data);
 int kvm_user_irqdev_restore(struct kvm_irqdevice *this, void *data);
 int kvm_userint_init(struct kvm_vcpu *vcpu);
+int kvm_kernint_init(struct kvm_vcpu *vcpu);
+
+#define KVM_LAPIC_OPTION_USERMODE (1 << 0)
+
+int kvm_lapic_init(struct kvm_vcpu *vcpu, struct kvm_irqdevice *dev,
+		   int flags);
+void kvm_lapic_destroy(struct kvm_vcpu *vcpu);
+void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, u64 cr8);
+u64  kvm_lapic_get_tpr(struct kvm_vcpu *vcpu);
+void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 base);
+u64  kvm_lapic_get_base(struct kvm_vcpu *vcpu);
+void kvm_lapic_save(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
+void kvm_lapic_restore(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
+void kvm_lapic_reset(struct kvm_vcpu *vcpu);
+int  kvm_lapic_enabled(struct kvm_vcpu *vcpu);
 
 /*
  * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
@@ -336,6 +351,11 @@ struct kvm_vcpu_irq {
 	int                  guest_mode;
 };
 
+struct kvm_lapic {
+	void                 *dev;
+	struct kvm_io_device *mmio;
+};
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 	union {
@@ -349,6 +369,7 @@ struct kvm_vcpu {
 	struct kvm_run *run;
 	int interrupt_window_open;
 	struct kvm_vcpu_irq irq;
+	struct kvm_lapic apic;
 	unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
 	unsigned long rip;      /* needs vcpu_load_rsp_rip() */
 
@@ -359,10 +380,8 @@ struct kvm_vcpu {
 	struct page *para_state_page;
 	gpa_t hypercall_gpa;
 	unsigned long cr4;
-	unsigned long cr8;
 	u64 pdptrs[4]; /* pae */
 	u64 shadow_efer;
-	u64 apic_base;
 	u64 ia32_misc_enable_msr;
 	int nmsrs;
 	struct vmx_msr_entry *guest_msrs;
@@ -534,6 +553,8 @@ struct kvm {
 	struct list_head vm_list;
 	struct file *filp;
 	struct kvm_io_bus mmio_bus;
+	int enable_kernel_pic;
+	struct kvm_irqdevice isa_irq;
 };
 
 struct descriptor_table {
@@ -608,6 +629,9 @@ void kvm_exit_arch(void);
 int kvm_mmu_module_init(void);
 void kvm_mmu_module_exit(void);
 
+int kvm_apicbus_send(struct kvm *kvm, int dest, int trig_mode, int level,
+		     int dest_mode, int delivery_mode, int vector);
+
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_mmu_create(struct kvm_vcpu *vcpu);
 int kvm_mmu_setup(struct kvm_vcpu *vcpu);
@@ -739,6 +763,13 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
 	return (struct kvm_mmu_page *)page_private(page);
 }
 
+static inline int vcpu_slot(struct kvm_vcpu *vcpu)
+{
+	return vcpu - vcpu->kvm->vcpus;
+}
+
+void kvm_crash_guest(struct kvm *kvm);
+
 static inline u16 read_fs(void)
 {
 	u16 seg;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 6acbd9b..57f77c4 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -296,6 +296,7 @@ static struct kvm *kvm_create_vm(void)
 	spin_lock_init(&kvm->lock);
 	INIT_LIST_HEAD(&kvm->active_mmu_pages);
 	kvm_io_bus_init(&kvm->mmio_bus);
+	kvm_irqdevice_init(&kvm->isa_irq);
 	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
 		struct kvm_vcpu *vcpu = &kvm->vcpus[i];
 
@@ -400,6 +401,23 @@ static void kvm_free_vcpus(struct kvm *kvm)
 		kvm_free_vcpu(&kvm->vcpus[i]);
 }
 
+/*
+ * The function kills a guest while there still is a user space processes
+ * with a descriptor to it
+ */
+void kvm_crash_guest(struct kvm *kvm)
+{
+	unsigned int i;
+
+	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+		/*
+		 * FIXME: in the future it should send IPI to gracefully
+		 * stop the other vCPUs
+		 */
+		kvm_free_vcpu(&kvm->vcpus[i]);
+	}
+}
+
 static int kvm_dev_release(struct inode *inode, struct file *filp)
 {
 	return 0;
@@ -411,6 +429,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
 	kvm_io_bus_destroy(&kvm->mmio_bus);
+	kvm_irqdevice_destructor(&kvm->isa_irq);
 	kvm_free_vcpus(kvm);
 	kvm_free_physmem(kvm);
 	kfree(kvm);
@@ -616,7 +635,7 @@ void set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8)
 		inject_gp(vcpu);
 		return;
 	}
-	vcpu->cr8 = cr8;
+	kvm_lapic_set_tpr(vcpu, cr8);
 }
 EXPORT_SYMBOL_GPL(set_cr8);
 
@@ -917,6 +936,66 @@ out:
 	return r;
 }
 
+static int kvm_vm_ioctl_enable_kernel_pic(struct kvm *kvm, __u32 val)
+{
+	/*
+	 * FIXME: We should not allow this if VCPUs have already been created
+	 */
+	if (kvm->enable_kernel_pic)
+		return -EINVAL;
+
+	/* Someday we may offer two levels of in-kernel PIC support:
+	 *
+	 *  level 0 = (default) compatiblity mode (everything in userspace)
+	 *  level 1 = LAPIC in kernel, IOAPIC/i8259 in userspace
+	 *  level 2 = All three in kernel
+	 *
+	 * For now we only support level 0 and 1.  However, you cant set
+	 * level 0
+	 */
+	if (val != 1)
+		return -EINVAL;
+
+	kvm->enable_kernel_pic = val;
+
+	/*
+	 * installing a user_irqdev model to the kvm->isa_irq device
+	 * creates a level-1 environment, where the userspace completely
+	 * controls the ISA domain interrupts in the IOAPIC/i8259.
+	 * Interrupts come down to the VCPU either as an ISA vector to
+	 * this controller, or as an APIC bus message (or both)
+	 */
+	kvm_user_irqdev_init(&kvm->isa_irq);
+
+	return 0;
+}
+
+static int kvm_vm_ioctl_isa_interrupt(struct kvm *kvm,
+				      struct kvm_interrupt *irq)
+{
+	if (irq->irq < 0 || irq->irq >= 256)
+		return -EINVAL;
+
+	if (!kvm->enable_kernel_pic)
+		return -EINVAL;
+
+	return kvm_irqdevice_set_pin(&kvm->isa_irq, irq->irq, 1);
+}
+
+static int kvm_vm_ioctl_apic_msg(struct kvm *kvm,
+				 struct kvm_apic_msg *msg)
+{
+	if (!kvm->enable_kernel_pic)
+		return -EINVAL;
+
+	msg->delivery_mode = (msg->delivery_mode << 8) & 0xF00;
+
+	kvm_apicbus_send(kvm, msg->dest, msg->trig_mode, 1, msg->dest_mode,
+			 msg->delivery_mode, msg->vector);
+
+	return 0;
+}
+
 static gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn)
 {
 	int i;
@@ -1037,10 +1116,16 @@ static int emulator_write_std(unsigned long addr,
 static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu,
 						gpa_t addr)
 {
+	struct kvm_io_device *dev = vcpu->apic.mmio;
+
+	/*
+	 * First check if the LAPIC will snarf this request
+	 */
+	if (dev && dev->in_range(dev, addr))
+		return dev;
+
 	/*
-	 * Note that its important to have this wrapper function because
-	 * in the very near future we will be checking for MMIOs against
-	 * the LAPIC as well as the general MMIO bus
+	 * And then fallback to allow any device to participate
 	 */
 	return kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr);
 }
@@ -1506,7 +1591,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 		data = 3;
 		break;
 	case MSR_IA32_APICBASE:
-		data = vcpu->apic_base;
+		data = kvm_lapic_get_base(vcpu);
 		break;
 	case MSR_IA32_MISC_ENABLE:
 		data = vcpu->ia32_misc_enable_msr;
@@ -1584,7 +1669,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	case 0x200 ... 0x2ff: /* MTRRs */
 		break;
 	case MSR_IA32_APICBASE:
-		vcpu->apic_base = data;
+		kvm_lapic_set_base(vcpu, data);
 		break;
 	case MSR_IA32_MISC_ENABLE:
 		vcpu->ia32_misc_enable_msr = data;
@@ -1849,8 +1934,9 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	if (vcpu->sigset_active)
 		sigprocmask(SIG_SETMASK, &vcpu->sigset, &sigsaved);
 
-	/* re-sync apic's tpr */
-	vcpu->cr8 = kvm_run->cr8;
+	if (!vcpu->kvm->enable_kernel_pic)
+		/* re-sync apic's tpr if the APIC is in userspace */
+		kvm_lapic_set_tpr(vcpu, kvm_run->cr8);
 
 	if (vcpu->pio.cur_count) {
 		r = complete_pio(vcpu);
@@ -2003,11 +2089,12 @@ static int kvm_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 	sregs->cr2 = vcpu->cr2;
 	sregs->cr3 = vcpu->cr3;
 	sregs->cr4 = vcpu->cr4;
-	sregs->cr8 = vcpu->cr8;
 	sregs->efer = vcpu->shadow_efer;
-	sregs->apic_base = vcpu->apic_base;
 
-	kvm_user_irqdev_save(&vcpu->irq.dev, &sregs->interrupt_bitmap);
+	kvm_lapic_save(vcpu, sregs);
+
+	if (!vcpu->kvm->enable_kernel_pic)
+		kvm_user_irqdev_save(&vcpu->irq.dev, &sregs->interrupt_bitmap);
 
 	vcpu_put(vcpu);
 
@@ -2039,14 +2126,10 @@ static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	mmu_reset_needed |= vcpu->cr3 != sregs->cr3;
 	vcpu->cr3 = sregs->cr3;
 
-	vcpu->cr8 = sregs->cr8;
-
 	mmu_reset_needed |= vcpu->shadow_efer != sregs->efer;
 #ifdef CONFIG_X86_64
 	kvm_arch_ops->set_efer(vcpu, sregs->efer);
 #endif
-	vcpu->apic_base = sregs->apic_base;
-
 	kvm_arch_ops->decache_cr4_guest_bits(vcpu);
 
 	mmu_reset_needed |= vcpu->cr0 != sregs->cr0;
@@ -2060,8 +2143,11 @@ static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	if (mmu_reset_needed)
 		kvm_mmu_reset_context(vcpu);
 
-	kvm_user_irqdev_restore(&vcpu->irq.dev,
-				&sregs->interrupt_bitmap[0]);
+	kvm_lapic_restore(vcpu, sregs);
+
+	if (!vcpu->kvm->enable_kernel_pic)
+		kvm_user_irqdev_restore(&vcpu->irq.dev,
+					&sregs->interrupt_bitmap[0]);
 
 	set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
 	set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
@@ -2455,7 +2541,12 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
 
 	kvm_irqdevice_init(&vcpu->irq.dev);
 	kvm_vcpu_irqsink_init(vcpu);
-	r = kvm_userint_init(vcpu);
+
+	if (kvm->enable_kernel_pic)
+		r = kvm_kernint_init(vcpu);
+	else
+		r = kvm_userint_init(vcpu);
+
 	if (r < 0)
 		goto out_free_vcpus;
 
@@ -2577,6 +2668,12 @@ static int kvm_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 	return 0;
 }
 
+static int kvm_vcpu_ioctl_apic_reset(struct kvm_vcpu *vcpu)
+{
+	kvm_lapic_reset(vcpu);
+	return 0;
+}
+
 static long kvm_vcpu_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2746,6 +2843,13 @@ static long kvm_vcpu_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_APIC_RESET: {
+		r = kvm_vcpu_ioctl_apic_reset(vcpu);
+		if (r)
+			goto out;
+		r = 0;
+		break;
+	}
 	default:
 		;
 	}
@@ -2799,6 +2903,41 @@ static long kvm_vm_ioctl(struct file *filp,
 			goto out;
 		break;
 	}
+	case KVM_ENABLE_KERNEL_PIC: {
+		__u32 val;
+
+		r = -EFAULT;
+		if (copy_from_user(&val, argp, sizeof val))
+			goto out;
+		r = kvm_vm_ioctl_enable_kernel_pic(kvm, val);
+		if (r)
+			goto out;
+		break;
+	}
+	case KVM_ISA_INTERRUPT: {
+		struct kvm_interrupt irq;
+
+		r = -EFAULT;
+		if (copy_from_user(&irq, argp, sizeof irq))
+			goto out;
+		r = kvm_vm_ioctl_isa_interrupt(kvm, &irq);
+		if (r)
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_APIC_MSG: {
+		struct kvm_apic_msg msg;
+
+		r = -EFAULT;
+		if (copy_from_user(&msg, argp, sizeof msg))
+			goto out;
+		r = kvm_vm_ioctl_apic_msg(kvm, &msg);
+		if (r)
+			goto out;
+		r = 0;
+		break;
+	}
 	default:
 		;
 	}
diff --git a/drivers/kvm/lapic.c b/drivers/kvm/lapic.c
new file mode 100644
index 0000000..8ea6b67
--- /dev/null
+++ b/drivers/kvm/lapic.c
@@ -0,0 +1,1582 @@
+/*
+ * Local APIC virtualization
+ *
+ * Copyright (C) 2006 Qumranet, Inc.
+ * Copyright (C) 2007 Novell
+ *
+ * Authors:
+ *   Dor Laor <dor.laor-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * Based on Xen 3.0 code, Copyright (c) 2004, Intel Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "kvm.h"
+#include <linux/kvm.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/smp.h>
+#include <linux/hrtimer.h>
+#include <asm/processor.h>
+#include <asm/io.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/current.h>
+
+/*XXX remove this definition after GFW enabled */
+#define APIC_NO_BIOS
+
+#define PRId64 "d"
+#define PRIx64 "llx"
+#define PRIu64 "u"
+#define PRIo64 "o"
+
+#define APIC_BUS_CYCLE_NS 1
+
+/* #define apic_debug(fmt,arg...) printk(KERN_WARNING fmt,##arg) */
+#define apic_debug(fmt,arg...)
+
+/*
+ *-----------------------------------------------------------------------
+ * KERNEL-TIMERS
+ *
+ * Unfortunately we really need HRTIMERs to do this right, but they might
+ * not exist on olders kernels (sigh).  So we roughly abstract this interface
+ * to support nanosecond resolution, and then emulate it as best we can if
+ * the real HRTIMERs arent available
+ *-----------------------------------------------------------------------
+ */
+
+struct kvm_apic_timer {
+	int (*function)(void *private);
+	void *private;
+#ifdef KVM_NO_HRTIMER
+	struct timer_list dev;
+#else
+	struct hrtimer dev;
+#endif
+};
+
+#ifdef KVM_NO_HRTIMER
+
+/*
+ *----------------
+ * low-res version
+ *----------------
+ */
+
+static void kvm_apictimer_lr_cb(unsigned long data)
+{
+	struct kvm_apic_timer *timer = (struct kvm_apic_timer*)data;
+
+	/*
+	 * If the callback returns >0, its cyclic
+	 */
+	if (timer->function(timer->private))
+		add_timer(&timer->dev);
+}
+
+static ktime_t kvm_apictimer_now(struct kvm_apic_timer *timer)
+{
+	struct timespec ts;
+
+	getnstimeofday(&ts);
+	return timespec_to_ktime(ts);
+}
+
+#define TICKS_PER_MS (HZ/1000)
+#define TICKS_PER_US (TICKS_PER_MS/1000)
+#define TICKS_PER_NS (TICKS_PER_US/1000)
+
+static void kvm_apictimer_update(struct kvm_apic_timer *timer, ktime_t val)
+{
+	/* FIXME: I'm sure this is broken */
+	ktime_t offset = ktime_sub(val, kvm_apictimer_now(timer));
+	unsigned long timeout = TICKS_PER_NS * ktime_to_ns(offset);
+	if (!timeout)
+		timeout++;
+	timer->dev.expires = jiffies + timeout;
+}
+
+static void kvm_apictimer_start(struct kvm_apic_timer *timer, ktime_t val)
+{
+	kvm_apictimer_update(timer, val);
+	add_timer(&timer->dev);
+}
+
+static void kvm_apictimer_stop(struct kvm_apic_timer *timer)
+{
+	del_timer(&timer->dev);
+}
+
+static int kvm_apictimer_init(struct kvm_apic_timer *timer)
+{
+	memset(timer, 0, sizeof(timer));
+
+	init_timer(&timer->dev);
+	timer->dev.function = kvm_apictimer_lr_cb;
+	timer->dev.data = (unsigned long)timer;
+
+	return 0;
+}
+
+#else
+
+/*
+ *----------------
+ * hi-res version
+ *----------------
+ */
+
+static enum hrtimer_restart kvm_apictimer_hr_cb(struct hrtimer *data)
+{
+	struct kvm_apic_timer *timer;
+
+	timer = container_of(data, struct kvm_apic_timer, dev);
+
+	/*
+	 * If the callback returns >0, its cyclic
+	 */
+	if (timer->function(timer->private))
+		return HRTIMER_RESTART;
+	else
+		return HRTIMER_NORESTART;
+}
+
+static ktime_t kvm_apictimer_now(struct kvm_apic_timer *timer)
+{
+	return timer->dev.base->get_time();
+}
+
+static void kvm_apictimer_update(struct kvm_apic_timer *timer, ktime_t val)
+{
+	timer->dev.expires = val;
+}
+
+static void kvm_apictimer_start(struct kvm_apic_timer *timer, ktime_t val)
+{
+	hrtimer_start(&timer->dev, val, HRTIMER_MODE_ABS);
+}
+
+static void kvm_apictimer_stop(struct kvm_apic_timer *timer)
+{
+	hrtimer_cancel(&timer->dev);
+}
+
+static int kvm_apictimer_init(struct kvm_apic_timer *timer)
+{
+	hrtimer_init(&timer->dev, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	timer->dev.function = kvm_apictimer_hr_cb;
+
+	return 0;
+}
+
+#endif
+
+/*
+ *-----------------------------------------------------------------------
+ * Actual LAPIC model
+ *-----------------------------------------------------------------------
+ */
+struct kvm_kern_apic {
+	spinlock_t              lock;
+	atomic_t                ref_count;
+	int                     usermode;
+	u32                     status;
+	u32                     vcpu_id;
+	u64                     base_msr;
+	unsigned long           base_address;
+	struct kvm_io_device    mmio_dev;
+	struct {
+		unsigned long   pending;
+		u32             divide_count;
+		ktime_t         last_update;
+		struct kvm_apic_timer dev;
+
+	} timer;
+	u32                     err_status;
+	u32                     err_write_count;
+	struct kvm_vcpu         *vcpu;
+	struct kvm_irqdevice    *irq_dev;
+	struct page             *regs_page;
+	void                    *regs;
+};
+
+static __inline__ int find_highest_bit(unsigned long *data, int nr_bits)
+{
+	int length = BITS_TO_LONGS(nr_bits);
+	while (length && !data[--length])
+		continue;
+	return __ffs(data[length]) + (length * BITS_PER_LONG);
+}
+
+#define APIC_LVT_NUM			6
+/* 14 is the version for Xeon and Pentium 8.4.8*/
+#define APIC_VERSION			(0x14UL | ((APIC_LVT_NUM - 1) << 16))
+#define VLOCAL_APIC_MEM_LENGTH		(1 << 12)
+/* followed define is not in apicdef.h */
+#define APIC_SHORT_MASK			0xc0000
+#define APIC_DEST_NOSHORT		0x0
+#define APIC_DEST_MASK			0x800
+#define _APIC_GLOB_DISABLE		0x0
+#define APIC_GLOB_DISABLE_MASK		0x1
+#define APIC_SOFTWARE_DISABLE_MASK	0x2
+#define _APIC_BSP_ACCEPT_PIC		0x3
+#define MAX_APIC_INT_VECTOR             256
+
+#define inject_gp(vcpu) kvm_arch_ops->inject_gp(vcpu, 0);
+
+#define apic_enabled(apic)              \
+	(!((apic)->status &                   \
+	   (APIC_GLOB_DISABLE_MASK | APIC_SOFTWARE_DISABLE_MASK)))
+
+#define apic_global_enabled(apic)       \
+	(!(test_bit(_APIC_GLOB_DISABLE, &(apic)->status)))
+
+#define LVT_MASK \
+	APIC_LVT_MASKED | APIC_SEND_PENDING | APIC_VECTOR_MASK
+
+#define LINT_MASK   \
+	LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY |\
+	APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER
+
+#define KVM_APIC_ID(apic)   \
+	(GET_APIC_ID(apic_get_reg(apic, APIC_ID)))
+
+#define apic_lvt_enabled(apic, lvt_type)    \
+	(!(apic_get_reg(apic, lvt_type) & APIC_LVT_MASKED))
+
+#define apic_lvt_vector(apic, lvt_type)     \
+	(apic_get_reg(apic, lvt_type) & APIC_VECTOR_MASK)
+
+#define apic_lvt_dm(apic, lvt_type)           \
+	(apic_get_reg(apic, lvt_type) & APIC_MODE_MASK)
+
+#define apic_lvtt_period(apic)     \
+	(apic_get_reg(apic, APIC_LVTT) & APIC_LVT_TIMER_PERIODIC)
+
+static inline u32 apic_get_reg(struct kvm_kern_apic *apic, u32 reg)
+{
+	return *((u32 *)(apic->regs + reg));
+}
+
+static inline void apic_set_reg(struct kvm_kern_apic *apic,
+				u32 reg, u32 val)
+{
+	*((u32 *)(apic->regs + reg)) = val;
+}
+
+static unsigned int apic_lvt_mask[APIC_LVT_NUM] =
+{
+	LVT_MASK | APIC_LVT_TIMER_PERIODIC, 	/* LVTT */
+	LVT_MASK | APIC_MODE_MASK, 		/* LVTTHMR */
+	LVT_MASK | APIC_MODE_MASK, 		/* LVTPC */
+	LINT_MASK, LINT_MASK, 			/* LVT0-1 */
+	LVT_MASK 				/* LVTERR */
+};
+
+#define ASSERT(x)  							     \
+	if (!(x)) { 							     \
+		printk(KERN_EMERG "assertion failed %s: %d: %s\n",           \
+		       __FILE__, __LINE__, #x);                              \
+		BUG();                                                       \
+	}
+
+static int apic_timer_fn(void *private);
+
+static int apic_find_highest_irr(struct kvm_kern_apic *apic)
+{
+	int result;
+
+	result = find_highest_bit((unsigned long *)(apic->regs + APIC_IRR),
+				  MAX_APIC_INT_VECTOR);
+
+	ASSERT( result == 0 || result >= 16);
+
+	return result;
+}
+
+
+static int apic_find_highest_isr(struct kvm_kern_apic *apic)
+{
+	int result;
+
+	result = find_highest_bit((unsigned long *)(apic->regs + APIC_ISR),
+				  MAX_APIC_INT_VECTOR);
+
+	ASSERT( result == 0 || result >= 16);
+
+	return result;
+}
+
+static void apic_dropref(struct kvm_kern_apic *apic)
+{
+	if (atomic_dec_and_test(&apic->ref_count)) {
+
+		spin_lock_bh(&apic->lock);
+
+		kvm_apictimer_stop(&apic->timer.dev);
+
+		if (apic->regs_page) {
+			__free_page(apic->regs_page);
+			apic->regs_page = 0;
+		}
+
+		spin_unlock_bh(&apic->lock);
+
+		kfree(apic);
+	}
+}
+
+#if 0
+static void apic_dump_state(struct kvm_kern_apic *apic)
+{
+	u64 *tmp;
+
+	printk(KERN_INFO "%s begin\n", __FUNCTION__);
+
+	printk(KERN_INFO "status = 0x%08x\n", apic->status);
+	printk(KERN_INFO "base_msr=0x%016llx, apicbase = 0x%08lx\n",
+	       apic->base_msr, apic->base_address);
+
+	tmp = (u64*)(apic->regs + APIC_IRR);
+	printk(KERN_INFO "IRR = 0x%016llx 0x%016llx 0x%016llx 0x%016llx\n",
+	       tmp[3], tmp[2], tmp[1], tmp[0]);
+	tmp = (u64*)(apic->regs + APIC_ISR);
+	printk(KERN_INFO "ISR = 0x%016llx 0x%016llx 0x%016llx 0x%016llx\n",
+	       tmp[3], tmp[2], tmp[1], tmp[0]);
+	tmp = (u64*)(apic->regs + APIC_TMR);
+	printk(KERN_INFO "TMR = 0x%016llx 0x%016llx 0x%016llx 0x%016llx\n",
+	       tmp[3], tmp[2], tmp[1], tmp[0]);
+
+	printk(KERN_INFO "APIC_ID=0x%08x\n", apic_get_reg(apic, APIC_ID));
+	printk(KERN_INFO "APIC_TASKPRI=0x%08x\n",
+	       apic_get_reg(apic, APIC_TASKPRI) & 0xff);
+	printk(KERN_INFO "APIC_PROCPRI=0x%08x\n",
+	       apic_get_reg(apic, APIC_PROCPRI));
+
+	printk(KERN_INFO "APIC_DFR=0x%08x\n",
+	       apic_get_reg(apic, APIC_DFR) | 0x0FFFFFFF);
+	printk(KERN_INFO "APIC_LDR=0x%08x\n",
+	       apic_get_reg(apic, APIC_LDR) & APIC_LDR_MASK);
+	printk(KERN_INFO "APIC_SPIV=0x%08x\n",
+	       apic_get_reg(apic, APIC_SPIV) & 0x3ff);
+	printk(KERN_INFO "APIC_ESR=0x%08x\n",
+	       apic_get_reg(apic, APIC_ESR));
+	printk(KERN_INFO "APIC_ICR=0x%08x\n",
+	       apic_get_reg(apic, APIC_ICR) & ~(1 << 12));
+	printk(KERN_INFO "APIC_ICR2=0x%08x\n",
+	       apic_get_reg(apic, APIC_ICR2) & 0xff000000);
+
+	printk(KERN_INFO "APIC_LVTERR=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVTERR));
+	printk(KERN_INFO "APIC_LVT1=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVT1));
+	printk(KERN_INFO "APIC_LVT0=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVT0));
+	printk(KERN_INFO "APIC_LVTPC=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVTPC));
+	printk(KERN_INFO "APIC_LVTTHMR=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVTTHMR));
+	printk(KERN_INFO "APIC_LVTT=0x%08x\n",
+	       apic_get_reg(apic, APIC_LVTT));
+
+	printk(KERN_INFO "APIC_TMICT=0x%08x\n",
+	       apic_get_reg(apic, APIC_TMICT));
+	printk(KERN_INFO "APIC_TDCR=0x%08x\n",
+	       apic_get_reg(apic, APIC_TDCR));
+
+	printk(KERN_INFO "%s end\n", __FUNCTION__);
+}
+#endif
+
+
+static int apic_update_ppr(struct kvm_kern_apic *apic)
+{
+	u32 tpr, isrv, ppr, orig_ppr;
+	int irq;
+	int masked = 0;
+	int forward = 0;
+
+	ppr = apic_get_reg(apic, APIC_PROCPRI);
+	orig_ppr = ppr;
+
+	/*
+	 * Before we change anything, see if the only pending vectors we have
+	 * are anything masked by PPR
+	 */
+	irq = apic_find_highest_irr(apic);
+	if (irq && ((irq & 0xf0) <= ppr))
+		masked = true;
+
+	/*
+	 * Compute the PPR value based on the current settings of TPR/ISR
+	 */
+	tpr = apic_get_reg(apic, APIC_TASKPRI);
+	irq = apic_find_highest_isr(apic);
+	isrv = (irq >> 4) & 0xf;
+
+	if ((tpr >> 4) >= isrv)
+		ppr = tpr & 0xff;
+	else
+		ppr = isrv << 4;  /* low 4 bits of PPR have to be cleared */
+
+	apic_set_reg(apic, APIC_PROCPRI, ppr);
+
+	if (masked) {
+		/*
+		 * If we get here its because there were vectors that
+		 * were masked by PPR.  Check again to see if anything is
+		 * now available
+		 */
+		irq = apic_find_highest_irr(apic);
+		if ((irq & 0xf0) > ppr)
+			forward = 1;
+	}
+
+	apic_debug("%s: ppr 0x%x (old) 0x%x (new), isr 0x%x, isrv 0x%x\n",
+	       __FUNCTION__, orig_ppr, ppr, irq, isrv);
+
+	return forward;
+}
+
+static void apic_set_tpr(struct kvm_kern_apic *apic, u32 tpr)
+{
+	int forward = 0;
+
+	apic_debug("new value = %x\n", tpr);
+
+	apic_set_reg(apic, APIC_TASKPRI, tpr);
+	forward = apic_update_ppr(apic);
+
+	if (forward) {
+		spin_unlock_bh(&apic->lock);
+		kvm_irqdevice_set_intr(apic->irq_dev, kvm_irqpin_localint);
+		spin_lock_bh(&apic->lock);
+	}
+}
+
+static int apic_match_dest(struct kvm_kern_apic *target,
+			   int dest,
+			   int dest_mode,
+			   int delivery_mode)
+{
+	int result = 0;
+
+	spin_lock_bh(&target->lock);
+
+	if (!dest_mode) /* Physical */
+		result = (GET_APIC_ID(apic_get_reg(target, APIC_ID)) == dest);
+	else { /* Logical */
+		u32 ldr = apic_get_reg(target, APIC_LDR);
+
+		/* Flat mode */
+		if (apic_get_reg(target, APIC_DFR) == APIC_DFR_FLAT)
+			result = GET_APIC_LOGICAL_ID(ldr) & dest;
+		else {
+			if ((delivery_mode == APIC_DM_LOWEST) &&
+			    (dest == 0xff)) {
+				printk(KERN_ALERT "Broadcast IPI " \
+				       "with lowest priority "
+				       "delivery mode\n");
+				spin_unlock_bh(&target->lock);
+				kvm_crash_guest(target->vcpu->kvm);
+				return 0;
+			}
+			if (GET_APIC_LOGICAL_ID(ldr) == (dest & 0xf))
+				result = (GET_APIC_LOGICAL_ID(ldr) >> 4) &
+					(dest >> 4);
+			else
+				result = 0;
+		}
+	}
+
+	spin_unlock_bh(&target->lock);
+
+	return result;
+}
+
+/*
+ * Add a pending IRQ into lapic.
+ * Return 1 if successfully added and 0 if discarded.
+ */
+static int __apic_accept_irq(struct kvm_kern_apic *apic,
+			     int delivery_mode,
+			     int vector,
+			     int level,
+			     int trig_mode)
+{
+	int result = 0;
+
+	switch (delivery_mode) {
+	case APIC_DM_FIXED:
+	case APIC_DM_LOWEST:
+		/* FIXME add logic for vcpu on reset */
+		if (unlikely(apic == NULL || !apic_enabled(apic)))
+			break;
+
+		if (test_and_set_bit(vector, apic->regs + APIC_IRR) &&
+		    trig_mode) {
+			apic_debug("level trig mode repeatedly for vector %d\n",
+			       vector);
+			break;
+		}
+
+		if (trig_mode) {
+			apic_debug("level trig mode for vector %d\n", vector);
+			set_bit(vector, apic->regs + APIC_TMR);
+		}
+
+		apic_debug("accepted FIXED/LOWEST interrupt for vector %d\n",
+			   vector);
+
+		if ((vector & 0xf0) > apic_get_reg(apic, APIC_PROCPRI)) {
+			apic_debug("signaling vector %d to output pin\n",
+				   vector);
+			/*
+			 * temp release of the lock to transmit
+			 */
+			spin_unlock_bh(&apic->lock);
+			kvm_irqdevice_set_intr(apic->irq_dev,
+					       kvm_irqpin_localint);
+			spin_lock_bh(&apic->lock);
+		} else {
+			apic_debug("vector %d is lower than TPR\n",
+				   vector);
+		}
+
+		result = 1;
+		break;
+
+	case APIC_DM_REMRD:
+		printk(KERN_WARNING "%s: Ignore deliver mode %d\n",
+		       __FUNCTION__, delivery_mode);
+		break;
+	case APIC_DM_EXTINT:
+	case APIC_DM_SMI:
+	case APIC_DM_NMI: {
+		kvm_irqpin_t pin = kvm_irqpin_invalid;
+
+		switch (delivery_mode) {
+		case APIC_DM_EXTINT:
+			pin = kvm_irqpin_extint;
+			break;
+		case APIC_DM_SMI:
+			pin = kvm_irqpin_smi;
+			break;
+		case APIC_DM_NMI:
+			pin = kvm_irqpin_nmi;
+			break;
+		default:
+			panic("KVM: illegal delivery_mode");
+		}
+
+		/*
+		 * temp release of the lock to transmit
+		 */
+		spin_unlock_bh(&apic->lock);
+		kvm_irqdevice_set_intr(apic->irq_dev, pin);
+		spin_lock_bh(&apic->lock);
+		result = 1;
+		break;
+	}
+	case APIC_DM_INIT:
+	case APIC_DM_STARTUP: /* FIXME: currently no support for SMP */
+	default:
+		printk(KERN_ALERT "TODO: not support interrupt type %x\n",
+		       delivery_mode);
+		kvm_crash_guest(apic->vcpu->kvm);
+		break;
+	}
+
+	return result;
+}
+
+static int apic_accept_irq(struct kvm_kern_apic *apic,
+			   int delivery_mode,
+			   int vector,
+			   int level,
+			   int trig_mode)
+{
+	int ret;
+
+	spin_lock_bh(&apic->lock);
+	ret = __apic_accept_irq(apic, delivery_mode, vector,
+				level, trig_mode);
+	spin_unlock_bh(&apic->lock);
+
+	return ret;
+}
+
+static void apic_set_eoi(struct kvm_kern_apic *apic)
+{
+	int vector = apic_find_highest_isr(apic);
+	int forward;
+
+	/*
+	 * Not every write EOI will has corresponding ISR,
+	 * one example is when Kernel check timer on setup_IO_APIC
+	 */
+	if (!vector)
+		return;
+
+	__clear_bit(vector, apic->regs + APIC_ISR);
+	forward = apic_update_ppr(apic);
+
+	__clear_bit(vector, apic->regs + APIC_TMR);
+
+	if (forward) {
+		spin_unlock_bh(&apic->lock);
+		kvm_irqdevice_set_intr(apic->irq_dev, kvm_irqpin_localint);
+		spin_lock_bh(&apic->lock);
+	}
+}
+
+static int apic_check_vector(struct kvm_kern_apic *apic,u32 dm, u32 vector)
+{
+	if ((dm == APIC_DM_FIXED) && (vector < 16)) {
+		apic->err_status |= 0x40;
+		__apic_accept_irq(apic, APIC_DM_FIXED,
+				  apic_lvt_vector(apic, APIC_LVTERR), 0, 0);
+		apic_debug("%s: check failed "
+		       " dm %x vector %x\n", __FUNCTION__, dm, vector);
+		return 0;
+	}
+	return 1;
+}
+
+int kvm_apicbus_send(struct kvm *kvm, int dest, int trig_mode, int level,
+		     int dest_mode, int delivery_mode, int vector)
+{
+	int i;
+	u32 lpr_map = 0;
+
+	apic_debug("%s: %d %d %d %d %d %d\n", __FUNCTION__,
+		   dest, trig_mode, level, dest_mode, delivery_mode, vector);
+
+	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+		struct kvm_kern_apic *target;
+		target = kvm->vcpus[i].apic.dev;
+
+		if (!target)
+			continue;
+
+		if (apic_match_dest(target, dest, dest_mode, delivery_mode)) {
+			if (delivery_mode == APIC_DM_LOWEST)
+				__set_bit(target->vcpu_id, &lpr_map);
+			else
+				apic_accept_irq(target, delivery_mode,
+						vector, level, trig_mode);
+		}
+	}
+
+	if (delivery_mode == APIC_DM_LOWEST) {
+		struct kvm_kern_apic *target;
+
+		/* Currently only UP is supported */
+		target = kvm->vcpus[0].apic.dev;
+
+		if (target)
+			apic_accept_irq(target, delivery_mode,
+					vector, level, trig_mode);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_apicbus_send);
+
+static void apic_send_ipi(struct kvm_kern_apic *apic)
+{
+	u32 icr_low = apic_get_reg(apic, APIC_ICR);
+	u32 icr_high = apic_get_reg(apic, APIC_ICR2);
+
+	unsigned int dest =          GET_APIC_DEST_FIELD(icr_high);
+	unsigned int short_hand =    icr_low & APIC_SHORT_MASK;
+	unsigned int trig_mode =     icr_low & APIC_INT_LEVELTRIG;
+	unsigned int level =         icr_low & APIC_INT_ASSERT;
+	unsigned int dest_mode =     icr_low & APIC_DEST_MASK;
+	unsigned int delivery_mode = icr_low & APIC_MODE_MASK;
+	unsigned int vector =        icr_low & APIC_VECTOR_MASK;
+
+	apic_debug("icr_high 0x%x, icr_low 0x%x, "
+		 "short_hand 0x%x, dest 0x%x, trig_mode 0x%x, level 0x%x, "
+		 "dest_mode 0x%x, delivery_mode 0x%x, vector 0x%x\n",
+		 icr_high, icr_low, short_hand, dest,
+		 trig_mode, level, dest_mode, delivery_mode, vector);
+
+	/*
+	 * We unlock here because we would enter this function in a lock
+	 * state and we dont want to remain this way while we transmit
+	 */
+	spin_unlock_bh(&apic->lock);
+
+	if (short_hand == APIC_DEST_NOSHORT)
+		/*
+		 * If no short-hand notation is in use, just forward the
+		 * message onto the apicbus and let the bus handle the routing.
+		 */
+		kvm_apicbus_send(apic->vcpu->kvm, dest, trig_mode, level,
+				 dest_mode, delivery_mode, vector);
+	else {
+		/*
+		 * Otherwise we need to consider the short-hand to find the
+		 * correct targets.
+		 */
+		unsigned int i;
+
+		for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+			struct kvm_kern_apic *target;
+			int result = 0;
+
+			target = apic->vcpu->kvm->vcpus[i].apic.dev;
+
+			if (!target)
+				continue;
+
+			switch (short_hand) {
+			case APIC_DEST_SELF:
+				if (target == apic)
+					result = 1;
+				break;
+			case APIC_DEST_ALLINC:
+				result = 1;
+				break;
+
+			case APIC_DEST_ALLBUT:
+				if (target != apic)
+					result = 1;
+				break;
+			}
+
+			if (result)
+				apic_accept_irq(target, delivery_mode,
+						vector, level, trig_mode);
+		}
+	}
+
+	/*
+	 * Relock before returning
+	 */
+	spin_lock_bh(&apic->lock);
+
+}
+
+static u32 apic_get_tmcct(struct kvm_kern_apic *apic)
+{
+	u32 counter_passed;
+	ktime_t passed, now = kvm_apictimer_now(&apic->timer.dev);
+	u32 tmcct = apic_get_reg(apic, APIC_TMCCT);
+
+	ASSERT(apic != NULL);
+
+	if (unlikely(ktime_to_ns(now) <=
+		     ktime_to_ns(apic->timer.last_update))) {
+		/* Wrap around */
+		passed = ktime_add(
+			({ (ktime_t){
+				.tv64 = KTIME_MAX -
+					 (apic->timer.last_update).tv64 };
+			}), now);
+		apic_debug("time elapsed\n");
+	} else
+		passed = ktime_sub(now, apic->timer.last_update);
+
+	counter_passed = ktime_to_ns(passed) /
+		(APIC_BUS_CYCLE_NS * apic->timer.divide_count);
+	tmcct -= counter_passed;
+
+	if (tmcct <= 0) {
+		if (unlikely(!apic_lvtt_period(apic))) {
+			tmcct =  0;
+		} else {
+			do {
+				tmcct += apic_get_reg(apic, APIC_TMICT);
+			} while ( tmcct <= 0 );
+		}
+	}
+
+	apic->timer.last_update = now;
+	apic_set_reg(apic, APIC_TMCCT, tmcct);
+
+	return tmcct;
+}
+
+/*
+ *----------------------------------------------------------------------
+ * MMIO
+ *----------------------------------------------------------------------
+ */
+
+#define align(val, len) ((val + (len-1)) & ~len)
+
+static int validate_mmio(struct kvm_kern_apic *apic, gpa_t address, int len)
+{
+	/*
+	 * According to IA 32 Manual, all registers should be accessed with
+	 * 32 bits alignment.
+	 */
+	if (align(address, 4) != align(address+len, 4)) {
+		printk(KERN_WARNING "KVM: MMIO request to %d bytes at %ld " \
+		       "is not 32 bit aligned.  Injecting #GP\n",
+		       len, address);
+		inject_gp(apic->vcpu);
+		return 0;
+	}
+
+	return 1;
+}
+
+static u32 __apic_read(struct kvm_kern_apic *apic,
+				unsigned int offset)
+{
+	u32 val = 0;
+
+	if (offset > APIC_TDCR)
+		return 0;
+
+	switch (offset) {
+	case APIC_ARBPRI:
+		printk(KERN_WARNING "access local APIC ARBPRI register " \
+		       "which is for P6\n");
+		break;
+
+	case APIC_TMCCT:        /* Timer CCR */
+		val = apic_get_tmcct(apic);
+		break;
+
+	case APIC_ESR:
+		apic->err_write_count = 0;
+		/* fall through */
+	default:
+		val = apic_get_reg(apic, offset);
+		break;
+	}
+
+	return val;
+}
+
+static void apic_mmio_read(struct kvm_io_device *this,
+			   gpa_t address,
+			   int len,
+			   void *data)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	unsigned int          offset = address - apic->base_address;
+	unsigned char         alignment = offset & 0x3;
+	u32                   val;
+
+	if (!validate_mmio(apic, address, len))
+		return;
+
+	spin_lock_bh(&apic->lock);
+	val = __apic_read(apic, offset & ~0x3);
+	spin_unlock_bh(&apic->lock);
+
+	switch (len) {
+	case 1:
+	case 2:
+	case 4:
+		memcpy(data, (char*)((char*)&val + alignment), len);
+		break;
+	default:
+		printk(KERN_ALERT "Local APIC read with len = %x, " \
+		       "should be 1,2, or 4 instead\n", len);
+		inject_gp(apic->vcpu);
+		break;
+	}
+}
+
+static void apic_mmio_write(struct kvm_io_device *this,
+			    gpa_t address,
+			    int len,
+			    const void *data)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	unsigned int          offset = address - apic->base_address;
+	unsigned char         alignment = offset & 0x3;
+	u32                   val;
+
+	if (!validate_mmio(apic, address, len))
+		return;
+
+	spin_lock_bh(&apic->lock);
+
+	switch (len) {
+	case 1:
+	case 2: {
+		unsigned int tmp;
+
+		/*
+		 * Some kernels will access with byte/word alignment
+		 */
+		apic_debug("Notice: Local APIC write with len = %x\n", len);
+		tmp = __apic_read(apic, offset & ~0x3);
+		switch (len) {
+		case 1:
+			val = *(u8*)data;
+
+			val = (tmp & ~(0xff << (8*alignment))) |
+			      ((val & 0xff) << (8*alignment));
+			break;
+
+		case 2:
+			if (alignment != 0x0 && alignment != 0x2) {
+				printk(KERN_ALERT "alignment error for apic " \
+				       "with len == 2\n");
+				inject_gp(apic->vcpu);
+			}
+
+			/*
+			 * assumes 16 bit alignment on the pointer.
+			 * Mis-alignment is a host-side issue, however, so
+			 * we crash
+			 */
+			BUG_ON(((long)data & 0x1));
+
+			val = *(u16*)data;
+
+			val = (tmp & ~(0xffff << (8*alignment))) |
+			      ((val & 0xffff) << (8*alignment));
+			break;
+		}
+
+		break;
+	}
+	case 4:
+		memcpy(&val, data, 4);
+		break;
+	default:
+		printk(KERN_ALERT "Local APIC write with len = %x, " \
+		       "should be 1,2, or 4 instead\n", len);
+		inject_gp(apic->vcpu);
+		break;
+	}
+
+	/* too common printing */
+	if (offset != APIC_EOI)
+		apic_debug("%s: offset 0x%x with length 0x%x, and value is " \
+			 "0x%lx\n",
+		       __FUNCTION__, offset, len, val);
+
+	offset &= 0xff0;
+
+	switch (offset) {
+	case APIC_ID:   /* Local APIC ID */
+		apic_set_reg(apic, APIC_ID, val);
+		break;
+
+	case APIC_TASKPRI:
+		apic_set_tpr(apic, val & 0xff);
+		break;
+
+	case APIC_EOI:
+		apic_set_eoi(apic);
+		break;
+
+	case APIC_LDR:
+		apic_set_reg(apic, APIC_LDR, val & APIC_LDR_MASK);
+		break;
+
+	case APIC_DFR:
+		apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
+		break;
+
+	case APIC_SPIV:
+		apic_set_reg(apic, APIC_SPIV, val & 0x3ff);
+		if (!(val & APIC_SPIV_APIC_ENABLED)) {
+			int i;
+			u32 lvt_val;
+
+			apic->status |= APIC_SOFTWARE_DISABLE_MASK;
+			for (i = 0; i < APIC_LVT_NUM; i++) {
+				lvt_val = apic_get_reg(apic,
+							   APIC_LVTT +
+							   0x10 * i);
+				apic_set_reg(apic, APIC_LVTT + 0x10 * i,
+						 lvt_val | APIC_LVT_MASKED);
+			}
+
+			if ((apic_get_reg(apic, APIC_LVT0) &
+			     APIC_MODE_MASK) == APIC_DM_EXTINT)
+				clear_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+		} else {
+			apic->status &= ~APIC_SOFTWARE_DISABLE_MASK;
+			if ((apic_get_reg(apic, APIC_LVT0) &
+			     APIC_MODE_MASK) == APIC_DM_EXTINT)
+				set_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+		}
+		break;
+
+	case APIC_ESR:
+		apic->err_write_count = !apic->err_write_count;
+		if (!apic->err_write_count)
+			apic->err_status = 0;
+		break;
+
+	case APIC_ICR:
+		/* No delay here, so we always clear the pending bit*/
+		apic_set_reg(apic, APIC_ICR, val & ~(1 << 12));
+		apic_send_ipi(apic);
+		break;
+
+	case APIC_ICR2:
+		apic_set_reg(apic, APIC_ICR2, val & 0xff000000);
+		break;
+
+	case APIC_LVTT:
+	case APIC_LVTTHMR:
+	case APIC_LVTPC:
+	case APIC_LVT0:
+	case APIC_LVT1:
+	case APIC_LVTERR:
+	{
+		if (apic->status & APIC_SOFTWARE_DISABLE_MASK)
+			val |= APIC_LVT_MASKED;
+
+		val &= apic_lvt_mask[(offset - APIC_LVTT) >> 4];
+		apic_set_reg(apic, offset, val);
+
+		/* On hardware, when write vector less than 0x20 will error */
+		if (!(val & APIC_LVT_MASKED))
+			apic_check_vector(apic, apic_lvt_dm(apic, offset),
+					  apic_lvt_vector(apic, offset));
+		if (!apic->vcpu_id && (offset == APIC_LVT0)) {
+			if ((val & APIC_MODE_MASK) == APIC_DM_EXTINT)
+				if (val & APIC_LVT_MASKED)
+					clear_bit(_APIC_BSP_ACCEPT_PIC,
+						  &apic->status);
+				else
+					set_bit(_APIC_BSP_ACCEPT_PIC,
+						&apic->status);
+			else
+				clear_bit(_APIC_BSP_ACCEPT_PIC,
+					  &apic->status);
+		}
+	}
+		break;
+
+	case APIC_TMICT:
+	{
+		ktime_t now = kvm_apictimer_now(&apic->timer.dev);
+		u32 offset;
+
+		apic_set_reg(apic, APIC_TMICT, val);
+		apic_set_reg(apic, APIC_TMCCT, val);
+		apic->timer.last_update = now;
+		offset = APIC_BUS_CYCLE_NS * apic->timer.divide_count * val;
+
+		/* Make sure the lock ordering is coherent */
+		spin_unlock_bh(&apic->lock);
+		kvm_apictimer_stop(&apic->timer.dev);
+		kvm_apictimer_start(&apic->timer.dev,
+				    ktime_add_ns(now, offset));
+
+		apic_debug("%s: bus cycle is %"PRId64"ns, now 0x%016"PRIx64", "
+			 "timer initial count 0x%x, offset 0x%x, "
+			 "expire @ 0x%016"PRIx64".\n", __FUNCTION__,
+			 APIC_BUS_CYCLE_NS, ktime_to_ns(now),
+			 apic_get_reg(apic, APIC_TMICT),
+			 offset, ktime_to_ns(ktime_add_ns(now, offset)));
+	}
+		return;
+
+	case APIC_TDCR:
+	{
+		unsigned int tmp1, tmp2;
+
+		tmp1 = val & 0xf;
+		tmp2 = ((tmp1 & 0x3) | ((tmp1 & 0x8) >> 1)) + 1;
+		apic->timer.divide_count = 0x1 << (tmp2 & 0x7);
+
+		apic_set_reg(apic, APIC_TDCR, val);
+
+		apic_debug("timer divide count is 0x%x\n",
+		       apic->timer.divide_count);
+	}
+		break;
+
+	default:
+		printk(KERN_WARNING "Local APIC Write to read-only register\n");
+		break;
+	}
+
+	spin_unlock_bh(&apic->lock);
+}
+
+static int apic_mmio_range(struct kvm_io_device *this, gpa_t addr)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	int ret = 0;
+
+	spin_lock_bh(&apic->lock);
+
+	if (apic_global_enabled(apic) &&
+	    (addr >= apic->base_address) &&
+	    (addr < apic->base_address + VLOCAL_APIC_MEM_LENGTH))
+		ret = 1;
+
+	spin_unlock_bh(&apic->lock);
+
+	return ret;
+}
+
+static void apic_mmio_destructor(struct kvm_io_device *this)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+
+	apic_dropref(apic);
+}
+
+static void apic_mmio_register(struct kvm_kern_apic *apic)
+{
+	/* Register ourselves with the MMIO subsystem */
+	struct kvm_io_device *dev = &apic->mmio_dev;
+
+	dev->read       = apic_mmio_read;
+	dev->write      = apic_mmio_write;
+	dev->in_range   = apic_mmio_range;
+	dev->destructor = apic_mmio_destructor;
+
+	dev->private = apic;
+	atomic_inc(&apic->ref_count);
+
+	apic->vcpu->apic.mmio = dev;
+}
+
+/*
+ *----------------------------------------------------------------------
+ * LAPIC interface
+ *----------------------------------------------------------------------
+ */
+
+void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, u64 cr8)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)vcpu->apic.dev;
+
+	spin_lock_bh(&apic->lock);
+	apic_set_tpr(apic, ((cr8 & 0x0f) << 4));
+	spin_unlock_bh(&apic->lock);
+}
+
+u64 kvm_lapic_get_tpr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)vcpu->apic.dev;
+	u64 tpr;
+
+	spin_lock_bh(&apic->lock);
+	tpr = (u64)apic_get_reg(apic, APIC_TASKPRI);
+	spin_unlock_bh(&apic->lock);
+
+	return (tpr & 0xf0) >> 4;
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_get_tpr);
+
+void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)vcpu->apic.dev;
+
+	spin_lock_bh(&apic->lock);
+	if (apic->vcpu_id)
+		value &= ~MSR_IA32_APICBASE_BSP;
+
+	apic->base_msr = value;
+	apic->base_address = apic->base_msr & MSR_IA32_APICBASE_BASE;
+
+	/* with FSB delivery interrupt, we can restart APIC functionality */
+	if (!(value & MSR_IA32_APICBASE_ENABLE))
+		set_bit(_APIC_GLOB_DISABLE, &apic->status);
+	else
+		clear_bit(_APIC_GLOB_DISABLE, &apic->status);
+
+	apic_debug("apic base msr is 0x%016"PRIx64", and base address is " \
+		 "0x%lx.\n", apic->base_msr, apic->base_address);
+
+	spin_unlock_bh(&apic->lock);
+}
+
+u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)vcpu->apic.dev;
+	u64 base;
+
+	spin_lock_bh(&apic->lock);
+	base = apic->base_msr;
+	spin_unlock_bh(&apic->lock);
+
+	return base;
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_get_base);
+
+void kvm_lapic_save(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+	/*
+	 * FIXME: This needs to support the entire register set when
+	 * enabled
+	 */
+	sregs->cr8       = kvm_lapic_get_tpr(vcpu);
+	sregs->apic_base = kvm_lapic_get_base(vcpu);
+}
+
+void kvm_lapic_restore(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+	/*
+	 * FIXME: This needs to support the entire register set when
+	 * enabled
+	 */
+	kvm_lapic_set_tpr(vcpu, sregs->cr8);
+	kvm_lapic_set_base(vcpu, sregs->apic_base);
+}
+
+void kvm_lapic_reset(struct kvm_vcpu *vcpu)
+{
+	struct kvm_kern_apic *apic;
+	int i;
+
+	apic_debug("%s\n", __FUNCTION__);
+
+	ASSERT(vcpu);
+	apic = vcpu->apic.dev;
+	ASSERT(apic != NULL);
+
+	/* Stop the timer in case it's a reset to an active apic */
+	kvm_apictimer_stop(&apic->timer.dev);
+
+	spin_lock_bh(&apic->lock);
+
+	apic_set_reg(apic, APIC_ID, vcpu_slot(vcpu) << 24);
+	apic_set_reg(apic, APIC_LVR, APIC_VERSION);
+
+	for (i = 0; i < APIC_LVT_NUM; i++)
+		apic_set_reg(apic, APIC_LVTT + 0x10 * i, APIC_LVT_MASKED);
+
+	apic_set_reg(apic, APIC_DFR, 0xffffffffU);
+	apic_set_reg(apic, APIC_SPIV, 0xff);
+	apic_set_reg(apic, APIC_TASKPRI, 0);
+	apic_set_reg(apic, APIC_LDR, 0);
+	apic_set_reg(apic, APIC_ESR, 0);
+	apic_set_reg(apic, APIC_ICR, 0);
+	apic_set_reg(apic, APIC_ICR2, 0);
+	apic_set_reg(apic, APIC_TDCR, 0);
+	apic_set_reg(apic, APIC_TMICT, 0);
+	memset((void*)(apic->regs + APIC_IRR), 0, KVM_IRQ_BITMAP_SIZE(u8));
+	memset((void*)(apic->regs + APIC_ISR), 0, KVM_IRQ_BITMAP_SIZE(u8));
+	memset((void*)(apic->regs + APIC_TMR), 0, KVM_IRQ_BITMAP_SIZE(u8));
+
+	apic->base_msr =
+		MSR_IA32_APICBASE_ENABLE |
+		APIC_DEFAULT_PHYS_BASE;
+	if (vcpu_slot(vcpu) == 0)
+		apic->base_msr |= MSR_IA32_APICBASE_BSP;
+	apic->base_address = apic->base_msr & MSR_IA32_APICBASE_BASE;
+
+	apic->timer.divide_count = 0;
+	apic->timer.pending = 0;
+	apic->status = 0;
+
+#ifdef APIC_NO_BIOS
+	/*
+	 * XXX According to mp specification, BIOS will enable LVT0/1,
+	 * remove it after BIOS enabled
+	 */
+	if (!vcpu_slot(vcpu)) {
+		apic_set_reg(apic, APIC_LVT0, APIC_MODE_EXTINT << 8);
+		apic_set_reg(apic, APIC_LVT1, APIC_MODE_NMI << 8);
+		set_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+	}
+#endif
+
+	spin_unlock_bh(&apic->lock);
+
+	printk(KERN_INFO  "%s: vcpu=%p, id=%d, base_msr=" \
+	       "0x%016"PRIx64", base_address=0x%0lx.\n", __FUNCTION__, vcpu,
+	       GET_APIC_ID(apic_get_reg(apic, APIC_ID)),
+	       apic->base_msr, apic->base_address);
+}
+
+int kvm_lapic_enabled(struct kvm_vcpu *vcpu)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)vcpu->apic.dev;
+	int ret = 0;
+
+	spin_lock_bh(&apic->lock);
+	if (!apic->usermode)
+		ret = apic_enabled(apic);
+	spin_unlock_bh(&apic->lock);
+
+	return ret;
+}
+
+/*
+ *----------------------------------------------------------------------
+ * timer interface
+ *----------------------------------------------------------------------
+ */
+static int __apic_timer_fn(struct kvm_kern_apic *apic)
+{
+	u32 vector;
+	ktime_t now;
+	int result = 0;
+
+	if (unlikely(!apic_enabled(apic) ||
+		     !apic_lvt_enabled(apic, APIC_LVTT))) {
+		apic_debug("%s: time interrupt although apic is down\n",
+			 __FUNCTION__);
+		return 0;
+	}
+
+	vector                  = apic_lvt_vector(apic, APIC_LVTT);
+	now                     = kvm_apictimer_now(&apic->timer.dev);
+	apic->timer.last_update = now;
+	apic->timer.pending++;
+
+	__apic_accept_irq(apic, APIC_DM_FIXED, vector, 1, 0);
+
+	if (apic_lvtt_period(apic)) {
+		u32 offset;
+		u32 tmict = apic_get_reg(apic, APIC_TMICT);
+
+		apic_set_reg(apic, APIC_TMCCT, tmict);
+		offset = APIC_BUS_CYCLE_NS * apic->timer.divide_count * tmict;
+
+		result = 1;
+		kvm_apictimer_update(&apic->timer.dev,
+				     ktime_add_ns(now, offset));
+
+		apic_debug("%s: now 0x%016"PRIx64", expire @ 0x%016"PRIx64", "
+		       "timer initial count 0x%x, timer current count 0x%x.\n",
+		       __FUNCTION__,
+		       ktime_to_ns(now), ktime_add_ns(now, offset),
+		       apic_get_reg(apic, APIC_TMICT),
+	               apic_get_reg(apic, APIC_TMCCT));
+	} else {
+		apic_set_reg(apic, APIC_TMCCT, 0);
+		apic_debug("%s: now 0x%016"PRIx64", "
+		       "timer initial count 0x%x, timer current count 0x%x.\n",
+		       __FUNCTION__,
+		       ktime_to_ns(now), apic_get_reg(apic, APIC_TMICT),
+		       apic_get_reg(apic, APIC_TMCCT));
+	}
+
+	return result;
+}
+
+static int apic_timer_fn(void *private)
+{
+	struct kvm_kern_apic *apic = private;
+	int restart_timer = 0;
+
+	spin_lock_bh(&apic->lock);
+	restart_timer = __apic_timer_fn(apic);
+	spin_unlock_bh(&apic->lock);
+
+	return restart_timer;
+}
+
+/*
+ *----------------------------------------------------------------------
+ * IRQDEVICE interface
+ *----------------------------------------------------------------------
+ */
+
+static int apic_irqdev_ack(struct kvm_irqdevice *this, int *vector)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	int ret = 0;
+	int irq;
+
+	apic_debug("LAPIC ACK attempt\n");
+
+	spin_lock_bh(&apic->lock);
+
+	if (!apic_enabled(apic))
+		goto out;
+
+	if (vector) {
+		irq = apic_find_highest_irr(apic);
+		if ((irq & 0xf0) > apic_get_reg(apic, APIC_PROCPRI)) {
+			BUG_ON (irq < 0x10);
+
+			__set_bit(irq, apic->regs + APIC_ISR);
+			__clear_bit(irq, apic->regs + APIC_IRR);
+			apic_update_ppr(apic);
+
+			/*
+			 * We have to special case the timer interrupt
+			 * because we want the vector to stay pending
+			 * for each tick of the clock, even for a backlog.
+			 * Therefore, if this was a timer vector and we
+			 * still have ticks pending, keep IRR set
+			 */
+			if (irq == apic_lvt_vector(apic, APIC_LVTT)) {
+				BUG_ON(!apic->timer.pending);
+				apic->timer.pending--;
+				if (apic->timer.pending)
+					__set_bit(irq, apic->regs + APIC_IRR);
+			}
+
+			ret |= KVM_IRQACK_VALID;
+			*vector = irq;
+		}
+		else
+			*vector = -1;
+
+		apic_debug("ACK for vector %d\n", *vector);
+	}
+
+	/*
+	 * Read it again to see if anything is still pending above TPR
+	 */
+	irq = apic_find_highest_irr(apic);
+	if ((irq & 0xf0) > apic_get_reg(apic, APIC_PROCPRI))
+		ret |= KVM_IRQACK_AGAIN;
+	else {
+		/*
+		 * See if there is anything masked by TPR
+		 */
+		irq = find_first_bit(apic->regs + APIC_IRR,
+				     MAX_APIC_INT_VECTOR/8);
+		if (irq &&
+		    ((irq & 0xf0) <= apic_get_reg(apic, APIC_PROCPRI))) {
+			ret |= KVM_IRQACK_TPRMASK;
+		}
+	}
+
+ out:
+	spin_unlock_bh(&apic->lock);
+
+	return ret;
+}
+
+static int apic_irqdev_set_pin(struct kvm_irqdevice *this, int irq, int level)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+	int lvt = 0;
+
+	spin_lock_bh(&apic->lock);
+
+	if (!apic_enabled(apic)) {
+		/*
+		 * If the LAPIC is disabled, we simply forward the interrupt
+		 * on to the output line
+		 */
+		__apic_accept_irq(apic, APIC_DM_EXTINT, 0, level, 1);
+		goto out;
+	}
+
+	/*
+	 * pin "0" is LINT0, and "1" is LINT1
+	 */
+	BUG_ON(irq > 1);
+
+	switch(irq) {
+	case 0:
+		lvt = APIC_LVT0;
+		break;
+	case 1:
+		lvt = APIC_LVT1;
+		break;
+	}
+
+	if (apic_lvt_enabled(apic, lvt))
+		__apic_accept_irq(apic,
+				  apic_lvt_dm(apic, lvt),
+				  apic_lvt_vector(apic, lvt),
+				  level,
+				  1);
+
+
+ out:
+	spin_unlock_bh(&apic->lock);
+
+	return 0;
+}
+
+static void apic_irqdev_destructor(struct kvm_irqdevice *this)
+{
+	struct kvm_kern_apic *apic = (struct kvm_kern_apic*)this->private;
+
+	apic_dropref(apic);
+}
+
+static void apic_irqdev_register(struct kvm_kern_apic *apic,
+				 struct kvm_irqdevice *dev)
+{
+	dev->ack         = apic_irqdev_ack;
+	dev->set_pin     = apic_irqdev_set_pin;
+	dev->destructor  = apic_irqdev_destructor;
+
+	dev->private = apic;
+	atomic_inc(&apic->ref_count);
+
+	apic->irq_dev = dev;
+}
+
+int kvm_lapic_init(struct kvm_vcpu *vcpu,
+		   struct kvm_irqdevice *irq_dev, int flags)
+{
+	struct kvm_kern_apic *apic = NULL;
+	struct kvm_io_device *mmio_dev = NULL;
+
+	ASSERT(vcpu != NULL);
+	apic_debug("apic_init %d\n", vcpu_slot(vcpu));
+
+	apic = kzalloc(sizeof(*apic), GFP_KERNEL);
+	if (!apic)
+		goto nomem;
+
+	spin_lock_init(&apic->lock);
+	atomic_inc(&apic->ref_count);
+	apic->vcpu_id = vcpu_slot(vcpu);
+
+	apic->regs_page = alloc_page(GFP_KERNEL);
+	if ( apic->regs_page == NULL ) {
+		printk(KERN_ALERT "malloc apic regs error for vcpu %x\n",
+		       vcpu_slot(vcpu));
+		goto nomem;
+	}
+	apic->regs = page_address(apic->regs_page);
+	memset(apic->regs, 0, PAGE_SIZE);
+
+	apic->vcpu = vcpu;
+	vcpu->apic.dev = apic;
+
+	if (!(flags & KVM_LAPIC_OPTION_USERMODE)) {
+		apic_irqdev_register(apic, irq_dev);
+		apic_mmio_register(apic);
+	} else
+		apic->usermode = 1;
+
+	kvm_apictimer_init(&apic->timer.dev);
+	apic->timer.dev.function = apic_timer_fn;
+	apic->timer.dev.private  = apic;
+
+	kvm_lapic_reset(vcpu);
+	return 0;
+
+ nomem:
+	if (mmio_dev)
+		kfree(mmio_dev);
+
+	if (apic)
+		apic_dropref(apic);
+
+	return -ENOMEM;
+}
+
+void kvm_lapic_destroy(struct kvm_vcpu *vcpu)
+{
+	struct kvm_kern_apic *apic = vcpu->apic.dev;
+
+	if (vcpu->apic.mmio)
+		kvm_iodevice_destructor(vcpu->apic.mmio);
+
+	apic_dropref(apic);
+}
+
+
+
+
+
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index a73232b..42d662e 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -570,9 +570,6 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
 
 	fx_init(vcpu);
 	vcpu->fpu_active = 1;
-	vcpu->apic_base = 0xfee00000 |
-			/*for vcpu 0*/ MSR_IA32_APICBASE_BSP |
-			MSR_IA32_APICBASE_ENABLE;
 
 	return 0;
 
@@ -1413,6 +1410,16 @@ static int do_intr_requests(struct kvm_vcpu *vcpu,
 			       "handled yet\n");
 			__clear_bit(pin, &vcpu->irq.pending);
 			break;
+		case kvm_irqpin_nmi:
+			/*
+			 * FIXME: Someday we will handle this using the
+			 * specific SVN NMI features.  For now, just inject
+			 * the NMI as a standard interrupt on vector 2
+			 */
+			r |= KVM_IRQACK_VALID;
+			__clear_bit(pin, &vcpu->irq.pending);
+			irq = 2;
+			break;
 		default:
 			panic("KVM: unknown interrupt pin raised: %d\n", pin);
 			break;
@@ -1458,14 +1465,13 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
 		switch (pin) {
 		case kvm_irqpin_localint:
 		case kvm_irqpin_extint:
+		case kvm_irqpin_nmi:
 			handled = do_intr_requests(vcpu, kvm_run, pin);
 			break;
 		case kvm_irqpin_smi:
-		case kvm_irqpin_nmi:
 			/* ignored (for now) */
 			printk(KERN_WARNING
-			       "KVM: dropping unhandled SMI/NMI: %d\n",
-			       pin);
+			       "KVM: dropping unhandled SMI\n");
 			__clear_bit(pin, &vcpu->irq.pending);
 			break;
 		case kvm_irqpin_invalid:
@@ -1487,8 +1493,8 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 		(vcpu->interrupt_window_open &&
 		 !kvm_vcpu_irq_pending(vcpu));
 	kvm_run->if_flag = (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF) != 0;
-	kvm_run->cr8 = vcpu->cr8;
-	kvm_run->apic_base = vcpu->apic_base;
+	kvm_run->cr8 = kvm_lapic_get_tpr(vcpu);
+	kvm_run->apic_base = kvm_lapic_get_base(vcpu);
 }
 
 /*
diff --git a/drivers/kvm/userint.c b/drivers/kvm/userint.c
index eeac005..c842737 100644
--- a/drivers/kvm/userint.c
+++ b/drivers/kvm/userint.c
@@ -219,6 +219,12 @@ int kvm_user_irqdev_restore(struct kvm_irqdevice *this, void *data)
 
 int kvm_userint_init(struct kvm_vcpu *vcpu)
 {
-	return kvm_user_irqdev_init(&vcpu->irq.dev);
+	int ret;
+
+	ret = kvm_user_irqdev_init(&vcpu->irq.dev);
+	if (ret < 0)
+		return ret;
+
+	return kvm_lapic_init(vcpu, NULL, KVM_LAPIC_OPTION_USERMODE);
 }
 
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 1aac001..c8a3e60 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1083,10 +1083,6 @@ static int vmx_vcpu_setup(struct kvm_vcpu *vcpu)
 
 	memset(vcpu->regs, 0, sizeof(vcpu->regs));
 	vcpu->regs[VCPU_REGS_RDX] = get_rdx_init_val();
-	vcpu->cr8 = 0;
-	vcpu->apic_base = 0xfee00000 |
-			/*for vcpu 0*/ MSR_IA32_APICBASE_BSP |
-			MSR_IA32_APICBASE_ENABLE;
 
 	fx_init(vcpu);
 
@@ -1327,9 +1323,9 @@ static int do_intr_requests(struct kvm_vcpu *vcpu,
 			r = kvm_vcpu_irq_pop(vcpu, &irq);
 			break;
 		case kvm_irqpin_extint:
-			printk(KERN_WARNING "KVM: external-interrupts not " \
-			       "handled yet\n");
-			__clear_bit(pin, &vcpu->irq.pending);
+			r = kvm_irqdevice_ack(&vcpu->kvm->isa_irq, &irq);
+			if (!(r & KVM_IRQACK_AGAIN))
+				__clear_bit(pin, &vcpu->irq.pending);
 			break;
 		case kvm_irqpin_nmi:
 			/*
@@ -1685,7 +1681,7 @@ static int handle_cr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 			return 1;
 		case 8:
 			vcpu_load_rsp_rip(vcpu);
-			vcpu->regs[reg] = vcpu->cr8;
+			vcpu->regs[reg] = kvm_lapic_get_tpr(vcpu);
 			vcpu_put_rsp_rip(vcpu);
 			skip_emulated_instruction(vcpu);
 			return 1;
@@ -1782,8 +1778,8 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 			      struct kvm_run *kvm_run)
 {
 	kvm_run->if_flag = (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) != 0;
-	kvm_run->cr8 = vcpu->cr8;
-	kvm_run->apic_base = vcpu->apic_base;
+	kvm_run->cr8 = kvm_lapic_get_tpr(vcpu);
+	kvm_run->apic_base = kvm_lapic_get_base(vcpu);
 	kvm_run->ready_for_interrupt_injection =
 		(vcpu->interrupt_window_open &&
 		 !kvm_vcpu_irq_pending(vcpu));


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Kernel side patches for in-kernel APIC
       [not found] ` <20070502212713.16738.8133.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
                     ` (3 preceding siblings ...)
  2007-05-02 21:43   ` [PATCH 4/4] KVM: Add support for in-kernel LAPIC model Gregory Haskins
@ 2007-05-03 18:57   ` Nakajima, Jun
       [not found]     ` <8FFF7E42E93CC646B632AB40643802A8028B114D-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2007-05-06  7:45   ` Avi Kivity
  5 siblings, 1 reply; 28+ messages in thread
From: Nakajima, Jun @ 2007-05-03 18:57 UTC (permalink / raw)
  To: Gregory Haskins, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f


This sounds good, but when will this get incorporated in the tree?

The reason I'm asking is that I've got X64 Vista up on KVM as well (with
patches to EFER and CR8 handling), and I'm wondering if I should hold on
because the code for CR8 handling would be different (and this in-kernel
APIC code would be preferable for that).  

Jun
---
Intel Open Source Technology Center

Gregory Haskins wrote:
> The following is my kernel side patch series for adding in-kernel
> APIC logic. By default, the code enters "level-0" mode and should be
> compatible with existing userspace.  I have a patch series for
> userspace which enables "level-1" mode which I will forward after
> this one. 
> 
> I have incorporated most of the feedback I have received to date. 
> There were a few things that I had initially agreed to do that you
> may find missing from the changes.  I found a few places where my
> original decisions made more sense to me than what I agreed to
> change, so I left them pending further discussion.  E.g.
> "kvm_irqpin_t" was going to change to "kvm_cpuirq_t" but I decided
> against it for reasons I can discuss if anyone is so inclined. 
> 
> level-0 has been tested with both 32 bit windows and 64-bit linux
> *before* I moved to git-HEAD.  They both worked without any
> discernable differences in behavior.
> 
> I then bumped up to git-HEAD and adjusted all my patches to get ready
> for submission.  Unfortunately I seem to have run into a
> (known/unknown) regression(*) in the KVM codebase with that update
> where things arent working quite right.  What I did confirm was that
> the system behaves the same both with and without my patches for both
> level-0 and level-1 behavior.  It is now at a point where testers
> could start to look at my patches and provide bug/performance
> feedback in addition to the code review comments.  All are welcome.
> 
> (*) I see a guest-crash exception very early on in 64 bit Ubuntu 6.x,
> and I get a blank screen in the SLED-10 64 bit graphical screen.  I
> have seen some emails floating around potentially talking about some
> screen drawing issues, so perhaps its another occurance of that. In
> any case, I have been able to get quite a significant amount of
> testing/bug-fixing done just with the SLED-10 installers kernel
> booting, so I think things are relatively working (even in level-1
> mode).  To reiterate, I see the exact behavior in both git-HEAD and
> with my patches, so its not necessarily anything I have done. 
> Otherwise, I would not request testing help until it was reasonably
> stable.  
> 
> Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
> 
> 
>
------------------------------------------------------------------------
-
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Kernel side patches for in-kernel APIC
       [not found]     ` <8FFF7E42E93CC646B632AB40643802A8028B114D-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2007-05-03 21:18       ` Gregory Haskins
       [not found]         ` <463A1935.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-03 21:18 UTC (permalink / raw)
  To: Jun Nakajima, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Thu, May 3, 2007 at  2:57 PM, in message
<8FFF7E42E93CC646B632AB40643802A8028B114D-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>,
"Nakajima, Jun" <jun.nakajima-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote: 

> This sounds good, but when will this get incorporated in the tree?
> 
> The reason I'm asking is that I've got X64 Vista up on KVM as well (with
> patches to EFER and CR8 handling), and I'm wondering if I should hold on
> because the code for CR8 handling would be different (and this in- kernel
> APIC code would be preferable for that).  

While I can't comment on when it goes into the tree (that's the Qumranet guys call), additional testing will help push the ball forward.  So if you have some spare cycles, please give it a whirl and let us know what you find.  

Note that I have found/fixed a few issues in the patches since I sent the mail yesterday (including a nasty host kernel OOPS during level-0 shutdown).  However, in the interest of trying not to spam the list more than once a week with a relatively large patch queue, I didn't send it out. ;) If interested in testing, email me directly and I will get you the latest.

Also, please share your APIC related patches (if you can).  If its something that is applicable to my work, I will see if I can get it integrated in.

Regards,
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Kernel side patches for in-kernel APIC
       [not found] ` <20070502212713.16738.8133.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
                     ` (4 preceding siblings ...)
  2007-05-03 18:57   ` [PATCH 0/4] Kernel side patches for in-kernel APIC Nakajima, Jun
@ 2007-05-06  7:45   ` Avi Kivity
  5 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2007-05-06  7:45 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> The following is my kernel side patch series for adding in-kernel APIC logic.
> By default, the code enters "level-0" mode and should be compatible with
> existing userspace.  I have a patch series for userspace which enables
> "level-1" mode which I will forward after this one.
>
> I have incorporated most of the feedback I have received to date.  There were
> a few things that I had initially agreed to do that you may find missing from
> the changes.  I found a few places where my original decisions made more sense
> to me than what I agreed to change, so I left them pending further
> discussion.  E.g. "kvm_irqpin_t" was going to change to "kvm_cpuirq_t" but I
> decided against it for reasons I can discuss if anyone is so inclined.
>   

Please do (it doesn't really matter; just curious).  Also please provide 
in the future a changelog relative to previous patches so that it is 
easier to review.  It's fairly difficult to keep track of think with a 
long release cycle.


> level-0 has been tested with both 32 bit windows and 64-bit linux *before* I
> moved to git-HEAD.  They both worked without any discernable differences in
> behavior.
>
> I then bumped up to git-HEAD and adjusted all my patches to get ready for
> submission.  Unfortunately I seem to have run into a (known/unknown)
> regression(*) in the KVM codebase with that update where things arent working
> quite right.  What I did confirm was that the system behaves the same both
> with and without my patches for both level-0 and level-1 behavior.  It is now
> at a point where testers could start to look at my patches and provide
> bug/performance feedback in addition to the code review comments.  All are
> welcome.  
>   

Yes, we did have some regressions; I think I fixed all of them but of 
course I can't tell for sure.  Please try with latest HEAD, and if not 
let me know.

I'll review once I manage to get kvm-22 out; probably Monday or 
Tuesday.  Sorry about the delay.

Incidentally the latest changes should make the performance advantages 
of in-kernel lapic more noticable (or maybe, noticable) as the 
difference between guest->kernel switch times and guest->userspace 
switch times is increasing.



-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Kernel side patches for in-kernel APIC
       [not found]         ` <463A1935.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-05-06  7:49           ` Avi Kivity
       [not found]             ` <463D8887.5020007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2007-05-06  7:49 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>>>> On Thu, May 3, 2007 at  2:57 PM, in message
>>>>         
> <8FFF7E42E93CC646B632AB40643802A8028B114D-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>,
> "Nakajima, Jun" <jun.nakajima-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote: 
>
>   
>> This sounds good, but when will this get incorporated in the tree?
>>
>> The reason I'm asking is that I've got X64 Vista up on KVM as well (with
>> patches to EFER and CR8 handling), and I'm wondering if I should hold on
>> because the code for CR8 handling would be different (and this in- kernel
>> APIC code would be preferable for that).  
>>     
>
> While I can't comment on when it goes into the tree (that's the Qumranet guys call), additional testing will help push the ball forward.  So if you have some spare cycles, please give it a whirl and let us know what you find.  
>
> Note that I have found/fixed a few issues in the patches since I sent the mail yesterday (including a nasty host kernel OOPS during level-0 shutdown).  However, in the interest of trying not to spam the list more than once a week with a relatively large patch queue, I didn't send it out. ;) If interested in testing, email me directly and I will get you the latest.
>   

Please do post (perhaps just the incremental).  People can ignore it if 
they're not interested.

By the way, Windows is by far the most sensitive guest to apic issues, 
and has the most to gain, so testing with that is quite important.

> Also, please share your APIC related patches (if you can).  If its something that is applicable to my work, I will see if I can get it integrated in.
>   

Yes please.  I'd also like to keep the old model working so we can 
isolate problems more easily.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: Adds support for in-kernel mmio handlers
       [not found]     ` <20070502214315.16738.68984.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
@ 2007-05-07  9:30       ` Avi Kivity
       [not found]         ` <463EF198.1050303-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2007-05-07  9:30 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>  
> +
> +void kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev)
> +{
> +	BUG_ON(bus->dev_count >= (NR_IOBUS_DEVS-1));
> +
> +	bus->devs[bus->dev_count++] = dev;
> +}
> +
>   

Off by one?


Otherwise ok.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] KVM: Add irqdevice object
       [not found]     ` <20070502214320.16738.21505.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
@ 2007-05-07  9:42       ` Avi Kivity
       [not found]         ` <463EF493.9070300-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2007-05-07  9:42 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> +/**
> + * kvm_irqdevice_ack - read and ack the highest priority vector from the device
> + * @dev: The device
> + * @vector: Retrieves the highest priority pending vector
> + *                [ NULL = Dont ack a vector, just check pending status]
> + *                [ non-NULL = Pointer to recieve vector data (out only)]
> + *
> + * Description: Read the highest priority pending vector from the device,
> + *              potentially invoking auto-EOI depending on device policy
> + *
> + * Returns: (int)
> + *   [ -1 = failure]
> + *   [>=0 = bitmap as follows: ]
> + *         [ KVM_IRQACK_VALID   = vector is valid]
> + *         [ KVM_IRQACK_AGAIN   = more unmasked vectors are available]
> + *         [ KVM_IRQACK_TPRMASK = TPR masked vectors are blocked]
> + */
> +static inline int kvm_irqdevice_ack(struct kvm_irqdevice *dev,
> +					    int *vector)
> +{
> +	return dev->ack(dev, vector);
> +}
>   

I still don't like this, but have nothing better to offer.


> +static void do_interrupt_requests(struct kvm_vcpu *vcpu,
> +				  struct kvm_run *kvm_run)
> +{
> +	int pending = __kvm_vcpu_irq_all_pending(vcpu);
> +	int handled = 0;
> +
> +	while (pending && !handled) {
> +		kvm_irqpin_t pin = __fls(pending);
> +
> +		switch (pin) {
> +		case kvm_irqpin_localint:
> +		case kvm_irqpin_extint:
> +		case kvm_irqpin_nmi:
> +			handled = do_intr_requests(vcpu, kvm_run, pin);
> +			break;
> +		case kvm_irqpin_smi:
> +			/* ignored (for now) */
> +			printk(KERN_WARNING
> +			       "KVM: dropping unhandled SMI\n");
> +			__clear_bit(pin, &vcpu->irq.pending);
> +			break;
> +		case kvm_irqpin_invalid:
> +			/* drop */
> +			break;
> +		default:
> +			panic("KVM: unknown interrupt pin raised: %d\n", pin);
> +			break;
> +		}
> +
> +		__clear_bit(pin, &pending);
> +	}
>  }
>   

This code can be shared with the svm function of the same name and moved 
to kvm_main, no?   You'll need a new arch op for do_inter_requests().


Otherwise ok.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU
       [not found]     ` <20070502214325.16738.42702.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
@ 2007-05-07  9:57       ` Avi Kivity
       [not found]         ` <463EF7F4.9020106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2007-05-07  9:57 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> The VCPU executes synchronously w.r.t. userspace today, and therefore
> interrupt injection is pretty straight forward.  However, we will soon need
> to be able to inject interrupts asynchronous to the execution of the VCPU
> due to the introduction of SMP, paravirtualized drivers, and asynchronous
> hypercalls.  This patch adds support to the interrupt mechanism to force
> a VCPU to VMEXIT when a new interrupt is pending.
>
> Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
> ---
>
>  drivers/kvm/kvm.h      |    5 +++
>  drivers/kvm/kvm_main.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/kvm/svm.c      |   43 ++++++++++++++++++++++++++++
>  drivers/kvm/vmx.c      |   43 ++++++++++++++++++++++++++++
>  4 files changed, 164 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
> index d41d653..15c8bec 100644
> --- a/drivers/kvm/kvm.h
> +++ b/drivers/kvm/kvm.h
> @@ -321,6 +321,8 @@ void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
>  
>  #define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
>  
> +#define KVM_SIGNAL_VIRTUAL_INTERRUPT 33 /* Hardcoded for now */
> +
>  /*
>   * structure for maintaining info for interrupting an executing VCPU
>   */
> @@ -329,6 +331,9 @@ struct kvm_vcpu_irq {
>  	struct kvm_irqdevice dev;
>  	int                  pending;
>  	int                  deferred;
> +	struct task_struct  *task;
> +	int                  signo;
> +	int                  guest_mode;
>  };
>  
>  struct kvm_vcpu {
> diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
> index 9aeb2f7..6acbd9b 100644
> --- a/drivers/kvm/kvm_main.c
> +++ b/drivers/kvm/kvm_main.c
> @@ -304,6 +304,10 @@ static struct kvm *kvm_create_vm(void)
>  		memset(&vcpu->irq, 0, sizeof(vcpu->irq));
>  		spin_lock_init(&vcpu->irq.lock);
>  		vcpu->irq.deferred = -1;
> +		/*
> +		 * This should be settable by userspace someday
> +		 */
> +		vcpu->irq.signo = KVM_SIGNAL_VIRTUAL_INTERRUPT;
>   

This needs to be fixed prior to merging.  Hopefully not by setting the 
signal number, bit by making the vcpu fd writable (userspace can attach 
a signal to the fd if it wishes).

>  
>  		vcpu->cpu = -1;
>  		vcpu->kvm = kvm;
> @@ -366,13 +370,20 @@ static void free_pio_guest_pages(struct kvm_vcpu *vcpu)
>  
>  static void kvm_free_vcpu(struct kvm_vcpu *vcpu)
>  {
> +	unsigned long irqsave;
> +
>  	if (!vcpu->vmcs)
>  		return;
>  
>  	vcpu_load(vcpu);
>  	kvm_mmu_destroy(vcpu);
>  	vcpu_put(vcpu);
> +
> +	spin_lock_irqsave(&vcpu->irq.lock, irqsave);
> +	vcpu->irq.task = NULL;
> +	spin_unlock_irqrestore(&vcpu->irq.lock, irqsave);
>   

Can irq.task be non-NULL here at all?  Also, we only free vcpus when we 
destroy the vm, and paravirt drivers would hopefully hold a ref to the 
vm, so there's nobody to race against here.

>  	kvm_irqdevice_destructor(&vcpu->irq.dev);
>  
> @@ -1868,6 +1880,10 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
>  		kvm_arch_ops->decache_regs(vcpu);
>  	}
>  
> +	spin_lock_irqsave(&vcpu->irq.lock, irqsaved);
> +	vcpu->irq.task = current;
> +	spin_unlock_irqrestore(&vcpu->irq.lock, irqsaved);
> +
>   

Just assignment + __smp_wmb().

> +/*
>   * This function will be invoked whenever the vcpu->irq.dev raises its INTR
>   * line
>   */
> @@ -2318,10 +2348,52 @@ static void kvm_vcpu_intr(struct kvm_irqsink *this,
>  {
>  	struct kvm_vcpu *vcpu = (struct kvm_vcpu*)this->private;
>  	unsigned long flags;
> +	int direct_ipi = -1;
>  
>  	spin_lock_irqsave(&vcpu->irq.lock, flags);
> -	__set_bit(pin, &vcpu->irq.pending);
> +
> +	if (!test_bit(pin, &vcpu->irq.pending)) {
> +		/*
> +		 * Record the change..
> +		 */
> +		__set_bit(pin, &vcpu->irq.pending);
> +
> +		/*
> +		 * then wake up the vcpu (if necessary)
> +		 */
> +		if (vcpu->irq.task && (vcpu->irq.task != current)) {
> +			if (vcpu->irq.guest_mode) {
> +				/*
> +				 * If we are in guest mode, we can optimize
> +				 * the IPI by executing a function directly
> +				 * on the owning processor.
> +				 */
> +				direct_ipi = task_cpu(vcpu->irq.task);
> +				BUG_ON(direct_ipi == smp_processor_id());
> +			} else
> +				/*
> +				 * otherwise, we must assume that we could be
> +				 * blocked anywhere, including userspace. Send
> +				 * a signal to give everyone a chance to get
> +				 * notification
> +				 */
> +				send_sig(vcpu->irq.signo, vcpu->irq.task, 0);
> +		}
> +	}
> +
>  	spin_unlock_irqrestore(&vcpu->irq.lock, flags);
> +
> +	if (direct_ipi != -1) {
> +		/*
> +		 * Not sure if disabling preemption is needed.
> +		 * The kick_process() code does this so I copied it
> +		 */
> +		preempt_disable();
> +		smp_call_function_single(direct_ipi,
> +					 kvm_vcpu_guest_intr,
> +					 vcpu, 0, 0);
> +		preempt_enable();
> +	}
>   

I see why you must issue the IPI outside the spin_lock_irqsave(), but 
aren't you now opening a race?  vcpu enters guest mode, irq on other 
cpu, irq sets direct_ipi to wakeup guest, releases lock, vcpu exits to 
userspace (or migrates to another cpu), ipi is issued but nobody cares.


>  	/*
> +	 * Signal that we have transitioned back to host mode
> +	 */
> +	spin_lock_irqsave(&vcpu->irq.lock, irq_flags);
> +	vcpu->irq.guest_mode = 0;
> +	spin_unlock_irqrestore(&vcpu->irq.lock, irq_flags);
> +
>   

Assign + __smp_wmb().

> +	/*
> +	 * Signal that we have transitioned back to host mode
> +	 */
> +	spin_lock_irqsave(&vcpu->irq.lock, irq_flags);
> +	vcpu->irq.guest_mode = 0;
> +	spin_unlock_irqrestore(&vcpu->irq.lock, irq_flags);
>   

Again.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] KVM: Add support for in-kernel LAPIC model
       [not found]     ` <20070502214330.16738.51436.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
@ 2007-05-07 10:17       ` Avi Kivity
       [not found]         ` <463EFCC5.8050408-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2007-05-07 10:17 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> +
> +struct kvm_kernint {
> +	struct kvm_vcpu              *vcpu;
> +	struct kvm_irqdevice         *self_irq;
> +	struct kvm_irqdevice         *ext_irq;
> +	struct kvm_irqdevice          apic_irq;
> +
> +};
>   

Can you elaborate on what these are? I sort of understand ext_irq and 
apic_irq, but what is self_irq? And why have the vcpu member?

> +
> +static struct kvm_irqdevice *get_irq_dev(struct kvm_kernint *s)
> +{
> +	struct kvm_irqdevice *dev;
> +
> +	if (kvm_lapic_enabled(s->vcpu))
> +		dev = &s->apic_irq;
> +	else
> +		dev = s->ext_irq;
> +
> +	if (!dev)
> +		kvm_crash_guest(s->vcpu->kvm);
>   

Can this happen? Doesn't there always have to be an external interrupt 
controller when the lapic is disabled?

> +
> +static void kvm_ext_intr(struct kvm_irqsink *this,
> +			 struct kvm_irqdevice *dev,
> +			 kvm_irqpin_t pin)
> +{
> +	struct kvm_kernint *s = (struct kvm_kernint*)this->private;
> +
> +	/*
> +	 * If the EXTINT device sent us an interrupt, forward it to the LINT0
> +	 * pin of the LAPIC
> +	 */
> +	if (pin != kvm_irqpin_localint)
> +	    return;
>   

Whitespace.

> +
> +	/*
> +	 * "irq 0" = LINT0, 1 = LINT1
> +	 */
> +	kvm_irqdevice_set_pin(&s->apic_irq, 0, 1);
> +}
> +
> +int kvm_kernint_init(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_irqdevice *irqdev = &vcpu->irq.dev;
> +	struct kvm_kernint *s;
> +
> +	s = kzalloc(sizeof(*s), GFP_KERNEL);
> +	if (!s)
> +	    return -ENOMEM;
>   

Whitespace.

> +
> +	s->vcpu = vcpu;
> +
> +	/*
> +	 * Configure the irqdevice interface
> +	 */
> +	irqdev->ack         = kernint_irqdev_ack;
> +	irqdev->set_pin     = kernint_irqdev_set_pin;
> +	irqdev->destructor  = kernint_irqdev_destructor;
> +
> +	irqdev->private = s;
> +	s->self_irq = irqdev;
> +
> +	/*
> +	 * Configure the EXTINT device if this is the BSP processor
> +	 */
> +	if (!vcpu_slot(vcpu)) {
> +		struct kvm_irqsink sink = {
> +			.set_intr   = kvm_ext_intr,
> +			.private    = s
> +		};
> +
> +		s->ext_irq = &vcpu->kvm->isa_irq;
> +		kvm_irqdevice_register_sink(s->ext_irq, &sink);
> +	}
> +
> +	/*
> +	 * Configure the LAPIC device
> +	 */
> +	kvm_irqdevice_init(&s->apic_irq);
> +
> +	{
> +		struct kvm_irqsink sink = {
> +			.set_intr   = kvm_apic_intr,
> +			.private    = s
> +		};
> +
> +		kvm_irqdevice_register_sink(&s->apic_irq, &sink);
> +	}
>   

Just move the variable definition to the top to avoid the wierd indentation.


> +
> +/*
> + *-----------------------------------------------------------------------
> + * KERNEL-TIMERS
> + *
> + * Unfortunately we really need HRTIMERs to do this right, but they might
> + * not exist on olders kernels (sigh).  So we roughly abstract this interface
> + * to support nanosecond resolution, and then emulate it as best we can if
> + * the real HRTIMERs arent available
> + *-----------------------------------------------------------------------
> + */
>   

The correct way to use non-hrtimer kernels is to write the hrtimer api 
in terms of the old api, in external-module-compat.h.

> +
> +struct kvm_apic_timer {
> +	int (*function)(void *private);
> +	void *private;
> +#ifdef KVM_NO_HRTIMER
> +	struct timer_list dev;
> +#else
> +	struct hrtimer dev;
> +#endif
> +};
>   

... for example:
#define hrtimer timer_list

I'd just drop it for now; it makes the code hard to read. We can re-add 
it later.


[I didn't audit the lapic code]

Where's vcpu->cr8 gone?


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] KVM: Add support for in-kernel LAPIC model
       [not found]         ` <463EFCC5.8050408-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-07 10:22           ` Avi Kivity
  2007-05-07 15:10           ` Gregory Haskins
  1 sibling, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2007-05-07 10:22 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Avi Kivity wrote:
>
>   
>> +
>> +/*
>> + *-----------------------------------------------------------------------
>> + * KERNEL-TIMERS
>> + *
>> + * Unfortunately we really need HRTIMERs to do this right, but they might
>> + * not exist on olders kernels (sigh).  So we roughly abstract this interface
>> + * to support nanosecond resolution, and then emulate it as best we can if
>> + * the real HRTIMERs arent available
>> + *-----------------------------------------------------------------------
>> + */
>>   
>>     
>
> The correct way to use non-hrtimer kernels is to write the hrtimer api 
> in terms of the old api, in external-module-compat.h.
>
>   

An alternative is to just put a patch adding the #ifdef and the 
non-hrtimer case in the kernel/ directory, and to have 'make sync' apply 
the patch.  This is annoying to maintain but easier to generate.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Kernel side patches for in-kernel APIC
       [not found]             ` <463D8887.5020007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-07 14:36               ` Gregory Haskins
  0 siblings, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2007-05-07 14:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Sun, May 6, 2007 at  3:49 AM, in message <463D8887.5020007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
>
> Please do post (perhaps just the incremental).  People can ignore it if 
> they're not interested.

Will do.  As long as I am not annoying you guys Ill send updates when they are significant or timely enough.




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] KVM: Adds support for in-kernel mmio handlers
       [not found]         ` <463EF198.1050303-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-07 14:37           ` Gregory Haskins
  0 siblings, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2007-05-07 14:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Mon, May 7, 2007 at  5:30 AM, in message <463EF198.1050303-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>>  
>> +
>> +void kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device 
> *dev)
>> +{
>> +	BUG_ON(bus- >dev_count >= (NR_IOBUS_DEVS- 1));
>> +
>> +	bus- >devs[bus- >dev_count++] = dev;
>> +}
>> +
>>   
> 
> Off by one?
> 
> 
> Otherwise ok.
> 


Good eyes.  Fixed

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] KVM: Add irqdevice object
       [not found]         ` <463EF493.9070300-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-07 14:39           ` Gregory Haskins
  0 siblings, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2007-05-07 14:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Mon, May 7, 2007 at  5:42 AM, in message <463EF493.9070300-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 

> 
>> +static void do_interrupt_requests(struct kvm_vcpu *vcpu,
>> +				  struct kvm_run *kvm_run)
>> +{
>> +	int pending = __kvm_vcpu_irq_all_pending(vcpu);
>> +	int handled = 0;
>> +
>> +	while (pending && !handled) {
>> +		kvm_irqpin_t pin = __fls(pending);
>> +
>> +		switch (pin) {
>> +		case kvm_irqpin_localint:
>> +		case kvm_irqpin_extint:
>> +		case kvm_irqpin_nmi:
>> +			handled = do_intr_requests(vcpu, kvm_run, pin);
>> +			break;
>> +		case kvm_irqpin_smi:
>> +			/* ignored (for now) */
>> +			printk(KERN_WARNING
>> +			       "KVM: dropping unhandled SMI\n");
>> +			__clear_bit(pin, &vcpu- >irq.pending);
>> +			break;
>> +		case kvm_irqpin_invalid:
>> +			/* drop */
>> +			break;
>> +		default:
>> +			panic("KVM: unknown interrupt pin raised: %d\n", pin);
>> +			break;
>> +		}
>> +
>> +		__clear_bit(pin, &pending);
>> +	}
>>  }
>>   
> 
> This code can be shared with the svm function of the same name and moved 
> to kvm_main, no?   You'll need a new arch op for do_inter_requests().
> 

Originally I was going to do exactly what you are proposing, but later (as you will see with my refresh coming shortly) they start to diverge between VMX and SVN.  I will still be on the lookout for places where we can centralize, however.



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU
       [not found]         ` <463EF7F4.9020106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-07 14:52           ` Gregory Haskins
       [not found]             ` <463F04D8.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  2007-05-07 15:17           ` Gregory Haskins
  1 sibling, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-07 14:52 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Mon, May 7, 2007 at  5:57 AM, in message <463EF7F4.9020106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>> The VCPU executes synchronously w.r.t. userspace today, and therefore
>> interrupt injection is pretty straight forward.  However, we will soon need
>> to be able to inject interrupts asynchronous to the execution of the VCPU
>> due to the introduction of SMP, paravirtualized drivers, and asynchronous
>> hypercalls.  This patch adds support to the interrupt mechanism to force
>> a VCPU to VMEXIT when a new interrupt is pending.
>>
>> Signed- off- by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
>> ---
>>
>>  drivers/kvm/kvm.h      |    5 +++
>>  drivers/kvm/kvm_main.c |   74 
> +++++++++++++++++++++++++++++++++++++++++++++++-
>>  drivers/kvm/svm.c      |   43 ++++++++++++++++++++++++++++
>>  drivers/kvm/vmx.c      |   43 ++++++++++++++++++++++++++++
>>  4 files changed, 164 insertions(+), 1 deletions(- )
>>
>> diff -- git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
>> index d41d653..15c8bec 100644
>> ---  a/drivers/kvm/kvm.h
>> +++ b/drivers/kvm/kvm.h
>> @@ - 321,6 +321,8 @@ void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
>>  
>>  #define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
>>  
>> +#define KVM_SIGNAL_VIRTUAL_INTERRUPT 33 /* Hardcoded for now */
>> +
>>  /*
>>   * structure for maintaining info for interrupting an executing VCPU
>>   */
>> @@ - 329,6 +331,9 @@ struct kvm_vcpu_irq {
>>  	struct kvm_irqdevice dev;
>>  	int                  pending;
>>  	int                  deferred;
>> +	struct task_struct  *task;
>> +	int                  signo;
>> +	int                  guest_mode;
>>  };
>>  
>>  struct kvm_vcpu {
>> diff -- git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
>> index 9aeb2f7..6acbd9b 100644
>> ---  a/drivers/kvm/kvm_main.c
>> +++ b/drivers/kvm/kvm_main.c
>> @@ - 304,6 +304,10 @@ static struct kvm *kvm_create_vm(void)
>>  		memset(&vcpu- >irq, 0, sizeof(vcpu- >irq));
>>  		spin_lock_init(&vcpu- >irq.lock);
>>  		vcpu- >irq.deferred = - 1;
>> +		/*
>> +		 * This should be settable by userspace someday
>> +		 */
>> +		vcpu- >irq.signo = KVM_SIGNAL_VIRTUAL_INTERRUPT;
>>   
> 
> This needs to be fixed prior to merging.  Hopefully not by setting the 
> signal number, bit by making the vcpu fd writable (userspace can attach 
> a signal to the fd if it wishes).
> 
>>  
>>  		vcpu- >cpu = - 1;
>>  		vcpu- >kvm = kvm;
>> @@ - 366,13 +370,20 @@ static void free_pio_guest_pages(struct kvm_vcpu *vcpu)
>>  
>>  static void kvm_free_vcpu(struct kvm_vcpu *vcpu)
>>  {
>> +	unsigned long irqsave;
>> +
>>  	if (!vcpu- >vmcs)
>>  		return;
>>  
>>  	vcpu_load(vcpu);
>>  	kvm_mmu_destroy(vcpu);
>>  	vcpu_put(vcpu);
>> +
>> +	spin_lock_irqsave(&vcpu- >irq.lock, irqsave);
>> +	vcpu- >irq.task = NULL;
>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, irqsave);
>>   
> 
> Can irq.task be non- NULL here at all?  Also, we only free vcpus when we 
> destroy the vm, and paravirt drivers would hopefully hold a ref to the 
> vm, so there's nobody to race against here.

I am perhaps being a bit overzealous here.  What I found in practice is that the LVTT can screw things up on shutdown, so I was being pretty conservative on the synchronization here.  

> 
>>  	kvm_irqdevice_destructor(&vcpu- >irq.dev);
>>  
>> @@ - 1868,6 +1880,10 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
> struct kvm_run *kvm_run)
>>  		kvm_arch_ops- >decache_regs(vcpu);
>>  	}
>>  
>> +	spin_lock_irqsave(&vcpu- >irq.lock, irqsaved);
>> +	vcpu- >irq.task = current;
>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, irqsaved);
>> +
>>   
> 
> Just assignment + __smp_wmb().

(This comment applies to all of the subsequent reviews where memory barriers are recommended instead of locks:)

I cant quite wrap my head around whether all these critical sections are correct with just a barrier instead of a full-blown lock.  I would prefer to be conservative and leave them as locks for now.  Someone with better insight could make a second pass and optimize the locks into barriers where appropriate.  I am just uncomfortable doing it feeling confident that I am not causing races.  If you insist on making the changes before the code is accepted, ok.  Just note that I am not comfortable ;)

> 
>> +/*
>>   * This function will be invoked whenever the vcpu- >irq.dev raises its INTR
>>   * line
>>   */
>> @@ - 2318,10 +2348,52 @@ static void kvm_vcpu_intr(struct kvm_irqsink *this,
>>  {
>>  	struct kvm_vcpu *vcpu = (struct kvm_vcpu*)this- >private;
>>  	unsigned long flags;
>> +	int direct_ipi = - 1;
>>  
>>  	spin_lock_irqsave(&vcpu- >irq.lock, flags);
>> -	__set_bit(pin, &vcpu- >irq.pending);
>> +
>> +	if (!test_bit(pin, &vcpu- >irq.pending)) {
>> +		/*
>> +		 * Record the change..
>> +		 */
>> +		__set_bit(pin, &vcpu- >irq.pending);
>> +
>> +		/*
>> +		 * then wake up the vcpu (if necessary)
>> +		 */
>> +		if (vcpu- >irq.task && (vcpu- >irq.task != current)) {
>> +			if (vcpu- >irq.guest_mode) {
>> +				/*
>> +				 * If we are in guest mode, we can optimize
>> +				 * the IPI by executing a function directly
>> +				 * on the owning processor.
>> +				 */
>> +				direct_ipi = task_cpu(vcpu- >irq.task);
>> +				BUG_ON(direct_ipi == smp_processor_id());
>> +			} else
>> +				/*
>> +				 * otherwise, we must assume that we could be
>> +				 * blocked anywhere, including userspace. Send
>> +				 * a signal to give everyone a chance to get
>> +				 * notification
>> +				 */
>> +				send_sig(vcpu- >irq.signo, vcpu- >irq.task, 0);
>> +		}
>> +	}
>> +
>>  	spin_unlock_irqrestore(&vcpu- >irq.lock, flags);
>> +
>> +	if (direct_ipi != - 1) {
>> +		/*
>> +		 * Not sure if disabling preemption is needed.
>> +		 * The kick_process() code does this so I copied it
>> +		 */
>> +		preempt_disable();
>> +		smp_call_function_single(direct_ipi,
>> +					 kvm_vcpu_guest_intr,
>> +					 vcpu, 0, 0);
>> +		preempt_enable();
>> +	}
>>   
> 
> I see why you must issue the IPI outside the spin_lock_irqsave(), but 
> aren't you now opening a race?  vcpu enters guest mode, irq on other 
> cpu, irq sets direct_ipi to wakeup guest, releases lock, vcpu exits to 
> userspace (or migrates to another cpu), ipi is issued but nobody cares.

Its subtle, but I think its ok.  The race is actually against the setting of the irq.pending.  This *must* happen inside the lock or the guest could exit and miss the interrupt.  Once the pending bit is set, however, the guest can be woken up in any old fashion and the behavior should be correct.  If the guest has already exited before the IPI is issued, its effectively a no-op (well, really its just a wasted IPI/reschedule event,  but no harm is done).  Does this make sense?  Did I miss something else?



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] KVM: Add support for in-kernel LAPIC model
       [not found]         ` <463EFCC5.8050408-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-05-07 10:22           ` Avi Kivity
@ 2007-05-07 15:10           ` Gregory Haskins
       [not found]             ` <463F0914.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-07 15:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Mon, May 7, 2007 at  6:17 AM, in message <463EFCC5.8050408-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>> +
>> +struct kvm_kernint {
>> +	struct kvm_vcpu              *vcpu;
>> +	struct kvm_irqdevice         *self_irq;
>> +	struct kvm_irqdevice         *ext_irq;
>> +	struct kvm_irqdevice          apic_irq;
>> +
>> +};
>>   
> 
> Can you elaborate on what these are? I sort of understand ext_irq and 
> apic_irq, but what is self_irq? And why have the vcpu member?

vcpu is there so that the kernint object has proper VCPU/LAPIC context for functions like kvm_lapic_enabled()

ext/apic_irq you already understand

self_irq is a self reference to the kernints own irqdevice.  This is because the kernint object is really just an intermediate "glue" component that glues LAPIC, EXTINT, and VCPU together.  When interrupts come into the KERNINT from the LAPIC, they need to be forwarded to the VCPU as if they came from the KERNINT.  See kvm_apic_intr().

> 
>> +
>> +static struct kvm_irqdevice *get_irq_dev(struct kvm_kernint *s)
>> +{
>> +	struct kvm_irqdevice *dev;
>> +
>> +	if (kvm_lapic_enabled(s- >vcpu))
>> +		dev = &s- >apic_irq;
>> +	else
>> +		dev = s- >ext_irq;
>> +
>> +	if (!dev)
>> +		kvm_crash_guest(s- >vcpu- >kvm);
>>   
> 
> Can this happen? Doesn't there always have to be an external interrupt 
> controller when the lapic is disabled?

Well, if a non-BSP processor calls this, ext_irq could be null.  Its a pathological condition, thus the harsh punishment to the guest.  If its an AP, it *better* be using its LAPIC.  If its not, something is busted. 


> 
>> +
>> +static void kvm_ext_intr(struct kvm_irqsink *this,
>> +			 struct kvm_irqdevice *dev,
>> +			 kvm_irqpin_t pin)
>> +{
>> +	struct kvm_kernint *s = (struct kvm_kernint*)this- >private;
>> +
>> +	/*
>> +	 * If the EXTINT device sent us an interrupt, forward it to the LINT0
>> +	 * pin of the LAPIC
>> +	 */
>> +	if (pin != kvm_irqpin_localint)
>> +	    return;
>>   
> 
> Whitespace.

Fixed

> 
>> +
>> +	/*
>> +	 * "irq 0" = LINT0, 1 = LINT1
>> +	 */
>> +	kvm_irqdevice_set_pin(&s- >apic_irq, 0, 1);
>> +}
>> +
>> +int kvm_kernint_init(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_irqdevice *irqdev = &vcpu- >irq.dev;
>> +	struct kvm_kernint *s;
>> +
>> +	s = kzalloc(sizeof(*s), GFP_KERNEL);
>> +	if (!s)
>> +	    return - ENOMEM;
>>   
> 
> Whitespace.

Fixed

> 
>> +
>> +	s- >vcpu = vcpu;
>> +
>> +	/*
>> +	 * Configure the irqdevice interface
>> +	 */
>> +	irqdev- >ack         = kernint_irqdev_ack;
>> +	irqdev- >set_pin     = kernint_irqdev_set_pin;
>> +	irqdev- >destructor  = kernint_irqdev_destructor;
>> +
>> +	irqdev- >private = s;
>> +	s- >self_irq = irqdev;
>> +
>> +	/*
>> +	 * Configure the EXTINT device if this is the BSP processor
>> +	 */
>> +	if (!vcpu_slot(vcpu)) {
>> +		struct kvm_irqsink sink = {
>> +			.set_intr   = kvm_ext_intr,
>> +			.private    = s
>> +		};
>> +
>> +		s- >ext_irq = &vcpu- >kvm- >isa_irq;
>> +		kvm_irqdevice_register_sink(s- >ext_irq, &sink);
>> +	}
>> +
>> +	/*
>> +	 * Configure the LAPIC device
>> +	 */
>> +	kvm_irqdevice_init(&s- >apic_irq);
>> +
>> +	{
>> +		struct kvm_irqsink sink = {
>> +			.set_intr   = kvm_apic_intr,
>> +			.private    = s
>> +		};
>> +
>> +		kvm_irqdevice_register_sink(&s- >apic_irq, &sink);
>> +	}
>>   
> 
> Just move the variable definition to the top to avoid the wierd indentation.

Weirdness fixed ;)

> 
> 
>> +
>> +/*
>> + *-----------------------------------------------------------------------
>> + * KERNEL- TIMERS
>> + *
>> + * Unfortunately we really need HRTIMERs to do this right, but they might
>> + * not exist on olders kernels (sigh).  So we roughly abstract this 
> interface
>> + * to support nanosecond resolution, and then emulate it as best we can if
>> + * the real HRTIMERs arent available
>> + *-----------------------------------------------------------------------
>> + */
>>   
> 
> The correct way to use non- hrtimer kernels is to write the hrtimer api 
> in terms of the old api, in external- module- compat.h.
> 
>> +
>> +struct kvm_apic_timer {
>> +	int (*function)(void *private);
>> +	void *private;
>> +#ifdef KVM_NO_HRTIMER
>> +	struct timer_list dev;
>> +#else
>> +	struct hrtimer dev;
>> +#endif
>> +};
>>   
> 
> ... for example:
> #define hrtimer timer_list
> 
> I'd just drop it for now; it makes the code hard to read. We can re- add 
> it later.

By "drop it", I assume you mean drop the abstraction and just use native HRTIMER directly for now?  

> 
> 
> [I didn't audit the lapic code]
> 
> Where's vcpu- >cr8 gone?

I know you requested that this entry remain in the VCPU structure.  However, I couldnt make this work reasonably.  The APIC wants to maintain this value itself so the two can very easily get out of sync.  I suppose there could be ways to make the APIC use the vcpu->cr8 variable as storage for TPR (albeit messy), but this idea falls apart when we start looking at optimizations like TPR shadowing.

Based on all that, I felt it was best to just maintain CR8 as the TPR register in the model.





-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU
       [not found]         ` <463EF7F4.9020106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-05-07 14:52           ` Gregory Haskins
@ 2007-05-07 15:17           ` Gregory Haskins
       [not found]             ` <463F0AC7.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-07 15:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Oops... missed one:

>>> On Mon, May 7, 2007 at  5:57 AM, in message <463EF7F4.9020106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
>
> 
> This needs to be fixed prior to merging.  

Agreed.

> Hopefully not by setting the 
> signal number, bit by making the vcpu fd writable (userspace can attach 
> a signal to the fd if it wishes).

Can you provide an example of what you would like here?  I am not quite sure what you mean by making the fd writable.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU
       [not found]             ` <463F04D8.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-05-08  8:13               ` Avi Kivity
       [not found]                 ` <46403117.2070000-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2007-05-08  8:13 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>>     
>>>  
>>>  		vcpu- >cpu = - 1;
>>>  		vcpu- >kvm = kvm;
>>> @@ - 366,13 +370,20 @@ static void free_pio_guest_pages(struct kvm_vcpu *vcpu)
>>>  
>>>  static void kvm_free_vcpu(struct kvm_vcpu *vcpu)
>>>  {
>>> +	unsigned long irqsave;
>>> +
>>>  	if (!vcpu- >vmcs)
>>>  		return;
>>>  
>>>  	vcpu_load(vcpu);
>>>  	kvm_mmu_destroy(vcpu);
>>>  	vcpu_put(vcpu);
>>> +
>>> +	spin_lock_irqsave(&vcpu- >irq.lock, irqsave);
>>> +	vcpu- >irq.task = NULL;
>>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, irqsave);
>>>   
>>>       
>> Can irq.task be non- NULL here at all?  Also, we only free vcpus when we 
>> destroy the vm, and paravirt drivers would hopefully hold a ref to the 
>> vm, so there's nobody to race against here.
>>     
>
> I am perhaps being a bit overzealous here.  What I found in practice is that the LVTT can screw things up on shutdown, so I was being pretty conservative on the synchronization here.  
>
>   

That may point out to a different sync problem.  All pending timers 
ought to have been canceled before we reach here.  Please check to make 
sure this isn't papering over another problem.

>>>  	kvm_irqdevice_destructor(&vcpu- >irq.dev);
>>>  
>>> @@ - 1868,6 +1880,10 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
>>>       
>> struct kvm_run *kvm_run)
>>     
>>>  		kvm_arch_ops- >decache_regs(vcpu);
>>>  	}
>>>  
>>> +	spin_lock_irqsave(&vcpu- >irq.lock, irqsaved);
>>> +	vcpu- >irq.task = current;
>>> +	spin_unlock_irqrestore(&vcpu- >irq.lock, irqsaved);
>>> +
>>>   
>>>       
>> Just assignment + __smp_wmb().
>>     
>
> (This comment applies to all of the subsequent reviews where memory barriers are recommended instead of locks:)
>
> I cant quite wrap my head around whether all these critical sections are correct with just a barrier instead of a full-blown lock.  I would prefer to be conservative and leave them as locks for now.  Someone with better insight could make a second pass and optimize the locks into barriers where appropriate.  I am just uncomfortable doing it feeling confident that I am not causing races.  If you insist on making the changes before the code is accepted, ok.  Just note that I am not comfortable ;)
>
>   

I approach it from the other direction: to me, a locked assignment says 
that something is fundamentally wrong.  Usually anything under a lock is 
a read-modify-write operation, otherwise the writes just stomp on each 
other.

This is the source of the all the race-after-vmexit-irq-check comments 
you've been getting to me.  No matter how many times you explain it, 
every time I see it the automatic race alarm pops up.

>>> +/*
>>>   * This function will be invoked whenever the vcpu- >irq.dev raises its INTR
>>>   * line
>>>   */
>>> @@ - 2318,10 +2348,52 @@ static void kvm_vcpu_intr(struct kvm_irqsink *this,
>>>  {
>>>  	struct kvm_vcpu *vcpu = (struct kvm_vcpu*)this- >private;
>>>  	unsigned long flags;
>>> +	int direct_ipi = - 1;
>>>  
>>>  	spin_lock_irqsave(&vcpu- >irq.lock, flags);
>>> -	__set_bit(pin, &vcpu- >irq.pending);
>>> +
>>> +	if (!test_bit(pin, &vcpu- >irq.pending)) {
>>> +		/*
>>> +		 * Record the change..
>>> +		 */
>>> +		__set_bit(pin, &vcpu- >irq.pending);
>>> +
>>> +		/*
>>> +		 * then wake up the vcpu (if necessary)
>>> +		 */
>>> +		if (vcpu- >irq.task && (vcpu- >irq.task != current)) {
>>> +			if (vcpu- >irq.guest_mode) {
>>> +				/*
>>> +				 * If we are in guest mode, we can optimize
>>> +				 * the IPI by executing a function directly
>>> +				 * on the owning processor.
>>> +				 */
>>> +				direct_ipi = task_cpu(vcpu- >irq.task);
>>> +				BUG_ON(direct_ipi == smp_processor_id());
>>> +			} else
>>> +				/*
>>> +				 * otherwise, we must assume that we could be
>>> +				 * blocked anywhere, including userspace. Send
>>> +				 * a signal to give everyone a chance to get
>>> +				 * notification
>>> +				 */
>>> +				send_sig(vcpu- >irq.signo, vcpu- >irq.task, 0);
>>> +		}
>>> +	}
>>> +
>>>  	spin_unlock_irqrestore(&vcpu- >irq.lock, flags);
>>> +
>>> +	if (direct_ipi != - 1) {
>>> +		/*
>>> +		 * Not sure if disabling preemption is needed.
>>> +		 * The kick_process() code does this so I copied it
>>> +		 */
>>> +		preempt_disable();
>>>       

[preemption is disabled here anyway]

>>> +		smp_call_function_single(direct_ipi,
>>> +					 kvm_vcpu_guest_intr,
>>> +					 vcpu, 0, 0);
>>> +		preempt_enable();
>>> +	}
>>>   
>>>       
>> I see why you must issue the IPI outside the spin_lock_irqsave(), but 
>> aren't you now opening a race?  vcpu enters guest mode, irq on other 
>> cpu, irq sets direct_ipi to wakeup guest, releases lock, vcpu exits to 
>> userspace (or migrates to another cpu), ipi is issued but nobody cares.
>>     
>
> Its subtle, but I think its ok.  The race is actually against the setting of the irq.pending.  This *must* happen inside the lock or the guest could exit and miss the interrupt.  Once the pending bit is set, however, the guest can be woken up in any old fashion and the behavior should be correct.  If the guest has already exited before the IPI is issued, its effectively a no-op (well, really its just a wasted IPI/reschedule event,  but no harm is done).  Does this make sense?  Did I miss something else?
>   

No, you are correct wrt the vcpu migrating to another cpu.

What about vs. exit to userspace where we may sleep?

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] KVM: Add support for in-kernel LAPIC model
       [not found]             ` <463F0914.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-05-08  8:19               ` Avi Kivity
       [not found]                 ` <464032A3.2010303-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2007-05-08  8:19 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>   
>>> +
>>> +static struct kvm_irqdevice *get_irq_dev(struct kvm_kernint *s)
>>> +{
>>> +	struct kvm_irqdevice *dev;
>>> +
>>> +	if (kvm_lapic_enabled(s- >vcpu))
>>> +		dev = &s- >apic_irq;
>>> +	else
>>> +		dev = s- >ext_irq;
>>> +
>>> +	if (!dev)
>>> +		kvm_crash_guest(s- >vcpu- >kvm);
>>>   
>>>       
>> Can this happen? Doesn't there always have to be an external interrupt 
>> controller when the lapic is disabled?
>>     
>
> Well, if a non-BSP processor calls this, ext_irq could be null.  Its a pathological condition, thus the harsh punishment to the guest.  If its an AP, it *better* be using its LAPIC.  If its not, something is busted. 
>
>
>   

Yes, I agree.

>>>   
>>>       
>> ... for example:
>> #define hrtimer timer_list
>>
>> I'd just drop it for now; it makes the code hard to read. We can re- add 
>> it later.
>>     
>
> By "drop it", I assume you mean drop the abstraction and just use native HRTIMER directly for now?  
>
>   

Yes, we can worry about compatibility when this merged.

>> [I didn't audit the lapic code]
>>
>> Where's vcpu- >cr8 gone?
>>     
>
> I know you requested that this entry remain in the VCPU structure.  However, I couldnt make this work reasonably.  The APIC wants to maintain this value itself so the two can very easily get out of sync.  I suppose there could be ways to make the APIC use the vcpu->cr8 variable as storage for TPR (albeit messy), but this idea falls apart when we start looking at optimizations like TPR shadowing.
>
> Based on all that, I felt it was best to just maintain CR8 as the TPR register in the model.
>
>   

I don't understand.  Isn't the tpr read-only from the point of view of 
the lapic?

A simple set_cr8 helper should do the trick.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU
       [not found]             ` <463F0AC7.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-05-08  8:26               ` Avi Kivity
       [not found]                 ` <4640341E.6090905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2007-05-08  8:26 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>   
>> Hopefully not by setting the 
>> signal number, bit by making the vcpu fd writable (userspace can attach 
>> a signal to the fd if it wishes).
>>     
>
> Can you provide an example of what you would like here?  I am not quite sure what you mean by making the fd writable.
>   

Making it respond to poll(2) as a writable fd.

See http://lwn.net/Articles/226252/ for an example.  It makes the fd 
readable instead of writable, but it's the same mechanism.

The larger picture is that fds and the poll() family are the closest 
thing Linux has to a generic event framework.  If the patchset in the 
article above is accepted, fds _will_ be the generic event framework.  
If something else is accepted, we'll just switch to that.

Qemu currently depends on signals, but you can have a writable fd 
generate a signal, and the mechanism for that is optional and 
configurable from userspace, which is what we want anyway.

-- 
error compiling committee.c: too many arguments to function

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU
       [not found]                 ` <46403117.2070000-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-08 11:48                   ` Gregory Haskins
       [not found]                     ` <46402B25.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-08 11:48 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Tue, May 8, 2007 at  4:13 AM, in message <46403117.2070000-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>>
>> I am perhaps being a bit overzealous here.  What I found in practice is that 
> the LVTT can screw things up on shutdown, so I was being pretty conservative 
> on the synchronization here.  
>>
>>   
> 
> That may point out to a different sync problem.  All pending timers 
> ought to have been canceled before we reach here.  Please check to make 
> sure this isn't papering over another problem.
> 

You are definitely right there.  I had added this logic in the early stage of debugging.  It turned out that I was missing an apic_dropref, which effectively meant the hrtimer_cancel() was never being issued.  That was the root-cause of my "LVTT expiration after guest shutdown" bug.  I left the sync code in as a conservative measure, but I will clean this up.

>>
>>   
> 
> I approach it from the other direction: to me, a locked assignment says 
> that something is fundamentally wrong.  Usually anything under a lock is 
> a read- modify- write operation, otherwise the writes just stomp on each 
> other.
> 

Interesting. That makes sense.  So if I replace the assignment cases with wmb, do I need to sprinkle rmbs anywhere or is that take care of naturally by the places where we take the lock for a compound operation?

>
> 
> [preemption is disabled here anyway]
>

Ack.  I will remove the calls
 
>>>> +		smp_call_function_single(direct_ipi,
>>>> +					 kvm_vcpu_guest_intr,
>>>> +					 vcpu, 0, 0);
>>>> +		preempt_enable();
>>>> +	}
>>>>   
>>>>       
>>> I see why you must issue the IPI outside the spin_lock_irqsave(), but 
>>> aren't you now opening a race?  vcpu enters guest mode, irq on other 
>>> cpu, irq sets direct_ipi to wakeup guest, releases lock, vcpu exits to 
>>> userspace (or migrates to another cpu), ipi is issued but nobody cares.
>>>     
>>
>> Its subtle, but I think its ok.  The race is actually against the setting of 
> the irq.pending.  This *must* happen inside the lock or the guest could exit 
> and miss the interrupt.  Once the pending bit is set, however, the guest can 
> be woken up in any old fashion and the behavior should be correct.  If the 
> guest has already exited before the IPI is issued, its effectively a no- op 
> (well, really its just a wasted IPI/reschedule event,  but no harm is done).  
> Does this make sense?  Did I miss something else?
>>   
> 
> No, you are correct wrt the vcpu migrating to another cpu.
> 
> What about vs. exit to userspace where we may sleep?

My logic being correct is predicated on the assumption that you and I made a week or two ago:  That the user-space will not sleep for anything but HLT.  If userspace *can* sleep on other things besides HLT, I agree that there is a race here.  If it is limited to HLT, we will be taken care of by the virtue of the fact that irq.pending be set before the handle_halt() logic is checked.  I admit that I was coding against an assumption that I do not yet know for a fact to be true.  I will update the comments to note this assumption so its clearer, and we can address it in the future if its ever revealed to be false.





-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU
       [not found]                     ` <46402B25.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-05-08 11:56                       ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2007-05-08 11:56 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>>>> On Tue, May 8, 2007 at  4:13 AM, in message <46403117.2070000-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
>>>>         
> Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
>   
>> Gregory Haskins wrote:
>>     
>>> I am perhaps being a bit overzealous here.  What I found in practice is that 
>>>       
>> the LVTT can screw things up on shutdown, so I was being pretty conservative 
>> on the synchronization here.  
>>     
>>>   
>>>       
>> That may point out to a different sync problem.  All pending timers 
>> ought to have been canceled before we reach here.  Please check to make 
>> sure this isn't papering over another problem.
>>
>>     
>
> You are definitely right there.  I had added this logic in the early stage of debugging.  It turned out that I was missing an apic_dropref, which effectively meant the hrtimer_cancel() was never being issued.  That was the root-cause of my "LVTT expiration after guest shutdown" bug.  I left the sync code in as a conservative measure, but I will clean this up.
>
>   

Okay.  An alternative to removing it is replacing it with a BUG_ON() so 
make sure the constraint is checked.

>>>   
>>>       
>> I approach it from the other direction: to me, a locked assignment says 
>> that something is fundamentally wrong.  Usually anything under a lock is 
>> a read- modify- write operation, otherwise the writes just stomp on each 
>> other.
>>
>>     
>
> Interesting. That makes sense.  So if I replace the assignment cases with wmb, do I need to sprinkle rmbs anywhere or is that take care of naturally by the places where we take the lock for a compound operation?
>   

I was going to say yes, but I'm not so sure now.  In any case I'm still 
uneasy about the lack of rmw in there.

See Documentation/memory-barriers.txt for an interesting, if difficult, 
discussion of the subject.

>>>   
>>>       
>> No, you are correct wrt the vcpu migrating to another cpu.
>>
>> What about vs. exit to userspace where we may sleep?
>>     
>
> My logic being correct is predicated on the assumption that you and I made a week or two ago:  That the user-space will not sleep for anything but HLT.  If userspace *can* sleep on other things besides HLT, I agree that there is a race here.  If it is limited to HLT, we will be taken care of by the virtue of the fact that irq.pending be set before the handle_halt() logic is checked.  I admit that I was coding against an assumption that I do not yet know for a fact to be true.  I will update the comments to note this assumption so its clearer, and we can address it in the future if its ever revealed to be false.
>   

Yeah, I keep forgetting this.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] KVM: Add support for in-kernel LAPIC model
       [not found]                 ` <464032A3.2010303-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-08 11:59                   ` Gregory Haskins
       [not found]                     ` <46402DC0.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2007-05-08 11:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Tue, May 8, 2007 at  4:19 AM, in message <464032A3.2010303-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>
>> By "drop it", I assume you mean drop the abstraction and just use native 
> HRTIMER directly for now?  
>>
>>   
> 
> Yes, we can worry about compatibility when this merged.
>

Ack.  I have removed the abstraction.

>>> [I didn't audit the lapic code]
>>>
>>> Where's vcpu-  >cr8 gone?
>>>     
>>
>> I know you requested that this entry remain in the VCPU structure.  However, 
> I couldnt make this work reasonably.  The APIC wants to maintain this value 
> itself so the two can very easily get out of sync.  I suppose there could be 
> ways to make the APIC use the vcpu- >cr8 variable as storage for TPR (albeit 
> messy), but this idea falls apart when we start looking at optimizations like 
> TPR shadowing.
>>
>> Based on all that, I felt it was best to just maintain CR8 as the TPR 
> register in the model.
>>
>>   
> 
> I don't understand.  Isn't the tpr read- only from the point of view of 
> the lapic?
> 

Not quite.  Its true that the APIC proper views the TPR as read-only.  However, TPR can be set by the CPU using both MOV to CR8 as well as an MMIO operation to the TPR register, and MMIOs are handled by the APIC code (on behalf of the vCPU)

> A simple set_cr8 helper should do the trick.

That would certainly solve the simple MOV to CR8 problem, yes.  I gets a little goofier when we start looking at the MMIO access path, but even that isn't insurmountable.  Where I really hit the wall was when I was thinking about TPR shadowing, which really wants direct access to the contiguous register file of the LAPIC.  I suppose we could give the CPU a real shadow of the registers and simply sync the TPR value on exit.  But it just seemed to be getting hacky for the sake of having vcpu->cr8.  Whats wrong with simply looking at the LAPIC registers when you want to know? ;)

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU
       [not found]                 ` <4640341E.6090905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-08 12:00                   ` Gregory Haskins
  0 siblings, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2007-05-08 12:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Tue, May 8, 2007 at  4:26 AM, in message <4640341E.6090905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>>   
>>> Hopefully not by setting the 
>>> signal number, bit by making the vcpu fd writable (userspace can attach 
>>> a signal to the fd if it wishes).
>>>     
>>
>> Can you provide an example of what you would like here?  I am not quite sure 
> what you mean by making the fd writable.
>>   
> 
> Making it respond to poll(2) as a writable fd.
> 
> See http://lwn.net/Articles/226252/ for an example.  It makes the fd 
> readable instead of writable, but it's the same mechanism.
> 
> The larger picture is that fds and the poll() family are the closest 
> thing Linux has to a generic event framework.  If the patchset in the 
> article above is accepted, fds _will_ be the generic event framework.  
> If something else is accepted, we'll just switch to that.
> 
> Qemu currently depends on signals, but you can have a writable fd 
> generate a signal, and the mechanism for that is optional and 
> configurable from userspace, which is what we want anyway.


Thanks for the link!  I will take a look and add this for my next drop.

-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] KVM: Add support for in-kernel LAPIC model
       [not found]                     ` <46402DC0.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-05-13 10:21                       ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2007-05-13 10:21 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:

  

>> I don't understand.  Isn't the tpr read- only from the point of view of 
>> the lapic?
>>
>>     
>
> Not quite.  Its true that the APIC proper views the TPR as read-only.  However, TPR can be set by the CPU using both MOV to CR8 as well as an MMIO operation to the TPR register, and MMIOs are handled by the APIC code (on behalf of the vCPU)
>   

Right, I forgot.

>   
>> A simple set_cr8 helper should do the trick.
>>     
>
> That would certainly solve the simple MOV to CR8 problem, yes.  I gets a little goofier when we start looking at the MMIO access path, but even that isn't insurmountable.  Where I really hit the wall was when I was thinking about TPR shadowing, which really wants direct access to the contiguous register file of the LAPIC.  I suppose we could give the CPU a real shadow of the registers and simply sync the TPR value on exit.  But it just seemed to be getting hacky for the sake of having vcpu->cr8.  Whats wrong with simply looking at the LAPIC registers when you want to know? ;)
>   

It's weird that you look at the kernel lapic even when you're using the 
userint code.  I'd also like to be consistent with the other registers 
which have shadows in the vcpu struct.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2007-05-13 10:21 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-02 21:43 [PATCH 0/4] Kernel side patches for in-kernel APIC Gregory Haskins
     [not found] ` <20070502212713.16738.8133.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
2007-05-02 21:43   ` [PATCH 1/4] KVM: Adds support for in-kernel mmio handlers Gregory Haskins
     [not found]     ` <20070502214315.16738.68984.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
2007-05-07  9:30       ` Avi Kivity
     [not found]         ` <463EF198.1050303-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-07 14:37           ` Gregory Haskins
2007-05-02 21:43   ` [PATCH 2/4] KVM: Add irqdevice object Gregory Haskins
     [not found]     ` <20070502214320.16738.21505.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
2007-05-07  9:42       ` Avi Kivity
     [not found]         ` <463EF493.9070300-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-07 14:39           ` Gregory Haskins
2007-05-02 21:43   ` [PATCH 3/4] KVM: Adds ability to preempt an executing VCPU Gregory Haskins
     [not found]     ` <20070502214325.16738.42702.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
2007-05-07  9:57       ` Avi Kivity
     [not found]         ` <463EF7F4.9020106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-07 14:52           ` Gregory Haskins
     [not found]             ` <463F04D8.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-05-08  8:13               ` Avi Kivity
     [not found]                 ` <46403117.2070000-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-08 11:48                   ` Gregory Haskins
     [not found]                     ` <46402B25.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-05-08 11:56                       ` Avi Kivity
2007-05-07 15:17           ` Gregory Haskins
     [not found]             ` <463F0AC7.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-05-08  8:26               ` Avi Kivity
     [not found]                 ` <4640341E.6090905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-08 12:00                   ` Gregory Haskins
2007-05-02 21:43   ` [PATCH 4/4] KVM: Add support for in-kernel LAPIC model Gregory Haskins
     [not found]     ` <20070502214330.16738.51436.stgit-sLgBBP33vUGnsjUZhwzVf9HuzzzSOjJt@public.gmane.org>
2007-05-07 10:17       ` Avi Kivity
     [not found]         ` <463EFCC5.8050408-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-07 10:22           ` Avi Kivity
2007-05-07 15:10           ` Gregory Haskins
     [not found]             ` <463F0914.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-05-08  8:19               ` Avi Kivity
     [not found]                 ` <464032A3.2010303-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-08 11:59                   ` Gregory Haskins
     [not found]                     ` <46402DC0.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-05-13 10:21                       ` Avi Kivity
2007-05-03 18:57   ` [PATCH 0/4] Kernel side patches for in-kernel APIC Nakajima, Jun
     [not found]     ` <8FFF7E42E93CC646B632AB40643802A8028B114D-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-05-03 21:18       ` Gregory Haskins
     [not found]         ` <463A1935.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-05-06  7:49           ` Avi Kivity
     [not found]             ` <463D8887.5020007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-07 14:36               ` Gregory Haskins
2007-05-06  7:45   ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox