KVM in-kernel APIC update

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* KVM in-kernel APIC update
@ 2007-04-03 22:31 Gregory Haskins
       [not found] ` <46128F80.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-03 22:31 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

[-- Attachment #1: Type: text/plain, Size: 3633 bytes --]


Hi all,

Attached is a snapshot of my current efforts on the kernel side for the in-kernel APIC work.  Feedback welcome.  

This patch pretty much only deals explicitly with the LAPIC.  We need to make some decisions about the extent of the in-kernel emulation we want to support.  For instance, do we bring the entire ISA interrupt subsystem down (e.g. the singleton IOAPIC and dual 8259s) or a subset?  The QEMU patch I posted previously should allow us to support any combination.

This patch merges the existing kernel-apic branch, and then adds the following

*) Support for SVN+VMX breakout.  The original code was written back when VMX was the one and only platform.

*) Refactored vcpu->irq_XX into an abstract interface.  The goal is to simplify the core code away from IRQ handling so we can support both "user-mode" interrupts as well as in-kernel.  This allows us to preserve ABI compatibility with older user-space that might still have the userspace IRQ handling.  Additionally, this abstraction should be able to make ports to platforms that dont use xAPICs (or a different flavor e.g. IA64 SAPIC) a little easier.  

*) Cursory NMI handling.  The existing code in both the kernel-apic and trunk for KVM doesn't really have the notion of NMIs.  Everything uses a flat model with all interrupts being subject to current masking conditions.  IIRC, IPIs tend to use NMI vectoring, so I think this will be key later if we want a better modeling of SMP system behavior.  Even if they don't, we may want NMIs in the future for a different set of reasons.  The new IRQ abstraction has the notion of NMI filtering so that it should be fairly easy to do the right thing at IRQ injection time at some point down the road.  Right now this feature of the interface is unused.

My current thoughts are that we at least move the IOAPIC into the kernel as well.  That will give sufficient control to generate ISA bus interrupts for guests that understand APICs.  If we want to be able to generate ISA interrupts for legacy guests which talk to the 8259s that will prove to be insufficient.  The good news is that moving the 8259s down as well is probably not a huge deal either, especially since I have already prepped the usermode side.  Thoughts?

So heres a question for you guys out there.  What is the expected use of the in-kernel APIC?  My interests lie in the ability to send IPIs for SMP, as well as being able to inject asynchronous hypercall interrupts.  I assume there are other reasons too, such as PV device interrupts, etc and I would like to make sure I am seeing the big picture before making any bad design decisions.  My question is, how do we expect the PV devices to look from a bus perspective?  

The current Bochs/QEMU system model paints a fairly simple ISA architecture utilizing a single IOAPIC + dual 8259 setup.  Do we expect in-kernel injected IRQs to follow the ISA model (e.g. either legacy or PCI interrupts only limited to IRQ0-15) or do we want to expand on this?  The PCI hypercall device introduced a while back would be an example of something ISA based.  Alternatives would be to utilize unused "pins" (such as IRQ16-23) on IOAPIC #0, or introducing new an entirely new bus/IOAPICs just for KVM, etc.  

If the latter, we also need to decide what the resource conveyance model and vector allocation policy should be.  For instance, do we publish said resources formally in the MP/ACPI tables in Bochs?  Doing so would allow MP/ACPI compliant OSs like linux to naturally route the IRQ.  Conversely, do we do something more direct just like we do for KVM discovery via wrmsr?

Cheers,
-Greg    

[-- Attachment #2: kvm-apic.patch --]
[-- Type: text/plain, Size: 70703 bytes --]

commit e1ff23affe598e90ffe21f5b0942296da89741ee
Author: Gregory Haskins <ghaskins-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
Date:   Tue Apr 3 16:32:50 2007 -0400

    KVM - Baseline support for in-kernel APIC
    
    This patch is not yet complete.  It is a starting point based upon the original work done by Dor Laor.  There are additional changes on the kernel side that
    are needed to complete this work in addition to usermode side changes.
    
    Signed-off by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>

diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c0a789f..7532eae 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Kernel-based Virtual Machine module
 #
 
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o kvm_apic.o kvm_userint.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index fceeb84..080cb1b 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -157,6 +157,51 @@ struct vmcs {
 
 struct kvm_vcpu;
 
+struct kvm_irqinfo {
+	int         vector;
+	int         nmi;
+};
+
+#define KVM_IRQFLAGS_NMI  (1 << 0)
+#define KVM_IRQFLAGS_PEEK (1 << 1)
+
+struct kvm_irqdevice {
+	int  (*pending)(struct kvm_irqdevice *this, int flags);
+	int  (*read)(struct kvm_irqdevice *this, int flags, 
+		     struct kvm_irqinfo *info);
+	int  (*inject)(struct kvm_irqdevice *this, int irq, int flags);
+	int  (*summary)(struct kvm_irqdevice *this, void *data);
+	void (*destructor)(struct kvm_irqdevice *this);
+
+	void *private;
+};
+
+#define MAX_APIC_INT_VECTOR      256
+
+struct kvm_apic {
+	u32	           	status;
+	u32	           	vcpu_id;
+	spinlock_t 		lock;
+	u32			pcpu_lock_owner;
+	atomic_t		timer_pending;
+	u64	           	apic_base_msr;
+	unsigned long      	base_address;
+	u32	           	timer_divide_count;
+	struct hrtimer       	apic_timer;
+	int                	intr_pending_count[MAX_APIC_INT_VECTOR];
+	ktime_t           	timer_last_update;
+	struct {
+		int deliver_mode;
+		int source[6];
+	} direct_intr;
+	u32	           	err_status;
+	u32	           	err_write_count;
+	struct kvm_vcpu        	*vcpu;
+	struct page	   	*regs_page;
+	void               	*regs;
+	struct kvm_irqdevice    ext; /* Used for external/NMI interrupts */
+};
+
 /*
  * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
  * 32-bit).  The kvm_mmu structure abstracts the details of the current mmu
@@ -236,6 +281,11 @@ struct kvm_pio_request {
 	int rep;
 };
 
+#define KVM_VCPU_INIT_SIPI_SIPI_STATE_NORM          0
+#define KVM_VCPU_INIT_SIPI_SIPI_STATE_WAIT_SIPI     1
+
+#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 	union {
@@ -248,12 +298,9 @@ struct kvm_vcpu {
 	u64 host_tsc;
 	struct kvm_run *run;
 	int interrupt_window_open;
-	unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */
-#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
-	unsigned long irq_pending[NR_IRQ_WORDS];
 	unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
 	unsigned long rip;      /* needs vcpu_load_rsp_rip() */
-
+	struct kvm_irqdevice irq_dev;
 	unsigned long cr0;
 	unsigned long cr2;
 	unsigned long cr3;
@@ -261,10 +308,8 @@ struct kvm_vcpu {
 	struct page *para_state_page;
 	gpa_t hypercall_gpa;
 	unsigned long cr4;
-	unsigned long cr8;
 	u64 pdptrs[4]; /* pae */
 	u64 shadow_efer;
-	u64 apic_base;
 	u64 ia32_misc_enable_msr;
 	int nmsrs;
 	struct vmx_msr_entry *guest_msrs;
@@ -298,6 +343,11 @@ struct kvm_vcpu {
 	int sigset_active;
 	sigset_t sigset;
 
+	struct kvm_apic apic;
+	wait_queue_head_t halt_wq;
+	/* For AP startup */
+	unsigned long init_sipi_sipi_state;
+
 	struct {
 		int active;
 		u8 save_iopl;
@@ -319,6 +369,23 @@ struct kvm_mem_alias {
 	gfn_t target_gfn;
 };
 
+#define kvm_irq_pending(dev, flags)     (dev)->pending(dev, flags)
+#define kvm_irq_read(dev, flags, info)  (dev)->read(dev, flags, info)
+#define kvm_irq_inject(dev, irq, flags) (dev)->inject(dev, irq, flags)
+#define kvm_irq_summary(dev, data)      (dev)->summary(dev, data)
+
+#define kvm_vcpu_irq_pending(vcpu, flags) \
+   kvm_irq_pending(&vcpu->irq_dev, flags)
+#define kvm_vcpu_irq_read(vcpu, flags, info) \
+   kvm_irq_read(&vcpu->irq_dev, flags, info)
+#define kvm_vcpu_irq_inject(vcpu, irq, flags) \
+   kvm_irq_inject(&vcpu->irq_dev, irq, flags)
+#define kvm_vcpu_irq_summary(vcpu, data)  \
+   kvm_irq_summary(&vcpu->irq_dev, data)
+
+
+int kvm_userint_init(struct kvm_vcpu *vcpu);
+
 struct kvm_memory_slot {
 	gfn_t base_gfn;
 	unsigned long npages;
@@ -345,6 +412,7 @@ struct kvm {
 	unsigned long rmap_overflow;
 	struct list_head vm_list;
 	struct file *filp;
+	struct kvm_options options;
 };
 
 struct kvm_stat {
@@ -564,6 +632,13 @@ static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
 	return (struct kvm_mmu_page *)page_private(page);
 }
 
+static inline int vcpu_slot(struct kvm_vcpu *vcpu)
+{
+	return vcpu - vcpu->kvm->vcpus;
+}
+
+void kvm_crash_guest(struct kvm *kvm);
+
 static inline u16 read_fs(void)
 {
 	u16 seg;
diff --git a/drivers/kvm/kvm_apic.c b/drivers/kvm/kvm_apic.c
new file mode 100644
index 0000000..f3e7edc
--- /dev/null
+++ b/drivers/kvm/kvm_apic.c
@@ -0,0 +1,1289 @@
+/*
+ * kvm_apic.c: Local APIC virtualization
+ *
+ *
+ * Copyright (C) 2006 Qumranet, Inc.
+ *
+ * Authors:
+ *   Dor Laor   <dor.laor-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * Copyright (c) 2004, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ */
+
+#include "kvm.h"
+#include "kvm_apic.h"
+#include <linux/kvm.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/smp.h>
+#include <linux/hrtimer.h>
+#include <asm/processor.h>
+#include <asm/io.h>
+#include <asm/msr.h>
+#include <asm/page.h>
+#include <asm/current.h>
+
+/*XXX remove this definition after GFW enabled */
+#define APIC_NO_BIOS
+
+#define PRId64 "d"
+#define PRIx64 "llx"
+#define PRIu64 "u"
+#define PRIo64 "o"
+
+#define APIC_BUS_CYCLE_NS 1
+
+static unsigned int apic_lvt_mask[APIC_LVT_NUM] =
+{
+	LVT_MASK | APIC_LVT_TIMER_PERIODIC, 	/* LVTT */
+	LVT_MASK | APIC_MODE_MASK, 		/* LVTTHMR */
+	LVT_MASK | APIC_MODE_MASK, 		/* LVTPC */
+	LINT_MASK, LINT_MASK, 			/* LVT0-1 */
+	LVT_MASK 				/* LVTERR */
+};
+
+
+#define ASSERT(x)  							     \
+	if (!(x)) { 							     \
+		printk(KERN_EMERG "assertion failed %s: %d: %s\n", __FILE__, __LINE__, #x);\
+		BUG();\
+	}
+
+static int apic_find_highest_irr(struct kvm_apic *apic)
+{
+	int result;
+	
+	result = find_highest_bit((unsigned long *)(apic->regs + APIC_IRR),
+				  MAX_APIC_INT_VECTOR);
+	
+	ASSERT( result == 0 || result >= 16);
+	
+	return result;
+}
+
+
+static int apic_find_highest_isr(struct kvm_apic *apic)
+{
+	int result;
+	
+	result = find_highest_bit((unsigned long *)(apic->regs + APIC_ISR),
+				  MAX_APIC_INT_VECTOR);
+	
+	ASSERT( result == 0 || result >= 16);
+	
+	return result;
+}
+
+static u32 apic_update_ppr(struct kvm_apic *apic)
+{
+	u32 tpr, isrv, ppr;
+	int isr;
+	
+	tpr = kvm_apic_get_reg(apic, APIC_TASKPRI);
+        isr = apic_find_highest_isr(apic);
+	isrv = (isr >> 4) & 0xf;
+	
+	if ((tpr >> 4) >= isrv)
+		ppr = tpr & 0xff;
+	else
+		ppr = isrv << 4;  /* low 4 bits of PPR have to be cleared */
+	
+	kvm_apic_set_reg(apic, APIC_PROCPRI, ppr);
+	
+	pr_debug("%s: ppr 0x%x, isr 0x%x, isrv 0x%x\n",
+	       __FUNCTION__, ppr, isr, isrv);
+	
+	return ppr;
+}
+
+void kvm_apic_update_tpr(struct kvm_apic *apic, unsigned long cr8)
+{
+	spin_lock_bh(&apic->lock);
+	kvm_apic_set_reg(apic, APIC_TASKPRI, ((cr8 & 0x0f) << 4));
+	apic_update_ppr(apic);
+	spin_unlock_bh(&apic->lock);
+}
+
+/*
+ * The function do not need a lock because it's atomic value
+ */
+unsigned long kvm_apic_read_tpr(struct kvm_apic* apic)
+{
+	unsigned long tpr = (unsigned long)kvm_apic_get_reg(apic, APIC_TASKPRI);
+	return (tpr & 0xf0) >> 4;
+}
+EXPORT_SYMBOL_GPL(kvm_apic_read_tpr);
+
+
+/*
+ * This only for fixed delivery mode
+ */
+static int apic_match_dest(struct kvm_vcpu *vcpu,
+			   struct kvm_apic *source,
+			   int short_hand,
+			   int dest,
+			   int dest_mode,
+			   int delivery_mode)
+{
+	int result = 0;
+	struct kvm_apic *target = &vcpu->apic;
+	
+	pr_debug("target %p, source %p, dest 0x%x, dest_mode 0x%x, "
+			  "short_hand 0x%x, delivery_mode 0x%x\n",
+			  target, source, dest, dest_mode, short_hand, 
+			  delivery_mode);
+
+	if (unlikely(target == NULL) &&
+	    ((delivery_mode != APIC_DM_INIT) &&
+	     (delivery_mode != APIC_DM_STARTUP) &&
+	     (delivery_mode != APIC_DM_NMI))) {
+
+		pr_debug("uninitialized target vcpu %p, "
+				  "delivery_mode 0x%x, dest 0x%x.\n", 
+				  vcpu, delivery_mode, dest);
+		return result;
+	}
+	
+	switch (short_hand) {
+	case APIC_DEST_NOSHORT:             /* no shorthand */
+		if (!dest_mode) /* Physical */
+			result = ( ((target != NULL) ?
+				GET_APIC_ID(kvm_apic_get_reg(target, APIC_ID)):
+				vcpu_slot(vcpu))) == dest;
+		else { /* Logical */
+			u32 ldr;
+			if (target == NULL)
+				break;
+			ldr = kvm_apic_get_reg(target, APIC_LDR);
+	
+			/* Flat mode */
+			if (kvm_apic_get_reg(target, APIC_DFR) == APIC_DFR_FLAT)
+				result = GET_APIC_LOGICAL_ID(ldr) & dest;
+			else {
+				if (delivery_mode == APIC_DM_LOWEST &&
+				    dest == 0xff) {
+					printk(KERN_ALERT "Broadcast IPI with lowest priority "
+					       "delivery mode\n");
+					kvm_crash_guest(vcpu->kvm);
+				}
+				result = (GET_APIC_LOGICAL_ID(ldr) == (dest & 0xf)) ?
+					  (GET_APIC_LOGICAL_ID(ldr) >> 4) & (dest >> 4) : 0;
+			}
+		}
+		break;
+
+	case APIC_DEST_SELF:
+		if (target == source)
+			result = 1;
+		break;
+	case APIC_DEST_ALLINC:
+		result = 1;
+		break;
+	
+	case APIC_DEST_ALLBUT:
+		if (target != source)
+			result = 1;
+		break;
+	
+	default:
+		break;
+	}
+	
+	return result;
+}
+
+/*
+ * Add a pending IRQ into lapic.
+ * Return 1 if successfully added and 0 if discarded.
+ */
+static int apic_accept_irq(struct kvm_apic *apic,
+			   int delivery_mode,
+			   int vector,
+			   int level,
+			   int trig_mode)
+{
+	int result = 0;
+	
+	switch (delivery_mode) {
+	case APIC_DM_FIXED:
+	case APIC_DM_LOWEST:
+		/* FIXME add logic for vcpu on reset */
+		if (unlikely(apic == NULL || !apic_enabled(apic)))
+			break;
+
+		if (test_and_set_bit(vector, apic->regs + APIC_IRR) && trig_mode) {
+			pr_debug("level trig mode repeatedly for vector %d\n",
+			       vector);
+			break;
+		}
+
+		if (trig_mode) {
+			pr_debug("level trig mode for vector %d\n", vector);
+			set_bit(vector, apic->regs + APIC_TMR);
+		}
+		
+		
+		/*
+		 * FIXME(kvm) When we'll have smp support we will need to wakeup
+		 * a sleeping/running vcpu to inject the interrupt to it.
+		 */
+		result = 1;
+		break;
+
+	case APIC_DM_REMRD:
+	case APIC_DM_SMI:
+		printk(KERN_WARNING "%s: Ignore deliver mode %d\n", __FUNCTION__, delivery_mode);
+		break;
+	case APIC_DM_NMI:
+	case APIC_DM_EXTINT:
+		kvm_irq_inject(&apic->ext, vector, 
+			       (delivery_mode == APIC_DM_NMI) ? 
+			       KVM_IRQFLAGS_NMI : 0);
+		break;
+
+	case APIC_DM_INIT:
+		if (trig_mode && !(level & APIC_INT_ASSERT))     /* Deassert */
+			printk(KERN_INFO "This kvm_apic is for P4, no work for De-assert init\n");
+		else {
+			/* FIXME(xen) How to check the situation after vcpu reset? */
+			if (apic->vcpu->launched) {
+				printk(KERN_ALERT "Reset kvm vcpu not supported yet\n");
+				kvm_crash_guest(apic->vcpu->kvm);
+			}
+			apic->vcpu->init_sipi_sipi_state = 
+				KVM_VCPU_INIT_SIPI_SIPI_STATE_WAIT_SIPI;
+			result = 1;
+		}
+		break;
+
+	case APIC_DM_STARTUP: /* FIXME: currently no support for SMP */
+		printk(KERN_ALERT "%s:SMP not supported yet\n", __FUNCTION__);
+		kvm_crash_guest(apic->vcpu->kvm);
+		break;
+
+	default:
+		printk(KERN_ALERT "TODO: not support interrupt type %x\n", delivery_mode);
+		kvm_crash_guest(apic->vcpu->kvm);
+		break;
+	}
+
+	return result;
+}
+
+static int _apic_inject(struct kvm_irqdevice *this, int irq, int flags)
+{
+	struct kvm_apic *apic = (struct kvm_apic*)this->private;
+	int ret;
+	int apic_type = (flags & KVM_IRQFLAGS_NMI) ? APIC_DM_NMI : 
+		APIC_DM_EXTINT;
+
+	spin_lock_bh(&apic->lock);
+	ret = apic_accept_irq(apic, apic_type, irq, 0, 1);
+	spin_unlock_bh(&apic->lock);
+
+	return ret;
+}
+
+#if 0
+int kvm_apic_receive_msg(struct kvm_vcpu *vcpu, struct kvm_apic_msg *msg)
+{
+	pr_debug("%s: vcpu(%d), delivery_mode(%d), vector(%x), trig_mode(%d)\n",
+		__FUNCTION__, vcpu_slot(vcpu), msg->delivery_mode, msg->vector, msg->trig_mode);
+
+	spin_lock_bh(&vcpu->apic.lock);
+	apic_accept_irq(vcpu->apic, msg->delivery_mode, msg->vector, 1, msg->trig_mode);
+	spin_unlock_bh(&vcpu->apic.lock);
+
+	return 0;
+}
+#endif
+
+
+
+/*
+ * This function is used by both ioapic and local APIC
+ * The bitmap is for vcpu_id
+ * Should be called by the ioapic too.
+ * The implementation is null because there is no SMP support
+ */
+struct kvm_apic *kvm_apic_round_robin(struct kvm_vcpu *vcpu,
+				  u8 dest_mode,
+				  u8 vector,
+				  u32 bitmap)
+{
+	if (dest_mode == 0) {  /* Physical mode */
+		pr_debug("%s: lowest priority for physical mode\n", __FUNCTION__);
+		return NULL;
+	}
+	
+	if (!bitmap) {
+		pr_debug("%s no bit set in bitmap.\n", __FUNCTION__);
+		return NULL;
+	}
+	
+	/* FIXME: add SMP support */
+	return &vcpu->apic;
+}
+
+static void apic_EOI_set(struct kvm_apic *apic)
+{
+	int vector = apic_find_highest_isr(apic);
+	
+	/*
+	 * Not every write EOI will has corresponding ISR,
+	 * one example is when Kernel check timer on setup_IO_APIC
+	 */
+	if (!vector)
+		return;
+	
+	clear_bit(vector, apic->regs + APIC_ISR);
+	apic_update_ppr(apic);
+	
+	clear_bit(vector, apic->regs + APIC_TMR);
+}
+
+static int apic_check_vector(struct kvm_apic *apic,u32 dm, u32 vector)
+{
+	if (dm == APIC_DM_FIXED && vector < 16) {
+		apic->err_status |= 0x40;
+		apic_accept_irq(apic, APIC_DM_FIXED,
+				apic_lvt_vector(apic, APIC_LVTERR), 0, 0);
+		pr_debug("%s: check failed "
+		       " dm %x vector %x\n", __FUNCTION__, dm, vector);
+		return 0;
+	}
+	return 1;
+}
+
+static void apic_ipi(struct kvm_vcpu *vcpu)
+{
+	struct kvm_apic *apic = &vcpu->apic;
+	u32 icr_low = kvm_apic_get_reg(apic, APIC_ICR);
+	u32 icr_high = kvm_apic_get_reg(apic, APIC_ICR2);
+	
+	unsigned int dest =         GET_APIC_DEST_FIELD(icr_high);
+	unsigned int short_hand =   icr_low & APIC_SHORT_MASK;
+	unsigned int trig_mode =    icr_low & APIC_INT_LEVELTRIG;
+	unsigned int level =        icr_low & APIC_INT_ASSERT;
+	unsigned int dest_mode =    icr_low & APIC_DEST_MASK;
+	unsigned int delivery_mode =    icr_low & APIC_MODE_MASK;
+	unsigned int vector =       icr_low & APIC_VECTOR_MASK;
+	
+	struct kvm_apic *target;
+        u32 lpr_map = 0;
+	
+	pr_debug("icr_high 0x%x, icr_low 0x%x, "
+			  "short_hand 0x%x, dest 0x%x, trig_mode 0x%x, level 0x%x, "
+			  "dest_mode 0x%x, delivery_mode 0x%x, vector 0x%x\n",
+			  icr_high, icr_low, short_hand, dest,
+			  trig_mode, level, dest_mode, delivery_mode, vector);
+	
+	if (apic_match_dest(vcpu, apic, short_hand, dest, dest_mode, delivery_mode)) {
+		if (delivery_mode == APIC_DM_LOWEST)
+			set_bit(vcpu_slot(vcpu), &lpr_map);
+		else
+			apic_accept_irq(apic, delivery_mode,
+					vector, level, trig_mode);
+	}
+        
+	if (delivery_mode == APIC_DM_LOWEST) {
+		/* Currently only UP is supported so target == apic */
+		target = kvm_apic_round_robin(vcpu, dest_mode, vector, lpr_map);
+	
+		if (target) 
+			apic_accept_irq(target, delivery_mode,
+					vector, level, trig_mode);
+	}
+}
+
+static u32 apic_get_tmcct(struct kvm_apic *apic)
+{
+	u32 counter_passed;
+	ktime_t passed, now = apic->apic_timer.base->get_time();
+	u32 tmcct = kvm_apic_get_reg(apic, APIC_TMCCT);
+	
+	ASSERT(apic != NULL);
+	
+	if (unlikely(ktime_to_ns(now) <= ktime_to_ns(apic->timer_last_update))) {
+		/* Wrap around */
+		passed = ktime_add(
+			({ (ktime_t){ .tv64 = KTIME_MAX - (apic->timer_last_update).tv64 }; }),
+			now);
+		pr_debug("time elapsed\n");
+	} else
+		passed = ktime_sub(now, apic->timer_last_update);
+
+	counter_passed = ktime_to_ns(passed) / 
+			 (APIC_BUS_CYCLE_NS * apic->timer_divide_count);
+        tmcct -= counter_passed;
+	
+	if (tmcct <= 0) {
+		if (unlikely(!apic_lvtt_period(apic))) {
+			tmcct =  0;
+		} else {
+			do {
+				tmcct += kvm_apic_get_reg(apic, APIC_TMICT);
+			} while ( tmcct <= 0 );
+		}
+	}
+	
+	apic->timer_last_update = now;
+	kvm_apic_set_reg(apic, APIC_TMCCT, tmcct);
+	
+	return tmcct;
+}
+
+static void kvm_apic_read_aligned(struct kvm_apic *apic,
+				  unsigned int offset,
+				  unsigned int len,
+				  unsigned int *result)
+{
+	ASSERT(len == 4 && offset > 0 && offset <= APIC_TDCR);
+	*result = 0;
+	
+	switch (offset) {
+	case APIC_ARBPRI:
+		printk(KERN_WARNING "access local APIC ARBPRI register which is for P6\n");
+		break;
+	
+	case APIC_TMCCT:        /* Timer CCR */
+		*result = apic_get_tmcct(apic);
+		break;
+	
+	case APIC_ESR:
+		apic->err_write_count = 0;
+		*result = kvm_apic_get_reg(apic, offset);
+		break;
+	
+	default:
+		*result = kvm_apic_get_reg(apic, offset);
+		break;
+	}
+}
+
+static unsigned long __kvm_apic_read(struct kvm_vcpu *vcpu,
+				     unsigned long address,
+				     unsigned long len)
+{
+	unsigned int alignment;
+	unsigned int tmp;
+	unsigned long result;
+	struct kvm_apic *apic = &vcpu->apic;
+	unsigned int offset = address - apic->base_address;
+	
+	if (offset > APIC_TDCR)
+		return 0;
+	
+	/* some bugs on kernel cause read this with byte*/
+	if (len != 4)
+		pr_debug("read with len=0x%lx, should be 4 instead.\n", len);
+	
+	alignment = offset & 0x3;
+	
+	kvm_apic_read_aligned(apic, offset & ~0x3, 4, &tmp);
+	switch (len) {
+	case 1:
+		result = *((unsigned char *)&tmp + alignment);
+		break;
+	
+	case 2:
+		ASSERT(alignment != 3);
+		result = *(unsigned short *)((unsigned char *)&tmp + alignment);
+		break;
+	
+	case 4:
+		ASSERT(alignment == 0);
+		result = *(unsigned int *)((unsigned char *)&tmp + alignment);
+		break;
+	
+	default:
+		printk(KERN_ALERT "Local APIC read with len=0x%lx, should be 4 instead.\n", len);
+		kvm_crash_guest(vcpu->kvm);
+		result = 0; /* to make gcc happy */
+		break;
+	}
+	
+	pr_debug("%s: offset 0x%x with length 0x%lx, "
+			  "and the result is 0x%lx\n", __FUNCTION__, offset, len, result);
+	
+	return result;
+}
+
+unsigned long kvm_apic_read(struct kvm_vcpu *vcpu,
+			    unsigned long address,
+			    unsigned long len)
+{
+	unsigned long result;
+
+	spin_lock_bh(&vcpu->apic.lock);
+	result = __kvm_apic_read(vcpu, address, len);
+	spin_unlock_bh(&vcpu->apic.lock);
+
+	return result;
+}
+
+void kvm_apic_write(struct kvm_vcpu *vcpu,
+		    unsigned long address,
+		    unsigned long len,
+		    unsigned long val)
+{
+	struct kvm_apic *apic = &vcpu->apic;
+	unsigned int offset = address - apic->base_address;
+
+	spin_lock_bh(&apic->lock);
+
+	/* too common printing */
+        if (offset != APIC_EOI)
+		pr_debug("%s: offset 0x%x with length 0x%lx, and value is 0x%lx\n",
+		       __FUNCTION__, offset, len, val);
+
+	/*
+	 * According to IA 32 Manual, all registers should be accessed with
+	 * 32 bits alignment.
+	 */
+	if (len != 4) {
+		unsigned int tmp;
+		unsigned char alignment;
+		
+		/* Some kernels do will access with byte/word alignment */
+		pr_debug("Notice: Local APIC write with len = %lx\n",len);
+		alignment = offset & 0x3;
+		tmp = __kvm_apic_read(vcpu, offset & ~0x3, 4);
+		switch (len) {
+		case 1:
+			/*
+			 * XXX the saddr is a tmp variable from caller, so should be ok
+			 * But we should still change the following ref to val to
+			 * local variable later
+			 */
+			val = (tmp & ~(0xff << (8*alignment))) |
+			      ((val & 0xff) << (8*alignment));
+			break;
+	
+		case 2:
+			if (alignment != 0x0 && alignment != 0x2) {
+				printk(KERN_ALERT "alignment error for apic with len == 2\n");
+				kvm_crash_guest(vcpu->kvm);
+			}
+			
+			val = (tmp & ~(0xffff << (8*alignment))) |
+			      ((val & 0xffff) << (8*alignment));
+			break;
+	
+		case 3:
+			/* will it happen? */
+			printk(KERN_ALERT "apic_write with len = 3 !!!\n");
+			kvm_crash_guest(vcpu->kvm);
+			break;
+	
+		default:
+			printk(KERN_ALERT "Local APIC write with len = %lx, should be 4 instead\n", len);
+			kvm_crash_guest(vcpu->kvm);
+			break;
+		}
+	}
+
+	offset &= 0xff0;
+
+	switch (offset) {
+	case APIC_ID:   /* Local APIC ID */
+		kvm_apic_set_reg(apic, APIC_ID, val);
+		break;
+
+	case APIC_TASKPRI:
+		kvm_apic_set_reg(apic, APIC_TASKPRI, val & 0xff);
+		apic_update_ppr(apic);
+		break;
+
+	case APIC_EOI:
+		apic_EOI_set(apic);
+		break;
+
+	case APIC_LDR:
+		kvm_apic_set_reg(apic, APIC_LDR, val & APIC_LDR_MASK);
+		break;
+
+	case APIC_DFR:
+		kvm_apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
+		break;
+
+	case APIC_SPIV:
+		kvm_apic_set_reg(apic, APIC_SPIV, val & 0x3ff);
+		if (!(val & APIC_SPIV_APIC_ENABLED)) {
+			int i;
+			u32 lvt_val;
+
+			apic->status |= APIC_SOFTWARE_DISABLE_MASK;
+			for (i = 0; i < APIC_LVT_NUM; i++) {
+				lvt_val = kvm_apic_get_reg(apic, APIC_LVTT + 0x10 * i);
+				kvm_apic_set_reg(apic, APIC_LVTT + 0x10 * i,
+						 lvt_val | APIC_LVT_MASKED);
+			}
+
+			if ((kvm_apic_get_reg(apic, APIC_LVT0) & APIC_MODE_MASK)
+			    == APIC_DM_EXTINT)
+				clear_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+		} else {
+			apic->status &= ~APIC_SOFTWARE_DISABLE_MASK;
+			if ((kvm_apic_get_reg(apic, APIC_LVT0) & APIC_MODE_MASK)
+			    == APIC_DM_EXTINT)
+				set_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+		}
+		break;
+
+	case APIC_ESR:
+		apic->err_write_count = !apic->err_write_count;
+		if (!apic->err_write_count)
+			apic->err_status = 0;
+		break;
+
+	case APIC_ICR:
+		/* No delay here, so we always clear the pending bit*/
+		kvm_apic_set_reg(apic, APIC_ICR, val & ~(1 << 12));
+		apic_ipi(vcpu);
+		break;
+
+	case APIC_ICR2:
+		kvm_apic_set_reg(apic, APIC_ICR2, val & 0xff000000);
+		break;
+
+	case APIC_LVTT:
+	case APIC_LVTTHMR:
+	case APIC_LVTPC:
+	case APIC_LVT0:
+	case APIC_LVT1:
+	case APIC_LVTERR:
+	{
+		if (apic->status & APIC_SOFTWARE_DISABLE_MASK)
+			val |= APIC_LVT_MASKED;
+
+		val &= apic_lvt_mask[(offset - APIC_LVTT) >> 4];
+                kvm_apic_set_reg(apic, offset, val);
+
+		/* On hardware, when write vector less than 0x20 will error */
+		if (!(val & APIC_LVT_MASKED))
+			apic_check_vector(apic, apic_lvt_dm(apic, offset),
+					  apic_lvt_vector(apic, offset));
+		if (!vcpu_slot(vcpu) && (offset == APIC_LVT0)) {
+			if ((val & APIC_MODE_MASK) == APIC_DM_EXTINT)
+				if (val & APIC_LVT_MASKED)
+					clear_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+				else
+					set_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+			else
+				clear_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+		}
+	}
+		break;
+
+	case APIC_TMICT:
+	{
+		ktime_t now = apic->apic_timer.base->get_time();
+		u32 offset;
+
+		kvm_apic_set_reg(apic, APIC_TMICT, val);
+		kvm_apic_set_reg(apic, APIC_TMCCT, val);
+		apic->timer_last_update = now;
+		offset = APIC_BUS_CYCLE_NS * apic->timer_divide_count * val;
+
+		/* Make sure the lock ordering is coherent */
+		spin_unlock_bh(&apic->lock);
+		hrtimer_cancel(&apic->apic_timer);
+		hrtimer_start(&apic->apic_timer, ktime_add_ns(now, offset), HRTIMER_ABS);
+
+		pr_debug("%s: bus cycle is %"PRId64"ns, now 0x%016"PRIx64", "
+				  "timer initial count 0x%x, offset 0x%x, "
+				  "expire @ 0x%016"PRIx64".\n", __FUNCTION__,
+				  APIC_BUS_CYCLE_NS, ktime_to_ns(now),
+				  kvm_apic_get_reg(apic, APIC_TMICT),
+				  offset, ktime_to_ns(ktime_add_ns(now, offset)));
+	}
+		return;
+
+	case APIC_TDCR:
+	{
+		unsigned int tmp1, tmp2;
+
+		tmp1 = val & 0xf;
+		tmp2 = ((tmp1 & 0x3) | ((tmp1 & 0x8) >> 1)) + 1;
+		apic->timer_divide_count = 0x1 << (tmp2 & 0x7);
+
+		kvm_apic_set_reg(apic, APIC_TDCR, val);
+
+		pr_debug("timer divide count is 0x%x\n",
+		       apic->timer_divide_count);
+	}
+		break;
+
+	default:
+		printk(KERN_WARNING "Local APIC Write to read-only register\n");
+		break;
+	}
+
+	spin_unlock_bh(&apic->lock);
+}
+
+int kvm_apic_range(struct kvm_vcpu *vcpu, unsigned long addr)
+{
+	struct kvm_apic *apic = &vcpu->apic;
+	
+	spin_lock_bh(&apic->lock);
+
+	if (apic_global_enabled(apic) &&
+	    (addr >= apic->base_address) &&
+	    (addr < apic->base_address + VLOCAL_APIC_MEM_LENGTH)) {
+		spin_unlock_bh(&apic->lock);
+		return 1;
+	}
+	spin_unlock_bh(&apic->lock);
+	
+	return 0;
+}
+
+void kvm_apic_msr_set(struct kvm_apic *apic, u64 value)
+{
+	/* When apic disabled */
+	if (apic == NULL)
+		return;
+
+	spin_lock_bh(&apic->lock);
+	if (apic->vcpu_id )
+		value &= ~MSR_IA32_APICBASE_BSP;
+	
+	apic->apic_base_msr = value;
+	apic->base_address = apic->apic_base_msr & MSR_IA32_APICBASE_BASE;
+	
+	/* with FSB delivery interrupt, we can restart APIC functionality */
+	if (!(value & MSR_IA32_APICBASE_ENABLE))
+		set_bit(_APIC_GLOB_DISABLE, &apic->status);
+	else
+		clear_bit(_APIC_GLOB_DISABLE, &apic->status);
+	
+	pr_debug("apic base msr is 0x%016"PRIx64", and base address is 0x%lx.\n",
+			  apic->apic_base_msr, apic->base_address);
+
+	spin_unlock_bh(&apic->lock);
+}
+
+static int __apic_timer_fn(struct kvm_apic *apic)
+{
+	struct kvm_vcpu *vcpu;
+	u32 timer_vector;
+	ktime_t now;
+	int result = HRTIMER_NORESTART;
+
+	if (unlikely(!apic_enabled(apic) || !apic_lvt_enabled(apic, APIC_LVTT))) {
+		pr_debug("%s: time interrupt although apic is down\n", __FUNCTION__);
+		return HRTIMER_NORESTART;
+	}
+	
+	vcpu = apic->vcpu;
+	timer_vector = apic_lvt_vector(apic, APIC_LVTT);
+	now = apic->apic_timer.base->get_time();
+	apic->timer_last_update = now;
+	
+	if (test_and_set_bit(timer_vector, apic->regs + APIC_IRR)) {
+		apic->intr_pending_count[timer_vector]++;
+		pr_debug("%s: increasing intr_pending_count to %d\n" , __FUNCTION__, apic->intr_pending_count[timer_vector]);
+	}
+
+	if (apic_lvtt_period(apic)) {
+		u32 offset;
+		u32 tmict = kvm_apic_get_reg(apic, APIC_TMICT);
+			   
+		kvm_apic_set_reg(apic, APIC_TMCCT, tmict);
+		offset = APIC_BUS_CYCLE_NS * apic->timer_divide_count * tmict;
+                
+		result = HRTIMER_RESTART;
+		apic->apic_timer.expires = ktime_add_ns(now, offset);
+		
+		pr_debug("%s: now 0x%016"PRIx64", expire @ 0x%016"PRIx64", "
+		       "timer initial count 0x%x, timer current count 0x%x.\n",
+		       __FUNCTION__,
+		       ktime_to_ns(now), ktime_to_ns(apic->apic_timer.expires),
+		       kvm_apic_get_reg(apic, APIC_TMICT),
+	               kvm_apic_get_reg(apic, APIC_TMCCT));
+	} else {
+		kvm_apic_set_reg(apic, APIC_TMCCT, 0);
+		pr_debug("%s: now 0x%016"PRIx64", "
+		       "timer initial count 0x%x, timer current count 0x%x.\n",
+		       __FUNCTION__,
+		       ktime_to_ns(now), kvm_apic_get_reg(apic, APIC_TMICT),
+		       kvm_apic_get_reg(apic, APIC_TMCCT));
+	}
+	
+	wake_up_interruptible(&vcpu->halt_wq);
+	return result;
+}
+
+void apic_check_pending_timer(struct kvm_apic *apic)
+{
+	int timer_restart;
+
+	if (!apic || !apic_enabled(apic))
+		return;
+
+	if (!atomic_read(&apic->timer_pending))
+		return;
+
+	atomic_dec(&apic->timer_pending);
+	timer_restart = __apic_timer_fn(apic);
+	if (timer_restart == HRTIMER_RESTART) {
+		pr_debug("%s restarting timer\n", __FUNCTION__);
+		hrtimer_start(&apic->apic_timer, apic->apic_timer.expires, HRTIMER_ABS);
+	}
+}
+EXPORT_SYMBOL_GPL(apic_check_pending_timer);
+
+static void receive_apic_ipi(void *arg)
+{
+	int cpu = smp_processor_id();
+
+	pr_debug("%s: cpu(%d)\n", __FUNCTION__, cpu);
+}
+
+static int apic_timer_fn(struct hrtimer* timer)
+{
+	struct kvm_apic *apic;
+	int restart_timer = HRTIMER_NORESTART;
+	u32 apic_lock_owner;
+
+	apic = container_of(timer, struct kvm_apic, apic_timer);
+
+	while (!spin_trylock_bh(&apic->lock)) {
+		/*
+		 * Send an IPI in order cause vmexit for the lock holder
+		 * The IPI receiver will handle the timer function
+		 */
+		if ((apic_lock_owner = apic->pcpu_lock_owner) != -1) {
+			BUG_ON(smp_processor_id() == apic_lock_owner);
+			pr_debug("%s: cpu(%d) send ipi to %d\n",
+					  __FUNCTION__,
+					  smp_processor_id(),
+					  apic_lock_owner);
+
+			atomic_inc(&apic->timer_pending);
+			smp_call_function_single(apic_lock_owner,
+						 receive_apic_ipi,
+						 apic, 0, 1);
+			return restart_timer;
+		}
+		cpu_relax();
+	}
+
+	restart_timer = __apic_timer_fn(apic);
+	spin_unlock_bh(&apic->lock);
+
+	return restart_timer;
+}
+
+static int apic_read_irr(struct kvm_apic *apic)
+{
+	if (apic && apic_enabled(apic)) {
+		int highest_irr = apic_find_highest_irr(apic);
+	
+		if ((highest_irr & 0xf0) > kvm_apic_get_reg(apic, APIC_PROCPRI)) {
+			if (highest_irr < 0x10) {
+				u32 err_vector;
+	
+				apic->err_status |= 0x20;
+				err_vector = apic_lvt_vector(apic, APIC_LVTERR);
+	
+				pr_debug("Sending an illegal vector 0x%x.\n", highest_irr);
+	
+				set_bit(err_vector, apic->regs + APIC_IRR);
+				highest_irr = err_vector;
+			}
+	
+			return highest_irr;
+		}
+	}
+
+	return 0;
+}
+
+static void apic_post_interrupt(struct kvm_apic *apic, int vector, int deliver_mode)
+{
+	if (unlikely(apic == NULL))
+		return;
+	
+	switch (deliver_mode) {
+	case APIC_DM_FIXED:
+	case APIC_DM_LOWEST:
+		set_bit(vector, apic->regs + APIC_ISR);
+		clear_bit(vector, apic->regs + APIC_IRR);
+		apic_update_ppr(apic);
+	
+		if (vector == apic_lvt_vector(apic, APIC_LVTT)) {
+			apic->intr_pending_count[vector]--;
+			if (apic->intr_pending_count[vector] > 0)
+				test_and_set_bit(vector, apic->regs + APIC_IRR);
+		}
+		break;
+	
+	/*XXX deal with these later */
+	case APIC_DM_REMRD:
+		pr_debug("%s: Ignore deliver mode %d \n", __FUNCTION__, deliver_mode);
+		break;
+	
+	case APIC_DM_SMI:
+	case APIC_DM_NMI:
+	case APIC_DM_INIT:
+	case APIC_DM_STARTUP:
+	case APIC_DM_EXTINT:
+		apic->direct_intr.deliver_mode &= (1 << (deliver_mode >> 8));
+		break;
+	
+	default:
+		pr_debug("%s: invalid deliver mode\n", __FUNCTION__);
+		break;
+	}
+}
+
+static int apic_pending_irr(struct kvm_apic *apic)
+{
+	if (apic && apic_enabled(apic)) {
+		int highest_irr = apic_find_highest_irr(apic);
+	
+		if ((highest_irr & 0xF0) > kvm_apic_get_reg(apic, APIC_PROCPRI))
+			return 1;
+	}
+	return 0;
+}
+
+static int _apic_pending(struct kvm_irqdevice *this, int flags)
+{
+	struct kvm_apic *apic = (struct kvm_apic*)this->private;
+
+	if(!(flags & KVM_IRQFLAGS_NMI) && apic_pending_irr(apic))
+		return 1;
+
+	if(kvm_irq_pending(&apic->ext, flags))
+		return 1;
+	
+	return 0;
+}
+
+static int _apic_read(struct kvm_irqdevice *this, int flags, 
+		     struct kvm_irqinfo *info)
+{
+	struct kvm_apic *apic = (struct kvm_apic*)this->private;
+
+	int state = 0;
+	int vector;
+
+	/* We consider the IRR if the APIC is present, enabled, and we 
+	   are not filtering out based on NMI
+	*/
+	if(!(flags & KVM_IRQFLAGS_NMI) && apic_pending_irr(apic))
+		state |= 1;
+	if(kvm_irq_pending(&apic->ext, flags))
+		state |= 2;
+
+	switch(state) {
+	case 0: { 
+		/* No interrupts are pending */
+		return 0;
+	}
+	case 1: {
+		/* Only APIC interrupts are pending */
+		vector = apic_read_irr(apic);
+		apic_post_interrupt(apic, vector, APIC_DM_FIXED);
+		if(info) {
+			info->vector = vector;
+			info->nmi    = 0;
+		}
+		return vector;
+	}
+	case 2: {
+		/* Only external/NMI interrupts are pending */
+		return kvm_irq_read(&apic->ext, flags, info);
+	}
+	case 3: {
+		/* We have a conflict.  Figure out which is higher pri */
+		int highest_apic_irq = apic_read_irr(apic);
+		int highest_ext_irq  = kvm_irq_read(&apic->ext, 
+						    KVM_IRQFLAGS_PEEK, 
+						    info);
+		if(highest_apic_irq > highest_ext_irq) {
+			apic_post_interrupt(apic, highest_apic_irq, 
+					    APIC_DM_FIXED);
+			if(info) {
+				info->vector = highest_apic_irq;
+				info->nmi    = 0;
+			}
+			return highest_apic_irq;
+		} else {
+			return kvm_irq_read(&apic->ext, flags, info);
+		}
+	}
+	}
+
+	return 0;
+}
+
+static void kvm_apic_dump_state(struct kvm_apic *apic)
+{
+	u64 *tmp;
+
+	printk(KERN_INFO "%s begin\n", __FUNCTION__);
+	
+	printk(KERN_INFO "status = 0x%08x\n", apic->status);
+	printk(KERN_INFO "apic_base_msr=0x%016llx, apicbase = 0x%08lx\n", apic->apic_base_msr, apic->base_address);
+	
+        tmp = (u64*)(apic->regs + APIC_IRR);
+	printk(KERN_INFO "IRR = 0x%016llx 0x%016llx 0x%016llx 0x%016llx\n", tmp[3], tmp[2], tmp[1], tmp[0]);
+	tmp = (u64*)(apic->regs + APIC_ISR);
+	printk(KERN_INFO "ISR = 0x%016llx 0x%016llx 0x%016llx 0x%016llx\n", tmp[3], tmp[2], tmp[1], tmp[0]);
+	tmp = (u64*)(apic->regs + APIC_TMR);
+	printk(KERN_INFO "TMR = 0x%016llx 0x%016llx 0x%016llx 0x%016llx\n", tmp[3], tmp[2], tmp[1], tmp[0]);
+
+	printk(KERN_INFO "APIC_ID=0x%08x\n", kvm_apic_get_reg(apic, APIC_ID));
+	printk(KERN_INFO "APIC_TASKPRI=0x%08x\n", kvm_apic_get_reg(apic, APIC_TASKPRI) & 0xff);
+	printk(KERN_INFO "APIC_PROCPRI=0x%08x\n", kvm_apic_get_reg(apic, APIC_PROCPRI));
+	
+	printk(KERN_INFO "APIC_DFR=0x%08x\n", kvm_apic_get_reg(apic, APIC_DFR) | 0x0FFFFFFF);
+	printk(KERN_INFO "APIC_LDR=0x%08x\n", kvm_apic_get_reg(apic, APIC_LDR) & APIC_LDR_MASK);
+	printk(KERN_INFO "APIC_SPIV=0x%08x\n", kvm_apic_get_reg(apic, APIC_SPIV) & 0x3ff);
+        printk(KERN_INFO "APIC_ESR=0x%08x\n", kvm_apic_get_reg(apic, APIC_ESR));
+	printk(KERN_INFO "APIC_ICR=0x%08x\n", kvm_apic_get_reg(apic, APIC_ICR) & ~(1 << 12));
+	printk(KERN_INFO "APIC_ICR2=0x%08x\n", kvm_apic_get_reg(apic, APIC_ICR2) & 0xff000000);
+
+	printk(KERN_INFO "APIC_LVTERR=0x%08x\n", kvm_apic_get_reg(apic, APIC_LVTERR));
+	printk(KERN_INFO "APIC_LVT1=0x%08x\n", kvm_apic_get_reg(apic, APIC_LVT1));
+	printk(KERN_INFO "APIC_LVT0=0x%08x\n", kvm_apic_get_reg(apic, APIC_LVT0));
+	printk(KERN_INFO "APIC_LVTPC=0x%08x\n", kvm_apic_get_reg(apic, APIC_LVTPC));
+	printk(KERN_INFO "APIC_LVTTHMR=0x%08x\n", kvm_apic_get_reg(apic, APIC_LVTTHMR));
+	printk(KERN_INFO "APIC_LVTT=0x%08x\n", kvm_apic_get_reg(apic, APIC_LVTT));
+	
+	printk(KERN_INFO "APIC_TMICT=0x%08x\n", kvm_apic_get_reg(apic, APIC_TMICT));
+	printk(KERN_INFO "APIC_TDCR=0x%08x\n", kvm_apic_get_reg(apic, APIC_TDCR));
+        
+	printk(KERN_INFO "%s end\n", __FUNCTION__);
+}
+
+#if 0
+/*
+ * kvm_apic_set not need to be locked with the apic->lock since it is called when the
+ * guest is stopped.
+ */
+int kvm_apic_set(struct kvm_apic *apic, struct kvm_apic_state* as)
+{
+	pr_debug("%s begin\n", __FUNCTION__);
+	
+	apic->vcpu_id = as->vcpu;
+	apic->status = as->status;
+	kvm_apic_reset(apic);
+	kvm_apic_msr_set(apic, as->apicbase);
+	memset(&apic->intr_pending_count, 0, sizeof(int) * MAX_APIC_INT_VECTOR);
+	memcpy((void*)(apic->regs + APIC_IRR), &as->irr, sizeof(as->irr));
+	memcpy((void*)(apic->regs + APIC_ISR), &as->isr, sizeof(as->isr));
+	memcpy((void*)(apic->regs + APIC_TMR), &as->tmr, sizeof(as->tmr));
+	kvm_apic_set_reg(apic, APIC_ID, as->id);
+	kvm_apic_set_reg(apic, APIC_TASKPRI, as->tpr & 0xff);
+	kvm_apic_set_reg(apic, APIC_PROCPRI, as->ppr);
+	
+	kvm_apic_set_reg(apic, APIC_DFR, as->dfr | 0x0FFFFFFF);
+	kvm_apic_set_reg(apic, APIC_LDR, as->ldr & APIC_LDR_MASK);
+	kvm_apic_set_reg(apic, APIC_SPIV, as->spurious_vec & 0x3ff);
+        kvm_apic_set_reg(apic, APIC_ESR, as->esr);
+	kvm_apic_set_reg(apic, APIC_ICR, as->icr[0] & ~(1 << 12));
+	kvm_apic_set_reg(apic, APIC_ICR2, as->icr[1] & 0xff000000);
+
+	kvm_apic_set_reg(apic, APIC_LVTERR, as->lvterr);
+	kvm_apic_set_reg(apic, APIC_LVT1, as->lvt1);
+	kvm_apic_set_reg(apic, APIC_LVT0, as->lvt0);
+	kvm_apic_set_reg(apic, APIC_LVTPC, as->lvtpc);
+	kvm_apic_set_reg(apic, APIC_LVTTHMR, as->lvtthmr);
+	kvm_apic_set_reg(apic, APIC_LVTT, as->lvtt);
+	
+	kvm_apic_write(apic->vcpu, (unsigned long)(apic->regs + APIC_TDCR), 4, as->divide_conf);
+	kvm_apic_write(apic->vcpu, (unsigned long)(apic->regs + APIC_TMICT), 4, as->initial_count);
+
+	pr_debug("%s end\n", __FUNCTION__);
+	kvm_apic_dump_state(apic);
+	return 0;
+}
+
+int kvm_apic_get(struct kvm_apic *apic, struct kvm_apic_state* as)
+{
+	pr_debug("%s begin apic=%p\n", __FUNCTION__, apic);
+
+	spin_lock_bh(&apic->lock);
+	as->status = apic->status;
+	as->apicbase = apic->apic_base_msr;
+        memcpy(as->irr, (void*)(apic->regs + APIC_IRR), sizeof(as->irr));
+	memcpy(as->isr, (void*)(apic->regs + APIC_ISR), sizeof(as->isr));
+	memcpy(as->tmr, (void*)(apic->regs + APIC_TMR), sizeof(as->tmr));
+
+        as->id = kvm_apic_get_reg(apic, APIC_ID);
+	as->tpr = kvm_apic_get_reg(apic, APIC_TASKPRI) & 0xff;
+	as->ppr = kvm_apic_get_reg(apic, APIC_PROCPRI);
+        
+	as->dfr = kvm_apic_get_reg(apic, APIC_DFR) | 0x0FFFFFFF;
+	as->ldr = kvm_apic_get_reg(apic, APIC_LDR) & APIC_LDR_MASK;
+	as->spurious_vec = kvm_apic_get_reg(apic, APIC_SPIV) & 0x3ff;
+        as->esr = kvm_apic_get_reg(apic, APIC_ESR);
+	as->icr[0] = kvm_apic_get_reg(apic, APIC_ICR) & ~(1 << 12);
+	as->icr[1] = kvm_apic_get_reg(apic, APIC_ICR2) & 0xff000000;
+
+	as->lvterr = kvm_apic_get_reg(apic, APIC_LVTERR);
+	as->lvt1 = kvm_apic_get_reg(apic, APIC_LVT1);
+	as->lvt0 = kvm_apic_get_reg(apic, APIC_LVT0);
+	as->lvtpc = kvm_apic_get_reg(apic, APIC_LVTPC);
+	as->lvtthmr = kvm_apic_get_reg(apic, APIC_LVTTHMR);
+	as->lvtt = kvm_apic_get_reg(apic, APIC_LVTT);
+	as->initial_count = kvm_apic_get_reg(apic, APIC_TMICT);
+	as->divide_conf = kvm_apic_get_reg(apic, APIC_TDCR);
+        
+	kvm_apic_dump_state(apic);
+
+	spin_unlock_bh(&apic->lock);
+	return 0;
+}
+#endif
+
+int kvm_apic_reset(struct kvm_apic *apic)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+	ASSERT(apic != NULL);
+	vcpu = apic->vcpu;
+	ASSERT(vcpu != NULL);
+	
+
+	/* Stop the timer in case it's a reset an active apic */
+	if (apic->apic_timer.function)
+		hrtimer_cancel(&apic->apic_timer);
+
+	spin_lock_bh(&apic->lock);
+
+	kvm_apic_set_reg(apic, APIC_ID, vcpu_slot(vcpu) << 24);
+	kvm_apic_set_reg(apic, APIC_LVR, APIC_VERSION);
+
+	for (i = 0; i < APIC_LVT_NUM; i++)
+		kvm_apic_set_reg(apic, APIC_LVTT + 0x10 * i, APIC_LVT_MASKED);
+
+	kvm_apic_set_reg(apic, APIC_DFR, 0xffffffffU);
+	kvm_apic_set_reg(apic, APIC_SPIV, 0xff); 
+	kvm_apic_set_reg(apic, APIC_TASKPRI, 0);
+	kvm_apic_set_reg(apic, APIC_LDR, 0);
+        kvm_apic_set_reg(apic, APIC_ESR, 0);
+	kvm_apic_set_reg(apic, APIC_ICR, 0);
+	kvm_apic_set_reg(apic, APIC_ICR2, 0);
+	kvm_apic_set_reg(apic, APIC_TDCR, 0);
+	kvm_apic_set_reg(apic, APIC_TMICT, 0);
+	memset((void*)(apic->regs + APIC_IRR), 0, KVM_IRQ_BITMAP_SIZE(u8));
+	memset((void*)(apic->regs + APIC_ISR), 0, KVM_IRQ_BITMAP_SIZE(u8));
+	memset((void*)(apic->regs + APIC_TMR), 0, KVM_IRQ_BITMAP_SIZE(u8));
+	
+	apic->apic_base_msr = MSR_IA32_APICBASE_ENABLE | APIC_DEFAULT_PHYS_BASE;
+        if (vcpu_slot(vcpu) == 0)
+		apic->apic_base_msr |= MSR_IA32_APICBASE_BSP;
+        apic->base_address = apic->apic_base_msr & MSR_IA32_APICBASE_BASE;
+
+	hrtimer_init(&apic->apic_timer, CLOCK_MONOTONIC, HRTIMER_ABS);
+	apic->apic_timer.function = apic_timer_fn;
+	apic->timer_divide_count = 0;
+	apic->status = 0;
+	apic->pcpu_lock_owner = -1;
+        memset(&apic->intr_pending_count, 0, sizeof(int) * MAX_APIC_INT_VECTOR);
+
+#ifdef APIC_NO_BIOS
+	/*
+	 * XXX According to mp specific, BIOS will enable LVT0/1,
+	 * remove it after BIOS enabled
+	 */
+	if (!vcpu_slot(vcpu)) {
+		kvm_apic_set_reg(apic, APIC_LVT0, APIC_MODE_EXTINT << 8);
+		kvm_apic_set_reg(apic, APIC_LVT1, APIC_MODE_NMI << 8);
+		set_bit(_APIC_BSP_ACCEPT_PIC, &apic->status);
+	}
+#endif
+
+	spin_unlock_bh(&apic->lock);
+
+	printk(KERN_INFO  "%s: vcpu=%p, id=%d, apic_apic_base_msr=0x%016"PRIx64", "
+			  "base_address=0x%0lx.\n",
+			  __FUNCTION__, vcpu,  GET_APIC_ID(kvm_apic_get_reg(apic, APIC_ID)),
+			  apic->apic_base_msr, apic->base_address);
+	
+	return 1;
+}
+
+static int _apic_summary(struct kvm_irqdevice *this, void *data)
+{
+	struct kvm_apic *apic = (struct kvm_apic*)this->private;
+	
+	/* FIXME */
+	return kvm_irq_summary(&apic->ext, data);
+}
+
+/*
+ * Should be called with the apic lock held
+ */
+static void _apic_destructor(struct kvm_irqdevice *this)
+{
+	struct kvm_apic *apic = (struct kvm_apic*)this->private;
+
+	if (apic->regs_page) {
+		if (apic->apic_timer.function)
+			hrtimer_cancel(&apic->apic_timer);
+		__free_page(apic->regs_page);
+		apic->regs_page = 0;
+	}
+}
+
+/*
+ * Should be called by vcpu_setup
+ */
+int kvm_apic_init(struct kvm_vcpu *vcpu)
+{
+	struct kvm_irqdevice *dev = &vcpu->irq_dev;
+
+	dev->pending    = _apic_pending;
+	dev->read       = _apic_read;
+	dev->inject     = _apic_inject;
+	dev->summary    = _apic_summary;
+	dev->destructor = _apic_destructor;
+
+	dev->private = &vcpu->apic;
+
+	struct kvm_apic *apic = &vcpu->apic;
+	
+	ASSERT(vcpu != NULL);
+	pr_debug("apic_init %d\n", vcpu_slot(vcpu));
+	
+	apic->regs_page = alloc_page(GFP_KERNEL);
+	if ( apic->regs_page == NULL ) {
+		printk(KERN_ALERT "malloc apic regs error for vcpu %x\n", vcpu_slot(vcpu));
+		return -ENOMEM;
+	}
+	apic->regs = page_address(apic->regs_page);
+	memset(apic->regs, 0, PAGE_SIZE);
+
+	apic->vcpu = vcpu;
+	spin_lock_init(&apic->lock);
+	
+	kvm_apic_reset(apic);
+	return 0;
+}
+
+
diff --git a/drivers/kvm/kvm_apic.h b/drivers/kvm/kvm_apic.h
new file mode 100644
index 0000000..b5e6d72
--- /dev/null
+++ b/drivers/kvm/kvm_apic.h
@@ -0,0 +1,142 @@
+/*
+ * kvm_apic.h: Local APIC virtualization 
+ *
+ * Copyright (C) 2006 Qumranet, Inc.
+ *
+ * Authors:
+ *   Dor Laor   <dor.laor-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
+ *
+ * Copyright (c) 2004, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ */
+
+#ifndef __KVM_APIC_H__
+#define __KVM_APIC_H__
+
+#include "kvm.h"
+#include <asm/msr.h>
+#include <asm/apicdef.h>
+
+static __inline__ int find_highest_bit(unsigned long *data, int nr_bits)
+{
+	int length = BITS_TO_LONGS(nr_bits);
+	while (length && !data[--length])
+		continue;
+	return __ffs(data[length]) + (length * BITS_PER_LONG);
+}
+
+#define APIC_LVT_NUM			6
+/* 14 is the version for Xeon and Pentium 8.4.8*/
+#define APIC_VERSION			(0x14UL | ((APIC_LVT_NUM - 1) << 16))
+#define VLOCAL_APIC_MEM_LENGTH		(1 << 12)
+/* followed define is not in apicdef.h */
+#define APIC_SHORT_MASK			0xc0000
+#define APIC_DEST_NOSHORT		0x0
+#define APIC_DEST_MASK			0x800
+#define _APIC_GLOB_DISABLE		0x0
+#define APIC_GLOB_DISABLE_MASK		0x1
+#define APIC_SOFTWARE_DISABLE_MASK	0x2
+#define _APIC_BSP_ACCEPT_PIC		0x3
+
+#define apic_enabled(apic)              \
+	(!((apic)->status &                   \
+	   (APIC_GLOB_DISABLE_MASK | APIC_SOFTWARE_DISABLE_MASK)))
+
+#define apic_global_enabled(apic)       \
+	(!(test_bit(_APIC_GLOB_DISABLE, &(apic)->status)))
+
+#define LVT_MASK \
+	APIC_LVT_MASKED | APIC_SEND_PENDING | APIC_VECTOR_MASK
+
+#define LINT_MASK   \
+	LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY |\
+	APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER
+
+#define KVM_APIC_ID(apic)   \
+	(GET_APIC_ID(kvm_apic_get_reg(apic, APIC_ID)))
+
+#define apic_lvt_enabled(apic, lvt_type)    \
+	(!(kvm_apic_get_reg(apic, lvt_type) & APIC_LVT_MASKED))
+
+#define apic_lvt_vector(apic, lvt_type)     \
+	(kvm_apic_get_reg(apic, lvt_type) & APIC_VECTOR_MASK)
+
+#define apic_lvt_dm(apic, lvt_type)           \
+	(kvm_apic_get_reg(apic, lvt_type) & APIC_MODE_MASK)
+
+#define apic_lvtt_period(apic)     \
+	(kvm_apic_get_reg(apic, APIC_LVTT) & APIC_LVT_TIMER_PERIODIC)
+
+
+static inline int kvm_apic_set_irq(struct kvm_apic *apic, u8 vec, u8 trigger)
+{
+	int ret;
+	
+	ret = test_and_set_bit(vec, apic->regs + APIC_IRR);
+	if (trigger)
+		set_bit(vec, apic->regs + APIC_TMR);
+	
+	/* We may need to wake up target vcpu, besides set pending bit here */
+	return ret;
+}
+
+static inline u32 kvm_apic_get_reg(struct kvm_apic *apic, u32 reg)
+{
+	return *((u32 *)(apic->regs + reg));
+}
+
+static inline void kvm_apic_set_reg(struct kvm_apic *apic, u32 reg, u32 val)
+{
+	*((u32 *)(apic->regs + reg)) = val;
+}
+
+
+void kvm_apic_post_injection(struct kvm_vcpu* vcpu, int vector, int deliver_mode);
+
+int kvm_cpu_get_apic_interrupt(struct kvm_vcpu* vcpu);
+int kvm_cpu_has_pending_irq(struct kvm_vcpu *vcpu);
+
+extern int kvm_apic_init(struct kvm_vcpu *vcpu);
+
+extern void kvm_apic_msr_set(struct kvm_apic *apic, u64 value);
+
+int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu);
+
+struct kvm_apic* kvm_apic_round_robin(struct kvm_vcpu *vcpu,
+				      u8 dest_mode,
+				      u8 vector,
+				      u32 bitmap);
+
+u64 kvm_get_apictime_scheduled(struct kvm_vcpu *vcpu);
+
+int kvm_apic_range(struct kvm_vcpu *vcpu, unsigned long addr);
+void kvm_apic_write(struct kvm_vcpu *vcpu, unsigned long address,
+		    unsigned long len, unsigned long val);
+unsigned long kvm_apic_read(struct kvm_vcpu *vcpu, unsigned long address,
+			    unsigned long len);
+void kvm_free_apic(struct kvm_vcpu *vcpu);
+unsigned long kvm_apic_read_tpr(struct kvm_apic* apic);
+void kvm_apic_update_tpr(struct kvm_apic *apic, unsigned long cr8);
+
+#if 0
+int kvm_apic_receive_msg(struct kvm_vcpu *vcpu, struct kvm_apic_msg *msg);
+int kvm_apic_reset(struct kvm_apic *apic);
+
+int kvm_apic_get(struct kvm_apic *apic, struct kvm_apic_state* as);
+int kvm_apic_set(struct kvm_apic *apic, struct kvm_apic_state* as);
+#endif
+
+void apic_check_pending_timer(struct kvm_apic *apic);
+#endif /* __KVM_APIC_H__ */
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 4473174..0df9070 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -16,6 +16,7 @@
  */
 
 #include "kvm.h"
+#include "kvm_apic.h"
 
 #include <linux/kvm.h>
 #include <linux/module.h>
@@ -381,6 +382,22 @@ static void kvm_free_vcpus(struct kvm *kvm)
 		kvm_free_vcpu(&kvm->vcpus[i]);
 }
 
+/*
+ * The function kills a guest while there still is a user space processes
+ * with a descriptor to it
+ */
+void kvm_crash_guest(struct kvm *kvm)
+{
+	unsigned int i;
+
+	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
+		/* FIXME: in the future it should send IPI to gracefully stop the other
+		 * vCPUs
+		 */
+		kvm_free_vcpu(&kvm->vcpus[i]);
+	}
+}
+
 static int kvm_dev_release(struct inode *inode, struct file *filp)
 {
 	return 0;
@@ -597,7 +614,7 @@ void set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8)
 		inject_gp(vcpu);
 		return;
 	}
-	vcpu->cr8 = cr8;
+	kvm_apic_update_tpr(&vcpu->apic, cr8);
 }
 EXPORT_SYMBOL_GPL(set_cr8);
 
@@ -1015,12 +1032,37 @@ static int emulator_write_std(unsigned long addr,
 	return X86EMUL_UNHANDLEABLE;
 }
 
+struct kvm_mmio_handler {
+	unsigned long (*read)(struct kvm_vcpu *v,
+			      unsigned long addr,
+			      unsigned long length);
+	void (*write)(struct kvm_vcpu *v,
+			   unsigned long addr,
+			   unsigned long length,
+			   unsigned long val);
+	int (*in_range)(struct kvm_vcpu *v, unsigned long addr);
+};
+
+struct kvm_mmio_handler apic_mmio_handler = {
+	kvm_apic_read,
+	kvm_apic_write,
+	kvm_apic_range,
+};
+
+#define KVM_MMIO_HANDLER_NR_ARRAY_SIZE 1
+static struct kvm_mmio_handler *kvm_mmio_handlers[KVM_MMIO_HANDLER_NR_ARRAY_SIZE] =
+{
+    &apic_mmio_handler,
+};
+
 static int emulator_read_emulated(unsigned long addr,
 				  unsigned long *val,
 				  unsigned int bytes,
 				  struct x86_emulate_ctxt *ctxt)
 {
 	struct kvm_vcpu *vcpu = ctxt->vcpu;
+	gpa_t gpa;
+	int i;
 
 	if (vcpu->mmio_read_completed) {
 		memcpy(val, vcpu->mmio_data, bytes);
@@ -1029,18 +1071,23 @@ static int emulator_read_emulated(unsigned long addr,
 	} else if (emulator_read_std(addr, val, bytes, ctxt)
 		   == X86EMUL_CONTINUE)
 		return X86EMUL_CONTINUE;
-	else {
-		gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+        
+	gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+	if (gpa == UNMAPPED_GVA)
+		return vcpu_printf(vcpu, "not present\n"), X86EMUL_PROPAGATE_FAULT;
 
-		if (gpa == UNMAPPED_GVA)
-			return X86EMUL_PROPAGATE_FAULT;
-		vcpu->mmio_needed = 1;
-		vcpu->mmio_phys_addr = gpa;
-		vcpu->mmio_size = bytes;
-		vcpu->mmio_is_write = 0;
+	for (i = 0; i < KVM_MMIO_HANDLER_NR_ARRAY_SIZE; i++)
+		if (kvm_mmio_handlers[i]->in_range(vcpu, gpa)) {
+			*val = kvm_mmio_handlers[i]->read(vcpu, gpa, bytes);
+			return X86EMUL_CONTINUE;
+		}
 
-		return X86EMUL_UNHANDLEABLE;
-	}
+	vcpu->mmio_needed = 1;
+	vcpu->mmio_phys_addr = gpa;
+	vcpu->mmio_size = bytes;
+	vcpu->mmio_is_write = 0;
+
+	return X86EMUL_UNHANDLEABLE;
 }
 
 static int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -1070,6 +1117,7 @@ static int emulator_write_emulated(unsigned long addr,
 {
 	struct kvm_vcpu *vcpu = ctxt->vcpu;
 	gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+	int i;
 
 	if (gpa == UNMAPPED_GVA)
 		return X86EMUL_PROPAGATE_FAULT;
@@ -1077,6 +1125,12 @@ static int emulator_write_emulated(unsigned long addr,
 	if (emulator_write_phys(vcpu, gpa, val, bytes))
 		return X86EMUL_CONTINUE;
 
+	for (i = 0; i < KVM_MMIO_HANDLER_NR_ARRAY_SIZE; i++)
+		if (kvm_mmio_handlers[i]->in_range(vcpu, gpa)) {
+			kvm_mmio_handlers[i]->write(vcpu, gpa, bytes, val);
+			return X86EMUL_CONTINUE;
+		}
+
 	vcpu->mmio_needed = 1;
 	vcpu->mmio_phys_addr = gpa;
 	vcpu->mmio_size = bytes;
@@ -1479,7 +1533,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 		data = 3;
 		break;
 	case MSR_IA32_APICBASE:
-		data = vcpu->apic_base;
+		data = vcpu->apic.apic_base_msr;
+		break;
+	case MSR_IA32_TIME_STAMP_COUNTER:
+		// FIXME
+                //data = guest_read_tsc();
 		break;
 	case MSR_IA32_MISC_ENABLE:
 		data = vcpu->ia32_misc_enable_msr;
@@ -1557,7 +1615,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	case 0x200 ... 0x2ff: /* MTRRs */
 		break;
 	case MSR_IA32_APICBASE:
-		vcpu->apic_base = data;
+		kvm_apic_msr_set(&vcpu->apic, data);
 		break;
 	case MSR_IA32_MISC_ENABLE:
 		vcpu->ia32_misc_enable_msr = data;
@@ -1812,9 +1870,6 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	if (vcpu->sigset_active)
 		sigprocmask(SIG_SETMASK, &vcpu->sigset, &sigsaved);
 
-	/* re-sync apic's tpr */
-	vcpu->cr8 = kvm_run->cr8;
-
 	if (kvm_run->io_completed) {
 		if (vcpu->pio.count) {
 			r = complete_pio(vcpu);
@@ -1953,12 +2008,11 @@ static int kvm_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 	sregs->cr2 = vcpu->cr2;
 	sregs->cr3 = vcpu->cr3;
 	sregs->cr4 = vcpu->cr4;
-	sregs->cr8 = vcpu->cr8;
+	sregs->cr8 = kvm_apic_read_tpr(&vcpu->apic);
 	sregs->efer = vcpu->shadow_efer;
-	sregs->apic_base = vcpu->apic_base;
+	sregs->apic_base = vcpu->apic.apic_base_msr;
 
-	memcpy(sregs->interrupt_bitmap, vcpu->irq_pending,
-	       sizeof sregs->interrupt_bitmap);
+	kvm_vcpu_irq_summary(vcpu, &sregs->interrupt_bitmap);
 
 	vcpu_put(vcpu);
 
@@ -1991,13 +2045,13 @@ static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	mmu_reset_needed |= vcpu->cr3 != sregs->cr3;
 	vcpu->cr3 = sregs->cr3;
 
-	vcpu->cr8 = sregs->cr8;
+	kvm_apic_update_tpr(&vcpu->apic, sregs->cr8);
 
 	mmu_reset_needed |= vcpu->shadow_efer != sregs->efer;
 #ifdef CONFIG_X86_64
 	kvm_arch_ops->set_efer(vcpu, sregs->efer);
 #endif
-	vcpu->apic_base = sregs->apic_base;
+	kvm_apic_msr_set(&vcpu->apic, sregs->apic_base);
 
 	kvm_arch_ops->decache_cr0_cr4_guest_bits(vcpu);
 
@@ -2012,12 +2066,18 @@ static int kvm_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	if (mmu_reset_needed)
 		kvm_mmu_reset_context(vcpu);
 
-	memcpy(vcpu->irq_pending, sregs->interrupt_bitmap,
-	       sizeof vcpu->irq_pending);
-	vcpu->irq_summary = 0;
-	for (i = 0; i < NR_IRQ_WORDS; ++i)
-		if (vcpu->irq_pending[i])
-			__set_bit(i, &vcpu->irq_summary);
+	/* walk the interrupt-bitmap and inject an IRQ for each bit found */
+	for (i = 0; i < NR_IRQ_WORDS; ++i) {
+		unsigned long word = sregs->interrupt_bitmap[i];
+		while(word) {
+			int bit_index = __ffs(word);
+			int irq = i * BITS_PER_LONG + bit_index;
+			
+			kvm_vcpu_irq_inject(vcpu, irq, 0);
+
+			clear_bit(bit_index, &word);
+		}
+	}
 
 	set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
 	set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
@@ -2178,14 +2238,8 @@ static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
 {
 	if (irq->irq < 0 || irq->irq >= 256)
 		return -EINVAL;
-	vcpu_load(vcpu);
-
-	set_bit(irq->irq, vcpu->irq_pending);
-	set_bit(irq->irq / BITS_PER_LONG, &vcpu->irq_summary);
 
-	vcpu_put(vcpu);
-
-	return 0;
+	return kvm_vcpu_irq_inject(vcpu, irq->irq, 0);
 }
 
 static int kvm_vcpu_ioctl_debug_guest(struct kvm_vcpu *vcpu,
@@ -2332,10 +2386,16 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n)
 	if (r < 0)
 		goto out_free_vcpus;
 
+	if(kvm->options.apic_enabled)
+	    kvm_apic_init(vcpu);
+	else
+	    kvm_userint_init(vcpu);
+
 	kvm_arch_ops->vcpu_load(vcpu);
 	r = kvm_mmu_setup(vcpu);
-	if (r >= 0)
+	if (r >= 0) {
 		r = kvm_arch_ops->vcpu_setup(vcpu);
+	}
 	vcpu_put(vcpu);
 
 	if (r < 0)
@@ -2672,6 +2732,20 @@ static long kvm_vm_ioctl(struct file *filp,
 			goto out;
 		break;
 	}
+	case KVM_GET_OPTIONS: {
+		r = -EFAULT;
+		if (copy_to_user(&kvm->options, argp, sizeof(kvm->options)))
+		    goto out;
+		r = 0;
+		break;
+	}
+	case KVM_SET_OPTIONS: {
+		r = -EFAULT;
+		if (copy_from_user(&kvm->options, argp, sizeof(kvm->options)))
+		    goto out;
+		r = 0;
+		break;
+	}
 	default:
 		;
 	}
diff --git a/drivers/kvm/kvm_userint.c b/drivers/kvm/kvm_userint.c
new file mode 100644
index 0000000..218ae81
--- /dev/null
+++ b/drivers/kvm/kvm_userint.c
@@ -0,0 +1,165 @@
+/*
+ * kvm_userint.c: User Interrupts IRQ device - This acts as an extention
+ *                of an interrupt controller that exists elsewhere (typically
+ *                in userspace/QEMU)
+ *
+ * Copyright (C) 2007 Qumranet
+ * Copyright (C) 2007 Novell
+ *
+ * Authors:
+ *   Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ */
+#include "kvm.h"
+
+/*----------------------------------------------------------------------
+ * optimized bitarray object - works like bitarrays in bitops, but uses 
+ * a summary field to accelerate lookups 
+ *---------------------------------------------------------------------*/
+
+struct bitarray {
+    unsigned long summary; /* 1 per word in pending */
+    unsigned long pending[NR_IRQ_WORDS];
+};
+
+static int bitarray_pending(struct bitarray *this)
+{
+	return this->summary ? 1 : 0;	
+}
+
+static int bitarray_findhighest(struct bitarray *this)
+{
+	if(!this->summary)
+		return 0;
+
+	int word_index = __ffs(this->summary);
+	int bit_index = __ffs(this->pending[word_index]);
+
+	return word_index * BITS_PER_LONG + bit_index;	
+}
+
+static void bitarray_set(struct bitarray *this, int nr)
+{
+	set_bit(nr, &this->pending);
+	set_bit(nr / BITS_PER_LONG, &this->summary); 
+} 
+
+static void bitarray_clear(struct bitarray *this, int nr)
+{
+	int word = nr / BITS_PER_LONG;
+
+	clear_bit(nr, &this->pending);
+	if(!this->pending[word])
+		clear_bit(word, &this->summary);
+}
+
+static int bitarray_test(struct bitarray *this, int nr)
+{
+	return test_bit(nr, &this->pending);
+}
+
+/*----------------------------------------------------------------------
+ * userint interface - provides the actual kvm_irqdevice implementation
+ *---------------------------------------------------------------------*/
+
+typedef struct {
+	struct bitarray irq_pending;
+	struct bitarray nmi_pending;
+}kvm_userint;
+
+static int userint_pending(struct kvm_irqdevice *this, int flags)
+{
+	kvm_userint *s = (kvm_userint*)this->private;
+
+	if(flags & KVM_IRQFLAGS_NMI)
+		return bitarray_pending(&s->nmi_pending);
+	else
+		return bitarray_pending(&s->irq_pending);
+}
+
+static int userint_read(struct kvm_irqdevice *this, int flags, 
+			struct kvm_irqinfo *info)
+{
+	kvm_userint *s = (kvm_userint*)this->private;
+	int          irq;
+
+	if(flags & KVM_IRQFLAGS_NMI)
+		irq = bitarray_findhighest(&s->nmi_pending);
+	else
+		irq = bitarray_findhighest(&s->irq_pending);
+
+	if(!irq)
+		return 0;
+
+	if(info) {
+		info->vector = irq;
+		if(bitarray_test(&s->nmi_pending, irq))
+			info->nmi = 1;
+		else
+			info->nmi = 0;
+	}
+
+	if(!(flags & KVM_IRQFLAGS_PEEK)) {
+		/* If the "peek" flag is not set, automatically clear the 
+		   interrupt as the EOI mechanism (if any) will take place 
+		   in userspace */
+		bitarray_clear(&s->irq_pending, irq);
+		bitarray_clear(&s->nmi_pending, irq);
+	}
+
+	return irq;
+}
+
+static int userint_inject(struct kvm_irqdevice* this, int irq, int flags)
+{
+	kvm_userint *s = (kvm_userint*)this->private;
+
+	bitarray_set(&s->irq_pending, irq);
+	if(flags & KVM_IRQFLAGS_NMI)
+		bitarray_set(&s->nmi_pending, irq);
+
+	return 0;
+}
+
+static int userint_summary(struct kvm_irqdevice* this, void *data)
+{	
+	kvm_userint *s = (kvm_userint*)this->private;
+
+	memcpy(data, s->irq_pending.pending, sizeof s->irq_pending.pending);
+
+	return 0;
+}
+
+static void userint_destructor(struct kvm_irqdevice *this)
+{
+	kfree(this->private);
+}
+
+int kvm_userint_init(struct kvm_vcpu *vcpu)
+{
+	struct kvm_irqdevice *dev = &vcpu->irq_dev;
+
+	dev->pending    = userint_pending;
+	dev->read       = userint_read;
+	dev->inject     = userint_inject;
+	dev->summary    = userint_summary;
+	dev->destructor = userint_destructor;
+
+	dev->private = kzalloc(sizeof(kvm_userint), GFP_KERNEL);
+
+	return 0;
+}
+
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index b7e1410..0e0a291 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -22,6 +22,7 @@
 #include <asm/desc.h>
 
 #include "kvm_svm.h"
+#include "kvm_apic.h"
 #include "x86_emulate.h"
 
 MODULE_AUTHOR("Qumranet");
@@ -108,20 +109,12 @@ static unsigned get_addr_size(struct kvm_vcpu *vcpu)
 
 static inline u8 pop_irq(struct kvm_vcpu *vcpu)
 {
-	int word_index = __ffs(vcpu->irq_summary);
-	int bit_index = __ffs(vcpu->irq_pending[word_index]);
-	int irq = word_index * BITS_PER_LONG + bit_index;
-
-	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
-	if (!vcpu->irq_pending[word_index])
-		clear_bit(word_index, &vcpu->irq_summary);
-	return irq;
+	return kvm_vcpu_irq_read(vcpu, 0, NULL);
 }
 
 static inline void push_irq(struct kvm_vcpu *vcpu, u8 irq)
 {
-	set_bit(irq, vcpu->irq_pending);
-	set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
+	kvm_vcpu_irq_inject(vcpu, irq, 0);
 }
 
 static inline void clgi(void)
@@ -587,9 +580,6 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
 	init_vmcb(vcpu->svm->vmcb);
 
 	fx_init(vcpu);
-	vcpu->apic_base = 0xfee00000 |
-			/*for vcpu 0*/ MSR_IA32_APICBASE_BSP |
-			MSR_IA32_APICBASE_ENABLE;
 
 	return 0;
 
@@ -1092,7 +1082,7 @@ static int halt_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	vcpu->svm->next_rip = vcpu->svm->vmcb->save.rip + 1;
 	skip_emulated_instruction(vcpu);
-	if (vcpu->irq_summary)
+	if (kvm_vcpu_irq_pending(vcpu, 0))
 		return 1;
 
 	kvm_run->exit_reason = KVM_EXIT_HLT;
@@ -1263,7 +1253,7 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu,
 	 * possible
 	 */
 	if (kvm_run->request_interrupt_window &&
-	    !vcpu->irq_summary) {
+	    !kvm_vcpu_irq_pending(vcpu, 0)) {
 		++kvm_stat.irq_window_exits;
 		kvm_run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
 		return 0;
@@ -1399,7 +1389,7 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
 		(!(control->int_state & SVM_INTERRUPT_SHADOW_MASK) &&
 		 (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF));
 
-	if (vcpu->interrupt_window_open && vcpu->irq_summary)
+	if (vcpu->interrupt_window_open && kvm_vcpu_irq_pending(vcpu, 0))
 		/*
 		 * If interrupts enabled, and not blocked by sti or mov ss. Good.
 		 */
@@ -1409,7 +1399,8 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
 	 * Interrupts blocked.  Wait for unblock.
 	 */
 	if (!vcpu->interrupt_window_open &&
-	    (vcpu->irq_summary || kvm_run->request_interrupt_window)) {
+	    (kvm_vcpu_irq_pending(vcpu, 0) || 
+	     kvm_run->request_interrupt_window)) {
 		control->intercept |= 1ULL << INTERCEPT_VINTR;
 	} else
 		control->intercept &= ~(1ULL << INTERCEPT_VINTR);
@@ -1418,11 +1409,13 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
 static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 			      struct kvm_run *kvm_run)
 {
-	kvm_run->ready_for_interrupt_injection = (vcpu->interrupt_window_open &&
-						  vcpu->irq_summary == 0);
+	kvm_run->ready_for_interrupt_injection = 
+		(vcpu->interrupt_window_open && 
+		 !kvm_vcpu_irq_pending(vcpu, 0));
+
 	kvm_run->if_flag = (vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF) != 0;
-	kvm_run->cr8 = vcpu->cr8;
-	kvm_run->apic_base = vcpu->apic_base;
+	kvm_run->cr8 = kvm_apic_read_tpr(&vcpu->apic);
+	kvm_run->apic_base = vcpu->apic.apic_base_msr;
 }
 
 /*
@@ -1434,7 +1427,7 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu,
 					  struct kvm_run *kvm_run)
 {
-	return (!vcpu->irq_summary &&
+	return (!kvm_vcpu_irq_pending(vcpu, 0) &&
 		kvm_run->request_interrupt_window &&
 		vcpu->interrupt_window_open &&
 		(vcpu->svm->vmcb->save.rflags & X86_EFLAGS_IF));
@@ -1489,6 +1482,8 @@ again:
 	fx_save(vcpu->host_fx_image);
 	fx_restore(vcpu->guest_fx_image);
 
+	apic_check_pending_timer(&vcpu->apic);
+
 	asm volatile (
 #ifdef CONFIG_X86_64
 		"push %%rbx; push %%rcx; push %%rdx;"
@@ -1601,6 +1596,8 @@ again:
 	fx_save(vcpu->guest_fx_image);
 	fx_restore(vcpu->host_fx_image);
 
+	apic_check_pending_timer(&vcpu->apic);
+
 	if ((vcpu->svm->vmcb->save.dr7 & 0xff))
 		load_db_regs(vcpu->svm->host_db_regs);
 
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 61a6116..c250e33 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -18,6 +18,7 @@
 #include "kvm.h"
 #include "vmx.h"
 #include "kvm_vmx.h"
+#include "kvm_apic.h"
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
@@ -994,10 +995,6 @@ static int vmx_vcpu_setup(struct kvm_vcpu *vcpu)
 
 	memset(vcpu->regs, 0, sizeof(vcpu->regs));
 	vcpu->regs[VCPU_REGS_RDX] = get_rdx_init_val();
-	vcpu->cr8 = 0;
-	vcpu->apic_base = 0xfee00000 |
-			/*for vcpu 0*/ MSR_IA32_APICBASE_BSP |
-			MSR_IA32_APICBASE_ENABLE;
 
 	fx_init(vcpu);
 
@@ -1219,13 +1216,8 @@ static void inject_rmode_irq(struct kvm_vcpu *vcpu, int irq)
 
 static void kvm_do_inject_irq(struct kvm_vcpu *vcpu)
 {
-	int word_index = __ffs(vcpu->irq_summary);
-	int bit_index = __ffs(vcpu->irq_pending[word_index]);
-	int irq = word_index * BITS_PER_LONG + bit_index;
-
-	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
-	if (!vcpu->irq_pending[word_index])
-		clear_bit(word_index, &vcpu->irq_summary);
+	int irq = kvm_vcpu_irq_read(vcpu, 0, NULL);
+	BUG_ON(!irq);
 
 	if (vcpu->rmode.active) {
 		inject_rmode_irq(vcpu, irq);
@@ -1246,7 +1238,7 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
 		 (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0);
 
 	if (vcpu->interrupt_window_open &&
-	    vcpu->irq_summary &&
+	    kvm_vcpu_irq_pending(vcpu, 0) &&
 	    !(vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK))
 		/*
 		 * If interrupts enabled, and not blocked by sti or mov ss. Good.
@@ -1255,7 +1247,7 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
 
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	if (!vcpu->interrupt_window_open &&
-	    (vcpu->irq_summary || kvm_run->request_interrupt_window))
+	    (kvm_vcpu_irq_pending(vcpu, 0) || kvm_run->request_interrupt_window))
 		/*
 		 * Interrupts blocked.  Wait for unblock.
 		 */
@@ -1314,8 +1306,8 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 
 	if (is_external_interrupt(vect_info)) {
 		int irq = vect_info & VECTORING_INFO_VECTOR_MASK;
-		set_bit(irq, vcpu->irq_pending);
-		set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
+		/* FIXME: Is this right? */
+		kvm_vcpu_irq_inject(vcpu, irq, 0); 
 	}
 
 	if ((intr_info & INTR_INFO_INTR_TYPE_MASK) == 0x200) { /* nmi */
@@ -1520,7 +1512,7 @@ static int handle_cr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 			printk(KERN_DEBUG "handle_cr: read CR8 "
 			       "cpu erratum AA15\n");
 			vcpu_load_rsp_rip(vcpu);
-			vcpu->regs[reg] = vcpu->cr8;
+			vcpu->regs[reg] = kvm_apic_read_tpr(&vcpu->apic);
 			vcpu_put_rsp_rip(vcpu);
 			skip_emulated_instruction(vcpu);
 			return 1;
@@ -1617,10 +1609,10 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu,
 			      struct kvm_run *kvm_run)
 {
 	kvm_run->if_flag = (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) != 0;
-	kvm_run->cr8 = vcpu->cr8;
-	kvm_run->apic_base = vcpu->apic_base;
-	kvm_run->ready_for_interrupt_injection = (vcpu->interrupt_window_open &&
-						  vcpu->irq_summary == 0);
+	kvm_run->cr8 = kvm_apic_read_tpr(&vcpu->apic);
+	kvm_run->apic_base = vcpu->apic.apic_base_msr;
+	kvm_run->ready_for_interrupt_injection = 
+	    (vcpu->interrupt_window_open && !kvm_vcpu_irq_pending(vcpu, 0));
 }
 
 static int handle_interrupt_window(struct kvm_vcpu *vcpu,
@@ -1631,7 +1623,7 @@ static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 	 * possible
 	 */
 	if (kvm_run->request_interrupt_window &&
-	    !vcpu->irq_summary) {
+	    !kvm_vcpu_irq_pending(vcpu, 0)) {
 		kvm_run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
 		++kvm_stat.irq_window_exits;
 		return 0;
@@ -1642,7 +1634,7 @@ static int handle_interrupt_window(struct kvm_vcpu *vcpu,
 static int handle_halt(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
 	skip_emulated_instruction(vcpu);
-	if (vcpu->irq_summary)
+	if (kvm_vcpu_irq_pending(vcpu, 0))
 		return 1;
 
 	kvm_run->exit_reason = KVM_EXIT_HLT;
@@ -1713,7 +1705,7 @@ static int kvm_handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu,
 					  struct kvm_run *kvm_run)
 {
-	return (!vcpu->irq_summary &&
+	return (!kvm_vcpu_irq_pending(vcpu, 0) &&
 		kvm_run->request_interrupt_window &&
 		vcpu->interrupt_window_open &&
 		(vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF));
@@ -1763,6 +1755,8 @@ again:
 	save_msrs(vcpu->host_msrs, vcpu->nmsrs);
 	load_msrs(vcpu->guest_msrs, NR_BAD_MSRS);
 
+	apic_check_pending_timer(&vcpu->apic);
+
 	asm (
 		/* Store host registers */
 		"pushf \n\t"
@@ -1905,6 +1899,8 @@ again:
 	}
 	++kvm_stat.exits;
 
+	apic_check_pending_timer(&vcpu->apic);
+
 	save_msrs(vcpu->guest_msrs, NR_BAD_MSRS);
 	load_msrs(vcpu->host_msrs, NR_BAD_MSRS);
 
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 07bf353..6d0a4de 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -233,6 +233,10 @@ struct kvm_dirty_log {
 	};
 };
 
+struct kvm_options {
+	__u32 apic_enabled;
+};
+
 struct kvm_cpuid_entry {
 	__u32 function;
 	__u32 eax;
@@ -284,6 +288,8 @@ struct kvm_signal_mask {
 #define KVM_CREATE_VCPU           _IO(KVMIO,  0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO, 0x42, struct kvm_dirty_log)
 #define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO, 0x43, struct kvm_memory_alias)
+#define KVM_GET_OPTIONS           _IOW(KVMIO, 0x44, struct kvm_options)
+#define KVM_SET_OPTIONS           _IOW(KVMIO, 0x45, struct kvm_options)
 
 /*
  * ioctls for vcpu fds

[-- Attachment #3: Type: text/plain, Size: 345 bytes --]

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

[-- Attachment #4: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found] ` <46128F80.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-04-04  7:40   ` Avi Kivity
       [not found]     ` <46135686.4090905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-04-04 20:32   ` Ingo Molnar
  1 sibling, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-04  7:40 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> My current thoughts are that we at least move the IOAPIC into the kernel as well.  That will give sufficient control to generate ISA bus interrupts for guests that understand APICs.  If we want to be able to generate ISA interrupts for legacy guests which talk to the 8259s that will prove to be insufficient.  The good news is that moving the 8259s down as well is probably not a huge deal either, especially since I have already prepped the usermode side.  Thoughts?
>   

I would avoid moving down anything that's not strictly necessary.  If we 
want to keep the PIC in qemu, for example, just export the APIC-PIC 
interface to qemu.

I still don't have an opinion as to whether it is necessary; I'll need 
to study the details.  Xen pushes most of the platform into the 
hypervisor, but then Xen's communication path to qemu is much more 
expensive (involving the scheduler and a potential cpu switch) than 
kvm.  We'd need to balance possible performance improvements (I'd expect 
negligible) and interface simplification (possibly non-negligible) 
against further diverging from qemu.


> So heres a question for you guys out there.  What is the expected use of the in-kernel APIC?  My interests lie in the ability to send IPIs for SMP, as well as being able to inject asynchronous hypercall interrupts.  I assume there are other reasons too, such as PV device interrupts, etc and I would like to make sure I am seeing the big picture before making any bad design decisions.  My question is, how do we expect the PV devices to look from a bus perspective?  
>
> The current Bochs/QEMU system model paints a fairly simple ISA architecture utilizing a single IOAPIC + dual 8259 setup.  Do we expect in-kernel injected IRQs to follow the ISA model (e.g. either legacy or PCI interrupts only limited to IRQ0-15) or do we want to expand on this?  The PCI hypercall device introduced a while back would be an example of something ISA based.  Alternatives would be to utilize unused "pins" (such as IRQ16-23) on IOAPIC #0, or introducing new an entirely new bus/IOAPICs just for KVM, etc.  
>   

There are two extreme models, which I think are both needed.  On one 
end, support for closed OSes (e.g. Windows) requires fairly strict 
conformance to the PCI model, which means going through the IOAPIC or 
PIC or however the interrupt lines are wired in qemu. This seems to 
indicate that an in-kernel IOAPIC is needed.  On the other end (Linux), 
a legacy-free and emulation-free device can just inject interrupts 
directly and use shared memory to ack interrupts and indicate their source.

> If the latter, we also need to decide what the resource conveyance model and vector allocation policy should be.  For instance, do we publish said resources formally in the MP/ACPI tables in Bochs?  Doing so would allow MP/ACPI compliant OSs like linux to naturally route the IRQ.  Conversely, do we do something more direct just like we do for KVM discovery via wrmsr?
>   

I think we can go the direct route for cooperative guests.

I also suggest doing the work in stages and measuring; that is, first 
push the local apic, determine what the remaining bottlenecks are and 
tackle them.  I'm pretty sure that Linux guests would only require the 
local apic but Windows (and older Linux kernels) might require more.

>  struct kvm_vcpu;
>  
> +struct kvm_irqinfo {
> +	int         vector;
> +	int         nmi;
> +};
> +
> +#define KVM_IRQFLAGS_NMI  (1 << 0)
> +#define KVM_IRQFLAGS_PEEK (1 << 1)
> +
> +struct kvm_irqdevice {
> +	int  (*pending)(struct kvm_irqdevice *this, int flags);
> +	int  (*read)(struct kvm_irqdevice *this, int flags, 
> +		     struct kvm_irqinfo *info);
>   

Aren't pending() and read() + PEEK overlapping?

> +	int  (*inject)(struct kvm_irqdevice *this, int irq, int flags);
> +	int  (*summary)(struct kvm_irqdevice *this, void *data);
> +	void (*destructor)(struct kvm_irqdevice *this);
> +
> +	void *private;
> +};
>   

Consider using container_of() to simulate C++ inheritance.  Messier but 
less indirections.   Also consider a kvm_irqdevice_operations structure.

> +
> +#define MAX_APIC_INT_VECTOR      256
> +
> +struct kvm_apic {
> +	u32	           	status;
> +	u32	           	vcpu_id;
> +	spinlock_t 		lock;
> +	u32			pcpu_lock_owner;
>   

Isn't this vcpu->cpu?

> +	atomic_t		timer_pending;
> +	u64	           	apic_base_msr;
> +	unsigned long      	base_address;
> +	u32	           	timer_divide_count;
> +	struct hrtimer       	apic_timer;
> +	int                	intr_pending_count[MAX_APIC_INT_VECTOR];
> +	ktime_t           	timer_last_update;
> +	struct {
> +		int deliver_mode;
> +		int source[6];
> +	} direct_intr;
> +	u32	           	err_status;
> +	u32	           	err_write_count;
> +	struct kvm_vcpu        	*vcpu;
> +	struct page	   	*regs_page;
> +	void               	*regs;
> +	struct kvm_irqdevice    ext; /* Used for external/NMI interrupts */
> +};
> +
>  /*
>   * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
>   * 32-bit).  The kvm_mmu structure abstracts the details of the current mmu
> @@ -236,6 +281,11 @@ struct kvm_pio_request {
>  	int rep;
>  };
>  
> +#define KVM_VCPU_INIT_SIPI_SIPI_STATE_NORM          0
> +#define KVM_VCPU_INIT_SIPI_SIPI_STATE_WAIT_SIPI     1
> +
> +#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
> +
>  struct kvm_vcpu {
>  	struct kvm *kvm;
>  	union {
> @@ -248,12 +298,9 @@ struct kvm_vcpu {
>  	u64 host_tsc;
>  	struct kvm_run *run;
>  	int interrupt_window_open;
> -	unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */
> -#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
> -	unsigned long irq_pending[NR_IRQ_WORDS];
>  	unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
>  	unsigned long rip;      /* needs vcpu_load_rsp_rip() */
> -
> +	struct kvm_irqdevice irq_dev;
>  	unsigned long cr0;
>  	unsigned long cr2;
>  	unsigned long cr3;
> @@ -261,10 +308,8 @@ struct kvm_vcpu {
>  	struct page *para_state_page;
>  	gpa_t hypercall_gpa;
>  	unsigned long cr4;
> -	unsigned long cr8;
>   

I think we should keep cr8 here and accept the duplication.

>  	u64 pdptrs[4]; /* pae */
>  	u64 shadow_efer;
> -	u64 apic_base;
>  	u64 ia32_misc_enable_msr;
>  	int nmsrs;
>  	struct vmx_msr_entry *guest_msrs;
> @@ -298,6 +343,11 @@ struct kvm_vcpu {
>  	int sigset_active;
>  	sigset_t sigset;
>  
> +	struct kvm_apic apic;
> +	wait_queue_head_t halt_wq;
> +	/* For AP startup */
> +	unsigned long init_sipi_sipi_state;
> +
>  	struct {
>  		int active;
>  		u8 save_iopl;
> @@ -319,6 +369,23 @@ struct kvm_mem_alias {
>  	gfn_t target_gfn;
>  };
>  
> +#define kvm_irq_pending(dev, flags)     (dev)->pending(dev, flags)
> +#define kvm_irq_read(dev, flags, info)  (dev)->read(dev, flags, info)
> +#define kvm_irq_inject(dev, irq, flags) (dev)->inject(dev, irq, flags)
> +#define kvm_irq_summary(dev, data)      (dev)->summary(dev, data)
>   

Static inlines please, no #defines.  In this case, I'd just call the 
functions directly.

> +
> +#define kvm_vcpu_irq_pending(vcpu, flags) \
> +   kvm_irq_pending(&vcpu->irq_dev, flags)
> +#define kvm_vcpu_irq_read(vcpu, flags, info) \
> +   kvm_irq_read(&vcpu->irq_dev, flags, info)
> +#define kvm_vcpu_irq_inject(vcpu, irq, flags) \
> +   kvm_irq_inject(&vcpu->irq_dev, irq, flags)
> +#define kvm_vcpu_irq_summary(vcpu, data)  \
> +   kvm_irq_summary(&vcpu->irq_dev, data)
> +
>   

Ditto.

>  
> +struct kvm_mmio_handler {
> +	unsigned long (*read)(struct kvm_vcpu *v,
> +			      unsigned long addr,
> +			      unsigned long length);
> +	void (*write)(struct kvm_vcpu *v,
> +			   unsigned long addr,
> +			   unsigned long length,
> +			   unsigned long val);
> +	int (*in_range)(struct kvm_vcpu *v, unsigned long addr);
> +};
>   

This would need to be split out into a separate patch.  Addresses are gpa_t.

> +
> +#define KVM_MMIO_HANDLER_NR_ARRAY_SIZE 1
> +static struct kvm_mmio_handler *kvm_mmio_handlers[KVM_MMIO_HANDLER_NR_ARRAY_SIZE] =
> +{
> +    &apic_mmio_handler,
> +};
> +
>   

We'll need dynamic registration if we want kernel apic support to be 
optional.  Also, this needs to be per-vcpu as each vcpu's apic can be 
placed at a different address.


> @@ -1479,7 +1533,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>  		data = 3;
>  		break;
>  	case MSR_IA32_APICBASE:
> -		data = vcpu->apic_base;
> +		data = vcpu->apic.apic_base_msr;
> +		break;
> +	case MSR_IA32_TIME_STAMP_COUNTER:
> +		// FIXME
> +                //data = guest_read_tsc();
>  		break;
>   

The tsc stuff looks spurious.

> +	case KVM_GET_OPTIONS: {
> +		r = -EFAULT;
> +		if (copy_to_user(&kvm->options, argp, sizeof(kvm->options)))
> +		    goto out;
> +		r = 0;
> +		break;
> +	}
> +	case KVM_SET_OPTIONS: {
> +		r = -EFAULT;
> +		if (copy_from_user(&kvm->options, argp, sizeof(kvm->options)))
> +		    goto out;
> +		r = 0;
> +		break;
> +	}
>   

This is difficult to expand over time.  I prefer a separate ioctl() for 
each option.

> +
> +struct bitarray {
> +    unsigned long summary; /* 1 per word in pending */
> +    unsigned long pending[NR_IRQ_WORDS];
> +};
> +
> +static int bitarray_pending(struct bitarray *this)
> +{
> +	return this->summary ? 1 : 0;	
> +}
> +
> +static int bitarray_findhighest(struct bitarray *this)
> +{
> +	if(!this->summary)
>   

Space after if, here and elsewhere.

> +		return 0;
> +
> +	int word_index = __ffs(this->summary);
> +	int bit_index = __ffs(this->pending[word_index]);
> +
> +	return word_index * BITS_PER_LONG + bit_index;	
> +}
> +
> +static void bitarray_set(struct bitarray *this, int nr)
> +{
> +	set_bit(nr, &this->pending);
> +	set_bit(nr / BITS_PER_LONG, &this->summary); 
>   

Since you need locking anyway, best to use the unlocked versions 
(__set_bit()).

> +typedef struct {
> +	struct bitarray irq_pending;
> +	struct bitarray nmi_pending;
>   

Why is nmi_pending a bitarray?

> @@ -108,20 +109,12 @@ static unsigned get_addr_size(struct kvm_vcpu *vcpu)
>  
>  static inline u8 pop_irq(struct kvm_vcpu *vcpu)
>  {
> -	int word_index = __ffs(vcpu->irq_summary);
> -	int bit_index = __ffs(vcpu->irq_pending[word_index]);
> -	int irq = word_index * BITS_PER_LONG + bit_index;
> -
> -	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
> -	if (!vcpu->irq_pending[word_index])
> -		clear_bit(word_index, &vcpu->irq_summary);
> -	return irq;
> +	return kvm_vcpu_irq_read(vcpu, 0, NULL);
>  }
>  
>  static inline void push_irq(struct kvm_vcpu *vcpu, u8 irq)
>  {
> -	set_bit(irq, vcpu->irq_pending);
> -	set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
> +	kvm_vcpu_irq_inject(vcpu, irq, 0);
>  }
>   


It would be helpful to unify the vmx and svm irq code first (I can merge 
something like that immediately), so this patch becomes simpler.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]     ` <46135686.4090905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-04 16:23       ` Gregory Haskins
       [not found]         ` <46138A98.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  2007-04-04 20:12       ` Ingo Molnar
  1 sibling, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-04 16:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Wed, Apr 4, 2007 at  3:40 AM, in message <46135686.4090905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
>
> I would avoid moving down anything that's not strictly necessary.  

Agreed.

> 
> I still don't have an opinion as to whether it is necessary; I'll need 
> to study the details.  Xen pushes most of the platform into the 
> hypervisor, but then Xen's communication path to qemu is much more 
> expensive (involving the scheduler and a potential cpu switch) than 
> kvm.  We'd need to balance possible performance improvements (I'd expect 
> negligible) and interface simplification (possibly non-negligible) 
> against further diverging from qemu.

I am not going to advocate one way or the other here, but I will just layout how I saw this "full kernel emulation" happening so we are on the same page.  What I found was that the entire QEMU interrupt system converges on "pic_send_irq()" regardless of whether its a PCI or legacy ISA device.  This function also acts as a pin-forwarding mechanism, sending the IRQ event to both the 8259s and IOAPIC #0.  The irq that is dispatched by pic_send_irq() is still a raw "pin IRQ" reference, and must be translated into an actual x86 vector by the 8259/IOAPIC (which is governed by how the BIOS/OS programs them).  

The action of raising the interrupt ends up synchronously calling cpu_interrupt() which currently seems to be a no-op for KVM.  Later, in the pre_run stage the system retrieves a pending vector (this is when the irq/vector translation occurs) and injects it to the kernel with the INTERRUPT ioctl.

What I was planning on doing was using that QEMU patch I provided to intercept all pic_send_irq() calls and forward them directly to the kernel via a new ioctl().  This ioctl would be directed at the VM fd, not the VCPU, since its a pure ISA global pin reference and wont know the targeted vcpu until the 8259/IOAPIC perform their translation.

So that being said, I think the interface between userspace and kernel would be no more complex than it is today.  I.e. we would just be passing a single int via an ioctl.  The danger, as you have pointed out, is accepting the QEMU patch that I submitted that potentially diverges the pic code from QEMU upstream.  What this buys us, however, is is that any code (kernel or userspace) would be able to inject an ISA interrupt.  Using an ISA interrupt has the advantage of already being represented in the ACPI/MP tables presented by Bochs, and thus any compliant OS will automatically route the IRQ.

Where things do get potentially complicated with the interface is if we go with a hybrid solution.  Leaving the LAPIC in kernel, but the IOAPIC/8259 in userspace requires a much wider interface so the IOAPIC can deliver APIC style messages to the kernel.  Also, IOAPIC EOI for level sensitive interrupts become more difficult and complex.  Putting LAPIC + IOAPIC#0 in the kernel and leaving 8259 outside might actually work fairly well, but in-kernel ISA interrupts will only work with OSs which enable the APIC.  This may or may not be an issue and may be an acceptable tradeoff.  

> 
> There are two extreme models, which I think are both needed.  On one 
> end, support for closed OSes (e.g. Windows) requires fairly strict 
> conformance to the PCI model, which means going through the IOAPIC or 
> PIC or however the interrupt lines are wired in qemu.  This seems to indicate 
> that an in-kernel IOAPIC is needed.  On the other end (Linux), 
> a legacy-free and emulation-free device can just inject interrupts 
> directly and use shared memory to ack interrupts and indicate their source.

Well, sort of.  The problem as I see it in both cases is still IRQ routing.   If you knew of a free vector to take (regardless of OS) you could forgo the ACPI/MP declarations all together and register the vector with your "device" (e.g. via a hypercall, etc) and have the device issue direct LAPIC messages with the proper vector.  I think this would work assuming the device used edge triggered interrupts (which don't require IOAPIC IRR registration or EOI).

> 
>> If the latter, we also need to decide what the resource conveyance model and 
> vector allocation policy should be.  For instance, do we publish said 
> resources formally in the MP/ACPI tables in Bochs?  Doing so would allow 
> MP/ACPI compliant OSs like linux to naturally route the IRQ.  Conversely, do 
> we do something more direct just like we do for KVM discovery via wrmsr?
>>   
> 
> I think we can go the direct route for cooperative guests.

As long as there is a way for the guest code to know about a free vector to use, I think this should work just fine.

> 
> I also suggest doing the work in stages and measuring; that is, first 
> push the local apic, determine what the remaining bottlenecks are and 
> tackle them.  I'm pretty sure that Linux guests would only require the 
> local apic but Windows (and older Linux kernels) might require more.
> 
>>  struct kvm_vcpu;
>>  
>> +struct kvm_irqinfo {
>> +	int         vector;
>> +	int         nmi;
>> +};
>> +
>> +#define KVM_IRQFLAGS_NMI  (1 << 0)
>> +#define KVM_IRQFLAGS_PEEK (1 << 1)
>> +
>> +struct kvm_irqdevice {
>> +	int  (*pending)(struct kvm_irqdevice *this, int flags);
>> +	int  (*read)(struct kvm_irqdevice *this, int flags, 
>> +		     struct kvm_irqinfo *info);
>>   
> 
> Aren't pending() and read() + PEEK overlapping?

Excellent point.  I didnt introduce the PEEK/NMI support until after this interface was already defined.  I will consolidate.

> 
>> +	int  (*inject)(struct kvm_irqdevice *this, int irq, int flags);
>> +	int  (*summary)(struct kvm_irqdevice *this, void *data);
>> +	void (*destructor)(struct kvm_irqdevice *this);
>> +
>> +	void *private;
>> +};
>>   
> 
> Consider using container_of() to simulate C++ inheritance.  Messier but 
> less indirections.   Also consider a kvm_irqdevice_operations structure.

Ack.

> 
>> +
>> +#define MAX_APIC_INT_VECTOR      256
>> +
>> +struct kvm_apic {
>> +	u32	           	status;
>> +	u32	           	vcpu_id;
>> +	spinlock_t 		lock;
>> +	u32			pcpu_lock_owner;
>>   
> 
> Isn't this vcpu->cpu?

With the exception of the kvm_irqdevice at the end, I inherited this struct from Dor's branch, so I am not sure.  I will have to look into it deeper.

> 
>> +	atomic_t		timer_pending;
>> +	u64	           	apic_base_msr;
>> +	unsigned long      	base_address;
>> +	u32	           	timer_divide_count;
>> +	struct hrtimer       	apic_timer;
>> +	int                	intr_pending_count[MAX_APIC_INT_VECTOR];
>> +	ktime_t           	timer_last_update;
>> +	struct {
>> +		int deliver_mode;
>> +		int source[6];
>> +	} direct_intr;
>> +	u32	           	err_status;
>> +	u32	           	err_write_count;
>> +	struct kvm_vcpu        	*vcpu;
>> +	struct page	   	*regs_page;
>> +	void               	*regs;
>> +	struct kvm_irqdevice    ext; /* Used for external/NMI interrupts */
>> +};
>> +
>>  /*
>>   * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
>>   * 32-bit).  The kvm_mmu structure abstracts the details of the current mmu
>> @@ -236,6 +281,11 @@ struct kvm_pio_request {
>>  	int rep;
>>  };
>>  
>> +#define KVM_VCPU_INIT_SIPI_SIPI_STATE_NORM          0
>> +#define KVM_VCPU_INIT_SIPI_SIPI_STATE_WAIT_SIPI     1
>> +
>> +#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
>> +
>>  struct kvm_vcpu {
>>  	struct kvm *kvm;
>>  	union {
>> @@ -248,12 +298,9 @@ struct kvm_vcpu {
>>  	u64 host_tsc;
>>  	struct kvm_run *run;
>>  	int interrupt_window_open;
>> -	unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */
>> -#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
>> -	unsigned long irq_pending[NR_IRQ_WORDS];
>>  	unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
>>  	unsigned long rip;      /* needs vcpu_load_rsp_rip() */
>> -
>> +	struct kvm_irqdevice irq_dev;
>>  	unsigned long cr0;
>>  	unsigned long cr2;
>>  	unsigned long cr3;
>> @@ -261,10 +308,8 @@ struct kvm_vcpu {
>>  	struct page *para_state_page;
>>  	gpa_t hypercall_gpa;
>>  	unsigned long cr4;
>> -	unsigned long cr8;
>>   
> 
> I think we should keep cr8 here and accept the duplication.

Agreed.  The original branch had deleted this, but I really need to leave it in if we expect to be able to support the ABI compatible mode of "apic = disabled".  I will restore these.

> 
>>  	u64 pdptrs[4]; /* pae */
>>  	u64 shadow_efer;
>> -	u64 apic_base;
>>  	u64 ia32_misc_enable_msr;
>>  	int nmsrs;
>>  	struct vmx_msr_entry *guest_msrs;
>> @@ -298,6 +343,11 @@ struct kvm_vcpu {
>>  	int sigset_active;
>>  	sigset_t sigset;
>>  
>> +	struct kvm_apic apic;
>> +	wait_queue_head_t halt_wq;
>> +	/* For AP startup */
>> +	unsigned long init_sipi_sipi_state;
>> +
>>  	struct {
>>  		int active;
>>  		u8 save_iopl;
>> @@ -319,6 +369,23 @@ struct kvm_mem_alias {
>>  	gfn_t target_gfn;
>>  };
>>  
>> +#define kvm_irq_pending(dev, flags)     (dev)->pending(dev, flags)
>> +#define kvm_irq_read(dev, flags, info)  (dev)->read(dev, flags, info)
>> +#define kvm_irq_inject(dev, irq, flags) (dev)->inject(dev, irq, flags)
>> +#define kvm_irq_summary(dev, data)      (dev)->summary(dev, data)
>>   
> 
> Static inlines please, no #defines.  In this case, I'd just call the 
> functions directly.

ack.

> 
>> +
>> +#define kvm_vcpu_irq_pending(vcpu, flags) \
>> +   kvm_irq_pending(&vcpu->irq_dev, flags)
>> +#define kvm_vcpu_irq_read(vcpu, flags, info) \
>> +   kvm_irq_read(&vcpu->irq_dev, flags, info)
>> +#define kvm_vcpu_irq_inject(vcpu, irq, flags) \
>> +   kvm_irq_inject(&vcpu->irq_dev, irq, flags)
>> +#define kvm_vcpu_irq_summary(vcpu, data)  \
>> +   kvm_irq_summary(&vcpu->irq_dev, data)
>> +
>>   
> 
> Ditto.

ack

> 
>>  
>> +struct kvm_mmio_handler {
>> +	unsigned long (*read)(struct kvm_vcpu *v,
>> +			      unsigned long addr,
>> +			      unsigned long length);
>> +	void (*write)(struct kvm_vcpu *v,
>> +			   unsigned long addr,
>> +			   unsigned long length,
>> +			   unsigned long val);
>> +	int (*in_range)(struct kvm_vcpu *v, unsigned long addr);
>> +};
>>   
> 
> This would need to be split out into a separate patch.  Addresses are gpa_t.

ack

> 
>> +
>> +#define KVM_MMIO_HANDLER_NR_ARRAY_SIZE 1
>> +static struct kvm_mmio_handler 
> *kvm_mmio_handlers[KVM_MMIO_HANDLER_NR_ARRAY_SIZE] =
>> +{
>> +    &apic_mmio_handler,
>> +};
>> +
>>   
> 
> We'll need dynamic registration if we want kernel apic support to be 
> optional.  

Agreed.  This is another example of something I inherited so it probably wasn't written with the dynamic model in mind.  I will clean this up

>Also, this needs to be per-vcpu as each vcpu's apic can be  placed at a different address.

Note that this interface does actually work on a per-cpu basis today as the "in-range' mechanism considers the apicbase MSR from its vcpu structure.

But that aside, I fully agree that this should be more explicit.  Actually, i think it would be good to come up with a general MMIO/PIO system that can work on a per VM and per VCPU basis so we can get coverage for both global and per-cpu resources.  For instance, if we move the IOAPIC and i8259s, they should be associated with a VM global address.

> 
> 
>> @@ -1479,7 +1533,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 
> msr, u64 *pdata)
>>  		data = 3;
>>  		break;
>>  	case MSR_IA32_APICBASE:
>> -		data = vcpu->apic_base;
>> +		data = vcpu->apic.apic_base_msr;
>> +		break;
>> +	case MSR_IA32_TIME_STAMP_COUNTER:
>> +		// FIXME
>> +                //data = guest_read_tsc();
>>  		break;
>>   
> 
> The tsc stuff looks spurious.

More inherited stuff, and Im not sure what it was trying to do.  I will investigate further.

> 
>> +	case KVM_GET_OPTIONS: {
>> +		r = -EFAULT;
>> +		if (copy_to_user(&kvm->options, argp, sizeof(kvm->options)))
>> +		    goto out;
>> +		r = 0;
>> +		break;
>> +	}
>> +	case KVM_SET_OPTIONS: {
>> +		r = -EFAULT;
>> +		if (copy_from_user(&kvm->options, argp, sizeof(kvm->options)))
>> +		    goto out;
>> +		r = 0;
>> +		break;
>> +	}
>>   
> 
> This is difficult to expand over time.  I prefer a separate ioctl() for 
> each option.

Ack.

> 
>> +
>> +struct bitarray {
>> +    unsigned long summary; /* 1 per word in pending */
>> +    unsigned long pending[NR_IRQ_WORDS];
>> +};
>> +
>> +static int bitarray_pending(struct bitarray *this)
>> +{
>> +	return this->summary ? 1 : 0;	
>> +}
>> +
>> +static int bitarray_findhighest(struct bitarray *this)
>> +{
>> +	if(!this->summary)
>>   
> 
> Space after if, here and elsewhere.

Ack.

> 
>> +		return 0;
>> +
>> +	int word_index = __ffs(this->summary);
>> +	int bit_index = __ffs(this->pending[word_index]);
>> +
>> +	return word_index * BITS_PER_LONG + bit_index;	
>> +}
>> +
>> +static void bitarray_set(struct bitarray *this, int nr)
>> +{
>> +	set_bit(nr, &this->pending);
>> +	set_bit(nr / BITS_PER_LONG, &this->summary); 
>>   
> 
> Since you need locking anyway, best to use the unlocked versions 
> (__set_bit()).

Ack.  Yes, locking needs to be added.  Votes for appropriate mechanism? (e.g. spinlock, mutex, etc?)

> 
>> +typedef struct {
>> +	struct bitarray irq_pending;
>> +	struct bitarray nmi_pending;
>>   
> 
> Why is nmi_pending a bitarray?

When I wrote this I thought any vector could be declared as an NMI.  I just restudied the doc and I see this is wrong.  You are right, the nmi_pending doesnt need to be an array.  I will fix this.

> 
>> @@ -108,20 +109,12 @@ static unsigned get_addr_size(struct kvm_vcpu *vcpu)
>>  
>>  static inline u8 pop_irq(struct kvm_vcpu *vcpu)
>>  {
>> -	int word_index = __ffs(vcpu->irq_summary);
>> -	int bit_index = __ffs(vcpu->irq_pending[word_index]);
>> -	int irq = word_index * BITS_PER_LONG + bit_index;
>> -
>> -	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
>> -	if (!vcpu->irq_pending[word_index])
>> -		clear_bit(word_index, &vcpu->irq_summary);
>> -	return irq;
>> +	return kvm_vcpu_irq_read(vcpu, 0, NULL);
>>  }
>>  
>>  static inline void push_irq(struct kvm_vcpu *vcpu, u8 irq)
>>  {
>> -	set_bit(irq, vcpu->irq_pending);
>> -	set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
>> +	kvm_vcpu_irq_inject(vcpu, irq, 0);
>>  }
>>   
> 
> 
> It would be helpful to unify the vmx and svm irq code first (I can merge 
> something like that immediately), so this patch becomes simpler.

Well, I could just change any occurrence of "pop_irq" with kvm_vcpu_irq_read() and any occurrence of push_irq() to kvm_vcpu_irq_inject(), but I don't get the impression that this is what you were referring to.  Could you elaborate?

> 
> -- 
> Do not meddle in the internals of kernels, for they are subtle and quick to 
> panic.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]         ` <46138A98.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-04-04 16:49           ` Avi Kivity
       [not found]             ` <4613D736.1080207-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-04 16:49 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> What I was planning on doing was using that QEMU patch I provided to intercept all pic_send_irq() calls and forward them directly to the kernel via a new ioctl().  This ioctl would be directed at the VM fd, not the VCPU, since its a pure ISA global pin reference and wont know the targeted vcpu until the 8259/IOAPIC perform their translation.
>   

Hmm.  If the ioapic is in the kernel, then it's a platform-wide resource 
and you would need a vm ioctl.  If ioapic emulation is in userspace, 
then the ioapic logic will have decided which cpu is targeted and you 
would issue a vcpu ioctl.

> So that being said, I think the interface between userspace and kernel would be no more complex than it is today.  I.e. we would just be passing a single int via an ioctl.  

I think the interface should mirror the hardware interface at the point 
of the "cut".  For example, if we keep the ioapic in userspace, the 
interface is ioapic/apic bus messages.  If we push the ioapic into the 
kernel, the interface describes the various ioapic pins and how the 
ioapics are connected to each other and to the processors (i.e. the 
topology).


> The danger, as you have pointed out, is accepting the QEMU patch that I submitted that potentially diverges the pic code from QEMU upstream.  What this buys us, however, is is that any code (kernel or userspace) would be able to inject an ISA interrupt.  Using an ISA interrupt has the advantage of already being represented in the ACPI/MP tables presented by Bochs, and thus any compliant OS will automatically route the IRQ.
>
> Where things do get potentially complicated with the interface is if we go with a hybrid solution.  Leaving the LAPIC in kernel, but the IOAPIC/8259 in userspace requires a much wider interface so the IOAPIC can deliver APIC style messages to the kernel.  Also, IOAPIC EOI for level sensitive interrupts become more difficult and complex.  Putting LAPIC + IOAPIC#0 in the kernel and leaving 8259 outside might actually work fairly well, but in-kernel ISA interrupts will only work with OSs which enable the APIC.  This may or may not be an issue and may be an acceptable tradeoff.  
>   

Everything should keep working, that is a must.  We just need the 
interfaces to follow the hardware faithfully.  The issue with the ioapic 
eoi is worrying me performance wise, though; it looks like we need to 
push the ioapic too if we are to have no-compromise performance on 
unmodified OSes.


>> There are two extreme models, which I think are both needed.  On one 
>> end, support for closed OSes (e.g. Windows) requires fairly strict 
>> conformance to the PCI model, which means going through the IOAPIC or 
>> PIC or however the interrupt lines are wired in qemu.  This seems to indicate 
>> that an in-kernel IOAPIC is needed.  On the other end (Linux), 
>> a legacy-free and emulation-free device can just inject interrupts 
>> directly and use shared memory to ack interrupts and indicate their source.
>>     
>
> Well, sort of.  The problem as I see it in both cases is still IRQ routing.   If you knew of a free vector to take (regardless of OS) you could forgo the ACPI/MP declarations all together and register the vector with your "device" (e.g. via a hypercall, etc) and have the device issue direct LAPIC messages with the proper vector.  I think this would work assuming the device used edge triggered interrupts (which don't require IOAPIC IRR registration or EOI).
>
>   

For unmodified guests, use the existing pci irq routing.  I certainly 
wouldn't want to debug anything else.  For modified guests, there's no 
real problem.

>>>   
>>>       
>> Since you need locking anyway, best to use the unlocked versions 
>> (__set_bit()).
>>     
>
> Ack.  Yes, locking needs to be added.  Votes for appropriate mechanism? (e.g. spinlock, mutex, etc?)
>
>   

spin_lock_irq(), as this lock will frequently have to be taken in host 
irq handlers.  Need to be extra careful with the locking in 
{vmx,svm}_vcpu_run().

>   
>>> @@ -108,20 +109,12 @@ static unsigned get_addr_size(struct kvm_vcpu *vcpu)
>>>  
>>>  static inline u8 pop_irq(struct kvm_vcpu *vcpu)
>>>  {
>>> -	int word_index = __ffs(vcpu->irq_summary);
>>> -	int bit_index = __ffs(vcpu->irq_pending[word_index]);
>>> -	int irq = word_index * BITS_PER_LONG + bit_index;
>>> -
>>> -	clear_bit(bit_index, &vcpu->irq_pending[word_index]);
>>> -	if (!vcpu->irq_pending[word_index])
>>> -		clear_bit(word_index, &vcpu->irq_summary);
>>> -	return irq;
>>> +	return kvm_vcpu_irq_read(vcpu, 0, NULL);
>>>  }
>>>  
>>>  static inline void push_irq(struct kvm_vcpu *vcpu, u8 irq)
>>>  {
>>> -	set_bit(irq, vcpu->irq_pending);
>>> -	set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
>>> +	kvm_vcpu_irq_inject(vcpu, irq, 0);
>>>  }
>>>   
>>>       
>> It would be helpful to unify the vmx and svm irq code first (I can merge 
>> something like that immediately), so this patch becomes simpler.
>>     
>
> Well, I could just change any occurrence of "pop_irq" with kvm_vcpu_irq_read() and any occurrence of push_irq() to kvm_vcpu_irq_inject(), but I don't get the impression that this is what you were referring to.  Could you elaborate?
>   

I meant having a cleanup patch before that pushes irq handling into 
common code, then the apic patch could modify that to call the kernel 
apic code if necessary.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]             ` <4613D736.1080207-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-04 17:10               ` Gregory Haskins
       [not found]                 ` <4613959F.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-04 17:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Wed, Apr 4, 2007 at 12:49 PM, in message <4613D736.1080207-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>
> 
> Hmm.  If the ioapic is in the kernel, then it's a platform- wide resource 
> and you would need a vm ioctl.  If ioapic emulation is in userspace, 
> then the ioapic logic will have decided which cpu is targeted and you 
> would issue a vcpu ioctl.
>

Thats exactly in line with my thinking.
 
>> So that being said, I think the interface between userspace and kernel would 
> be no more complex than it is today.  I.e. we would just be passing a single 
> int via an ioctl.  
> 
> I think the interface should mirror the hardware interface at the point 
> of the "cut".  For example, if we keep the ioapic in userspace, the 
> interface is ioapic/apic bus messages.  If we push the ioapic into the 
> kernel, the interface describes the various ioapic pins and how the 
> ioapics are connected to each other and to the processors (i.e. the 
> topology).

Agreed.  I was thinking that the interface for the "IOAPIC in kernel" model would look something like the way the pic_send_irq() function looks, except it would also convey BUS/IOAPIC id.

so: kvm_inject_interrupt(int bus, int pin, int value);

and the "kvmpic" driver would currently translate as bus = 0 (giving us IRQ0-23).  E.g. 

kvmpic_send_irq(int irq, int value)
{
   kvm_inject_interrupt(0, irq, value);
}

In the future, if we have more than one IOAPIC the system can map to the right unit according to the topology.


> 
> Everything should keep working, that is a must.  We just need the 
> interfaces to follow the hardware faithfully.  The issue with the ioapic 
> eoi is worrying me performance wise, though; it looks like we need to 
> push the ioapic too if we are to have no- compromise performance on 
> unmodified OSes.

Not sure if this will make you feel better, but it appears as though both the QEMU and the in-kernel model that I inherited don't accurately support IOAPIC EOI already.  From that you can infer that there must not be any level-sensitive interrupts in use today.  Since the system seems to only have the legacy ISA model (which dictates edge triggers, IIRC), this makes sense.  However, if we want to support level triggers in the future, we will have to address this.  I would be uncomfortable designing something that doesn't take the IOAPIC EOI into account, even if its not immediately used.

> 
> For unmodified guests, use the existing pci irq routing.  I certainly 
> wouldn't want to debug anything else.  For modified guests, there's no 
> real problem.

Ill take your word for it ;)  I spent a few minutes looking through the linux kernel trying to figure out how to get a hint about a free vector without assigning one statically in the header files and came up empty handed.  Im sure there is a way (or maybe static is ok?).

> 
>>>>   
>>>>       
>>> Since you need locking anyway, best to use the unlocked versions 
>>> (__set_bit()).
>>>     
>>
>> Ack.  Yes, locking needs to be added.  Votes for appropriate mechanism? 
> (e.g. spinlock, mutex, etc?)
>>
>>   
> 
> spin_lock_irq(), as this lock will frequently have to be taken in host 
> irq handlers.  Need to be extra careful with the locking in 
> {vmx,svm}_vcpu_run().

Ack


>> Well, I could just change any occurrence of "pop_irq" with 
> kvm_vcpu_irq_read() and any occurrence of push_irq() to 
> kvm_vcpu_irq_inject(), but I don't get the impression that this is what you 
> were referring to.  Could you elaborate?
>>   
> 
> I meant having a cleanup patch before that pushes irq handling into 
> common code, then the apic patch could modify that to call the kernel 
> apic code if necessary.
> 

Ack

-Greg




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]                 ` <4613959F.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-04-04 17:43                   ` Avi Kivity
       [not found]                     ` <4613E3CB.6000904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-04-04 22:00                   ` Dor Laor
  1 sibling, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-04 17:43 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
> Agreed.  I was thinking that the interface for the "IOAPIC in kernel" model would look something like the way the pic_send_irq() function looks, except it would also convey BUS/IOAPIC id.
>
> so: kvm_inject_interrupt(int bus, int pin, int value);
>
> and the "kvmpic" driver would currently translate as bus = 0 (giving us IRQ0-23).  E.g. 
>
> kvmpic_send_irq(int irq, int value)
> {
>    kvm_inject_interrupt(0, irq, value);
> }
>   

With appropriate modeling of edge vs level triggered, and translation of 
pin to vector, yes.

>
>   
>> Everything should keep working, that is a must.  We just need the 
>> interfaces to follow the hardware faithfully.  The issue with the ioapic 
>> eoi is worrying me performance wise, though; it looks like we need to 
>> push the ioapic too if we are to have no- compromise performance on 
>> unmodified OSes.
>>     
>
> Not sure if this will make you feel better, but it appears as though both the QEMU and the in-kernel model that I inherited don't accurately support IOAPIC EOI already.  From that you can infer that there must not be any level-sensitive interrupts in use today.  Since the system seems to only have the legacy ISA model (which dictates edge triggers, IIRC), this makes sense.  However, if we want to support level triggers in the future, we will have to address this.  I would be uncomfortable designing something that doesn't take the IOAPIC EOI into account, even if its not immediately used.
>
>   

pci is level triggered, so maybe the guests just handle the inaccuracy.

>> For unmodified guests, use the existing pci irq routing.  I certainly 
>> wouldn't want to debug anything else.  For modified guests, there's no 
>> real problem.
>>     
>
> Ill take your word for it ;)  I spent a few minutes looking through the linux kernel trying to figure out how to get a hint about a free vector without assigning one statically in the header files and came up empty handed.  Im sure there is a way (or maybe static is ok?).
>
>   

No idea really, but the paravirt_ops stuff probably provides the right 
hooks.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]                     ` <4613E3CB.6000904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-04 17:56                       ` Gregory Haskins
       [not found]                         ` <4613A090.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-04 17:56 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Wed, Apr 4, 2007 at  1:43 PM, in message <4613E3CB.6000904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>,
Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>> Agreed.  I was thinking that the interface for the "IOAPIC in kernel" model 
> would look something like the way the pic_send_irq() function looks, except 
> it would also convey BUS/IOAPIC id.
>>
>> so: kvm_inject_interrupt(int bus, int pin, int value);
>>
>> and the "kvmpic" driver would currently translate as bus = 0 (giving us 
> IRQ0- 23).  E.g. 
>>
>> kvmpic_send_irq(int irq, int value)
>> {
>>    kvm_inject_interrupt(0, irq, value);
>> }
>>   
> 
> With appropriate modeling of edge vs level triggered, and translation of 
> pin to vector, yes.

I believe we would be ok here.  The current code uses a similar model for edge/level, where edge simply ignores *value*, and level uses non-zero value as "assert" and 0 as "deassert".  As far as the pin/vector mapping, that would be taken care of by the programming of the IOAPIC by the BIOS/OS.


> pci is level triggered, so maybe the guests just handle the inaccuracy.
> 

Good point.  I'm not sure how this works today.  Perhaps we just get lucky that nothing checks the IRR in the IOAPIC coupled with a bug in the IOAPIC model that an APIC message is sent out even if the interrupt is not acknowledged.  It would explain why it works today, anyway.  Either way, I would like to get this modeled "right" this go-round, so the point is moot.

-Greg

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]     ` <46135686.4090905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-04-04 16:23       ` Gregory Haskins
@ 2007-04-04 20:12       ` Ingo Molnar
       [not found]         ` <20070404201205.GC6070-X9Un+BFzKDI@public.gmane.org>
  1 sibling, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2007-04-04 20:12 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f


* Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote:

> > My current thoughts are that we at least move the IOAPIC into the 
> > kernel as well.  That will give sufficient control to generate ISA 
> > bus interrupts for guests that understand APICs.  If we want to be 
> > able to generate ISA interrupts for legacy guests which talk to the 
> > 8259s that will prove to be insufficient.  The good news is that 
> > moving the 8259s down as well is probably not a huge deal either, 
> > especially since I have already prepped the usermode side.  
> > Thoughts?
> 
> I would avoid moving down anything that's not strictly necessary.  If 
> we want to keep the PIC in qemu, for example, just export the APIC-PIC 
> interface to qemu.

we should move all the PICs into KVM proper - and that includes the 
i8259A PIC too. Qemu-space drivers are then wired to pins on these PICs, 
but nothing in Qemu does vector generation or vector prioritization - 
that task is purely up to KVM. There are mixed i8259A+lapic models 
possible too and the simplest model is to have all vector handling in 
KVM.

any 'cut' of the interface to allow both qemu and KVM generate vectors 
is unnecessary (and harmful) complexity. The interface cut should be at 
the 'pin' level, with Qemu raising a signal on a pin and lowering a 
signal on a pin, but otherwise not dealing with IRQ routing and IRQ 
vectors.

	Ingo

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]                         ` <4613A090.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-04-04 20:20                           ` Ingo Molnar
       [not found]                             ` <20070404202046.GD6070-X9Un+BFzKDI@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2007-04-04 20:20 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

* Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org> wrote:

> > pci is level triggered, so maybe the guests just handle the 
> > inaccuracy.
> > 
> 
> Good point.  I'm not sure how this works today.  Perhaps we just get 
> lucky that nothing checks the IRR in the IOAPIC coupled with a bug in 
> the IOAPIC model that an APIC message is sent out even if the 
> interrupt is not acknowledged.  It would explain why it works today, 
> anyway.  Either way, I would like to get this modeled "right" this 
> go-round, so the point is moot.

on real hardware, some devices produce edges, some devices produce level 
signals that need to be deasserted. The basic model of a PIC is that it 
has 'pins', which convert the signal arriving on those pins into 
interrupts. Qemu itself should only know about the pin enumeration, and 
should only be able to raise/lower the (virtual) 'signal' on such a pin.

PCI can be level and edge triggered too, and IO-APICs can be programmed 
on a per-pin basis to detect edge-high, edge-low, level-high, level-low 
signals.

there is a remote possibility that some OSs depend on certain devices 
being level-triggered: for example if you get an IRQ from a 
level-triggered device and _dont_ deassert that signal from the IRQ 
handler (intentionally so), then the semantics of current hardware will 
cause a second interrupt to be sent by the PIC, after the APIC message 
has been EOI-ed in the local APIC. While such "repeat interrupts" would 
be pure madness to rely on i think, i'm not sure it's not being done. 
Note that if the same IO-APIC pin is set up to detect edges then not 
deasserting the signal would not cause a 'repeat interrupt'. Whether 
such accurate emulation of signalling is needed depends on the hardware 
semantics of the devices we emulate.

	Ingo

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found] ` <46128F80.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  2007-04-04  7:40   ` Avi Kivity
@ 2007-04-04 20:32   ` Ingo Molnar
       [not found]     ` <20070404203235.GA11369-X9Un+BFzKDI@public.gmane.org>
  1 sibling, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2007-04-04 20:32 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

* Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org> wrote:

> Hi all,
> 
> Attached is a snapshot of my current efforts on the kernel side for 
> the in-kernel APIC work.  Feedback welcome.

good work and nice patch! :)

> My current thoughts are that we at least move the IOAPIC into the 
> kernel as well. [...]

yes. And then do the final 10% move of handling the i8529A in KVM too. 
That would mean that PV drivers could inject IRQ events without having 
to schedule back to qemu. Especially if the IRQ is masked this avoids a 
context-switch. (with any PIC component in qemu we have no choice but to 
always wake up qemu and let it handle the event - even if it results in 
a 'no vector generated' decision). This is a plus because most PV 
drivers will do IO completion from irq context and possibly on other 
CPUs.

> The current Bochs/QEMU system model paints a fairly simple ISA 
> architecture utilizing a single IOAPIC + dual 8259 setup.  Do we 
> expect in-kernel injected IRQs to follow the ISA model (e.g. either 
> legacy or PCI interrupts only limited to IRQ0-15) or do we want to 
> expand on this? [...]

yes, we should probably expand them - it's not hard. PV/accel drivers 
would most likely hook up to free pins to unshare the interrupts.

> If the latter, we also need to decide what the resource conveyance 
> model and vector allocation policy should be.  For instance, do we 
> publish said resources formally in the MP/ACPI tables in Bochs?  Doing 
> so would allow MP/ACPI compliant OSs like linux to naturally route the 
> IRQ.  Conversely, do we do something more direct just like we do for 
> KVM discovery via wrmsr?

for PV/accel drivers we dont need any extra ACPI enumeration - the 
hypercall API is good enough to connect to the hypervisor, and i suspect 
all guest OSs we care about allow drivers to allocate an IRQ vector for 
a new device, without having that device enumerated in ACPI.

	Ingo

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]     ` <20070404203235.GA11369-X9Un+BFzKDI@public.gmane.org>
@ 2007-04-04 21:22       ` Gregory Haskins
       [not found]         ` <4613D0CE.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2007-04-04 21:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>>> On Wed, Apr 4, 2007 at  4:32 PM, in message <20070404203235.GA11369-X9Un+BFzKDI@public.gmane.org>,
Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote: 
>
>> My current thoughts are that we at least move the IOAPIC into the 
>> kernel as well. [...]
> 
> yes. And then do the final 10% move of handling the i8529A in KVM too. 

Hi Ingo,
  We are in full agreement on this point, and has been my preferred model from the beginning.  The only issue with this approach is that it requires a fairly disruptive patch to QEMUs "pic_set_irq()" feature which many people have drawn exception to so far.  (In case you weren't following from the beginning, its the "QEMU PIC indirection patch...." thread). 

If we dont care about supporting "--no-kvm" anymore, this problem becomes trivially easy.....we can just link in a different pic module into QEMU and be done with it.  The problem as I see it is that we really have a lot of value in being able to switch between kvm and pure qemu mode via --no-kvm, especially for debugging.  Therefore, IMHO we need to be able to dynamically switch between PIC emulation code.  

If we *do* want to go with this model, *and* we decide that the approach I have taken with QEMU is a reasonable way to do it, then I would suggest we go about it by getting the patch accepted in QEMU upstream.  I would gladly take on this duty if we all agree this is the right approach.

> for PV/accel drivers we dont need any extra ACPI enumeration -  the 
> hypercall API is good enough to connect to the hypervisor, and i suspect 
> all guest OSs we care about allow drivers to allocate an IRQ vector for 
> a new device, without having that device enumerated in ACPI.

If you know how to do this in Linux, please share!  I was looking for this earlier and came up empty handed.  All I could find was the places where the PCI/MP/ACPI type things assigned vectors to devices they new about.  It's probably "operator-ignorance" ;)

Regards,
-Greg

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]         ` <4613D0CE.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-04-04 21:40           ` Anthony Liguori
  2007-04-05  7:11           ` Avi Kivity
  1 sibling, 0 replies; 22+ messages in thread
From: Anthony Liguori @ 2007-04-04 21:40 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>>>> On Wed, Apr 4, 2007 at  4:32 PM, in message <20070404203235.GA11369-X9Un+BFzKDI@public.gmane.org>,
>>>>         
> Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote: 
>   
>>> My current thoughts are that we at least move the IOAPIC into the 
>>> kernel as well. [...]
>>>       
>> yes. And then do the final 10% move of handling the i8529A in KVM too. 
>>     
>
> Hi Ingo,
>   We are in full agreement on this point, and has been my preferred model from the beginning.  The only issue with this approach is that it requires a fairly disruptive patch to QEMUs "pic_set_irq()" feature which many people have drawn exception to so far.  (In case you weren't following from the beginning, its the "QEMU PIC indirection patch...." thread). 
>
> If we dont care about supporting "--no-kvm" anymore, this problem becomes trivially easy.....

No, this would be a big mistake.  We'll just end up with another qemu-dm.

> we can just link in a different pic module into QEMU and be done with it.  The problem as I see it is that we really have a lot of value in being able to switch between kvm and pure qemu mode via --no-kvm, especially for debugging.  Therefore, IMHO we need to be able to dynamically switch between PIC emulation code.  
>
> If we *do* want to go with this model, *and* we decide that the approach I have taken with QEMU is a reasonable way to do it, then I would suggest we go about it by getting the patch accepted in QEMU upstream.  I would gladly take on this duty if we all agree this is the right approach.
>   

I think you should post the patch on qemu-devel if for nothing else to 
begin the discussion.

Regards,

Anthony Liguori

>   
>> for PV/accel drivers we dont need any extra ACPI enumeration -  the 
>> hypercall API is good enough to connect to the hypervisor, and i suspect 
>> all guest OSs we care about allow drivers to allocate an IRQ vector for 
>> a new device, without having that device enumerated in ACPI.
>>     
>
> If you know how to do this in Linux, please share!  I was looking for this earlier and came up empty handed.  All I could find was the places where the PCI/MP/ACPI type things assigned vectors to devices they new about.  It's probably "operator-ignorance" ;)
>
> Regards,
> -Greg
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>   


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]         ` <20070404201205.GC6070-X9Un+BFzKDI@public.gmane.org>
@ 2007-04-04 21:55           ` Dor Laor
  2007-04-05  6:47           ` Avi Kivity
  2007-04-05 14:37           ` Dong, Eddie
  2 siblings, 0 replies; 22+ messages in thread
From: Dor Laor @ 2007-04-04 21:55 UTC (permalink / raw)
  To: Ingo Molnar, Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>we should move all the PICs into KVM proper - and that includes the
>i8259A PIC too. Qemu-space drivers are then wired to pins on these
PICs,
>but nothing in Qemu does vector generation or vector prioritization -
>that task is purely up to KVM. There are mixed i8259A+lapic models
>possible too and the simplest model is to have all vector handling in
>KVM.
>
>any 'cut' of the interface to allow both qemu and KVM generate vectors
>is unnecessary (and harmful) complexity. The interface cut should be at
>the 'pin' level, with Qemu raising a signal on a pin and lowering a
>signal on a pin, but otherwise not dealing with IRQ routing and IRQ
>vectors.
>
>	Ingo

Actually that the best cut, it's not a cut that was chosen because of
someone's implementation status. It is the best logical way doing
things. 
It's either all of the components are in user space or in kernel space
altogether. Since PV drivers and APIC acceleration (vt/svm) are strong
arguments in favor of kernel implementation we should aim towards that.

It is also a very clean cut and it will predicting the exact interface
is easy.

>
>-----------------------------------------------------------------------
--
>Take Surveys. Earn Cash. Influence the Future of IT
>Join SourceForge.net's Techsay panel and you'll get the chance to share
>your
>opinions on IT & business topics through brief surveys-and earn cash
>http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVD
EV
>_______________________________________________
>kvm-devel mailing list
>kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
>https://lists.sourceforge.net/lists/listinfo/kvm-devel

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]                 ` <4613959F.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  2007-04-04 17:43                   ` Avi Kivity
@ 2007-04-04 22:00                   ` Dor Laor
  1 sibling, 0 replies; 22+ messages in thread
From: Dor Laor @ 2007-04-04 22:00 UTC (permalink / raw)
  To: Gregory Haskins, Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>> Gregory Haskins wrote:
>>
>>
>> Hmm.  If the ioapic is in the kernel, then it's a platform- wide
resource
>> and you would need a vm ioctl.  If ioapic emulation is in userspace,
>> then the ioapic logic will have decided which cpu is targeted and you
>> would issue a vcpu ioctl.
>>
>
>Thats exactly in line with my thinking.

This is one supports in-kernel apic since the physical ioapic is not
bound to specific cpu.

>
>>> So that being said, I think the interface between userspace and
kernel
>would
>> be no more complex than it is today.  I.e. we would just be passing a
>single
>> int via an ioctl.
>>
>> I think the interface should mirror the hardware interface at the
point
>> of the "cut".  For example, if we keep the ioapic in userspace, the
>> interface is ioapic/apic bus messages.  If we push the ioapic into
the
>> kernel, the interface describes the various ioapic pins and how the
>> ioapics are connected to each other and to the processors (i.e. the
>> topology).
>
>Agreed.  I was thinking that the interface for the "IOAPIC in kernel"
model
>would look something like the way the pic_send_irq() function looks,
except
>it would also convey BUS/IOAPIC id.
>

Keeping the ioapic in qemu was just a temporal solution, the full blown
architecture should implement everything within the kernel.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]         ` <20070404201205.GC6070-X9Un+BFzKDI@public.gmane.org>
  2007-04-04 21:55           ` Dor Laor
@ 2007-04-05  6:47           ` Avi Kivity
       [not found]             ` <46149B7C.5020004-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-04-05 14:37           ` Dong, Eddie
  2 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2007-04-05  6:47 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Ingo Molnar wrote:
> we should move all the PICs into KVM proper - and that includes the 
> i8259A PIC too. Qemu-space drivers are then wired to pins on these PICs, 
> but nothing in Qemu does vector generation or vector prioritization - 
> that task is purely up to KVM. There are mixed i8259A+lapic models 
> possible too and the simplest model is to have all vector handling in 
> KVM.
>
> any 'cut' of the interface to allow both qemu and KVM generate vectors 
> is unnecessary (and harmful) complexity. The interface cut should be at 
> the 'pin' level, with Qemu raising a signal on a pin and lowering a 
> signal on a pin, but otherwise not dealing with IRQ routing and IRQ 
> vectors.
>   

Following is my view of the possible cuts:

- everything in userspace: worst performance, but needed for comaptibility

- tpr in kernel: minimal effort, bad badly defined interface

- lapic in kernel: well defined (all on-chip stuff in kernel, off-chip 
in userspace), fixes main Windows problems, interrupts still need 
userspace.  Interface is the processor's LINT pins and APIC bus.

- *pic in kernel: most effort, easiest irq synchronization.  Interface 
is pic/ioapic pin level, plus a topology description that userspace uses 
to tell the hardware how the pins are connected (logically required, but 
practically not).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]                             ` <20070404202046.GD6070-X9Un+BFzKDI@public.gmane.org>
@ 2007-04-05  6:48                               ` Avi Kivity
  0 siblings, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2007-04-05  6:48 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Ingo Molnar wrote:
> there is a remote possibility that some OSs depend on certain devices 
> being level-triggered: for example if you get an IRQ from a 
> level-triggered device and _dont_ deassert that signal from the IRQ 
> handler (intentionally so), then the semantics of current hardware will 
> cause a second interrupt to be sent by the PIC, after the APIC message 
> has been EOI-ed in the local APIC. While such "repeat interrupts" would 
> be pure madness to rely on i think, i'm not sure it's not being done. 
> Note that if the same IO-APIC pin is set up to detect edges then not 
> deasserting the signal would not cause a 'repeat interrupt'. Whether 
> such accurate emulation of signalling is needed depends on the hardware 
> semantics of the devices we emulate.
>
>   

Don't all OSes heavily depend on level-triggered interrupts for irq line 
sharing?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]         ` <4613D0CE.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  2007-04-04 21:40           ` Anthony Liguori
@ 2007-04-05  7:11           ` Avi Kivity
  1 sibling, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2007-04-05  7:11 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Gregory Haskins wrote:
>>>> On Wed, Apr 4, 2007 at  4:32 PM, in message <20070404203235.GA11369-X9Un+BFzKDI@public.gmane.org>,
>>>>         
> Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote: 
>   
>>> My current thoughts are that we at least move the IOAPIC into the 
>>> kernel as well. [...]
>>>       
>> yes. And then do the final 10% move of handling the i8529A in KVM too. 
>>     
>
> Hi Ingo,
>   We are in full agreement on this point, and has been my preferred model from the beginning.  The only issue with this approach is that it requires a fairly disruptive patch to QEMUs "pic_set_irq()" feature which many people have drawn exception to so far.  (In case you weren't following from the beginning, its the "QEMU PIC indirection patch...." thread). 
>   

You can hack it with if (use_kvm) { ... } (or, more likely, 
use_kvm_platform_/


> If we dont care about supporting "--no-kvm" anymore, this problem becomes trivially easy.....we can just link in a different pic module into QEMU and be done with it.  The problem as I see it is that we really have a lot of value in being able to switch between kvm and pure qemu mode via --no-kvm, especially for debugging.  Therefore, IMHO we need to be able to dynamically switch between PIC emulation code.  
>
>   

-no-kvm is critical, it's an important debugging tool.  Also keeping 
emulation in qemu is important for <= 2.6.21 (or maybe 22) users, which 
will not have the in-kernel platform devices.

> If we *do* want to go with this model, *and* we decide that the approach I have taken with QEMU is a reasonable way to do it, then I would suggest we go about it by getting the patch accepted in QEMU upstream.  I would gladly take on this duty if we all agree this is the right approach.
>   

I think getting the patch in is beneficial.  Perhaps after re-review in 
light of all the discussion.



-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]             ` <46149B7C.5020004-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-04-05  7:37               ` Dor Laor
  2007-04-05 10:39               ` Ingo Molnar
  1 sibling, 0 replies; 22+ messages in thread
From: Dor Laor @ 2007-04-05  7:37 UTC (permalink / raw)
  To: Avi Kivity, Ingo Molnar; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>Ingo Molnar wrote:
>> we should move all the PICs into KVM proper - and that includes the
>> i8259A PIC too. Qemu-space drivers are then wired to pins on these
PICs,
>> but nothing in Qemu does vector generation or vector prioritization -
>> that task is purely up to KVM. There are mixed i8259A+lapic models
>> possible too and the simplest model is to have all vector handling in
>> KVM.
>>
>> any 'cut' of the interface to allow both qemu and KVM generate
vectors
>> is unnecessary (and harmful) complexity. The interface cut should be
at
>> the 'pin' level, with Qemu raising a signal on a pin and lowering a
>> signal on a pin, but otherwise not dealing with IRQ routing and IRQ
>> vectors.
>>
>
>Following is my view of the possible cuts:
>
>- everything in userspace: worst performance, but needed for
comaptibility
>
>- tpr in kernel: minimal effort, bad badly defined interface
>
>- lapic in kernel: well defined (all on-chip stuff in kernel, off-chip
>in userspace), fixes main Windows problems, interrupts still need
>userspace.  Interface is the processor's LINT pins and APIC bus.
>
>- *pic in kernel: most effort, easiest irq synchronization.  Interface
>is pic/ioapic pin level, plus a topology description that userspace
uses
>to tell the hardware how the pins are connected (logically required,
but
>practically not).


I vote for the *pic in kernel:
- Clean 'cut', kills all the future arguments against moving things
around
- Easier to define a stable interface
- Easy support for PV drivers
- Easy hw assist acceleration
- Provides the best performance

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]             ` <46149B7C.5020004-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-04-05  7:37               ` Dor Laor
@ 2007-04-05 10:39               ` Ingo Molnar
       [not found]                 ` <20070405103933.GA8936-X9Un+BFzKDI@public.gmane.org>
  1 sibling, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2007-04-05 10:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f


* Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote:

> - *pic in kernel: most effort, easiest irq synchronization.  Interface 
> is pic/ioapic pin level, plus a topology description that userspace 
> uses to tell the hardware how the pins are connected (logically 
> required, but practically not).

yeah, that variant is the right way to do it. Note that this also paves 
the way for possible future PIC hardware virtualization features like 
the CPU directly injecting an external IRQ vector to the guest [without 
trapping out to the hypervisor]. Interrupt handling fundamentally 
belongs into the kernel, be that the host or the guest. (User-space 
abstractions in the PIC space make little sense and i know of none.)

	Ingo

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]                 ` <20070405103933.GA8936-X9Un+BFzKDI@public.gmane.org>
@ 2007-04-05 10:50                   ` Ingo Molnar
       [not found]                     ` <20070405105042.GA11779-X9Un+BFzKDI@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2007-04-05 10:50 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f


there's one thing i havent seen mentioned (maybe i missed it): the 
ability to save/restore the in-kernel *PIC state. That would be required 
for migration/snapshotting/etc.

	Ingo

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]                     ` <20070405105042.GA11779-X9Un+BFzKDI@public.gmane.org>
@ 2007-04-05 11:29                       ` Avi Kivity
  0 siblings, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2007-04-05 11:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Ingo Molnar wrote:
> there's one thing i havent seen mentioned (maybe i missed it): the 
> ability to save/restore the in-kernel *PIC state. That would be required 
> for migration/snapshotting/etc.
>   

That is a common requirement for all variants.  Actually this is best 
implemented (assuming the in-kernel model) as moving the in-kernel *pic 
state to the qemu *pic implementation, and letting the qemu code 
serialize it normally.  This is what we do for cpu registers.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: KVM in-kernel APIC update
       [not found]         ` <20070404201205.GC6070-X9Un+BFzKDI@public.gmane.org>
  2007-04-04 21:55           ` Dor Laor
  2007-04-05  6:47           ` Avi Kivity
@ 2007-04-05 14:37           ` Dong, Eddie
  2 siblings, 0 replies; 22+ messages in thread
From: Dong, Eddie @ 2007-04-05 14:37 UTC (permalink / raw)
  To: Ingo Molnar, Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Ingo Molnar wrote:
> * Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote:
> 
> any 'cut' of the interface to allow both qemu and KVM generate vectors
> is unnecessary (and harmful) complexity. The interface cut should be
> at 
> the 'pin' level, with Qemu raising a signal on a pin and lowering a
> signal on a pin, but otherwise not dealing with IRQ routing and IRQ
> vectors.
> 
Actually, current Qemu device is not well designed to support this
behavior,
it will call pic_set_irq even if the 'pin' state doesn't change for some
reason.
In Xen, we ever tracked the 'pin' state in Qemu too so that only
'raising' 
and 'lowering' a signal will inform VMM like you mentioned. But this
logic
is removed later on at cleanup.

Thx, Eddie

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-04-05 14:37 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-03 22:31 KVM in-kernel APIC update Gregory Haskins
     [not found] ` <46128F80.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-04-04  7:40   ` Avi Kivity
     [not found]     ` <46135686.4090905-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-04 16:23       ` Gregory Haskins
     [not found]         ` <46138A98.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-04-04 16:49           ` Avi Kivity
     [not found]             ` <4613D736.1080207-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-04 17:10               ` Gregory Haskins
     [not found]                 ` <4613959F.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-04-04 17:43                   ` Avi Kivity
     [not found]                     ` <4613E3CB.6000904-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-04 17:56                       ` Gregory Haskins
     [not found]                         ` <4613A090.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-04-04 20:20                           ` Ingo Molnar
     [not found]                             ` <20070404202046.GD6070-X9Un+BFzKDI@public.gmane.org>
2007-04-05  6:48                               ` Avi Kivity
2007-04-04 22:00                   ` Dor Laor
2007-04-04 20:12       ` Ingo Molnar
     [not found]         ` <20070404201205.GC6070-X9Un+BFzKDI@public.gmane.org>
2007-04-04 21:55           ` Dor Laor
2007-04-05  6:47           ` Avi Kivity
     [not found]             ` <46149B7C.5020004-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-04-05  7:37               ` Dor Laor
2007-04-05 10:39               ` Ingo Molnar
     [not found]                 ` <20070405103933.GA8936-X9Un+BFzKDI@public.gmane.org>
2007-04-05 10:50                   ` Ingo Molnar
     [not found]                     ` <20070405105042.GA11779-X9Un+BFzKDI@public.gmane.org>
2007-04-05 11:29                       ` Avi Kivity
2007-04-05 14:37           ` Dong, Eddie
2007-04-04 20:32   ` Ingo Molnar
     [not found]     ` <20070404203235.GA11369-X9Un+BFzKDI@public.gmane.org>
2007-04-04 21:22       ` Gregory Haskins
     [not found]         ` <4613D0CE.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-04-04 21:40           ` Anthony Liguori
2007-04-05  7:11           ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox