PCI passthrough with VT-d - native performance

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* PCI passthrough with VT-d - native performance
@ 2008-07-16 13:16 Ben-Ami Yassour
  2008-07-16 13:16 ` [PATCH 1/8] KVM: Introduce a callback routine for IOAPIC ack handling Ben-Ami Yassour
  2008-07-16 14:36 ` PCI passthrough with VT-d - native performance Avi Kivity
  0 siblings, 2 replies; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:16 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony

In last few tests that we made with PCI-passthrough and VT-d using
iperf, we were able to get the same throughput as on native OS with a 1G
NIC (with higher CPU utilization).

The following patches are the PCI-passthrough patches that Amit sent
(re-based on the last kvm tree), followed by a few improvements and the
VT-d extension.
I am also sending the userspace patches: the patch that Amit sent for
PCI passthrough and the direct-mmio extension for userspace (note that
without the direct mmio extension we get less then half the throughput).

Comments are welcome.

Regards,
Ben

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/8] KVM: Introduce a callback routine for IOAPIC ack handling
  2008-07-16 13:16 PCI passthrough with VT-d - native performance Ben-Ami Yassour
@ 2008-07-16 13:16 ` Ben-Ami Yassour
  2008-07-16 13:16   ` [PATCH 2/8] KVM: Introduce a callback routine for PIC " Ben-Ami Yassour
  2008-07-16 14:36 ` PCI passthrough with VT-d - native performance Avi Kivity
  1 sibling, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:16 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony

From: Amit Shah <amit.shah@qumranet.com>

This will be useful for acking irqs of assigned devices

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
---
 virt/kvm/ioapic.c |    3 +++
 virt/kvm/ioapic.h |    1 +
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index c0d2287..8ce93c7 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -295,6 +295,9 @@ static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int gsi)
 	ent->fields.remote_irr = 0;
 	if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
 		ioapic_service(ioapic, gsi);
+
+	if (ioapic->ack_notifier)
+		ioapic->ack_notifier(ioapic->kvm, gsi);
 }
 
 void kvm_ioapic_update_eoi(struct kvm *kvm, int vector)
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index 7f16675..a42743f 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -58,6 +58,7 @@ struct kvm_ioapic {
 	} redirtbl[IOAPIC_NUM_PINS];
 	struct kvm_io_device dev;
 	struct kvm *kvm;
+	void (*ack_notifier)(void *opaque, int irq);
 };
 
 #ifdef DEBUG
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/8] KVM: Introduce a callback routine for PIC ack handling
  2008-07-16 13:16 ` [PATCH 1/8] KVM: Introduce a callback routine for IOAPIC ack handling Ben-Ami Yassour
@ 2008-07-16 13:16   ` Ben-Ami Yassour
  2008-07-16 13:17     ` [PATCH 3/8] KVM: Handle device assignment to guests Ben-Ami Yassour
  0 siblings, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:16 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony

From: Amit Shah <amit.shah@qumranet.com>

This is useful for acking irqs of assigned devices

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
---
 arch/x86/kvm/i8259.c |    6 +++++-
 arch/x86/kvm/irq.c   |    2 +-
 arch/x86/kvm/irq.h   |    3 ++-
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 55e179a..c1a4110 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -159,9 +159,10 @@ static inline void pic_intack(struct kvm_kpic_state *s, int irq)
 		s->irr &= ~(1 << irq);
 }
 
-int kvm_pic_read_irq(struct kvm_pic *s)
+int kvm_pic_read_irq(struct kvm *kvm)
 {
 	int irq, irq2, intno;
+	struct kvm_pic *s = pic_irqchip(kvm);
 
 	irq = pic_get_irq(&s->pics[0]);
 	if (irq >= 0) {
@@ -186,6 +187,9 @@ int kvm_pic_read_irq(struct kvm_pic *s)
 		irq = 7;
 		intno = s->pics[0].irq_base + irq;
 	}
+	if (kvm->arch.vpic->ack_notifier)
+		kvm->arch.vpic->ack_notifier(kvm, irq);
+
 	pic_update_irq(s);
 
 	return intno;
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 0d9e552..3529620 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -72,7 +72,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
 		if (kvm_apic_accept_pic_intr(v)) {
 			s = pic_irqchip(v->kvm);
 			s->output = 0;		/* PIC */
-			vector = kvm_pic_read_irq(s);
+			vector = kvm_pic_read_irq(v->kvm);
 		}
 	}
 	return vector;
diff --git a/arch/x86/kvm/irq.h b/arch/x86/kvm/irq.h
index 07ff2ae..ac71a04 100644
--- a/arch/x86/kvm/irq.h
+++ b/arch/x86/kvm/irq.h
@@ -63,11 +63,12 @@ struct kvm_pic {
 	void *irq_request_opaque;
 	int output;		/* intr from master PIC */
 	struct kvm_io_device dev;
+	void (*ack_notifier)(void *opaque, int irq);
 };
 
 struct kvm_pic *kvm_create_pic(struct kvm *kvm);
 void kvm_pic_set_irq(void *opaque, int irq, int level);
-int kvm_pic_read_irq(struct kvm_pic *s);
+int kvm_pic_read_irq(struct kvm *kvm);
 void kvm_pic_update_irq(struct kvm_pic *s);
 
 static inline struct kvm_pic *pic_irqchip(struct kvm *kvm)
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-16 13:16   ` [PATCH 2/8] KVM: Introduce a callback routine for PIC " Ben-Ami Yassour
@ 2008-07-16 13:17     ` Ben-Ami Yassour
  2008-07-16 13:17       ` [PATCH 4/8] KVM: PCIPT: fix interrupt handling Ben-Ami Yassour
  2008-07-16 15:04       ` [PATCH 3/8] KVM: Handle device assignment to guests Avi Kivity
  0 siblings, 2 replies; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:17 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony

From: Han, Weidong <weidong.han@intel.com>

This patch adds support for handling PCI devices that are assigned to
the guest ("PCI passthrough").

The device to be assigned to the guest is registered in the host kernel
and interrupt delivery is handled. If a device is already assigned, or
the
device driver for it is still loaded on the host, the device assignment
is
failed by conveying a -EBUSY reply to the userspace.

Devices that share their interrupt line are not supported at the moment.

By itself, this patch will not make devices work within the guest. There
has to be some mechanism of translating guest DMA addresses into machine
addresses.
This support comes from one of three approaches:
1. If you have recent Intel hardware with VT-d support, you can use the
patches in

    git.kernel.org/pub/scm/linux/kernel/git/amit/kvm.git vtd
    git.kernel.org/pub/scm/linux/kernel/git/amit/kvm-userspace.git vtd

    These patches are for the host kernel.
2. For paravirtualised Linux guests, you can use the patches in

    git.kernel.org/pub/scm/linux/kernel/git/amit/kvm.git pvdma
    git.kernel.org/pub/scm/linux/kernel/git/amit/kvm-userspace.git pvdma

    This kernel tree has patches for host as well as guest kernels.

3. 1-1 mapping of guest in host address space

    The patches to do this are available on the kvm / lkml list archive:
    http://thread.gmane.org/gmane.comp.emulators.kvm.devel/18722/focus=18753

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
Signed-off-by: Han, Weidong <weidong.han@intel.com>
---
 arch/x86/kvm/x86.c         |  293 ++++++++++++++++++++++++++++++++++++++++++++
 include/asm-x86/kvm_host.h |   38 ++++++
 include/asm-x86/kvm_para.h |   16 +++-
 include/linux/kvm.h        |    3 +
 virt/kvm/ioapic.c          |   13 ++-
 5 files changed, 360 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3167006..c07ca2b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4,10 +4,12 @@
  * derived from drivers/kvm/kvm_main.c
  *
  * Copyright (C) 2006 Qumranet, Inc.
+ * Copyright (C) 2008 Qumranet, Inc.
  *
  * Authors:
  *   Avi Kivity   <avi@qumranet.com>
  *   Yaniv Kamay  <yaniv@qumranet.com>
+ *   Amit Shah    <amit.shah@qumranet.com>
  *
  * This work is licensed under the terms of the GNU GPL, version 2.  See
  * the COPYING file in the top-level directory.
@@ -23,8 +25,10 @@
 #include "x86.h"
 
 #include <linux/clocksource.h>
+#include <linux/interrupt.h>
 #include <linux/kvm.h>
 #include <linux/fs.h>
+#include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/module.h>
 #include <linux/mman.h>
@@ -98,6 +102,282 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ NULL }
 };
 
+DEFINE_RWLOCK(kvm_pci_pt_lock);
+
+/*
+ * Used to find a registered host PCI device (a "passthrough" device)
+ * during ioctls, interrupts or EOI
+ */
+struct kvm_pci_pt_dev_list *
+kvm_find_pci_pt_dev(struct list_head *head,
+		struct kvm_pci_pt_info *pt_pci_info, int irq, int source)
+{
+	struct list_head *ptr;
+	struct kvm_pci_pt_dev_list *match;
+
+	list_for_each(ptr, head) {
+		match = list_entry(ptr, struct kvm_pci_pt_dev_list, list);
+
+		switch (source) {
+		case KVM_PT_SOURCE_IRQ:
+			/*
+			 * Used to find a registered host device
+			 * during interrupt context on host
+			 */
+			if (match->pt_dev.host.irq == irq)
+				return match;
+			break;
+		case KVM_PT_SOURCE_IRQ_ACK:
+			/*
+			 * Used to find a registered host device when
+			 * the guest acks an interrupt
+			 */
+			if (match->pt_dev.guest.irq == irq)
+				return match;
+			break;
+		case KVM_PT_SOURCE_UPDATE:
+			if ((match->pt_dev.host.busnr == pt_pci_info->busnr) &&
+			    (match->pt_dev.host.devfn == pt_pci_info->devfn))
+				return match;
+			break;
+		}
+	}
+	return NULL;
+}
+
+static DECLARE_BITMAP(pt_irq_handled, NR_IRQS);
+
+static void kvm_pci_pt_work_fn(struct work_struct *work)
+{
+	struct kvm_pci_pt_dev_list *match;
+	struct kvm_pci_pt_work *int_work;
+	int source;
+	unsigned long flags;
+	int guest_irq;
+	int host_irq;
+
+	int_work = container_of(work, struct kvm_pci_pt_work, work);
+
+	source = int_work->source ? KVM_PT_SOURCE_IRQ_ACK : KVM_PT_SOURCE_IRQ;
+
+	/* This is taken to safely inject irq inside the guest. When
+	 * the interrupt injection (or the ioapic code) uses a
+	 * finer-grained lock, update this
+	 */
+	mutex_lock(&int_work->kvm->lock);
+	read_lock_irqsave(&kvm_pci_pt_lock, flags);
+	match = kvm_find_pci_pt_dev(&int_work->kvm->arch.pci_pt_dev_head, NULL,
+				    int_work->irq, source);
+	if (!match) {
+		printk(KERN_ERR "%s: no matching device assigned to guest "
+		       "found for irq %d, source = %d!\n",
+		       __func__, int_work->irq, int_work->source);
+		read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+		goto out;
+	}
+	guest_irq = match->pt_dev.guest.irq;
+	host_irq = match->pt_dev.host.irq;
+	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+
+	if (source == KVM_PT_SOURCE_IRQ)
+		kvm_set_irq(int_work->kvm, guest_irq, 1);
+	else {
+		kvm_set_irq(int_work->kvm, int_work->irq, 0);
+		enable_irq(host_irq);
+	}
+out:
+	mutex_unlock(&int_work->kvm->lock);
+	kvm_put_kvm(int_work->kvm);
+}
+
+/* FIXME: Implement the OR logic needed to make shared interrupts on
+ * this line behave properly
+ */
+static irqreturn_t kvm_pci_pt_dev_intr(int irq, void *dev_id)
+{
+	struct kvm *kvm = (struct kvm *) dev_id;
+	struct kvm_pci_pt_dev_list *pci_pt_dev;
+
+	if (!test_bit(irq, pt_irq_handled))
+		return IRQ_NONE;
+
+	read_lock(&kvm_pci_pt_lock);
+	pci_pt_dev = kvm_find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL,
+					 irq, KVM_PT_SOURCE_IRQ);
+	if (!pci_pt_dev) {
+		read_unlock(&kvm_pci_pt_lock);
+		return IRQ_NONE;
+	}
+
+	pci_pt_dev->pt_dev.int_work.irq = irq;
+	pci_pt_dev->pt_dev.int_work.kvm = kvm;
+	pci_pt_dev->pt_dev.int_work.source = 0;
+
+	kvm_get_kvm(kvm);
+	schedule_work(&pci_pt_dev->pt_dev.int_work.work);
+	read_unlock(&kvm_pci_pt_lock);
+
+	disable_irq_nosync(irq);
+	return IRQ_HANDLED;
+}
+
+/* Ack the irq line for a passthrough device */
+static void kvm_pci_pt_ack_irq(void *opaque, int irq)
+{
+	struct kvm *kvm = opaque;
+	struct kvm_pci_pt_dev_list *pci_pt_dev;
+	unsigned long flags;
+
+	if (irq == -1)
+		return;
+
+	read_lock_irqsave(&kvm_pci_pt_lock, flags);
+	pci_pt_dev = kvm_find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL, irq,
+					 KVM_PT_SOURCE_IRQ_ACK);
+	if (!pci_pt_dev) {
+		read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+		return;
+	}
+	pci_pt_dev->pt_dev.ack_work.irq = irq;
+	pci_pt_dev->pt_dev.ack_work.kvm = kvm;
+	pci_pt_dev->pt_dev.ack_work.source = 1;
+
+	kvm_get_kvm(kvm);
+	schedule_work(&pci_pt_dev->pt_dev.ack_work.work);
+	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+}
+
+static int kvm_vm_ioctl_pci_pt_dev(struct kvm *kvm,
+				   struct kvm_pci_passthrough_dev *pci_pt_dev)
+{
+	int r = 0;
+	struct kvm_pci_pt_dev_list *match;
+	unsigned long flags;
+	struct pci_dev *dev;
+
+	write_lock_irqsave(&kvm_pci_pt_lock, flags);
+
+	/* Check if this is a request to update the irq of the device
+	 * in the guest (BIOS/ kernels can dynamically reprogram irq
+	 * numbers).  This also protects us from adding the same
+	 * device twice.
+	 */
+	match = kvm_find_pci_pt_dev(&kvm->arch.pci_pt_dev_head,
+				    &pci_pt_dev->host, 0, KVM_PT_SOURCE_UPDATE);
+	if (match) {
+		match->pt_dev.guest.irq = pci_pt_dev->guest.irq;
+		write_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+		goto out;
+	}
+	write_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+
+	match = kzalloc(sizeof(struct kvm_pci_pt_dev_list), GFP_KERNEL);
+	if (match == NULL) {
+		printk(KERN_INFO "%s: Couldn't allocate memory\n",
+		       __func__);
+		r = -ENOMEM;
+		goto out;
+	}
+	dev = pci_get_bus_and_slot(pci_pt_dev->host.busnr,
+				   pci_pt_dev->host.devfn);
+	if (!dev) {
+		printk(KERN_INFO "%s: host device not found\n", __func__);
+		r = -EINVAL;
+		goto out_free;
+	}
+	if (pci_enable_device(dev)) {
+		printk(KERN_INFO "%s: Could not enable PCI device\n", __func__);
+		r = -EBUSY;
+		goto out_put;
+	}
+	r = pci_request_regions(dev, "kvm_pt_device");
+	if (r) {
+		printk(KERN_INFO "%s: Could not get access to device regions\n",
+		       __func__);
+		goto out_put;
+	}
+	match->pt_dev.guest.busnr = pci_pt_dev->guest.busnr;
+	match->pt_dev.guest.devfn = pci_pt_dev->guest.devfn;
+	match->pt_dev.host.busnr  = pci_pt_dev->host.busnr;
+	match->pt_dev.host.devfn  = pci_pt_dev->host.devfn;
+	match->pt_dev.dev         = dev;
+
+	if (irqchip_in_kernel(kvm)) {
+		/* Even though this is PCI, we don't want to use shared
+		 * interrupts. Sharing host devices with guest-assigned devices
+		 * on the same interrupt line is not a happy situation: there
+		 * are going to be long delays in accepting, acking, etc.
+		 */
+		if (request_irq(dev->irq, kvm_pci_pt_dev_intr, 0,
+				"kvm_pt_device", (void *)kvm)) {
+			printk(KERN_INFO "%s: couldn't allocate irq for pv "
+			       "device\n", __func__);
+			r = -EIO;
+			goto out_put;
+		}
+		match->pt_dev.guest.irq = pci_pt_dev->guest.irq;
+		match->pt_dev.host.irq  = dev->irq;
+		if (kvm->arch.vioapic)
+			kvm->arch.vioapic->ack_notifier = kvm_pci_pt_ack_irq;
+		if (kvm->arch.vpic)
+			kvm->arch.vpic->ack_notifier = kvm_pci_pt_ack_irq;
+	}
+	write_lock_irqsave(&kvm_pci_pt_lock, flags);
+
+	INIT_WORK(&match->pt_dev.int_work.work, kvm_pci_pt_work_fn);
+	INIT_WORK(&match->pt_dev.ack_work.work, kvm_pci_pt_work_fn);
+
+	list_add(&match->list, &kvm->arch.pci_pt_dev_head);
+
+	if (irqchip_in_kernel(kvm))
+		set_bit(dev->irq, pt_irq_handled);
+	write_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+out:
+	return r;
+out_put:
+	pci_dev_put(dev);
+out_free:
+	kfree(match);
+	goto out;
+}
+
+static void kvm_free_pci_passthrough(struct kvm *kvm)
+{
+	struct list_head *ptr, *ptr2;
+	struct kvm_pci_pt_dev_list *pci_pt_dev;
+	unsigned long flags;
+
+	write_lock_irqsave(&kvm_pci_pt_lock, flags);
+	list_for_each_safe(ptr, ptr2, &kvm->arch.pci_pt_dev_head) {
+		pci_pt_dev = list_entry(ptr, struct kvm_pci_pt_dev_list, list);
+		if (cancel_work_sync(&pci_pt_dev->pt_dev.int_work.work))
+			/* We had pending work. That means we will have to take
+			 * care of kvm_put_kvm.
+			 */
+			kvm_put_kvm(kvm);
+
+		if (cancel_work_sync(&pci_pt_dev->pt_dev.ack_work.work))
+			/* We had pending work. That means we will have to take
+			 * care of kvm_put_kvm.
+			 */
+			kvm_put_kvm(kvm);
+	}
+
+	list_for_each_safe(ptr, ptr2, &kvm->arch.pci_pt_dev_head) {
+		pci_pt_dev = list_entry(ptr, struct kvm_pci_pt_dev_list, list);
+
+		if (irqchip_in_kernel(kvm) && pci_pt_dev->pt_dev.host.irq)
+			free_irq(pci_pt_dev->pt_dev.host.irq, kvm);
+		/* Search for this device got us a refcount */
+		pci_dev_put(pci_pt_dev->pt_dev.dev);
+		pci_release_regions(pci_pt_dev->pt_dev.dev);
+		pci_disable_device(pci_pt_dev->pt_dev.dev);
+
+		list_del(&pci_pt_dev->list);
+		kfree(pci_pt_dev);
+	}
+	write_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+}
 
 unsigned long segment_base(u16 selector)
 {
@@ -1746,6 +2026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_UPDATE_PCI_PT_DEV: {
+		struct kvm_pci_passthrough_dev pci_pt_dev;
+
+		r = -EFAULT;
+		if (copy_from_user(&pci_pt_dev, argp, sizeof pci_pt_dev))
+			goto out;
+		r = kvm_vm_ioctl_pci_pt_dev(kvm, &pci_pt_dev);
+		if (r)
+			goto out;
+		break;
+	}
 	case KVM_GET_PIT: {
 		struct kvm_pit_state ps;
 		r = -EFAULT;
@@ -3948,6 +4239,7 @@ struct  kvm *kvm_arch_create_vm(void)
 		return ERR_PTR(-ENOMEM);
 
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+	INIT_LIST_HEAD(&kvm->arch.pci_pt_dev_head);
 
 	return kvm;
 }
@@ -3980,6 +4272,7 @@ static void kvm_free_vcpus(struct kvm *kvm)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+	kvm_free_pci_passthrough(kvm);
 	kvm_free_pit(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 4a47859..2346c9f 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -88,6 +88,7 @@
 #define KVM_NR_VAR_MTRR 8
 
 extern spinlock_t kvm_lock;
+extern rwlock_t kvm_pci_pt_lock;
 extern struct list_head vm_list;
 
 struct kvm_vcpu;
@@ -319,6 +320,38 @@ struct kvm_mem_alias {
 	gfn_t target_gfn;
 };
 
+/* Some definitions for devices assigned to the guest by the host */
+#define KVM_PT_SOURCE_IRQ	1
+#define KVM_PT_SOURCE_IRQ_ACK	2
+#define KVM_PT_SOURCE_UPDATE	3
+
+/* For assigned devices, we schedule work in the system workqueue to
+ * inject interrupts into the guest when an interrupt occurs on the
+ * physical device and also when the guest acks the interrupt.
+ */
+struct kvm_pci_pt_work {
+	struct work_struct work;
+	struct kvm *kvm;
+	int irq;
+	bool source;
+};
+
+struct kvm_pci_passthrough_dev_kernel {
+	struct kvm_pci_pt_info guest;
+	struct kvm_pci_pt_info host;
+	struct kvm_pci_pt_work int_work;
+	struct kvm_pci_pt_work ack_work;
+	struct pci_dev *dev;
+};
+
+/* This list is to store the guest bus:device:function-irq and host
+ * bus:device:function-irq mapping for assigned devices.
+ */
+struct kvm_pci_pt_dev_list {
+	struct list_head list;
+	struct kvm_pci_passthrough_dev_kernel pt_dev;
+};
+
 struct kvm_arch{
 	int naliases;
 	struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS];
@@ -331,6 +364,7 @@ struct kvm_arch{
 	 * Hash table of struct kvm_mmu_page.
 	 */
 	struct list_head active_mmu_pages;
+	struct list_head pci_pt_dev_head;
 	struct kvm_pic *vpic;
 	struct kvm_ioapic *vioapic;
 	struct kvm_pit *vpit;
@@ -577,6 +611,10 @@ void kvm_disable_tdp(void);
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 int complete_pio(struct kvm_vcpu *vcpu);
 
+struct kvm_pci_pt_dev_list *
+kvm_find_pci_pt_dev(struct list_head *head,
+		    struct kvm_pci_pt_info *pt_pci_info, int irq, int source);
+
 static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
 {
 	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
diff --git a/include/asm-x86/kvm_para.h b/include/asm-x86/kvm_para.h
index 76f3921..b6c5d00 100644
--- a/include/asm-x86/kvm_para.h
+++ b/include/asm-x86/kvm_para.h
@@ -142,6 +142,20 @@ static inline unsigned int kvm_arch_para_features(void)
 	return cpuid_eax(KVM_CPUID_FEATURES);
 }
 
-#endif
+#endif /* KERNEL */
 
+/* Stores information for identifying host PCI devices assigned to the
+ * guest: this is used in the host kernel and in the userspace.
+ */
+struct kvm_pci_pt_info {
+	unsigned char busnr;
+	unsigned int devfn;
+	__u32 irq;
+};
+
+/* Mapping between host and guest PCI device */
+struct kvm_pci_passthrough_dev {
+	struct kvm_pci_pt_info guest;
+	struct kvm_pci_pt_info host;
+};
 #endif
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 6edba45..3370e80 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -382,6 +382,7 @@ struct kvm_trace_rec {
 #define KVM_CAP_PV_MMU 13
 #define KVM_CAP_MP_STATE 14
 #define KVM_CAP_COALESCED_MMIO 15
+#define KVM_CAP_PCI_PASSTHROUGH 16
 
 /*
  * ioctls for VM fds
@@ -411,6 +412,8 @@ struct kvm_trace_rec {
 			_IOW(KVMIO,  0x67, struct kvm_coalesced_mmio_zone)
 #define KVM_UNREGISTER_COALESCED_MMIO \
 			_IOW(KVMIO,  0x68, struct kvm_coalesced_mmio_zone)
+#define KVM_UPDATE_PCI_PT_DEV	  _IOR(KVMIO, 0x69, \
+				       struct kvm_pci_passthrough_dev)
 
 /*
  * ioctls for vcpu fds
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 8ce93c7..0b5bc40 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -288,13 +288,22 @@ void kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
 static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int gsi)
 {
 	union ioapic_redir_entry *ent;
+	struct kvm_pci_pt_dev_list *match;
+	unsigned long flags;
 
 	ent = &ioapic->redirtbl[gsi];
 	ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
 
 	ent->fields.remote_irr = 0;
-	if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
-		ioapic_service(ioapic, gsi);
+
+	read_lock_irqsave(&kvm_pci_pt_lock, flags);
+	match = kvm_find_pci_pt_dev(&ioapic->kvm->arch.pci_pt_dev_head, NULL,
+				    gsi, KVM_PT_SOURCE_IRQ_ACK);
+	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+	if (!match) {
+		if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
+			ioapic_service(ioapic, gsi);
+	}
 
 	if (ioapic->ack_notifier)
 		ioapic->ack_notifier(ioapic->kvm, gsi);
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/8] KVM: PCIPT: fix interrupt handling
  2008-07-16 13:17     ` [PATCH 3/8] KVM: Handle device assignment to guests Ben-Ami Yassour
@ 2008-07-16 13:17       ` Ben-Ami Yassour
  2008-07-16 13:17         ` [PATCH 5/8] KVM: PCIPT: change order of device release Ben-Ami Yassour
                           ` (2 more replies)
  2008-07-16 15:04       ` [PATCH 3/8] KVM: Handle device assignment to guests Avi Kivity
  1 sibling, 3 replies; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:17 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony

This patch fixes a few problems with the interrupt handling for
passthrough devices.

1. Pass the interrupt handler the pointer to the device, so we do not
need to lock the pcipt lock in the interrupt handler.

2. Remove the pt_irq_handled bitmap - it is no longer needed.

3. Split kvm_pci_pt_work_fn into two functions, one for interrupt
injection and another for the ack - is much simpler code this way.

4. Change the passthrough initialization order - add the device
structure to the list, before registering the interrupt handler.

5. On passthrough destruction path, free the interrupt handler before
cleaning queued work.

Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
---
 arch/x86/kvm/x86.c         |  156 ++++++++++++++++++-------------------------
 include/asm-x86/kvm_host.h |    5 +-
 virt/kvm/ioapic.c          |    5 +-
 3 files changed, 69 insertions(+), 97 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c07ca2b..8d25b4a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -145,49 +145,37 @@ kvm_find_pci_pt_dev(struct list_head *head,
 	return NULL;
 }
 
-static DECLARE_BITMAP(pt_irq_handled, NR_IRQS);
-
-static void kvm_pci_pt_work_fn(struct work_struct *work)
+static void kvm_pci_pt_int_work_fn(struct work_struct *work)
 {
-	struct kvm_pci_pt_dev_list *match;
 	struct kvm_pci_pt_work *int_work;
-	int source;
-	unsigned long flags;
-	int guest_irq;
-	int host_irq;
 
 	int_work = container_of(work, struct kvm_pci_pt_work, work);
 
-	source = int_work->source ? KVM_PT_SOURCE_IRQ_ACK : KVM_PT_SOURCE_IRQ;
-
 	/* This is taken to safely inject irq inside the guest. When
 	 * the interrupt injection (or the ioapic code) uses a
 	 * finer-grained lock, update this
 	 */
-	mutex_lock(&int_work->kvm->lock);
-	read_lock_irqsave(&kvm_pci_pt_lock, flags);
-	match = kvm_find_pci_pt_dev(&int_work->kvm->arch.pci_pt_dev_head, NULL,
-				    int_work->irq, source);
-	if (!match) {
-		printk(KERN_ERR "%s: no matching device assigned to guest "
-		       "found for irq %d, source = %d!\n",
-		       __func__, int_work->irq, int_work->source);
-		read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
-		goto out;
-	}
-	guest_irq = match->pt_dev.guest.irq;
-	host_irq = match->pt_dev.host.irq;
-	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+	mutex_lock(&int_work->pt_dev->kvm->lock);
+	kvm_set_irq(int_work->pt_dev->kvm, int_work->pt_dev->guest.irq, 1);
+	mutex_unlock(&int_work->pt_dev->kvm->lock);
+	kvm_put_kvm(int_work->pt_dev->kvm);
+}
 
-	if (source == KVM_PT_SOURCE_IRQ)
-		kvm_set_irq(int_work->kvm, guest_irq, 1);
-	else {
-		kvm_set_irq(int_work->kvm, int_work->irq, 0);
-		enable_irq(host_irq);
-	}
-out:
-	mutex_unlock(&int_work->kvm->lock);
-	kvm_put_kvm(int_work->kvm);
+static void kvm_pci_pt_ack_work_fn(struct work_struct *work)
+{
+	struct kvm_pci_pt_work *ack_work;
+
+	ack_work = container_of(work, struct kvm_pci_pt_work, work);
+
+	/* This is taken to safely inject irq inside the guest. When
+	 * the interrupt injection (or the ioapic code) uses a
+	 * finer-grained lock, update this
+	 */
+	mutex_lock(&ack_work->pt_dev->kvm->lock);
+	kvm_set_irq(ack_work->pt_dev->kvm, ack_work->pt_dev->guest.irq, 0);
+	enable_irq(ack_work->pt_dev->host.irq);
+	mutex_unlock(&ack_work->pt_dev->kvm->lock);
+	kvm_put_kvm(ack_work->pt_dev->kvm);
 }
 
 /* FIXME: Implement the OR logic needed to make shared interrupts on
@@ -195,28 +183,11 @@ out:
  */
 static irqreturn_t kvm_pci_pt_dev_intr(int irq, void *dev_id)
 {
-	struct kvm *kvm = (struct kvm *) dev_id;
-	struct kvm_pci_pt_dev_list *pci_pt_dev;
-
-	if (!test_bit(irq, pt_irq_handled))
-		return IRQ_NONE;
-
-	read_lock(&kvm_pci_pt_lock);
-	pci_pt_dev = kvm_find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL,
-					 irq, KVM_PT_SOURCE_IRQ);
-	if (!pci_pt_dev) {
-		read_unlock(&kvm_pci_pt_lock);
-		return IRQ_NONE;
-	}
-
-	pci_pt_dev->pt_dev.int_work.irq = irq;
-	pci_pt_dev->pt_dev.int_work.kvm = kvm;
-	pci_pt_dev->pt_dev.int_work.source = 0;
-
-	kvm_get_kvm(kvm);
-	schedule_work(&pci_pt_dev->pt_dev.int_work.work);
-	read_unlock(&kvm_pci_pt_lock);
+	struct kvm_pci_passthrough_dev_kernel *pt_dev =
+		(struct kvm_pci_passthrough_dev_kernel *) dev_id;
 
+	kvm_get_kvm(pt_dev->kvm);
+	schedule_work(&pt_dev->int_work.work);
 	disable_irq_nosync(irq);
 	return IRQ_HANDLED;
 }
@@ -226,25 +197,20 @@ static void kvm_pci_pt_ack_irq(void *opaque, int irq)
 {
 	struct kvm *kvm = opaque;
 	struct kvm_pci_pt_dev_list *pci_pt_dev;
-	unsigned long flags;
 
 	if (irq == -1)
 		return;
 
-	read_lock_irqsave(&kvm_pci_pt_lock, flags);
+	read_lock(&kvm_pci_pt_lock);
 	pci_pt_dev = kvm_find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL, irq,
 					 KVM_PT_SOURCE_IRQ_ACK);
 	if (!pci_pt_dev) {
-		read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+		read_unlock(&kvm_pci_pt_lock);
 		return;
 	}
-	pci_pt_dev->pt_dev.ack_work.irq = irq;
-	pci_pt_dev->pt_dev.ack_work.kvm = kvm;
-	pci_pt_dev->pt_dev.ack_work.source = 1;
-
 	kvm_get_kvm(kvm);
+	read_unlock(&kvm_pci_pt_lock);
 	schedule_work(&pci_pt_dev->pt_dev.ack_work.work);
-	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
 }
 
 static int kvm_vm_ioctl_pci_pt_dev(struct kvm *kvm,
@@ -252,10 +218,9 @@ static int kvm_vm_ioctl_pci_pt_dev(struct kvm *kvm,
 {
 	int r = 0;
 	struct kvm_pci_pt_dev_list *match;
-	unsigned long flags;
 	struct pci_dev *dev;
 
-	write_lock_irqsave(&kvm_pci_pt_lock, flags);
+	write_lock(&kvm_pci_pt_lock);
 
 	/* Check if this is a request to update the irq of the device
 	 * in the guest (BIOS/ kernels can dynamically reprogram irq
@@ -266,10 +231,10 @@ static int kvm_vm_ioctl_pci_pt_dev(struct kvm *kvm,
 				    &pci_pt_dev->host, 0, KVM_PT_SOURCE_UPDATE);
 	if (match) {
 		match->pt_dev.guest.irq = pci_pt_dev->guest.irq;
-		write_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+		write_unlock(&kvm_pci_pt_lock);
 		goto out;
 	}
-	write_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+	write_unlock(&kvm_pci_pt_lock);
 
 	match = kzalloc(sizeof(struct kvm_pci_pt_dev_list), GFP_KERNEL);
 	if (match == NULL) {
@@ -298,42 +263,49 @@ static int kvm_vm_ioctl_pci_pt_dev(struct kvm *kvm,
 	}
 	match->pt_dev.guest.busnr = pci_pt_dev->guest.busnr;
 	match->pt_dev.guest.devfn = pci_pt_dev->guest.devfn;
-	match->pt_dev.host.busnr  = pci_pt_dev->host.busnr;
-	match->pt_dev.host.devfn  = pci_pt_dev->host.devfn;
-	match->pt_dev.dev         = dev;
+	match->pt_dev.host.busnr = pci_pt_dev->host.busnr;
+	match->pt_dev.host.devfn = pci_pt_dev->host.devfn;
+	match->pt_dev.dev = dev;
+
+	write_lock(&kvm_pci_pt_lock);
+
+	INIT_WORK(&match->pt_dev.int_work.work, kvm_pci_pt_int_work_fn);
+	INIT_WORK(&match->pt_dev.ack_work.work, kvm_pci_pt_ack_work_fn);
+
+	match->pt_dev.kvm = kvm;
+	match->pt_dev.int_work.pt_dev = &match->pt_dev;
+	match->pt_dev.ack_work.pt_dev = &match->pt_dev;
+
+	list_add(&match->list, &kvm->arch.pci_pt_dev_head);
+
+	write_unlock(&kvm_pci_pt_lock);
 
 	if (irqchip_in_kernel(kvm)) {
+		match->pt_dev.guest.irq = pci_pt_dev->guest.irq;
+		match->pt_dev.host.irq = dev->irq;
+		if (kvm->arch.vioapic)
+			kvm->arch.vioapic->ack_notifier = kvm_pci_pt_ack_irq;
+		if (kvm->arch.vpic)
+			kvm->arch.vpic->ack_notifier = kvm_pci_pt_ack_irq;
+
 		/* Even though this is PCI, we don't want to use shared
 		 * interrupts. Sharing host devices with guest-assigned devices
 		 * on the same interrupt line is not a happy situation: there
 		 * are going to be long delays in accepting, acking, etc.
 		 */
 		if (request_irq(dev->irq, kvm_pci_pt_dev_intr, 0,
-				"kvm_pt_device", (void *)kvm)) {
+				"kvm_pt_device", (void *)&match->pt_dev)) {
 			printk(KERN_INFO "%s: couldn't allocate irq for pv "
 			       "device\n", __func__);
 			r = -EIO;
-			goto out_put;
+			goto out_list_del;
 		}
-		match->pt_dev.guest.irq = pci_pt_dev->guest.irq;
-		match->pt_dev.host.irq  = dev->irq;
-		if (kvm->arch.vioapic)
-			kvm->arch.vioapic->ack_notifier = kvm_pci_pt_ack_irq;
-		if (kvm->arch.vpic)
-			kvm->arch.vpic->ack_notifier = kvm_pci_pt_ack_irq;
 	}
-	write_lock_irqsave(&kvm_pci_pt_lock, flags);
-
-	INIT_WORK(&match->pt_dev.int_work.work, kvm_pci_pt_work_fn);
-	INIT_WORK(&match->pt_dev.ack_work.work, kvm_pci_pt_work_fn);
 
-	list_add(&match->list, &kvm->arch.pci_pt_dev_head);
-
-	if (irqchip_in_kernel(kvm))
-		set_bit(dev->irq, pt_irq_handled);
-	write_unlock_irqrestore(&kvm_pci_pt_lock, flags);
 out:
 	return r;
+out_list_del:
+	list_del(&match->list);
 out_put:
 	pci_dev_put(dev);
 out_free:
@@ -345,11 +317,15 @@ static void kvm_free_pci_passthrough(struct kvm *kvm)
 {
 	struct list_head *ptr, *ptr2;
 	struct kvm_pci_pt_dev_list *pci_pt_dev;
-	unsigned long flags;
 
-	write_lock_irqsave(&kvm_pci_pt_lock, flags);
+	write_lock(&kvm_pci_pt_lock);
 	list_for_each_safe(ptr, ptr2, &kvm->arch.pci_pt_dev_head) {
 		pci_pt_dev = list_entry(ptr, struct kvm_pci_pt_dev_list, list);
+
+		if (irqchip_in_kernel(kvm) && pci_pt_dev->pt_dev.host.irq)
+			free_irq(pci_pt_dev->pt_dev.host.irq,
+				 (void *)&pci_pt_dev->pt_dev);
+
 		if (cancel_work_sync(&pci_pt_dev->pt_dev.int_work.work))
 			/* We had pending work. That means we will have to take
 			 * care of kvm_put_kvm.
@@ -366,8 +342,6 @@ static void kvm_free_pci_passthrough(struct kvm *kvm)
 	list_for_each_safe(ptr, ptr2, &kvm->arch.pci_pt_dev_head) {
 		pci_pt_dev = list_entry(ptr, struct kvm_pci_pt_dev_list, list);
 
-		if (irqchip_in_kernel(kvm) && pci_pt_dev->pt_dev.host.irq)
-			free_irq(pci_pt_dev->pt_dev.host.irq, kvm);
 		/* Search for this device got us a refcount */
 		pci_dev_put(pci_pt_dev->pt_dev.dev);
 		pci_release_regions(pci_pt_dev->pt_dev.dev);
@@ -376,7 +350,7 @@ static void kvm_free_pci_passthrough(struct kvm *kvm)
 		list_del(&pci_pt_dev->list);
 		kfree(pci_pt_dev);
 	}
-	write_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+	write_unlock(&kvm_pci_pt_lock);
 }
 
 unsigned long segment_base(u16 selector)
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 2346c9f..f6973e0 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -331,9 +331,7 @@ struct kvm_mem_alias {
  */
 struct kvm_pci_pt_work {
 	struct work_struct work;
-	struct kvm *kvm;
-	int irq;
-	bool source;
+	struct kvm_pci_passthrough_dev_kernel *pt_dev;
 };
 
 struct kvm_pci_passthrough_dev_kernel {
@@ -342,6 +340,7 @@ struct kvm_pci_passthrough_dev_kernel {
 	struct kvm_pci_pt_work int_work;
 	struct kvm_pci_pt_work ack_work;
 	struct pci_dev *dev;
+	struct kvm *kvm;
 };
 
 /* This list is to store the guest bus:device:function-irq and host
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 0b5bc40..6ec99fd 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -289,17 +289,16 @@ static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int gsi)
 {
 	union ioapic_redir_entry *ent;
 	struct kvm_pci_pt_dev_list *match;
-	unsigned long flags;
 
 	ent = &ioapic->redirtbl[gsi];
 	ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
 
 	ent->fields.remote_irr = 0;
 
-	read_lock_irqsave(&kvm_pci_pt_lock, flags);
+	read_lock(&kvm_pci_pt_lock);
 	match = kvm_find_pci_pt_dev(&ioapic->kvm->arch.pci_pt_dev_head, NULL,
 				    gsi, KVM_PT_SOURCE_IRQ_ACK);
-	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
+	read_unlock(&kvm_pci_pt_lock);
 	if (!match) {
 		if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
 			ioapic_service(ioapic, gsi);
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/8] KVM: PCIPT: change order of device release
  2008-07-16 13:17       ` [PATCH 4/8] KVM: PCIPT: fix interrupt handling Ben-Ami Yassour
@ 2008-07-16 13:17         ` Ben-Ami Yassour
  2008-07-16 13:17           ` [PATCH 6/8] VT-d: changes to support KVM Ben-Ami Yassour
  2008-07-16 15:06           ` [PATCH 5/8] KVM: PCIPT: change order of device release Avi Kivity
  2008-07-16 15:06         ` [PATCH 4/8] KVM: PCIPT: fix interrupt handling Avi Kivity
  2008-07-23 13:37         ` Amit Shah
  2 siblings, 2 replies; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:17 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony

Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
---
 arch/x86/kvm/x86.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8d25b4a..65b307d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -343,9 +343,9 @@ static void kvm_free_pci_passthrough(struct kvm *kvm)
 		pci_pt_dev = list_entry(ptr, struct kvm_pci_pt_dev_list, list);
 
 		/* Search for this device got us a refcount */
-		pci_dev_put(pci_pt_dev->pt_dev.dev);
 		pci_release_regions(pci_pt_dev->pt_dev.dev);
 		pci_disable_device(pci_pt_dev->pt_dev.dev);
+		pci_dev_put(pci_pt_dev->pt_dev.dev);
 
 		list_del(&pci_pt_dev->list);
 		kfree(pci_pt_dev);
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/8] VT-d: changes to support KVM
  2008-07-16 13:17         ` [PATCH 5/8] KVM: PCIPT: change order of device release Ben-Ami Yassour
@ 2008-07-16 13:17           ` Ben-Ami Yassour
  2008-07-16 13:17             ` [PATCH 7/8] KVM: PCIPT: VT-d support Ben-Ami Yassour
  2008-07-16 15:06           ` [PATCH 5/8] KVM: PCIPT: change order of device release Avi Kivity
  1 sibling, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:17 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony, Kay, Allen M

From: Kay, Allen M <allen.m.kay@intel.com>

This patch extends the VT-d driver to support KVM

[Ben: fixed memory pinning]

Signed-off-by: Kay, Allen M <allen.m.kay@intel.com>
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
---
 drivers/pci/dmar.c                           |    4 +-
 drivers/pci/intel-iommu.c                    |  117 +++++++++++++++++++++++++-
 drivers/pci/iova.c                           |    2 +-
 {drivers/pci => include/linux}/intel-iommu.h |   11 +++
 {drivers/pci => include/linux}/iova.h        |    0 
 5 files changed, 127 insertions(+), 7 deletions(-)
 rename {drivers/pci => include/linux}/intel-iommu.h (94%)
 rename {drivers/pci => include/linux}/iova.h (100%)

diff --git a/drivers/pci/dmar.c b/drivers/pci/dmar.c
index f941f60..a58a5b0 100644
--- a/drivers/pci/dmar.c
+++ b/drivers/pci/dmar.c
@@ -26,8 +26,8 @@
 
 #include <linux/pci.h>
 #include <linux/dmar.h>
-#include "iova.h"
-#include "intel-iommu.h"
+#include <linux/iova.h>
+#include <linux/intel-iommu.h>
 
 #undef PREFIX
 #define PREFIX "DMAR:"
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index bb06423..a566406 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -20,6 +20,7 @@
  * Author: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
  */
 
+#undef DEBUG
 #include <linux/init.h>
 #include <linux/bitmap.h>
 #include <linux/debugfs.h>
@@ -33,8 +34,8 @@
 #include <linux/dma-mapping.h>
 #include <linux/mempool.h>
 #include <linux/timer.h>
-#include "iova.h"
-#include "intel-iommu.h"
+#include <linux/iova.h>
+#include <linux/intel-iommu.h>
 #include <asm/proto.h> /* force_iommu in this header in x86-64*/
 #include <asm/cacheflush.h>
 #include <asm/gart.h>
@@ -160,7 +161,7 @@ static inline void *alloc_domain_mem(void)
 	return iommu_kmem_cache_alloc(iommu_domain_cache);
 }
 
-static inline void free_domain_mem(void *vaddr)
+static void free_domain_mem(void *vaddr)
 {
 	kmem_cache_free(iommu_domain_cache, vaddr);
 }
@@ -1414,7 +1415,7 @@ static void domain_remove_dev_info(struct dmar_domain *domain)
  * find_domain
  * Note: we use struct pci_dev->dev.archdata.iommu stores the info
  */
-struct dmar_domain *
+static struct dmar_domain *
 find_domain(struct pci_dev *pdev)
 {
 	struct device_domain_info *info;
@@ -2431,3 +2432,111 @@ int __init intel_iommu_init(void)
 	return 0;
 }
 
+void intel_iommu_domain_exit(struct dmar_domain *domain)
+{
+	u64 end;
+
+	/* Domain 0 is reserved, so dont process it */
+	if (!domain)
+		return;
+
+	end = DOMAIN_MAX_ADDR(domain->gaw);
+	end = end & (~PAGE_MASK_4K);
+
+	/* clear ptes */
+	dma_pte_clear_range(domain, 0, end);
+
+	/* free page tables */
+	dma_pte_free_pagetable(domain, 0, end);
+
+	iommu_free_domain(domain);
+	free_domain_mem(domain);
+}
+EXPORT_SYMBOL_GPL(intel_iommu_domain_exit);
+
+struct dmar_domain *intel_iommu_domain_alloc(struct pci_dev *pdev)
+{
+	struct dmar_drhd_unit *drhd;
+	struct dmar_domain *domain;
+	struct intel_iommu *iommu;
+
+	drhd = dmar_find_matched_drhd_unit(pdev);
+	if (!drhd) {
+		printk(KERN_ERR "intel_iommu_domain_alloc: drhd == NULL\n");
+		return NULL;
+	}
+
+	iommu = drhd->iommu;
+	if (!iommu) {
+		printk(KERN_ERR
+			"intel_iommu_domain_alloc: iommu == NULL\n");
+		return NULL;
+	}
+	domain = iommu_alloc_domain(iommu);
+	if (!domain) {
+		printk(KERN_ERR
+			"intel_iommu_domain_alloc: domain == NULL\n");
+		return NULL;
+	}
+	if (domain_init(domain, DEFAULT_DOMAIN_ADDRESS_WIDTH)) {
+		printk(KERN_ERR
+			"intel_iommu_domain_alloc: domain_init() failed\n");
+		intel_iommu_domain_exit(domain);
+		return NULL;
+	}
+	return domain;
+}
+EXPORT_SYMBOL_GPL(intel_iommu_domain_alloc);
+
+int intel_iommu_context_mapping(
+	struct dmar_domain *domain, struct pci_dev *pdev)
+{
+	int rc;
+	rc = domain_context_mapping(domain, pdev);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(intel_iommu_context_mapping);
+
+int intel_iommu_page_mapping(
+	struct dmar_domain *domain, dma_addr_t iova,
+	u64 hpa, size_t size, int prot)
+{
+	int rc;
+	rc = domain_page_mapping(domain, iova, hpa, size, prot);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(intel_iommu_page_mapping);
+
+void intel_iommu_detach_dev(struct dmar_domain *domain, u8 bus, u8 devfn)
+{
+	detach_domain_for_dev(domain, bus, devfn);
+}
+EXPORT_SYMBOL_GPL(intel_iommu_detach_dev);
+
+struct dmar_domain *
+intel_iommu_find_domain(struct pci_dev *pdev)
+{
+	return find_domain(pdev);
+}
+EXPORT_SYMBOL_GPL(intel_iommu_find_domain);
+
+int intel_iommu_found(void)
+{
+	return g_num_of_iommus;
+}
+EXPORT_SYMBOL_GPL(intel_iommu_found);
+
+u64 intel_iommu_iova_to_pfn(struct dmar_domain *domain, u64 iova)
+{
+	struct dma_pte *pte;
+	u64 pfn;
+
+	pfn = 0;
+	pte = addr_to_dma_pte(domain, iova);
+
+	if (pte)
+		pfn = dma_pte_addr(*pte);
+
+	return pfn >> PAGE_SHIFT_4K;
+}
+EXPORT_SYMBOL_GPL(intel_iommu_iova_to_pfn);
diff --git a/drivers/pci/iova.c b/drivers/pci/iova.c
index 3ef4ac0..2287116 100644
--- a/drivers/pci/iova.c
+++ b/drivers/pci/iova.c
@@ -7,7 +7,7 @@
  * Author: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
  */
 
-#include "iova.h"
+#include <linux/iova.h>
 
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long pfn_32bit)
diff --git a/drivers/pci/intel-iommu.h b/include/linux/intel-iommu.h
similarity index 94%
rename from drivers/pci/intel-iommu.h
rename to include/linux/intel-iommu.h
index afc0ad9..1490fc0 100644
--- a/drivers/pci/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -341,4 +341,15 @@ static inline void iommu_prepare_gfx_mapping(void)
 }
 #endif /* !CONFIG_DMAR_GFX_WA */
 
+void intel_iommu_domain_exit(struct dmar_domain *domain);
+struct dmar_domain *intel_iommu_domain_alloc(struct pci_dev *pdev);
+int intel_iommu_context_mapping(struct dmar_domain *domain,
+				struct pci_dev *pdev);
+int intel_iommu_page_mapping(struct dmar_domain *domain, dma_addr_t iova,
+			     u64 hpa, size_t size, int prot);
+void intel_iommu_detach_dev(struct dmar_domain *domain, u8 bus, u8 devfn);
+struct dmar_domain *intel_iommu_find_domain(struct pci_dev *pdev);
+int intel_iommu_found(void);
+u64 intel_iommu_iova_to_pfn(struct dmar_domain *domain, u64 iova);
+
 #endif
diff --git a/drivers/pci/iova.h b/include/linux/iova.h
similarity index 100%
rename from drivers/pci/iova.h
rename to include/linux/iova.h
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 7/8] KVM: PCIPT: VT-d support
  2008-07-16 13:17           ` [PATCH 6/8] VT-d: changes to support KVM Ben-Ami Yassour
@ 2008-07-16 13:17             ` Ben-Ami Yassour
  2008-07-16 13:17               ` [PATCH 8/8] KVM: PCIPT: VT-d: dont map mmio memory slots Ben-Ami Yassour
  0 siblings, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:17 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony, Kay, Allen M

This patch includes the functions to support VT-d for passthrough
devices.

[Ben: fixed memory pinning, cleanup]

Signed-off-by: Kay, Allen M <allen.m.kay@intel.com>
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
---
 arch/x86/kvm/Makefile      |    2 +-
 arch/x86/kvm/vtd.c         |  176 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c         |   11 +++
 include/asm-x86/kvm_host.h |    1 +
 include/linux/kvm_host.h   |    6 ++
 virt/kvm/kvm_main.c        |    6 ++
 6 files changed, 201 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/kvm/vtd.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index d0e940b..5d9d079 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -11,7 +11,7 @@ endif
 EXTRA_CFLAGS += -Ivirt/kvm -Iarch/x86/kvm
 
 kvm-objs := $(common-objs) x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \
-	i8254.o
+	i8254.o vtd.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/arch/x86/kvm/vtd.c b/arch/x86/kvm/vtd.c
new file mode 100644
index 0000000..83efb8a
--- /dev/null
+++ b/arch/x86/kvm/vtd.c
@@ -0,0 +1,176 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) 2006-2008 Intel Corporation
+ * Author: Allen M. Kay <allen.m.kay@intel.com>
+ * Author: Weidong Han <weidong.han@intel.com>
+ */
+
+#include <linux/list.h>
+#include <linux/kvm_host.h>
+#include <linux/pci.h>
+#include <linux/dmar.h>
+#include <linux/intel-iommu.h>
+
+static int kvm_iommu_unmap_memslots(struct kvm *kvm);
+
+int kvm_iommu_map_pages(struct kvm *kvm,
+			gfn_t base_gfn, unsigned long npages)
+{
+	gfn_t gfn = base_gfn;
+	pfn_t pfn;
+	int i, rc;
+	struct dmar_domain *domain = kvm->arch.intel_iommu_domain;
+
+	if (!domain)
+		return -EFAULT;
+
+	for (i = 0; i < npages; i++) {
+		pfn = gfn_to_pfn(kvm, gfn);
+		rc = intel_iommu_page_mapping(domain,
+					      gfn << PAGE_SHIFT,
+						      pfn << PAGE_SHIFT,
+					      PAGE_SIZE,
+					      DMA_PTE_READ |
+					      DMA_PTE_WRITE);
+		if (rc)
+			kvm_release_pfn_clean(pfn);
+
+		gfn++;
+	}
+	return 0;
+}
+
+static int kvm_iommu_map_memslots(struct kvm *kvm)
+{
+	int i, rc;
+	for (i = 0; i < kvm->nmemslots; i++) {
+		rc = kvm_iommu_map_pages(kvm, kvm->memslots[i].base_gfn,
+					 kvm->memslots[i].npages);
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
+int kvm_iommu_map_guest(struct kvm *kvm,
+			struct kvm_pci_passthrough_dev *pci_pt_dev)
+{
+	struct pci_dev *pdev = NULL;
+
+	printk(KERN_DEBUG "VT-d direct map: host bdf = %x:%x:%x\n",
+	       pci_pt_dev->host.busnr,
+	       PCI_SLOT(pci_pt_dev->host.devfn),
+	       PCI_FUNC(pci_pt_dev->host.devfn));
+
+	for_each_pci_dev(pdev) {
+		if ((pdev->bus->number == pci_pt_dev->host.busnr) &&
+		    (pdev->devfn == pci_pt_dev->host.devfn)) {
+			break;
+		}
+	}
+
+	if (pdev == NULL) {
+		if (kvm->arch.intel_iommu_domain) {
+			intel_iommu_domain_exit(kvm->arch.intel_iommu_domain);
+			kvm->arch.intel_iommu_domain = NULL;
+		}
+		return -ENODEV;
+	}
+
+	kvm->arch.intel_iommu_domain = intel_iommu_domain_alloc(pdev);
+
+	if (kvm_iommu_map_memslots(kvm)) {
+		kvm_iommu_unmap_memslots(kvm);
+		return -EFAULT;
+	}
+
+	intel_iommu_detach_dev(kvm->arch.intel_iommu_domain,
+			       pdev->bus->number, pdev->devfn);
+
+	if (intel_iommu_context_mapping(kvm->arch.intel_iommu_domain,
+					pdev)) {
+		printk(KERN_ERR "Domain context map for %s failed",
+		       pci_name(pdev));
+		return -EFAULT;
+	}
+	return 0;
+}
+
+static int kvm_iommu_put_pages(struct kvm *kvm,
+			       gfn_t base_gfn, unsigned long npages)
+{
+	gfn_t gfn = base_gfn;
+	pfn_t pfn;
+	struct dmar_domain *domain = kvm->arch.intel_iommu_domain;
+	int i;
+
+	if (!domain)
+		return -EFAULT;
+
+	for (i = 0; i < npages; i++) {
+		pfn = (pfn_t)intel_iommu_iova_to_pfn(domain,
+						     gfn << PAGE_SHIFT);
+		kvm_release_pfn_clean(pfn);
+		gfn++;
+	}
+	return 0;
+}
+
+static int kvm_iommu_unmap_memslots(struct kvm *kvm)
+{
+	int i, rc;
+	for (i = 0; i < kvm->nmemslots; i++) {
+		rc = kvm_iommu_put_pages(kvm, kvm->memslots[i].base_gfn,
+					 kvm->memslots[i].npages);
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
+int kvm_iommu_unmap_guest(struct kvm *kvm)
+{
+	struct kvm_pci_pt_dev_list *entry;
+	struct pci_dev *pdev = NULL;
+	struct dmar_domain *domain = kvm->arch.intel_iommu_domain;
+
+	if (!domain)
+		return 0;
+
+	list_for_each_entry(entry, &kvm->arch.pci_pt_dev_head, list) {
+		printk(KERN_DEBUG "VT-d unmap: host bdf = %x:%x:%x\n",
+		       entry->pt_dev.host.busnr,
+		       PCI_SLOT(entry->pt_dev.host.devfn),
+		       PCI_FUNC(entry->pt_dev.host.devfn));
+
+		for_each_pci_dev(pdev) {
+			if ((pdev->bus->number == entry->pt_dev.host.busnr) &&
+			    (pdev->devfn == entry->pt_dev.host.devfn))
+				break;
+		}
+
+		if (pdev == NULL)
+			return -ENODEV;
+
+		/* detach kvm dmar domain */
+		intel_iommu_detach_dev(domain,
+				       pdev->bus->number, pdev->devfn);
+	}
+	kvm_iommu_unmap_memslots(kvm);
+	intel_iommu_domain_exit(domain);
+	return 0;
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 65b307d..9a1caf0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -33,6 +33,7 @@
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/highmem.h>
+#include <linux/intel-iommu.h>
 
 #include <asm/uaccess.h>
 #include <asm/msr.h>
@@ -302,8 +303,16 @@ static int kvm_vm_ioctl_pci_pt_dev(struct kvm *kvm,
 		}
 	}
 
+	if (intel_iommu_found()) {
+		r = kvm_iommu_map_guest(kvm, pci_pt_dev);
+		if (r)
+			goto out_intr;
+	}
+
 out:
 	return r;
+out_intr:
+	free_irq(dev->irq, &match->pt_dev);
 out_list_del:
 	list_del(&match->list);
 out_put:
@@ -4246,6 +4255,8 @@ static void kvm_free_vcpus(struct kvm *kvm)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+	if (intel_iommu_found())
+		kvm_iommu_unmap_guest(kvm);
 	kvm_free_pci_passthrough(kvm);
 	kvm_free_pit(kvm);
 	kfree(kvm->arch.vpic);
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index f6973e0..6185ed7 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -364,6 +364,7 @@ struct kvm_arch{
 	 */
 	struct list_head active_mmu_pages;
 	struct list_head pci_pt_dev_head;
+	struct dmar_domain *intel_iommu_domain;
 	struct kvm_pic *vpic;
 	struct kvm_ioapic *vioapic;
 	struct kvm_pit *vpit;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3798097..e9eae80 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -279,6 +279,12 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v);
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
 
+int kvm_iommu_map_pages(struct kvm *kvm, gfn_t base_gfn,
+			unsigned long npages);
+int kvm_iommu_map_guest(struct kvm *kvm,
+			struct kvm_pci_passthrough_dev *pci_pt_dev);
+int kvm_iommu_unmap_guest(struct kvm *kvm);
+
 static inline void kvm_guest_enter(void)
 {
 	account_system_vtime(current);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 825fbd3..77d7001 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -41,6 +41,7 @@
 #include <linux/pagemap.h>
 #include <linux/mman.h>
 #include <linux/swap.h>
+#include <linux/intel-iommu.h>
 
 #include <asm/processor.h>
 #include <asm/io.h>
@@ -425,6 +426,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	}
 
 	kvm_free_physmem_slot(&old, &new);
+
+	/* map the pages in iommu page table */
+	if (intel_iommu_found())
+		kvm_iommu_map_pages(kvm, base_gfn, npages);
+
 	return 0;
 
 out_free:
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 8/8] KVM: PCIPT: VT-d: dont map mmio memory slots
  2008-07-16 13:17             ` [PATCH 7/8] KVM: PCIPT: VT-d support Ben-Ami Yassour
@ 2008-07-16 13:17               ` Ben-Ami Yassour
  2008-07-16 13:21                 ` PCI passthrough with VT-d - native performance Ben-Ami Yassour
  0 siblings, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:17 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony

Avoid mapping mmio memory slots.

Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
---
 arch/x86/kvm/vtd.c         |   20 +++++++++++++-------
 include/asm-x86/kvm_host.h |    2 ++
 virt/kvm/kvm_main.c        |    2 +-
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vtd.c b/arch/x86/kvm/vtd.c
index 83efb8a..77044fb 100644
--- a/arch/x86/kvm/vtd.c
+++ b/arch/x86/kvm/vtd.c
@@ -40,14 +40,20 @@ int kvm_iommu_map_pages(struct kvm *kvm,
 
 	for (i = 0; i < npages; i++) {
 		pfn = gfn_to_pfn(kvm, gfn);
-		rc = intel_iommu_page_mapping(domain,
-					      gfn << PAGE_SHIFT,
+		if (!is_mmio_pfn(pfn)) {
+			rc = intel_iommu_page_mapping(domain,
+						      gfn << PAGE_SHIFT,
 						      pfn << PAGE_SHIFT,
-					      PAGE_SIZE,
-					      DMA_PTE_READ |
-					      DMA_PTE_WRITE);
-		if (rc)
-			kvm_release_pfn_clean(pfn);
+						      PAGE_SIZE,
+						      DMA_PTE_READ |
+						      DMA_PTE_WRITE);
+			if (rc)
+				kvm_release_pfn_clean(pfn);
+		} else {
+			printk(KERN_DEBUG "kvm_iommu_map_page:"
+			       "invalid pfn=%lx\n", pfn);
+			return 0;
+		}
 
 		gfn++;
 	}
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 6185ed7..ee4685c 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -513,6 +513,8 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 int kvm_pv_mmu_op(struct kvm_vcpu *vcpu, unsigned long bytes,
 		  gpa_t addr, unsigned long *ret);
 
+int is_mmio_pfn(pfn_t pfn);
+
 extern bool tdp_enabled;
 
 enum emulation_result {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 77d7001..0653ec1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -77,7 +77,7 @@ static inline int valid_vcpu(int n)
 	return likely(n >= 0 && n < KVM_MAX_VCPUS);
 }
 
-static inline int is_mmio_pfn(pfn_t pfn)
+inline int is_mmio_pfn(pfn_t pfn)
 {
 	if (pfn_valid(pfn))
 		return PageReserved(pfn_to_page(pfn));
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: PCI passthrough with VT-d - native performance
  2008-07-16 13:17               ` [PATCH 8/8] KVM: PCIPT: VT-d: dont map mmio memory slots Ben-Ami Yassour
@ 2008-07-16 13:21                 ` Ben-Ami Yassour
  2008-07-16 13:21                   ` [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest Ben-Ami Yassour
  0 siblings, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:21 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony

The following are the userspace patches



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest
  2008-07-16 13:21                 ` PCI passthrough with VT-d - native performance Ben-Ami Yassour
@ 2008-07-16 13:21                   ` Ben-Ami Yassour
  2008-07-16 13:21                     ` [PATCH 2/2] PCIPT: direct mmio Ben-Ami Yassour
  2008-07-17  7:52                     ` [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest Han, Weidong
  0 siblings, 2 replies; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:21 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony, Or Sagi

From: Or Sagi <ors@tutis.com>

From: Nir Peleg <nir@tutis.com>
From: Amit Shah <amit.shah@qumranet.com>
From: Glauber de Oliveira Costa <gcosta@redhat.com>

We can assign a device from the host machine to a guest. The
original code comes from Neocleus.

A new command-line option, -pcidevice is added.
For example, to invoke it for an Ethernet device sitting at
PCI bus:dev.fn 04:08.0 with host IRQ 18, use this:

        -pcidevice Ethernet/04:08.0

The host ethernet driver is to be removed before doing the
passthrough. If not, the device assignment fails but the
guest continues without the assignment.

If kvm uses the in-kernel irqchip, interrupts are routed to the
guest via the kvm module (accompanied kernel changes are necessary).

If -no-kvm-irqchip is used, the 'irqhook' module, available
separately, is to be used for interrupt injection into the guest. In
this case, an extra parameter, -<intr-number> is to be appended to
the above-mentioned pcidevice parameter.

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
---
 libkvm/libkvm-x86.c       |    9 +-
 libkvm/libkvm.h           |   16 ++
 qemu/Makefile.target      |    1 +
 qemu/hw/isa.h             |    2 +
 qemu/hw/pc.c              |    9 +
 qemu/hw/pci-passthrough.c |  594 +++++++++++++++++++++++++++++++++++++++++++++
 qemu/hw/pci-passthrough.h |   93 +++++++
 qemu/hw/pci.c             |   12 +
 qemu/hw/pci.h             |    1 +
 qemu/hw/piix_pci.c        |   19 ++
 qemu/vl.c                 |   17 ++
 11 files changed, 772 insertions(+), 1 deletions(-)
 create mode 100644 qemu/hw/pci-passthrough.c
 create mode 100644 qemu/hw/pci-passthrough.h

diff --git a/libkvm/libkvm-x86.c b/libkvm/libkvm-x86.c
index ea97bdd..0c4cdbe 100644
--- a/libkvm/libkvm-x86.c
+++ b/libkvm/libkvm-x86.c
@@ -126,6 +126,14 @@ static int kvm_init_tss(kvm_context_t kvm)
 	return 0;
 }
 
+#ifdef KVM_CAP_PCI_PASSTHROUGH
+int kvm_update_pci_pt_device(kvm_context_t kvm,
+			     struct kvm_pci_passthrough_dev *pci_pt_dev)
+{
+	return ioctl(kvm->vm_fd, KVM_UPDATE_PCI_PT_DEV, pci_pt_dev);
+}
+#endif
+
 int kvm_arch_create_default_phys_mem(kvm_context_t kvm,
 				       unsigned long phys_mem_bytes,
 				       void **vm_mem)
@@ -435,7 +443,6 @@ void kvm_show_code(kvm_context_t kvm, int vcpu)
 	fprintf(stderr, "code:%s\n", code_str);
 }
 
-
 /*
  * Returns available msr list.  User must free.
  */
diff --git a/libkvm/libkvm.h b/libkvm/libkvm.h
index ad6e26a..ccb086f 100644
--- a/libkvm/libkvm.h
+++ b/libkvm/libkvm.h
@@ -12,6 +12,7 @@
 #endif
 
 #include <linux/kvm.h>
+#include <linux/kvm_para.h>
 
 #include <signal.h>
 
@@ -639,4 +640,19 @@ int kvm_enable_vapic(kvm_context_t kvm, int vcpu, uint64_t vapic);
 
 #endif
 
+#ifdef KVM_CAP_PCI_PASSTHROUGH
+/*!
+ * \brief Notifies host kernel about changes to a PCI device assigned to guest
+ *
+ * Used for PCI device assignment, this function notifies the host
+ * kernel about the assigning of the physical PCI device and the guest
+ * PCI parameters or updates to the PCI config space from the guest
+ * (mainly the device irq)
+ *
+ * \param kvm Pointer to the current kvm_context
+ * \param pci_pt_dev Parameters like irq, PCI bus, devfn number, etc
+ */
+int kvm_update_pci_pt_device(kvm_context_t kvm,
+			     struct kvm_pci_passthrough_dev *pci_pt_dev);
+#endif
 #endif
diff --git a/qemu/Makefile.target b/qemu/Makefile.target
index 7a51cd0..4ad858b 100644
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -605,6 +605,7 @@ OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o
 OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o
 OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o
 OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o
+OBJS+= pci-passthrough.o
 ifeq ($(USE_KVM_PIT), 1)
 OBJS+= i8254-kvm.o
 endif
diff --git a/qemu/hw/isa.h b/qemu/hw/isa.h
index 89b3004..c720f5e 100644
--- a/qemu/hw/isa.h
+++ b/qemu/hw/isa.h
@@ -1,5 +1,7 @@
 /* ISA bus */
 
+#include "hw.h"
+
 extern target_phys_addr_t isa_mem_base;
 
 int register_ioport_read(int start, int length, int size,
diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c
index d578e68..090d92c 100644
--- a/qemu/hw/pc.c
+++ b/qemu/hw/pc.c
@@ -32,6 +32,7 @@
 #include "smbus.h"
 #include "boards.h"
 #include "console.h"
+#include "pci-passthrough.h"
 
 #include "qemu-kvm.h"
 
@@ -994,6 +995,14 @@ static void pc_init1(ram_addr_t ram_size, int vga_ram_size,
         }
     }
 
+    /* Initialize pass-through */
+    if (pci_enabled) {
+        int r = -1;
+        do {
+            pt_init_device(pci_bus, &r);
+	} while (r >= 0);
+    }
+
     rtc_state = rtc_init(0x70, i8259[8]);
 
     qemu_register_boot_set(pc_boot_set, rtc_state);
diff --git a/qemu/hw/pci-passthrough.c b/qemu/hw/pci-passthrough.c
new file mode 100644
index 0000000..250d7ef
--- /dev/null
+++ b/qemu/hw/pci-passthrough.c
@@ -0,0 +1,594 @@
+/*
+ * Copyright (c) 2007, Neocleus Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ *
+ *  Pass a PCI device from the host to a guest VM.
+ *
+ *  Adapted for KVM by Qumranet.
+ *
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ */
+#include <stdio.h>
+#include <pthread.h>
+#include <sys/io.h>
+#include <sys/ioctl.h>
+#include <linux/types.h>
+
+/* From linux/ioport.h */
+#define IORESOURCE_IO		0x00000100	/* Resource type */
+#define IORESOURCE_MEM		0x00000200
+#define IORESOURCE_IRQ		0x00000400
+#define IORESOURCE_DMA		0x00000800
+#define IORESOURCE_PREFETCH	0x00001000	/* No side effects */
+
+#include "pci-passthrough.h"
+#include "irq.h"
+
+#include "qemu-kvm.h"
+#include <linux/kvm_para.h>
+extern FILE *logfile;
+
+//#define PT_DEBUG
+
+#ifdef PT_DEBUG
+#define DEBUG(fmt, args...) fprintf(stderr, "%s: " fmt, __func__ , ## args)
+#else
+#define DEBUG(fmt, args...)
+#endif
+
+#define pt_mmio_write(suffix, type)					\
+static void pt_mmio_write##suffix(void *opaque, target_phys_addr_t e_phys, \
+				uint32_t value)				\
+{									\
+	pt_region_t *r_access = (pt_region_t *)opaque;			\
+	void *r_virt = (uint8_t *)r_access->r_virtbase +		\
+			(e_phys - r_access->e_physbase);		\
+	if (r_access->debug & PT_DEBUG_MMIO) {				\
+		fprintf(logfile, "pt_mmio_write" #suffix		\
+			": e_physbase=%p e_phys=%p r_virt=%p value=%08x\n", \
+			(void *)r_access->e_physbase, (void *)e_phys,	\
+			r_virt, value);					\
+	}								\
+	*(type *)r_virt = (type)value;					\
+}
+
+pt_mmio_write(b, uint8_t)
+pt_mmio_write(w, uint16_t)
+pt_mmio_write(l, uint32_t)
+
+#define pt_mmio_read(suffix, type)					\
+static uint32_t pt_mmio_read##suffix(void *opaque, target_phys_addr_t e_phys) \
+{									\
+	pt_region_t *r_access = (pt_region_t *)opaque;			\
+	void *r_virt = (uint8_t *)r_access->r_virtbase +		\
+			(e_phys - r_access->e_physbase);		\
+	uint32_t value = (uint32_t) (*(type *) r_virt);			\
+	if (r_access->debug & PT_DEBUG_MMIO) {				\
+		fprintf(logfile,					\
+			"pt_mmio_read" #suffix ": e_physbase=%p "	\
+			"e_phys=%p r_virt=%p value=%08x\n",		\
+			(void *)r_access->e_physbase,			\
+			(void *)e_phys, r_virt, value);			\
+	}								\
+	return value;							\
+}
+
+pt_mmio_read(b, uint8_t)
+pt_mmio_read(w, uint16_t)
+pt_mmio_read(l, uint32_t)
+
+CPUReadMemoryFunc *pt_mmio_read_cb[3] = {
+	pt_mmio_readb,
+	pt_mmio_readw,
+	pt_mmio_readl
+};
+
+CPUWriteMemoryFunc *pt_mmio_write_cb[3] = {
+	pt_mmio_writeb,
+	pt_mmio_writew,
+	pt_mmio_writel
+};
+
+#define pt_ioport_write(suffix)						\
+static void pt_ioport_write##suffix(void *opaque, uint32_t addr, uint32_t value) \
+{									\
+	pt_region_t *r_access = (pt_region_t *)opaque;			\
+	uint32_t r_pio = (unsigned long)r_access->r_virtbase		\
+			 + (addr - r_access->e_physbase);		\
+	if (r_access->debug & PT_DEBUG_PIO) {				\
+		fprintf(logfile, "pt_ioport_write" #suffix 		\
+			": r_pio=%08x e_physbase=%08x"			\
+			" r_virtbase=%08lx value=%08x\n", 		\
+			r_pio, (int)r_access->e_physbase,		\
+			(unsigned long)r_access->r_virtbase, value);	\
+	}								\
+	out##suffix(value, r_pio);					\
+}
+
+pt_ioport_write(b)
+pt_ioport_write(w)
+pt_ioport_write(l)
+
+#define pt_ioport_read(suffix)						\
+static uint32_t pt_ioport_read##suffix(void *opaque, uint32_t addr)	\
+{									\
+	pt_region_t *r_access = (pt_region_t *)opaque;			\
+	uint32_t r_pio = (addr - r_access->e_physbase)			\
+			+ (unsigned long)r_access->r_virtbase;		\
+	uint32_t value = in##suffix(r_pio);				\
+	if (r_access->debug & PT_DEBUG_PIO) {				\
+		fprintf(logfile, "pt_ioport_read" #suffix 		\
+			": r_pio=%08x e_physbase=%08x r_virtbase=%08lx "\
+			"value=%08x\n", 				\
+			r_pio, (int)r_access->e_physbase, 		\
+			(unsigned long)r_access->r_virtbase, value);	\
+	}								\
+	return value;							\
+}
+
+pt_ioport_read(b)
+pt_ioport_read(w)
+pt_ioport_read(l)
+
+static void pt_iomem_map(PCIDevice *d, int region_num,
+			 uint32_t e_phys, uint32_t e_size, int type)
+{
+	pt_dev_t *r_dev = (pt_dev_t *) d;
+
+	r_dev->v_addrs[region_num].e_physbase = e_phys;
+
+	DEBUG("e_phys=%08x r_virt=%p type=%d len=%08x region_num=%d \n",
+	      e_phys, r_dev->v_addrs[region_num].r_virtbase, type, e_size,
+	      region_num);
+
+	cpu_register_physical_memory(e_phys,
+				     r_dev->dev.io_regions[region_num].size,
+				     r_dev->v_addrs[region_num].memory_index);
+}
+
+static void pt_ioport_map(PCIDevice *pci_dev, int region_num,
+			  uint32_t addr, uint32_t size, int type)
+{
+	pt_dev_t *r_dev = (pt_dev_t *) pci_dev;
+	int i;
+	uint32_t ((*rf[])(void *, uint32_t)) =  { pt_ioport_readb,
+						  pt_ioport_readw,
+						  pt_ioport_readl
+						};
+	void ((*wf[])(void *, uint32_t, uint32_t)) = { pt_ioport_writeb,
+						       pt_ioport_writew,
+						       pt_ioport_writel
+						     };
+
+	r_dev->v_addrs[region_num].e_physbase = addr;
+	DEBUG("pt_ioport_map: address=0x%x type=0x%x len=%d"
+	      "region_num=%d \n", addr, type, size, region_num);
+
+	for (i = 0; i < 3; i++) {
+		register_ioport_write(addr, size, 1<<i, wf[i],
+				      (void *) (r_dev->v_addrs + region_num));
+		register_ioport_read(addr, size, 1<<i, rf[i],
+				     (void *) (r_dev->v_addrs + region_num));
+	}
+}
+
+static void pt_pci_write_config(PCIDevice *d, uint32_t address, uint32_t val,
+				int len)
+{
+	int fd, r;
+
+	DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n",
+	      ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), (uint16_t) address,
+	      val, len);
+
+	if (address == 0x4)
+		pci_default_write_config(d, address, val, len);
+
+	if ((address >= 0x10 && address <= 0x24) || address == 0x34 ||
+	    address == 0x3c || address == 0x3d) {
+		/* used for update-mappings (BAR emulation) */
+		pci_default_write_config(d, address, val, len);
+		return;
+	}
+
+	DEBUG("NON BAR (%x.%x): address=%04x val=0x%08x len=%d\n",
+	      ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), (uint16_t) address,
+	      val, len);
+	fd = ((pt_dev_t *)d)->real_device.config_fd;
+	lseek(fd, address, SEEK_SET);
+again:
+	r = write(fd, &val, len);
+	if (r < 0) {
+		if (errno == EINTR || errno == EAGAIN)
+			goto again;
+		fprintf(stderr, "%s: write failed, errno = %d\n", __func__,
+			errno);
+	}
+}
+
+static uint32_t pt_pci_read_config(PCIDevice *d, uint32_t address, int len)
+{
+	uint32_t val = 0;
+	int fd, r;
+
+	if ((address >= 0x10 && address <= 0x24) || address == 0x34 ||
+	    address == 0x3c || address == 0x3d) {
+		val = pci_default_read_config(d, address, len);
+		DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n",
+		      (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val,
+		      len);
+		return val;
+	}
+
+	/* vga specific, remove later */
+	if (address == 0xFC)
+		goto do_log;
+
+	fd = ((pt_dev_t *)d)->real_device.config_fd;
+	lseek(fd, address, SEEK_SET);
+again:
+	r = read(fd, &val, len);
+	if (r < 0) {
+		if (errno == EINTR || errno == EAGAIN)
+			goto again;
+		fprintf(stderr, "%s: read failed, errno = %d\n", __func__,
+			errno);
+	}
+
+do_log:
+	DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n",
+	      (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, len);
+
+	/* kill the special capabilities */
+	if (address == 4 && len == 4)
+		val &= ~0x100000;
+	else if (address == 6)
+		val &= ~0x10;
+
+	return val;
+}
+
+static int pt_register_regions(pci_region_t *io_regions,
+			       unsigned long regions_num, pt_dev_t *pci_dev)
+{
+	uint32_t i;
+	pci_region_t *cur_region = io_regions;
+
+	for (i = 0; i < regions_num; i++, cur_region++) {
+		if (!cur_region->valid)
+			continue;
+#ifdef PT_DEBUG
+		pci_dev->v_addrs[i].debug |= PT_DEBUG_MMIO | PT_DEBUG_PIO;
+#endif
+		pci_dev->v_addrs[i].num = i;
+
+		/* handle memory io regions */
+		if (cur_region->type & IORESOURCE_MEM) {
+			int t = cur_region->type & IORESOURCE_PREFETCH
+			  ? PCI_ADDRESS_SPACE_MEM_PREFETCH
+			  : PCI_ADDRESS_SPACE_MEM;
+
+			/* map physical memory */
+			pci_dev->v_addrs[i].e_physbase = cur_region->base_addr;
+			pci_dev->v_addrs[i].r_virtbase =
+			  mmap(NULL, (cur_region->size + 0xFFF) & 0xFFFFF000,
+			       PROT_WRITE | PROT_READ, MAP_SHARED,
+			       cur_region->resource_fd, (off_t) 0);
+
+			if ((void *) -1 == pci_dev->v_addrs[i].r_virtbase) {
+				fprintf(stderr, "Error: Couldn't mmap 0x%x!\n",
+					(uint32_t) (cur_region->base_addr));
+				return -1;
+			}
+
+			/* add offset */
+			pci_dev->v_addrs[i].r_virtbase +=
+			  (cur_region->base_addr & 0xFFF);
+
+			pci_register_io_region((PCIDevice *) pci_dev, i,
+					       cur_region->size, t,
+					       pt_iomem_map);
+
+			pci_dev->v_addrs[i].memory_index =
+			    cpu_register_io_memory(0, pt_mmio_read_cb,
+						   pt_mmio_write_cb,
+						   (void *) &(pci_dev->v_addrs[i]));
+
+			continue;
+		}
+		/* handle port io regions */
+
+		pci_register_io_region((PCIDevice *) pci_dev, i,
+				       cur_region->size, PCI_ADDRESS_SPACE_IO,
+				       pt_ioport_map);
+
+		pci_dev->v_addrs[i].e_physbase = cur_region->base_addr;
+		pci_dev->v_addrs[i].r_virtbase = (void *)(long)cur_region->base_addr;
+		/* not relevant for port io */
+		pci_dev->v_addrs[i].memory_index = 0;
+	}
+
+	/* success */
+	return 0;
+
+}
+
+static int pt_get_real_device(pt_dev_t *pci_dev, uint8_t r_bus, uint8_t r_dev,
+			      uint8_t r_func)
+{
+	char dir[128], name[128], comp[16];
+	int fd, r = 0;
+	FILE *f;
+	unsigned long long start, end, size, flags;
+	pci_region_t *rp;
+	pci_dev_t *dev = &pci_dev->real_device;
+
+	dev->region_number = 0;
+
+	sprintf(dir, "/sys/bus/pci/devices/0000:%02x:%02x.%x/",
+		r_bus, r_dev, r_func);
+	strcpy(name, dir);
+	strcat(name, "config");
+	if ((fd = open(name, O_RDWR)) == -1) {
+		fprintf(stderr, "%s: %m\n", name);
+		return 1;
+	}
+	dev->config_fd = fd;
+again:
+	r = read(fd, pci_dev->dev.config, sizeof pci_dev->dev.config);
+	if (r < 0) {
+		if (errno == EINTR || errno == EAGAIN)
+			goto again;
+		fprintf(stderr, "%s: read failed, errno = %d\n", __func__,
+			errno);
+	}
+
+	strcpy(name, dir);
+	strcat(name, "resource");
+	if ((f = fopen(name, "r")) == NULL) {
+		fprintf(stderr, "%s: %m\n", name);
+		return 1;
+	}
+
+	for (r = 0; fscanf(f, "%lli %lli %lli\n", &start, &end, &flags) == 3;
+	     r++) {
+		rp = dev->regions + r;
+		rp->valid = 0;
+		size = end - start + 1;
+		flags &= IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH;
+		if (size == 0 || (flags & ~IORESOURCE_PREFETCH) == 0)
+			continue;
+		if (flags & IORESOURCE_MEM) {
+			flags &= ~IORESOURCE_IO;
+			sprintf(comp, "resource%d", r);
+			strcpy(name, dir);
+			strcat(name, comp);
+			if ((fd = open(name, O_RDWR)) == -1)
+				continue;		/* probably ROM */
+			rp->resource_fd = fd;
+		} else
+			flags &= ~IORESOURCE_PREFETCH;
+
+		rp->type = flags;
+		rp->valid = 1;
+		rp->base_addr = start;
+		rp->size = size;
+		DEBUG("region %d size %d start 0x%x type %d "
+		      "resource_fd %d\n", r, rp->size, start, rp->type,
+		      rp->resource_fd);
+	}
+	fclose(f);
+
+	dev->region_number = r;
+	return 0;
+}
+
+static pt_dev_t *register_real_device(PCIBus *e_bus, const char *e_dev_name,
+				      int e_devfn, uint8_t r_bus, uint8_t r_dev,
+				      uint8_t r_func)
+{
+	int rc;
+	pt_dev_t *pci_dev;
+	uint8_t e_device, e_intx;
+
+	DEBUG("register_real_device: Registering real physical "
+	      "device %s (devfn=0x%x)\n", e_dev_name, e_devfn);
+
+	pci_dev = (pt_dev_t *) pci_register_device(e_bus, e_dev_name,
+						   sizeof(pt_dev_t), e_devfn,
+						   pt_pci_read_config,
+						   pt_pci_write_config);
+
+	if (NULL == pci_dev) {
+		fprintf(stderr, "register_real_device: Error: Couldn't "
+			"register real device %s\n", e_dev_name);
+		return NULL;
+	}
+	if (pt_get_real_device(pci_dev, r_bus, r_dev, r_func)) {
+		fprintf(stderr, "register_real_device: Error: Couldn't get "
+			"real device (%s)!\n", e_dev_name);
+		return NULL;
+	}
+
+	/* handle real device's MMIO/PIO BARs */
+	if (pt_register_regions(pci_dev->real_device.regions,
+				pci_dev->real_device.region_number, pci_dev))
+		return NULL;
+
+	/* handle interrupt routing */
+	e_device = (pci_dev->dev.devfn >> 3) & 0x1f;
+	e_intx = pci_dev->dev.config[0x3d] - 1;
+	pci_dev->intpin = e_intx;
+	pci_dev->run = 0;
+	pci_dev->girq = 0;
+	pci_dev->h_busnr = r_bus;
+	pci_dev->h_devfn = PCI_DEVFN(r_dev, r_func);
+
+#ifdef KVM_CAP_PCI_PASSTHROUGH
+	if (kvm_enabled()) {
+		struct kvm_pci_passthrough_dev pci_pt_dev;
+
+		memset(&pci_pt_dev, 0, sizeof(pci_pt_dev));
+		pci_pt_dev.guest.busnr = pci_bus_num(e_bus);
+		pci_pt_dev.guest.devfn = PCI_DEVFN(e_device, r_func);
+		pci_pt_dev.host.busnr  = pci_dev->h_busnr;
+		pci_pt_dev.host.devfn  = pci_dev->h_devfn;
+
+		/* We'll set the value of the guest irq as and when
+		 * the piix config gets updated. See pci_pt_update_irq.
+		 * The host irq field never gets used anyway
+		 */
+
+		rc = kvm_update_pci_pt_device(kvm_context, &pci_pt_dev);
+		if (rc < 0) {
+			fprintf(stderr, "Could not notify kernel about "
+				"passthrough device\n");
+			perror("pt-ioctl");
+			return NULL;
+		}
+	}
+#endif
+
+	fprintf(logfile, "Registered host PCI device %02x:%02x.%1x "
+		"as guest device %02x:%02x.%1x\n",
+		r_bus, r_dev, r_func,
+		pci_bus_num(e_bus), e_device, r_func);
+
+	return pci_dev;
+}
+
+#define	MAX_PTDEVS 4
+struct {
+	char name[128];
+	int bus;
+	int dev;
+	int func;
+	pt_dev_t *ptdev;
+} ptdevs[MAX_PTDEVS];
+
+int nptdevs;
+extern int piix_get_irq(int);
+
+#ifdef KVM_CAP_PCI_PASSTHROUGH
+/* The pci config space got updated. Check if irq numbers have changed
+ * for our devices
+ */
+void pci_pt_update_irq(PCIDevice *d)
+{
+	int i, irq, r;
+	pt_dev_t *pt_dev;
+
+	for (i = 0; i < nptdevs; i++) {
+		pt_dev = ptdevs[i].ptdev;
+		if (pt_dev == NULL)
+			continue;
+
+		irq = pci_map_irq(&pt_dev->dev, pt_dev->intpin);
+		irq = piix_get_irq(irq);
+		if (irq != pt_dev->girq) {
+			struct kvm_pci_passthrough_dev pci_pt_dev;
+
+			memset(&pci_pt_dev, 0, sizeof(pci_pt_dev));
+			pci_pt_dev.guest.irq = irq;
+			pci_pt_dev.host.busnr = pt_dev->h_busnr;
+			pci_pt_dev.host.devfn = pt_dev->h_devfn;
+			r = kvm_update_pci_pt_device(kvm_context, &pci_pt_dev);
+			if (r < 0) {
+				perror("pci_pt_update_irq");
+				continue;
+			}
+			pt_dev->girq = irq;
+		}
+	}
+}
+#endif
+
+int pt_init_system(void)
+{
+	/* Do we have any devices to be assigned? */
+	if (nptdevs == 0)
+		return -1;
+
+	iopl(3);
+
+	return 0;
+}
+
+int pt_init_device(PCIBus *bus, int *index)
+{
+	pt_dev_t *dev = NULL;
+	int i, ret = 0;
+
+	if (*index == -1) {
+		if (pt_init_system() < 0)
+			return -1;
+
+		*index = nptdevs - 1;
+	}
+	i = *index;
+
+	dev = register_real_device(bus, ptdevs[i].name, -1,
+				   ptdevs[i].bus, ptdevs[i].dev,
+				   ptdevs[i].func);
+	if (dev == NULL) {
+		fprintf(stderr, "Error: Couldn't register device %s\n",
+			ptdevs[i].name);
+		ret = -1;
+	}
+	ptdevs[i].ptdev = dev;
+
+	--*index;
+	return ret;
+}
+
+void add_pci_passthrough_device(const char *arg)
+{
+	/* name/bus:dev.func */
+	char *cp, *cp1;
+
+	if (nptdevs >= MAX_PTDEVS) {
+		fprintf(stderr, "Too many passthrough devices (max %d)\n",
+			MAX_PTDEVS);
+		return;
+	}
+	strcpy(ptdevs[nptdevs].name, arg);
+	cp = strchr(ptdevs[nptdevs].name, '/');
+	if (cp == NULL)
+		goto bad;
+	*cp++ = 0;
+
+	ptdevs[nptdevs].bus = strtoul(cp, &cp1, 16);
+	if (*cp1 != ':')
+		goto bad;
+	cp = cp1 + 1;
+
+	ptdevs[nptdevs].dev = strtoul(cp, &cp1, 16);
+	if (*cp1 != '.')
+		goto bad;
+	cp = cp1 + 1;
+
+	ptdevs[nptdevs].func = strtoul(cp, &cp1, 16);
+	if (*cp1 != 0)
+		goto bad;
+
+	nptdevs++;
+	return;
+bad:
+	fprintf(stderr, "passthrough arg (%s) not in the form of "
+		"name/bus:dev.func\n", arg);
+}
diff --git a/qemu/hw/pci-passthrough.h b/qemu/hw/pci-passthrough.h
new file mode 100644
index 0000000..60df017
--- /dev/null
+++ b/qemu/hw/pci-passthrough.h
@@ -0,0 +1,93 @@
+/*
+ * Copyright (c) 2007, Neocleus Corporation.
+ * Copyright (c) 2007, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ *  Data structures for storing PCI state
+ *
+ *  Adapted to kvm by Qumranet
+ *
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ */
+
+#ifndef __PCI_PASSTHROUGH_H__
+#define __PCI_PASSTHROUGH_H__
+
+#include <sys/mman.h>
+#include "qemu-common.h"
+#include "pci.h"
+#include <linux/types.h>
+
+#define PT_DEBUG_PIO	(0x01)
+#define PT_DEBUG_MMIO	(0x02)
+
+/* From include/linux/pci.h in the kernel sources */
+#define PCI_DEVFN(slot,func)	((((slot) & 0x1f) << 3) | ((func) & 0x07))
+
+typedef uint32_t pciaddr_t;
+
+#define MAX_IO_REGIONS			(6)
+
+typedef struct pci_region_s {
+	int type;	/* Memory or port I/O */
+	int valid;
+	pciaddr_t base_addr;
+	pciaddr_t size;		/* size of the region */
+	int resource_fd;
+} pci_region_t;
+
+typedef struct pci_dev_s {
+	uint8_t bus, dev, func;	/* Bus inside domain, device and function */
+	int irq;		/* IRQ number */
+	uint16_t region_number;	/* number of active regions */
+
+	/* Port I/O or MMIO Regions */
+	pci_region_t regions[MAX_IO_REGIONS];
+	int config_fd;
+} pci_dev_t;
+
+typedef struct pt_region_s {
+	target_phys_addr_t e_physbase;
+	uint32_t memory_index;
+	void *r_virtbase;	/* mmapped access address */
+	int num;		/* our index within v_addrs[] */
+	uint32_t debug;
+} pt_region_t;
+
+typedef struct pt_dev_s {
+	PCIDevice dev;
+	int intpin;
+	uint8_t debug_flags;
+	pt_region_t v_addrs[PCI_NUM_REGIONS];
+	pci_dev_t real_device;
+	int run;
+	int girq;
+	char sirq[4];
+	unsigned char h_busnr;
+	unsigned int h_devfn;
+	int bound;
+} pt_dev_t;
+
+/* Initialization functions */
+int pt_init_device(PCIBus *bus, int *index);
+void add_pci_passthrough_device(const char *arg);
+void pt_set_vector(int irq, int vector);
+void pt_ack_mirq(int vector);
+
+#define logfile stderr
+
+#endif				/* __PCI_PASSTHROUGH_H__ */
diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c
index 92683d1..ff21b83 100644
--- a/qemu/hw/pci.c
+++ b/qemu/hw/pci.c
@@ -50,6 +50,7 @@ struct PCIBus {
 
 static void pci_update_mappings(PCIDevice *d);
 static void pci_set_irq(void *opaque, int irq_num, int level);
+void pci_pt_update_irq(PCIDevice *d);
 
 target_phys_addr_t pci_mem_base;
 static int pci_irq_index;
@@ -453,6 +454,12 @@ void pci_default_write_config(PCIDevice *d,
         val >>= 8;
     }
 
+#ifdef KVM_CAP_PCI_PASSTHROUGH
+    if (kvm_enabled() && qemu_kvm_irqchip_in_kernel() &&
+	address >= 0x60 && address <= 0x63)
+	pci_pt_update_irq(d);
+#endif
+
     end = address + len;
     if (end > PCI_COMMAND && address < (PCI_COMMAND + 2)) {
         /* if the command register is modified, we must modify the mappings */
@@ -555,6 +562,11 @@ static void pci_set_irq(void *opaque, int irq_num, int level)
     bus->set_irq(bus->irq_opaque, irq_num, bus->irq_count[irq_num] != 0);
 }
 
+int pci_map_irq(PCIDevice *pci_dev, int pin)
+{
+	return pci_dev->bus->map_irq(pci_dev, pin);
+}
+
 /***********************************************************/
 /* monitor info on PCI */
 
diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h
index 60e4094..e11fbbf 100644
--- a/qemu/hw/pci.h
+++ b/qemu/hw/pci.h
@@ -81,6 +81,7 @@ void pci_register_io_region(PCIDevice *pci_dev, int region_num,
                             uint32_t size, int type,
                             PCIMapIORegionFunc *map_func);
 
+int pci_map_irq(PCIDevice *pci_dev, int pin);
 uint32_t pci_default_read_config(PCIDevice *d,
                                  uint32_t address, int len);
 void pci_default_write_config(PCIDevice *d,
diff --git a/qemu/hw/piix_pci.c b/qemu/hw/piix_pci.c
index 90cb3a6..112381a 100644
--- a/qemu/hw/piix_pci.c
+++ b/qemu/hw/piix_pci.c
@@ -237,6 +237,25 @@ static void piix3_set_irq(qemu_irq *pic, int irq_num, int level)
     }
 }
 
+int piix3_get_pin(int pic_irq)
+{
+    int i;
+    for (i = 0; i < 4; i++)
+        if (piix3_dev->config[0x60+i] == pic_irq)
+            return i;
+    return -1;
+}
+
+int piix_get_irq(int pin)
+{
+    if (piix3_dev)
+        return piix3_dev->config[0x60+pin];
+    if (piix4_dev)
+        return piix4_dev->config[0x60+pin];
+
+    return 0;
+}
+
 static void piix3_reset(PCIDevice *d)
 {
     uint8_t *pci_conf = d->config;
diff --git a/qemu/vl.c b/qemu/vl.c
index db7095f..ca63b2e 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -37,6 +37,7 @@
 #include "qemu-char.h"
 #include "block.h"
 #include "audio/audio.h"
+#include "hw/pci-passthrough.h"
 #include "migration.h"
 #include "qemu-kvm.h"
 
@@ -7968,6 +7969,11 @@ static void help(int exitcode)
 #endif
 	   "-no-kvm-irqchip disable KVM kernel mode PIC/IOAPIC/LAPIC\n"
 	   "-no-kvm-pit	    disable KVM kernel mode PIT\n"
+#if defined(TARGET_I386) || defined(TARGET_X86_64)
+	   "-pcidevice name/bus:dev.func\n"
+	   "                expose a PCI device to the guest OS.\n"
+	   "                'name' is just used for debug logs.\n"
+#endif
 #endif
 #ifdef TARGET_I386
            "-std-vga        simulate a standard VGA card with VESA Bochs Extensions\n"
@@ -8093,6 +8099,9 @@ enum {
     QEMU_OPTION_no_kvm,
     QEMU_OPTION_no_kvm_irqchip,
     QEMU_OPTION_no_kvm_pit,
+#if defined(TARGET_I386) || defined(TARGET_X86_64)
+    QEMU_OPTION_pcidevice,
+#endif
     QEMU_OPTION_no_reboot,
     QEMU_OPTION_no_shutdown,
     QEMU_OPTION_show_cursor,
@@ -8182,6 +8191,9 @@ const QEMUOption qemu_options[] = {
 #endif
     { "no-kvm-irqchip", 0, QEMU_OPTION_no_kvm_irqchip },
     { "no-kvm-pit", 0, QEMU_OPTION_no_kvm_pit },
+#if defined(TARGET_I386) || defined(TARGET_X86_64)
+    { "pcidevice", HAS_ARG, QEMU_OPTION_pcidevice },
+#endif
 #endif
 #if defined(TARGET_PPC) || defined(TARGET_SPARC)
     { "g", 1, QEMU_OPTION_g },
@@ -9095,6 +9107,11 @@ int main(int argc, char **argv)
 		kvm_pit = 0;
 		break;
 	    }
+#if defined(TARGET_I386) || defined(TARGET_X86_64)
+	    case QEMU_OPTION_pcidevice:
+		add_pci_passthrough_device(optarg);
+		break;
+#endif
 #endif
             case QEMU_OPTION_usb:
                 usb_enabled = 1;
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/2] PCIPT: direct mmio
  2008-07-16 13:21                   ` [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest Ben-Ami Yassour
@ 2008-07-16 13:21                     ` Ben-Ami Yassour
  2008-07-17  7:52                     ` [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest Han, Weidong
  1 sibling, 0 replies; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 13:21 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony

Enable a guest to access a device's memory mapped I/O regions directly.
When the guest changes the I/O memory mappings for a passthrough device,
send the updated mmio region to the kernel.

Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
---
 libkvm/libkvm.c           |   24 ++++++++----
 qemu/hw/pci-passthrough.c |   88 +++++++++++---------------------------------
 qemu/hw/pci-passthrough.h |    2 +
 3 files changed, 40 insertions(+), 74 deletions(-)

diff --git a/libkvm/libkvm.c b/libkvm/libkvm.c
index a49cbdc..35bc411 100644
--- a/libkvm/libkvm.c
+++ b/libkvm/libkvm.c
@@ -400,7 +400,7 @@ void *kvm_create_userspace_phys_mem(kvm_context_t kvm, unsigned long phys_start,
 {
 	int r;
 	int prot = PROT_READ;
-	void *ptr;
+	void *ptr = NULL;
 	struct kvm_userspace_memory_region memory = {
 		.memory_size = len,
 		.guest_phys_addr = phys_start,
@@ -410,16 +410,24 @@ void *kvm_create_userspace_phys_mem(kvm_context_t kvm, unsigned long phys_start,
 	if (writable)
 		prot |= PROT_WRITE;
 
-	ptr = mmap(NULL, len, prot, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
-	if (ptr == MAP_FAILED) {
-		fprintf(stderr, "create_userspace_phys_mem: %s", strerror(errno));
-		return 0;
-	}
+	if (len > 0) {
+		ptr = mmap(NULL, len, prot, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+		if (ptr == MAP_FAILED) {
+			fprintf(stderr, "create_userspace_phys_mem: %s",
+				strerror(errno));
+			return 0;
+		}
 
-	memset(ptr, 0, len);
+		memset(ptr, 0, len);
+	}
 
 	memory.userspace_addr = (unsigned long)ptr;
-	memory.slot = get_free_slot(kvm);
+
+	if (len > 0)
+		memory.slot = get_free_slot(kvm);
+	else
+		memory.slot = get_slot(phys_start);
+
 	r = ioctl(kvm->vm_fd, KVM_SET_USER_MEMORY_REGION, &memory);
 	if (r == -1) {
 		fprintf(stderr, "create_userspace_phys_mem: %s", strerror(errno));
diff --git a/qemu/hw/pci-passthrough.c b/qemu/hw/pci-passthrough.c
index 250d7ef..08e339d 100644
--- a/qemu/hw/pci-passthrough.c
+++ b/qemu/hw/pci-passthrough.c
@@ -51,59 +51,6 @@ extern FILE *logfile;
 #define DEBUG(fmt, args...)
 #endif
 
-#define pt_mmio_write(suffix, type)					\
-static void pt_mmio_write##suffix(void *opaque, target_phys_addr_t e_phys, \
-				uint32_t value)				\
-{									\
-	pt_region_t *r_access = (pt_region_t *)opaque;			\
-	void *r_virt = (uint8_t *)r_access->r_virtbase +		\
-			(e_phys - r_access->e_physbase);		\
-	if (r_access->debug & PT_DEBUG_MMIO) {				\
-		fprintf(logfile, "pt_mmio_write" #suffix		\
-			": e_physbase=%p e_phys=%p r_virt=%p value=%08x\n", \
-			(void *)r_access->e_physbase, (void *)e_phys,	\
-			r_virt, value);					\
-	}								\
-	*(type *)r_virt = (type)value;					\
-}
-
-pt_mmio_write(b, uint8_t)
-pt_mmio_write(w, uint16_t)
-pt_mmio_write(l, uint32_t)
-
-#define pt_mmio_read(suffix, type)					\
-static uint32_t pt_mmio_read##suffix(void *opaque, target_phys_addr_t e_phys) \
-{									\
-	pt_region_t *r_access = (pt_region_t *)opaque;			\
-	void *r_virt = (uint8_t *)r_access->r_virtbase +		\
-			(e_phys - r_access->e_physbase);		\
-	uint32_t value = (uint32_t) (*(type *) r_virt);			\
-	if (r_access->debug & PT_DEBUG_MMIO) {				\
-		fprintf(logfile,					\
-			"pt_mmio_read" #suffix ": e_physbase=%p "	\
-			"e_phys=%p r_virt=%p value=%08x\n",		\
-			(void *)r_access->e_physbase,			\
-			(void *)e_phys, r_virt, value);			\
-	}								\
-	return value;							\
-}
-
-pt_mmio_read(b, uint8_t)
-pt_mmio_read(w, uint16_t)
-pt_mmio_read(l, uint32_t)
-
-CPUReadMemoryFunc *pt_mmio_read_cb[3] = {
-	pt_mmio_readb,
-	pt_mmio_readw,
-	pt_mmio_readl
-};
-
-CPUWriteMemoryFunc *pt_mmio_write_cb[3] = {
-	pt_mmio_writeb,
-	pt_mmio_writew,
-	pt_mmio_writel
-};
-
 #define pt_ioport_write(suffix)						\
 static void pt_ioport_write##suffix(void *opaque, uint32_t addr, uint32_t value) \
 {									\
@@ -145,20 +92,32 @@ pt_ioport_read(b)
 pt_ioport_read(w)
 pt_ioport_read(l)
 
-static void pt_iomem_map(PCIDevice *d, int region_num,
-			 uint32_t e_phys, uint32_t e_size, int type)
+void pt_iomem_map(PCIDevice * pci_dev, int region_num, uint32_t e_phys,
+		  uint32_t e_size, int type)
 {
-	pt_dev_t *r_dev = (pt_dev_t *) d;
-
-	r_dev->v_addrs[region_num].e_physbase = e_phys;
+	pt_dev_t *r_dev = (pt_dev_t *) pci_dev;
+	pt_region_t *region = &r_dev->v_addrs[region_num];
+	int first_map = (region->e_size == 0);
+	int ret = 0;
 
 	DEBUG("e_phys=%08x r_virt=%p type=%d len=%08x region_num=%d \n",
 	      e_phys, r_dev->v_addrs[region_num].r_virtbase, type, e_size,
 	      region_num);
 
-	cpu_register_physical_memory(e_phys,
-				     r_dev->dev.io_regions[region_num].size,
-				     r_dev->v_addrs[region_num].memory_index);
+	region->e_physbase = e_phys;
+	region->e_size = e_size;
+
+	if (!first_map)
+		kvm_destroy_phys_mem(kvm_context, e_phys, e_size);
+
+	if (e_size > 0)
+		ret = kvm_register_userspace_phys_mem(kvm_context,
+						      e_phys,
+						      region->r_virtbase,
+						      e_size,
+						      0);
+	if (ret != 0)
+		fprintf(logfile, "Error: create new mapping failed\n");
 }
 
 static void pt_ioport_map(PCIDevice *pci_dev, int region_num,
@@ -295,6 +254,8 @@ static int pt_register_regions(pci_region_t *io_regions,
 					(uint32_t) (cur_region->base_addr));
 				return -1;
 			}
+			pci_dev->v_addrs[i].r_size = cur_region->size;
+			pci_dev->v_addrs[i].e_size = 0;
 
 			/* add offset */
 			pci_dev->v_addrs[i].r_virtbase +=
@@ -304,11 +265,6 @@ static int pt_register_regions(pci_region_t *io_regions,
 					       cur_region->size, t,
 					       pt_iomem_map);
 
-			pci_dev->v_addrs[i].memory_index =
-			    cpu_register_io_memory(0, pt_mmio_read_cb,
-						   pt_mmio_write_cb,
-						   (void *) &(pci_dev->v_addrs[i]));
-
 			continue;
 		}
 		/* handle port io regions */
diff --git a/qemu/hw/pci-passthrough.h b/qemu/hw/pci-passthrough.h
index 60df017..7a98b0e 100644
--- a/qemu/hw/pci-passthrough.h
+++ b/qemu/hw/pci-passthrough.h
@@ -65,6 +65,8 @@ typedef struct pt_region_s {
 	uint32_t memory_index;
 	void *r_virtbase;	/* mmapped access address */
 	int num;		/* our index within v_addrs[] */
+	uint32_t e_size;        /* emulated size of region in bytes */
+	uint32_t r_size;        /* real size of region in bytes */
 	uint32_t debug;
 } pt_region_t;
 
-- 
1.5.6


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* RE: [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest
  2008-07-16 13:21                   ` [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest Ben-Ami Yassour
  2008-07-16 13:21                     ` [PATCH 2/2] PCIPT: direct mmio Ben-Ami Yassour
@ 2008-07-17  7:52                     ` Han, Weidong
  2008-07-22 12:28                       ` Ben-Ami Yassour
  1 sibling, 1 reply; 42+ messages in thread
From: Han, Weidong @ 2008-07-17  7:52 UTC (permalink / raw)
  To: Ben-Ami Yassour, amit.shah; +Cc: kvm, muli, anthony, Or Sagi

Ben-Ami Yassour wrote:
> From: Or Sagi <ors@tutis.com>
> 
> From: Nir Peleg <nir@tutis.com>
> From: Amit Shah <amit.shah@qumranet.com>
> From: Glauber de Oliveira Costa <gcosta@redhat.com>
> 
> We can assign a device from the host machine to a guest. The
> original code comes from Neocleus.
> 
> A new command-line option, -pcidevice is added.
> For example, to invoke it for an Ethernet device sitting at
> PCI bus:dev.fn 04:08.0 with host IRQ 18, use this:
> 
>         -pcidevice Ethernet/04:08.0
> 
> The host ethernet driver is to be removed before doing the
> passthrough. If not, the device assignment fails but the
> guest continues without the assignment.
> 
> If kvm uses the in-kernel irqchip, interrupts are routed to the
> guest via the kvm module (accompanied kernel changes are necessary).
> 
> If -no-kvm-irqchip is used, the 'irqhook' module, available
> separately, is to be used for interrupt injection into the guest. In
> this case, an extra parameter, -<intr-number> is to be appended to
> the above-mentioned pcidevice parameter.
> 
> Signed-off-by: Amit Shah <amit.shah@qumranet.com>
> ---
> +static pt_dev_t *register_real_device(PCIBus *e_bus, const char
> *e_dev_name, +				      int e_devfn,
uint8_t r_bus, uint8_t r_dev,
> +				      uint8_t r_func)
> +{
> +	int rc;
> +	pt_dev_t *pci_dev;
> +	uint8_t e_device, e_intx;
> +
> +	DEBUG("register_real_device: Registering real physical "
> +	      "device %s (devfn=0x%x)\n", e_dev_name, e_devfn);
> +
> +	pci_dev = (pt_dev_t *) pci_register_device(e_bus, e_dev_name,
> +						   sizeof(pt_dev_t),
e_devfn,
> +						   pt_pci_read_config,
> +						   pt_pci_write_config);
> +
> +	if (NULL == pci_dev) {
> +		fprintf(stderr, "register_real_device: Error: Couldn't "
> +			"register real device %s\n", e_dev_name);
> +		return NULL;
> +	}
> +	if (pt_get_real_device(pci_dev, r_bus, r_dev, r_func)) {
> +		fprintf(stderr, "register_real_device: Error: Couldn't
get "
> +			"real device (%s)!\n", e_dev_name);
> +		return NULL;
> +	}
> +
> +	/* handle real device's MMIO/PIO BARs */
> +	if (pt_register_regions(pci_dev->real_device.regions,
> +				pci_dev->real_device.region_number,
pci_dev))
> +		return NULL;
> +
> +	/* handle interrupt routing */
> +	e_device = (pci_dev->dev.devfn >> 3) & 0x1f;
> +	e_intx = pci_dev->dev.config[0x3d] - 1;
> +	pci_dev->intpin = e_intx;
> +	pci_dev->run = 0;
> +	pci_dev->girq = 0;
> +	pci_dev->h_busnr = r_bus;
> +	pci_dev->h_devfn = PCI_DEVFN(r_dev, r_func);
> +
> +#ifdef KVM_CAP_PCI_PASSTHROUGH
> +	if (kvm_enabled()) {
> +		struct kvm_pci_passthrough_dev pci_pt_dev;
> +
> +		memset(&pci_pt_dev, 0, sizeof(pci_pt_dev));
> +		pci_pt_dev.guest.busnr = pci_bus_num(e_bus);
> +		pci_pt_dev.guest.devfn = PCI_DEVFN(e_device, r_func);

Why combine e_device and r_func? The guest devfn is e_devfn.

> +		pci_pt_dev.host.busnr  = pci_dev->h_busnr;
> +		pci_pt_dev.host.devfn  = pci_dev->h_devfn;
> +
> +		/* We'll set the value of the guest irq as and when
> +		 * the piix config gets updated. See pci_pt_update_irq.
> +		 * The host irq field never gets used anyway
> +		 */
> +
> +		rc = kvm_update_pci_pt_device(kvm_context, &pci_pt_dev);
> +		if (rc < 0) {
> +			fprintf(stderr, "Could not notify kernel about "
> +				"passthrough device\n");
> +			perror("pt-ioctl");
> +			return NULL;
> +		}
> +	}
> +#endif
> +
> +	fprintf(logfile, "Registered host PCI device %02x:%02x.%1x "
> +		"as guest device %02x:%02x.%1x\n",
> +		r_bus, r_dev, r_func,
> +		pci_bus_num(e_bus), e_device, r_func);

Second "r_func" should replaced by e_devfn & 0x7.

Randy (Weidong)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest
  2008-07-17  7:52                     ` [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest Han, Weidong
@ 2008-07-22 12:28                       ` Ben-Ami Yassour
  2008-07-22 12:44                         ` [PATCH 1/2] KVM/userspace: Support for assigning PCI devicesto guest Han, Weidong
  0 siblings, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-22 12:28 UTC (permalink / raw)
  To: Han, Weidong; +Cc: amit.shah, kvm, Muli Ben-Yehuda, anthony, Or Sagi

On Thu, 2008-07-17 at 15:52 +0800, Han, Weidong wrote:
> Ben-Ami Yassour wrote:
> > From: Or Sagi <ors@tutis.com>
> > 
> > From: Nir Peleg <nir@tutis.com>
> > From: Amit Shah <amit.shah@qumranet.com>
> > From: Glauber de Oliveira Costa <gcosta@redhat.com>
> > 
> > We can assign a device from the host machine to a guest. The
> > original code comes from Neocleus.
> > 
> > A new command-line option, -pcidevice is added.
> > For example, to invoke it for an Ethernet device sitting at
> > PCI bus:dev.fn 04:08.0 with host IRQ 18, use this:
> > 
> >         -pcidevice Ethernet/04:08.0
> > 
> > The host ethernet driver is to be removed before doing the
> > passthrough. If not, the device assignment fails but the
> > guest continues without the assignment.
> > 
> > If kvm uses the in-kernel irqchip, interrupts are routed to the
> > guest via the kvm module (accompanied kernel changes are necessary).
> > 
> > If -no-kvm-irqchip is used, the 'irqhook' module, available
> > separately, is to be used for interrupt injection into the guest. In
> > this case, an extra parameter, -<intr-number> is to be appended to
> > the above-mentioned pcidevice parameter.
> > 
> > Signed-off-by: Amit Shah <amit.shah@qumranet.com>
> > ---
> > +static pt_dev_t *register_real_device(PCIBus *e_bus, const char
> > *e_dev_name, +				      int e_devfn,
> uint8_t r_bus, uint8_t r_dev,
> > +				      uint8_t r_func)
> > +{
> > +	int rc;
> > +	pt_dev_t *pci_dev;
> > +	uint8_t e_device, e_intx;
> > +
> > +	DEBUG("register_real_device: Registering real physical "
> > +	      "device %s (devfn=0x%x)\n", e_dev_name, e_devfn);
> > +
> > +	pci_dev = (pt_dev_t *) pci_register_device(e_bus, e_dev_name,
> > +						   sizeof(pt_dev_t),
> e_devfn,
> > +						   pt_pci_read_config,
> > +						   pt_pci_write_config);
> > +
> > +	if (NULL == pci_dev) {
> > +		fprintf(stderr, "register_real_device: Error: Couldn't "
> > +			"register real device %s\n", e_dev_name);
> > +		return NULL;
> > +	}
> > +	if (pt_get_real_device(pci_dev, r_bus, r_dev, r_func)) {
> > +		fprintf(stderr, "register_real_device: Error: Couldn't
> get "
> > +			"real device (%s)!\n", e_dev_name);
> > +		return NULL;
> > +	}
> > +
> > +	/* handle real device's MMIO/PIO BARs */
> > +	if (pt_register_regions(pci_dev->real_device.regions,
> > +				pci_dev->real_device.region_number,
> pci_dev))
> > +		return NULL;
> > +
> > +	/* handle interrupt routing */
> > +	e_device = (pci_dev->dev.devfn >> 3) & 0x1f;
> > +	e_intx = pci_dev->dev.config[0x3d] - 1;
> > +	pci_dev->intpin = e_intx;
> > +	pci_dev->run = 0;
> > +	pci_dev->girq = 0;
> > +	pci_dev->h_busnr = r_bus;
> > +	pci_dev->h_devfn = PCI_DEVFN(r_dev, r_func);
> > +
> > +#ifdef KVM_CAP_PCI_PASSTHROUGH
> > +	if (kvm_enabled()) {
> > +		struct kvm_pci_passthrough_dev pci_pt_dev;
> > +
> > +		memset(&pci_pt_dev, 0, sizeof(pci_pt_dev));
> > +		pci_pt_dev.guest.busnr = pci_bus_num(e_bus);
> > +		pci_pt_dev.guest.devfn = PCI_DEVFN(e_device, r_func);
> 
> Why combine e_device and r_func? The guest devfn is e_devfn.
the current assumption is that e_func and r_func are identical.

> 
> > +		pci_pt_dev.host.busnr  = pci_dev->h_busnr;
> > +		pci_pt_dev.host.devfn  = pci_dev->h_devfn;
> > +
> > +		/* We'll set the value of the guest irq as and when
> > +		 * the piix config gets updated. See pci_pt_update_irq.
> > +		 * The host irq field never gets used anyway
> > +		 */
> > +
> > +		rc = kvm_update_pci_pt_device(kvm_context, &pci_pt_dev);
> > +		if (rc < 0) {
> > +			fprintf(stderr, "Could not notify kernel about "
> > +				"passthrough device\n");
> > +			perror("pt-ioctl");
> > +			return NULL;
> > +		}
> > +	}
> > +#endif
> > +
> > +	fprintf(logfile, "Registered host PCI device %02x:%02x.%1x "
> > +		"as guest device %02x:%02x.%1x\n",
> > +		r_bus, r_dev, r_func,
> > +		pci_bus_num(e_bus), e_device, r_func);
> 
> Second "r_func" should replaced by e_devfn & 0x7.
same as above.

> 
> Randy (Weidong)
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 1/2] KVM/userspace: Support for assigning PCI devicesto guest
  2008-07-22 12:28                       ` Ben-Ami Yassour
@ 2008-07-22 12:44                         ` Han, Weidong
  0 siblings, 0 replies; 42+ messages in thread
From: Han, Weidong @ 2008-07-22 12:44 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: amit.shah, kvm, Muli Ben-Yehuda, anthony, Or Sagi

Ben-Ami Yassour wrote:
> On Thu, 2008-07-17 at 15:52 +0800, Han, Weidong wrote:
>> Ben-Ami Yassour wrote:
>>> From: Or Sagi <ors@tutis.com>
>>> 
>>> From: Nir Peleg <nir@tutis.com>
>>> From: Amit Shah <amit.shah@qumranet.com>
>>> From: Glauber de Oliveira Costa <gcosta@redhat.com>
>>> 
>>> We can assign a device from the host machine to a guest. The
>>> original code comes from Neocleus.
>>> 
>>> A new command-line option, -pcidevice is added.
>>> For example, to invoke it for an Ethernet device sitting at
>>> PCI bus:dev.fn 04:08.0 with host IRQ 18, use this:
>>> 
>>>         -pcidevice Ethernet/04:08.0
>>> 
>>> The host ethernet driver is to be removed before doing the
>>> passthrough. If not, the device assignment fails but the
>>> guest continues without the assignment.
>>> 
>>> If kvm uses the in-kernel irqchip, interrupts are routed to the
>>> guest via the kvm module (accompanied kernel changes are necessary).
>>> 
>>> If -no-kvm-irqchip is used, the 'irqhook' module, available
>>> separately, is to be used for interrupt injection into the guest. In
>>> this case, an extra parameter, -<intr-number> is to be appended to
>>> the above-mentioned pcidevice parameter.
>>> 
>>> Signed-off-by: Amit Shah <amit.shah@qumranet.com>
>>> ---
>>> +static pt_dev_t *register_real_device(PCIBus *e_bus, const char
>>> *e_dev_name, +				      int e_devfn,
>> uint8_t r_bus, uint8_t r_dev,
>>> +				      uint8_t r_func)
>>> +{
>>> +	int rc;
>>> +	pt_dev_t *pci_dev;
>>> +	uint8_t e_device, e_intx;
>>> +
>>> +	DEBUG("register_real_device: Registering real physical "
>>> +	      "device %s (devfn=0x%x)\n", e_dev_name, e_devfn); +
>>> +	pci_dev = (pt_dev_t *) pci_register_device(e_bus, e_dev_name,
>>> +						   sizeof(pt_dev_t),
>> e_devfn,
>>> +						   pt_pci_read_config,
>>> +						   pt_pci_write_config);
>>> +
>>> +	if (NULL == pci_dev) {
>>> +		fprintf(stderr, "register_real_device: Error: Couldn't "
>>> +			"register real device %s\n", e_dev_name);
>>> +		return NULL;
>>> +	}
>>> +	if (pt_get_real_device(pci_dev, r_bus, r_dev, r_func)) {
>>> +		fprintf(stderr, "register_real_device: Error: Couldn't
get "
>>> +			"real device (%s)!\n", e_dev_name);
>>> +		return NULL;
>>> +	}
>>> +
>>> +	/* handle real device's MMIO/PIO BARs */
>>> +	if (pt_register_regions(pci_dev->real_device.regions,
>>> +				pci_dev->real_device.region_number,
pci_dev))
>>> +		return NULL;
>>> +
>>> +	/* handle interrupt routing */
>>> +	e_device = (pci_dev->dev.devfn >> 3) & 0x1f;
>>> +	e_intx = pci_dev->dev.config[0x3d] - 1;
>>> +	pci_dev->intpin = e_intx;
>>> +	pci_dev->run = 0;
>>> +	pci_dev->girq = 0;
>>> +	pci_dev->h_busnr = r_bus;
>>> +	pci_dev->h_devfn = PCI_DEVFN(r_dev, r_func);
>>> +
>>> +#ifdef KVM_CAP_PCI_PASSTHROUGH
>>> +	if (kvm_enabled()) {
>>> +		struct kvm_pci_passthrough_dev pci_pt_dev;
>>> +
>>> +		memset(&pci_pt_dev, 0, sizeof(pci_pt_dev));
>>> +		pci_pt_dev.guest.busnr = pci_bus_num(e_bus);
>>> +		pci_pt_dev.guest.devfn = PCI_DEVFN(e_device, r_func);
>> 
>> Why combine e_device and r_func? The guest devfn is e_devfn.
> the current assumption is that e_func and r_func are identical.

But you can find they are not the same in guest. Why do we need such
assumption?

Randy (Weidong)

> 
>> 
>>> +		pci_pt_dev.host.busnr  = pci_dev->h_busnr;
>>> +		pci_pt_dev.host.devfn  = pci_dev->h_devfn;
>>> +
>>> +		/* We'll set the value of the guest irq as and when
>>> +		 * the piix config gets updated. See pci_pt_update_irq.
>>> +		 * The host irq field never gets used anyway
>>> +		 */
>>> +
>>> +		rc = kvm_update_pci_pt_device(kvm_context, &pci_pt_dev);
+		if
>>> (rc < 0) { +			fprintf(stderr, "Could not
notify kernel about "
>>> +				"passthrough device\n"); +
perror("pt-ioctl");
>>> +			return NULL;
>>> +		}
>>> +	}
>>> +#endif
>>> +
>>> +	fprintf(logfile, "Registered host PCI device %02x:%02x.%1x "
>>> +		"as guest device %02x:%02x.%1x\n",
>>> +		r_bus, r_dev, r_func,
>>> +		pci_bus_num(e_bus), e_device, r_func);
>> 
>> Second "r_func" should replaced by e_devfn & 0x7. same as above.
> 
>> 
>> Randy (Weidong)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/8] KVM: PCIPT: change order of device release
  2008-07-16 13:17         ` [PATCH 5/8] KVM: PCIPT: change order of device release Ben-Ami Yassour
  2008-07-16 13:17           ` [PATCH 6/8] VT-d: changes to support KVM Ben-Ami Yassour
@ 2008-07-16 15:06           ` Avi Kivity
  1 sibling, 0 replies; 42+ messages in thread
From: Avi Kivity @ 2008-07-16 15:06 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: amit.shah, kvm, muli, weidong.han, anthony

Ben-Ami Yassour wrote:


Fold into parent patch please.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] KVM: PCIPT: fix interrupt handling
  2008-07-16 13:17       ` [PATCH 4/8] KVM: PCIPT: fix interrupt handling Ben-Ami Yassour
  2008-07-16 13:17         ` [PATCH 5/8] KVM: PCIPT: change order of device release Ben-Ami Yassour
@ 2008-07-16 15:06         ` Avi Kivity
  2008-07-23 13:37         ` Amit Shah
  2 siblings, 0 replies; 42+ messages in thread
From: Avi Kivity @ 2008-07-16 15:06 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: amit.shah, kvm, muli, weidong.han, anthony

Ben-Ami Yassour wrote:
> This patch fixes a few problems with the interrupt handling for
> passthrough devices.
>
>   

Well, fold it into the patch it fixes.  There is no point in sending a 
buggy patch and a fix in the same patchset.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] KVM: PCIPT: fix interrupt handling
  2008-07-16 13:17       ` [PATCH 4/8] KVM: PCIPT: fix interrupt handling Ben-Ami Yassour
  2008-07-16 13:17         ` [PATCH 5/8] KVM: PCIPT: change order of device release Ben-Ami Yassour
  2008-07-16 15:06         ` [PATCH 4/8] KVM: PCIPT: fix interrupt handling Avi Kivity
@ 2008-07-23 13:37         ` Amit Shah
  2008-07-24 11:28           ` Ben-Ami Yassour
  2 siblings, 1 reply; 42+ messages in thread
From: Amit Shah @ 2008-07-23 13:37 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: kvm, muli, weidong.han, anthony

* On Wednesday 16 Jul 2008 18:47:01 Ben-Ami Yassour wrote:
> This patch fixes a few problems with the interrupt handling for
> passthrough devices.
>
> 1. Pass the interrupt handler the pointer to the device, so we do not
> need to lock the pcipt lock in the interrupt handler.
>
> 2. Remove the pt_irq_handled bitmap - it is no longer needed.
>
> 3. Split kvm_pci_pt_work_fn into two functions, one for interrupt
> injection and another for the ack - is much simpler code this way.
>
> 4. Change the passthrough initialization order - add the device
> structure to the list, before registering the interrupt handler.
>
> 5. On passthrough destruction path, free the interrupt handler before
> cleaning queued work.
>
> Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
> ---

>  	if (irqchip_in_kernel(kvm)) {
> +		match->pt_dev.guest.irq = pci_pt_dev->guest.irq;
> +		match->pt_dev.host.irq = dev->irq;
> +		if (kvm->arch.vioapic)
> +			kvm->arch.vioapic->ack_notifier = kvm_pci_pt_ack_irq;
> +		if (kvm->arch.vpic)
> +			kvm->arch.vpic->ack_notifier = kvm_pci_pt_ack_irq;
> +

We shouldn't register this notifier unless we get the irq below to avoid 
unneeded function calls and checks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] KVM: PCIPT: fix interrupt handling
  2008-07-23 13:37         ` Amit Shah
@ 2008-07-24 11:28           ` Ben-Ami Yassour
  2008-07-24 13:31             ` Amit Shah
  0 siblings, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-24 11:28 UTC (permalink / raw)
  To: Amit Shah; +Cc: kvm, Muli Ben-Yehuda, weidong.han, anthony

On Wed, 2008-07-23 at 19:07 +0530, Amit Shah wrote:
> * On Wednesday 16 Jul 2008 18:47:01 Ben-Ami Yassour wrote:
> > This patch fixes a few problems with the interrupt handling for
> > passthrough devices.
> >
> > 1. Pass the interrupt handler the pointer to the device, so we do not
> > need to lock the pcipt lock in the interrupt handler.
> >
> > 2. Remove the pt_irq_handled bitmap - it is no longer needed.
> >
> > 3. Split kvm_pci_pt_work_fn into two functions, one for interrupt
> > injection and another for the ack - is much simpler code this way.
> >
> > 4. Change the passthrough initialization order - add the device
> > structure to the list, before registering the interrupt handler.
> >
> > 5. On passthrough destruction path, free the interrupt handler before
> > cleaning queued work.
> >
> > Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
> > ---
> 
> >  	if (irqchip_in_kernel(kvm)) {
> > +		match->pt_dev.guest.irq = pci_pt_dev->guest.irq;
> > +		match->pt_dev.host.irq = dev->irq;
> > +		if (kvm->arch.vioapic)
> > +			kvm->arch.vioapic->ack_notifier = kvm_pci_pt_ack_irq;
> > +		if (kvm->arch.vpic)
> > +			kvm->arch.vpic->ack_notifier = kvm_pci_pt_ack_irq;
> > +
> 
> We shouldn't register this notifier unless we get the irq below to avoid 
> unneeded function calls and checks.

Note: This code was changed in the last version of the code but the
comment is still relevant.

Do you mean that we need to postpone registering the notification?
In the case of an assigned device this is means the we postpone it for a
few seconds, and implementing it like above it cleaner. So I don't see
the real value in postponing it.

Thanks,
Ben



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] KVM: PCIPT: fix interrupt handling
  2008-07-24 11:28           ` Ben-Ami Yassour
@ 2008-07-24 13:31             ` Amit Shah
  2008-07-24 14:24               ` Ben-Ami Yassour
  0 siblings, 1 reply; 42+ messages in thread
From: Amit Shah @ 2008-07-24 13:31 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: kvm, Muli Ben-Yehuda, weidong.han, anthony

* On Thursday 24 Jul 2008 16:58:57 Ben-Ami Yassour wrote:
> On Wed, 2008-07-23 at 19:07 +0530, Amit Shah wrote:
> > * On Wednesday 16 Jul 2008 18:47:01 Ben-Ami Yassour wrote:
> > > This patch fixes a few problems with the interrupt handling for
> > > passthrough devices.
> > >
> > > 1. Pass the interrupt handler the pointer to the device, so we do not
> > > need to lock the pcipt lock in the interrupt handler.
> > >
> > > 2. Remove the pt_irq_handled bitmap - it is no longer needed.
> > >
> > > 3. Split kvm_pci_pt_work_fn into two functions, one for interrupt
> > > injection and another for the ack - is much simpler code this way.
> > >
> > > 4. Change the passthrough initialization order - add the device
> > > structure to the list, before registering the interrupt handler.
> > >
> > > 5. On passthrough destruction path, free the interrupt handler before
> > > cleaning queued work.
> > >
> > > Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
> > > ---
> > >
> > >  	if (irqchip_in_kernel(kvm)) {
> > > +		match->pt_dev.guest.irq = pci_pt_dev->guest.irq;
> > > +		match->pt_dev.host.irq = dev->irq;
> > > +		if (kvm->arch.vioapic)
> > > +			kvm->arch.vioapic->ack_notifier = kvm_pci_pt_ack_irq;
> > > +		if (kvm->arch.vpic)
> > > +			kvm->arch.vpic->ack_notifier = kvm_pci_pt_ack_irq;
> > > +
> >
> > We shouldn't register this notifier unless we get the irq below to avoid
> > unneeded function calls and checks.
>
> Note: This code was changed in the last version of the code but the
> comment is still relevant.
>
> Do you mean that we need to postpone registering the notification?

I mean we can register these function pointers after the request_irq succeeds.

> In the case of an assigned device this is means the we postpone it for a
> few seconds, and implementing it like above it cleaner. So I don't see
> the real value in postponing it.

Sorry, don't get what you mean here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] KVM: PCIPT: fix interrupt handling
  2008-07-24 13:31             ` Amit Shah
@ 2008-07-24 14:24               ` Ben-Ami Yassour
  0 siblings, 0 replies; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-24 14:24 UTC (permalink / raw)
  To: Amit Shah; +Cc: kvm, Muli Ben-Yehuda, weidong.han, anthony

On Thu, 2008-07-24 at 19:01 +0530, Amit Shah wrote:
> * On Thursday 24 Jul 2008 16:58:57 Ben-Ami Yassour wrote:
> > On Wed, 2008-07-23 at 19:07 +0530, Amit Shah wrote:
> > > * On Wednesday 16 Jul 2008 18:47:01 Ben-Ami Yassour wrote:
> > > >
> > > >  	if (irqchip_in_kernel(kvm)) {
> > > > +		match->pt_dev.guest.irq = pci_pt_dev->guest.irq;
> > > > +		match->pt_dev.host.irq = dev->irq;
> > > > +		if (kvm->arch.vioapic)
> > > > +			kvm->arch.vioapic->ack_notifier = kvm_pci_pt_ack_irq;
> > > > +		if (kvm->arch.vpic)
> > > > +			kvm->arch.vpic->ack_notifier = kvm_pci_pt_ack_irq;
> > > > +
> > >
> > > We shouldn't register this notifier unless we get the irq below to avoid
> > > unneeded function calls and checks.
> >
> > Note: This code was changed in the last version of the code but the
> > comment is still relevant.
> >
> > Do you mean that we need to postpone registering the notification?
> 
> I mean we can register these function pointers after the request_irq succeeds.

request_irq should be the last initialization operation, since every
thing should be ready when in case an interrupt is received

> 
> > In the case of an assigned device this is means the we postpone it for a
> > few seconds, and implementing it like above it cleaner. So I don't see
> > the real value in postponing it.
> 
> Sorry, don't get what you mean here.
never mind... see the answer is above.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-16 13:17     ` [PATCH 3/8] KVM: Handle device assignment to guests Ben-Ami Yassour
  2008-07-16 13:17       ` [PATCH 4/8] KVM: PCIPT: fix interrupt handling Ben-Ami Yassour
@ 2008-07-16 15:04       ` Avi Kivity
  2008-07-17  2:09         ` Han, Weidong
  2008-07-17  9:34         ` Ben-Ami Yassour
  1 sibling, 2 replies; 42+ messages in thread
From: Avi Kivity @ 2008-07-16 15:04 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: amit.shah, kvm, muli, weidong.han, anthony

Ben-Ami Yassour wrote:
> From: Han, Weidong <weidong.han@intel.com>
>
> This patch adds support for handling PCI devices that are assigned to
> the guest ("PCI passthrough").
> +
> +/*
> + * Used to find a registered host PCI device (a "passthrough" device)
> + * during ioctls, interrupts or EOI
> + */
> +struct kvm_pci_pt_dev_list *
> +kvm_find_pci_pt_dev(struct list_head *head,
> +		struct kvm_pci_pt_info *pt_pci_info, int irq, int source)
> +{
> +	struct list_head *ptr;
> +	struct kvm_pci_pt_dev_list *match;
> +
> +	list_for_each(ptr, head) {
> +		match = list_entry(ptr, struct kvm_pci_pt_dev_list, list);
> +
> +		switch (source) {
> +		case KVM_PT_SOURCE_IRQ:
> +			/*
> +			 * Used to find a registered host device
> +			 * during interrupt context on host
> +			 */
> +			if (match->pt_dev.host.irq == irq)
> +				return match;
> +			break;
> +		case KVM_PT_SOURCE_IRQ_ACK:
> +			/*
> +			 * Used to find a registered host device when
> +			 * the guest acks an interrupt
> +			 */
> +			if (match->pt_dev.guest.irq == irq)
> +				return match;
> +			break;
> +		case KVM_PT_SOURCE_UPDATE:
> +			if ((match->pt_dev.host.busnr == pt_pci_info->busnr) &&
> +			    (match->pt_dev.host.devfn == pt_pci_info->devfn))
> +				return match;
> +			break;
> +		}
> +	}
> +	return NULL;
> +}
>   

This monster is best split into three functions each handling a separate 
case, without the 'source' argument.

> +static void kvm_pci_pt_work_fn(struct work_struct *work)
> +{
> +	struct kvm_pci_pt_dev_list *match;
> +	struct kvm_pci_pt_work *int_work;
> +	int source;
> +	unsigned long flags;
> +	int guest_irq;
> +	int host_irq;
> +
> +	int_work = container_of(work, struct kvm_pci_pt_work, work);
> +
> +	source = int_work->source ? KVM_PT_SOURCE_IRQ_ACK : KVM_PT_SOURCE_IRQ;
> +
> +	/* This is taken to safely inject irq inside the guest. When
> +	 * the interrupt injection (or the ioapic code) uses a
> +	 * finer-grained lock, update this
> +	 */
> +	mutex_lock(&int_work->kvm->lock);
> +	read_lock_irqsave(&kvm_pci_pt_lock, flags);
> +	match = kvm_find_pci_pt_dev(&int_work->kvm->arch.pci_pt_dev_head, NULL,
> +				    int_work->irq, source);
> +	if (!match) {
> +		printk(KERN_ERR "%s: no matching device assigned to guest "
> +		       "found for irq %d, source = %d!\n",
> +		       __func__, int_work->irq, int_work->source);
> +		read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
> +		goto out;
> +	}
> +	guest_irq = match->pt_dev.guest.irq;
> +	host_irq = match->pt_dev.host.irq;
> +	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
> +
> +	if (source == KVM_PT_SOURCE_IRQ)
> +		kvm_set_irq(int_work->kvm, guest_irq, 1);
> +	else {
> +		kvm_set_irq(int_work->kvm, int_work->irq, 0);
> +		enable_irq(host_irq);
> +	}
> +out:
> +	mutex_unlock(&int_work->kvm->lock);
> +	kvm_put_kvm(int_work->kvm);
> +}
>   
> +
> +/* FIXME: Implement the OR logic needed to make shared interrupts on
> + * this line behave properly
> + */
>   

Isn't this a showstopper?  There is no easy way for a user to avoid 
sharing, especially as we have only three pci irqs at present.

> +static irqreturn_t kvm_pci_pt_dev_intr(int irq, void *dev_id)
> +{
> +	struct kvm *kvm = (struct kvm *) dev_id;
> +	struct kvm_pci_pt_dev_list *pci_pt_dev;
> +
> +	if (!test_bit(irq, pt_irq_handled))
> +		return IRQ_NONE;
> +
> +	read_lock(&kvm_pci_pt_lock);
> +	pci_pt_dev = kvm_find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL,
> +					 irq, KVM_PT_SOURCE_IRQ);
> +	if (!pci_pt_dev) {
> +		read_unlock(&kvm_pci_pt_lock);
> +		return IRQ_NONE;
> +	}
>   


I see we don't reuse the result of the search.  I guess we can't, since 
the list may change between the interrupt and the execution of the work 
function.

> +
> +	pci_pt_dev->pt_dev.int_work.irq = irq;
> +	pci_pt_dev->pt_dev.int_work.kvm = kvm;
> +	pci_pt_dev->pt_dev.int_work.source = 0;
> +
>   

For a bool, use false, not 0.  But 'source' isn't really a good name for 
a boolean.  Perhaps 'is_ack'?

> +
> +/* Ack the irq line for a passthrough device */
> +static void kvm_pci_pt_ack_irq(void *opaque, int irq)
> +{
> +	struct kvm *kvm = opaque;
> +	struct kvm_pci_pt_dev_list *pci_pt_dev;
> +	unsigned long flags;
> +
> +	if (irq == -1)
> +		return;
> +
> +	read_lock_irqsave(&kvm_pci_pt_lock, flags);
> +	pci_pt_dev = kvm_find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL, irq,
> +					 KVM_PT_SOURCE_IRQ_ACK);
> +	if (!pci_pt_dev) {
> +		read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
> +		return;
> +	}
> +	pci_pt_dev->pt_dev.ack_work.irq = irq;
> +	pci_pt_dev->pt_dev.ack_work.kvm = kvm;
> +	pci_pt_dev->pt_dev.ack_work.source = 1;
> +
> +	kvm_get_kvm(kvm);
> +	schedule_work(&pci_pt_dev->pt_dev.ack_work.work);
> +	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
> +}
>   

Why do we have to schedule this?  We're already in process context, no?

Is this to avoid recursion into the pic/ioapic code? If so, we can use a 
queue to avoid this.  That's not a merge blocker, however, so it can be 
done later.

> diff --git a/include/asm-x86/kvm_para.h b/include/asm-x86/kvm_para.h
> index 76f3921..b6c5d00 100644
> --- a/include/asm-x86/kvm_para.h
> +++ b/include/asm-x86/kvm_para.h
> @@ -142,6 +142,20 @@ static inline unsigned int kvm_arch_para_features(void)
>  	return cpuid_eax(KVM_CPUID_FEATURES);
>  }
>  
> -#endif
> +#endif /* KERNEL */
>   

Move the following to linux/kvm.h.  This used to be oriented at 
paravirt, but is really a host->kernel communication API.

>  
> +/* Stores information for identifying host PCI devices assigned to the
> + * guest: this is used in the host kernel and in the userspace.
> + */
> +struct kvm_pci_pt_info {
> +	unsigned char busnr;
> +	unsigned int devfn;
> +	__u32 irq;
> +};
>   

Badly aligned.  Please use __u32 for all fields, and add a __u32 padding 
field at the end so we have the same ABI for i386 and x86_64.

Some pci devices support more than one irq.  Should we make irq an 
array?  Alternatively, decouple irq assignment from pci information and 
let userspace handle everything.  This lets us assign non-pci devices 
(like serial ports, etc.).

Also, I don't like the term "passthrough", and very much dislike "pt", 
especially in public interfaces.  It's simply too generic.  Can we 
switch to "device assignment"?

>   * ioctls for vcpu fds
> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> index 8ce93c7..0b5bc40 100644
> --- a/virt/kvm/ioapic.c
> +++ b/virt/kvm/ioapic.c
> @@ -288,13 +288,22 @@ void kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
>  static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int gsi)
>  {
>  	union ioapic_redir_entry *ent;
> +	struct kvm_pci_pt_dev_list *match;
> +	unsigned long flags;
>  
>  	ent = &ioapic->redirtbl[gsi];
>  	ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
>  
>  	ent->fields.remote_irr = 0;
> -	if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
> -		ioapic_service(ioapic, gsi);
> +
> +	read_lock_irqsave(&kvm_pci_pt_lock, flags);
> +	match = kvm_find_pci_pt_dev(&ioapic->kvm->arch.pci_pt_dev_head, NULL,
> +				    gsi, KVM_PT_SOURCE_IRQ_ACK);
> +	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
> +	if (!match) {
> +		if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
> +			ioapic_service(ioapic, gsi);
> +	}
>   

What's device assignment code doing in the ioapic?  This should be done 
through the notifier (if it needs to return a value, great).

>  
>  	if (ioapic->ack_notifier)
>  		ioapic->ack_notifier(ioapic->kvm, gsi);
>   


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-16 15:04       ` [PATCH 3/8] KVM: Handle device assignment to guests Avi Kivity
@ 2008-07-17  2:09         ` Han, Weidong
  2008-07-17  2:29           ` Yang, Sheng
  2008-07-17  6:02           ` Avi Kivity
  2008-07-17  9:34         ` Ben-Ami Yassour
  1 sibling, 2 replies; 42+ messages in thread
From: Han, Weidong @ 2008-07-17  2:09 UTC (permalink / raw)
  To: Avi Kivity, Ben-Ami Yassour; +Cc: amit.shah, kvm, muli, anthony

Avi Kivity wrote:
>> +static void kvm_pci_pt_work_fn(struct work_struct *work) +{
>> +	struct kvm_pci_pt_dev_list *match;
>> +	struct kvm_pci_pt_work *int_work;
>> +	int source;
>> +	unsigned long flags;
>> +	int guest_irq;
>> +	int host_irq;
>> +
>> +	int_work = container_of(work, struct kvm_pci_pt_work, work); +
>> +	source = int_work->source ? KVM_PT_SOURCE_IRQ_ACK :
>> KVM_PT_SOURCE_IRQ; + +	/* This is taken to safely inject irq
inside
>> the guest. When +	 * the interrupt injection (or the ioapic code)
>> uses a +	 * finer-grained lock, update this
>> +	 */
>> +	mutex_lock(&int_work->kvm->lock);
>> +	read_lock_irqsave(&kvm_pci_pt_lock, flags);
>> +	match =
kvm_find_pci_pt_dev(&int_work->kvm->arch.pci_pt_dev_head,
>> NULL, +				    int_work->irq, source);
>> +	if (!match) {
>> +		printk(KERN_ERR "%s: no matching device assigned to
guest "
>> +		       "found for irq %d, source = %d!\n",
>> +		       __func__, int_work->irq, int_work->source);
>> +		read_unlock_irqrestore(&kvm_pci_pt_lock, flags); +
goto out;
>> +	}
>> +	guest_irq = match->pt_dev.guest.irq;
>> +	host_irq = match->pt_dev.host.irq;
>> +	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
>> +
>> +	if (source == KVM_PT_SOURCE_IRQ)
>> +		kvm_set_irq(int_work->kvm, guest_irq, 1);
>> +	else {
>> +		kvm_set_irq(int_work->kvm, int_work->irq, 0);
>> +		enable_irq(host_irq);
>> +	}
>> +out:
>> +	mutex_unlock(&int_work->kvm->lock);
>> +	kvm_put_kvm(int_work->kvm);
>> +}
>> 
>> +
>> +/* FIXME: Implement the OR logic needed to make shared interrupts
>> on + * this line behave properly + */
>> 
> 
> Isn't this a showstopper?  There is no easy way for a user to avoid
> sharing, especially as we have only three pci irqs at present.
> 

Currently it's not easy to avoid sharing. I think we can support MSI for
assgined device to solve sharing problem. 

Randy (Weidong)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-17  2:09         ` Han, Weidong
@ 2008-07-17  2:29           ` Yang, Sheng
  2008-07-17  6:02           ` Avi Kivity
  1 sibling, 0 replies; 42+ messages in thread
From: Yang, Sheng @ 2008-07-17  2:29 UTC (permalink / raw)
  To: kvm; +Cc: Han, Weidong, Avi Kivity, Ben-Ami Yassour, amit.shah, muli,
	anthony

On Thursday 17 July 2008 10:09:57 Han, Weidong wrote:
> Avi Kivity wrote:
> >> +static void kvm_pci_pt_work_fn(struct work_struct *work) +{
> >> +	struct kvm_pci_pt_dev_list *match;
> >> +	struct kvm_pci_pt_work *int_work;
> >> +	int source;
> >> +	unsigned long flags;
> >> +	int guest_irq;
> >> +	int host_irq;
> >> +
> >> +	int_work = container_of(work, struct kvm_pci_pt_work, work); +
> >> +	source = int_work->source ? KVM_PT_SOURCE_IRQ_ACK :
> >> KVM_PT_SOURCE_IRQ; + +	/* This is taken to safely inject irq
>
> inside
>
> >> the guest. When +	 * the interrupt injection (or the ioapic
> >> code) uses a +	 * finer-grained lock, update this
> >> +	 */
> >> +	mutex_lock(&int_work->kvm->lock);
> >> +	read_lock_irqsave(&kvm_pci_pt_lock, flags);
> >> +	match =
>
> kvm_find_pci_pt_dev(&int_work->kvm->arch.pci_pt_dev_head,
>
> >> NULL, +				    int_work->irq, source);
> >> +	if (!match) {
> >> +		printk(KERN_ERR "%s: no matching device assigned to
>
> guest "
>
> >> +		       "found for irq %d, source = %d!\n",
> >> +		       __func__, int_work->irq, int_work->source);
> >> +		read_unlock_irqrestore(&kvm_pci_pt_lock, flags); +
>
> goto out;
>
> >> +	}
> >> +	guest_irq = match->pt_dev.guest.irq;
> >> +	host_irq = match->pt_dev.host.irq;
> >> +	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
> >> +
> >> +	if (source == KVM_PT_SOURCE_IRQ)
> >> +		kvm_set_irq(int_work->kvm, guest_irq, 1);
> >> +	else {
> >> +		kvm_set_irq(int_work->kvm, int_work->irq, 0);
> >> +		enable_irq(host_irq);
> >> +	}
> >> +out:
> >> +	mutex_unlock(&int_work->kvm->lock);
> >> +	kvm_put_kvm(int_work->kvm);
> >> +}
> >>
> >> +
> >> +/* FIXME: Implement the OR logic needed to make shared
> >> interrupts on + * this line behave properly + */
> >
> > Isn't this a showstopper?  There is no easy way for a user to
> > avoid sharing, especially as we have only three pci irqs at
> > present.
>
> Currently it's not easy to avoid sharing. I think we can support
> MSI for assgined device to solve sharing problem.
>

I am working on the MSI support now.

-- 
regards
Yang, Sheng



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-17  2:09         ` Han, Weidong
  2008-07-17  2:29           ` Yang, Sheng
@ 2008-07-17  6:02           ` Avi Kivity
  2008-07-17  8:23             ` Ben-Ami Yassour
  1 sibling, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2008-07-17  6:02 UTC (permalink / raw)
  To: Han, Weidong; +Cc: Ben-Ami Yassour, amit.shah, kvm, muli, anthony

Han, Weidong wrote:
> Avi Kivity wrote:
>   
>>> +static void kvm_pci_pt_work_fn(struct work_struct *work) +{
>>> +	struct kvm_pci_pt_dev_list *match;
>>> +	struct kvm_pci_pt_work *int_work;
>>> +	int source;
>>> +	unsigned long flags;
>>> +	int guest_irq;
>>> +	int host_irq;
>>> +
>>> +	int_work = container_of(work, struct kvm_pci_pt_work, work); +
>>> +	source = int_work->source ? KVM_PT_SOURCE_IRQ_ACK :
>>> KVM_PT_SOURCE_IRQ; + +	/* This is taken to safely inject irq
>>>       
> inside
>   
>>> the guest. When +	 * the interrupt injection (or the ioapic code)
>>> uses a +	 * finer-grained lock, update this
>>> +	 */
>>> +	mutex_lock(&int_work->kvm->lock);
>>> +	read_lock_irqsave(&kvm_pci_pt_lock, flags);
>>> +	match =
>>>       
> kvm_find_pci_pt_dev(&int_work->kvm->arch.pci_pt_dev_head,
>   
>>> NULL, +				    int_work->irq, source);
>>> +	if (!match) {
>>> +		printk(KERN_ERR "%s: no matching device assigned to
>>>       
> guest "
>   
>>> +		       "found for irq %d, source = %d!\n",
>>> +		       __func__, int_work->irq, int_work->source);
>>> +		read_unlock_irqrestore(&kvm_pci_pt_lock, flags); +
>>>       
> goto out;
>   
>>> +	}
>>> +	guest_irq = match->pt_dev.guest.irq;
>>> +	host_irq = match->pt_dev.host.irq;
>>> +	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
>>> +
>>> +	if (source == KVM_PT_SOURCE_IRQ)
>>> +		kvm_set_irq(int_work->kvm, guest_irq, 1);
>>> +	else {
>>> +		kvm_set_irq(int_work->kvm, int_work->irq, 0);
>>> +		enable_irq(host_irq);
>>> +	}
>>> +out:
>>> +	mutex_unlock(&int_work->kvm->lock);
>>> +	kvm_put_kvm(int_work->kvm);
>>> +}
>>>
>>> +
>>> +/* FIXME: Implement the OR logic needed to make shared interrupts
>>> on + * this line behave properly + */
>>>
>>>       
>> Isn't this a showstopper?  There is no easy way for a user to avoid
>> sharing, especially as we have only three pci irqs at present.
>>
>>     
>
> Currently it's not easy to avoid sharing. I think we can support MSI for
> assgined device to solve sharing problem. 
>   

MSI is definitely the right direction, but we also need to support the 
OR logic for guests that do not support MSI.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-17  6:02           ` Avi Kivity
@ 2008-07-17  8:23             ` Ben-Ami Yassour
  2008-07-17  8:31               ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-17  8:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Han, Weidong, amit.shah, kvm, Muli Ben-Yehuda, anthony

On Thu, 2008-07-17 at 09:02 +0300, Avi Kivity wrote:
> Han, Weidong wrote:
> > Avi Kivity wrote:
> >   
> >>> +static void kvm_pci_pt_work_fn(struct work_struct *work) +{
> >>> +	struct kvm_pci_pt_dev_list *match;
> >>> +	struct kvm_pci_pt_work *int_work;
> >>> +	int source;
> >>> +	unsigned long flags;
> >>> +	int guest_irq;
> >>> +	int host_irq;
> >>> +
> >>> +	int_work = container_of(work, struct kvm_pci_pt_work, work); +
> >>> +	source = int_work->source ? KVM_PT_SOURCE_IRQ_ACK :
> >>> KVM_PT_SOURCE_IRQ; + +	/* This is taken to safely inject irq
> >>>       
> > inside
> >   
> >>> the guest. When +	 * the interrupt injection (or the ioapic code)
> >>> uses a +	 * finer-grained lock, update this
> >>> +	 */
> >>> +	mutex_lock(&int_work->kvm->lock);
> >>> +	read_lock_irqsave(&kvm_pci_pt_lock, flags);
> >>> +	match =
> >>>       
> > kvm_find_pci_pt_dev(&int_work->kvm->arch.pci_pt_dev_head,
> >   
> >>> NULL, +				    int_work->irq, source);
> >>> +	if (!match) {
> >>> +		printk(KERN_ERR "%s: no matching device assigned to
> >>>       
> > guest "
> >   
> >>> +		       "found for irq %d, source = %d!\n",
> >>> +		       __func__, int_work->irq, int_work->source);
> >>> +		read_unlock_irqrestore(&kvm_pci_pt_lock, flags); +
> >>>       
> > goto out;
> >   
> >>> +	}
> >>> +	guest_irq = match->pt_dev.guest.irq;
> >>> +	host_irq = match->pt_dev.host.irq;
> >>> +	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
> >>> +
> >>> +	if (source == KVM_PT_SOURCE_IRQ)
> >>> +		kvm_set_irq(int_work->kvm, guest_irq, 1);
> >>> +	else {
> >>> +		kvm_set_irq(int_work->kvm, int_work->irq, 0);
> >>> +		enable_irq(host_irq);
> >>> +	}
> >>> +out:
> >>> +	mutex_unlock(&int_work->kvm->lock);
> >>> +	kvm_put_kvm(int_work->kvm);
> >>> +}
> >>>
> >>> +
> >>> +/* FIXME: Implement the OR logic needed to make shared interrupts
> >>> on + * this line behave properly + */
> >>>
> >>>       
> >> Isn't this a showstopper?  There is no easy way for a user to avoid
> >> sharing, especially as we have only three pci irqs at present.
What do you mean by only 3 (for passthrough we use whatever interrupt
the host is using for that device)?

> >>
> >>     
> >
> > Currently it's not easy to avoid sharing. I think we can support MSI for
> > assgined device to solve sharing problem. 
> >   
> 
> MSI is definitely the right direction, but we also need to support
> the 
> OR logic for guests that do not support MSI.
> 

The main problems that I am ware of with shared interrupt are as
follows:

1. Protection
A bad guest can keep the interrupt line up, and if its shared then it
interferes with other devices that might not be assigned to this guest.

2. Performance
The current implementation disables the irq when the interrupt is
received on the host and re-enables it when the guest acks the interrupt
on the virtual ioapic.

Clearly its not nice to do that when other devices uses the same irq...

When removing the enable-disable of the irq, it causes serious
performance degradation.
The throughput goes down from 700 to 400 Mb/sec.
The number of interrupts per second on the host (from the NIC) is 130000
(compared to 22200 before).

Regardless, I think that the right way to go would be to merge the
current code (with the non-shared interrupt limitation) and then
continue to work on additional features like shared interrupts support. 
Do you agree?

Regards,
Ben



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-17  8:23             ` Ben-Ami Yassour
@ 2008-07-17  8:31               ` Avi Kivity
  2008-07-17 18:01                 ` Ben-Ami Yassour
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2008-07-17  8:31 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: Han, Weidong, amit.shah, kvm, Muli Ben-Yehuda, anthony

Ben-Ami Yassour wrote:
>>>>> +
>>>>> +/* FIXME: Implement the OR logic needed to make shared interrupts
>>>>> on + * this line behave properly + */
>>>>>
>>>>>       
>>>>>           
>>>> Isn't this a showstopper?  There is no easy way for a user to avoid
>>>> sharing, especially as we have only three pci irqs at present.
>>>>         
> What do you mean by only 3 (for passthrough we use whatever interrupt
> the host is using for that device)?
>
>   

There are two issue:
- shared host interrupts
- shared guest interrupts

For shared host interrupts, I don't think there's a cost-effective 
solution, especially as hosts are transitioning to MSI.  But the problem 
the comment describes is shared guest interrupts, where the assigned 
device's interrupt is shared with another device (assigned or virtual).

And currently we only have three shareable interrupts: 5, 10, and 11.

See qemu's pci_set_irq() for the logic used to share interrupts.
>>>>     
>>>>         
>>> Currently it's not easy to avoid sharing. I think we can support MSI for
>>> assgined device to solve sharing problem. 
>>>   
>>>       
>> MSI is definitely the right direction, but we also need to support
>> the 
>> OR logic for guests that do not support MSI.
>>
>>     
>
> The main problems that I am ware of with shared interrupt are as
> follows:
>
> 1. Protection
> A bad guest can keep the interrupt line up, and if its shared then it
> interferes with other devices that might not be assigned to this guest.
>
> 2. Performance
> The current implementation disables the irq when the interrupt is
> received on the host and re-enables it when the guest acks the interrupt
> on the virtual ioapic.
>
> Clearly its not nice to do that when other devices uses the same irq...
>
> When removing the enable-disable of the irq, it causes serious
> performance degradation.
> The throughput goes down from 700 to 400 Mb/sec.
> The number of interrupts per second on the host (from the NIC) is 130000
> (compared to 22200 before).
>
>   

These are all shared host interrupt problems.

> Regardless, I think that the right way to go would be to merge the
> current code (with the non-shared interrupt limitation) and then
> continue to work on additional features like shared interrupts support. 
> Do you agree?
>   

The current code will be incredibly easy to break; all you need is more 
than three pci devices.  I think the sharing code can be easily added 
(as a separate patch); all that's needed is to make sure the API is good.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-17  8:31               ` Avi Kivity
@ 2008-07-17 18:01                 ` Ben-Ami Yassour
  2008-07-17 18:07                   ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-17 18:01 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Han, Weidong, amit.shah, kvm, Muli Ben-Yehuda, anthony

On Thu, 2008-07-17 at 11:31 +0300, Avi Kivity wrote:
> Ben-Ami Yassour wrote:
> >>>>> +
> >>>>> +/* FIXME: Implement the OR logic needed to make shared interrupts
> >>>>> on + * this line behave properly + */
> >>>>>
> >>>>>       
> >>>>>           
> >>>> Isn't this a showstopper?  There is no easy way for a user to avoid
> >>>> sharing, especially as we have only three pci irqs at present.
> >>>>         
> > What do you mean by only 3 (for passthrough we use whatever interrupt
> > the host is using for that device)?
> >
> >   
> 
> There are two issue:
> - shared host interrupts
> - shared guest interrupts
> 
> For shared host interrupts, I don't think there's a cost-effective 
> solution, especially as hosts are transitioning to MSI.  But the problem 
> the comment describes is shared guest interrupts, where the assigned 
> device's interrupt is shared with another device (assigned or virtual).
> 
> And currently we only have three shareable interrupts: 5, 10, and 11.
> 
> See qemu's pci_set_irq() for the logic used to share interrupts.

I think we'll leave this issue until we fix the more basic comments. In
any case, how can we make qemu configure the assigned device to share an
interrupt with another device (assigned or virtual)?

Thanks,
Ben


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-17 18:01                 ` Ben-Ami Yassour
@ 2008-07-17 18:07                   ` Avi Kivity
  0 siblings, 0 replies; 42+ messages in thread
From: Avi Kivity @ 2008-07-17 18:07 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: Han, Weidong, amit.shah, kvm, Muli Ben-Yehuda, anthony

Ben-Ami Yassour wrote:
> On Thu, 2008-07-17 at 11:31 +0300, Avi Kivity wrote:
>   
>> Ben-Ami Yassour wrote:
>>     
>>>>>>> +
>>>>>>> +/* FIXME: Implement the OR logic needed to make shared interrupts
>>>>>>> on + * this line behave properly + */
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> Isn't this a showstopper?  There is no easy way for a user to avoid
>>>>>> sharing, especially as we have only three pci irqs at present.
>>>>>>         
>>>>>>             
>>> What do you mean by only 3 (for passthrough we use whatever interrupt
>>> the host is using for that device)?
>>>
>>>   
>>>       
>> There are two issue:
>> - shared host interrupts
>> - shared guest interrupts
>>
>> For shared host interrupts, I don't think there's a cost-effective 
>> solution, especially as hosts are transitioning to MSI.  But the problem 
>> the comment describes is shared guest interrupts, where the assigned 
>> device's interrupt is shared with another device (assigned or virtual).
>>
>> And currently we only have three shareable interrupts: 5, 10, and 11.
>>
>> See qemu's pci_set_irq() for the logic used to share interrupts.
>>     
>
> I think we'll leave this issue until we fix the more basic comments. In
> any case, how can we make qemu configure the assigned device to share an
> interrupt with another device (assigned or virtual)?
>
>   

It's the guest's responsibility to assign irq lines, when running in 
acpi mode.  In no-acpi mode, the bios assigns lines according to an 
algorithm in bios/rombios32.c:pci_bios_init_device().

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-16 15:04       ` [PATCH 3/8] KVM: Handle device assignment to guests Avi Kivity
  2008-07-17  2:09         ` Han, Weidong
@ 2008-07-17  9:34         ` Ben-Ami Yassour
  2008-07-17  9:50           ` Avi Kivity
  1 sibling, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-17  9:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: amit.shah, kvm, Muli Ben-Yehuda, weidong.han, anthony

On Wed, 2008-07-16 at 18:04 +0300, Avi Kivity wrote:
> Ben-Ami Yassour wrote:
> > +/* Stores information for identifying host PCI devices assigned to the
> > + * guest: this is used in the host kernel and in the userspace.
> > + */
> > +struct kvm_pci_pt_info {
> > +	unsigned char busnr;
> > +	unsigned int devfn;
> > +	__u32 irq;
> > +};
> >   
> 
> Badly aligned.  Please use __u32 for all fields, and add a __u32 padding 
> field at the end so we have the same ABI for i386 and x86_64.
> 
> Some pci devices support more than one irq.  Should we make irq an 
> array?  Alternatively, decouple irq assignment from pci information
> and 
> let userspace handle everything.  This lets us assign non-pci devices 
> (like serial ports, etc.).
Can we limit the first version to single irq devices, and handle this
after the merge?

> >   * ioctls for vcpu fds
> > diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> > index 8ce93c7..0b5bc40 100644
> > --- a/virt/kvm/ioapic.c
> > +++ b/virt/kvm/ioapic.c
> > @@ -288,13 +288,22 @@ void kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
> >  static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int gsi)
> >  {
> >  	union ioapic_redir_entry *ent;
> > +	struct kvm_pci_pt_dev_list *match;
> > +	unsigned long flags;
> >  
> >  	ent = &ioapic->redirtbl[gsi];
> >  	ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
> >  
> >  	ent->fields.remote_irr = 0;
> > -	if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
> > -		ioapic_service(ioapic, gsi);
> > +
> > +	read_lock_irqsave(&kvm_pci_pt_lock, flags);
> > +	match = kvm_find_pci_pt_dev(&ioapic->kvm->arch.pci_pt_dev_head, NULL,
> > +				    gsi, KVM_PT_SOURCE_IRQ_ACK);
> > +	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
> > +	if (!match) {
> > +		if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
> > +			ioapic_service(ioapic, gsi);
> > +	}
> >   
> 
> What's device assignment code doing in the ioapic?  This should be done 
> through the notifier (if it needs to return a value, great).
> 

This purpose of this code is to avoid the call to ioapic_service for
assigned devices, because it is not required and causes a double
injection of the interrupt.

Can you please explain why as part of eoi we inject a new interrupt?

Thanks,
Ben


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-17  9:34         ` Ben-Ami Yassour
@ 2008-07-17  9:50           ` Avi Kivity
  2008-07-17 17:40             ` Ben-Ami Yassour
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2008-07-17  9:50 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: amit.shah, kvm, Muli Ben-Yehuda, weidong.han, anthony

Ben-Ami Yassour wrote:
> On Wed, 2008-07-16 at 18:04 +0300, Avi Kivity wrote:
>   
>> Ben-Ami Yassour wrote:
>>     
>>> +/* Stores information for identifying host PCI devices assigned to the
>>> + * guest: this is used in the host kernel and in the userspace.
>>> + */
>>> +struct kvm_pci_pt_info {
>>> +	unsigned char busnr;
>>> +	unsigned int devfn;
>>> +	__u32 irq;
>>> +};
>>>   
>>>       
>> Badly aligned.  Please use __u32 for all fields, and add a __u32 padding 
>> field at the end so we have the same ABI for i386 and x86_64.
>>
>> Some pci devices support more than one irq.  Should we make irq an 
>> array?  Alternatively, decouple irq assignment from pci information
>> and 
>> let userspace handle everything.  This lets us assign non-pci devices 
>> (like serial ports, etc.).
>>     
> Can we limit the first version to single irq devices, and handle this
> after the merge?
>
>   

Yes.  But we need to get at least the userspace interface right (and 
perhaps only partially implement it).  Breaking the userspace interface 
later is possible, but I'd rather avoid it.

>>>   * ioctls for vcpu fds
>>> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
>>> index 8ce93c7..0b5bc40 100644
>>> --- a/virt/kvm/ioapic.c
>>> +++ b/virt/kvm/ioapic.c
>>> @@ -288,13 +288,22 @@ void kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
>>>  static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int gsi)
>>>  {
>>>  	union ioapic_redir_entry *ent;
>>> +	struct kvm_pci_pt_dev_list *match;
>>> +	unsigned long flags;
>>>  
>>>  	ent = &ioapic->redirtbl[gsi];
>>>  	ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
>>>  
>>>  	ent->fields.remote_irr = 0;
>>> -	if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
>>> -		ioapic_service(ioapic, gsi);
>>> +
>>> +	read_lock_irqsave(&kvm_pci_pt_lock, flags);
>>> +	match = kvm_find_pci_pt_dev(&ioapic->kvm->arch.pci_pt_dev_head, NULL,
>>> +				    gsi, KVM_PT_SOURCE_IRQ_ACK);
>>> +	read_unlock_irqrestore(&kvm_pci_pt_lock, flags);
>>> +	if (!match) {
>>> +		if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
>>> +			ioapic_service(ioapic, gsi);
>>> +	}
>>>   
>>>       
>> What's device assignment code doing in the ioapic?  This should be done 
>> through the notifier (if it needs to return a value, great).
>>
>>     
>
> This purpose of this code is to avoid the call to ioapic_service for
> assigned devices, because it is not required and causes a double
> injection of the interrupt.
>
> Can you please explain why as part of eoi we inject a new interrupt?
>   

If a level triggered interrupt remains active after the eoi, the ioapic 
has to inject it.  This is used to support shared interrupts, or when 
the device has re-raised the line by the time the ack arrives.

I don't see why it should behave differently for assigned devices.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-17  9:50           ` Avi Kivity
@ 2008-07-17 17:40             ` Ben-Ami Yassour
  2008-07-17 18:04               ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-17 17:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: amit.shah, kvm, Muli Ben-Yehuda, weidong.han, anthony

On Thu, 2008-07-17 at 12:50 +0300, Avi Kivity wrote:
> Ben-Ami Yassour wrote:
> > On Wed, 2008-07-16 at 18:04 +0300, Avi Kivity wrote:
> >   
> 
> If a level triggered interrupt remains active after the eoi, the ioapic 
> has to inject it.  This is used to support shared interrupts, or when 
> the device has re-raised the line by the time the ack arrives.
> 
> I don't see why it should behave differently for assigned devices.
> 

The difference is that for emulated devices, qemu is resetting the irr
bit. For assigned devices it does not, and that's the difference.
The first chance that we can clear the irr bit for real devices is the
eoi function, and actually this is what the ack notify handler is doing
(by calling pci_set_irq(kvm,irq,0) ).
I was able to remove the code in ioapic by calling the ack notify
handler before the irr check, and it seems to work fine.
(to make it work, I also had to remove the queuing of the ack handler
which was not necessary, as you mentioned in earlier comment)

The eoi function now looks like this:
static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int gsi)
{
	union ioapic_redir_entry *ent;

	ent = &ioapic->redirtbl[gsi];
	ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);

	ent->fields.remote_irr = 0;

	if (ioapic->ack_notifier)
		ioapic->ack_notifier(ioapic->kvm, gsi);

	if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
		ioapic_service(ioapic, gsi);
}

Any comments on such an approach?

Thanks,
Ben

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] KVM: Handle device assignment to guests
  2008-07-17 17:40             ` Ben-Ami Yassour
@ 2008-07-17 18:04               ` Avi Kivity
  0 siblings, 0 replies; 42+ messages in thread
From: Avi Kivity @ 2008-07-17 18:04 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: amit.shah, kvm, Muli Ben-Yehuda, weidong.han, anthony

Ben-Ami Yassour wrote:
> On Thu, 2008-07-17 at 12:50 +0300, Avi Kivity wrote:
>   
>> Ben-Ami Yassour wrote:
>>     
>>> On Wed, 2008-07-16 at 18:04 +0300, Avi Kivity wrote:
>>>   
>>>       
>> If a level triggered interrupt remains active after the eoi, the ioapic 
>> has to inject it.  This is used to support shared interrupts, or when 
>> the device has re-raised the line by the time the ack arrives.
>>
>> I don't see why it should behave differently for assigned devices.
>>
>>     
>
> The difference is that for emulated devices, qemu is resetting the irr
> bit. For assigned devices it does not, and that's the difference.
> The first chance that we can clear the irr bit for real devices is the
> eoi function, and actually this is what the ack notify handler is doing
> (by calling pci_set_irq(kvm,irq,0) ).
> I was able to remove the code in ioapic by calling the ack notify
> handler before the irr check, and it seems to work fine.
>   

Okay.

> (to make it work, I also had to remove the queuing of the ack handler
> which was not necessary, as you mentioned in earlier comment)
>
> The eoi function now looks like this:
> static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int gsi)
> {
> 	union ioapic_redir_entry *ent;
>
> 	ent = &ioapic->redirtbl[gsi];
> 	ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
>
> 	ent->fields.remote_irr = 0;
>
> 	if (ioapic->ack_notifier)
> 		ioapic->ack_notifier(ioapic->kvm, gsi);
>
> 	if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
> 		ioapic_service(ioapic, gsi);
> }
>
> Any comments on such an approach?
>
>   

I think it's fine.  The point where the ack notifier is called is 
between the end of service of the old interrupt, and the beginning of 
service of a potential new interrupt (from the same device or some other 
device on the same guest line).


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PCI passthrough with VT-d - native performance
  2008-07-16 13:16 PCI passthrough with VT-d - native performance Ben-Ami Yassour
  2008-07-16 13:16 ` [PATCH 1/8] KVM: Introduce a callback routine for IOAPIC ack handling Ben-Ami Yassour
@ 2008-07-16 14:36 ` Avi Kivity
  2008-07-16 15:18   ` Ben-Ami Yassour
  1 sibling, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2008-07-16 14:36 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: amit.shah, kvm, muli, weidong.han, anthony

Ben-Ami Yassour wrote:
> In last few tests that we made with PCI-passthrough and VT-d using
> iperf, we were able to get the same throughput as on native OS with a 1G
> NIC

Excellent!

>  (with higher CPU utilization).
>   

How much higher?

> The following patches are the PCI-passthrough patches that Amit sent
> (re-based on the last kvm tree), followed by a few improvements and the
> VT-d extension.
> I am also sending the userspace patches: the patch that Amit sent for
> PCI passthrough and the direct-mmio extension for userspace (note that
> without the direct mmio extension we get less then half the throughput).
>   

Is mmio passthrough the reason for the performance improvement?  If not, 
what was the problem?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PCI passthrough with VT-d - native performance
  2008-07-16 14:36 ` PCI passthrough with VT-d - native performance Avi Kivity
@ 2008-07-16 15:18   ` Ben-Ami Yassour
  2008-07-16 15:22     ` Avi Kivity
  2008-07-16 15:23     ` Anthony Liguori
  0 siblings, 2 replies; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 15:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: amit.shah, kvm, Muli Ben-Yehuda, weidong.han, anthony

On Wed, 2008-07-16 at 17:36 +0300, Avi Kivity wrote:
> Ben-Ami Yassour wrote:
> > In last few tests that we made with PCI-passthrough and VT-d using
> > iperf, we were able to get the same throughput as on native OS with a 1G
> > NIC
> 
> Excellent!
> 
> >  (with higher CPU utilization).
> >   
> 
> How much higher?

Here are some numbers for running iperf -l 1M:

e1000 NIC (behind a PCI bridge)
                       Bandwidth (Mbit/sec)    CPU utilization
Native OS                   771                      18%
Native OS with VT-d         760                      18% 
KVM VT-d                    390                      95% 
KVM VT-d with direct mmio   770                      84%
KVM emulated                 57                     100%  

Comment: its not clear to me why the native linux can not get closer to 1G for this NIC,
(I verified that its not external network issues). But clearly we shouldn't hope to 
get more then the host does with a KVM guest (especially if the guest and host are the 
same OS as in this case...).

e1000e NIC (onboard)
                       Bandwidth (Mbit/sec)    CPU utilization
Native OS                   915                      18%
Native OS with VT-d         915                      18%
KVM VT-d with direct mmio   914                      98%

Clearly we need to try and improve the CPU utilization, but I think that this is good enough 
for the first phase.

> 
> > The following patches are the PCI-passthrough patches that Amit sent
> > (re-based on the last kvm tree), followed by a few improvements and the
> > VT-d extension.
> > I am also sending the userspace patches: the patch that Amit sent for
> > PCI passthrough and the direct-mmio extension for userspace (note that
> > without the direct mmio extension we get less then half the throughput).
> >   
> 
> Is mmio passthrough the reason for the performance improvement?  If not, 
> what was the problem?
> 
Direct mmio was definitely a major improvement, without it we got half the throughput,
as you can see above.
In addition patch 4/8 improves the interrupt handling and removes unnecessary locks,
and I assume that it also fixed performance issues (I did not investigate exactly in what way).

Regards,
Ben



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PCI passthrough with VT-d - native performance
  2008-07-16 15:18   ` Ben-Ami Yassour
@ 2008-07-16 15:22     ` Avi Kivity
  2008-07-16 15:23     ` Anthony Liguori
  1 sibling, 0 replies; 42+ messages in thread
From: Avi Kivity @ 2008-07-16 15:22 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: amit.shah, kvm, Muli Ben-Yehuda, weidong.han, anthony

Ben-Ami Yassour wrote:
>>>  (with higher CPU utilization).
>>>   
>>>       
>> How much higher?
>>     
>
> Here are some numbers for running iperf -l 1M:
>
> e1000 NIC (behind a PCI bridge)
>                        Bandwidth (Mbit/sec)    CPU utilization
> Native OS                   771                      18%
> Native OS with VT-d         760                      18% 
> KVM VT-d                    390                      95% 
> KVM VT-d with direct mmio   770                      84%
> KVM emulated                 57                     100%  
>
> Comment: its not clear to me why the native linux can not get closer to 1G for this NIC,
> (I verified that its not external network issues). But clearly we shouldn't hope to 
> get more then the host does with a KVM guest (especially if the guest and host are the 
> same OS as in this case...).
>
> e1000e NIC (onboard)
>                        Bandwidth (Mbit/sec)    CPU utilization
> Native OS                   915                      18%
> Native OS with VT-d         915                      18%
> KVM VT-d with direct mmio   914                      98%
>
> Clearly we need to try and improve the CPU utilization, but I think that this is good enough 
> for the first phase.
>
>   

Agree;  part of the higher utilization is of course not the fault of the 
device assignment code, rather it is ordinary virtualization overhead.  
We'll have to tune this.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PCI passthrough with VT-d - native performance
  2008-07-16 15:18   ` Ben-Ami Yassour
  2008-07-16 15:22     ` Avi Kivity
@ 2008-07-16 15:23     ` Anthony Liguori
  2008-07-16 16:13       ` Ben-Ami Yassour
  2008-07-17  3:20       ` Han, Weidong
  1 sibling, 2 replies; 42+ messages in thread
From: Anthony Liguori @ 2008-07-16 15:23 UTC (permalink / raw)
  To: Ben-Ami Yassour; +Cc: Avi Kivity, amit.shah, kvm, Muli Ben-Yehuda, weidong.han

Ben-Ami Yassour wrote:
> On Wed, 2008-07-16 at 17:36 +0300, Avi Kivity wrote:
>   
>> Ben-Ami Yassour wrote:
>>     
>>> In last few tests that we made with PCI-passthrough and VT-d using
>>> iperf, we were able to get the same throughput as on native OS with a 1G
>>> NIC
>>>       
>> Excellent!
>>
>>     
>>>  (with higher CPU utilization).
>>>   
>>>       
>> How much higher?
>>     
>
> Here are some numbers for running iperf -l 1M:
>
> e1000 NIC (behind a PCI bridge)
>                        Bandwidth (Mbit/sec)    CPU utilization
> Native OS                   771                      18%
> Native OS with VT-d         760                      18% 
> KVM VT-d                    390                      95% 
> KVM VT-d with direct mmio   770                      84%
> KVM emulated                 57                     100%  
>   

What about virtio?  Also, which emulated is this?

That CPU utilization is extremely high and somewhat illogical if native 
w/vt-d has almost no CPU impact.  Have you run oprofile yet or have any 
insight into where CPU is being burnt?

What does kvm_stat look like?  I wonder if there are a large number of 
PIO exits.  What does the interrupt count look like on native vs. KVM 
with VT-d?

Regards,

Anthony Liguori

> Comment: its not clear to me why the native linux can not get closer to 1G for this NIC,
> (I verified that its not external network issues). But clearly we shouldn't hope to 
> get more then the host does with a KVM guest (especially if the guest and host are the 
> same OS as in this case...).
>
> e1000e NIC (onboard)
>                        Bandwidth (Mbit/sec)    CPU utilization
> Native OS                   915                      18%
> Native OS with VT-d         915                      18%
> KVM VT-d with direct mmio   914                      98%
>
> Clearly we need to try and improve the CPU utilization, but I think that this is good enough 
> for the first phase.
>
>   
>>> The following patches are the PCI-passthrough patches that Amit sent
>>> (re-based on the last kvm tree), followed by a few improvements and the
>>> VT-d extension.
>>> I am also sending the userspace patches: the patch that Amit sent for
>>> PCI passthrough and the direct-mmio extension for userspace (note that
>>> without the direct mmio extension we get less then half the throughput).
>>>   
>>>       
>> Is mmio passthrough the reason for the performance improvement?  If not, 
>> what was the problem?
>>
>>     
> Direct mmio was definitely a major improvement, without it we got half the throughput,
> as you can see above.
> In addition patch 4/8 improves the interrupt handling and removes unnecessary locks,
> and I assume that it also fixed performance issues (I did not investigate exactly in what way).
>
> Regards,
> Ben
>
>
>   


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PCI passthrough with VT-d - native performance
  2008-07-16 15:23     ` Anthony Liguori
@ 2008-07-16 16:13       ` Ben-Ami Yassour
  2008-07-16 16:57         ` Avi Kivity
  2008-07-17  3:20       ` Han, Weidong
  1 sibling, 1 reply; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 16:13 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, amit.shah, kvm, Muli Ben-Yehuda, weidong.han

On Wed, 2008-07-16 at 10:23 -0500, Anthony Liguori wrote:

> 
> What about virtio?  
didn't try it yet.

> Also, which emulated is this?
the default

> 
> That CPU utilization is extremely high and somewhat illogical if native 
> w/vt-d has almost no CPU impact.  Have you run oprofile yet or have any 
> insight into where CPU is being burnt?
> 
> What does kvm_stat look like?  I wonder if there are a large number of 
> PIO exits.  What does the interrupt count look like on native vs. KVM 
> with VT-d?
> 
> Regards,
> 
> Anthony Liguori
> 

These are all good points and questions, I agree that we need to take a deeper 
look into the performance issues, but I think that we need to merge with 
the main KVM tree first.

Regards,
Ben


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PCI passthrough with VT-d - native performance
  2008-07-16 16:13       ` Ben-Ami Yassour
@ 2008-07-16 16:57         ` Avi Kivity
  2008-07-17  6:24           ` Ben-Ami Yassour
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2008-07-16 16:57 UTC (permalink / raw)
  To: Ben-Ami Yassour
  Cc: Anthony Liguori, amit.shah, kvm, Muli Ben-Yehuda, weidong.han

Ben-Ami Yassour wrote:
>   
>> That CPU utilization is extremely high and somewhat illogical if native 
>> w/vt-d has almost no CPU impact.  Have you run oprofile yet or have any 
>> insight into where CPU is being burnt?
>>
>> What does kvm_stat look like?  I wonder if there are a large number of 
>> PIO exits.  What does the interrupt count look like on native vs. KVM 
>> with VT-d?
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>     
>
> These are all good points and questions, I agree that we need to take a deeper 
> look into the performance issues, but I think that we need to merge with 
> the main KVM tree first.
>   

It would be good to get the host interrupt rate, to confirm that the 
host isn't flooded with interrupts.  A deeper analysis can wait.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: PCI passthrough with VT-d - native performance
  2008-07-16 16:57         ` Avi Kivity
@ 2008-07-17  6:24           ` Ben-Ami Yassour
  0 siblings, 0 replies; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-17  6:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, amit.shah, kvm, Muli Ben-Yehuda, weidong.han

On Wed, 2008-07-16 at 19:57 +0300, Avi Kivity wrote:
> Ben-Ami Yassour wrote:
> >   
> >> That CPU utilization is extremely high and somewhat illogical if native 
> >> w/vt-d has almost no CPU impact.  Have you run oprofile yet or have any 
> >> insight into where CPU is being burnt?
> >>
> >> What does kvm_stat look like?  I wonder if there are a large number of 
> >> PIO exits.  What does the interrupt count look like on native vs. KVM 
> >> with VT-d?
> >>
> >> Regards,
> >>
> >> Anthony Liguori
> >>
> >>     
> >
> > These are all good points and questions, I agree that we need to take a deeper 
> > look into the performance issues, but I think that we need to merge with 
> > the main KVM tree first.
> >   
> 
> It would be good to get the host interrupt rate, to confirm that the 
> host isn't flooded with interrupts.  A deeper analysis can wait.
> 
> 

Here is an example of a kvm_stat while running iperf:
efer_reload                   13         0
exits                   11372838     58214
fpu_reload                709524         0
halt_exits                139306         0
halt_wakeup                 3580         0
host_state_reload        5070783     29031
hypercalls                635636         1
insn_emulation           4306311     22289
insn_emulation_fail            0         0
invlpg                         0         0
io_exits                  978192       646
irq_exits                2795848     20879
irq_window                     0         0
largepages                     0         0
mmio_exits                 15437         0
mmu_cache_miss             60299         0
mmu_flooded                43213         0
mmu_pde_zapped             49610         0
mmu_pte_updated           576608         1
mmu_pte_write            1203336         1
mmu_recycled                   0         0
mmu_shadow_zapped          59348         0
nmi_window                     0         0
pf_fixed                  571970         0
pf_guest                  103889         0
remote_tlb_flush               0         0
request_irq                    0         0
signal_exits                   1         0
tlb_flush                 725366         7

The average host interrupt rate is around 22,200 per second.
Note: I would assume that this is bigger then irq_exits, because some
of the interrupts occur while we are in host mode so it does not cause
an exit.
BTW, the interrupt counter on the host and guest is very close.

Regards,
Ben




^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: PCI passthrough with VT-d - native performance
  2008-07-16 15:23     ` Anthony Liguori
  2008-07-16 16:13       ` Ben-Ami Yassour
@ 2008-07-17  3:20       ` Han, Weidong
  1 sibling, 0 replies; 42+ messages in thread
From: Han, Weidong @ 2008-07-17  3:20 UTC (permalink / raw)
  To: Anthony Liguori, Ben-Ami Yassour
  Cc: Avi Kivity, amit.shah, kvm, Muli Ben-Yehuda

Anthony Liguori wrote:
> Ben-Ami Yassour wrote:
>> On Wed, 2008-07-16 at 17:36 +0300, Avi Kivity wrote:
>> 
>>> Ben-Ami Yassour wrote:
>>> 
>>>> In last few tests that we made with PCI-passthrough and VT-d using
>>>> iperf, we were able to get the same throughput as on native OS
>>>> with a 1G NIC 
>>>> 
>>> Excellent!
>>> 
>>> 
>>>>  (with higher CPU utilization).
>>>> 
>>>> 
>>> How much higher?
>>> 
>> 
>> Here are some numbers for running iperf -l 1M:
>> 
>> e1000 NIC (behind a PCI bridge)
>>                        Bandwidth (Mbit/sec)    CPU utilization
>> Native OS                   771                      18%
>> Native OS with VT-d         760                      18%
>> KVM VT-d                    390                      95%
>> KVM VT-d with direct mmio   770                      84%
>> KVM emulated                 57                     100%
>> 
> 
> What about virtio?  Also, which emulated is this?
> 
> That CPU utilization is extremely high and somewhat illogical if
> native w/vt-d has almost no CPU impact.  Have you run oprofile yet or
> have any insight into where CPU is being burnt?
> 
> What does kvm_stat look like?  I wonder if there are a large number of
> PIO exits.  What does the interrupt count look like on native vs. KVM
> with VT-d?
> 

e1000 NIC doesn't use PIO. 

Randy (Weidong)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* PCI passthrough with VT-d - native performance
@ 2008-07-16 15:56 Ben-Ami Yassour
  0 siblings, 0 replies; 42+ messages in thread
From: Ben-Ami Yassour @ 2008-07-16 15:56 UTC (permalink / raw)
  To: amit.shah; +Cc: kvm, muli, benami, weidong.han, anthony

In last few tests that we made with PCI-passthrough and VT-d using
iperf, we were able to get the same throughput as on native OS with a 1G
NIC (with higher CPU utilization).

The following patches are the PCI-passthrough patches that Amit sent
(re-based on the last kvm tree), followed by a few improvements and the
VT-d extension.
I am also sending the userspace patches: the patch that Amit sent for
PCI passthrough and the direct-mmio extension for userspace (note that
without the direct mmio extension we get less then half the throughput).

Per Avi's request, I am resending the kernel patches, folding patches 3/8, 4/8, 5/8.

Regards,
Ben

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2008-07-24 14:25 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-16 13:16 PCI passthrough with VT-d - native performance Ben-Ami Yassour
2008-07-16 13:16 ` [PATCH 1/8] KVM: Introduce a callback routine for IOAPIC ack handling Ben-Ami Yassour
2008-07-16 13:16   ` [PATCH 2/8] KVM: Introduce a callback routine for PIC " Ben-Ami Yassour
2008-07-16 13:17     ` [PATCH 3/8] KVM: Handle device assignment to guests Ben-Ami Yassour
2008-07-16 13:17       ` [PATCH 4/8] KVM: PCIPT: fix interrupt handling Ben-Ami Yassour
2008-07-16 13:17         ` [PATCH 5/8] KVM: PCIPT: change order of device release Ben-Ami Yassour
2008-07-16 13:17           ` [PATCH 6/8] VT-d: changes to support KVM Ben-Ami Yassour
2008-07-16 13:17             ` [PATCH 7/8] KVM: PCIPT: VT-d support Ben-Ami Yassour
2008-07-16 13:17               ` [PATCH 8/8] KVM: PCIPT: VT-d: dont map mmio memory slots Ben-Ami Yassour
2008-07-16 13:21                 ` PCI passthrough with VT-d - native performance Ben-Ami Yassour
2008-07-16 13:21                   ` [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest Ben-Ami Yassour
2008-07-16 13:21                     ` [PATCH 2/2] PCIPT: direct mmio Ben-Ami Yassour
2008-07-17  7:52                     ` [PATCH 1/2] KVM/userspace: Support for assigning PCI devices to guest Han, Weidong
2008-07-22 12:28                       ` Ben-Ami Yassour
2008-07-22 12:44                         ` [PATCH 1/2] KVM/userspace: Support for assigning PCI devicesto guest Han, Weidong
2008-07-16 15:06           ` [PATCH 5/8] KVM: PCIPT: change order of device release Avi Kivity
2008-07-16 15:06         ` [PATCH 4/8] KVM: PCIPT: fix interrupt handling Avi Kivity
2008-07-23 13:37         ` Amit Shah
2008-07-24 11:28           ` Ben-Ami Yassour
2008-07-24 13:31             ` Amit Shah
2008-07-24 14:24               ` Ben-Ami Yassour
2008-07-16 15:04       ` [PATCH 3/8] KVM: Handle device assignment to guests Avi Kivity
2008-07-17  2:09         ` Han, Weidong
2008-07-17  2:29           ` Yang, Sheng
2008-07-17  6:02           ` Avi Kivity
2008-07-17  8:23             ` Ben-Ami Yassour
2008-07-17  8:31               ` Avi Kivity
2008-07-17 18:01                 ` Ben-Ami Yassour
2008-07-17 18:07                   ` Avi Kivity
2008-07-17  9:34         ` Ben-Ami Yassour
2008-07-17  9:50           ` Avi Kivity
2008-07-17 17:40             ` Ben-Ami Yassour
2008-07-17 18:04               ` Avi Kivity
2008-07-16 14:36 ` PCI passthrough with VT-d - native performance Avi Kivity
2008-07-16 15:18   ` Ben-Ami Yassour
2008-07-16 15:22     ` Avi Kivity
2008-07-16 15:23     ` Anthony Liguori
2008-07-16 16:13       ` Ben-Ami Yassour
2008-07-16 16:57         ` Avi Kivity
2008-07-17  6:24           ` Ben-Ami Yassour
2008-07-17  3:20       ` Han, Weidong
  -- strict thread matches above, loose matches on Subject: below --
2008-07-16 15:56 Ben-Ami Yassour

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox