* [v7] Userspace patches for PCI device assignment @ 2008-10-24 15:26 Amit Shah 2008-10-24 15:26 ` [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices Amit Shah 2008-10-24 15:59 ` [v7] Userspace patches for PCI device assignment Anthony Liguori 0 siblings, 2 replies; 23+ messages in thread From: Amit Shah @ 2008-10-24 15:26 UTC (permalink / raw) To: avi; +Cc: kvm, anthony, weidong.han, allen.m.kay, muli, benami This patchset enables device assignment for KVM hosts for PCI devices. It uses the Intel IOMMU by default if available. Major changes since the last send in no particular order: - formatting changes: adhere to qemu style - use strncmp, strncpy etc. instead of the insecure ones - move from array to linked list - change iopl() to ioperm() (Weidong Han) Plus a lot of other small changes as suggested during the review of v6. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices 2008-10-24 15:26 [v7] Userspace patches for PCI device assignment Amit Shah @ 2008-10-24 15:26 ` Amit Shah 2008-10-24 15:26 ` [PATCH 2/6] qemu: Introduce pci_map_irq to get irq nr from pin number for a PCI device Amit Shah 2008-10-26 13:29 ` [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices Avi Kivity 2008-10-24 15:59 ` [v7] Userspace patches for PCI device assignment Anthony Liguori 1 sibling, 2 replies; 23+ messages in thread From: Amit Shah @ 2008-10-24 15:26 UTC (permalink / raw) To: avi; +Cc: kvm, anthony, weidong.han, allen.m.kay, muli, benami, Amit Shah Signed-off-by: Amit Shah <amit.shah@redhat.com> --- libkvm/libkvm.c | 13 +++++++++++++ libkvm/libkvm.h | 27 +++++++++++++++++++++++++++ 2 files changed, 40 insertions(+), 0 deletions(-) diff --git a/libkvm/libkvm.c b/libkvm/libkvm.c index 444b97f..c4814e0 100644 --- a/libkvm/libkvm.c +++ b/libkvm/libkvm.c @@ -1112,3 +1112,16 @@ int kvm_unregister_coalesced_mmio(kvm_context_t kvm, uint64_t addr, uint32_t siz return -ENOSYS; } +#ifdef KVM_CAP_DEVICE_ASSIGNMENT +int kvm_assign_pci_device(kvm_context_t kvm, + struct kvm_assigned_pci_dev *assigned_dev) +{ + return ioctl(kvm->vm_fd, KVM_ASSIGN_PCI_DEVICE, assigned_dev); +} + +int kvm_assign_irq(kvm_context_t kvm, + struct kvm_assigned_irq *assigned_irq) +{ + return ioctl(kvm->vm_fd, KVM_ASSIGN_IRQ, assigned_irq); +} +#endif diff --git a/libkvm/libkvm.h b/libkvm/libkvm.h index 423ce31..53d67f2 100644 --- a/libkvm/libkvm.h +++ b/libkvm/libkvm.h @@ -686,4 +686,31 @@ int kvm_s390_interrupt(kvm_context_t kvm, int slot, int kvm_s390_set_initial_psw(kvm_context_t kvm, int slot, psw_t psw); int kvm_s390_store_status(kvm_context_t kvm, int slot, unsigned long addr); #endif + +#ifdef KVM_CAP_DEVICE_ASSIGNMENT +/*! + * \brief Notifies host kernel about a PCI device to be assigned to a guest + * + * Used for PCI device assignment, this function notifies the host + * kernel about the assigning of the physical PCI device to a guest. + * + * \param kvm Pointer to the current kvm_context + * \param assigned_dev Parameters, like bus, devfn number, etc + */ +int kvm_assign_pci_device(kvm_context_t kvm, + struct kvm_assigned_pci_dev *assigned_dev); + +/*! + * \brief Notifies host kernel about changes to IRQ for an assigned device + * + * Used for PCI device assignment, this function notifies the host + * kernel about the changes in IRQ number for an assigned physical + * PCI device. + * + * \param kvm Pointer to the current kvm_context + * \param assigned_irq Parameters, like dev id, host irq, guest irq, etc + */ +int kvm_assign_irq(kvm_context_t kvm, + struct kvm_assigned_irq *assigned_irq); +#endif #endif -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/6] qemu: Introduce pci_map_irq to get irq nr from pin number for a PCI device 2008-10-24 15:26 ` [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices Amit Shah @ 2008-10-24 15:26 ` Amit Shah 2008-10-24 15:26 ` [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa Amit Shah 2008-10-26 13:29 ` [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices Avi Kivity 1 sibling, 1 reply; 23+ messages in thread From: Amit Shah @ 2008-10-24 15:26 UTC (permalink / raw) To: avi; +Cc: kvm, anthony, weidong.han, allen.m.kay, muli, benami, Amit Shah Signed-off-by: Amit Shah <amit.shah@redhat.com> --- qemu/hw/pci.c | 5 +++++ qemu/hw/pci.h | 1 + 2 files changed, 6 insertions(+), 0 deletions(-) diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c index 512dbea..c82cd20 100644 --- a/qemu/hw/pci.c +++ b/qemu/hw/pci.c @@ -560,6 +560,11 @@ static void pci_set_irq(void *opaque, int irq_num, int level) bus->set_irq(bus->irq_opaque, irq_num, bus->irq_count[irq_num] != 0); } +int pci_map_irq(PCIDevice *pci_dev, int pin) +{ + return pci_dev->bus->map_irq(pci_dev, pin); +} + /***********************************************************/ /* monitor info on PCI */ diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h index 60e4094..e11fbbf 100644 --- a/qemu/hw/pci.h +++ b/qemu/hw/pci.h @@ -81,6 +81,7 @@ void pci_register_io_region(PCIDevice *pci_dev, int region_num, uint32_t size, int type, PCIMapIORegionFunc *map_func); +int pci_map_irq(PCIDevice *pci_dev, int pin); uint32_t pci_default_read_config(PCIDevice *d, uint32_t address, int len); void pci_default_write_config(PCIDevice *d, -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa 2008-10-24 15:26 ` [PATCH 2/6] qemu: Introduce pci_map_irq to get irq nr from pin number for a PCI device Amit Shah @ 2008-10-24 15:26 ` Amit Shah 2008-10-24 15:26 ` [PATCH 4/6] KVM/userspace: Build vtd.c for Intel IOMMU support Amit Shah 2008-10-26 13:31 ` [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa Avi Kivity 0 siblings, 2 replies; 23+ messages in thread From: Amit Shah @ 2008-10-24 15:26 UTC (permalink / raw) To: avi; +Cc: kvm, anthony, weidong.han, allen.m.kay, muli, benami, Amit Shah Signed-off-by: Amit Shah <amit.shah@redhat.com> --- qemu/hw/pc.h | 3 +++ qemu/hw/piix_pci.c | 19 +++++++++++++++++++ 2 files changed, 22 insertions(+), 0 deletions(-) diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h index 1f63678..3edf62f 100644 --- a/qemu/hw/pc.h +++ b/qemu/hw/pc.h @@ -112,6 +112,9 @@ void i440fx_init_memory_mappings(PCIDevice *d); int piix4_init(PCIBus *bus, int devfn); +int piix3_get_pin(int pic_irq); +int piix_get_irq(int pin); + /* vga.c */ enum vga_retrace_method { VGA_RETRACE_DUMB, diff --git a/qemu/hw/piix_pci.c b/qemu/hw/piix_pci.c index 6fbf47b..dc12c8a 100644 --- a/qemu/hw/piix_pci.c +++ b/qemu/hw/piix_pci.c @@ -243,6 +243,25 @@ static void piix3_set_irq(qemu_irq *pic, int irq_num, int level) } } +int piix3_get_pin(int pic_irq) +{ + int i; + for (i = 0; i < 4; i++) + if (piix3_dev->config[0x60+i] == pic_irq) + return i; + return -1; +} + +int piix_get_irq(int pin) +{ + if (piix3_dev) + return piix3_dev->config[0x60+pin]; + if (piix4_dev) + return piix4_dev->config[0x60+pin]; + + return 0; +} + static void piix3_reset(PCIDevice *d) { uint8_t *pci_conf = d->config; -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 4/6] KVM/userspace: Build vtd.c for Intel IOMMU support 2008-10-24 15:26 ` [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa Amit Shah @ 2008-10-24 15:26 ` Amit Shah 2008-10-24 15:26 ` [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests Amit Shah 2008-10-26 13:31 ` [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa Avi Kivity 1 sibling, 1 reply; 23+ messages in thread From: Amit Shah @ 2008-10-24 15:26 UTC (permalink / raw) To: avi; +Cc: kvm, anthony, weidong.han, allen.m.kay, muli, benami, Amit Shah Signed-off-by: Amit Shah <amit.shah@redhat.com> --- kernel/x86/Kbuild | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/kernel/x86/Kbuild b/kernel/x86/Kbuild index 2369d00..c4723b1 100644 --- a/kernel/x86/Kbuild +++ b/kernel/x86/Kbuild @@ -9,6 +9,9 @@ kvm-objs := kvm_main.o x86.o mmu.o x86_emulate.o ../anon_inodes.o irq.o i8259.o ifeq ($(EXT_CONFIG_KVM_TRACE),y) kvm-objs += kvm_trace.o endif +ifeq ($(CONFIG_DMAR),y) +kvm-objs += vtd.o +endif kvm-intel-objs := vmx.o vmx-debug.o ../external-module-compat.o kvm-amd-objs := svm.o ../external-module-compat.o -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests 2008-10-24 15:26 ` [PATCH 4/6] KVM/userspace: Build vtd.c for Intel IOMMU support Amit Shah @ 2008-10-24 15:26 ` Amit Shah 2008-10-24 15:26 ` [PATCH 6/6] KVM/userspace: Device Assignment: Support for hot plugging PCI devices Amit Shah ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Amit Shah @ 2008-10-24 15:26 UTC (permalink / raw) To: avi; +Cc: kvm, anthony, weidong.han, allen.m.kay, muli, benami, Amit Shah This patch has been contributed to by the following people: From: Or Sagi <ors@tutis.com> From: Nir Peleg <nir@tutis.com> From: Amit Shah <amit.shah@redhat.com> From: Ben-Ami Yassour <benami@il.ibm.com> From: Weidong Han <weidong.han@intel.com> From: Glauber de Oliveira Costa <gcosta@redhat.com> With this patch, we can assign a device on the host machine to a guest. A new command-line option, -pcidevice is added. To invoke it for a device sitting at PCI bus:dev.fn 04:08.0, use this: -pcidevice host=04:08.0 * The host driver for the device, if any, is to be removed before assigning the device (else device assignment will fail). * A device that shares IRQ with another host device cannot currently be assigned. * The RAW_IO capability is needed for this to work This works only with the in-kernel irqchip method; to use the userspace irqchip, a kernel module (irqhook) and some extra changes are needed. Signed-off-by: Amit Shah <amit.shah@redhat.com> --- qemu/Makefile.target | 1 + qemu/hw/device-assignment.c | 619 +++++++++++++++++++++++++++++++++++++++++++ qemu/hw/device-assignment.h | 98 +++++++ qemu/hw/pc.c | 6 + qemu/hw/pci.c | 7 + qemu/vl.c | 18 ++ 6 files changed, 749 insertions(+), 0 deletions(-) create mode 100644 qemu/hw/device-assignment.c create mode 100644 qemu/hw/device-assignment.h diff --git a/qemu/Makefile.target b/qemu/Makefile.target index d9bdeca..05a1d84 100644 --- a/qemu/Makefile.target +++ b/qemu/Makefile.target @@ -621,6 +621,7 @@ OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o +OBJS+= device-assignment.o ifeq ($(USE_KVM_PIT), 1) OBJS+= i8254-kvm.o endif diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c new file mode 100644 index 0000000..5ba21a0 --- /dev/null +++ b/qemu/hw/device-assignment.c @@ -0,0 +1,619 @@ +/* + * Copyright (c) 2007, Neocleus Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + * + * + * Assign a PCI device from the host to a guest VM. + * + * Adapted for KVM by Qumranet. + * + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) + */ +#include <stdio.h> +#include <sys/io.h> +#include "qemu-kvm.h" +#include "hw.h" +#include "pc.h" +#include "sysemu.h" +#include "console.h" +#include <linux/kvm_para.h> +#include "device-assignment.h" + +/* From linux/ioport.h */ +#define IORESOURCE_IO 0x00000100 /* Resource type */ +#define IORESOURCE_MEM 0x00000200 +#define IORESOURCE_IRQ 0x00000400 +#define IORESOURCE_DMA 0x00000800 +#define IORESOURCE_PREFETCH 0x00001000 /* No side effects */ + +/* #define DEVICE_ASSIGNMENT_DEBUG 1 */ + +#ifdef DEVICE_ASSIGNMENT_DEBUG +#define DEBUG(fmt, args...) \ + do { \ + fprintf(stderr, "%s: " fmt, __func__ , ## args); \ + } while (0) +#else +#define DEBUG(fmt, args...) do { } while(0) +#endif + +static void assigned_dev_ioport_writeb(void *opaque, uint32_t addr, + uint32_t value) +{ + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; + uint32_t r_pio = (unsigned long)r_access->r_virtbase + + (addr - r_access->e_physbase); + + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" + " r_virtbase=%08lx value=%08x\n", + __func__, r_pio, (int)r_access->e_physbase, + (unsigned long)r_access->r_virtbase, value); + outb(value, r_pio); +} + +static void assigned_dev_ioport_writew(void *opaque, uint32_t addr, + uint32_t value) +{ + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; + uint32_t r_pio = (unsigned long)r_access->r_virtbase + + (addr - r_access->e_physbase); + + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" + " r_virtbase=%08lx value=%08x\n", + __func__, r_pio, (int)r_access->e_physbase, + (unsigned long)r_access->r_virtbase, value); + outw(value, r_pio); +} + +static void assigned_dev_ioport_writel(void *opaque, uint32_t addr, + uint32_t value) +{ + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; + uint32_t r_pio = (unsigned long)r_access->r_virtbase + + (addr - r_access->e_physbase); + + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" + " r_virtbase=%08lx value=%08x\n", + __func__, r_pio, (int)r_access->e_physbase, + (unsigned long)r_access->r_virtbase, value); + outl(value, r_pio); +} + +static uint32_t assigned_dev_ioport_readb(void *opaque, uint32_t addr) +{ + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; + uint32_t r_pio = (addr - r_access->e_physbase) + + (unsigned long)r_access->r_virtbase; + uint32_t value; + + value = inb(r_pio); + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x " + "r_virtbase=%08lx value=%08x\n", + __func__, r_pio, (int)r_access->e_physbase, + (unsigned long)r_access->r_virtbase, value); + return value; +} + +static uint32_t assigned_dev_ioport_readw(void *opaque, uint32_t addr) +{ + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; + uint32_t r_pio = (addr - r_access->e_physbase) + + (unsigned long)r_access->r_virtbase; + uint32_t value; + + value = inw(r_pio); + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x " + "r_virtbase=%08lx value=%08x\n", + __func__, r_pio, (int)r_access->e_physbase, + (unsigned long)r_access->r_virtbase, value); + return value; +} + +static uint32_t assigned_dev_ioport_readl(void *opaque, uint32_t addr) +{ + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; + uint32_t r_pio = (addr - r_access->e_physbase) + + (unsigned long)r_access->r_virtbase; + uint32_t value; + + value = inl(r_pio); + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x " + "r_virtbase=%08lx value=%08x\n", + __func__, r_pio, (int)r_access->e_physbase, + (unsigned long)r_access->r_virtbase, value); + return value; +} + +static void assigned_dev_iomem_map(PCIDevice *pci_dev, int region_num, + uint32_t e_phys, uint32_t e_size, int type) +{ + AssignedDevice *r_dev = (AssignedDevice *) pci_dev; + AssignedDevRegion *region = &r_dev->v_addrs[region_num]; + int first_map = (region->e_size == 0); + int ret = 0; + + DEBUG("%s: e_phys=%08x r_virt=%x type=%d len=%08x region_num=%d \n", + __func__, e_phys, (uint32_t)region->r_virtbase, type, e_size, + region_num); + + region->e_physbase = e_phys; + region->e_size = e_size; + + /* FIXME: Add support for emulated MMIO for non-kvm guests */ + if (kvm_enabled()) { + if (!first_map) + kvm_destroy_phys_mem(kvm_context, e_phys, e_size); + if (e_size > 0) + ret = kvm_register_phys_mem(kvm_context, e_phys, + region->r_virtbase, e_size, 0); + if (ret != 0) + fprintf(stderr, "%s: Error: create new mapping failed\n", __func__); + } +} + +static void assigned_dev_ioport_map(PCIDevice *pci_dev, int region_num, + uint32_t addr, uint32_t size, int type) +{ + AssignedDevice *r_dev = (AssignedDevice *) pci_dev; + AssignedDevRegion *region = &r_dev->v_addrs[region_num]; + int r; + + region->e_physbase = addr; + region->e_size = size; + + DEBUG("%s: e_phys=0x%x r_virt=%x type=0x%x len=%d region_num=%d \n", + __func__, addr, (uint32_t)region->r_virtbase, type, size, region_num); + + r = ioperm((uint32_t)region->r_virtbase, size, 1); + if (r < 0) { + perror("assigned_dev_ioport_map: ioperm"); + return; + } + + register_ioport_read(addr, size, 1, assigned_dev_ioport_readb, + (void *) (r_dev->v_addrs + region_num)); + register_ioport_read(addr, size, 2, assigned_dev_ioport_readw, + (void *) (r_dev->v_addrs + region_num)); + register_ioport_read(addr, size, 4, assigned_dev_ioport_readl, + (void *) (r_dev->v_addrs + region_num)); + register_ioport_write(addr, size, 1, assigned_dev_ioport_writeb, + (void *) (r_dev->v_addrs + region_num)); + register_ioport_write(addr, size, 2, assigned_dev_ioport_writew, + (void *) (r_dev->v_addrs + region_num)); + register_ioport_write(addr, size, 4, assigned_dev_ioport_writel, + (void *) (r_dev->v_addrs + region_num)); +} + +static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, + uint32_t val, int len) +{ + int fd, r; + + DEBUG("%s: (%x.%x): address=%04x val=0x%08x len=%d\n", + __func__, ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), + (uint16_t) address, val, len); + + if (address == 0x4) { + pci_default_write_config(d, address, val, len); + /* Continue to program the card */ + } + + if ((address >= 0x10 && address <= 0x24) || address == 0x34 || + address == 0x3c || address == 0x3d) { + /* used for update-mappings (BAR emulation) */ + pci_default_write_config(d, address, val, len); + return; + } + DEBUG("%s: NON BAR (%x.%x): address=%04x val=0x%08x len=%d\n", + __func__, ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), + (uint16_t) address, val, len); + fd = ((AssignedDevice *)d)->real_device.config_fd; + r = lseek(fd, address, SEEK_SET); + if (r < 0) { + fprintf(stderr, "%s: bad seek, errno = %d\n", __func__, errno); + return; + } +again: + r = write(fd, &val, len); + if (r < 0) { + if (errno == EINTR || errno == EAGAIN) + goto again; + fprintf(stderr, "%s: write failed, errno = %d\n", __func__, errno); + } +} + +static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, + int len) +{ + uint32_t val = 0; + int fd, r; + + if ((address >= 0x10 && address <= 0x24) || address == 0x34 || + address == 0x3c || address == 0x3d) { + val = pci_default_read_config(d, address, len); + DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n", + (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, len); + return val; + } + + /* vga specific, remove later */ + if (address == 0xFC) + goto do_log; + + fd = ((AssignedDevice *)d)->real_device.config_fd; + r = lseek(fd, address, SEEK_SET); + if (r < 0) { + fprintf(stderr, "%s: bad seek, errno = %d\n", __func__, errno); + return val; + } +again: + r = read(fd, &val, len); + if (r < 0) { + if (errno == EINTR || errno == EAGAIN) + goto again; + fprintf(stderr, "%s: read failed, errno = %d\n", + __func__, errno); + } +do_log: + DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n", + (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, len); + + /* kill the special capabilities */ + if (address == 4 && len == 4) + val &= ~0x100000; + else if (address == 6) + val &= ~0x10; + + return val; +} + +static int assigned_dev_register_regions(PCIRegion *io_regions, + unsigned long regions_num, + AssignedDevice *pci_dev) +{ + uint32_t i; + PCIRegion *cur_region = io_regions; + + for (i = 0; i < regions_num; i++, cur_region++) { + if (!cur_region->valid) + continue; + pci_dev->v_addrs[i].num = i; + + /* handle memory io regions */ + if (cur_region->type & IORESOURCE_MEM) { + int t = cur_region->type & IORESOURCE_PREFETCH + ? PCI_ADDRESS_SPACE_MEM_PREFETCH + : PCI_ADDRESS_SPACE_MEM; + + /* map physical memory */ + pci_dev->v_addrs[i].e_physbase = cur_region->base_addr; + pci_dev->v_addrs[i].r_virtbase = + mmap(NULL, + (cur_region->size + 0xFFF) & 0xFFFFF000, + PROT_WRITE | PROT_READ, MAP_SHARED, + cur_region->resource_fd, (off_t) 0); + + if ((void *) -1 == pci_dev->v_addrs[i].r_virtbase) { + fprintf(stderr, "%s: Error: Couldn't mmap 0x%x!" + "\n", __func__, + (uint32_t) (cur_region->base_addr)); + return -1; + } + pci_dev->v_addrs[i].r_size = cur_region->size; + pci_dev->v_addrs[i].e_size = 0; + + /* add offset */ + pci_dev->v_addrs[i].r_virtbase += + (cur_region->base_addr & 0xFFF); + + pci_register_io_region((PCIDevice *) pci_dev, i, + cur_region->size, t, + assigned_dev_iomem_map); + continue; + } + /* handle port io regions */ + pci_register_io_region((PCIDevice *) pci_dev, i, + cur_region->size, PCI_ADDRESS_SPACE_IO, + assigned_dev_ioport_map); + + pci_dev->v_addrs[i].e_physbase = cur_region->base_addr; + pci_dev->v_addrs[i].r_virtbase = + (void *)(long)cur_region->base_addr; + /* not relevant for port io */ + pci_dev->v_addrs[i].memory_index = 0; + } + + /* success */ + return 0; +} + +static int get_real_device(AssignedDevice *pci_dev, uint8_t r_bus, + uint8_t r_dev, uint8_t r_func) +{ + char dir[128], name[128], comp[16]; + int fd, r = 0; + FILE *f; + unsigned long long start, end, size, flags; + PCIRegion *rp; + PCIDevRegions *dev = &pci_dev->real_device; + + dev->region_number = 0; + + snprintf(dir, 128, "/sys/bus/pci/devices/0000:%02x:%02x.%x/", + r_bus, r_dev, r_func); + strncpy(name, dir, 128); + strncat(name, "config", 6); + fd = open(name, O_RDWR); + if (fd == -1) { + fprintf(stderr, "%s: %s: %m\n", __func__, name); + return 1; + } + dev->config_fd = fd; +again: + r = read(fd, pci_dev->dev.config, sizeof(pci_dev->dev.config)); + if (r < 0) { + if (errno == EINTR || errno == EAGAIN) + goto again; + fprintf(stderr, "%s: read failed, errno = %d\n", __func__, errno); + } + strncpy(name, dir, 128); + strncat(name, "resource", 8); + + f = fopen(name, "r"); + if (f == NULL) { + fprintf(stderr, "%s: %s: %m\n", __func__, name); + return 1; + } + r = -1; + while (fscanf(f, "%lli %lli %lli\n", &start, &end, &flags) == 3) { + r++; + rp = dev->regions + r; + rp->valid = 0; + size = end - start + 1; + flags &= IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH; + if (size == 0 || (flags & ~IORESOURCE_PREFETCH) == 0) + continue; + if (flags & IORESOURCE_MEM) { + flags &= ~IORESOURCE_IO; + snprintf(comp, 16, "resource%d", r); + strncpy(name, dir, 128); + strncat(name, comp, 16); + fd = open(name, O_RDWR); + if (fd == -1) + continue; /* probably ROM */ + rp->resource_fd = fd; + } else + flags &= ~IORESOURCE_PREFETCH; + + rp->type = flags; + rp->valid = 1; + rp->base_addr = start; + rp->size = size; + DEBUG("%s: region %d size %d start 0x%x type %d resource_fd %d\n", + __func__, r, rp->size, start, rp->type, rp->resource_fd); + } + fclose(f); + + dev->region_number = r; + return 0; +} + +static int disable_iommu; +int nr_assigned_devices; +static LIST_HEAD(, AssignedDevInfo) adev_head; + +static uint32_t calc_assigned_dev_id(uint8_t bus, uint8_t devfn) +{ + return (uint32_t)bus << 8 | (uint32_t)devfn; +} + +static AssignedDevice *register_real_device(PCIBus *e_bus, + const char *e_dev_name, + int e_devfn, uint8_t r_bus, + uint8_t r_dev, uint8_t r_func) +{ + int r; + AssignedDevice *pci_dev; + uint8_t e_device, e_intx; + + DEBUG("%s: Registering real physical device %s (devfn=0x%x)\n", + __func__, e_dev_name, e_devfn); + + pci_dev = (AssignedDevice *) + pci_register_device(e_bus, e_dev_name, sizeof(AssignedDevice), + e_devfn, assigned_dev_pci_read_config, + assigned_dev_pci_write_config); + if (NULL == pci_dev) { + fprintf(stderr, "%s: Error: Couldn't register real device %s\n", + __func__, e_dev_name); + return NULL; + } + if (get_real_device(pci_dev, r_bus, r_dev, r_func)) { + fprintf(stderr, "%s: Error: Couldn't get real device (%s)!\n", + __func__, e_dev_name); + goto out; + } + + /* handle real device's MMIO/PIO BARs */ + if (assigned_dev_register_regions(pci_dev->real_device.regions, + pci_dev->real_device.region_number, + pci_dev)) + goto out; + + /* handle interrupt routing */ + e_device = (pci_dev->dev.devfn >> 3) & 0x1f; + e_intx = pci_dev->dev.config[0x3d] - 1; + pci_dev->intpin = e_intx; + pci_dev->run = 0; + pci_dev->girq = 0; + pci_dev->h_busnr = r_bus; + pci_dev->h_devfn = PCI_DEVFN(r_dev, r_func); + +#ifdef KVM_CAP_DEVICE_ASSIGNMENT + if (kvm_enabled()) { + struct kvm_assigned_pci_dev assigned_dev_data; + + memset(&assigned_dev_data, 0, sizeof(assigned_dev_data)); + assigned_dev_data.assigned_dev_id = + calc_assigned_dev_id(pci_dev->h_busnr, + (uint32_t)pci_dev->h_devfn); + assigned_dev_data.busnr = pci_dev->h_busnr; + assigned_dev_data.devfn = pci_dev->h_devfn; + +#ifdef KVM_CAP_IOMMU + /* We always enable the IOMMU if present + * (or when not disabled on the command line) + */ + r = kvm_check_extension(kvm_context, KVM_CAP_IOMMU); + if (r && !disable_iommu) + assigned_dev_data.flags |= KVM_DEV_ASSIGN_ENABLE_IOMMU; +#endif + r = kvm_assign_pci_device(kvm_context, &assigned_dev_data); + if (r < 0) { + fprintf(stderr, "Could not notify kernel about " + "assigned device \"%s\"\n", e_dev_name); + perror("register_real_device"); + goto out; + } + } +#endif + term_printf("Registered host PCI device %02x:%02x.%1x " + "(\"%s\") as guest device %02x:%02x.%1x\n", + r_bus, r_dev, r_func, e_dev_name, + pci_bus_num(e_bus), e_device, r_func); + + return pci_dev; +out: +/* pci_unregister_device(&pci_dev->dev); */ + return NULL; +} + +#ifdef KVM_CAP_DEVICE_ASSIGNMENT +/* The pci config space got updated. Check if irq numbers have changed + * for our devices + */ +void assigned_dev_update_irq(PCIDevice *d) +{ + int irq, r; + AssignedDevice *assigned_dev; + AssignedDevInfo *adev; + + LIST_FOREACH(adev, &adev_head, next) { + assigned_dev = adev->assigned_dev; + irq = pci_map_irq(&assigned_dev->dev, assigned_dev->intpin); + irq = piix_get_irq(irq); + + if (irq != assigned_dev->girq) { + struct kvm_assigned_irq assigned_irq_data; + + memset(&assigned_irq_data, 0, sizeof(assigned_irq_data)); + assigned_irq_data.assigned_dev_id = + calc_assigned_dev_id(assigned_dev->h_busnr, + (uint8_t) assigned_dev->h_devfn); + assigned_irq_data.guest_irq = irq; + assigned_irq_data.host_irq = assigned_dev->real_device.irq; + r = kvm_assign_irq(kvm_context, &assigned_irq_data); + if (r < 0) { + perror("assigned_dev_update_irq"); + fprintf(stderr, "Are you assigning a device " + "that shares IRQ with some other device?\n"); + pci_unregister_device(&assigned_dev->dev); + /* FIXME: Delete node from list */ + continue; + } + assigned_dev->girq = irq; + } + } +} +#endif + +struct PCIDevice *init_assigned_device(AssignedDevInfo *adev, PCIBus *bus) +{ + adev->assigned_dev = register_real_device(bus, + adev->name, -1, + adev->bus, + adev->dev, + adev->func); + return &adev->assigned_dev->dev; +} + +int init_all_assigned_devices(PCIBus *bus) +{ + struct AssignedDevInfo *adev; + + LIST_FOREACH(adev, &adev_head, next) + if (init_assigned_device(adev, bus) == NULL) + return -1; + return 0; +} + +/* + * Syntax to assign device: + * + * -pcidevice dev=bus:dev.func,dma=dma + * + * Example: + * -pcidevice host=00:13.0,dma=pvdma + * + * dma can currently only be 'none' to disable iommu support. + */ +AssignedDevInfo *add_assigned_device(const char *arg) +{ + char *cp, *cp1; + char device[8]; + char dma[6]; + int r; + AssignedDevInfo *adev; + + adev = qemu_mallocz(sizeof(AssignedDevInfo)); + if (adev == NULL) { + fprintf(stderr, "%s: Out of memory\n", __func__); + return NULL; + } + r = get_param_value(device, sizeof(device), "host", arg); + r = get_param_value(adev->name, sizeof(adev->name), "name", arg); + if (!r) + strncpy(adev->name, device, 8); + +#ifdef KVM_CAP_IOMMU + r = get_param_value(dma, sizeof(dma), "dma", arg); + if (r && !strncmp(dma, "none", 4)) + disable_iommu = 1; +#endif + cp = device; + adev->bus = strtoul(cp, &cp1, 16); + if (*cp1 != ':') + goto bad; + cp = cp1 + 1; + + adev->dev = strtoul(cp, &cp1, 16); + if (*cp1 != '.') + goto bad; + cp = cp1 + 1; + + adev->func = strtoul(cp, &cp1, 16); + + nr_assigned_devices++; + LIST_INSERT_HEAD(&adev_head, adev, next); + return adev; +bad: + fprintf(stderr, "pcidevice argument parse error; " + "please check the help text for usage\n"); + qemu_free(adev); + return NULL; +} diff --git a/qemu/hw/device-assignment.h b/qemu/hw/device-assignment.h new file mode 100644 index 0000000..e4148df --- /dev/null +++ b/qemu/hw/device-assignment.h @@ -0,0 +1,98 @@ +/* + * Copyright (c) 2007, Neocleus Corporation. + * Copyright (c) 2007, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA. + * + * Data structures for storing PCI state + * + * Adapted to kvm by Qumranet + * + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) + */ + +#ifndef __DEVICE_ASSIGNMENT_H__ +#define __DEVICE_ASSIGNMENT_H__ + +#include <sys/mman.h> +#include "qemu-common.h" +#include "sys-queue.h" +#include "pci.h" + +/* From include/linux/pci.h in the kernel sources */ +#define PCI_DEVFN(slot, func) ((((slot) & 0x1f) << 3) | ((func) & 0x07)) + +#define MAX_IO_REGIONS (6) + +typedef struct { + int type; /* Memory or port I/O */ + int valid; + uint32_t base_addr; + uint32_t size; /* size of the region */ + int resource_fd; +} PCIRegion; + +typedef struct { + uint8_t bus, dev, func; /* Bus inside domain, device and function */ + int irq; /* IRQ number */ + uint16_t region_number; /* number of active regions */ + + /* Port I/O or MMIO Regions */ + PCIRegion regions[MAX_IO_REGIONS]; + int config_fd; +} PCIDevRegions; + +typedef struct { + target_phys_addr_t e_physbase; + uint32_t memory_index; + void *r_virtbase; /* mmapped access address */ + int num; /* our index within v_addrs[] */ + uint32_t e_size; /* emulated size of region in bytes */ + uint32_t r_size; /* real size of region in bytes */ +} AssignedDevRegion; + +typedef struct { + PCIDevice dev; + int intpin; + uint8_t debug_flags; + AssignedDevRegion v_addrs[PCI_NUM_REGIONS]; + PCIDevRegions real_device; + int run; + int girq; + unsigned char h_busnr; + unsigned int h_devfn; + int bound; +} AssignedDevice; + +typedef struct AssignedDevInfo AssignedDevInfo; + +struct AssignedDevInfo { + char name[15]; + int bus; + int dev; + int func; + AssignedDevice *assigned_dev; + LIST_ENTRY(AssignedDevInfo) next; +}; + +PCIDevice *init_assigned_device(AssignedDevInfo *adev, PCIBus *bus); +int init_all_assigned_devices(PCIBus *bus); +AssignedDevInfo *add_assigned_device(const char *arg); +void assigned_dev_set_vector(int irq, int vector); +void assigned_dev_ack_mirq(int vector); + +#endif /* __DEVICE_ASSIGNMENT_H__ */ diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c index d559f0c..e0438ed 100644 --- a/qemu/hw/pc.c +++ b/qemu/hw/pc.c @@ -33,6 +33,7 @@ #include "boards.h" #include "console.h" #include "fw_cfg.h" +#include "device-assignment.h" #include "qemu-kvm.h" @@ -993,6 +994,11 @@ static void pc_init1(ram_addr_t ram_size, int vga_ram_size, } } + /* Initialize assigned devices */ + if (pci_enabled) + if(init_all_assigned_devices(pci_bus)) + exit(1); + rtc_state = rtc_init(0x70, i8259[8]); qemu_register_boot_set(pc_boot_set, rtc_state); diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c index c82cd20..f86a8a7 100644 --- a/qemu/hw/pci.c +++ b/qemu/hw/pci.c @@ -50,6 +50,7 @@ struct PCIBus { static void pci_update_mappings(PCIDevice *d); static void pci_set_irq(void *opaque, int irq_num, int level); +void assigned_dev_update_irq(PCIDevice *d); target_phys_addr_t pci_mem_base; static int pci_irq_index; @@ -453,6 +454,12 @@ void pci_default_write_config(PCIDevice *d, val >>= 8; } +#ifdef KVM_CAP_DEVICE_ASSIGNMENT + if (kvm_enabled() && qemu_kvm_irqchip_in_kernel() && + address >= 0x60 && address <= 0x63) + assigned_dev_update_irq(d); +#endif + end = address + len; if (end > PCI_COMMAND && address < (PCI_COMMAND + 2)) { /* if the command register is modified, we must modify the mappings */ diff --git a/qemu/vl.c b/qemu/vl.c index 388e79d..5a39d12 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -38,6 +38,7 @@ #include "qemu-char.h" #include "block.h" #include "audio/audio.h" +#include "hw/device-assignment.h" #include "migration.h" #include "balloon.h" #include "qemu-kvm.h" @@ -8692,6 +8693,12 @@ static void help(int exitcode) #endif "-no-kvm-irqchip disable KVM kernel mode PIC/IOAPIC/LAPIC\n" "-no-kvm-pit disable KVM kernel mode PIT\n" +#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(__linux__) + "-pcidevice host=bus:dev.func[,dma=none][,name=\"string\"]\n" + " expose a PCI device to the guest OS.\n" + " dma=none: don't perform any dma translations (default is to use an iommu)\n" + " 'string' is used in log output.\n" +#endif #endif #ifdef TARGET_I386 "-no-acpi disable ACPI\n" @@ -8811,6 +8818,9 @@ enum { QEMU_OPTION_no_kvm, QEMU_OPTION_no_kvm_irqchip, QEMU_OPTION_no_kvm_pit, +#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(__linux__) + QEMU_OPTION_pcidevice, +#endif QEMU_OPTION_no_reboot, QEMU_OPTION_no_shutdown, QEMU_OPTION_show_cursor, @@ -8900,6 +8910,9 @@ static const QEMUOption qemu_options[] = { #endif { "no-kvm-irqchip", 0, QEMU_OPTION_no_kvm_irqchip }, { "no-kvm-pit", 0, QEMU_OPTION_no_kvm_pit }, +#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(__linux__) + { "pcidevice", HAS_ARG, QEMU_OPTION_pcidevice }, +#endif #endif #if defined(TARGET_PPC) || defined(TARGET_SPARC) { "g", 1, QEMU_OPTION_g }, @@ -9844,6 +9857,11 @@ int main(int argc, char **argv) kvm_pit = 0; break; } +#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(__linux__) + case QEMU_OPTION_pcidevice: + add_assigned_device(optarg); + break; +#endif #endif case QEMU_OPTION_usb: usb_enabled = 1; -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 6/6] KVM/userspace: Device Assignment: Support for hot plugging PCI devices 2008-10-24 15:26 ` [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests Amit Shah @ 2008-10-24 15:26 ` Amit Shah 2008-10-24 16:22 ` [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests Anthony Liguori 2008-10-27 1:28 ` Su, Disheng 2 siblings, 0 replies; 23+ messages in thread From: Amit Shah @ 2008-10-24 15:26 UTC (permalink / raw) To: avi; +Cc: kvm, anthony, weidong.han, allen.m.kay, muli, benami, Amit Shah This patch adds support for hot-plugging host PCI devices into guests Signed-off-by: Amit Shah <amit.shah@redhat.com> --- qemu/hw/device-hotplug.c | 21 +++++++++++++++++++++ qemu/monitor.c | 2 +- 2 files changed, 22 insertions(+), 1 deletions(-) diff --git a/qemu/hw/device-hotplug.c b/qemu/hw/device-hotplug.c index 8e2bc35..817e708 100644 --- a/qemu/hw/device-hotplug.c +++ b/qemu/hw/device-hotplug.c @@ -6,6 +6,7 @@ #include "pc.h" #include "console.h" #include "block_int.h" +#include "device-assignment.h" #define PCI_BASE_CLASS_STORAGE 0x01 #define PCI_BASE_CLASS_NETWORK 0x02 @@ -27,6 +28,24 @@ static PCIDevice *qemu_system_hot_add_nic(const char *opts, int bus_nr) return pci_nic_init (pci_bus, &nd_table[ret], -1); } +static PCIDevice *qemu_system_hot_assign_device(const char *opts, int bus_nr) +{ + PCIBus *pci_bus; + AssignedDevInfo *adev; + + pci_bus = pci_find_bus(bus_nr); + if (!pci_bus) { + term_printf ("Can't find pci_bus %d\n", bus_nr); + return NULL; + } + adev = add_assigned_device(opts); + if (adev == NULL) { + term_printf ("Error adding device; check syntax\n"); + return NULL; + } + return init_assigned_device(adev, pci_bus); +} + static int add_init_drive(const char *opts) { int drive_opt_idx, drive_idx; @@ -143,6 +162,8 @@ void device_hot_add(int pcibus, const char *type, const char *opts) dev = qemu_system_hot_add_nic(opts, pcibus); else if (strcmp(type, "storage") == 0) dev = qemu_system_hot_add_storage(opts, pcibus); + else if (strcmp(type, "host") == 0) + dev = qemu_system_hot_assign_device(opts, pcibus); else term_printf("invalid type: %s\n", type); diff --git a/qemu/monitor.c b/qemu/monitor.c index 79b6b4c..d1043b1 100644 --- a/qemu/monitor.c +++ b/qemu/monitor.c @@ -1529,7 +1529,7 @@ static const term_cmd_t term_cmds[] = { "[,cyls=c,heads=h,secs=s[,trans=t]]\n" "[snapshot=on|off][,cache=on|off]", "add drive to PCI storage controller" }, - { "pci_add", "iss", device_hot_add, "bus nic|storage [[vlan=n][,macaddr=addr][,model=type]] [file=file][,if=type][,bus=nr]...", "hot-add PCI device" }, + { "pci_add", "iss", device_hot_add, "bus nic|storage|host [[vlan=n][,macaddr=addr][,model=type]] [file=file][,if=type][,bus=nr]... [host=02:00.0[,name=string][,dma=none]" "hot-add PCI device" }, { "pci_del", "ii", device_hot_remove, "bus slot-number", "hot remove PCI device" }, #endif { "balloon", "i", do_balloon, -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests 2008-10-24 15:26 ` [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests Amit Shah 2008-10-24 15:26 ` [PATCH 6/6] KVM/userspace: Device Assignment: Support for hot plugging PCI devices Amit Shah @ 2008-10-24 16:22 ` Anthony Liguori 2008-10-26 12:54 ` Avi Kivity 2008-10-28 10:11 ` Muli Ben-Yehuda 2008-10-27 1:28 ` Su, Disheng 2 siblings, 2 replies; 23+ messages in thread From: Anthony Liguori @ 2008-10-24 16:22 UTC (permalink / raw) To: Amit Shah; +Cc: avi, kvm, weidong.han, allen.m.kay, muli, benami Amit Shah wrote: > This patch has been contributed to by the following people: > > From: Or Sagi <ors@tutis.com> > From: Nir Peleg <nir@tutis.com> > From: Amit Shah <amit.shah@redhat.com> > From: Ben-Ami Yassour <benami@il.ibm.com> > From: Weidong Han <weidong.han@intel.com> > From: Glauber de Oliveira Costa <gcosta@redhat.com> > > With this patch, we can assign a device on the host machine to a > guest. > > A new command-line option, -pcidevice is added. > To invoke it for a device sitting at PCI bus:dev.fn 04:08.0, use this: > > -pcidevice host=04:08.0 > > * The host driver for the device, if any, is to be removed before > assigning the device (else device assignment will fail). > > * A device that shares IRQ with another host device cannot currently > be assigned. > > * The RAW_IO capability is needed for this to work > > This works only with the in-kernel irqchip method; to use the > userspace irqchip, a kernel module (irqhook) and some extra changes > are needed. > > Signed-off-by: Amit Shah <amit.shah@redhat.com> > --- > qemu/Makefile.target | 1 + > qemu/hw/device-assignment.c | 619 +++++++++++++++++++++++++++++++++++++++++++ > qemu/hw/device-assignment.h | 98 +++++++ > qemu/hw/pc.c | 6 + > qemu/hw/pci.c | 7 + > qemu/vl.c | 18 ++ > 6 files changed, 749 insertions(+), 0 deletions(-) > create mode 100644 qemu/hw/device-assignment.c > create mode 100644 qemu/hw/device-assignment.h > > diff --git a/qemu/Makefile.target b/qemu/Makefile.target > index d9bdeca..05a1d84 100644 > --- a/qemu/Makefile.target > +++ b/qemu/Makefile.target > @@ -621,6 +621,7 @@ OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o > OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o > OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o > OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o > +OBJS+= device-assignment.o > ifeq ($(USE_KVM_PIT), 1) > OBJS+= i8254-kvm.o > endif > diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c > new file mode 100644 > index 0000000..5ba21a0 > --- /dev/null > +++ b/qemu/hw/device-assignment.c > @@ -0,0 +1,619 @@ > +/* > + * Copyright (c) 2007, Neocleus Corporation. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + * > + * You should have received a copy of the GNU General Public License along with > + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple > + * Place - Suite 330, Boston, MA 02111-1307 USA. > + * > + * > + * Assign a PCI device from the host to a guest VM. > + * > + * Adapted for KVM by Qumranet. > + * > + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) > + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) > + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) > + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) > + */ > +#include <stdio.h> > +#include <sys/io.h> > +#include "qemu-kvm.h" > +#include "hw.h" > +#include "pc.h" > +#include "sysemu.h" > +#include "console.h" > +#include <linux/kvm_para.h> > Is this header really necessary? > +#include "device-assignment.h" > + > +/* From linux/ioport.h */ > +#define IORESOURCE_IO 0x00000100 /* Resource type */ > +#define IORESOURCE_MEM 0x00000200 > +#define IORESOURCE_IRQ 0x00000400 > +#define IORESOURCE_DMA 0x00000800 > +#define IORESOURCE_PREFETCH 0x00001000 /* No side effects */ > + > +/* #define DEVICE_ASSIGNMENT_DEBUG 1 */ > + > +#ifdef DEVICE_ASSIGNMENT_DEBUG > +#define DEBUG(fmt, args...) \ > Please use C99 style varidacs. > + do { \ > + fprintf(stderr, "%s: " fmt, __func__ , ## args); \ > + } while (0) > +#else > +#define DEBUG(fmt, args...) do { } while(0) > +#endif > + > +static void assigned_dev_ioport_writeb(void *opaque, uint32_t addr, > + uint32_t value) > +{ > + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; > Cast is unnecessary. > + uint32_t r_pio = (unsigned long)r_access->r_virtbase > + + (addr - r_access->e_physbase); > It would be nice to make this a function to make it more obvious that you were translated from guest to host regions. The cast to unsigned long should probably be target_ulong too. > + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" > + " r_virtbase=%08lx value=%08x\n", > + __func__, r_pio, (int)r_access->e_physbase, > + (unsigned long)r_access->r_virtbase, value); > This debug statement looks wrong to me. You're passing stderr. It's true for all of these functions. > +static void assigned_dev_iomem_map(PCIDevice *pci_dev, int region_num, > + uint32_t e_phys, uint32_t e_size, int type) > +{ > + AssignedDevice *r_dev = (AssignedDevice *) pci_dev; > + AssignedDevRegion *region = &r_dev->v_addrs[region_num]; > + int first_map = (region->e_size == 0); > + int ret = 0; > + > + DEBUG("%s: e_phys=%08x r_virt=%x type=%d len=%08x region_num=%d \n", > + __func__, e_phys, (uint32_t)region->r_virtbase, type, e_size, > + region_num); > You already have __func__ in your debug printf(). > + region->e_physbase = e_phys; > + region->e_size = e_size; > + > + /* FIXME: Add support for emulated MMIO for non-kvm guests */ > + if (kvm_enabled()) { > I don't think having a kvm_enabled() check here is very useful. I think device-assignment.c should be conditional on USE_KVM, and the only kvm_enabled() check should be when creating the initial device assignment. Practically speaking, QEMU is never going to support device assignment outside of the context of KVM because I strongly doubt anything like irqhook will make it upstream. > + if (!first_map) > + kvm_destroy_phys_mem(kvm_context, e_phys, e_size); > + if (e_size > 0) > + ret = kvm_register_phys_mem(kvm_context, e_phys, > + region->r_virtbase, e_size, 0); > + if (ret != 0) > + fprintf(stderr, "%s: Error: create new mapping failed\n", __func__); > If we do get an error here, we shouldn't keep going. This error is probably going to happen in practice if a guest tries to pass through too many devices and we run out of slots. > + } > +} > + > +static void assigned_dev_ioport_map(PCIDevice *pci_dev, int region_num, > + uint32_t addr, uint32_t size, int type) > +{ > + AssignedDevice *r_dev = (AssignedDevice *) pci_dev; > + AssignedDevRegion *region = &r_dev->v_addrs[region_num]; > + int r; > + > + region->e_physbase = addr; > + region->e_size = size; > + > + DEBUG("%s: e_phys=0x%x r_virt=%x type=0x%x len=%d region_num=%d \n", > + __func__, addr, (uint32_t)region->r_virtbase, type, size, region_num); > Need to fix this DEBUG(). > + r = ioperm((uint32_t)region->r_virtbase, size, 1); > I don't think this is enough for KVM. This will only do the ioperm in the thread that triggered the IO. If you have an SMP guest, ioperm needs to be done on each VCPU's thread. > + if (r < 0) { > + perror("assigned_dev_ioport_map: ioperm"); > + return; > + } > Again, if we fail, we have to exit QEMU gracefully. > + register_ioport_read(addr, size, 1, assigned_dev_ioport_readb, > + (void *) (r_dev->v_addrs + region_num)); > + register_ioport_read(addr, size, 2, assigned_dev_ioport_readw, > + (void *) (r_dev->v_addrs + region_num)); > + register_ioport_read(addr, size, 4, assigned_dev_ioport_readl, > + (void *) (r_dev->v_addrs + region_num)); > + register_ioport_write(addr, size, 1, assigned_dev_ioport_writeb, > + (void *) (r_dev->v_addrs + region_num)); > + register_ioport_write(addr, size, 2, assigned_dev_ioport_writew, > + (void *) (r_dev->v_addrs + region_num)); > + register_ioport_write(addr, size, 4, assigned_dev_ioport_writel, > + (void *) (r_dev->v_addrs + region_num)); > +} > You never need to explicitly cast a pointer to void *. > +static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, > + uint32_t val, int len) > +{ > + int fd, r; > + > + DEBUG("%s: (%x.%x): address=%04x val=0x%08x len=%d\n", > + __func__, ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), > + (uint16_t) address, val, len); > bad DEBUG() > + if (address == 0x4) { > + pci_default_write_config(d, address, val, len); > + /* Continue to program the card */ > + } > + > + if ((address >= 0x10 && address <= 0x24) || address == 0x34 || > + address == 0x3c || address == 0x3d) { > + /* used for update-mappings (BAR emulation) */ > + pci_default_write_config(d, address, val, len); > + return; > + } > + DEBUG("%s: NON BAR (%x.%x): address=%04x val=0x%08x len=%d\n", > + __func__, ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), > + (uint16_t) address, val, len); > + fd = ((AssignedDevice *)d)->real_device.config_fd; > + r = lseek(fd, address, SEEK_SET); > + if (r < 0) { > + fprintf(stderr, "%s: bad seek, errno = %d\n", __func__, errno); > + return; > + } > +again: > + r = write(fd, &val, len); > Can you just do a pwrite()? That'll make things simpler. > + if (r < 0) { > + if (errno == EINTR || errno == EAGAIN) > + goto again; > + fprintf(stderr, "%s: write failed, errno = %d\n", __func__, errno); > + } > +} > + > +static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, > + int len) > +{ > + uint32_t val = 0; > + int fd, r; > + > + if ((address >= 0x10 && address <= 0x24) || address == 0x34 || > + address == 0x3c || address == 0x3d) { > + val = pci_default_read_config(d, address, len); > + DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n", > + (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, len); > + return val; > + } > + > + /* vga specific, remove later */ > + if (address == 0xFC) > + goto do_log; > Can you explain the point of this? > + fd = ((AssignedDevice *)d)->real_device.config_fd; > + r = lseek(fd, address, SEEK_SET); > + if (r < 0) { > + fprintf(stderr, "%s: bad seek, errno = %d\n", __func__, errno); > + return val; > + } > +again: > + r = read(fd, &val, len); > pread(). > + if (r < 0) { > + if (errno == EINTR || errno == EAGAIN) > + goto again; > + fprintf(stderr, "%s: read failed, errno = %d\n", > + __func__, errno); > Should bail out gracefully. > +static int assigned_dev_register_regions(PCIRegion *io_regions, > + unsigned long regions_num, > + AssignedDevice *pci_dev) > +{ > + uint32_t i; > + PCIRegion *cur_region = io_regions; > + > + for (i = 0; i < regions_num; i++, cur_region++) { > + if (!cur_region->valid) > + continue; > + pci_dev->v_addrs[i].num = i; > + > + /* handle memory io regions */ > + if (cur_region->type & IORESOURCE_MEM) { > + int t = cur_region->type & IORESOURCE_PREFETCH > + ? PCI_ADDRESS_SPACE_MEM_PREFETCH > + : PCI_ADDRESS_SPACE_MEM; > + > + /* map physical memory */ > + pci_dev->v_addrs[i].e_physbase = cur_region->base_addr; > + pci_dev->v_addrs[i].r_virtbase = > + mmap(NULL, > + (cur_region->size + 0xFFF) & 0xFFFFF000, > + PROT_WRITE | PROT_READ, MAP_SHARED, > + cur_region->resource_fd, (off_t) 0); > + > + if ((void *) -1 == pci_dev->v_addrs[i].r_virtbase) { > Please use MAP_FAILED and don't use a defensive if. > + fprintf(stderr, "%s: Error: Couldn't mmap 0x%x!" > + "\n", __func__, > + (uint32_t) (cur_region->base_addr)); > + return -1; > + } > + pci_dev->v_addrs[i].r_size = cur_region->size; > + pci_dev->v_addrs[i].e_size = 0; > + > + /* add offset */ > + pci_dev->v_addrs[i].r_virtbase += > + (cur_region->base_addr & 0xFFF); > + > + pci_register_io_region((PCIDevice *) pci_dev, i, > + cur_region->size, t, > + assigned_dev_iomem_map); > + continue; > + } > + /* handle port io regions */ > + pci_register_io_region((PCIDevice *) pci_dev, i, > + cur_region->size, PCI_ADDRESS_SPACE_IO, > + assigned_dev_ioport_map); > + > + pci_dev->v_addrs[i].e_physbase = cur_region->base_addr; > + pci_dev->v_addrs[i].r_virtbase = > + (void *)(long)cur_region->base_addr; > I think virtbase would make more sense as a target_ulong. > + /* not relevant for port io */ > + pci_dev->v_addrs[i].memory_index = 0; > + } > + > + /* success */ > + return 0; > +} > + > +static int get_real_device(AssignedDevice *pci_dev, uint8_t r_bus, > + uint8_t r_dev, uint8_t r_func) > +{ > + char dir[128], name[128], comp[16]; > + int fd, r = 0; > + FILE *f; > + unsigned long long start, end, size, flags; > + PCIRegion *rp; > + PCIDevRegions *dev = &pci_dev->real_device; > + > + dev->region_number = 0; > + > + snprintf(dir, 128, "/sys/bus/pci/devices/0000:%02x:%02x.%x/", > + r_bus, r_dev, r_func); > just use sizeof(). > + strncpy(name, dir, 128); > + strncat(name, "config", 6); > strncpy() doesn't do what you think it does. Why not just snprintf(name, sizeof(name), "%sconfig", dir)? > + fd = open(name, O_RDWR); > + if (fd == -1) { > + fprintf(stderr, "%s: %s: %m\n", __func__, name); > + return 1; > + } > + dev->config_fd = fd; > +again: > + r = read(fd, pci_dev->dev.config, sizeof(pci_dev->dev.config)); > + if (r < 0) { > + if (errno == EINTR || errno == EAGAIN) > + goto again; > + fprintf(stderr, "%s: read failed, errno = %d\n", __func__, errno); > + } > + strncpy(name, dir, 128); > + strncat(name, "resource", 8); > Just use snprintf(). > + f = fopen(name, "r"); > + if (f == NULL) { > + fprintf(stderr, "%s: %s: %m\n", __func__, name); > + return 1; > + } > + r = -1; > + while (fscanf(f, "%lli %lli %lli\n", &start, &end, &flags) == 3) { > + r++; > + rp = dev->regions + r; > + rp->valid = 0; > + size = end - start + 1; > + flags &= IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH; > + if (size == 0 || (flags & ~IORESOURCE_PREFETCH) == 0) > + continue; > + if (flags & IORESOURCE_MEM) { > + flags &= ~IORESOURCE_IO; > + snprintf(comp, 16, "resource%d", r); > + strncpy(name, dir, 128); > + strncat(name, comp, 16); > snprintf(name, sizeof(name), "%sresource%d", dir, r). > +/* > + * Syntax to assign device: > + * > + * -pcidevice dev=bus:dev.func,dma=dma > + * > + * Example: > + * -pcidevice host=00:13.0,dma=pvdma > + * > + * dma can currently only be 'none' to disable iommu support. > + */ > +AssignedDevInfo *add_assigned_device(const char *arg) > +{ > + char *cp, *cp1; > + char device[8]; > + char dma[6]; > + int r; > + AssignedDevInfo *adev; > + > + adev = qemu_mallocz(sizeof(AssignedDevInfo)); > + if (adev == NULL) { > + fprintf(stderr, "%s: Out of memory\n", __func__); > + return NULL; > + } > + r = get_param_value(device, sizeof(device), "host", arg); > + r = get_param_value(adev->name, sizeof(adev->name), "name", arg); > + if (!r) > + strncpy(adev->name, device, 8); > + > +#ifdef KVM_CAP_IOMMU > + r = get_param_value(dma, sizeof(dma), "dma", arg); > + if (r && !strncmp(dma, "none", 4)) > + disable_iommu = 1; > +#endif > + cp = device; > + adev->bus = strtoul(cp, &cp1, 16); > + if (*cp1 != ':') > + goto bad; > + cp = cp1 + 1; > + > + adev->dev = strtoul(cp, &cp1, 16); > + if (*cp1 != '.') > + goto bad; > + cp = cp1 + 1; > + > + adev->func = strtoul(cp, &cp1, 16); > + > + nr_assigned_devices++; > + LIST_INSERT_HEAD(&adev_head, adev, next); > + return adev; > +bad: > + fprintf(stderr, "pcidevice argument parse error; " > + "please check the help text for usage\n"); > + qemu_free(adev); > + return NULL; > +} > > diff --git a/qemu/vl.c b/qemu/vl.c > index 388e79d..5a39d12 100644 > --- a/qemu/vl.c > +++ b/qemu/vl.c > @@ -38,6 +38,7 @@ > #include "qemu-char.h" > #include "block.h" > #include "audio/audio.h" > +#include "hw/device-assignment.h" > #include "migration.h" > #include "balloon.h" > #include "qemu-kvm.h" > @@ -8692,6 +8693,12 @@ static void help(int exitcode) > #endif > "-no-kvm-irqchip disable KVM kernel mode PIC/IOAPIC/LAPIC\n" > "-no-kvm-pit disable KVM kernel mode PIT\n" > +#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(__linux__) > + "-pcidevice host=bus:dev.func[,dma=none][,name=\"string\"]\n" > + " expose a PCI device to the guest OS.\n" > + " dma=none: don't perform any dma translations (default is to use an iommu)\n" > + " 'string' is used in log output.\n" > +#endif > #endif > #ifdef TARGET_I386 > "-no-acpi disable ACPI\n" > @@ -8811,6 +8818,9 @@ enum { > QEMU_OPTION_no_kvm, > QEMU_OPTION_no_kvm_irqchip, > QEMU_OPTION_no_kvm_pit, > +#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(__linux__) > + QEMU_OPTION_pcidevice, > +#endif > QEMU_OPTION_no_reboot, > QEMU_OPTION_no_shutdown, > QEMU_OPTION_show_cursor, > @@ -8900,6 +8910,9 @@ static const QEMUOption qemu_options[] = { > #endif > { "no-kvm-irqchip", 0, QEMU_OPTION_no_kvm_irqchip }, > { "no-kvm-pit", 0, QEMU_OPTION_no_kvm_pit }, > +#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(__linux__) > + { "pcidevice", HAS_ARG, QEMU_OPTION_pcidevice }, > +#endif > #endif > #if defined(TARGET_PPC) || defined(TARGET_SPARC) > { "g", 1, QEMU_OPTION_g }, > @@ -9844,6 +9857,11 @@ int main(int argc, char **argv) > kvm_pit = 0; > break; > } > +#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(__linux__) > + case QEMU_OPTION_pcidevice: > + add_assigned_device(optarg); > + break; > +#endif > #endif > case QEMU_OPTION_usb: > usb_enabled = 1; > This is the wrong general model for doing this. The way the rest of QEMU works is to maintain an array of strings representing the assigned devices. The option handling just saves the name of the option. Then in pc.c, you iterate through the list of assigned devices, and then add them. Other architectures may have a completely different implementation of device assignment so it's better to let the individual architectures decide what to do with the assigned devices. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests 2008-10-24 16:22 ` [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests Anthony Liguori @ 2008-10-26 12:54 ` Avi Kivity 2008-10-28 10:11 ` Muli Ben-Yehuda 1 sibling, 0 replies; 23+ messages in thread From: Avi Kivity @ 2008-10-26 12:54 UTC (permalink / raw) To: Anthony Liguori; +Cc: Amit Shah, kvm, weidong.han, allen.m.kay, muli, benami Anthony Liguori wrote: > > I don't think having a kvm_enabled() check here is very useful. I > think device-assignment.c should be conditional on USE_KVM, and the > only kvm_enabled() check should be when creating the initial device > assignment. Practically speaking, QEMU is never going to support > device assignment outside of the context of KVM because I strongly > doubt anything like irqhook will make it upstream. Userspace interrupt handlers are actually possible with MSI; we should see if uio is open to adding support for that. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests 2008-10-24 16:22 ` [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests Anthony Liguori 2008-10-26 12:54 ` Avi Kivity @ 2008-10-28 10:11 ` Muli Ben-Yehuda 1 sibling, 0 replies; 23+ messages in thread From: Muli Ben-Yehuda @ 2008-10-28 10:11 UTC (permalink / raw) To: Anthony Liguori Cc: Amit Shah, avi, kvm, weidong.han, allen.m.kay, Ben-Ami Yassour1 On Fri, Oct 24, 2008 at 11:22:48AM -0500, Anthony Liguori wrote: > Amit Shah wrote: >> +#include <linux/kvm_para.h> >> > > Is this header really necessary? No, removed. > >> +#include "device-assignment.h" >> + >> +/* From linux/ioport.h */ >> +#define IORESOURCE_IO 0x00000100 /* Resource type */ >> +#define IORESOURCE_MEM 0x00000200 >> +#define IORESOURCE_IRQ 0x00000400 >> +#define IORESOURCE_DMA 0x00000800 >> +#define IORESOURCE_PREFETCH 0x00001000 /* No side effects */ >> + >> +/* #define DEVICE_ASSIGNMENT_DEBUG 1 */ >> + >> +#ifdef DEVICE_ASSIGNMENT_DEBUG >> +#define DEBUG(fmt, args...) \ >> > > Please use C99 style varidacs. Done. > >> + do { \ >> + fprintf(stderr, "%s: " fmt, __func__ , ## args); \ >> + } while (0) >> +#else >> +#define DEBUG(fmt, args...) do { } while(0) >> +#endif >> + >> +static void assigned_dev_ioport_writeb(void *opaque, uint32_t addr, >> + uint32_t value) >> +{ >> + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; >> > > Cast is unnecessary. Removed. > >> + uint32_t r_pio = (unsigned long)r_access->r_virtbase >> + + (addr - r_access->e_physbase); >> > > It would be nice to make this a function to make it more obvious that you > were translated from guest to host regions. The cast to unsigned long > should probably be target_ulong too. Done. > >> + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" >> + " r_virtbase=%08lx value=%08x\n", >> + __func__, r_pio, (int)r_access->e_physbase, >> + (unsigned long)r_access->r_virtbase, value); >> > > This debug statement looks wrong to me. You're passing stderr. > It's true for all of these functions. Fixed. > >> +static void assigned_dev_iomem_map(PCIDevice *pci_dev, int region_num, >> + uint32_t e_phys, uint32_t e_size, int >> type) >> +{ >> + AssignedDevice *r_dev = (AssignedDevice *) pci_dev; >> + AssignedDevRegion *region = &r_dev->v_addrs[region_num]; >> + int first_map = (region->e_size == 0); >> + int ret = 0; >> + >> + DEBUG("%s: e_phys=%08x r_virt=%x type=%d len=%08x region_num=%d \n", >> + __func__, e_phys, (uint32_t)region->r_virtbase, type, e_size, >> + region_num); >> > > You already have __func__ in your debug printf(). Fixed. > >> + region->e_physbase = e_phys; >> + region->e_size = e_size; >> + >> + /* FIXME: Add support for emulated MMIO for non-kvm guests */ >> + if (kvm_enabled()) { >> > > I don't think having a kvm_enabled() check here is very useful. I > think device-assignment.c should be conditional on USE_KVM, and the > only kvm_enabled() check should be when creating the initial device > assignment. Practically speaking, QEMU is never going to support > device assignment outside of the context of KVM because I strongly > doubt anything like irqhook will make it upstream. Reworked along your suggestions, please let me know if you have further comments. >> + if (!first_map) >> + kvm_destroy_phys_mem(kvm_context, e_phys, e_size); >> + if (e_size > 0) >> + ret = kvm_register_phys_mem(kvm_context, e_phys, >> + region->r_virtbase, e_size, 0); >> + if (ret != 0) >> + fprintf(stderr, "%s: Error: create new mapping failed\n", >> __func__); >> > > If we do get an error here, we shouldn't keep going. This error is > probably going to happen in practice if a guest tries to pass > through too many devices and we run out of slots. Fixed, we exit(1) now (is there a more graceful to bail out?). >> + } >> +} >> + >> +static void assigned_dev_ioport_map(PCIDevice *pci_dev, int region_num, >> + uint32_t addr, uint32_t size, int >> type) >> +{ >> + AssignedDevice *r_dev = (AssignedDevice *) pci_dev; >> + AssignedDevRegion *region = &r_dev->v_addrs[region_num]; >> + int r; >> + >> + region->e_physbase = addr; >> + region->e_size = size; >> + >> + DEBUG("%s: e_phys=0x%x r_virt=%x type=0x%x len=%d region_num=%d \n", >> + __func__, addr, (uint32_t)region->r_virtbase, type, size, >> region_num); >> > > Need to fix this DEBUG(). Fixed. > >> + r = ioperm((uint32_t)region->r_virtbase, size, 1); >> > > I don't think this is enough for KVM. This will only do the ioperm > in the thread that triggered the IO. If you have an SMP guest, > ioperm needs to be done on each VCPU's thread. Fixed. >> + if (r < 0) { >> + perror("assigned_dev_ioport_map: ioperm"); >> + return; >> + } >> > > Again, if we fail, we have to exit QEMU gracefully. Fixed. > >> + register_ioport_read(addr, size, 1, assigned_dev_ioport_readb, >> + (void *) (r_dev->v_addrs + region_num)); >> + register_ioport_read(addr, size, 2, assigned_dev_ioport_readw, >> + (void *) (r_dev->v_addrs + region_num)); >> + register_ioport_read(addr, size, 4, assigned_dev_ioport_readl, >> + (void *) (r_dev->v_addrs + region_num)); >> + register_ioport_write(addr, size, 1, assigned_dev_ioport_writeb, >> + (void *) (r_dev->v_addrs + region_num)); >> + register_ioport_write(addr, size, 2, assigned_dev_ioport_writew, >> + (void *) (r_dev->v_addrs + region_num)); >> + register_ioport_write(addr, size, 4, assigned_dev_ioport_writel, >> + (void *) (r_dev->v_addrs + region_num)); >> +} >> > > You never need to explicitly cast a pointer to void *. Fixed. > >> +static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, >> + uint32_t val, int len) >> +{ >> + int fd, r; >> + >> + DEBUG("%s: (%x.%x): address=%04x val=0x%08x len=%d\n", >> + __func__, ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), >> + (uint16_t) address, val, len); >> > > bad DEBUG() Fixed. > >> + if (address == 0x4) { >> + pci_default_write_config(d, address, val, len); >> + /* Continue to program the card */ >> + } >> + >> + if ((address >= 0x10 && address <= 0x24) || address == 0x34 || >> + address == 0x3c || address == 0x3d) { >> + /* used for update-mappings (BAR emulation) */ >> + pci_default_write_config(d, address, val, len); >> + return; >> + } >> + DEBUG("%s: NON BAR (%x.%x): address=%04x val=0x%08x len=%d\n", >> + __func__, ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), >> + (uint16_t) address, val, len); >> + fd = ((AssignedDevice *)d)->real_device.config_fd; >> + r = lseek(fd, address, SEEK_SET); >> + if (r < 0) { >> + fprintf(stderr, "%s: bad seek, errno = %d\n", __func__, errno); >> + return; >> + } >> +again: >> + r = write(fd, &val, len); >> > > Can you just do a pwrite()? That'll make things simpler. Fixed. > >> + if (r < 0) { >> + if (errno == EINTR || errno == EAGAIN) >> + goto again; >> + fprintf(stderr, "%s: write failed, errno = %d\n", __func__, >> errno); >> + } >> +} >> + >> +static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t >> address, >> + int len) >> +{ >> + uint32_t val = 0; >> + int fd, r; >> + >> + if ((address >= 0x10 && address <= 0x24) || address == 0x34 || >> + address == 0x3c || address == 0x3d) { >> + val = pci_default_read_config(d, address, len); >> + DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n", >> + (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, >> len); >> + return val; >> + } >> + >> + /* vga specific, remove later */ >> + if (address == 0xFC) >> + goto do_log; >> > > Can you explain the point of this? No. It appears to exist since the earliest versions of the patch. Since removing it does modify the behavior, I kept it in for now pending further investigation. >> + fd = ((AssignedDevice *)d)->real_device.config_fd; >> + r = lseek(fd, address, SEEK_SET); >> + if (r < 0) { >> + fprintf(stderr, "%s: bad seek, errno = %d\n", __func__, errno); >> + return val; >> + } >> +again: >> + r = read(fd, &val, len); >> > > pread(). Fixed. > >> + if (r < 0) { >> + if (errno == EINTR || errno == EAGAIN) >> + goto again; >> + fprintf(stderr, "%s: read failed, errno = %d\n", >> + __func__, errno); >> > > Should bail out gracefully. Done. > >> +static int assigned_dev_register_regions(PCIRegion *io_regions, >> + unsigned long regions_num, >> + AssignedDevice *pci_dev) >> +{ >> + uint32_t i; >> + PCIRegion *cur_region = io_regions; >> + >> + for (i = 0; i < regions_num; i++, cur_region++) { >> + if (!cur_region->valid) >> + continue; >> + pci_dev->v_addrs[i].num = i; >> + >> + /* handle memory io regions */ >> + if (cur_region->type & IORESOURCE_MEM) { >> + int t = cur_region->type & IORESOURCE_PREFETCH >> + ? PCI_ADDRESS_SPACE_MEM_PREFETCH >> + : PCI_ADDRESS_SPACE_MEM; >> + >> + /* map physical memory */ >> + pci_dev->v_addrs[i].e_physbase = cur_region->base_addr; >> + pci_dev->v_addrs[i].r_virtbase = >> + mmap(NULL, >> + (cur_region->size + 0xFFF) & 0xFFFFF000, >> + PROT_WRITE | PROT_READ, MAP_SHARED, >> + cur_region->resource_fd, (off_t) 0); >> + >> + if ((void *) -1 == pci_dev->v_addrs[i].r_virtbase) { >> > > Please use MAP_FAILED and don't use a defensive if. Fixed. > >> + fprintf(stderr, "%s: Error: Couldn't mmap 0x%x!" >> + "\n", __func__, >> + (uint32_t) (cur_region->base_addr)); >> + return -1; >> + } >> + pci_dev->v_addrs[i].r_size = cur_region->size; >> + pci_dev->v_addrs[i].e_size = 0; >> + >> + /* add offset */ >> + pci_dev->v_addrs[i].r_virtbase += >> + (cur_region->base_addr & 0xFFF); >> + >> + pci_register_io_region((PCIDevice *) pci_dev, i, >> + cur_region->size, t, >> + assigned_dev_iomem_map); >> + continue; >> + } >> + /* handle port io regions */ >> + pci_register_io_region((PCIDevice *) pci_dev, i, >> + cur_region->size, PCI_ADDRESS_SPACE_IO, >> + assigned_dev_ioport_map); >> + >> + pci_dev->v_addrs[i].e_physbase = cur_region->base_addr; >> + pci_dev->v_addrs[i].r_virtbase = >> + (void *)(long)cur_region->base_addr; >> > > I think virtbase would make more sense as a target_ulong. I split r_virtbase into a union of void* for memory regions and a ulong32_t for port numbers. > >> + /* not relevant for port io */ >> + pci_dev->v_addrs[i].memory_index = 0; >> + } >> + >> + /* success */ >> + return 0; >> +} >> + >> +static int get_real_device(AssignedDevice *pci_dev, uint8_t r_bus, >> + uint8_t r_dev, uint8_t r_func) >> +{ >> + char dir[128], name[128], comp[16]; >> + int fd, r = 0; >> + FILE *f; >> + unsigned long long start, end, size, flags; >> + PCIRegion *rp; >> + PCIDevRegions *dev = &pci_dev->real_device; >> + >> + dev->region_number = 0; >> + >> + snprintf(dir, 128, "/sys/bus/pci/devices/0000:%02x:%02x.%x/", >> + r_bus, r_dev, r_func); >> > > just use sizeof(). Done. > >> + strncpy(name, dir, 128); >> + strncat(name, "config", 6); >> > > strncpy() doesn't do what you think it does. Why not just snprintf(name, > sizeof(name), "%sconfig", dir)? Fixed to use snprintf. > >> + fd = open(name, O_RDWR); >> + if (fd == -1) { >> + fprintf(stderr, "%s: %s: %m\n", __func__, name); >> + return 1; >> + } >> + dev->config_fd = fd; >> +again: >> + r = read(fd, pci_dev->dev.config, sizeof(pci_dev->dev.config)); >> + if (r < 0) { >> + if (errno == EINTR || errno == EAGAIN) >> + goto again; >> + fprintf(stderr, "%s: read failed, errno = %d\n", __func__, >> errno); >> + } >> + strncpy(name, dir, 128); >> + strncat(name, "resource", 8); >> > > Just use snprintf(). Done. > >> + f = fopen(name, "r"); >> + if (f == NULL) { >> + fprintf(stderr, "%s: %s: %m\n", __func__, name); >> + return 1; >> + } >> + r = -1; >> + while (fscanf(f, "%lli %lli %lli\n", &start, &end, &flags) == 3) { >> + r++; >> + rp = dev->regions + r; >> + rp->valid = 0; >> + size = end - start + 1; >> + flags &= IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH; >> + if (size == 0 || (flags & ~IORESOURCE_PREFETCH) == 0) >> + continue; >> + if (flags & IORESOURCE_MEM) { >> + flags &= ~IORESOURCE_IO; >> + snprintf(comp, 16, "resource%d", r); >> + strncpy(name, dir, 128); >> + strncat(name, comp, 16); >> > snprintf(name, sizeof(name), "%sresource%d", dir, r). Done. > This is the wrong general model for doing this. The way the rest of > QEMU works is to maintain an array of strings representing the > assigned devices. The option handling just saves the name of the > option. Then in pc.c, you iterate through the list of assigned > devices, and then add them. Other architectures may have a > completely different implementation of device assignment so it's > better to let the individual architectures decide what to do with > the assigned devices. Split option parsing and initialization in two parts, as you suggested. Thanks for the detailed review comments! Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ <-> SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests 2008-10-24 15:26 ` [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests Amit Shah 2008-10-24 15:26 ` [PATCH 6/6] KVM/userspace: Device Assignment: Support for hot plugging PCI devices Amit Shah 2008-10-24 16:22 ` [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests Anthony Liguori @ 2008-10-27 1:28 ` Su, Disheng 2008-10-27 6:32 ` Han, Weidong 2 siblings, 1 reply; 23+ messages in thread From: Su, Disheng @ 2008-10-27 1:28 UTC (permalink / raw) To: Amit Shah, avi@redhat.com Cc: kvm@vger.kernel.org, anthony@codemonkey.ws, Han, Weidong, Kay, Allen M, muli@il.ibm.com, benami@il.ibm.com, Su, Disheng Amit Shah wrote: > This patch has been contributed to by the following people: > > From: Or Sagi <ors@tutis.com> > From: Nir Peleg <nir@tutis.com> > From: Amit Shah <amit.shah@redhat.com> > From: Ben-Ami Yassour <benami@il.ibm.com> > From: Weidong Han <weidong.han@intel.com> > From: Glauber de Oliveira Costa <gcosta@redhat.com> > > With this patch, we can assign a device on the host machine to a > guest. > > A new command-line option, -pcidevice is added. > To invoke it for a device sitting at PCI bus:dev.fn 04:08.0, use this: > > -pcidevice host=04:08.0 > > * The host driver for the device, if any, is to be removed before > assigning the device (else device assignment will fail). > > * A device that shares IRQ with another host device cannot currently > be assigned. > > * The RAW_IO capability is needed for this to work > > This works only with the in-kernel irqchip method; to use the > userspace irqchip, a kernel module (irqhook) and some extra changes > are needed. > > Signed-off-by: Amit Shah <amit.shah@redhat.com> > --- > qemu/Makefile.target | 1 + > qemu/hw/device-assignment.c | 619 > +++++++++++++++++++++++++++++++++++++++++++ > qemu/hw/device-assignment.h | 98 +++++++ qemu/hw/pc.c > | 6 + qemu/hw/pci.c | 7 + > qemu/vl.c | 18 ++ > 6 files changed, 749 insertions(+), 0 deletions(-) > create mode 100644 qemu/hw/device-assignment.c > create mode 100644 qemu/hw/device-assignment.h > > diff --git a/qemu/Makefile.target b/qemu/Makefile.target > index d9bdeca..05a1d84 100644 > --- a/qemu/Makefile.target > +++ b/qemu/Makefile.target > @@ -621,6 +621,7 @@ OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o > OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o > OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o > OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o > +OBJS+= device-assignment.o > ifeq ($(USE_KVM_PIT), 1) > OBJS+= i8254-kvm.o > endif > diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c > new file mode 100644 > index 0000000..5ba21a0 > --- /dev/null > +++ b/qemu/hw/device-assignment.c > @@ -0,0 +1,619 @@ > +/* > + * Copyright (c) 2007, Neocleus Corporation. > + * > + * This program is free software; you can redistribute it and/or > modify it + * under the terms and conditions of the GNU General > Public License, + * version 2, as published by the Free Software > Foundation. + * > + * This program is distributed in the hope it will be useful, but > WITHOUT + * ANY WARRANTY; without even the implied warranty of > MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU > General Public License for + * more details. > + * > + * You should have received a copy of the GNU General Public License > along with + * this program; if not, write to the Free Software > Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA > 02111-1307 USA. + * > + * > + * Assign a PCI device from the host to a guest VM. > + * > + * Adapted for KVM by Qumranet. > + * > + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) > + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) > + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) > + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) > + */ > +#include <stdio.h> > +#include <sys/io.h> > +#include "qemu-kvm.h" > +#include "hw.h" > +#include "pc.h" > +#include "sysemu.h" > +#include "console.h" > +#include <linux/kvm_para.h> > +#include "device-assignment.h" > + > +/* From linux/ioport.h */ > +#define IORESOURCE_IO 0x00000100 /* Resource type */ > +#define IORESOURCE_MEM 0x00000200 > +#define IORESOURCE_IRQ 0x00000400 > +#define IORESOURCE_DMA 0x00000800 > +#define IORESOURCE_PREFETCH 0x00001000 /* No side effects */ > + > +/* #define DEVICE_ASSIGNMENT_DEBUG 1 */ > + > +#ifdef DEVICE_ASSIGNMENT_DEBUG > +#define DEBUG(fmt, args...) \ > + do { \ > + fprintf(stderr, "%s: " fmt, __func__ , ## args); \ > + } while (0) > +#else > +#define DEBUG(fmt, args...) do { } while(0) > +#endif > + > +static void assigned_dev_ioport_writeb(void *opaque, uint32_t addr, > + uint32_t value) > +{ > + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; > + uint32_t r_pio = (unsigned long)r_access->r_virtbase > + + (addr - r_access->e_physbase); > + > + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" > + " r_virtbase=%08lx value=%08x\n", > + __func__, r_pio, (int)r_access->e_physbase, > + (unsigned long)r_access->r_virtbase, value); > + outb(value, r_pio); > +} > + > +static void assigned_dev_ioport_writew(void *opaque, uint32_t addr, > + uint32_t value) > +{ > + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; > + uint32_t r_pio = (unsigned long)r_access->r_virtbase > + + (addr - r_access->e_physbase); > + > + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" > + " r_virtbase=%08lx value=%08x\n", > + __func__, r_pio, (int)r_access->e_physbase, > + (unsigned long)r_access->r_virtbase, value); > + outw(value, r_pio); > +} > + > +static void assigned_dev_ioport_writel(void *opaque, uint32_t addr, > + uint32_t value) > +{ > + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; > + uint32_t r_pio = (unsigned long)r_access->r_virtbase > + + (addr - r_access->e_physbase); > + > + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" > + " r_virtbase=%08lx value=%08x\n", > + __func__, r_pio, (int)r_access->e_physbase, > + (unsigned long)r_access->r_virtbase, value); > + outl(value, r_pio); > +} > + > +static uint32_t assigned_dev_ioport_readb(void *opaque, uint32_t > addr) +{ > + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; > + uint32_t r_pio = (addr - r_access->e_physbase) > + + (unsigned long)r_access->r_virtbase; > + uint32_t value; > + > + value = inb(r_pio); > + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x " > + "r_virtbase=%08lx value=%08x\n", > + __func__, r_pio, (int)r_access->e_physbase, > + (unsigned long)r_access->r_virtbase, value); > + return value; > +} > + > +static uint32_t assigned_dev_ioport_readw(void *opaque, uint32_t > addr) +{ > + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; > + uint32_t r_pio = (addr - r_access->e_physbase) > + + (unsigned long)r_access->r_virtbase; > + uint32_t value; > + > + value = inw(r_pio); > + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x " > + "r_virtbase=%08lx value=%08x\n", > + __func__, r_pio, (int)r_access->e_physbase, > + (unsigned long)r_access->r_virtbase, value); > + return value; > +} > + > +static uint32_t assigned_dev_ioport_readl(void *opaque, uint32_t > addr) +{ > + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; > + uint32_t r_pio = (addr - r_access->e_physbase) > + + (unsigned long)r_access->r_virtbase; > + uint32_t value; > + > + value = inl(r_pio); > + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x " > + "r_virtbase=%08lx value=%08x\n", > + __func__, r_pio, (int)r_access->e_physbase, > + (unsigned long)r_access->r_virtbase, value); > + return value; > +} > + > +static void assigned_dev_iomem_map(PCIDevice *pci_dev, int > region_num, + uint32_t e_phys, > uint32_t e_size, int type) +{ > + AssignedDevice *r_dev = (AssignedDevice *) pci_dev; > + AssignedDevRegion *region = &r_dev->v_addrs[region_num]; > + int first_map = (region->e_size == 0); > + int ret = 0; > + > + DEBUG("%s: e_phys=%08x r_virt=%x type=%d len=%08x region_num=%d > \n", + __func__, e_phys, (uint32_t)region->r_virtbase, type, > e_size, + region_num); > + > + region->e_physbase = e_phys; > + region->e_size = e_size; > + > + /* FIXME: Add support for emulated MMIO for non-kvm guests */ > + if (kvm_enabled()) { > + if (!first_map) > + kvm_destroy_phys_mem(kvm_context, e_phys, e_size); A typo? Need to destory orignal registered address? > + if (e_size > 0) > + ret = kvm_register_phys_mem(kvm_context, e_phys, > + region->r_virtbase, e_size, > 0); + if (ret != 0) > + fprintf(stderr, "%s: Error: create new mapping > failed\n", __func__); + } > +} > + > +static void assigned_dev_ioport_map(PCIDevice *pci_dev, int > region_num, + uint32_t addr, > uint32_t size, int type) +{ > + AssignedDevice *r_dev = (AssignedDevice *) pci_dev; > + AssignedDevRegion *region = &r_dev->v_addrs[region_num]; > + int r; > + > + region->e_physbase = addr; > + region->e_size = size; > + > + DEBUG("%s: e_phys=0x%x r_virt=%x type=0x%x len=%d region_num=%d > \n", + __func__, addr, (uint32_t)region->r_virtbase, type, > size, region_num); + > + r = ioperm((uint32_t)region->r_virtbase, size, 1); > + if (r < 0) { > + perror("assigned_dev_ioport_map: ioperm"); > + return; > + } > + > + register_ioport_read(addr, size, 1, assigned_dev_ioport_readb, > + (void *) (r_dev->v_addrs + region_num)); > + register_ioport_read(addr, size, 2, assigned_dev_ioport_readw, > + (void *) (r_dev->v_addrs + region_num)); > + register_ioport_read(addr, size, 4, assigned_dev_ioport_readl, > + (void *) (r_dev->v_addrs + region_num)); > + register_ioport_write(addr, size, 1, assigned_dev_ioport_writeb, > + (void *) (r_dev->v_addrs + region_num)); > + register_ioport_write(addr, size, 2, assigned_dev_ioport_writew, > + (void *) (r_dev->v_addrs + region_num)); > + register_ioport_write(addr, size, 4, assigned_dev_ioport_writel, > + (void *) (r_dev->v_addrs + region_num)); > +} > + > +static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t > address, + uint32_t val, int > len) +{ > + int fd, r; > + > + DEBUG("%s: (%x.%x): address=%04x val=0x%08x len=%d\n", > + __func__, ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), > + (uint16_t) address, val, len); > + > + if (address == 0x4) { > + pci_default_write_config(d, address, val, len); > + /* Continue to program the card */ > + } > + > + if ((address >= 0x10 && address <= 0x24) || address == 0x34 || > + address == 0x3c || address == 0x3d) { > + /* used for update-mappings (BAR emulation) */ > + pci_default_write_config(d, address, val, len); > + return; > + } > + DEBUG("%s: NON BAR (%x.%x): address=%04x val=0x%08x len=%d\n", > + __func__, ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), > + (uint16_t) address, val, len); > + fd = ((AssignedDevice *)d)->real_device.config_fd; > + r = lseek(fd, address, SEEK_SET); > + if (r < 0) { > + fprintf(stderr, "%s: bad seek, errno = %d\n", __func__, > errno); + return; > + } > +again: > + r = write(fd, &val, len); > + if (r < 0) { > + if (errno == EINTR || errno == EAGAIN) > + goto again; > + fprintf(stderr, "%s: write failed, errno = %d\n", __func__, > errno); + } > +} > + > +static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t > address, + int len) > +{ > + uint32_t val = 0; > + int fd, r; > + > + if ((address >= 0x10 && address <= 0x24) || address == 0x34 || > + address == 0x3c || address == 0x3d) { > + val = pci_default_read_config(d, address, len); > + DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n", > + (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, > val, len); + return val; > + } > + > + /* vga specific, remove later */ > + if (address == 0xFC) > + goto do_log; > + > + fd = ((AssignedDevice *)d)->real_device.config_fd; > + r = lseek(fd, address, SEEK_SET); > + if (r < 0) { > + fprintf(stderr, "%s: bad seek, errno = %d\n", __func__, > errno); + return val; > + } > +again: > + r = read(fd, &val, len); > + if (r < 0) { > + if (errno == EINTR || errno == EAGAIN) > + goto again; > + fprintf(stderr, "%s: read failed, errno = %d\n", > + __func__, errno); > + } > +do_log: > + DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n", > + (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, > len); + > + /* kill the special capabilities */ > + if (address == 4 && len == 4) > + val &= ~0x100000; > + else if (address == 6) > + val &= ~0x10; > + > + return val; > +} > + > +static int assigned_dev_register_regions(PCIRegion *io_regions, > + unsigned long regions_num, > + AssignedDevice *pci_dev) > +{ > + uint32_t i; > + PCIRegion *cur_region = io_regions; > + > + for (i = 0; i < regions_num; i++, cur_region++) { > + if (!cur_region->valid) > + continue; > + pci_dev->v_addrs[i].num = i; > + > + /* handle memory io regions */ > + if (cur_region->type & IORESOURCE_MEM) { > + int t = cur_region->type & IORESOURCE_PREFETCH > + ? PCI_ADDRESS_SPACE_MEM_PREFETCH > + : PCI_ADDRESS_SPACE_MEM; > + > + /* map physical memory */ > + pci_dev->v_addrs[i].e_physbase = cur_region->base_addr; > + pci_dev->v_addrs[i].r_virtbase = > + mmap(NULL, > + (cur_region->size + 0xFFF) & 0xFFFFF000, > + PROT_WRITE | PROT_READ, MAP_SHARED, > + cur_region->resource_fd, (off_t) 0); > + > + if ((void *) -1 == pci_dev->v_addrs[i].r_virtbase) { > + fprintf(stderr, "%s: Error: Couldn't mmap 0x%x!" > + "\n", __func__, > + (uint32_t) (cur_region->base_addr)); > + return -1; > + } > + pci_dev->v_addrs[i].r_size = cur_region->size; > + pci_dev->v_addrs[i].e_size = 0; > + > + /* add offset */ > + pci_dev->v_addrs[i].r_virtbase += > + (cur_region->base_addr & 0xFFF); > + > + pci_register_io_region((PCIDevice *) pci_dev, i, > + cur_region->size, t, > + assigned_dev_iomem_map); > + continue; > + } > + /* handle port io regions */ > + pci_register_io_region((PCIDevice *) pci_dev, i, > + cur_region->size, > PCI_ADDRESS_SPACE_IO, + > assigned_dev_ioport_map); + > + pci_dev->v_addrs[i].e_physbase = cur_region->base_addr; > + pci_dev->v_addrs[i].r_virtbase = > + (void *)(long)cur_region->base_addr; > + /* not relevant for port io */ > + pci_dev->v_addrs[i].memory_index = 0; > + } > + > + /* success */ > + return 0; > +} > + > +static int get_real_device(AssignedDevice *pci_dev, uint8_t r_bus, > + uint8_t r_dev, uint8_t r_func) > +{ > + char dir[128], name[128], comp[16]; > + int fd, r = 0; > + FILE *f; > + unsigned long long start, end, size, flags; > + PCIRegion *rp; > + PCIDevRegions *dev = &pci_dev->real_device; > + > + dev->region_number = 0; > + > + snprintf(dir, 128, "/sys/bus/pci/devices/0000:%02x:%02x.%x/", > + r_bus, r_dev, r_func); > + strncpy(name, dir, 128); > + strncat(name, "config", 6); > + fd = open(name, O_RDWR); > + if (fd == -1) { > + fprintf(stderr, "%s: %s: %m\n", __func__, name); > + return 1; > + } > + dev->config_fd = fd; > +again: > + r = read(fd, pci_dev->dev.config, sizeof(pci_dev->dev.config)); > + if (r < 0) { > + if (errno == EINTR || errno == EAGAIN) > + goto again; > + fprintf(stderr, "%s: read failed, errno = %d\n", __func__, > errno); + } > + strncpy(name, dir, 128); > + strncat(name, "resource", 8); > + > + f = fopen(name, "r"); > + if (f == NULL) { > + fprintf(stderr, "%s: %s: %m\n", __func__, name); > + return 1; > + } > + r = -1; > + while (fscanf(f, "%lli %lli %lli\n", &start, &end, &flags) == 3) > { + r++; > + rp = dev->regions + r; > + rp->valid = 0; > + size = end - start + 1; > + flags &= IORESOURCE_IO | IORESOURCE_MEM | > IORESOURCE_PREFETCH; + if (size == 0 || (flags & > ~IORESOURCE_PREFETCH) == 0) + continue; > + if (flags & IORESOURCE_MEM) { > + flags &= ~IORESOURCE_IO; > + snprintf(comp, 16, "resource%d", r); > + strncpy(name, dir, 128); > + strncat(name, comp, 16); > + fd = open(name, O_RDWR); > + if (fd == -1) > + continue; /* probably ROM */ > + rp->resource_fd = fd; > + } else > + flags &= ~IORESOURCE_PREFETCH; > + > + rp->type = flags; > + rp->valid = 1; > + rp->base_addr = start; > + rp->size = size; > + DEBUG("%s: region %d size %d start 0x%x type %d resource_fd > %d\n", + __func__, r, rp->size, start, rp->type, > rp->resource_fd); + } > + fclose(f); > + > + dev->region_number = r; > + return 0; > +} > + > +static int disable_iommu; > +int nr_assigned_devices; > +static LIST_HEAD(, AssignedDevInfo) adev_head; > + > +static uint32_t calc_assigned_dev_id(uint8_t bus, uint8_t devfn) > +{ > + return (uint32_t)bus << 8 | (uint32_t)devfn; > +} > + > +static AssignedDevice *register_real_device(PCIBus *e_bus, > + const char *e_dev_name, > + int e_devfn, uint8_t > r_bus, + uint8_t r_dev, > uint8_t r_func) +{ > + int r; > + AssignedDevice *pci_dev; > + uint8_t e_device, e_intx; > + > + DEBUG("%s: Registering real physical device %s (devfn=0x%x)\n", > + __func__, e_dev_name, e_devfn); > + > + pci_dev = (AssignedDevice *) > + pci_register_device(e_bus, e_dev_name, > sizeof(AssignedDevice), + e_devfn, > assigned_dev_pci_read_config, + > assigned_dev_pci_write_config); + if (NULL == pci_dev) { > + fprintf(stderr, "%s: Error: Couldn't register real device > %s\n", + __func__, e_dev_name); > + return NULL; > + } > + if (get_real_device(pci_dev, r_bus, r_dev, r_func)) { > + fprintf(stderr, "%s: Error: Couldn't get real device > (%s)!\n", + __func__, e_dev_name); > + goto out; > + } > + > + /* handle real device's MMIO/PIO BARs */ > + if (assigned_dev_register_regions(pci_dev->real_device.regions, > + > pci_dev->real_device.region_number, + > pci_dev)) + goto out; > + > + /* handle interrupt routing */ > + e_device = (pci_dev->dev.devfn >> 3) & 0x1f; > + e_intx = pci_dev->dev.config[0x3d] - 1; > + pci_dev->intpin = e_intx; > + pci_dev->run = 0; > + pci_dev->girq = 0; > + pci_dev->h_busnr = r_bus; > + pci_dev->h_devfn = PCI_DEVFN(r_dev, r_func); > + > +#ifdef KVM_CAP_DEVICE_ASSIGNMENT > + if (kvm_enabled()) { > + struct kvm_assigned_pci_dev assigned_dev_data; > + > + memset(&assigned_dev_data, 0, sizeof(assigned_dev_data)); > + assigned_dev_data.assigned_dev_id = > + calc_assigned_dev_id(pci_dev->h_busnr, > + (uint32_t)pci_dev->h_devfn); > + assigned_dev_data.busnr = pci_dev->h_busnr; > + assigned_dev_data.devfn = pci_dev->h_devfn; > + > +#ifdef KVM_CAP_IOMMU > + /* We always enable the IOMMU if present > + * (or when not disabled on the command line) > + */ > + r = kvm_check_extension(kvm_context, KVM_CAP_IOMMU); > + if (r && !disable_iommu) > + assigned_dev_data.flags |= KVM_DEV_ASSIGN_ENABLE_IOMMU; > +#endif > + r = kvm_assign_pci_device(kvm_context, &assigned_dev_data); > + if (r < 0) { > + fprintf(stderr, "Could not notify kernel about " > + "assigned device \"%s\"\n", e_dev_name); > + perror("register_real_device"); > + goto out; > + } > + } > +#endif > + term_printf("Registered host PCI device %02x:%02x.%1x " > + "(\"%s\") as guest device %02x:%02x.%1x\n", > + r_bus, r_dev, r_func, e_dev_name, > + pci_bus_num(e_bus), e_device, r_func); > + > + return pci_dev; > +out: > +/* pci_unregister_device(&pci_dev->dev); */ > + return NULL; > +} > + > +#ifdef KVM_CAP_DEVICE_ASSIGNMENT > +/* The pci config space got updated. Check if irq numbers have > changed + * for our devices > + */ > +void assigned_dev_update_irq(PCIDevice *d) > +{ > + int irq, r; > + AssignedDevice *assigned_dev; > + AssignedDevInfo *adev; > + > + LIST_FOREACH(adev, &adev_head, next) { > + assigned_dev = adev->assigned_dev; > + irq = pci_map_irq(&assigned_dev->dev, assigned_dev->intpin); > + irq = piix_get_irq(irq); > + > + if (irq != assigned_dev->girq) { > + struct kvm_assigned_irq assigned_irq_data; > + > + memset(&assigned_irq_data, 0, sizeof(assigned_irq_data)); > + assigned_irq_data.assigned_dev_id = > + calc_assigned_dev_id(assigned_dev->h_busnr, > + (uint8_t) > assigned_dev->h_devfn); + assigned_irq_data.guest_irq = > irq; + assigned_irq_data.host_irq = > assigned_dev->real_device.irq; + r = > kvm_assign_irq(kvm_context, &assigned_irq_data); + if (r < > 0) { + perror("assigned_dev_update_irq"); > + fprintf(stderr, "Are you assigning a device " > + "that shares IRQ with some other device?\n"); > + pci_unregister_device(&assigned_dev->dev); > + /* FIXME: Delete node from list */ > + continue; > + } > + assigned_dev->girq = irq; > + } > + } > +} > +#endif > + > +struct PCIDevice *init_assigned_device(AssignedDevInfo *adev, PCIBus > *bus) +{ > + adev->assigned_dev = register_real_device(bus, > + adev->name, -1, > + adev->bus, > + adev->dev, > + adev->func); > + return &adev->assigned_dev->dev; > +} > + > +int init_all_assigned_devices(PCIBus *bus) > +{ > + struct AssignedDevInfo *adev; > + > + LIST_FOREACH(adev, &adev_head, next) > + if (init_assigned_device(adev, bus) == NULL) > + return -1; > + return 0; > +} > + > +/* > + * Syntax to assign device: > + * > + * -pcidevice dev=bus:dev.func,dma=dma > + * > + * Example: > + * -pcidevice host=00:13.0,dma=pvdma > + * > + * dma can currently only be 'none' to disable iommu support. > + */ > +AssignedDevInfo *add_assigned_device(const char *arg) > +{ > + char *cp, *cp1; > + char device[8]; > + char dma[6]; > + int r; > + AssignedDevInfo *adev; > + > + adev = qemu_mallocz(sizeof(AssignedDevInfo)); > + if (adev == NULL) { > + fprintf(stderr, "%s: Out of memory\n", __func__); > + return NULL; > + } > + r = get_param_value(device, sizeof(device), "host", arg); > + r = get_param_value(adev->name, sizeof(adev->name), "name", arg); > + if (!r) > + strncpy(adev->name, device, 8); > + > +#ifdef KVM_CAP_IOMMU > + r = get_param_value(dma, sizeof(dma), "dma", arg); > + if (r && !strncmp(dma, "none", 4)) > + disable_iommu = 1; > +#endif > + cp = device; > + adev->bus = strtoul(cp, &cp1, 16); > + if (*cp1 != ':') > + goto bad; > + cp = cp1 + 1; > + > + adev->dev = strtoul(cp, &cp1, 16); > + if (*cp1 != '.') > + goto bad; > + cp = cp1 + 1; > + > + adev->func = strtoul(cp, &cp1, 16); > + > + nr_assigned_devices++; > + LIST_INSERT_HEAD(&adev_head, adev, next); > + return adev; > +bad: > + fprintf(stderr, "pcidevice argument parse error; " > + "please check the help text for usage\n"); > + qemu_free(adev); > + return NULL; > +} > diff --git a/qemu/hw/device-assignment.h b/qemu/hw/device-assignment.h > new file mode 100644 > index 0000000..e4148df > --- /dev/null > +++ b/qemu/hw/device-assignment.h > @@ -0,0 +1,98 @@ > +/* > + * Copyright (c) 2007, Neocleus Corporation. > + * Copyright (c) 2007, Intel Corporation. > + * > + * This program is free software; you can redistribute it and/or > modify it + * under the terms and conditions of the GNU General > Public License, + * version 2, as published by the Free Software > Foundation. + * > + * This program is distributed in the hope it will be useful, but > WITHOUT + * ANY WARRANTY; without even the implied warranty of > MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU > General Public License for + * more details. > + * > + * You should have received a copy of the GNU General Public License > along with + * this program; if not, write to the Free Software > Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA > 02111-1307 USA. + * > + * Data structures for storing PCI state > + * > + * Adapted to kvm by Qumranet > + * > + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) > + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) > + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) > + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) > + */ > + > +#ifndef __DEVICE_ASSIGNMENT_H__ > +#define __DEVICE_ASSIGNMENT_H__ > + > +#include <sys/mman.h> > +#include "qemu-common.h" > +#include "sys-queue.h" > +#include "pci.h" > + > +/* From include/linux/pci.h in the kernel sources */ > +#define PCI_DEVFN(slot, func) ((((slot) & 0x1f) << 3) | ((func) & > 0x07)) + > +#define MAX_IO_REGIONS (6) > + > +typedef struct { > + int type; /* Memory or port I/O */ > + int valid; > + uint32_t base_addr; > + uint32_t size; /* size of the region */ > + int resource_fd; > +} PCIRegion; > + > +typedef struct { > + uint8_t bus, dev, func; /* Bus inside domain, device and > function */ + int irq; /* IRQ number */ > + uint16_t region_number; /* number of active regions */ > + > + /* Port I/O or MMIO Regions */ > + PCIRegion regions[MAX_IO_REGIONS]; > + int config_fd; > +} PCIDevRegions; > + > +typedef struct { > + target_phys_addr_t e_physbase; > + uint32_t memory_index; > + void *r_virtbase; /* mmapped access address */ > + int num; /* our index within v_addrs[] */ > + uint32_t e_size; /* emulated size of region in bytes */ > + uint32_t r_size; /* real size of region in bytes */ > +} AssignedDevRegion; > + > +typedef struct { > + PCIDevice dev; > + int intpin; > + uint8_t debug_flags; > + AssignedDevRegion v_addrs[PCI_NUM_REGIONS]; > + PCIDevRegions real_device; > + int run; > + int girq; > + unsigned char h_busnr; > + unsigned int h_devfn; > + int bound; > +} AssignedDevice; > + > +typedef struct AssignedDevInfo AssignedDevInfo; > + > +struct AssignedDevInfo { > + char name[15]; > + int bus; > + int dev; > + int func; > + AssignedDevice *assigned_dev; > + LIST_ENTRY(AssignedDevInfo) next; > +}; > + > +PCIDevice *init_assigned_device(AssignedDevInfo *adev, PCIBus *bus); > +int init_all_assigned_devices(PCIBus *bus); > +AssignedDevInfo *add_assigned_device(const char *arg); > +void assigned_dev_set_vector(int irq, int vector); > +void assigned_dev_ack_mirq(int vector); > + > +#endif /* __DEVICE_ASSIGNMENT_H__ */ > diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c > index d559f0c..e0438ed 100644 > --- a/qemu/hw/pc.c > +++ b/qemu/hw/pc.c > @@ -33,6 +33,7 @@ > #include "boards.h" > #include "console.h" > #include "fw_cfg.h" > +#include "device-assignment.h" > > #include "qemu-kvm.h" > > @@ -993,6 +994,11 @@ static void pc_init1(ram_addr_t ram_size, int > vga_ram_size, } > } > > + /* Initialize assigned devices */ > + if (pci_enabled) > + if(init_all_assigned_devices(pci_bus)) > + exit(1); > + > rtc_state = rtc_init(0x70, i8259[8]); > > qemu_register_boot_set(pc_boot_set, rtc_state); > diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c > index c82cd20..f86a8a7 100644 > --- a/qemu/hw/pci.c > +++ b/qemu/hw/pci.c > @@ -50,6 +50,7 @@ struct PCIBus { > > static void pci_update_mappings(PCIDevice *d); > static void pci_set_irq(void *opaque, int irq_num, int level); > +void assigned_dev_update_irq(PCIDevice *d); > > target_phys_addr_t pci_mem_base; > static int pci_irq_index; > @@ -453,6 +454,12 @@ void pci_default_write_config(PCIDevice *d, > val >>= 8; > } > > +#ifdef KVM_CAP_DEVICE_ASSIGNMENT > + if (kvm_enabled() && qemu_kvm_irqchip_in_kernel() && > + address >= 0x60 && address <= 0x63) > + assigned_dev_update_irq(d); > +#endif > + > end = address + len; > if (end > PCI_COMMAND && address < (PCI_COMMAND + 2)) { > /* if the command register is modified, we must modify the > mappings */ > diff --git a/qemu/vl.c b/qemu/vl.c > index 388e79d..5a39d12 100644 > --- a/qemu/vl.c > +++ b/qemu/vl.c > @@ -38,6 +38,7 @@ > #include "qemu-char.h" > #include "block.h" > #include "audio/audio.h" > +#include "hw/device-assignment.h" > #include "migration.h" > #include "balloon.h" > #include "qemu-kvm.h" > @@ -8692,6 +8693,12 @@ static void help(int exitcode) > #endif > "-no-kvm-irqchip disable KVM kernel mode PIC/IOAPIC/LAPIC\n" > "-no-kvm-pit disable KVM kernel mode PIT\n" > +#if defined(TARGET_I386) || defined(TARGET_X86_64) || > defined(__linux__) + "-pcidevice > host=bus:dev.func[,dma=none][,name=\"string\"]\n" + " > expose a PCI device to the guest OS.\n" + " > dma=none: don't perform any dma translations (default is to use an > iommu)\n" + " 'string' is used in log > output.\n" +#endif #endif > #ifdef TARGET_I386 > "-no-acpi disable ACPI\n" > @@ -8811,6 +8818,9 @@ enum { > QEMU_OPTION_no_kvm, > QEMU_OPTION_no_kvm_irqchip, > QEMU_OPTION_no_kvm_pit, > +#if defined(TARGET_I386) || defined(TARGET_X86_64) || > defined(__linux__) + QEMU_OPTION_pcidevice, > +#endif > QEMU_OPTION_no_reboot, > QEMU_OPTION_no_shutdown, > QEMU_OPTION_show_cursor, > @@ -8900,6 +8910,9 @@ static const QEMUOption qemu_options[] = { > #endif > { "no-kvm-irqchip", 0, QEMU_OPTION_no_kvm_irqchip }, > { "no-kvm-pit", 0, QEMU_OPTION_no_kvm_pit }, > +#if defined(TARGET_I386) || defined(TARGET_X86_64) || > defined(__linux__) + { "pcidevice", HAS_ARG, QEMU_OPTION_pcidevice > }, +#endif > #endif > #if defined(TARGET_PPC) || defined(TARGET_SPARC) > { "g", 1, QEMU_OPTION_g }, > @@ -9844,6 +9857,11 @@ int main(int argc, char **argv) > kvm_pit = 0; > break; > } > +#if defined(TARGET_I386) || defined(TARGET_X86_64) || > defined(__linux__) + case QEMU_OPTION_pcidevice: > + add_assigned_device(optarg); > + break; > +#endif > #endif > case QEMU_OPTION_usb: > usb_enabled = 1; > -- > 1.6.0.2 Best Regards, Disheng, Su ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests 2008-10-27 1:28 ` Su, Disheng @ 2008-10-27 6:32 ` Han, Weidong 2008-10-28 10:12 ` Muli Ben-Yehuda 0 siblings, 1 reply; 23+ messages in thread From: Han, Weidong @ 2008-10-27 6:32 UTC (permalink / raw) To: Su, Disheng, Amit Shah, avi@redhat.com Cc: kvm@vger.kernel.org, anthony@codemonkey.ws, Kay, Allen M, muli@il.ibm.com, benami@il.ibm.com Su, Disheng wrote: > Amit Shah wrote: >> This patch has been contributed to by the following people: >> >> From: Or Sagi <ors@tutis.com> >> From: Nir Peleg <nir@tutis.com> >> From: Amit Shah <amit.shah@redhat.com> >> From: Ben-Ami Yassour <benami@il.ibm.com> >> From: Weidong Han <weidong.han@intel.com> >> From: Glauber de Oliveira Costa <gcosta@redhat.com> >> >> With this patch, we can assign a device on the host machine to a >> guest. >> >> A new command-line option, -pcidevice is added. >> To invoke it for a device sitting at PCI bus:dev.fn 04:08.0, use >> this: >> >> -pcidevice host=04:08.0 >> >> * The host driver for the device, if any, is to be removed before >> assigning the device (else device assignment will fail). >> >> * A device that shares IRQ with another host device cannot currently >> be assigned. >> >> * The RAW_IO capability is needed for this to work >> >> This works only with the in-kernel irqchip method; to use the >> userspace irqchip, a kernel module (irqhook) and some extra changes >> are needed. >> >> Signed-off-by: Amit Shah <amit.shah@redhat.com> >> --- >> qemu/Makefile.target | 1 + >> qemu/hw/device-assignment.c | 619 >> +++++++++++++++++++++++++++++++++++++++++++ >> qemu/hw/device-assignment.h | 98 +++++++ qemu/hw/pc.c >> | 6 + qemu/hw/pci.c | 7 + >> qemu/vl.c | 18 ++ >> 6 files changed, 749 insertions(+), 0 deletions(-) >> create mode 100644 qemu/hw/device-assignment.c >> create mode 100644 qemu/hw/device-assignment.h >> >> diff --git a/qemu/Makefile.target b/qemu/Makefile.target >> index d9bdeca..05a1d84 100644 >> --- a/qemu/Makefile.target >> +++ b/qemu/Makefile.target >> @@ -621,6 +621,7 @@ OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) >> dma.o OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o >> pc.o OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o >> OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o +OBJS+= >> device-assignment.o ifeq ($(USE_KVM_PIT), 1) >> OBJS+= i8254-kvm.o >> endif >> diff --git a/qemu/hw/device-assignment.c >> b/qemu/hw/device-assignment.c new file mode 100644 index >> 0000000..5ba21a0 --- /dev/null >> +++ b/qemu/hw/device-assignment.c >> @@ -0,0 +1,619 @@ >> +/* >> + * Copyright (c) 2007, Neocleus Corporation. >> + * >> + * This program is free software; you can redistribute it and/or >> modify it + * under the terms and conditions of the GNU General >> Public License, + * version 2, as published by the Free Software >> Foundation. + * + * This program is distributed in the hope it will >> be useful, but WITHOUT + * ANY WARRANTY; without even the implied >> warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. >> See the GNU General Public License for + * more details. >> + * >> + * You should have received a copy of the GNU General Public License >> along with + * this program; if not, write to the Free Software >> Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA >> 02111-1307 USA. + * + * >> + * Assign a PCI device from the host to a guest VM. + * >> + * Adapted for KVM by Qumranet. >> + * >> + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) >> + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) >> + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) >> + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) + >> */ +#include <stdio.h> >> +#include <sys/io.h> >> +#include "qemu-kvm.h" >> +#include "hw.h" >> +#include "pc.h" >> +#include "sysemu.h" >> +#include "console.h" >> +#include <linux/kvm_para.h> >> +#include "device-assignment.h" >> + >> +/* From linux/ioport.h */ >> +#define IORESOURCE_IO 0x00000100 /* Resource type */ >> +#define IORESOURCE_MEM 0x00000200 >> +#define IORESOURCE_IRQ 0x00000400 >> +#define IORESOURCE_DMA 0x00000800 >> +#define IORESOURCE_PREFETCH 0x00001000 /* No side effects */ + >> +/* #define DEVICE_ASSIGNMENT_DEBUG 1 */ >> + >> +#ifdef DEVICE_ASSIGNMENT_DEBUG >> +#define DEBUG(fmt, args...) \ >> + do { \ >> + fprintf(stderr, "%s: " fmt, __func__ , ## args); \ + } >> while (0) +#else >> +#define DEBUG(fmt, args...) do { } while(0) >> +#endif >> + >> +static void assigned_dev_ioport_writeb(void *opaque, uint32_t addr, >> + uint32_t value) +{ >> + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; >> + uint32_t r_pio = (unsigned long)r_access->r_virtbase >> + + (addr - r_access->e_physbase); >> + >> + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" >> + " r_virtbase=%08lx value=%08x\n", >> + __func__, r_pio, (int)r_access->e_physbase, >> + (unsigned long)r_access->r_virtbase, value); + >> outb(value, r_pio); +} >> + >> +static void assigned_dev_ioport_writew(void *opaque, uint32_t addr, >> + uint32_t value) +{ >> + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; >> + uint32_t r_pio = (unsigned long)r_access->r_virtbase >> + + (addr - r_access->e_physbase); >> + >> + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" >> + " r_virtbase=%08lx value=%08x\n", >> + __func__, r_pio, (int)r_access->e_physbase, >> + (unsigned long)r_access->r_virtbase, value); + >> outw(value, r_pio); +} >> + >> +static void assigned_dev_ioport_writel(void *opaque, uint32_t addr, >> + uint32_t value) >> +{ >> + AssignedDevRegion *r_access = (AssignedDevRegion *)opaque; >> + uint32_t r_pio = (unsigned long)r_access->r_virtbase >> + + (addr - r_access->e_physbase); >> + >> + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x" >> + " r_virtbase=%08lx value=%08x\n", >> + __func__, r_pio, (int)r_access->e_physbase, >> + (unsigned long)r_access->r_virtbase, value); + >> outl(value, r_pio); +} >> + >> +static uint32_t assigned_dev_ioport_readb(void *opaque, uint32_t >> addr) +{ + AssignedDevRegion *r_access = (AssignedDevRegion >> *)opaque; + uint32_t r_pio = (addr - r_access->e_physbase) >> + + (unsigned long)r_access->r_virtbase; >> + uint32_t value; >> + >> + value = inb(r_pio); >> + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x " >> + "r_virtbase=%08lx value=%08x\n", >> + __func__, r_pio, (int)r_access->e_physbase, >> + (unsigned long)r_access->r_virtbase, value); + return >> value; +} >> + >> +static uint32_t assigned_dev_ioport_readw(void *opaque, uint32_t >> addr) +{ + AssignedDevRegion *r_access = (AssignedDevRegion >> *)opaque; + uint32_t r_pio = (addr - r_access->e_physbase) >> + + (unsigned long)r_access->r_virtbase; >> + uint32_t value; >> + >> + value = inw(r_pio); >> + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x " >> + "r_virtbase=%08lx value=%08x\n", >> + __func__, r_pio, (int)r_access->e_physbase, >> + (unsigned long)r_access->r_virtbase, value); + return >> value; +} >> + >> +static uint32_t assigned_dev_ioport_readl(void *opaque, uint32_t >> addr) +{ + AssignedDevRegion *r_access = (AssignedDevRegion >> *)opaque; + uint32_t r_pio = (addr - r_access->e_physbase) >> + + (unsigned long)r_access->r_virtbase; >> + uint32_t value; >> + >> + value = inl(r_pio); >> + DEBUG(stderr, "%s: r_pio=%08x e_physbase=%08x " >> + "r_virtbase=%08lx value=%08x\n", >> + __func__, r_pio, (int)r_access->e_physbase, >> + (unsigned long)r_access->r_virtbase, value); + return >> value; +} >> + >> +static void assigned_dev_iomem_map(PCIDevice *pci_dev, int >> region_num, + uint32_t e_phys, >> uint32_t e_size, int type) +{ + AssignedDevice *r_dev = >> (AssignedDevice *) pci_dev; + AssignedDevRegion *region = >> &r_dev->v_addrs[region_num]; + int first_map = (region->e_size == >> 0); + int ret = 0; >> + >> + DEBUG("%s: e_phys=%08x r_virt=%x type=%d len=%08x region_num=%d >> \n", + __func__, e_phys, (uint32_t)region->r_virtbase, type, >> e_size, + region_num); >> + >> + region->e_physbase = e_phys; >> + region->e_size = e_size; >> + >> + /* FIXME: Add support for emulated MMIO for non-kvm guests */ + >> if (kvm_enabled()) { + if (!first_map) >> + kvm_destroy_phys_mem(kvm_context, e_phys, e_size); > A typo? Need to destory orignal registered address? Yes, it's buggy. It should like: uint32_t old_ephys = region->e_physbase; uint32_t old_esize = region->e_size; ... kvm_destroy_phys_mem(kvm_context, old_ephys, old_esize); ... Regards, Weidong ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests 2008-10-27 6:32 ` Han, Weidong @ 2008-10-28 10:12 ` Muli Ben-Yehuda 0 siblings, 0 replies; 23+ messages in thread From: Muli Ben-Yehuda @ 2008-10-28 10:12 UTC (permalink / raw) To: Han, Weidong Cc: Su, Disheng, Amit Shah, avi@redhat.com, kvm@vger.kernel.org, anthony@codemonkey.ws, Kay, Allen M, Ben-Ami Yassour1 On Mon, Oct 27, 2008 at 02:32:48PM +0800, Han, Weidong wrote: > Yes, it's buggy. It should like: > > uint32_t old_ephys = region->e_physbase; > uint32_t old_esize = region->e_size; > > ... > > kvm_destroy_phys_mem(kvm_context, old_ephys, old_esize); Fixed in v8. Thanks! Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ <-> SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa 2008-10-24 15:26 ` [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa Amit Shah 2008-10-24 15:26 ` [PATCH 4/6] KVM/userspace: Build vtd.c for Intel IOMMU support Amit Shah @ 2008-10-26 13:31 ` Avi Kivity 2008-10-28 10:12 ` Muli Ben-Yehuda 1 sibling, 1 reply; 23+ messages in thread From: Avi Kivity @ 2008-10-26 13:31 UTC (permalink / raw) To: Amit Shah; +Cc: kvm, anthony, weidong.han, allen.m.kay, muli, benami Amit Shah wrote: > > +int piix3_get_pin(int pic_irq) > +{ > + int i; > + for (i = 0; i < 4; i++) > + if (piix3_dev->config[0x60+i] == pic_irq) > + return i; > + return -1; > +} > What happens if two pci interrupts are routed to one irq line? -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa 2008-10-26 13:31 ` [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa Avi Kivity @ 2008-10-28 10:12 ` Muli Ben-Yehuda 2008-10-28 10:46 ` Avi Kivity 0 siblings, 1 reply; 23+ messages in thread From: Muli Ben-Yehuda @ 2008-10-28 10:12 UTC (permalink / raw) To: Avi Kivity Cc: Amit Shah, kvm, anthony, weidong.han, allen.m.kay, Ben-Ami Yassour1 On Sun, Oct 26, 2008 at 03:31:24PM +0200, Avi Kivity wrote: > Amit Shah wrote: >> +int piix3_get_pin(int pic_irq) >> +{ >> + int i; >> + for (i = 0; i < 4; i++) >> + if (piix3_dev->config[0x60+i] == pic_irq) >> + return i; >> + return -1; >> +} >> > > What happens if two pci interrupts are routed to one irq line? This one I'm still thinking about. Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ <-> SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa 2008-10-28 10:12 ` Muli Ben-Yehuda @ 2008-10-28 10:46 ` Avi Kivity 2008-10-28 15:44 ` Muli Ben-Yehuda 0 siblings, 1 reply; 23+ messages in thread From: Avi Kivity @ 2008-10-28 10:46 UTC (permalink / raw) To: Muli Ben-Yehuda Cc: Amit Shah, kvm, anthony, weidong.han, allen.m.kay, Ben-Ami Yassour1 Muli Ben-Yehuda wrote: > On Sun, Oct 26, 2008 at 03:31:24PM +0200, Avi Kivity wrote: > >> Amit Shah wrote: >> >>> +int piix3_get_pin(int pic_irq) >>> +{ >>> + int i; >>> + for (i = 0; i < 4; i++) >>> + if (piix3_dev->config[0x60+i] == pic_irq) >>> + return i; >>> + return -1; >>> +} >>> >>> >> What happens if two pci interrupts are routed to one irq line? >> > > This one I'm still thinking about. > Well, what is this needed for in the first place? -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa 2008-10-28 10:46 ` Avi Kivity @ 2008-10-28 15:44 ` Muli Ben-Yehuda 2008-10-28 16:21 ` Avi Kivity 0 siblings, 1 reply; 23+ messages in thread From: Muli Ben-Yehuda @ 2008-10-28 15:44 UTC (permalink / raw) To: Avi Kivity Cc: Amit Shah, kvm, anthony, weidong.han, allen.m.kay, Ben-Ami Yassour1 On Tue, Oct 28, 2008 at 12:46:39PM +0200, Avi Kivity wrote: > Muli Ben-Yehuda wrote: >> On Sun, Oct 26, 2008 at 03:31:24PM +0200, Avi Kivity wrote: >> >>> Amit Shah wrote: >>> >>>> +int piix3_get_pin(int pic_irq) >>>> +{ >>>> + int i; >>>> + for (i = 0; i < 4; i++) >>>> + if (piix3_dev->config[0x60+i] == pic_irq) >>>> + return i; >>>> + return -1; >>>> +} >>>> >>> What happens if two pci interrupts are routed to one irq line? >>> >> >> This one I'm still thinking about. >> > > Well, what is this needed for in the first place? This specific function is not used. I assume Amit added it for completeness with piix_get_irq. piix_get_irq, as far as I can tell, is used in only one place (when the guest updates a device's configuration space interrupt register) to go from interrupt pin (intx) to guest IRQ line. Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ <-> SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa 2008-10-28 15:44 ` Muli Ben-Yehuda @ 2008-10-28 16:21 ` Avi Kivity 2008-10-28 16:45 ` Muli Ben-Yehuda 0 siblings, 1 reply; 23+ messages in thread From: Avi Kivity @ 2008-10-28 16:21 UTC (permalink / raw) To: Muli Ben-Yehuda Cc: Amit Shah, kvm, anthony, weidong.han, allen.m.kay, Ben-Ami Yassour1 Muli Ben-Yehuda wrote: >> Well, what is this needed for in the first place? >> > > This specific function is not used. I assume Amit added it for > completeness with piix_get_irq. piix_get_irq, as far as I can tell, is > used in only one place (when the guest updates a device's > configuration space interrupt register) to go from interrupt pin > (intx) to guest IRQ line. > In that case, a solution suggests itself... -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa 2008-10-28 16:21 ` Avi Kivity @ 2008-10-28 16:45 ` Muli Ben-Yehuda 0 siblings, 0 replies; 23+ messages in thread From: Muli Ben-Yehuda @ 2008-10-28 16:45 UTC (permalink / raw) To: Avi Kivity Cc: Amit Shah, kvm, anthony, weidong.han, allen.m.kay, Ben-Ami Yassour1 On Tue, Oct 28, 2008 at 06:21:35PM +0200, Avi Kivity wrote: > Muli Ben-Yehuda wrote: > > >>> Well, what is this needed for in the first place? >>> >> >> This specific function is not used. I assume Amit added it for >> completeness with piix_get_irq. piix_get_irq, as far as I can tell, is >> used in only one place (when the guest updates a device's >> configuration space interrupt register) to go from interrupt pin >> (intx) to guest IRQ line. >> > > In that case, a solution suggests itself... Yes, of course! I don't know how I missed it! Err... What is it? Seriously, I removed piix3_get_pin as soon as I noticed it wasn't actually used, but I am not convinced that there are no aliasing issues remaining with piix_get_irq---most likely because I do not understand PCI interrupt routing to any sufficient degree. Do you see problems remaining with pixx_get_irq? Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ <-> SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices 2008-10-24 15:26 ` [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices Amit Shah 2008-10-24 15:26 ` [PATCH 2/6] qemu: Introduce pci_map_irq to get irq nr from pin number for a PCI device Amit Shah @ 2008-10-26 13:29 ` Avi Kivity 2008-10-28 10:13 ` Muli Ben-Yehuda 1 sibling, 1 reply; 23+ messages in thread From: Avi Kivity @ 2008-10-26 13:29 UTC (permalink / raw) To: Amit Shah; +Cc: kvm, anthony, weidong.han, allen.m.kay, muli, benami Amit Shah wrote: > > +#ifdef KVM_CAP_DEVICE_ASSIGNMENT > +int kvm_assign_pci_device(kvm_context_t kvm, > + struct kvm_assigned_pci_dev *assigned_dev) > +{ > + return ioctl(kvm->vm_fd, KVM_ASSIGN_PCI_DEVICE, assigned_dev); > Convert -1s to -errno, to avoid problems with errno being overwritten later. > +} > + > +int kvm_assign_irq(kvm_context_t kvm, > + struct kvm_assigned_irq *assigned_irq) > +{ > + return ioctl(kvm->vm_fd, KVM_ASSIGN_IRQ, assigned_irq); > +} > +#endif > -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices 2008-10-26 13:29 ` [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices Avi Kivity @ 2008-10-28 10:13 ` Muli Ben-Yehuda 0 siblings, 0 replies; 23+ messages in thread From: Muli Ben-Yehuda @ 2008-10-28 10:13 UTC (permalink / raw) To: Avi Kivity Cc: Amit Shah, kvm, anthony, weidong.han, allen.m.kay, Ben-Ami Yassour1 On Sun, Oct 26, 2008 at 03:29:19PM +0200, Avi Kivity wrote: > Amit Shah wrote: >> +#ifdef KVM_CAP_DEVICE_ASSIGNMENT >> +int kvm_assign_pci_device(kvm_context_t kvm, >> + struct kvm_assigned_pci_dev *assigned_dev) >> +{ >> + return ioctl(kvm->vm_fd, KVM_ASSIGN_PCI_DEVICE, assigned_dev); >> > > Convert -1s to -errno, to avoid problems with errno being > overwritten later. Done. Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ <-> SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [v7] Userspace patches for PCI device assignment 2008-10-24 15:26 [v7] Userspace patches for PCI device assignment Amit Shah 2008-10-24 15:26 ` [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices Amit Shah @ 2008-10-24 15:59 ` Anthony Liguori 2008-10-28 10:13 ` Muli Ben-Yehuda 1 sibling, 1 reply; 23+ messages in thread From: Anthony Liguori @ 2008-10-24 15:59 UTC (permalink / raw) To: Amit Shah; +Cc: avi, kvm, weidong.han, allen.m.kay, muli, benami Amit Shah wrote: > This patchset enables device assignment for KVM hosts for PCI devices. It uses the Intel IOMMU by default if available. > > Major changes since the last send in no particular order: > - formatting changes: adhere to qemu style > - use strncmp, strncpy etc. instead of the insecure ones > FWIW, strncpy almost never does what you expect it to. snprintf() is much nicer. Regards, Anthony Liguori > - move from array to linked list > - change iopl() to ioperm() (Weidong Han) > > Plus a lot of other small changes as suggested during the review of v6. > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [v7] Userspace patches for PCI device assignment 2008-10-24 15:59 ` [v7] Userspace patches for PCI device assignment Anthony Liguori @ 2008-10-28 10:13 ` Muli Ben-Yehuda 0 siblings, 0 replies; 23+ messages in thread From: Muli Ben-Yehuda @ 2008-10-28 10:13 UTC (permalink / raw) To: Anthony Liguori Cc: Amit Shah, avi, kvm, weidong.han, allen.m.kay, Ben-Ami Yassour1 On Fri, Oct 24, 2008 at 10:59:58AM -0500, Anthony Liguori wrote: > Amit Shah wrote: >> This patchset enables device assignment for KVM hosts for PCI devices. It >> uses the Intel IOMMU by default if available. >> >> Major changes since the last send in no particular order: >> - formatting changes: adhere to qemu style >> - use strncmp, strncpy etc. instead of the insecure ones >> > > FWIW, strncpy almost never does what you expect it to. snprintf() > is much nicer. Fixed all over. If you find a stray strncpy, shoot it. Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ <-> SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2008-10-28 16:46 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-10-24 15:26 [v7] Userspace patches for PCI device assignment Amit Shah 2008-10-24 15:26 ` [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices Amit Shah 2008-10-24 15:26 ` [PATCH 2/6] qemu: Introduce pci_map_irq to get irq nr from pin number for a PCI device Amit Shah 2008-10-24 15:26 ` [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa Amit Shah 2008-10-24 15:26 ` [PATCH 4/6] KVM/userspace: Build vtd.c for Intel IOMMU support Amit Shah 2008-10-24 15:26 ` [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests Amit Shah 2008-10-24 15:26 ` [PATCH 6/6] KVM/userspace: Device Assignment: Support for hot plugging PCI devices Amit Shah 2008-10-24 16:22 ` [PATCH 5/6] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests Anthony Liguori 2008-10-26 12:54 ` Avi Kivity 2008-10-28 10:11 ` Muli Ben-Yehuda 2008-10-27 1:28 ` Su, Disheng 2008-10-27 6:32 ` Han, Weidong 2008-10-28 10:12 ` Muli Ben-Yehuda 2008-10-26 13:31 ` [PATCH 3/6] qemu: piix: Introduce functions to get pin number from irq and vice versa Avi Kivity 2008-10-28 10:12 ` Muli Ben-Yehuda 2008-10-28 10:46 ` Avi Kivity 2008-10-28 15:44 ` Muli Ben-Yehuda 2008-10-28 16:21 ` Avi Kivity 2008-10-28 16:45 ` Muli Ben-Yehuda 2008-10-26 13:29 ` [PATCH 1/6] KVM/userspace: Device Assignment: Add ioctl wrappers needed for assigning devices Avi Kivity 2008-10-28 10:13 ` Muli Ben-Yehuda 2008-10-24 15:59 ` [v7] Userspace patches for PCI device assignment Anthony Liguori 2008-10-28 10:13 ` Muli Ben-Yehuda
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).