public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Anthony Liguori <anthony@codemonkey.ws>
To: Amit Shah <amit.shah@qumranet.com>
Cc: kvm@vger.kernel.org, muli@il.ibm.com, benami@il.ibm.com,
	allen.m.kay@intel.com, chrisw@redhat.com
Subject: Re: [PATCH 1/1] KVM/userspace: Support for assigning PCI devices to guest
Date: Mon, 02 Jun 2008 08:57:31 -0500	[thread overview]
Message-ID: <4843FC4B.3060706@codemonkey.ws> (raw)
In-Reply-To: <1212389189-26115-1-git-send-email-amit.shah@qumranet.com>

Amit Shah wrote:
> From: Or Sagi <ors@tutis.com>
> From: Nir Peleg <nir@tutis.com>
> From: Amit Shah <amit.shah@qumranet.com>
> From: Glauber de Oliveira Costa <gcosta@redhat.com>
>
> We can assign a device from the host machine to a guest. The original code comes
> from Neocleus.
>
> A new command-line option, -pcidevice is added.
> For example, to invoke it for an Ethernet device sitting at
> PCI bus:dev.fn 04:08.0 with host IRQ 18, use this:
>
> -pcidevice Ethernet/04:08.0-18
>   

Why is it necessary to specify the device type and the IRQ?  Shouldn't 
that all be discoverable?

> The host ethernet driver is to be removed before doing the passthrough.
>
> If kvm uses the in-kernel irqchip, interrupts are routed to
> the guest via the kvm module (accompanied kernel changes are necessary).
> If -no-kvm-irqchip is used, the 'irqhook' module, also included here,
> is to be used.
>   

We should drop the irqhook module entirely.  It seems unlikely that it 
will be received well on LKML.

> diff --git a/libkvm/libkvm.h b/libkvm/libkvm.h
> index 63183b8..0dd4334 100644
> --- a/libkvm/libkvm.h
> +++ b/libkvm/libkvm.h
> @@ -12,6 +12,7 @@
>  #endif
>  
>  #include <linux/kvm.h>
> +#include <linux/kvm_para.h>
>  
>  #include <signal.h>
>  
> @@ -626,4 +627,19 @@ int kvm_enable_vapic(kvm_context_t kvm, int vcpu, uint64_t vapic);
>  
>  #endif
>  
> +#ifdef KVM_CAP_PCI_PASSTHROUGH
> +/*!
> + * \brief Notifies host kernel about assigning a PCI device to the guest
> + *
> + * Used for PCI passthrough, this function notifies the host kernel
> + * about the assigning of the physical PCI device and the guest PCI
> + * parameters. Also called when the guest updates the irq number for a
> + * passthrough device
> + *
> + * \param kvm Pointer to the current kvm_context
> + * \param pci_pt_dev Parameters like irq, PCI bus, devfn number, etc
> + */
> +int kvm_assign_pci_pt_device(kvm_context_t kvm,
> +			     struct kvm_pci_passthrough_dev *pci_pt_dev);
>   

Is such a high level interface really required?  We need to forward one 
(or potentially more) IRQs to the guest and we need to setup VT-d.  
Shouldn't these be separate operations?

> +#endif
>  #endif
> diff --git a/qemu/Makefile.target b/qemu/Makefile.target
> index 66c1fb1..1c9d445 100644
> --- a/qemu/Makefile.target
> +++ b/qemu/Makefile.target
> @@ -615,6 +615,7 @@ OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o
>  OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o
>  OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o
>  OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o
> +OBJS+= pci-passthrough.o
>  ifeq ($(USE_KVM_PIT), 1)
>  OBJS+= i8254-kvm.o
>  endif
> diff --git a/qemu/hw/apic.c b/qemu/hw/apic.c
> index a14cab2..3f83e64 100644
> --- a/qemu/hw/apic.c
> +++ b/qemu/hw/apic.c
> @@ -23,6 +23,8 @@
>  
>  #include "qemu-kvm.h"
>  
> +#include "pci-passthrough.h"
> +
>  //#define DEBUG_APIC
>  //#define DEBUG_IOAPIC
>  
> @@ -389,6 +391,7 @@ static void apic_eoi(APICState *s)
>      /* XXX: send the EOI packet to the APIC bus to allow the I/O APIC to
>              set the remote IRR bit for level triggered interrupts. */
>      apic_update_irq(s);
> +    pt_ack_mirq(isrv);
>  }
>  
>  static void apic_get_delivery_bitmask(uint32_t *deliver_bitmask,
> @@ -1144,6 +1147,7 @@ static void ioapic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t va
>                      } else {
>                          s->ioredtbl[index] &= ~0xffffffffULL;
>                          s->ioredtbl[index] |= val;
> +                        pt_set_vector(index, (val << 24) >> 24);
>   

Is this an attempt to sign extend a 24-bit value?  It's very odd looking.

>                      }
>                      ioapic_service(s);
>                  }
> diff --git a/qemu/hw/isa.h b/qemu/hw/isa.h
> index 89b3004..c720f5e 100644
> --- a/qemu/hw/isa.h
> +++ b/qemu/hw/isa.h
> @@ -1,5 +1,7 @@
>  /* ISA bus */
>  
> +#include "hw.h"
> +
>  extern target_phys_addr_t isa_mem_base;
>  
>  int register_ioport_read(int start, int length, int size,
> diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c
> index da4a608..a9b7e60 100644
> --- a/qemu/hw/pc.c
> +++ b/qemu/hw/pc.c
> @@ -32,6 +32,7 @@
>  #include "smbus.h"
>  #include "boards.h"
>  #include "console.h"
> +#include "pci-passthrough.h"
>  
>  #include "qemu-kvm.h"
>  
> @@ -997,6 +998,9 @@ static void pc_init1(ram_addr_t ram_size, int vga_ram_size,
>          }
>      }
>  
> +    /* Initialize pass-through */
> +    pt_init(pci_bus);
> +
>   

Each pass-through device should be treated as a special PCI device.  
This implies that we should be looping in pc_init1() and creating each 
pass through device based on user-provided parameters.  A side effect of 
this is that it will make pci_add much easier to get working which is 
something that's missing from this series.

>      rtc_state = rtc_init(0x70, i8259[8]);
>  
>      register_ioport_read(0x92, 1, 1, ioport92_read, NULL);
> diff --git a/qemu/hw/pci-passthrough.c b/qemu/hw/pci-passthrough.c
> new file mode 100644
> index 0000000..b1413af
> --- /dev/null
> +++ b/qemu/hw/pci-passthrough.c
> @@ -0,0 +1,704 @@
> +/*
> + * Copyright (c) 2007, Neocleus Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.
> + *
> + *
> + *  Pass a PCI device from the host to a guest VM.
> + *
> + *  Adapted for KVM by Qumranet.
> + *
> + *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
> + *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
> + *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
> + */
> +#include <stdio.h>
> +#include <pthread.h>
> +#include <sys/io.h>
> +#include <sys/ioctl.h>
> +#include <linux/types.h>
> +
> +typedef __u64 resource_size_t;
> +#define __deprecated 
> +#include <linux/ioport.h>
>   

Neither of these look very good.  Why are we including ioport.h?

> +#include "pci-passthrough.h"
> +#include "irq.h"
> +
> +#include "qemu-kvm.h"
> +#include <linux/kvm_para.h>
> +extern kvm_context_t kvm_context;
> +extern FILE *logfile;
>   

Please avoid open usage of kvm_context.

> +CPUReadMemoryFunc *pt_mmio_read_cb[3] = {
> +	pt_mmio_readb,
> +	pt_mmio_readw,
> +	pt_mmio_readl
> +};
> +
> +CPUWriteMemoryFunc *pt_mmio_write_cb[3] = {
> +	pt_mmio_writeb,
> +	pt_mmio_writew,
> +	pt_mmio_writel
> +};
> +
> +//#define PT_DEBUG
> +
> +#ifdef PT_DEBUG
> +#define DEBUG(fmt, args...) fprintf(stderr, "%s: " fmt, __FUNCTION__ , ## args)
> +#else
> +#define DEBUG(fmt, args...)
> +#endif
> +
> +#define pt_mmio_write(suffix, type)					\
> +void pt_mmio_write##suffix(void *opaque, target_phys_addr_t e_phys,	\
> +				uint32_t value)				\
> +{									\
> +	pt_region_t *r_access = (pt_region_t *)opaque;			\
> +	void *r_virt = (u8 *)r_access->r_virtbase + 			\
> +			(e_phys - r_access->e_physbase);		\
> +	if (r_access->debug & PT_DEBUG_MMIO) {				\
> +		fprintf(logfile, "pt_mmio_write" #suffix		\
> +			": e_physbase=%p e_phys=%p r_virt=%p value=%08x\n", \
> +			(void *)r_access->e_physbase, (void *)e_phys,	\
> +			r_virt, value);					\
> +	}								\
> +	*(type *)r_virt = (type)value;					\
> +}
> +
> +pt_mmio_write(b, u8)
> +pt_mmio_write(w, u16)
> +pt_mmio_write(l, u32)
> +
> +#define pt_mmio_read(suffix, type)					\
> +uint32_t pt_mmio_read##suffix(void *opaque, target_phys_addr_t e_phys)	\
> +{									\
> +	pt_region_t *r_access = (pt_region_t *)opaque;			\
> +	void *r_virt = (u8 *)r_access->r_virtbase + 			\
> +			(e_phys - r_access->e_physbase);		\
> +	uint32_t value = (u32) (*(type *) r_virt);			\
> +	if (r_access->debug & PT_DEBUG_MMIO) {				\
> +		fprintf(logfile,					\
> +			"pt_mmio_read" #suffix ": e_physbase=%p "	\
> +			"e_phys=%p r_virt=%p value=%08x\n",		\
> +			(void *)r_access->e_physbase,			\
> +			(void *)e_phys, r_virt, value);			\
> +	}								\
> +	return value;							\
> +}
> +
> +pt_mmio_read(b, u8)
> +pt_mmio_read(w, u16)
> +pt_mmio_read(l, u32)
> +
> +#define pt_ioport_write(suffix)						\
> +void pt_ioport_write##suffix(void *opaque, uint32_t addr, uint32_t value) \
> +{									\
> +	pt_region_t *r_access = (pt_region_t *)opaque;			\
> +	uint32_t r_pio = (unsigned long)r_access->r_virtbase		\
> +			 + (addr - r_access->e_physbase);		\
> +	if (r_access->debug & PT_DEBUG_PIO) {				\
> +		fprintf(logfile, "pt_ioport_write" #suffix 		\
> +			": r_pio=%08x e_physbase=%08x"			\
> +			" r_virtbase=%08lx value=%08x\n", 		\
> +			r_pio, (int)r_access->e_physbase,		\
> +			(unsigned long)r_access->r_virtbase, value);	\
> +	}								\
> +	out##suffix(value, r_pio);					\
> +}
> +
> +pt_ioport_write(b)
> +pt_ioport_write(w)
> +pt_ioport_write(l)
> +
> +#define pt_ioport_read(suffix)						\
> +uint32_t pt_ioport_read##suffix(void *opaque, uint32_t addr)		\
> +{									\
> +	pt_region_t *r_access = (pt_region_t *)opaque;			\
> +	uint32_t r_pio = (addr - r_access->e_physbase)			\
> +			+ (unsigned long)r_access->r_virtbase;		\
> +	uint32_t value = in##suffix(r_pio);				\
>   

FYI, you can get inl et al from sys/io.h

> +	if (r_access->debug & PT_DEBUG_PIO) {				\
> +		fprintf(logfile, "pt_ioport_read" #suffix 		\
> +			": r_pio=%08x e_physbase=%08x r_virtbase=%08lx "\
> +			"value=%08x\n", 				\
> +			r_pio, (int)r_access->e_physbase, 		\
> +			(unsigned long)r_access->r_virtbase, value);	\
> +	}								\
> +	return (value);							\
> +}
> +
> +pt_ioport_read(b)
> +pt_ioport_read(w)
> +pt_ioport_read(l)
> +
> +static void pt_iomem_map(PCIDevice * d, int region_num,
> +			 uint32_t e_phys, uint32_t e_size, int type)
> +{
> +	pt_dev_t *r_dev = (pt_dev_t *) d;
>   

Please try to make this look more QEMU like by having 
PassthroughDeviceState or something like that.

> +	r_dev->v_addrs[region_num].e_physbase = e_phys;
> +
> +	DEBUG("e_phys=%08x r_virt=%p type=%d len=%08x region_num=%d \n",
> +	      e_phys, r_dev->v_addrs[region_num].r_virtbase, type, e_size,
> +	      region_num);
> +
> +	cpu_register_physical_memory(e_phys,
> +				     r_dev->dev.io_regions[region_num].size,
> +				     r_dev->v_addrs[region_num].memory_index);
> +}
> +
> +
> +static void pt_ioport_map(PCIDevice * pci_dev, int region_num,
> +			  uint32_t addr, uint32_t size, int type)
> +{
> +	pt_dev_t *r_dev = (pt_dev_t *) pci_dev;
> +	int i;
> +	uint32_t ((*rf[])(void *, uint32_t)) =  { pt_ioport_readb,
> +						  pt_ioport_readw,
> +						  pt_ioport_readl
> +						};
> +	void ((*wf[])(void *, uint32_t, uint32_t)) = { pt_ioport_writeb,
> +						       pt_ioport_writew,
> +						       pt_ioport_writel
> +						     };
>   

Please make this a static array like the memory operations or at least 
introduce a proper typedef.

> +	r_dev->v_addrs[region_num].e_physbase = addr;
> +	DEBUG("pt_ioport_map: address=0x%x type=0x%x len=%d"
> +	      "region_num=%d \n", addr, type, size, region_num);
> +
> +	for (i = 0; i < 3; i++) {
> +		register_ioport_write(addr, size, 1<<i, wf[i],
> +				      (void *) (r_dev->v_addrs + region_num));
> +		register_ioport_read(addr, size, 1<<i, rf[i],
> +				     (void *) (r_dev->v_addrs + region_num));
> +	}
> +}
> +
> +static void pt_pci_write_config(PCIDevice * d, uint32_t address, uint32_t val,
> +				int len)
> +{
> +	int fd;
> +
> +	DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n",
> +	      ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), (uint16_t) address,
> +	      val, len);
> +
> +	if (address == 0x4)
> +		pci_default_write_config(d, address, val, len);
> +
> +	if ((address >= 0x10 && address <= 0x24) || address == 0x34 ||
> +	    address == 0x3c || address == 0x3d) {
> +	  /* used for update-mappings (BAR emulation) */
> +		pci_default_write_config(d, address, val, len);
> +		return;
> +	}
> +
> +	DEBUG("NON BAR (%x.%x): address=%04x val=0x%08x len=%d\n",
> +	      ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), (uint16_t) address,
> +	      val, len);
> +	fd = ((pt_dev_t *)d)->real_device.config_fd;
> +	lseek(fd, address, SEEK_SET);
> +	write(fd, &val, len);
>   

Is it possible to mmap() the config space?  This would look better.  
Right now, you don't deal with EINTR which is a Bad Thing.

> +	/* vga specific, remove later */
> +	if (address == 0xFC)
> +		goto do_log;
>   

Can you explain this?

> +	fd = ((pt_dev_t *)d)->real_device.config_fd;
> +	lseek(fd, address, SEEK_SET);
> +	read(fd, &val, len);
> +
> +      do_log:
> +	DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n",
> +	      (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, len);
> +
> +	/* kill the special capabilities */
> +	if (address == 4 && len == 4)
> +		val &= ~0x100000;
> +	else if (address == 6)
> +		val &= ~0x10;
> +
> +	return (val);
> +}
> +
> +
> +static int pt_register_regions(pci_region_t * io_regions,
> +			       unsigned long regions_num, pt_dev_t * pci_dev)
> +{
> +	uint32_t i;
> +	pci_region_t *cur_region = io_regions;
> +
> +	for (i = 0; i < regions_num; i++, cur_region++) {
> +		if (!cur_region->valid)
> +			continue;
> +#ifdef PT_DEBUG
> +		pci_dev->v_addrs[i].debug |= PT_DEBUG_MMIO | PT_DEBUG_PIO;
> +#endif
> +		pci_dev->v_addrs[i].num = i;
> +
> +		/* handle memory io regions */
> +		if (cur_region->type & IORESOURCE_MEM) {
> +			int t = cur_region->type & IORESOURCE_PREFETCH
> +			  ? PCI_ADDRESS_SPACE_MEM_PREFETCH 
> +			  : PCI_ADDRESS_SPACE_MEM;
> +
> +			/* map physical memory */
> +			pci_dev->v_addrs[i].e_physbase = cur_region->base_addr;
> +			pci_dev->v_addrs[i].r_virtbase =
> +			  mmap(NULL, (cur_region->size + 0xFFF) & 0xFFFFF000,
> +			       PROT_WRITE | PROT_READ, MAP_SHARED,
> +			       cur_region->resource_fd, (off_t) 0);
> +
> +			if ((void *) -1 == pci_dev->v_addrs[i].r_virtbase) {
> +				fprintf(logfile, "Error: Couldn't mmap 0x%x!\n",
> +					(uint32_t) (cur_region->base_addr));
> +				return (-1);
> +			}
> +
> +			/* add offset */
> +			pci_dev->v_addrs[i].r_virtbase +=
> +			  (cur_region->base_addr & 0xFFF);
> +
> +			pci_register_io_region((PCIDevice *) pci_dev, i,
> +					       cur_region->size, t,
> +					       pt_iomem_map);
> +
> +			pci_dev->v_addrs[i].memory_index =
> +			    cpu_register_io_memory(0, pt_mmio_read_cb,
> +						   pt_mmio_write_cb,
> +						   (void *) &(pci_dev->v_addrs[i]));
> +
> +			continue;
> +		}
> +		/* handle port io regions */
> +
> +		pci_register_io_region((PCIDevice *) pci_dev, i,
> +				       cur_region->size, PCI_ADDRESS_SPACE_IO,
> +				       pt_ioport_map);
> +
> +		pci_dev->v_addrs[i].e_physbase = cur_region->base_addr;
> +		pci_dev->v_addrs[i].r_virtbase = (void *)(long)cur_region->base_addr;
> +		pci_dev->v_addrs[i].memory_index = 0;	// not relevant for port io
>   

I see no reason why we couldn't allow PIO pass-through too.  Marcelo's 
already added a userspace interface for it.

> +	}
> +
> +	/* success */
> +	return (0);
> +
> +}
> +
> +static int
> +pt_get_real_device(pt_dev_t *pci_dev, uint8_t r_bus, uint8_t r_dev,
> +		   uint8_t r_func)
> +{
> +	char dir[128], name[128], comp[16];
> +	int fd, r = 0;
> +	FILE *f;
> +	unsigned long long start, end, size, flags;
> +	pci_region_t *rp;
> +	pci_dev_t *dev = &pci_dev->real_device;
> +
> +	dev->region_number = 0;
> +
> +	sprintf(dir, "/sys/bus/pci/devices/0000:%02x:%02x.%x/",
> +		r_bus, r_dev, r_func);
> +	strcpy(name, dir);
> +	strcat(name, "config");
>   

snprintf please, also remove all uses of strcpy or stract.

> +	if ((fd = open(name, O_RDWR)) == -1) {
> +		fprintf(logfile, "%s: %m\n", name);
> +		return 1;
> +	}
> +	dev->config_fd = fd;
> +	read(fd, pci_dev->dev.config, sizeof pci_dev->dev.config);
>   

Can't assume read() will always succeed.  Especially in QEMU.  It's very 
likely to fail with EINTR.

> +	strcpy(name, dir);
> +	strcat(name, "resource");
> +	if ((f = fopen(name, "r")) == NULL) {
> +		fprintf(logfile, "%s: %m\n", name);
> +		return 1;
> +	}
>   

Please log to stderr for this sort of error.

> +	for (r = 0; fscanf(f, "%lli %lli %lli\n", &start, &end, &flags) == 3;
> +	     r++) {
> +		rp = dev->regions + r;
> +		rp->valid = 0;
> +		size = end - start + 1;
> +		flags &= IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH;
> +		if (size == 0 || (flags & ~IORESOURCE_PREFETCH) == 0)
> +			continue;
> +		if (flags & IORESOURCE_MEM) {
> +			flags &= ~IORESOURCE_IO;
> +			sprintf(comp, "resource%d", r);
> +			strcpy(name, dir);
> +			strcat(name, comp);
> +			if ((fd = open(name, O_RDWR)) == -1)
> +				continue;		// probably ROM
> +			rp->resource_fd = fd;
> +		} else
> +			flags &= ~IORESOURCE_PREFETCH;
> +
> +		rp->type = flags;
> +		rp->valid = 1;
> +		rp->base_addr = start;
> +		rp->size = size;
> +		DEBUG("region %d size %d start 0x%x type %d "
> +		      "resource_fd %d\n", r, rp->size, start, rp->type,
> +		      rp->resource_fd);
> +	}
> +	fclose(f);
> +	
> +	dev->region_number = r;
> +	return 0;
> +}
> +
> +static int pt_bind_mirq(int bus, int dev, int fn);
> +
> +static pt_dev_t *register_real_device(PCIBus * e_bus, const char *e_dev_name,
> +				      int e_devfn, uint8_t r_bus, uint8_t r_dev,
> +				      uint8_t r_func, uint32_t machine_irq)
> +{
> +	int rc;
> +	pt_dev_t *pci_dev;
> +	uint8_t e_device, e_intx;
> +
> +	DEBUG("register_real_device: Registering real physical "
> +	      "device %s (devfn=0x%x)\n", e_dev_name, e_devfn);
> +
> +	pci_dev = (pt_dev_t *) pci_register_device(e_bus, e_dev_name,
> +						   sizeof(pt_dev_t), e_devfn,
> +						   pt_pci_read_config,
> +						   pt_pci_write_config);
> +
> +	if (NULL == pci_dev) {
> +		fprintf(logfile, "register_real_device: Error: Couldn't "
> +			"register real device %s\n", e_dev_name);
> +		return (NULL);
> +	}
> +	if (pt_get_real_device(pci_dev, r_bus, r_dev, r_func)) {
> +		fprintf(logfile, "register_real_device: Error: Couldn't get "
> +			"real device (%s)!\n", e_dev_name);
> +		return NULL;
> +	}
> +
> +	/* handle real device's MMIO/PIO BARs */
> +	if (pt_register_regions(pci_dev->real_device.regions,
> +				pci_dev->real_device.region_number, pci_dev))
> +		return (NULL);
> +
> +	/* handle interrupt routing */
> +	e_device = (pci_dev->dev.devfn >> 3) & 0x1f;
> +	e_intx = pci_dev->dev.config[0x3d] - 1;
> +	pci_dev->intpin = e_intx;
> +	pci_dev->run = 0;
> +	pci_dev->mirq = machine_irq;
> +	pci_dev->girq = 0;
> +
> +	/* bind machine_irq to device */
> +	if (machine_irq && ((kvm_enabled() && !qemu_kvm_irqchip_in_kernel())
> +			    || !kvm_enabled())) {
> +		DEBUG(logfile, "Binding mirq %u to device=0x%x intpin=0x%x\n",
> +			machine_irq, e_device, pci_dev->intpin);
> +		rc = pt_bind_mirq(r_bus, r_dev, r_func);
> +		if (rc) {
> +			fprintf(logfile, "pt_bind %d failed rc=%d\n",
> +				pci_dev->mirq, rc);
> +			return NULL;
> +		}
> +		sprintf(pci_dev->sirq, "%d", pci_dev->mirq);
> +	}
> +
> +#ifdef KVM_CAP_PCI_PASSTHROUGH
> +	if (kvm_enabled()) {
> +		struct kvm_pci_passthrough_dev pci_pt_dev;
> +
> +		pci_pt_dev.guest.busnr  = pci_bus_num(e_bus);
> +		pci_pt_dev.guest.devfn  = PCI_DEVFN(e_device, r_func);
> +		pci_pt_dev.host.busnr   = r_bus;
> +		pci_pt_dev.host.devfn   = PCI_DEVFN(r_dev, r_func);
> +
> +		if (qemu_kvm_irqchip_in_kernel()) {
> +			/* We'll set the value of the guest irq as and when
> +			 * the piix config gets updated. See pci_pt_update_irq
> +			 */
> +			pci_pt_dev.guest.irq = 0;
> +			pci_pt_dev.host.irq = machine_irq;
> +		}
> +		rc = kvm_assign_pci_pt_device(kvm_context, &pci_pt_dev);
> +		if (rc < 0) {
> +			fprintf(stderr, "Could not notify kernel about "
> +				"passthrough device\n");
> +			perror("pt-ioctl");
> +			return NULL;
> +		}
> +	}
> +#endif
>   

This whole file is useless without KVM_CAP_PCI_PASSTHROUGH so you should 
make it only get compiled into QEMU if that capability is present.  Take 
a look at the in-kernel PIT for an example of this in QEMU.

> +	fprintf(logfile, "Registered host PCI device %02x:%02x.%1x-%u "
> +		"as guest device %02x:%02x.%1x\n",
> +		r_bus, r_dev, r_func, machine_irq,
> +		pci_bus_num(e_bus), e_device, r_func);
> +
> +	return (pci_dev);
> +}
> +
> +#define	MAX_PTDEVS 4
> +struct {
> +	char name[128];
> +	int bus;
> +	int dev;
> +	int func;
> +	int irq;
> +	pt_dev_t *ptdev;
> +} ptdevs[MAX_PTDEVS];
> +
> +int nptdevs;
> +extern int piix_get_irq(int);
> +int pt_init(PCIBus * bus)
> +{
> +	pt_dev_t *dev = NULL;
> +	int i, ret = 0;
> +
> +	iopl(3);
>   

An advantage of using PIO passthrough is we could avoid doing iopl(3).  
This gives us more privileges than we actually need.  In fact, doing PIO 
means we can't even write to the PIO ports in userspace which seems like 
a nice thing to me from a security perspective.

> diff --git a/qemu/hw/pci-passthrough.h b/qemu/hw/pci-passthrough.h
> new file mode 100644
> index 0000000..564b989
> --- /dev/null
> +++ b/qemu/hw/pci-passthrough.h
> @@ -0,0 +1,119 @@
> +/*
> + * Copyright (c) 2007, Neocleus Corporation.
> + * Copyright (c) 2007, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.
> + *
> + *  Data structures for storing PCI state
> + *
> + *  Adapted to kvm by Qumranet
> + *
> + *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
> + *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
> + *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
> + */
> +
> +#ifndef __PCI_PASSTHROUGH_H__
> +#define __PCI_PASSTHROUGH_H__
> +
> +#include <sys/mman.h>
> +#include "qemu-common.h"
> +#include "pci.h"
> +#include <linux/types.h>
> +
> +typedef __u8 u8;
> +typedef __u16 u16;
> +typedef __u32 u32;
> +typedef __u64 u64;
>   

This should go away.

> +#define PT_DEBUG_PIO	(0x01)
> +#define PT_DEBUG_MMIO	(0x02)
> +
> +/* From include/linux/pci.h in the kernel sources */
> +#define PCI_DEVFN(slot,func)	((((slot) & 0x1f) << 3) | ((func) & 0x07))
> +
> +typedef u32 pciaddr_t;
> +
> +#define MAX_IO_REGIONS			(6)
> +
> +typedef struct pci_region_s {
> +	int type;	/* Memory or port I/O */
> +	int valid;
> +	pciaddr_t base_addr;
> +	pciaddr_t size;		/* size of the region */
> +	int resource_fd;
> +} pci_region_t;
> +
> +typedef struct pci_dev_s {
> +	u8 bus, dev, func;	/* Bus inside domain, device and function */
> +	int irq;		/* IRQ number */
> +	u16 region_number;	/* number of active regions */
> +
> +	/* Port I/O or MMIO Regions */
> +	pci_region_t regions[MAX_IO_REGIONS];
> +	int config_fd;
> +} pci_dev_t;
> +
> +typedef struct pt_region_s {
> +	target_phys_addr_t e_physbase;
> +	uint32_t memory_index;
> +	void *r_virtbase;	/* mmapped access address */
> +	int num;		/* our index within v_addrs[] */
> +	uint32_t debug;
> +} pt_region_t;
> +
> +typedef struct pt_dev_s {
> +	PCIDevice dev;
> +	int intpin;
> +	uint8_t debug_flags;
> +	pt_region_t v_addrs[PCI_NUM_REGIONS];
> +	pci_dev_t real_device;
> +	int run;
> +	int mirq;
> +	int girq;
> +	char sirq[4];
> +	int bound;
> +} pt_dev_t;
> +
> +/* MMIO access functions */
> +uint32_t pt_mmio_readb(void *opaque, target_phys_addr_t e_phys);
> +uint32_t pt_mmio_readw(void *opaque, target_phys_addr_t e_phys);
> +uint32_t pt_mmio_readl(void *opaque, target_phys_addr_t e_phys);
> +void pt_mmio_writeb(void *opaque, target_phys_addr_t e_phys, uint32_t value);
> +void pt_mmio_writew(void *opaque, target_phys_addr_t e_phys, uint32_t value);
> +void pt_mmio_writel(void *opaque, target_phys_addr_t e_phys, uint32_t value);
> +
> +/* PIO access functions */
> +uint32_t pt_ioport_readb(void *opaque, uint32_t addr);
> +uint32_t pt_ioport_readw(void *opaque, uint32_t addr);
> +uint32_t pt_ioport_readl(void *opaque, uint32_t addr);
> +void pt_ioport_writeb(void *opaque, uint32_t addr, uint32_t value);
> +void pt_ioport_writew(void *opaque, uint32_t addr, uint32_t value);
> +void pt_ioport_writel(void *opaque, uint32_t addr, uint32_t value);
> +
> +/* Registration functions */
> +int register_pt_pio_region(uint32_t pio_start, uint32_t length,
> +			   uint8_t do_logging);
> +int register_pt_mmio_region(uint32_t mmio_addr, uint32_t length,
> +			    uint8_t do_logging);
> +
> +/* Initialization functions */
> +int pt_init(PCIBus * bus);
> +void add_pci_passthrough_device(const char *arg);
> +void pt_set_vector(int irq, int vector);
> +void pt_ack_mirq(int vector);
>   

Is there a reason these aren't static?


> +#define logfile stderr
>   

Oh this is a very bad thing :-)

> +#endif				/* __PCI_PASSTHROUGH_H__ */
> diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c
> index a23a466..1a2b1fa 100644
> --- a/qemu/hw/pci.c
> +++ b/qemu/hw/pci.c
> @@ -26,6 +26,7 @@
>  #include "console.h"
>  #include "net.h"
>  #include "pc.h"
> +#include "qemu-kvm.h"
>  
>  //#define DEBUG_PCI
>  
> @@ -49,6 +50,7 @@ struct PCIBus {
>  
>  static void pci_update_mappings(PCIDevice *d);
>  static void pci_set_irq(void *opaque, int irq_num, int level);
> +void pci_pt_update_irq(PCIDevice *d);
>  
>  target_phys_addr_t pci_mem_base;
>  static int pci_irq_index;
> @@ -449,6 +451,12 @@ void pci_default_write_config(PCIDevice *d,
>          val >>= 8;
>      }
>  
> +#ifdef KVM_CAP_PCI_PASSTHROUGH
> +    if (kvm_enabled() && qemu_kvm_irqchip_in_kernel() &&
> +	address >= 0x60 && address <= 0x63)
> +	pci_pt_update_irq(d);
> +#endif
> +
>      end = address + len;
>      if (end > PCI_COMMAND && address < (PCI_COMMAND + 2)) {
>          /* if the command register is modified, we must modify the mappings */
> @@ -551,6 +559,11 @@ static void pci_set_irq(void *opaque, int irq_num, int level)
>      bus->set_irq(bus->irq_opaque, irq_num, bus->irq_count[irq_num] != 0);
>  }
>  
> +int pci_map_irq(PCIDevice *pci_dev, int pin)
> +{
> +	return pci_dev->bus->map_irq(pci_dev, pin);
> +}
> +
>  /***********************************************************/
>  /* monitor info on PCI */
>  
> diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h
> index 60e4094..e11fbbf 100644
> --- a/qemu/hw/pci.h
> +++ b/qemu/hw/pci.h
> @@ -81,6 +81,7 @@ void pci_register_io_region(PCIDevice *pci_dev, int region_num,
>                              uint32_t size, int type,
>                              PCIMapIORegionFunc *map_func);
>  
> +int pci_map_irq(PCIDevice *pci_dev, int pin);
>  uint32_t pci_default_read_config(PCIDevice *d,
>                                   uint32_t address, int len);
>  void pci_default_write_config(PCIDevice *d,
> diff --git a/qemu/hw/piix_pci.c b/qemu/hw/piix_pci.c
> index 90cb3a6..112381a 100644
> --- a/qemu/hw/piix_pci.c
> +++ b/qemu/hw/piix_pci.c
> @@ -237,6 +237,25 @@ static void piix3_set_irq(qemu_irq *pic, int irq_num, int level)
>      }
>  }
>  
> +int piix3_get_pin(int pic_irq)
> +{
> +    int i;
> +    for (i = 0; i < 4; i++)
> +        if (piix3_dev->config[0x60+i] == pic_irq)
> +            return i;
> +    return -1;
> +}
> +
> +int piix_get_irq(int pin)
> +{
> +    if (piix3_dev)
> +        return piix3_dev->config[0x60+pin];
> +    if (piix4_dev)
> +        return piix4_dev->config[0x60+pin];
> +
> +    return 0;
> +}
> +
>  static void piix3_reset(PCIDevice *d)
>  {
>      uint8_t *pci_conf = d->config;
> diff --git a/qemu/vl.c b/qemu/vl.c
> index 7e4dce1..136119e 100644
> --- a/qemu/vl.c
> +++ b/qemu/vl.c
> @@ -37,6 +37,7 @@
>  #include "qemu-char.h"
>  #include "block.h"
>  #include "audio/audio.h"
> +#include "hw/pci-passthrough.h"
>  #include "migration.h"
>  #include "qemu-kvm.h"
>  
> @@ -7790,6 +7791,10 @@ static void help(int exitcode)
>  #endif
>  	   "-no-kvm-irqchip disable KVM kernel mode PIC/IOAPIC/LAPIC\n"
>  	   "-no-kvm-pit	    disable KVM kernel mode PIT\n"
> +#if defined(TARGET_I386) || defined(TARGET_X86_64)
> +	   "-pcidevice name/bus:dev.func-intr \n"
> +	   "                expose a PCI device to the guest OS\n"
> +#endif
>  #endif
>  #ifdef TARGET_I386
>             "-std-vga        simulate a standard VGA card with VESA Bochs Extensions\n"
> @@ -7914,6 +7919,9 @@ enum {
>      QEMU_OPTION_no_kvm,
>      QEMU_OPTION_no_kvm_irqchip,
>      QEMU_OPTION_no_kvm_pit,
> +#if defined(TARGET_I386) || defined(TARGET_X86_64)
> +    QEMU_OPTION_pcidevice,
> +#endif
>      QEMU_OPTION_no_reboot,
>      QEMU_OPTION_no_shutdown,
>      QEMU_OPTION_show_cursor,
> @@ -8002,6 +8010,9 @@ const QEMUOption qemu_options[] = {
>  #endif
>      { "no-kvm-irqchip", 0, QEMU_OPTION_no_kvm_irqchip },
>      { "no-kvm-pit", 0, QEMU_OPTION_no_kvm_pit },
> +#if defined(TARGET_I386) || defined(TARGET_X86_64)
> +    { "pcidevice", HAS_ARG, QEMU_OPTION_pcidevice },
> +#endif
>  #endif
>  #if defined(TARGET_PPC) || defined(TARGET_SPARC)
>      { "g", 1, QEMU_OPTION_g },
> @@ -8904,6 +8915,11 @@ int main(int argc, char **argv)
>  		kvm_pit = 0;
>  		break;
>  	    }
> +#if defined(TARGET_I386) || defined(TARGET_X86_64)
> +	    case QEMU_OPTION_pcidevice:
> +		add_pci_passthrough_device(optarg);
> +		break;
> +#endif		
>  #endif
>              case QEMU_OPTION_usb:
>                  usb_enabled = 1;
> diff --git a/tools/pci_barsize.c b/tools/pci_barsize.c
>   

While these may be useful for your debugging, I don't know if we want to 
include them in KVM by default.

Regards,

Anthony Liguori

> new file mode 100644
> index 0000000..dd230c9
> --- /dev/null
> +++ b/tools/pci_barsize.c
> @@ -0,0 +1,53 @@
> +#include <stdio.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <stdlib.h>
> +
> +int
> +panic(char *msg)
> +{
> +	perror(msg);
> +	exit(1);
> +}
> +
> +int
> +main(int argc, char **argv)
> +{
> +	unsigned l, b, sz;
> +	int fd, ismem, bar = 0, offs;
> +
> +	if (argc < 2)
> +		panic("usage: pci_barsize <file> [bar no]");
> +	
> +	if ((fd = open(argv[1], O_RDWR)) < 0)
> +		panic("open");
> +
> +	if (argc > 2)
> +		bar = strtoul(argv[2], 0, 0);
> +	if (bar < 0 || bar > 5)
> +		panic("bar range 0-5");
> +
> +	offs = 0x10 + bar * 4;
> +	lseek(fd, offs, 0);
> +	read(fd, &l, sizeof(l));
> +	printf("bar %d (offs 0x%x) == %x\n", bar, offs, l);
> +
> +	ismem = !(l & 0x01);
> +	
> +	b = ~0;
> +	lseek(fd, offs, 0);
> +	write(fd, &b, sizeof(b));
> +
> +	lseek(fd, offs, 0);
> +	read(fd, &b, sizeof(b));
> +	sz = ~(b & (ismem ? ~0x15 : ~0x1)) + 1;
> +	printf("bar %d %s size 0x%x == %ldKB (%x)\n",
> +		bar, ismem ? "memory" : "IO", sz, sz / 1024, b);
> +
> +	lseek(fd, offs, 0);
> +	write(fd, &l, sizeof(l));
> +
> +	return 0;
> +}
> diff --git a/tools/pci_mmio.c b/tools/pci_mmio.c
> new file mode 100644
> index 0000000..6e91571
> --- /dev/null
> +++ b/tools/pci_mmio.c
> @@ -0,0 +1,82 @@
> +#include <stdio.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <stdlib.h>
> +#include <sys/mman.h>
> +
> +int
> +panic(char *msg)
> +{
> +	perror(msg);
> +	exit(1);
> +}
> +
> +int
> +main(int argc, char **argv)
> +{
> +	unsigned sz;
> +	int fd, cnt, rsz, offs = 0;
> +	void *map;
> +	struct stat st;
> +
> +	if (argc < 2)
> +		panic("usage: pci_mmio <resouce-file> [offset [count]]");
> +	
> +	if ((fd = open(argv[1], O_RDWR)) < 0)
> +		panic("open");
> +
> +	if (fstat(fd, &st) < 0)
> +		panic("fstat");
> +	cnt = sz = st.st_size;
> +
> +	if (argc > 2)
> +		offs = strtoul(argv[2], 0, 0);
> +	if (argc > 3)
> +		cnt = strtoul(argv[3], 0, 0);
> +
> +	if (cnt < 0 || cnt > sz)
> +		panic("bad count");
> +	if (offs < 0 || offs > sz)
> +		panic("bad offset");
> +	if (offs + cnt > sz) {
> +		cnt = sz - offs;
> +		fprintf(stderr, "count truncated to %d", cnt);
> +	}
> +	if (cnt > 4 && offs % 4)
> +		panic("read bigger than 4 must be 4 bytes aligned");
> +	if (cnt == 2 && offs % 2)
> +		panic("2 bytes read must be 2 bytes aligned");
> +	if (cnt != 1 && cnt != 2 && cnt != 4 && cnt % 4)
> +		panic("counts must be 1, 2, 4 or 4*n");
> +
> +	fprintf(stderr, "reading %s [%d:%d]\n", argv[1], offs, offs + cnt);
> +	map = mmap(NULL, sz, PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
> +
> +	if (!map)
> +		panic("mmap");
> +
> +	rsz = cnt > 4 ? 4 : cnt;
> +	fprintf(stderr, "rsz: %d cnt %d\n", rsz, cnt);
> +	while (cnt > 0) {
> +		char buf[8];
> +		switch (rsz) {
> +		case 1:
> +			*(char *)buf = *(char *)map + offs;
> +			break;
> +		case 2:
> +			*(short *)buf = *(short *)map + offs/sizeof(short);
> +			break;
> +		case 4:
> +			*(int *)buf = *(int *)map + offs/4;
> +			break;
> +		}
> +		write(1, buf, rsz);
> +
> +		offs += rsz;
> +		cnt -= rsz;
> +	}
> +	fprintf(stderr, "done\n");
> +	return 0;
> +}
>   


  parent reply	other threads:[~2008-06-02 13:57 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-02  6:46 [PATCH 1/1] KVM/userspace: Support for assigning PCI devices to guest Amit Shah
2008-06-02  7:18 ` Han, Weidong
2008-06-02  8:11   ` Amit Shah
2008-06-04 14:19     ` Avi Kivity
2008-06-04 14:43       ` Amit Shah
2008-06-04 14:53         ` Avi Kivity
2008-06-04 15:53           ` Amit Shah
2008-06-04 16:11           ` Muli Ben-Yehuda
2008-06-02 13:57 ` Anthony Liguori [this message]
2008-06-02 17:07   ` Amit Shah
2008-06-02 13:58 ` Anthony Liguori
  -- strict thread matches above, loose matches on Subject: below --
2008-07-22 12:13 [PATCH 4/4] KVM: Device assignemnt with VT-d Ben-Ami Yassour
2008-07-22 12:18 ` device assignment - userspace part Ben-Ami Yassour
2008-07-22 12:18   ` [PATCH 1/1] KVM/userspace: Support for assigning PCI devices to guest Ben-Ami Yassour
2008-07-28 16:26 [PATCH 5/5] This patch extends the VT-d driver to support KVM Ben-Ami Yassour
2008-07-28 16:32 ` Device assignment - userspace part Ben-Ami Yassour
2008-07-28 16:32   ` [PATCH 1/1] KVM/userspace: Support for assigning PCI devices to guest Ben-Ami Yassour
2008-08-01  3:09     ` Han, Weidong
2008-08-05  9:41       ` Ben-Ami Yassour

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4843FC4B.3060706@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=allen.m.kay@intel.com \
    --cc=amit.shah@qumranet.com \
    --cc=benami@il.ibm.com \
    --cc=chrisw@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=muli@il.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox