* [Qemu-devel] [PATCH v3 0/2] Inter-VM shared memory PCI device @ 2010-03-25 6:08 Cam Macdonell 2010-03-25 6:08 ` [Qemu-devel] [PATCH v3 1/2] Support adding a file to qemu's ram allocation Cam Macdonell 2010-03-25 9:04 ` [Qemu-devel] Re: [PATCH v3 0/2] " Avi Kivity 0 siblings, 2 replies; 18+ messages in thread From: Cam Macdonell @ 2010-03-25 6:08 UTC (permalink / raw) To: kvm; +Cc: Cam Macdonell, qemu-devel Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. Changes in this version are using the qdev format and optional use of MSI and ioeventfd/irqfd. The non-interrupt version is supported by passing the shm parameter -device ivshmem,size=<size in MB>,[shm=<shm_name>] which will simply map the shm object into a BAR. Interrupts are supported between multiple VMs by using a shared memory server that is connected to with a socket character device -device ivshmem,size=<size in MB>[,chardev=<chardev name>][,irqfd=on] [,msi=on][,nvectors=n] -chardev socket,path=<path>,id=<chardev name> The server passes file descriptors for the shared memory object and eventfds (our interrupt mechanism) to the respective qemu instances. When using interrupts, VMs communicate with a shared memory server that passes the shared memory object file descriptor using SCM_RIGHTS. The server assigns each VM an ID number and sends this ID number to the Qemu process along with a series of eventfd file descriptors, one per guest using the shared memory server. These eventfds will be used to send interrupts between guests. Each guest listens on the eventfd corresponding to their ID and may use the others for sending interrupts to other guests. enum ivshmem_registers { IntrMask = 0, IntrStatus = 4, IVPosition = 8, Doorbell = 12 }; The first two registers are the interrupt mask and status registers. Mask and status are only used with pin-based interrupts. They are unused with MSI interrupts. The IVPosition register is read-only and reports the guest's ID number. Interrupts are triggered when a message is received on the guest's eventfd from another VM. To trigger an event, a guest must write to another guest's Doorbell. The "Doorbells" begin at offset 12. A particular guest's doorbell offset in the MMIO region is equal to guest_id * 32 + Doorbell The doorbell register for each guest is 32-bits. The doorbell-per-guest design was motivated for use with ioeventfd. The semantics of the value written to the doorbell depends on whether the device is using MSI or a regular pin-based interrupt. Regular Interrupts ------------------ If regular interrupts are used (due to either a guest not supporting MSI or the user specifying not to use them on the command-line) then the value written to a guest's doorbell is what the guest's status register will be set to. An status of (2^32 - 1) indicates that a new guest has joined. Guests should not send a message of this value for any other reason. Message Signalled Interrupts ---------------------------- The important thing to remember with MSI is that it is only a signal, no status is set (since MSI interrupts are not shared). All information other than the interrupt itself should be communicated via the shared memory region. MSI is on by default. It can be turned off with the msi=off to the parameter. If the device uses MSI then the value written to the doorbell is the MSI vector that will be raised. Vector 0 is used to notify that a new guest has joined. Vector 0 cannot be triggered by another guest since a value of 0 does not trigger an eventfd. ioeventfd/irqfd --------------- ioeventfd/irqfd is turned on by irqfd=on passed to the device parameter (it is off by default). When using ioeventfd/irqfd the only interrupt value that can be passed to another guest is 1 despite what value is written to a guest's Doorbell. Sample programs, init scripts and the shared memory server are available in a git repo here: www.gitorious.org/nahanni Cam Macdonell (2): Support adding a file to qemu's ram allocation Inter-VM shared memory PCI device Makefile.target | 3 + cpu-common.h | 1 + exec.c | 33 +++ hw/ivshmem.c | 622 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ qemu-char.c | 6 + qemu-char.h | 3 + 6 files changed, 668 insertions(+), 0 deletions(-) create mode 100644 hw/ivshmem.c ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] [PATCH v3 1/2] Support adding a file to qemu's ram allocation 2010-03-25 6:08 [Qemu-devel] [PATCH v3 0/2] Inter-VM shared memory PCI device Cam Macdonell @ 2010-03-25 6:08 ` Cam Macdonell 2010-03-25 6:08 ` [Qemu-devel] [PATCH v3 2/2] Inter-VM shared memory PCI device Cam Macdonell 2010-03-25 9:04 ` [Qemu-devel] Re: [PATCH v3 0/2] " Avi Kivity 1 sibling, 1 reply; 18+ messages in thread From: Cam Macdonell @ 2010-03-25 6:08 UTC (permalink / raw) To: kvm; +Cc: Cam Macdonell, qemu-devel This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to map a host file into guest RAM. This function mmaps the opened file anywhere and adds the memory to the ram blocks. Usage is qemu_ram_mmap(fd, size, MAP_SHARED, offset); --- cpu-common.h | 1 + exec.c | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 34 insertions(+), 0 deletions(-) diff --git a/cpu-common.h b/cpu-common.h index 6cae15b..dffe4e7 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -32,6 +32,7 @@ static inline void cpu_register_physical_memory(target_phys_addr_t start_addr, } ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr); +ram_addr_t qemu_ram_mmap(int, ram_addr_t, int, int); ram_addr_t qemu_ram_alloc(ram_addr_t); void qemu_ram_free(ram_addr_t addr); /* This should only be used for ram local to a device. */ diff --git a/exec.c b/exec.c index 3b4426e..6f4e747 100644 --- a/exec.c +++ b/exec.c @@ -2727,6 +2727,39 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path) } #endif +ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, int offset) +{ + RAMBlock *new_block; + + size = TARGET_PAGE_ALIGN(size); + new_block = qemu_malloc(sizeof(*new_block)); + + // map the file passed as a parameter to be this part of memory + new_block->host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, offset); + +#ifdef MADV_MERGEABLE + madvise(new_block->host, size, MADV_MERGEABLE); +#endif + + new_block->offset = last_ram_offset; + new_block->length = size; + + new_block->next = ram_blocks; + ram_blocks = new_block; + + phys_ram_dirty = qemu_realloc(phys_ram_dirty, + (last_ram_offset + size) >> TARGET_PAGE_BITS); + memset(phys_ram_dirty + (last_ram_offset >> TARGET_PAGE_BITS), + 0xff, size >> TARGET_PAGE_BITS); + + last_ram_offset += size; + + if (kvm_enabled()) + kvm_setup_guest_memory(new_block->host, size); + + return new_block->offset; +} + ram_addr_t qemu_ram_alloc(ram_addr_t size) { RAMBlock *new_block; -- 1.6.6.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [Qemu-devel] [PATCH v3 2/2] Inter-VM shared memory PCI device 2010-03-25 6:08 ` [Qemu-devel] [PATCH v3 1/2] Support adding a file to qemu's ram allocation Cam Macdonell @ 2010-03-25 6:08 ` Cam Macdonell 0 siblings, 0 replies; 18+ messages in thread From: Cam Macdonell @ 2010-03-25 6:08 UTC (permalink / raw) To: kvm; +Cc: Cam Macdonell, qemu-devel Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. -device ivshmem,size=<size in MB>[,shm=<shm name>] Interrupts are supported between multiple VMs by using a shared memory server by using a chardev socket. -device ivshmem,size=<size in MB>[,shm=<shm name>][,chardev=<id>][,msi=on] [,irqfd=on][,nvectors=n] -chardev socket,path=<path>,id=<id> Sample programs, init scripts and the shared memory server are available in a git repo here: www.gitorious.org/nahanni --- Makefile.target | 3 + hw/ivshmem.c | 622 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ qemu-char.c | 6 + qemu-char.h | 3 + 4 files changed, 634 insertions(+), 0 deletions(-) create mode 100644 hw/ivshmem.c diff --git a/Makefile.target b/Makefile.target index 4d88543..15edf19 100644 --- a/Makefile.target +++ b/Makefile.target @@ -219,6 +219,9 @@ obj-y += pcnet.o obj-y += rtl8139.o obj-y += e1000.o +# Inter-VM PCI shared memory +obj-y += ivshmem.o + # Hardware support obj-i386-y = ide/core.o ide/qdev.o ide/isa.o ide/pci.o ide/piix.o obj-i386-y += pckbd.o $(sound-obj-y) dma.o diff --git a/hw/ivshmem.c b/hw/ivshmem.c new file mode 100644 index 0000000..c76aae3 --- /dev/null +++ b/hw/ivshmem.c @@ -0,0 +1,622 @@ +/* + * Inter-VM Shared Memory PCI device. + * + * Author: + * Cam Macdonell <cam@cs.ualberta.ca> + * + * Based On: cirrus_vga.c and rtl8139.c + * + * This code is licensed under the GNU GPL v2. + */ +#include <sys/mman.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <sys/io.h> +#include <sys/ioctl.h> +#include <sys/eventfd.h> +#include "hw.h" +#include "console.h" +#include "pc.h" +#include "pci.h" +#include "sysemu.h" + +#include "msix.h" +#include "qemu-kvm.h" +#include "libkvm.h" + +#include <sys/eventfd.h> +#include <sys/mman.h> +#include <sys/socket.h> +#include <sys/ioctl.h> + +#define PCI_COMMAND_IOACCESS 0x0001 +#define PCI_COMMAND_MEMACCESS 0x0002 + +#define DEBUG_IVSHMEM + +#define IVSHMEM_IRQFD 0 +#define IVSHMEM_MSI 1 + +#ifdef DEBUG_IVSHMEM +#define IVSHMEM_DPRINTF(fmt, args...) \ + do {printf("IVSHMEM: " fmt, ##args); } while (0) +#else +#define IVSHMEM_DPRINTF(fmt, args...) +#endif + +#define NEW_GUEST_VAL UINT_MAX + +typedef struct IVShmemState { + PCIDevice dev; + uint32_t intrmask; + uint32_t intrstatus; + uint32_t doorbell; + + CharDriverState * chr; + CharDriverState * eventfd_chr; + int ivshmem_mmio_io_addr; + + pcibus_t mmio_addr; + uint8_t *ivshmem_ptr; + unsigned long ivshmem_offset; + unsigned int ivshmem_size; + int shm_fd; /* shared memory file descriptor */ + + struct kvm_ioeventfd ioeventfds[16]; + int eventfds[16]; /* for now we have a limit of 16 inter-connected guests */ + int eventfd_posn; + int num_eventfds; + uint32_t vectors; + uint32_t features; + + char * shmobj; + uint32_t size; /*size of shared memory in MB*/ +} IVShmemState; + +/* registers for the Inter-VM shared memory device */ +enum ivshmem_registers { + IntrMask = 0, + IntrStatus = 4, + IVPosition = 8, + Doorbell = 12, +}; + +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) { + return (ivs->features & (1 << feature)); +} + +static inline int is_power_of_two(int x) {return (x & (x-1)) == 0;} + +static void ivshmem_map(PCIDevice *pci_dev, int region_num, + pcibus_t addr, pcibus_t size, int type) +{ + IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev); + + IVSHMEM_DPRINTF("addr = %u size = %u\n", (uint32_t)addr, (uint32_t)size); + cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset); + +} + +/* accessing registers - based on rtl8139 */ +static void ivshmem_update_irq(IVShmemState *s, int val) +{ + int isr; + isr = (s->intrstatus & s->intrmask) & 0xffffffff; + + /* don't print ISR resets */ + if (isr) { + IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n", + isr ? 1 : 0, s->intrstatus, s->intrmask); + } + + qemu_set_irq(s->dev.irq[0], (isr != 0)); +} + +static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val) +{ + IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val); + + s->intrmask = val; + + ivshmem_update_irq(s, val); +} + +static uint32_t ivshmem_IntrMask_read(IVShmemState *s) +{ + uint32_t ret = s->intrmask; + + IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret); + + return ret; +} + +static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val) +{ + IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val); + + s->intrstatus = val; + + ivshmem_update_irq(s, val); + return; +} + +static uint32_t ivshmem_IntrStatus_read(IVShmemState *s) +{ + uint32_t ret = s->intrstatus; + + /* reading ISR clears all interrupts */ + s->intrstatus = 0; + + ivshmem_update_irq(s, 0); + + return ret; +} + +static void ivshmem_io_writew(void *opaque, uint8_t addr, uint32_t val) +{ + + IVSHMEM_DPRINTF("We shouldn't be writing words\n"); +} + +static void ivshmem_io_writel(void *opaque, uint8_t addr, uint32_t val) +{ + IVShmemState *s = opaque; + u_int64_t writelong = val; + int dest = (addr - Doorbell)/sizeof(int); + + addr &= 0xfe; + + switch (addr) + { + case IntrMask: + ivshmem_IntrMask_write(s, val); + break; + + case IntrStatus: + ivshmem_IntrStatus_write(s, val); + break; + + default: + /* check doorbell range */ + if ((addr >= Doorbell) && + (addr <= Doorbell + sizeof(int) * s->num_eventfds)) { + IVSHMEM_DPRINTF("Writing %ld to VM %d\n", writelong, dest); + if (write(s->eventfds[dest], &(writelong), 8) != 8) + IVSHMEM_DPRINTF("error writing to eventfd\n"); + } else { + IVSHMEM_DPRINTF("Invalid VM Doorbell VM %d\n", dest); + } + } +} + +static void ivshmem_io_writeb(void *opaque, uint8_t addr, uint32_t val) +{ + IVSHMEM_DPRINTF("We shouldn't be writing bytes\n"); +} + +static uint32_t ivshmem_io_readw(void *opaque, uint8_t addr) +{ + + IVSHMEM_DPRINTF("We shouldn't be reading words\n"); + return 0; +} + +static uint32_t ivshmem_io_readl(void *opaque, uint8_t addr) +{ + + IVShmemState *s = opaque; + uint32_t ret; + + switch (addr) + { + case IntrMask: + ret = ivshmem_IntrMask_read(s); + break; + + case IntrStatus: + ret = ivshmem_IntrStatus_read(s); + break; + + case IVPosition: + /* return my id in the ivshmem list */ + ret = s->eventfd_posn; + break; + + default: + IVSHMEM_DPRINTF("why are we reading 0x%x\n", addr); + ret = 0; + } + + return ret; + +} + +static uint32_t ivshmem_io_readb(void *opaque, uint8_t addr) +{ + IVSHMEM_DPRINTF("We shouldn't be reading bytes\n"); + + return 0; +} + +static void ivshmem_mmio_writeb(void *opaque, + target_phys_addr_t addr, uint32_t val) +{ + ivshmem_io_writeb(opaque, addr & 0xFF, val); +} + +static void ivshmem_mmio_writew(void *opaque, + target_phys_addr_t addr, uint32_t val) +{ + ivshmem_io_writew(opaque, addr & 0xFF, val); +} + +static void ivshmem_mmio_writel(void *opaque, + target_phys_addr_t addr, uint32_t val) +{ + ivshmem_io_writel(opaque, addr & 0xFF, val); +} + +static uint32_t ivshmem_mmio_readb(void *opaque, target_phys_addr_t addr) +{ + return ivshmem_io_readb(opaque, addr & 0xFF); +} + +static uint32_t ivshmem_mmio_readw(void *opaque, target_phys_addr_t addr) +{ + uint32_t val = ivshmem_io_readw(opaque, addr & 0xFF); + return val; +} + +static uint32_t ivshmem_mmio_readl(void *opaque, target_phys_addr_t addr) +{ + uint32_t val = ivshmem_io_readl(opaque, addr & 0xFF); + return val; +} + +static CPUReadMemoryFunc *ivshmem_mmio_read[3] = { + ivshmem_mmio_readb, + ivshmem_mmio_readw, + ivshmem_mmio_readl, +}; + +static CPUWriteMemoryFunc *ivshmem_mmio_write[3] = { + ivshmem_mmio_writeb, + ivshmem_mmio_writew, + ivshmem_mmio_writel, +}; + +static void ivshmem_receive(void *opaque, const uint8_t *buf, int size) +{ + IVShmemState *s = opaque; + uint32_t val; + + val = *buf; + if (msix_enabled(&s->dev) && (val < s->vectors)) + msix_notify(&s->dev, val); + else + ivshmem_IntrStatus_write(s, val); + + IVSHMEM_DPRINTF("ivshmem_receive 0x%02x\n", *buf); +} + +static int ivshmem_can_receive(void * opaque) +{ + return 8; +} + +static void ivshmem_event(void *opaque, int event) +{ +// IVShmemState *s = opaque; + IVSHMEM_DPRINTF("ivshmem_event %d\n", event); +} + + +static CharDriverState* create_eventfd_chr_device(void * opaque, int eventfd) +{ + // create a event character device based on the passed eventfd + IVShmemState *s = opaque; + CharDriverState * chr; + + chr = qemu_chr_open_eventfd(eventfd); + + if (chr == NULL) { + IVSHMEM_DPRINTF("creating eventfd for eventfd %d failed\n", eventfd); + exit(-1); + } + + qemu_chr_add_handlers(chr, ivshmem_can_receive, ivshmem_receive, + ivshmem_event, s); + + return chr; + +} + +static int check_shm_size(IVShmemState *s, int shmemfd) { + /* check that the guest isn't going to try and map more memory than the + * card server allocated return -1 to indicate error */ + + struct stat buf; + + fstat(shmemfd, &buf); + + if (s->ivshmem_size > buf.st_size) { + fprintf(stderr, "IVSHMEM ERROR: Requested memory size greater"); + fprintf(stderr, " than shared object size (%d > %ld)\n", + s->ivshmem_size, buf.st_size); + return -1; + } else { + return 0; + } +} + +static void create_shared_memory_BAR(IVShmemState *s, int fd) { + + s->shm_fd = fd; + + s->ivshmem_offset = qemu_ram_mmap(s->shm_fd, s->ivshmem_size, + MAP_SHARED, 0); + + s->ivshmem_ptr = qemu_get_ram_ptr(s->ivshmem_offset); + + /* region for shared memory */ + pci_register_bar(&s->dev, 2, s->ivshmem_size, + PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map); +} + +static int ivshmem_irqfd(PCIDevice* pdev, uint16_t vector, int fd) +{ + struct kvm_irqfd call = { }; + int r; + + IVSHMEM_DPRINTF("inside irqfd\n"); + if (vector >= pdev->msix_entries_nr) + return -EINVAL; + call.fd = fd; + call.gsi = pdev->msix_irq_entries[vector].gsi; + r = kvm_vm_ioctl(kvm_state, KVM_IRQFD, &call); + if (r < 0) + return r; + return 0; +} + +static int ivshmem_ioeventfd(IVShmemState* s, int posn, int fd) +{ + + int ret, offset; + struct kvm_ioeventfd iofd; + + offset = Doorbell + 4 * posn; + iofd.addr = s->mmio_addr + offset; + iofd.len = 4; + iofd.flags = 0; + iofd.fd = fd; + + ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &iofd); + + if (ret < 0) { + fprintf(stderr, "error assigning ioeventfd (%d) - disabling ioeventfds\n", ret); + perror(strerror(ret)); + } else IVSHMEM_DPRINTF("success assigning ioeventfd (%d)\n", ret); + + return ret; +} +/* notify that a new guest has joined */ +static void new_guest_interrupt(IVShmemState *s) +{ + if (msix_enabled(&s->dev)) { + msix_notify(&s->dev, 0); + } else { + ivshmem_IntrStatus_write(s, NEW_GUEST_VAL); + } +} + +static void ivshmem_read(void *opaque, const uint8_t * buf, int flags) +{ + IVShmemState *s = opaque; + int incoming_fd, tmp_fd; + long incoming_posn; + + memcpy(&incoming_posn, buf, sizeof(long)); + /* pick off s->chr->msgfd and store it, posn should accompany msg */ + tmp_fd = qemu_chr_get_msgfd(s->chr); + IVSHMEM_DPRINTF("posn is %ld, fd is %d\n", incoming_posn, tmp_fd); + + if (tmp_fd == -1) { + s->eventfd_posn = incoming_posn; + return; + } + + /* because of the implementation of get_msgfd, we need a dup */ + incoming_fd = dup(tmp_fd); + + /* if the position is -1, then it's shared memory region fd */ + if (incoming_posn == -1) { + + s->num_eventfds = 0; + + if (check_shm_size(s, incoming_fd) == -1) { + exit(-1); + } + + /* creating a BAR in qemu_chr callback may be crazy */ + create_shared_memory_BAR(s, incoming_fd); + + return; + } + + if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) { + /* allocate ioeventfd for the new guest */ + ivshmem_ioeventfd(s, incoming_posn, s->eventfds[incoming_posn]); + } + + /* this is an eventfd for a particular guest VM */ + IVSHMEM_DPRINTF("eventfds[%ld] = %d\n", incoming_posn, incoming_fd); + s->eventfds[incoming_posn] = incoming_fd; + + /* keep track of the maximum VM ID */ + if (incoming_posn > s->num_eventfds) { + s->num_eventfds = incoming_posn; + } + + if (incoming_posn == s->eventfd_posn) { + if (ivshmem_has_feature(s, IVSHMEM_IRQFD)) { + /* setup irqfd for this VM's eventfd */ + /* vector is set to 1 */ + ivshmem_irqfd(&s->dev, 1, s->eventfds[s->eventfd_posn]); + } else { + /* initialize char device for callback on my eventfd */ + s->eventfd_chr = create_eventfd_chr_device(s, s->eventfds[s->eventfd_posn]); + } + } else { + new_guest_interrupt(s); + } + + return; +} + +static void ivshmem_reset(DeviceState *d) +{ + return; +} + +static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num, + pcibus_t addr, pcibus_t size, int type) +{ + IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev); + + s->mmio_addr = addr; + cpu_register_physical_memory(addr + 0, 0x400, s->ivshmem_mmio_io_addr); + + /* now that our mmio region has been allocated, we can receive + * the file descriptors */ + qemu_chr_add_handlers(s->chr, ivshmem_can_receive, ivshmem_read, + ivshmem_event, s); + +} + +static int pci_ivshmem_init(PCIDevice *dev) +{ + IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev); + uint8_t *pci_conf; + int i; + + /* BARs must be a power of 2 */ + if (is_power_of_two(s->size)) + s->ivshmem_size = s->size * 1024* 1024; + else { + fprintf(stderr, "ivshmem: size must be power of 2\n"); + exit(1); + } + + /* IRQFD requires MSI */ + if (ivshmem_has_feature(s, IVSHMEM_IRQFD) && + !ivshmem_has_feature(s, IVSHMEM_MSI)) { + fprintf(stderr, "ivshmem: ioeventfd/irqfd requires MSI\n"); + exit(1); + } + + pci_conf = s->dev.config; + pci_conf[0x00] = 0xf4; // Qumranet vendor ID 0x5002 + pci_conf[0x01] = 0x1a; + pci_conf[0x02] = 0x10; + pci_conf[0x03] = 0x11; + pci_conf[0x04] = PCI_COMMAND_IOACCESS | PCI_COMMAND_MEMACCESS; + pci_conf[0x0a] = 0x00; // RAM controller + pci_conf[0x0b] = 0x05; + pci_conf[0x0e] = 0x00; // header_type + + s->ivshmem_mmio_io_addr = cpu_register_io_memory(ivshmem_mmio_read, + ivshmem_mmio_write, s); + /* region for registers*/ + pci_register_bar(&s->dev, 0, 0x400, + PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_mmio_map); + + if ((s->chr != NULL) && (strncmp(s->chr->filename, "unix:", 5) == 0)) { + /* if we get a UNIX socket as the parameter we will talk + * to the ivshmem server later once the MMIO BAR is actually + * allocated (see ivshmem_mmio_map) */ + + IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n", + s->chr->filename); + + s->eventfd_posn = -1; + + } else { + /* just map the file immediately, we're not using a server */ + int fd; + + IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj); + + if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR, + S_IRWXU|S_IRWXG|S_IRWXO)) < 0) { + fprintf(stderr, "kvm_ivshmem: could not open shared file\n"); + exit(-1); + } + + /* mmap onto PCI device's memory */ + if (ftruncate(fd, s->ivshmem_size) != 0) { + fprintf(stderr, "kvm_ivshmem: could not truncate shared file\n"); + } + + create_shared_memory_BAR(s, fd); + + } + + IVSHMEM_DPRINTF("shared memory size is = %d\n", s->size); + + pci_conf[PCI_INTERRUPT_PIN] = 1; // we are going to support interrupts + + /* allocate the MSI-X vectors */ + if (ivshmem_has_feature(s, IVSHMEM_MSI)) { + + if (!msix_init(&s->dev, s->vectors, 1, 0)) { + pci_register_bar(&s->dev, 1, + msix_bar_size(&s->dev), + PCI_BASE_ADDRESS_SPACE_MEMORY, + msix_mmio_map); + IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors); + } else { + IVSHMEM_DPRINTF("msix initialization failed\n"); + } + + /* 'activate' the vectors */ + for (i = 0; i < s->vectors; i++) { + msix_vector_use(&s->dev, i); + } + } + + return 0; +} + +static int pci_ivshmem_uninit(PCIDevice *dev) +{ + IVShmemState *s = DO_UPCAST(IVShmemState, dev, dev); + + cpu_unregister_io_memory(s->ivshmem_mmio_io_addr); + + return 0; +} + +static PCIDeviceInfo ivshmem_info = { + .qdev.name = "ivshmem", + .qdev.size = sizeof(IVShmemState), + .qdev.reset = ivshmem_reset, + .init = pci_ivshmem_init, + .exit = pci_ivshmem_uninit, + .qdev.props = (Property[]) { + DEFINE_PROP_CHR("chardev", IVShmemState, chr), + DEFINE_PROP_UINT32("size", IVShmemState, size, 0), + DEFINE_PROP_UINT32("vectors", IVShmemState, vectors, 2), + DEFINE_PROP_BIT("irqfd", IVShmemState, features, IVSHMEM_IRQFD, false), + DEFINE_PROP_BIT("msi", IVShmemState, features, IVSHMEM_MSI, true), + DEFINE_PROP_STRING("shm", IVShmemState, shmobj), + DEFINE_PROP_END_OF_LIST(), + } +}; + +static void ivshmem_register_devices(void) +{ + pci_qdev_register(&ivshmem_info); +} + +device_init(ivshmem_register_devices) diff --git a/qemu-char.c b/qemu-char.c index 40cfefa..6533395 100644 --- a/qemu-char.c +++ b/qemu-char.c @@ -2074,6 +2074,12 @@ static void tcp_chr_read(void *opaque) } } +CharDriverState *qemu_chr_open_eventfd(int eventfd){ + + return qemu_chr_open_fd(eventfd, eventfd); + +} + static void tcp_chr_connect(void *opaque) { CharDriverState *chr = opaque; diff --git a/qemu-char.h b/qemu-char.h index bcc0766..9a0d2c0 100644 --- a/qemu-char.h +++ b/qemu-char.h @@ -93,6 +93,9 @@ void qemu_chr_info_print(Monitor *mon, const QObject *ret_data); void qemu_chr_info(Monitor *mon, QObject **ret_data); CharDriverState *qemu_chr_find(const char *name); +/* add an eventfd to the qemu devices that are polled */ +CharDriverState *qemu_chr_open_eventfd(int eventfd); + extern int term_escape_char; /* async I/O support */ -- 1.6.6.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 6:08 [Qemu-devel] [PATCH v3 0/2] Inter-VM shared memory PCI device Cam Macdonell 2010-03-25 6:08 ` [Qemu-devel] [PATCH v3 1/2] Support adding a file to qemu's ram allocation Cam Macdonell @ 2010-03-25 9:04 ` Avi Kivity 2010-03-25 9:21 ` Michael S. Tsirkin ` (2 more replies) 1 sibling, 3 replies; 18+ messages in thread From: Avi Kivity @ 2010-03-25 9:04 UTC (permalink / raw) To: Cam Macdonell; +Cc: qemu-devel, kvm On 03/25/2010 08:08 AM, Cam Macdonell wrote: > Support an inter-vm shared memory device that maps a shared-memory object > as a PCI device in the guest. This patch also supports interrupts between > guest by communicating over a unix domain socket. This patch applies to the > qemu-kvm repository. > > Changes in this version are using the qdev format and optional use of MSI and > ioeventfd/irqfd. > > The non-interrupt version is supported by passing the shm parameter > > -device ivshmem,size=<size in MB>,[shm=<shm_name>] > > which will simply map the shm object into a BAR. > > Interrupts are supported between multiple VMs by using a shared memory server > that is connected to with a socket character device > > -device ivshmem,size=<size in MB>[,chardev=<chardev name>][,irqfd=on] > [,msi=on][,nvectors=n] > -chardev socket,path=<path>,id=<chardev name> > > The server passes file descriptors for the shared memory object and eventfds (our > interrupt mechanism) to the respective qemu instances. > > When using interrupts, VMs communicate with a shared memory server that passes > the shared memory object file descriptor using SCM_RIGHTS. The server assigns > each VM an ID number and sends this ID number to the Qemu process along with a > series of eventfd file descriptors, one per guest using the shared memory > server. These eventfds will be used to send interrupts between guests. Each > guest listens on the eventfd corresponding to their ID and may use the others > for sending interrupts to other guests. > Please put the spec somewhere publicly accessible with a permanent URL. I suggest a new qemu.git directory specs/. It's more important than the code IMO. > enum ivshmem_registers { > IntrMask = 0, > IntrStatus = 4, > IVPosition = 8, > Doorbell = 12 > }; > > The first two registers are the interrupt mask and status registers. Mask and > status are only used with pin-based interrupts. They are unused with MSI > interrupts. The IVPosition register is read-only and reports the guest's ID > number. Interrupts are triggered when a message is received on the guest's > eventfd from another VM. To trigger an event, a guest must write to another > guest's Doorbell. The "Doorbells" begin at offset 12. A particular guest's > doorbell offset in the MMIO region is equal to > > guest_id * 32 + Doorbell > > The doorbell register for each guest is 32-bits. The doorbell-per-guest > design was motivated for use with ioeventfd. > You can also use a single doorbell register with ioeventfd, as it can match against the data written. If you go this route, you'd have two doorbells, one where you write a guest ID to send an interrupt to that guest, and one where any write generates a multicast. Possible later extensions: - multiple doorbells that trigger different vectors - multicast doorbells > The semantics of the value written to the doorbell depends on whether the > device is using MSI or a regular pin-based interrupt. > I recommend against making the semantics interrupt-style dependent. It means the application needs to know whether MSI is in use or not, while it is generally the OS that is in control of that. > Regular Interrupts > ------------------ > > If regular interrupts are used (due to either a guest not supporting MSI or the > user specifying not to use them on the command-line) then the value written to > a guest's doorbell is what the guest's status register will be set to. > > An status of (2^32 - 1) indicates that a new guest has joined. Guests > should not send a message of this value for any other reason. > > Message Signalled Interrupts > ---------------------------- > > The important thing to remember with MSI is that it is only a signal, no > status is set (since MSI interrupts are not shared). All information other > than the interrupt itself should be communicated via the shared memory region. > MSI is on by default. It can be turned off with the msi=off to the parameter. > > If the device uses MSI then the value written to the doorbell is the MSI vector > that will be raised. Vector 0 is used to notify that a new guest has joined. > Vector 0 cannot be triggered by another guest since a value of 0 does not > trigger an eventfd. > Ah, looks like we approached the vector/guest matrix from different directions. > ioeventfd/irqfd > --------------- > > ioeventfd/irqfd is turned on by irqfd=on passed to the device parameter (it is > off by default). When using ioeventfd/irqfd the only interrupt value that can > be passed to another guest is 1 despite what value is written to a guest's > Doorbell. > ioeventfd/irqfd are an implementation detail. The spec should not depend on it. It needs to be written as if qemu and kvm do not exist. Again, I recommend Rusty's virtio-pci for inspiration. Applications should see exactly the same thing whether ioeventfd is enabled or not. > Sample programs, init scripts and the shared memory server are available in a > git repo here: > > www.gitorious.org/nahanni > > Cam Macdonell (2): > Support adding a file to qemu's ram allocation > Inter-VM shared memory PCI device > Do you plan do maintain the server indefinitely in that repository? If not, we can put it in qemu.git, perhaps under contrib/. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 9:04 ` [Qemu-devel] Re: [PATCH v3 0/2] " Avi Kivity @ 2010-03-25 9:21 ` Michael S. Tsirkin 2010-03-25 16:11 ` Cam Macdonell 2010-03-25 9:26 ` Markus Armbruster 2010-03-25 16:50 ` Cam Macdonell 2 siblings, 1 reply; 18+ messages in thread From: Michael S. Tsirkin @ 2010-03-25 9:21 UTC (permalink / raw) To: Avi Kivity; +Cc: Cam Macdonell, qemu-devel, kvm On Thu, Mar 25, 2010 at 11:04:54AM +0200, Avi Kivity wrote: > Again, I recommend Rusty's virtio-pci for inspiration. Not just inspiration, how about building on virtio-pci? -- MST ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 9:21 ` Michael S. Tsirkin @ 2010-03-25 16:11 ` Cam Macdonell 0 siblings, 0 replies; 18+ messages in thread From: Cam Macdonell @ 2010-03-25 16:11 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm, qemu-devel On Thu, Mar 25, 2010 at 3:21 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > On Thu, Mar 25, 2010 at 11:04:54AM +0200, Avi Kivity wrote: >> Again, I recommend Rusty's virtio-pci for inspiration. > > Not just inspiration, how about building on virtio-pci? Virtio was discussed at good length last year on a previous version. I did implement a virtio version that extended virtio to use memory regions for importing host memory into a guest. http://www.mail-archive.com/qemu-devel@nongnu.org/msg25784.html However, Anthony thought the memory regions broke the DMA engine model of virtio too much and so I went back to PCI. Cam > > -- > MST > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 9:04 ` [Qemu-devel] Re: [PATCH v3 0/2] " Avi Kivity 2010-03-25 9:21 ` Michael S. Tsirkin @ 2010-03-25 9:26 ` Markus Armbruster 2010-03-25 9:37 ` Avi Kivity 2010-03-25 16:50 ` Cam Macdonell 2 siblings, 1 reply; 18+ messages in thread From: Markus Armbruster @ 2010-03-25 9:26 UTC (permalink / raw) To: Avi Kivity; +Cc: Cam Macdonell, qemu-devel, kvm Avi Kivity <avi@redhat.com> writes: > Please put the spec somewhere publicly accessible with a permanent > URL. I suggest a new qemu.git directory specs/. It's more important > than the code IMO. What about docs/? It already exists. [...] ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 9:26 ` Markus Armbruster @ 2010-03-25 9:37 ` Avi Kivity 0 siblings, 0 replies; 18+ messages in thread From: Avi Kivity @ 2010-03-25 9:37 UTC (permalink / raw) To: Markus Armbruster; +Cc: Cam Macdonell, qemu-devel, kvm On 03/25/2010 11:26 AM, Markus Armbruster wrote: > Avi Kivity<avi@redhat.com> writes: > > >> Please put the spec somewhere publicly accessible with a permanent >> URL. I suggest a new qemu.git directory specs/. It's more important >> than the code IMO. >> > What about docs/? It already exists. > docs/ would be internal qemu documentation. I want it to be clear this is external documentation. docs/specs/ would work for that. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 9:04 ` [Qemu-devel] Re: [PATCH v3 0/2] " Avi Kivity 2010-03-25 9:21 ` Michael S. Tsirkin 2010-03-25 9:26 ` Markus Armbruster @ 2010-03-25 16:50 ` Cam Macdonell 2010-03-25 17:02 ` Avi Kivity 2 siblings, 1 reply; 18+ messages in thread From: Cam Macdonell @ 2010-03-25 16:50 UTC (permalink / raw) To: Avi Kivity; +Cc: qemu-devel, kvm On Thu, Mar 25, 2010 at 3:04 AM, Avi Kivity <avi@redhat.com> wrote: > On 03/25/2010 08:08 AM, Cam Macdonell wrote: >> >> Support an inter-vm shared memory device that maps a shared-memory object >> as a PCI device in the guest. This patch also supports interrupts between >> guest by communicating over a unix domain socket. This patch applies to >> the >> qemu-kvm repository. >> >> Changes in this version are using the qdev format and optional use of MSI >> and >> ioeventfd/irqfd. >> >> The non-interrupt version is supported by passing the shm parameter >> >> -device ivshmem,size=<size in MB>,[shm=<shm_name>] >> >> which will simply map the shm object into a BAR. >> >> Interrupts are supported between multiple VMs by using a shared memory >> server >> that is connected to with a socket character device >> >> -device ivshmem,size=<size in MB>[,chardev=<chardev name>][,irqfd=on] >> [,msi=on][,nvectors=n] >> -chardev socket,path=<path>,id=<chardev name> >> >> The server passes file descriptors for the shared memory object and >> eventfds (our >> interrupt mechanism) to the respective qemu instances. >> >> When using interrupts, VMs communicate with a shared memory server that >> passes >> the shared memory object file descriptor using SCM_RIGHTS. The server >> assigns >> each VM an ID number and sends this ID number to the Qemu process along >> with a >> series of eventfd file descriptors, one per guest using the shared memory >> server. These eventfds will be used to send interrupts between guests. >> Each >> guest listens on the eventfd corresponding to their ID and may use the >> others >> for sending interrupts to other guests. >> > > Please put the spec somewhere publicly accessible with a permanent URL. I > suggest a new qemu.git directory specs/. It's more important than the code > IMO. Sorry to be pedantic, do you want a URL or the spec as part of a patch that adds it as a file in qemu.git/docs/specs/ > >> enum ivshmem_registers { >> IntrMask = 0, >> IntrStatus = 4, >> IVPosition = 8, >> Doorbell = 12 >> }; >> >> The first two registers are the interrupt mask and status registers. Mask >> and >> status are only used with pin-based interrupts. They are unused with MSI >> interrupts. The IVPosition register is read-only and reports the guest's >> ID >> number. Interrupts are triggered when a message is received on the >> guest's >> eventfd from another VM. To trigger an event, a guest must write to >> another >> guest's Doorbell. The "Doorbells" begin at offset 12. A particular >> guest's >> doorbell offset in the MMIO region is equal to >> >> guest_id * 32 + Doorbell >> >> The doorbell register for each guest is 32-bits. The doorbell-per-guest >> design was motivated for use with ioeventfd. >> > > You can also use a single doorbell register with ioeventfd, as it can match > against the data written. If you go this route, you'd have two doorbells, > one where you write a guest ID to send an interrupt to that guest, and one > where any write generates a multicast. I thought of using the datamatch. > > Possible later extensions: > - multiple doorbells that trigger different vectors > - multicast doorbells Since the doorbells are exposed the multicast could be done by the driver. If multicast is handled by qemu, then we have different behaviour when using ioeventfd/irqfd since only one eventfd can be triggered by a write. > >> The semantics of the value written to the doorbell depends on whether the >> device is using MSI or a regular pin-based interrupt. >> > > I recommend against making the semantics interrupt-style dependent. It > means the application needs to know whether MSI is in use or not, while it > is generally the OS that is in control of that. It is basically the use of the status register that is the difference. The application view of what is happening doesn't need to change, especially with UIO: write to doorbells, block on read until interrupt arrives. In the MSI case I could set the status register to the vector that is received and then the would be equivalent from the view of the application. But, if future MSI support in UIO allows MSI information (such as vector number) to be accessible in userspace, then applications would become MSI dependent anyway. > >> Regular Interrupts >> ------------------ >> >> If regular interrupts are used (due to either a guest not supporting MSI >> or the >> user specifying not to use them on the command-line) then the value >> written to >> a guest's doorbell is what the guest's status register will be set to. >> >> An status of (2^32 - 1) indicates that a new guest has joined. Guests >> should not send a message of this value for any other reason. >> >> Message Signalled Interrupts >> ---------------------------- >> >> The important thing to remember with MSI is that it is only a signal, no >> status is set (since MSI interrupts are not shared). All information >> other >> than the interrupt itself should be communicated via the shared memory >> region. >> MSI is on by default. It can be turned off with the msi=off to the >> parameter. >> > >> If the device uses MSI then the value written to the doorbell is the MSI >> vector >> that will be raised. Vector 0 is used to notify that a new guest has >> joined. >> Vector 0 cannot be triggered by another guest since a value of 0 does not >> trigger an eventfd. >> > > Ah, looks like we approached the vector/guest matrix from different > directions. > >> ioeventfd/irqfd >> --------------- >> >> ioeventfd/irqfd is turned on by irqfd=on passed to the device parameter >> (it is >> off by default). When using ioeventfd/irqfd the only interrupt value that >> can >> be passed to another guest is 1 despite what value is written to a guest's >> Doorbell. >> > > ioeventfd/irqfd are an implementation detail. The spec should not depend on > it. It needs to be written as if qemu and kvm do not exist. Again, I > recommend Rusty's virtio-pci for inspiration. > > Applications should see exactly the same thing whether ioeventfd is enabled > or not. The challenge I recently encountered with this is one line in the eventfd implementation from kvm/virt/kvm/eventfd.c /* MMIO/PIO writes trigger an event if the addr/val match */ static int ioeventfd_write(struct kvm_io_device *this, gpa_t addr, int len, const void *val) { struct _ioeventfd *p = to_ioeventfd(this); if (!ioeventfd_in_range(p, addr, len, val)) return -EOPNOTSUPP; eventfd_signal(p->eventfd, 1); return 0; } IIUC, no matter what value is written to an ioeventfd by a guest, a value of 1 is written. So ioeventfds work differently than eventfds. Can we add a "multivalue" flag to ioeventfds so that the value that the guest writes is written to eventfd? > >> Sample programs, init scripts and the shared memory server are available >> in a >> git repo here: >> >> www.gitorious.org/nahanni >> >> Cam Macdonell (2): >> Support adding a file to qemu's ram allocation >> Inter-VM shared memory PCI device >> > > Do you plan do maintain the server indefinitely in that repository? If not, > we can put it in qemu.git, perhaps under contrib/. In qemu.git is fine with me. > > -- > error compiling committee.c: too many arguments to function > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 16:50 ` Cam Macdonell @ 2010-03-25 17:02 ` Avi Kivity 2010-03-25 17:35 ` Cam Macdonell 0 siblings, 1 reply; 18+ messages in thread From: Avi Kivity @ 2010-03-25 17:02 UTC (permalink / raw) To: Cam Macdonell; +Cc: qemu-devel, kvm On 03/25/2010 06:50 PM, Cam Macdonell wrote: > >> Please put the spec somewhere publicly accessible with a permanent URL. I >> suggest a new qemu.git directory specs/. It's more important than the code >> IMO. >> > Sorry to be pedantic, do you want a URL or the spec as part of a patch > that adds it as a file in qemu.git/docs/specs/ > I leave it up to you. If you are up to hosting it independently, than just post a URL as part of the patch. Otherwise, I'm sure qemu.git will be more than happy to be the official repository for the memory sharing device specification. In that case, make the the spec the first patch in the series. >> Possible later extensions: >> - multiple doorbells that trigger different vectors >> - multicast doorbells >> > Since the doorbells are exposed the multicast could be done by the > driver. If multicast is handled by qemu, then we have different > behaviour when using ioeventfd/irqfd since only one eventfd can be > triggered by a write. > Multicast by the driver would require one exit per guest signalled. Multicast by the shared memory server needs one exit to signal an eventfd, then the shared memory server signals the irqfds of all members of the multicast group. >>> The semantics of the value written to the doorbell depends on whether the >>> device is using MSI or a regular pin-based interrupt. >>> >>> >> I recommend against making the semantics interrupt-style dependent. It >> means the application needs to know whether MSI is in use or not, while it >> is generally the OS that is in control of that. >> > It is basically the use of the status register that is the difference. > The application view of what is happening doesn't need to change, > especially with UIO: write to doorbells, block on read until interrupt > arrives. In the MSI case I could set the status register to the > vector that is received and then the would be equivalent from the view > of the application. But, if future MSI support in UIO allows MSI > information (such as vector number) to be accessible in userspace, > then applications would become MSI dependent anyway. > Ah, I see. You adjusted for the different behaviours in the driver. Still I recommend dropping the status register: this allows single-msi and PIRQ to behave the same way. Also it is racy, if two guests signal a third, they will overwrite each other's status. >> ioeventfd/irqfd are an implementation detail. The spec should not depend on >> it. It needs to be written as if qemu and kvm do not exist. Again, I >> recommend Rusty's virtio-pci for inspiration. >> >> Applications should see exactly the same thing whether ioeventfd is enabled >> or not. >> > The challenge I recently encountered with this is one line in the > eventfd implementation > > from kvm/virt/kvm/eventfd.c > > /* MMIO/PIO writes trigger an event if the addr/val match */ > static int > ioeventfd_write(struct kvm_io_device *this, gpa_t addr, int len, > const void *val) > { > struct _ioeventfd *p = to_ioeventfd(this); > > if (!ioeventfd_in_range(p, addr, len, val)) > return -EOPNOTSUPP; > > eventfd_signal(p->eventfd, 1); > return 0; > } > > IIUC, no matter what value is written to an ioeventfd by a guest, a > value of 1 is written. So ioeventfds work differently than eventfds. > Can we add a "multivalue" flag to ioeventfds so that the value that > the guest writes is written to eventfd? > Eventfd values are a counter, not a register. A read() on the other side returns the sum of all write()s (or eventfd_signal()s). In the context of irqfd it just means the number of interrupts we coalesced. Multivalue was considered at one time for a different need and rejected. Really, to solve the race you need a queue, and that can only be done in the shared memory segment using locked instructions. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 17:02 ` Avi Kivity @ 2010-03-25 17:35 ` Cam Macdonell 2010-03-25 17:48 ` Avi Kivity 0 siblings, 1 reply; 18+ messages in thread From: Cam Macdonell @ 2010-03-25 17:35 UTC (permalink / raw) To: Avi Kivity; +Cc: qemu-devel, kvm On Thu, Mar 25, 2010 at 11:02 AM, Avi Kivity <avi@redhat.com> wrote: > On 03/25/2010 06:50 PM, Cam Macdonell wrote: >> >>> Please put the spec somewhere publicly accessible with a permanent URL. >>> I >>> suggest a new qemu.git directory specs/. It's more important than the >>> code >>> IMO. >>> >> >> Sorry to be pedantic, do you want a URL or the spec as part of a patch >> that adds it as a file in qemu.git/docs/specs/ >> > > I leave it up to you. If you are up to hosting it independently, than just > post a URL as part of the patch. Otherwise, I'm sure qemu.git will be more > than happy to be the official repository for the memory sharing device > specification. In that case, make the the spec the first patch in the > series. Ok, I'll send it as part of the series that way people can comment inline easily. > >>> Possible later extensions: >>> - multiple doorbells that trigger different vectors >>> - multicast doorbells >>> >> >> Since the doorbells are exposed the multicast could be done by the >> driver. If multicast is handled by qemu, then we have different >> behaviour when using ioeventfd/irqfd since only one eventfd can be >> triggered by a write. >> > > Multicast by the driver would require one exit per guest signalled. > Multicast by the shared memory server needs one exit to signal an eventfd, > then the shared memory server signals the irqfds of all members of the > multicast group. > >>>> The semantics of the value written to the doorbell depends on whether >>>> the >>>> device is using MSI or a regular pin-based interrupt. >>>> >>>> >>> >>> I recommend against making the semantics interrupt-style dependent. It >>> means the application needs to know whether MSI is in use or not, while >>> it >>> is generally the OS that is in control of that. >>> >> >> It is basically the use of the status register that is the difference. >> The application view of what is happening doesn't need to change, >> especially with UIO: write to doorbells, block on read until interrupt >> arrives. In the MSI case I could set the status register to the >> vector that is received and then the would be equivalent from the view >> of the application. But, if future MSI support in UIO allows MSI >> information (such as vector number) to be accessible in userspace, >> then applications would become MSI dependent anyway. >> > > Ah, I see. You adjusted for the different behaviours in the driver. > > Still I recommend dropping the status register: this allows single-msi and > PIRQ to behave the same way. Also it is racy, if two guests signal a third, > they will overwrite each other's status. With shared interrupts with PIRQ without a status register how does a device know it generated the interrupt? > >>> ioeventfd/irqfd are an implementation detail. The spec should not depend >>> on >>> it. It needs to be written as if qemu and kvm do not exist. Again, I >>> recommend Rusty's virtio-pci for inspiration. >>> >>> Applications should see exactly the same thing whether ioeventfd is >>> enabled >>> or not. >>> >> >> The challenge I recently encountered with this is one line in the >> eventfd implementation >> >> from kvm/virt/kvm/eventfd.c >> >> /* MMIO/PIO writes trigger an event if the addr/val match */ >> static int >> ioeventfd_write(struct kvm_io_device *this, gpa_t addr, int len, >> const void *val) >> { >> struct _ioeventfd *p = to_ioeventfd(this); >> >> if (!ioeventfd_in_range(p, addr, len, val)) >> return -EOPNOTSUPP; >> >> eventfd_signal(p->eventfd, 1); >> return 0; >> } >> >> IIUC, no matter what value is written to an ioeventfd by a guest, a >> value of 1 is written. So ioeventfds work differently than eventfds. >> Can we add a "multivalue" flag to ioeventfds so that the value that >> the guest writes is written to eventfd? >> > > Eventfd values are a counter, not a register. A read() on the other side > returns the sum of all write()s (or eventfd_signal()s). In the context of > irqfd it just means the number of interrupts we coalesced. > > Multivalue was considered at one time for a different need and rejected. > Really, to solve the race you need a queue, and that can only be done in > the shared memory segment using locked instructions. I had a hunch it was probably considered. That explains why irqfd doesn't have a datamatch field. I guess supporting multiple MSI vectors with one doorbell per guest isn't possible if one 1 bit of information can be communicated. So, ioeventfd/irqfd restricts MSI to 1 vector between guests. Should multi-MSI even be supported then in the non-ioeventfd/irq case? Otherwise ioeventfd/irqfd become more than an implementation detail. > > -- > error compiling committee.c: too many arguments to function > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 17:35 ` Cam Macdonell @ 2010-03-25 17:48 ` Avi Kivity 2010-03-25 18:17 ` Cam Macdonell 0 siblings, 1 reply; 18+ messages in thread From: Avi Kivity @ 2010-03-25 17:48 UTC (permalink / raw) To: Cam Macdonell; +Cc: qemu-devel, kvm On 03/25/2010 07:35 PM, Cam Macdonell wrote: > >> Ah, I see. You adjusted for the different behaviours in the driver. >> >> Still I recommend dropping the status register: this allows single-msi and >> PIRQ to behave the same way. Also it is racy, if two guests signal a third, >> they will overwrite each other's status. >> > With shared interrupts with PIRQ without a status register how does a > device know it generated the interrupt? > Right, you need a status register. Just don't add any more information, since MSI cannot carry any data. >> Eventfd values are a counter, not a register. A read() on the other side >> returns the sum of all write()s (or eventfd_signal()s). In the context of >> irqfd it just means the number of interrupts we coalesced. >> >> Multivalue was considered at one time for a different need and rejected. >> Really, to solve the race you need a queue, and that can only be done in >> the shared memory segment using locked instructions. >> > I had a hunch it was probably considered. That explains why irqfd > doesn't have a datamatch field. I guess supporting multiple MSI > vectors with one doorbell per guest isn't possible if one 1 bit of > information can be communicated. > Actually you can have one doorbell supporting multiple vectors and guests, simply divide the data value into two bit fields, one for the vector and one for the guest. A single write gets both values into the host, which can then use datamatch to trigger the correct eventfd (which is wired to an irqfd in another guest). > So, ioeventfd/irqfd restricts MSI to 1 vector between guests. Should > multi-MSI even be supported then in the non-ioeventfd/irq case? > Otherwise ioeventfd/irqfd become more than an implementation detail. > I lost you. Please re-explain. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 17:48 ` Avi Kivity @ 2010-03-25 18:17 ` Cam Macdonell 2010-03-25 21:10 ` Avi Kivity 2010-03-26 1:32 ` Jamie Lokier 0 siblings, 2 replies; 18+ messages in thread From: Cam Macdonell @ 2010-03-25 18:17 UTC (permalink / raw) To: Avi Kivity; +Cc: qemu-devel, kvm On Thu, Mar 25, 2010 at 11:48 AM, Avi Kivity <avi@redhat.com> wrote: > On 03/25/2010 07:35 PM, Cam Macdonell wrote: >> >>> Ah, I see. You adjusted for the different behaviours in the driver. >>> >>> Still I recommend dropping the status register: this allows single-msi >>> and >>> PIRQ to behave the same way. Also it is racy, if two guests signal a >>> third, >>> they will overwrite each other's status. >>> >> >> With shared interrupts with PIRQ without a status register how does a >> device know it generated the interrupt? >> > > Right, you need a status register. Just don't add any more information, > since MSI cannot carry any data. Right. > >>> Eventfd values are a counter, not a register. A read() on the other side >>> returns the sum of all write()s (or eventfd_signal()s). In the context >>> of >>> irqfd it just means the number of interrupts we coalesced. >>> >>> Multivalue was considered at one time for a different need and rejected. >>> Really, to solve the race you need a queue, and that can only be done in >>> the shared memory segment using locked instructions. >>> >> >> I had a hunch it was probably considered. That explains why irqfd >> doesn't have a datamatch field. I guess supporting multiple MSI >> vectors with one doorbell per guest isn't possible if one 1 bit of >> information can be communicated. >> > > Actually you can have one doorbell supporting multiple vectors and guests, > simply divide the data value into two bit fields, one for the vector and one > for the guest. A single write gets both values into the host, which can > then use datamatch to trigger the correct eventfd (which is wired to an > irqfd in another guest). At 4-bits per guest, a single write is then limited to 8 guests (with 32-bit registers), we could got to 64-bit. > >> So, ioeventfd/irqfd restricts MSI to 1 vector between guests. Should >> multi-MSI even be supported then in the non-ioeventfd/irq case? >> Otherwise ioeventfd/irqfd become more than an implementation detail. >> > > I lost you. Please re-explain. An irqfd can only trigger a single vector in a guest. Right now I only have one eventfd per guest. So ioeventfd/irqfd restricts the current implementation to a single vector that a guest can trigger. Without irqfd, eventfds can be used like registers a write the number of the vector they want to trigger, but as you point out it is racy. So, supporting multiple vectors via irqfd requires multiple eventfds for each guest (one per vector). a total of (# of guests) X (# of vectors) are required. If we're limited to 8 or 16 guests that's not too bad, but since the server opens them all we're restricted to 1024, but that's a pretty high ceiling for this purpose. > > > > -- > Do not meddle in the internals of kernels, for they are subtle and quick to > panic. > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 18:17 ` Cam Macdonell @ 2010-03-25 21:10 ` Avi Kivity 2010-03-25 23:05 ` Cam Macdonell 2010-03-26 1:32 ` Jamie Lokier 1 sibling, 1 reply; 18+ messages in thread From: Avi Kivity @ 2010-03-25 21:10 UTC (permalink / raw) To: Cam Macdonell; +Cc: qemu-devel, kvm On 03/25/2010 08:17 PM, Cam Macdonell wrote: > >>> I had a hunch it was probably considered. That explains why irqfd >>> doesn't have a datamatch field. I guess supporting multiple MSI >>> vectors with one doorbell per guest isn't possible if one 1 bit of >>> information can be communicated. >>> >>> >> Actually you can have one doorbell supporting multiple vectors and guests, >> simply divide the data value into two bit fields, one for the vector and one >> for the guest. A single write gets both values into the host, which can >> then use datamatch to trigger the correct eventfd (which is wired to an >> irqfd in another guest). >> > At 4-bits per guest, a single write is then limited to 8 guests (with > 32-bit registers), we could got to 64-bit. > I meant a unicast doorbell: 16 bits for guest ID, 16 bits for vector number. >> >>> So, ioeventfd/irqfd restricts MSI to 1 vector between guests. Should >>> multi-MSI even be supported then in the non-ioeventfd/irq case? >>> Otherwise ioeventfd/irqfd become more than an implementation detail. >>> >>> >> I lost you. Please re-explain. >> > An irqfd can only trigger a single vector in a guest. Right now I > only have one eventfd per guest. So ioeventfd/irqfd restricts the > current implementation to a single vector that a guest can trigger. > Without irqfd, eventfds can be used like registers a write the number > of the vector they want to trigger, but as you point out it is racy. > You can't use eventfds as registers. The next write will add to the current value. > So, supporting multiple vectors via irqfd requires multiple eventfds > for each guest (one per vector). a total of (# of guests) X (# of > vectors) are required. If we're limited to 8 or 16 guests that's not > too bad, but since the server opens them all we're restricted to 1024, > but that's a pretty high ceiling for this purpose. > I'm sure we can raise the fd ulimit for this. Note, I think qemus need the ulimit raised as well, since an fd passed via SCM_RIGHTS probably counts as an open file. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 21:10 ` Avi Kivity @ 2010-03-25 23:05 ` Cam Macdonell 2010-03-26 15:56 ` Avi Kivity 0 siblings, 1 reply; 18+ messages in thread From: Cam Macdonell @ 2010-03-25 23:05 UTC (permalink / raw) To: Avi Kivity; +Cc: qemu-devel, kvm On Thu, Mar 25, 2010 at 3:10 PM, Avi Kivity <avi@redhat.com> wrote: > On 03/25/2010 08:17 PM, Cam Macdonell wrote: >> >>>> I had a hunch it was probably considered. That explains why irqfd >>>> doesn't have a datamatch field. I guess supporting multiple MSI >>>> vectors with one doorbell per guest isn't possible if one 1 bit of >>>> information can be communicated. >>>> >>>> >>> >>> Actually you can have one doorbell supporting multiple vectors and >>> guests, >>> simply divide the data value into two bit fields, one for the vector and >>> one >>> for the guest. A single write gets both values into the host, which can >>> then use datamatch to trigger the correct eventfd (which is wired to an >>> irqfd in another guest). >>> >> >> At 4-bits per guest, a single write is then limited to 8 guests (with >> 32-bit registers), we could got to 64-bit. >> > > I meant a unicast doorbell: 16 bits for guest ID, 16 bits for vector number. Ah, yes. Who knew "two bit registers" is an ambiguous term. Do you strongly prefer the one doorbell design? ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 23:05 ` Cam Macdonell @ 2010-03-26 15:56 ` Avi Kivity 0 siblings, 0 replies; 18+ messages in thread From: Avi Kivity @ 2010-03-26 15:56 UTC (permalink / raw) To: Cam Macdonell; +Cc: qemu-devel, kvm On 03/26/2010 01:05 AM, Cam Macdonell wrote: > >> I meant a unicast doorbell: 16 bits for guest ID, 16 bits for vector number. >> > Ah, yes. Who knew "two bit registers" is an ambiguous term. Do you > strongly prefer the one doorbell design? > Just floating out ideas. An advantage is that it conserves register space; this is important if we use PIO. For mmio this isn't so important. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-25 18:17 ` Cam Macdonell 2010-03-25 21:10 ` Avi Kivity @ 2010-03-26 1:32 ` Jamie Lokier 2010-03-26 15:52 ` Cam Macdonell 1 sibling, 1 reply; 18+ messages in thread From: Jamie Lokier @ 2010-03-26 1:32 UTC (permalink / raw) To: Cam Macdonell; +Cc: Avi Kivity, kvm, qemu-devel Cam Macdonell wrote: > An irqfd can only trigger a single vector in a guest. Right now I > only have one eventfd per guest. So ioeventfd/irqfd restricts the > current implementation to a single vector that a guest can trigger. > Without irqfd, eventfds can be used like registers a write the number > of the vector they want to trigger, but as you point out it is racy. It's not racy if you use a pipe instead of eventfd. :-) Actually, why not? A byte pipe between guests would be more versatile. Could it even integrate with virtio-serial, somehow? -- Jamie ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] Re: [PATCH v3 0/2] Inter-VM shared memory PCI device 2010-03-26 1:32 ` Jamie Lokier @ 2010-03-26 15:52 ` Cam Macdonell 0 siblings, 0 replies; 18+ messages in thread From: Cam Macdonell @ 2010-03-26 15:52 UTC (permalink / raw) To: Jamie Lokier; +Cc: Avi Kivity, kvm, qemu-devel On Thu, Mar 25, 2010 at 7:32 PM, Jamie Lokier <jamie@shareable.org> wrote: > Cam Macdonell wrote: >> An irqfd can only trigger a single vector in a guest. Right now I >> only have one eventfd per guest. So ioeventfd/irqfd restricts the >> current implementation to a single vector that a guest can trigger. >> Without irqfd, eventfds can be used like registers a write the number >> of the vector they want to trigger, but as you point out it is racy. > > It's not racy if you use a pipe instead of eventfd. :-) > > Actually, why not? A byte pipe between guests would be more versatile. A pipe between guests would be quite versatile, however it's an orthogonal design to shared memory. The shared memory is how data should be shared/communicated if someone is using this device and the interrupts are there to help with synchronization. > > Could it even integrate with virtio-serial, somehow? > Could virtio-serial be used as is to support a pipe between guests? If a user wanted shared memory and a pipe, then they could setup two devices and use them together. Cam ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2010-03-26 15:56 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-03-25 6:08 [Qemu-devel] [PATCH v3 0/2] Inter-VM shared memory PCI device Cam Macdonell 2010-03-25 6:08 ` [Qemu-devel] [PATCH v3 1/2] Support adding a file to qemu's ram allocation Cam Macdonell 2010-03-25 6:08 ` [Qemu-devel] [PATCH v3 2/2] Inter-VM shared memory PCI device Cam Macdonell 2010-03-25 9:04 ` [Qemu-devel] Re: [PATCH v3 0/2] " Avi Kivity 2010-03-25 9:21 ` Michael S. Tsirkin 2010-03-25 16:11 ` Cam Macdonell 2010-03-25 9:26 ` Markus Armbruster 2010-03-25 9:37 ` Avi Kivity 2010-03-25 16:50 ` Cam Macdonell 2010-03-25 17:02 ` Avi Kivity 2010-03-25 17:35 ` Cam Macdonell 2010-03-25 17:48 ` Avi Kivity 2010-03-25 18:17 ` Cam Macdonell 2010-03-25 21:10 ` Avi Kivity 2010-03-25 23:05 ` Cam Macdonell 2010-03-26 15:56 ` Avi Kivity 2010-03-26 1:32 ` Jamie Lokier 2010-03-26 15:52 ` Cam Macdonell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).