qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC 1/2] pci-dma-api-v1
@ 2008-11-27 12:35 Andrea Arcangeli
  2008-11-27 12:43 ` [Qemu-devel] [RFC 2/2] bdrv_aio_readv/writev_em Andrea Arcangeli
  2008-11-27 19:14 ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Blue Swirl
  0 siblings, 2 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-27 12:35 UTC (permalink / raw)
  To: qemu-devel

Hello everyone,

One major limitation for KVM today is the lack of a proper way to
write drivers in a way that allows the host OS to use direct DMA to
the guest physical memory to avoid any intermediate copy. The only API
provided to drivers seems to be the cpu_physical_memory_rw and that
enforces all drivers to bounce and trash cpu caches and be memory
bound. This new DMA API instead allows drivers to use a pci_dma_sg
method for SG I/O that will translate the guest physical addresses to
host virutal addresses and it will call two operation, one is a submit
method and one is the complete method. The pci_dma_sg may have to
bounce buffer internally and to limit the max bounce size it may have
to submit I/O in pieces with multiple submit calls. The patch adapts
the ide.c HD driver to use this. Once cdrom is converted too
dma_buf_rw can be eliminated. As you can see the new ide_dma_submit
and ide_dma_complete code is much more readable than the previous
rearming callback.

This is only tested with KVM so far but qemu builds, in general
there's nothing kvm specific here (with the exception of a single
kvm_enabled), so it should all work well for both.

All we care about is the performance of the direct path, so I tried to
avoid dynamic allocations there to avoid entering glibc, the current
logic doesn't satisfy me yet but it should be at least faster than
calling malloc (but I'm still working on it to avoid memory waste to
detect when more than one iov should be cached). But in case of
instabilities I recommend first thing to set MAX_IOVEC_IOVCNT 0 to
disable that logic ;). I recommend to test with DEBUG_BOUNCE and with
a 512 max bounce buffer too. It's running stable in all modes so
far. However if ide.c end up calling aio_cancel things will likely
fall apart but this is all because of bdrv_aio_readv/writev, and the
astonishing lack of aio_readv/writev in glibc!

Once we finish fixing storage performance with a real
bdrv_aio_readv/writev (now a blocker issue), a pci_dma_single can be
added for zero copy networking (one NIC per VM, or VMDq, IOV
etc..). The DMA API should allow for that too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

diff --git a/qemu/hw/pci_dma.c b/qemu/hw/pci_dma.c
new file mode 100644
index 0000000..2e5f100
--- /dev/null
+++ b/qemu/hw/pci_dma.c
@@ -0,0 +1,346 @@
+/*
+ * QEMU PCI DMA operations
+ *
+ * Copyright (c) 2008 Red Hat, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "pci_dma.h"
+
+#define MAX_DMA_BOUNCE_BUFFER (1024*1024)
+//#define DEBUG_BOUNCE
+//#define MAX_DMA_BOUNCE_BUFFER (512)
+
+/*
+ * Too many entries will run slower and waste memory, this is really
+ * only about the fast path so it must be small, slow path is fine to
+ * allocate dynamically. Max memory used is around MAX_IOVEC_ENTRIES *
+ * MAX_IOVEC_IOVCNT * sizeof(struct iovec).
+ */
+#define MAX_IOVEC_ENTRIES 5
+
+/*
+ * Don't cache exceptionally large iovcnt used for huge DMA transfers
+ * as the DMA transfer may take much longer than malloc
+ * and huge memory could be wasted if it happens only once in a while. 
+ */
+#define MAX_IOVEC_IOVCNT 2048
+
+struct iovec_entry {
+    int iovcnt;
+    struct iovec_entry *next;
+    struct iovec iov;
+};
+
+struct iovec_entry *iovec_list;
+
+static struct iovec *qemu_get_iovec(int iovcnt)
+{
+    struct iovec_entry *entry = iovec_list;
+
+    while (entry) {
+	if (entry->iovcnt >= iovcnt)
+	    return &entry->iov;
+	entry = entry->next;
+    }
+
+    entry = qemu_malloc(sizeof(struct iovec_entry) + 
+			sizeof(struct iovec) * (iovcnt-1));
+    if (!entry)
+	return NULL;
+
+    return &entry->iov;
+}
+
+static void qemu_release_iovec(struct iovec *iov)
+{
+    struct iovec_entry *this, *min_entry = NULL, *entry = iovec_list;
+    struct iovec_entry *min_last = NULL, *last = NULL;
+    unsigned int min_iovcnt = -1, nr = 0;
+
+    if (!iov)
+	return;
+
+    this = (struct iovec_entry *)
+	(((char*)iov)-(unsigned long)(&((struct iovec_entry *)0)->iov));
+
+    if (this->iovcnt > MAX_IOVEC_IOVCNT) {
+	qemu_free(this);
+	return;
+    }
+
+    while (entry) {
+	nr += 1;
+	if ((unsigned int)entry->iovcnt < min_iovcnt) {
+	    min_entry = entry;
+	    min_last = last;
+	    min_iovcnt = entry->iovcnt;
+	}
+	last = entry;
+	entry = entry->next;
+    }
+
+    if (nr > MAX_IOVEC_ENTRIES) {
+	/* detail: replace even if it's equal as it's cache hot */
+	if (this->iovcnt < min_iovcnt)
+	    qemu_free(this);
+	else {
+	    if (min_entry == iovec_list) {
+		this->next = iovec_list->next;
+		iovec_list = this;
+	    } else {
+		min_last->next = min_entry->next;
+		this->next = iovec_list;
+		iovec_list = this;
+	    }
+	    qemu_free(min_entry);
+	}
+    } else {
+	if (!iovec_list) {
+	    this->next = NULL;
+	    iovec_list = this;
+	} else {
+	    this->next = iovec_list;
+	    iovec_list = this;
+	}
+    }
+}
+
+static struct iovec *pci_dma_sg_map_direct(QEMUPciDmaSg *sg,
+					   int iovcnt,
+					   int dma_to_memory,
+					   int alignment,
+					   size_t *len)
+{
+    int idx = 0;
+    struct iovec *dma_iov;
+    size_t _len = 0;
+
+#ifdef DEBUG_BOUNCE
+    return NULL;
+#endif
+
+    /* fixme: must not call malloc and cache them in some faster queue */
+    dma_iov = qemu_get_iovec(iovcnt);
+    if (!dma_iov)
+	goto out;
+
+    for (idx = 0; idx < iovcnt; idx++) {
+	void * addr;
+
+	if (_len + sg[idx].len <= _len)
+		goto err;
+	_len += sg[idx].len;
+
+	addr = cpu_physical_memory_can_dma(sg[idx].addr,
+					   sg[idx].len,
+					   dma_to_memory,
+					   alignment);
+	if (!addr)
+	    goto err;
+
+	dma_iov[idx].iov_base = addr;
+	dma_iov[idx].iov_len = sg[idx].len;
+    }
+
+    *len = _len;
+ out:
+    return dma_iov;
+err:
+    qemu_release_iovec(dma_iov);
+    dma_iov = NULL;
+    goto out;
+}
+
+static struct iovec *pci_dma_sg_map_bounce(QEMUPciDmaSgParam *param)
+{
+    int idx;
+    size_t len = 0;
+
+    param->curr_restart_iovcnt = param->restart_iovcnt;
+    param->curr_restart_offset = param->restart_offset;
+
+    for (idx = param->restart_iovcnt; idx < param->iovcnt; idx++) {
+	if (len & (param->alignment-1))
+	    return NULL;
+	if (len + param->sg[idx].len <= len)
+	    return NULL;
+	len += param->sg[idx].len - param->restart_offset;
+	param->restart_offset = 0;
+	if (len > MAX_DMA_BOUNCE_BUFFER) {
+	    size_t leftover = len - MAX_DMA_BOUNCE_BUFFER;
+	    param->restart_offset = param->sg[idx].len - leftover;
+	    len = MAX_DMA_BOUNCE_BUFFER;
+	    break;
+	}
+    }
+    param->restart_iovcnt = idx;
+    param->curr_len = len;
+
+    param->linearized.iov_len = len;
+    if (!param->bounce) {
+	param->bounce = qemu_memalign(param->alignment, len);
+	if (!param->bounce)
+	    return NULL;
+	param->linearized.iov_base = param->bounce;
+    }
+
+    if (!param->dma_to_memory) {
+	int idx;
+	size_t offset = 0;
+	for (idx = param->curr_restart_iovcnt;
+	     idx < param->iovcnt && offset < len; idx++) {
+	    size_t copy_len = param->sg[idx].len - param->curr_restart_offset;
+	    if (offset+copy_len > len)
+		copy_len = len;
+	    cpu_physical_memory_read(param->sg[idx].addr + 
+				     param->curr_restart_offset,
+				     param->bounce + offset,
+				     copy_len);
+	    param->curr_restart_offset = 0;
+	    offset += copy_len;
+	}
+    }
+
+    return &param->linearized;
+}
+
+static void pci_dma_sg_unmap_direct(QEMUPciDmaSgParam *param, int ret)
+{
+    if (!ret && param->dma_to_memory) {
+	int idx;
+	QEMUPciDmaSg *sg = param->sg;
+	for (idx = 0; idx < param->iovcnt; idx++)
+	    cpu_physical_memory_write_post_dma(sg[idx].addr,
+					       sg[idx].len);
+    }
+    qemu_release_iovec(param->dma_iov);
+}
+
+int pci_dma_sg_unmap_bounce(QEMUPciDmaSgParam *param, int ret)
+{
+    if (!ret && param->dma_to_memory) {
+	int idx;
+	size_t offset = 0;
+	for (idx = param->curr_restart_iovcnt;
+	     idx < param->iovcnt && offset < param->curr_len; idx++) {
+	    size_t copy_len = param->sg[idx].len - param->curr_restart_offset;
+	    if (offset+copy_len > param->curr_len)
+		copy_len = param->curr_len;
+	    cpu_physical_memory_write(param->sg[idx].addr +
+				      param->curr_restart_offset,
+				      param->bounce + offset,
+				      copy_len);
+	    param->curr_restart_offset = 0;
+	    offset += copy_len;
+	}
+    }
+    if (param->restart_iovcnt == param->iovcnt || ret) {
+	qemu_free(param->bounce);
+	return 0;
+    }
+    return 1;
+}
+
+static void pci_dma_sg_cb(void *opaque, int ret)
+{
+    QEMUPciDmaSgParam *param = opaque;
+    int restart = 0;
+
+    if (!param->bounce)
+	pci_dma_sg_unmap_direct(param, ret);
+    else
+	restart = pci_dma_sg_unmap_bounce(param, ret);
+
+    if (restart) {
+	ret = -1;
+	param->dma_iov = pci_dma_sg_map_bounce(param);
+	if (!param->dma_iov)
+	    goto out_free;
+	ret = param->pci_dma_sg_submit(param->pci_dma_sg_opaque,
+				       param->dma_iov, 1,
+				       param->curr_len,
+				       pci_dma_sg_cb,
+				       param);
+    }
+    if (ret || !restart) {
+    out_free:
+	param->pci_dma_sg_complete(param->pci_dma_sg_opaque, ret);
+	qemu_free(param);
+    }
+}
+
+/* PCIDevice is there in case we want to emulate an iommu later */
+void pci_dma_sg(PCIDevice *pci_dev,
+		QEMUPciDmaSg *sg, int iovcnt,
+		QEMUPciDmaSgSubmit pci_dma_sg_submit,
+		QEMUPciDmaSgComplete pci_dma_sg_complete,
+		void *pci_dma_sg_opaque,
+		int dma_to_memory, int alignment)
+{
+    int ret = -1;
+    QEMUPciDmaSgParam *param;
+
+    if ((unsigned int) dma_to_memory > 1)
+	goto err;
+    if (alignment < 0)
+	goto err;
+    if (iovcnt < 1)
+	goto err;
+
+    param = qemu_malloc(sizeof(QEMUPciDmaSgParam));
+    if (!param)
+	goto err;
+
+    param->pci_dma_sg_submit = pci_dma_sg_submit;
+    param->pci_dma_sg_complete = pci_dma_sg_complete;
+    param->pci_dma_sg_opaque = pci_dma_sg_opaque;
+    param->dma_to_memory = dma_to_memory;
+    param->alignment = alignment;
+    param->bounce = NULL;
+    param->sg = sg;
+    param->iovcnt = iovcnt;
+    param->restart_offset = param->restart_iovcnt = 0;
+
+    /* map the sg */
+    param->dma_iov = pci_dma_sg_map_direct(sg, iovcnt,
+					   dma_to_memory, alignment,
+					   &param->curr_len);
+    if (!param->dma_iov) {
+	param->dma_iov = pci_dma_sg_map_bounce(param);
+	if (!param->dma_iov)
+	    goto out_free;
+	iovcnt = 1;
+    }
+
+    /* run the I/O */
+    ret = pci_dma_sg_submit(pci_dma_sg_opaque,
+			    param->dma_iov, iovcnt, param->curr_len,
+			    pci_dma_sg_cb,
+			    param);
+    if (ret)
+    out_free:
+	pci_dma_sg_cb(param, ret);
+    return;
+
+ err:
+    pci_dma_sg_complete(pci_dma_sg_opaque, ret);
+    return;
+}
diff --git a/qemu/hw/pci_dma.h b/qemu/hw/pci_dma.h
new file mode 100644
index 0000000..9ea606d
--- /dev/null
+++ b/qemu/hw/pci_dma.h
@@ -0,0 +1,47 @@
+#ifndef QEMU_PCI_DMA_H
+#define QEMU_PCI_DMA_H
+
+#include "qemu-common.h"
+#include "block.h"
+#include <sys/uio.h> /* struct iovec */
+
+typedef int QEMUPciDmaSgSubmit(void *pci_dma_sg_opaque,
+			       struct iovec *iov, int iovcnt,
+			       size_t len,
+			       BlockDriverCompletionFunc dma_cb,
+			       void *dma_cb_param);
+
+typedef void QEMUPciDmaSgComplete(void *pci_dma_sg_opaque, int ret);
+
+typedef struct QEMUPciDmaSg {
+    target_phys_addr_t addr;
+    size_t len;
+} QEMUPciDmaSg;
+
+typedef struct QEMUPciDmaSgParam {
+    QEMUPciDmaSgSubmit *pci_dma_sg_submit;
+    QEMUPciDmaSgComplete *pci_dma_sg_complete;
+    void *pci_dma_sg_opaque;
+    int dma_to_memory;
+    int alignment;
+    uint8_t *bounce;
+    QEMUPciDmaSg *sg;
+    int iovcnt;
+    int restart_iovcnt;
+    size_t restart_offset;
+    int curr_restart_iovcnt;
+    size_t curr_restart_offset;
+    size_t curr_len;
+    struct iovec *dma_iov;
+    struct iovec linearized;
+} QEMUPciDmaSgParam;
+
+/* pci_dma.c */
+void pci_dma_sg(PCIDevice *pci_dev,
+		QEMUPciDmaSg *sg, int iovcnt,
+		QEMUPciDmaSgSubmit *pci_dma_sg_submit,
+		QEMUPciDmaSgComplete *pci_dma_sg_complete,
+		void *pci_dma_sg_opaque,
+		int dma_to_memory, int alignment);
+
+#endif
Index: block_int.h
===================================================================
--- block_int.h	(revision 5799)
+++ block_int.h	(working copy)
@@ -55,6 +55,8 @@
         int64_t sector_num, const uint8_t *buf, int nb_sectors,
         BlockDriverCompletionFunc *cb, void *opaque);
     void (*bdrv_aio_cancel)(BlockDriverAIOCB *acb);
+    BlockDriverAIOIOV *bdrv_aio_readv;
+    BlockDriverAIOIOV *bdrv_aio_writev;
     int aiocb_size;
 
     const char *protocol_name;
Index: Makefile.target
===================================================================
--- Makefile.target	(revision 5799)
+++ Makefile.target	(working copy)
@@ -659,7 +659,7 @@
 
 ifeq ($(TARGET_BASE_ARCH), i386)
 # Hardware support
-OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o
+OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o pci_dma.o
 OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o
 OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o
 OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o
@@ -668,7 +668,7 @@
 ifeq ($(TARGET_BASE_ARCH), ppc)
 CPPFLAGS += -DHAS_AUDIO -DHAS_AUDIO_CHOICE
 # shared objects
-OBJS+= ppc.o ide.o vga.o $(SOUND_HW) dma.o openpic.o
+OBJS+= ppc.o ide.o vga.o $(SOUND_HW) dma.o openpic.o pci_dma.o
 # PREP target
 OBJS+= pckbd.o ps2.o serial.o i8259.o i8254.o fdc.o m48t59.o mc146818rtc.o
 OBJS+= prep_pci.o ppc_prep.o
@@ -686,7 +686,7 @@
 OBJS+= mips_timer.o mips_int.o dma.o vga.o serial.o i8254.o i8259.o rc4030.o
 OBJS+= g364fb.o jazz_led.o
 OBJS+= ide.o gt64xxx.o pckbd.o ps2.o fdc.o mc146818rtc.o usb-uhci.o acpi.o ds1225y.o
-OBJS+= piix_pci.o parallel.o cirrus_vga.o pcspk.o $(SOUND_HW)
+OBJS+= piix_pci.o parallel.o cirrus_vga.o pcspk.o pci_dma.o $(SOUND_HW)
 OBJS+= mipsnet.o
 OBJS+= pflash_cfi01.o
 CPPFLAGS += -DHAS_AUDIO -DHAS_AUDIO_CHOICE
@@ -710,7 +710,7 @@
 else
 OBJS+= sun4m.o tcx.o pcnet.o iommu.o m48t59.o slavio_intctl.o
 OBJS+= slavio_timer.o slavio_serial.o slavio_misc.o fdc.o sparc32_dma.o
-OBJS+= cs4231.o ptimer.o eccmemctl.o sbi.o sun4c_intctl.o
+OBJS+= cs4231.o ptimer.o eccmemctl.o sbi.o sun4c_intctl.o pci_dma.o
 endif
 endif
 ifeq ($(TARGET_BASE_ARCH), arm)
@@ -731,7 +731,7 @@
 OBJS+= nseries.o blizzard.o onenand.o vga.o cbus.o tusb6010.o usb-musb.o
 OBJS+= tsc2005.o bt-hci-csr.o
 OBJS+= mst_fpga.o mainstone.o
-OBJS+= musicpal.o pflash_cfi02.o
+OBJS+= musicpal.o pflash_cfi02.o pci_dma.o
 CPPFLAGS += -DHAS_AUDIO
 endif
 ifeq ($(TARGET_BASE_ARCH), sh4)
Index: exec.c
===================================================================
--- exec.c	(revision 5799)
+++ exec.c	(working copy)
@@ -2807,7 +2807,7 @@
 /* physical memory access (slow version, mainly for debug) */
 #if defined(CONFIG_USER_ONLY)
 void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
-                            int len, int is_write)
+                            size_t len, int is_write)
 {
     int l, flags;
     target_ulong page;
@@ -2848,7 +2848,7 @@
 
 #else
 void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
-                            int len, int is_write)
+                            size_t len, int is_write)
 {
     int l, io_index;
     uint8_t *ptr;
@@ -2938,6 +2938,111 @@
     }
 }
 
+uint8_t *cpu_physical_memory_can_dma(target_phys_addr_t addr,
+				     size_t len, int is_write,
+				     int alignment)
+{
+    int l, first = 1;
+    uint8_t *ptr = NULL;
+    target_phys_addr_t page;
+    unsigned long pd, pd_first = 0;
+    PhysPageDesc *p;
+
+    if (len & (alignment-1))
+	goto out;
+
+    while (len > 0) {
+	page = addr & TARGET_PAGE_MASK;
+	p = phys_page_find(page >> TARGET_PAGE_BITS);
+
+	l = (page + TARGET_PAGE_SIZE) - addr;
+        if (l > len)
+            l = len;
+
+        if (!p)
+            pd = IO_MEM_UNASSIGNED;
+        else
+            pd = p->phys_offset;
+
+        if (is_write) {
+            if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM)
+		return NULL;
+        } else {
+            if ((pd & ~TARGET_PAGE_MASK) > IO_MEM_ROM &&
+                !(pd & IO_MEM_ROMD))
+		return NULL;
+        }
+
+	if (first) {
+	    first = 0;
+	    ptr = phys_ram_base + (pd & TARGET_PAGE_MASK) +
+		(addr & ~TARGET_PAGE_MASK);
+	    if ((unsigned long)ptr & (alignment-1))
+		return NULL;
+	    pd_first = pd;
+	}
+
+	/* nonlinear range */
+	if (pd_first != pd)
+	    return NULL;
+	pd_first += TARGET_PAGE_SIZE;
+
+        len -= l;
+        addr += l;
+    }
+
+out:
+    return ptr;
+}
+
+void cpu_physical_memory_write_post_dma(target_phys_addr_t addr,
+					size_t len)
+{
+    int l;
+    uint8_t *ptr;
+    target_phys_addr_t page;
+    unsigned long pd;
+    PhysPageDesc *p;
+
+    while (len > 0) {
+        page = addr & TARGET_PAGE_MASK;
+
+        l = (page + TARGET_PAGE_SIZE) - addr;
+        if (l > len)
+            l = len;
+
+        p = phys_page_find(page >> TARGET_PAGE_BITS);
+        if (!p)
+            pd = IO_MEM_UNASSIGNED;
+        else
+            pd = p->phys_offset;
+
+	if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
+	    printf("ERROR cpu_physical_memory_post_dma: memory layout changed\n");
+	    continue;
+	} else {
+	    unsigned long addr1;
+	    addr1 = (pd & TARGET_PAGE_MASK) + (addr & ~TARGET_PAGE_MASK);
+	    /* RAM case */
+	    ptr = phys_ram_base + addr1;
+	    if (!cpu_physical_memory_is_dirty(addr1)) {
+		/* invalidate code */
+		tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
+		/* set dirty bit */
+		phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
+		    (0xff & ~CODE_DIRTY_FLAG);
+	    }
+	    /* qemu doesn't execute guest code directly, but kvm does
+	       therefore fluch instruction caches */
+	    if (kvm_enabled())
+		flush_icache_range((unsigned long)ptr,
+				   ((unsigned long)ptr)+l);
+        }
+        len -= l;
+        addr += l;
+    }
+}
+
 /* used for ROM loading : can write in RAM and ROM */
 void cpu_physical_memory_write_rom(target_phys_addr_t addr,
                                    const uint8_t *buf, int len)
Index: block.c
===================================================================
--- block.c	(revision 5799)
+++ block.c	(working copy)
@@ -1291,7 +1307,51 @@
     drv->bdrv_aio_cancel(acb);
 }
 
+BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
+				 struct iovec *iov, int iovcnt, size_t len,
+				 BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BlockDriver *drv = bs->drv;
+    BlockDriverAIOCB *ret;
+    
 
+    if (!drv)
+        return NULL;
+
+    ret = drv->bdrv_aio_readv(bs, sector_num, iov, iovcnt, len, cb, opaque);
+
+    if (ret) {
+	/* Update stats even though technically transfer has not happened. */
+	bs->rd_bytes += (unsigned) len;
+	bs->rd_ops ++;
+    }
+
+    return ret;
+}
+
+BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
+				  struct iovec *iov, int iovcnt, size_t len,
+				  BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BlockDriver *drv = bs->drv;
+    BlockDriverAIOCB *ret;
+    
+
+    if (!drv)
+        return NULL;
+
+    ret = drv->bdrv_aio_writev(bs, sector_num, iov, iovcnt, len, cb, opaque);
+
+    if (ret) {
+	/* Update stats even though technically transfer has not happened. */
+	bs->rd_bytes += (unsigned) len;
+	bs->rd_ops ++;
+    }
+
+    return ret;
+}
+
+
 /**************************************************************/
 /* async block device emulation */
 
Index: block.h
===================================================================
--- block.h	(revision 5799)
+++ block.h	(working copy)
@@ -83,6 +83,13 @@
 /* async block I/O */
 typedef struct BlockDriverAIOCB BlockDriverAIOCB;
 typedef void BlockDriverCompletionFunc(void *opaque, int ret);
+typedef BlockDriverAIOCB *BlockDriverAIOIOV(BlockDriverState *bs,
+					    int64_t sector_num,
+					    struct iovec *iov,
+					    int iovnct,
+					    size_t len,
+					    BlockDriverCompletionFunc *cb,
+					    void *opaque);
 
 BlockDriverAIOCB *bdrv_aio_read(BlockDriverState *bs, int64_t sector_num,
                                 uint8_t *buf, int nb_sectors,
@@ -91,6 +98,12 @@
                                  const uint8_t *buf, int nb_sectors,
                                  BlockDriverCompletionFunc *cb, void *opaque);
 void bdrv_aio_cancel(BlockDriverAIOCB *acb);
+BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
+				 struct iovec *iov, int iovnct, size_t len,
+				 BlockDriverCompletionFunc *cb, void *opaque);
+BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
+				  struct iovec *iov, int iovnct, size_t len,
+				  BlockDriverCompletionFunc *cb, void *opaque);
 
 int qemu_key_check(BlockDriverState *bs, const char *name);
 
Index: hw/ide.c
===================================================================
--- hw/ide.c	(revision 5799)
+++ hw/ide.c	(working copy)
@@ -31,6 +31,7 @@
 #include "qemu-timer.h"
 #include "sysemu.h"
 #include "ppc_mac.h"
+#include "pci_dma.h"
 
 /* debug IDE devices */
 //#define DEBUG_IDE
@@ -480,12 +481,14 @@
     struct PCIIDEState *pci_dev;
     /* current transfer state */
     uint32_t cur_addr;
-    uint32_t cur_prd_last;
-    uint32_t cur_prd_addr;
-    uint32_t cur_prd_len;
+    uint32_t cur_prd_last; /* fixme delete */
+    uint32_t cur_prd_addr; /* fixme delete */
+    uint32_t cur_prd_len; /* fixme delete */
     IDEState *ide_if;
     BlockDriverCompletionFunc *dma_cb;
     BlockDriverAIOCB *aiocb;
+    QEMUPciDmaSg sg[IDE_DMA_BUF_SECTORS];
+    BlockDriverAIOIOV *bdrv_aio_iov;
 } BMDMAState;
 
 typedef struct PCIIDEState {
@@ -870,7 +873,7 @@
 }
 
 /* return 0 if buffer completed */
-static int dma_buf_rw(BMDMAState *bm, int is_write)
+static int dma_buf_rw(BMDMAState *bm, int is_write) /* fixme delete */
 {
     IDEState *s = bm->ide_if;
     struct {
@@ -917,61 +920,80 @@
     return 1;
 }
 
-static void ide_read_dma_cb(void *opaque, int ret)
+static int build_dma_sg(BMDMAState *bm)
 {
-    BMDMAState *bm = opaque;
-    IDEState *s = bm->ide_if;
-    int n;
-    int64_t sector_num;
+    struct {
+        uint32_t addr;
+        uint32_t size;
+    } prd;
+    int len;
+    int idx;
 
-    if (ret < 0) {
-	ide_dma_error(s);
-	return;
+    for (idx = 1; idx <= IDE_DMA_BUF_SECTORS; idx++) {
+	cpu_physical_memory_read(bm->cur_addr, (uint8_t *)&prd, 8);
+	bm->cur_addr += 8;
+	bm->sg[idx-1].addr = le32_to_cpu(prd.addr);
+	prd.size = le32_to_cpu(prd.size);
+	len = prd.size & 0xfffe;
+	if (len == 0)
+	    len = 0x10000;
+	bm->sg[idx-1].len = len;
+	/* end of table (with a fail safe of one page) */
+	if ((prd.size & 0x80000000) ||
+	    (bm->cur_addr - bm->addr) >= 4096)
+	    break;
     }
+    return idx;
+}
 
-    n = s->io_buffer_size >> 9;
-    sector_num = ide_get_sector(s);
-    if (n > 0) {
-        sector_num += n;
-        ide_set_sector(s, sector_num);
-        s->nsector -= n;
-        if (dma_buf_rw(bm, 1) == 0)
-            goto eot;
-    }
+static void ide_dma_complete(void *opaque, int ret)
+{
+    BMDMAState *bm = opaque;
+    IDEState *s = bm->ide_if;
 
+    bm->bdrv_aio_iov = NULL;
+    bm->dma_cb = NULL;
+    bm->ide_if = NULL;
+    bm->aiocb = NULL;
     /* end of transfer ? */
-    if (s->nsector == 0) {
+    if (s->nsector == 0 && !ret) {
         s->status = READY_STAT | SEEK_STAT;
         ide_set_irq(s);
-    eot:
         bm->status &= ~BM_STATUS_DMAING;
         bm->status |= BM_STATUS_INT;
-        bm->dma_cb = NULL;
-        bm->ide_if = NULL;
-        bm->aiocb = NULL;
-        return;
+    } else {
+	ide_dma_error(s);
+	printf("ide_dma_complete error: nsector %d err %d\n", s->nsector, ret);
     }
+}
 
-    /* launch next transfer */
-    n = s->nsector;
-    if (n > IDE_DMA_BUF_SECTORS)
-        n = IDE_DMA_BUF_SECTORS;
-    s->io_buffer_index = 0;
-    s->io_buffer_size = n * 512;
+static int ide_dma_submit(void *opaque, struct iovec *dma_iov,
+			  int iovcnt, size_t len,
+			  BlockDriverCompletionFunc dma_cb,
+			  void *dma_cb_param)
+{
+    BMDMAState *bm = opaque;
+    IDEState *s = bm->ide_if;
+    uint32_t sectors;
+    int64_t sector_num;
+
+    sectors = len >> 9;
+    if (s->nsector < sectors || !s->nsector)
+	return -1;
+
+    sector_num = ide_get_sector(s);
+    ide_set_sector(s, sector_num  + sectors);
+    s->nsector -= sectors;
+
 #ifdef DEBUG_AIO
-    printf("aio_read: sector_num=%lld n=%d\n", sector_num, n);
+    printf("aio_write: sector_num=%lld n=%d\n", sector_num, sectors);
 #endif
-    bm->aiocb = bdrv_aio_read(s->bs, sector_num, s->io_buffer, n,
-                              ide_read_dma_cb, bm);
-    ide_dma_submit_check(s, ide_read_dma_cb, bm);
-}
+    bm->aiocb = bm->bdrv_aio_iov(s->bs, sector_num, dma_iov, iovcnt, len,
+				 dma_cb, dma_cb_param);
+    if (!bm->aiocb)
+	return -1;
 
-static void ide_sector_read_dma(IDEState *s)
-{
-    s->status = READY_STAT | SEEK_STAT | DRQ_STAT | BUSY_STAT;
-    s->io_buffer_index = 0;
-    s->io_buffer_size = 0;
-    ide_dma_start(s, ide_read_dma_cb);
+    return 0;
 }
 
 static void ide_sector_write_timer_cb(void *opaque)
@@ -1028,64 +1050,29 @@
     }
 }
 
-static void ide_write_dma_cb(void *opaque, int ret)
+static void ide_sector_dma(IDEState *s, BlockDriverAIOIOV *bdrv_aio_iov,
+			   int dma_to_memory)
 {
-    BMDMAState *bm = opaque;
-    IDEState *s = bm->ide_if;
-    int n;
-    int64_t sector_num;
+    int iovcnt;
+    BMDMAState *bm = s->bmdma;
+    if(!bm)
+	goto err;
 
-    if (ret < 0) {
-	ide_dma_error(s);
-	return;
-    }
+    s->status = READY_STAT | SEEK_STAT | DRQ_STAT | BUSY_STAT;
+    bm->ide_if = s;
+    iovcnt = build_dma_sg(s->bmdma);
+    bm->bdrv_aio_iov = bdrv_aio_iov;
+    if (iovcnt > IDE_DMA_BUF_SECTORS)
+	goto err;
+    pci_dma_sg((PCIDevice *)bm->pci_dev, bm->sg, iovcnt,
+	       ide_dma_submit, ide_dma_complete, bm,
+	       dma_to_memory, 512);
+    return;
 
-    n = s->io_buffer_size >> 9;
-    sector_num = ide_get_sector(s);
-    if (n > 0) {
-        sector_num += n;
-        ide_set_sector(s, sector_num);
-        s->nsector -= n;
-    }
-
-    /* end of transfer ? */
-    if (s->nsector == 0) {
-        s->status = READY_STAT | SEEK_STAT;
-        ide_set_irq(s);
-    eot:
-        bm->status &= ~BM_STATUS_DMAING;
-        bm->status |= BM_STATUS_INT;
-        bm->dma_cb = NULL;
-        bm->ide_if = NULL;
-        bm->aiocb = NULL;
-        return;
-    }
-
-    /* launch next transfer */
-    n = s->nsector;
-    if (n > IDE_DMA_BUF_SECTORS)
-        n = IDE_DMA_BUF_SECTORS;
-    s->io_buffer_index = 0;
-    s->io_buffer_size = n * 512;
-
-    if (dma_buf_rw(bm, 0) == 0)
-        goto eot;
-#ifdef DEBUG_AIO
-    printf("aio_write: sector_num=%lld n=%d\n", sector_num, n);
-#endif
-    bm->aiocb = bdrv_aio_write(s->bs, sector_num, s->io_buffer, n,
-                               ide_write_dma_cb, bm);
-    ide_dma_submit_check(s, ide_write_dma_cb, bm);
+err:
+    ide_dma_error(s);
 }
 
-static void ide_sector_write_dma(IDEState *s)
-{
-    s->status = READY_STAT | SEEK_STAT | DRQ_STAT | BUSY_STAT;
-    s->io_buffer_index = 0;
-    s->io_buffer_size = 0;
-    ide_dma_start(s, ide_write_dma_cb);
-}
-
 static void ide_atapi_cmd_ok(IDEState *s)
 {
     s->error = 0;
@@ -2219,7 +2206,7 @@
             if (!s->bs)
                 goto abort_cmd;
 	    ide_cmd_lba48_transform(s, lba48);
-            ide_sector_read_dma(s);
+            ide_sector_dma(s, bdrv_aio_readv, 1);
             break;
 	case WIN_WRITEDMA_EXT:
 	    lba48 = 1;
@@ -2228,7 +2215,7 @@
             if (!s->bs)
                 goto abort_cmd;
 	    ide_cmd_lba48_transform(s, lba48);
-            ide_sector_write_dma(s);
+            ide_sector_dma(s, bdrv_aio_writev, 0);
             s->media_changed = 1;
             break;
         case WIN_READ_NATIVE_MAX_EXT:
Index: cpu-all.h
===================================================================
--- cpu-all.h	(revision 5799)
+++ cpu-all.h	(working copy)
@@ -891,14 +891,19 @@
 CPUReadMemoryFunc **cpu_get_io_memory_read(int io_index);
 
 void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
-                            int len, int is_write);
+                            size_t len, int is_write);
+void cpu_physical_memory_write_post_dma(target_phys_addr_t addr,
+					size_t len);
+uint8_t *cpu_physical_memory_can_dma(target_phys_addr_t addr,
+				     size_t len, int is_write,
+				     int alignment);
 static inline void cpu_physical_memory_read(target_phys_addr_t addr,
-                                            uint8_t *buf, int len)
+                                            uint8_t *buf, size_t len)
 {
     cpu_physical_memory_rw(addr, buf, len, 0);
 }
 static inline void cpu_physical_memory_write(target_phys_addr_t addr,
-                                             const uint8_t *buf, int len)
+                                             const uint8_t *buf, size_t len)
 {
     cpu_physical_memory_rw(addr, (uint8_t *)buf, len, 1);
 }

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [RFC 2/2] bdrv_aio_readv/writev_em
  2008-11-27 12:35 [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Andrea Arcangeli
@ 2008-11-27 12:43 ` Andrea Arcangeli
  2008-11-28 11:09   ` Jamie Lokier
  2008-11-27 19:14 ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Blue Swirl
  1 sibling, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-27 12:43 UTC (permalink / raw)
  To: qemu-devel

Hello,

this is the emulated bdrv_aio_readv/writev pure hack to be able to
test the dma api in previous patch.

About the real thing there are two ways to go:

pthread_create() and do aio with pthreads by calling writev by hand.

Use kernel based linux aio (I think it's much better as it won't
screwup with contiguous I/O, and it handles o_direct random writes and
random reads by keeping the lowlevel I/O pipeline full without threads
but by just queuing _in_order_ [in order only from the point of view
of the I/O scheduler of course] and asynchronously the commands of
every different direct-io aio_readv/writev in the lowlevel storage
queue without needing any scheduler and thread synchronization
involvement).

So who's going to add bdrv_aio_readv/writev instead of the below
aberration that breaks on backend not supporting aio and breaks with
bdrv_aio_cancel too, besides being horribly slow and making direct
path slower than the bounce path?

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Index: block.c
===================================================================
--- block.c	(revision 5799)
+++ block.c	(working copy)
@@ -53,6 +53,20 @@
                         uint8_t *buf, int nb_sectors);
 static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
                          const uint8_t *buf, int nb_sectors);
+static BlockDriverAIOCB *bdrv_aio_readv_em(BlockDriverState *bs,
+					   int64_t sector_num,
+					   struct iovec *iov,
+					   int iovnct,
+					   size_t len,
+					   BlockDriverCompletionFunc *cb,
+					   void *opaque);
+static BlockDriverAIOCB *bdrv_aio_writev_em(BlockDriverState *bs,
+					    int64_t sector_num,
+					    struct iovec *iov,
+					    int iovnct,
+					    size_t len,
+					    BlockDriverCompletionFunc *cb,
+					    void *opaque);
 
 BlockDriverState *bdrv_first;
 
@@ -135,6 +149,8 @@
         /* add synchronous IO emulation layer */
         bdrv->bdrv_read = bdrv_read_em;
         bdrv->bdrv_write = bdrv_write_em;
+        bdrv->bdrv_aio_readv = bdrv_aio_readv_em;
+        bdrv->bdrv_aio_writev = bdrv_aio_writev_em;
     }
     bdrv->next = first_drv;
     first_drv = bdrv;
@@ -1341,6 +1401,74 @@
     qemu_aio_release(acb);
 }
 
+static void bdrv_aio_iov_bh_cb(void *opaque)
+{
+    BlockDriverAIOCBSync *acb = opaque;
+    acb->common.cb(acb->common.opaque, acb->ret);
+    qemu_bh_delete(acb->bh);
+    qemu_free(acb);
+}
+
+static BlockDriverAIOCB *bdrv_aio_readv_em(BlockDriverState *bs,
+					   int64_t sector_num,
+					   struct iovec *iov,
+					   int iovcnt,
+					   size_t len,
+					   BlockDriverCompletionFunc *cb,
+					   void *opaque)
+{
+    BlockDriverAIOCBSync *acb;
+    int ret = -1, idx;
+
+    for (idx = 0; idx < iovcnt; idx++) {
+	size_t sectors = iov[idx].iov_len >> SECTOR_BITS;
+	ret = bdrv_read(bs, sector_num, iov[idx].iov_base, sectors);
+	if (ret)
+	    break;
+	sector_num += sectors;
+    }
+    acb = qemu_mallocz(sizeof(BlockDriverAIOCBSync));
+    if (!acb)
+            return NULL;
+    acb->common.bs = bs;
+    acb->common.cb = cb;
+    acb->common.opaque = opaque;
+    acb->bh = qemu_bh_new(bdrv_aio_iov_bh_cb, acb);
+    acb->ret = ret;
+    qemu_bh_schedule(acb->bh);
+    return &acb->common;
+}
+
+static BlockDriverAIOCB *bdrv_aio_writev_em(BlockDriverState *bs,
+					    int64_t sector_num,
+					    struct iovec *iov,
+					    int iovcnt,
+					    size_t len,
+					    BlockDriverCompletionFunc *cb,
+					    void *opaque)
+{
+    BlockDriverAIOCBSync *acb;
+    int ret = -1, idx;
+
+    for (idx = 0; idx < iovcnt; idx++) {
+	size_t sectors = iov[idx].iov_len >> SECTOR_BITS;
+	ret = bdrv_write(bs, sector_num, iov[idx].iov_base, sectors);
+	if (ret)
+	    break;
+	sector_num += sectors;
+    }
+    acb = qemu_mallocz(sizeof(BlockDriverAIOCBSync));
+    if (!acb)
+            return NULL;
+    acb->common.bs = bs;
+    acb->common.cb = cb;
+    acb->common.opaque = opaque;
+    acb->bh = qemu_bh_new(bdrv_aio_iov_bh_cb, acb);
+    acb->ret = ret;
+    qemu_bh_schedule(acb->bh);
+    return &acb->common;
+}
+
 /**************************************************************/
 /* sync block device emulation */
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-27 12:35 [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Andrea Arcangeli
  2008-11-27 12:43 ` [Qemu-devel] [RFC 2/2] bdrv_aio_readv/writev_em Andrea Arcangeli
@ 2008-11-27 19:14 ` Blue Swirl
  2008-11-28  1:56   ` Andrea Arcangeli
                     ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Blue Swirl @ 2008-11-27 19:14 UTC (permalink / raw)
  To: qemu-devel

On 11/27/08, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hello everyone,
>
>  One major limitation for KVM today is the lack of a proper way to
>  write drivers in a way that allows the host OS to use direct DMA to
>  the guest physical memory to avoid any intermediate copy. The only API
>  provided to drivers seems to be the cpu_physical_memory_rw and that
>  enforces all drivers to bounce and trash cpu caches and be memory
>  bound. This new DMA API instead allows drivers to use a pci_dma_sg
>  method for SG I/O that will translate the guest physical addresses to
>  host virutal addresses and it will call two operation, one is a submit
>  method and one is the complete method. The pci_dma_sg may have to
>  bounce buffer internally and to limit the max bounce size it may have
>  to submit I/O in pieces with multiple submit calls. The patch adapts
>  the ide.c HD driver to use this. Once cdrom is converted too
>  dma_buf_rw can be eliminated. As you can see the new ide_dma_submit
>  and ide_dma_complete code is much more readable than the previous
>  rearming callback.
>
>  This is only tested with KVM so far but qemu builds, in general
>  there's nothing kvm specific here (with the exception of a single
>  kvm_enabled), so it should all work well for both.
>
>  All we care about is the performance of the direct path, so I tried to
>  avoid dynamic allocations there to avoid entering glibc, the current
>  logic doesn't satisfy me yet but it should be at least faster than
>  calling malloc (but I'm still working on it to avoid memory waste to
>  detect when more than one iov should be cached). But in case of
>  instabilities I recommend first thing to set MAX_IOVEC_IOVCNT 0 to
>  disable that logic ;). I recommend to test with DEBUG_BOUNCE and with
>  a 512 max bounce buffer too. It's running stable in all modes so
>  far. However if ide.c end up calling aio_cancel things will likely
>  fall apart but this is all because of bdrv_aio_readv/writev, and the
>  astonishing lack of aio_readv/writev in glibc!
>
>  Once we finish fixing storage performance with a real
>  bdrv_aio_readv/writev (now a blocker issue), a pci_dma_single can be
>  added for zero copy networking (one NIC per VM, or VMDq, IOV
>  etc..). The DMA API should allow for that too.

The previous similar attempt by Anthony for generic DMA using vectored
IO was abandoned because the malloc/free overhead was more than the
performance gain. Have you made any performance measurements? How does
this version compare to the previous ones?

I think the pci_ prefix can be removed, there is little PCI specific.

For Sparc32 IOMMU (and probably other IOMMUS), it should be possible
to register a function used in place of  cpu_physical_memory_rw,
c_p_m_can_dma etc. The goal is that it should be possible to stack the
DMA resolvers (think of devices behind a number of buses).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-27 19:14 ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Blue Swirl
@ 2008-11-28  1:56   ` Andrea Arcangeli
  2008-11-28 17:59     ` Blue Swirl
  2008-11-29 19:48   ` Avi Kivity
  2008-11-30 22:33   ` Anthony Liguori
  2 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-28  1:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: Blue Swirl

On Thu, Nov 27, 2008 at 09:14:45PM +0200, Blue Swirl wrote:
> The previous similar attempt by Anthony for generic DMA using vectored
> IO was abandoned because the malloc/free overhead was more than the

Even if there were dynamic allocations in the fast path, the overhead
of malloc/free is nothing if compared to running and waiting a host
kernel syscall to return every 4k, not to tell with O_DIRECT enabled
which is the whole point of having a direct-dma API that truly doesn't
pollute the cache. With O_DIRECT, without a real readv/writev I/O
performance would be destroyed going down to something like 10M/sec
even on the fastest storage/CPU/ram combinations.

So the question is how those benchmarks were run, with or without a
real readv/writev and with or without O_DIRECT to truly eliminate all
CPU cache pollution out of the memory copies?

About malloc, all we care about is the direct-io fast path, and with
my patch there is no allocation whatsoever in the fast path. About the
bounce layer, that is there for correctness only (purely to do DMA to
non-RAM or in non-linear RAM ranges with non-RAM holes in between) and
we don't care about it in performance terms.

> performance gain. Have you made any performance measurements? How does
> this version compare to the previous ones?

I run some minor benchmark but it's basically futile to benchmark with
the bdrv_aio_readv/writev_em.

> I think the pci_ prefix can be removed, there is little PCI specific.

Adding the pci_ prefix looked a requirement in naming terms from
previous threads on the topic. Before I learnt about it, I didn't want
to have a pci_ prefix too so I can certainly agree with you ;).

There's is nothing PCI specific so far. Anyway this is just a naming
matter, it's up to you to decide what you like :).

> For Sparc32 IOMMU (and probably other IOMMUS), it should be possible
> to register a function used in place of  cpu_physical_memory_rw,
> c_p_m_can_dma etc. The goal is that it should be possible to stack the
> DMA resolvers (think of devices behind a number of buses).

The hardware thing being emulated in the real world wouldn't attach to
both buses I think, hence you can specify it in the driver what kind
of iommu it has (then behind it you can emulate whatever hardware you
want but still the original device was pci or not-pci). I personally
don't see much difference as renaming later wouldn't be harder than a
sed script...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 2/2] bdrv_aio_readv/writev_em
  2008-11-27 12:43 ` [Qemu-devel] [RFC 2/2] bdrv_aio_readv/writev_em Andrea Arcangeli
@ 2008-11-28 11:09   ` Jamie Lokier
  0 siblings, 0 replies; 32+ messages in thread
From: Jamie Lokier @ 2008-11-28 11:09 UTC (permalink / raw)
  To: qemu-devel

Andrea Arcangeli wrote:
> pthread_create() and do aio with pthreads by calling writev by hand.
> 
> Use kernel based linux aio (I think it's much better as it won't
> screwup with contiguous I/O, and it handles o_direct random writes and
> random reads by keeping the lowlevel I/O pipeline full without threads
> but by just queuing _in_order_ [in order only from the point of view
> of the I/O scheduler of course] and asynchronously the commands of
> every different direct-io aio_readv/writev in the lowlevel storage
> queue without needing any scheduler and thread synchronization
> involvement).

The good things about linux-aio are as you described.

The bad thing is that I'm told linux-aio can block sometimes.  Apart
from memory pressure (which can block the calling process anyway), an
aio implementation which can block is not so good.

Doing it in a pthread is less good for a lot of reasons.  On the other
hand, I/O in a pthread should never block the main thread.  We
discussed the "in order" part a while ago, and several people pointed
out that once there are a few I/Os in flight, the order should
stabilise automatically, but perhaps that's not so good in transient
situations.

(Ideally we'd use scheduler activations (syslets!) to avoid blocking
in any kernel I/O path no matter how complex, while keeping aio
ordering and avoiding unnecessary task switching.  I haven't heard
much about progress on this since earlier this year...)

-- Jamie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-28  1:56   ` Andrea Arcangeli
@ 2008-11-28 17:59     ` Blue Swirl
  2008-11-28 18:50       ` Andrea Arcangeli
  2008-11-30 22:34       ` Anthony Liguori
  0 siblings, 2 replies; 32+ messages in thread
From: Blue Swirl @ 2008-11-28 17:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: qemu-devel

On 11/28/08, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Nov 27, 2008 at 09:14:45PM +0200, Blue Swirl wrote:
>  > The previous similar attempt by Anthony for generic DMA using vectored
>  > IO was abandoned because the malloc/free overhead was more than the
>
>
> Even if there were dynamic allocations in the fast path, the overhead
>  of malloc/free is nothing if compared to running and waiting a host
>  kernel syscall to return every 4k, not to tell with O_DIRECT enabled
>  which is the whole point of having a direct-dma API that truly doesn't
>  pollute the cache. With O_DIRECT, without a real readv/writev I/O
>  performance would be destroyed going down to something like 10M/sec
>  even on the fastest storage/CPU/ram combinations.
>
>  So the question is how those benchmarks were run, with or without a
>  real readv/writev and with or without O_DIRECT to truly eliminate all
>  CPU cache pollution out of the memory copies?

I don't know, here's a pointer:
http://lists.gnu.org/archive/html/qemu-devel/2008-08/msg00092.html

>  About malloc, all we care about is the direct-io fast path, and with
>  my patch there is no allocation whatsoever in the fast path. About the
>  bounce layer, that is there for correctness only (purely to do DMA to
>  non-RAM or in non-linear RAM ranges with non-RAM holes in between) and
>  we don't care about it in performance terms.
>
>
>  > performance gain. Have you made any performance measurements? How does
>  > this version compare to the previous ones?
>
>
> I run some minor benchmark but it's basically futile to benchmark with
>  the bdrv_aio_readv/writev_em.
>
>
>  > I think the pci_ prefix can be removed, there is little PCI specific.
>
>
> Adding the pci_ prefix looked a requirement in naming terms from
>  previous threads on the topic. Before I learnt about it, I didn't want
>  to have a pci_ prefix too so I can certainly agree with you ;).
>
>  There's is nothing PCI specific so far. Anyway this is just a naming
>  matter, it's up to you to decide what you like :).
>
>
>  > For Sparc32 IOMMU (and probably other IOMMUS), it should be possible
>  > to register a function used in place of  cpu_physical_memory_rw,
>  > c_p_m_can_dma etc. The goal is that it should be possible to stack the
>  > DMA resolvers (think of devices behind a number of buses).
>
>
> The hardware thing being emulated in the real world wouldn't attach to
>  both buses I think, hence you can specify it in the driver what kind
>  of iommu it has (then behind it you can emulate whatever hardware you
>  want but still the original device was pci or not-pci). I personally
>  don't see much difference as renaming later wouldn't be harder than a
>  sed script...

Sorry, my description seems to have lead you to a totally wrong track.
I meant this scenario: device (Lance Ethernet) -> DMA controller
(MACIO) -> IOMMU -> physical memory. (In this case vectored DMA won't
be useful since there is byte swapping involved, but serves as an
example about generic DMA). At each step the DMA address is rewritten.
It would be nice if the interface between Lance and DMA, DMA and IOMMU
and IOMMU and memory was the same.

Here's some history, please have a look.

My first failed attempt:
http://lists.gnu.org/archive/html/qemu-devel/2007-08/msg00179.html

My second failed rough sketch:
http://lists.gnu.org/archive/html/qemu-devel/2007-10/msg00626.html

Anthony's version:
http://lists.gnu.org/archive/html/qemu-devel/2008-03/msg00474.html

Anthony's second version:
http://lists.gnu.org/archive/html/qemu-devel/2008-04/msg00077.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-28 17:59     ` Blue Swirl
@ 2008-11-28 18:50       ` Andrea Arcangeli
  2008-11-28 19:03         ` Blue Swirl
                           ` (2 more replies)
  2008-11-30 22:34       ` Anthony Liguori
  1 sibling, 3 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-28 18:50 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel

On Fri, Nov 28, 2008 at 07:59:13PM +0200, Blue Swirl wrote:
> I don't know, here's a pointer:
> http://lists.gnu.org/archive/html/qemu-devel/2008-08/msg00092.html

I'm in total agreement with it. The missing "proper vectored AIO
operations" are bdrv_aio_readv/writev ;).

I wonder how can possibly aio_readv/writev be missing in posix aio?
Unbelievable. It'd be totally trivial to add those to glibc, much
easier infact than to pthread_create by hand, but how can we add a
dependency on a certain glibc version? Ironically it'll be more
user-friendly to add dependency on linux kernel-aio implementation
that is already available for ages and it's guaranteed to run faster
(or at least not slower).

> Sorry, my description seems to have lead you to a totally wrong track.
> I meant this scenario: device (Lance Ethernet) -> DMA controller
> (MACIO) -> IOMMU -> physical memory. (In this case vectored DMA won't
> be useful since there is byte swapping involved, but serves as an
> example about generic DMA). At each step the DMA address is rewritten.
> It would be nice if the interface between Lance and DMA, DMA and IOMMU
> and IOMMU and memory was the same.

No problem. So you think I should change it to qemu_dma_sg instead of
pci_dma_sg? We can decide it later, but surely we can think about it
in the meantime ;).

> Here's some history, please have a look.
> 
> My first failed attempt:
> http://lists.gnu.org/archive/html/qemu-devel/2007-08/msg00179.html
> 
> My second failed rough sketch:
> http://lists.gnu.org/archive/html/qemu-devel/2007-10/msg00626.html
> 
> Anthony's version:
> http://lists.gnu.org/archive/html/qemu-devel/2008-03/msg00474.html
> 
> Anthony's second version:
> http://lists.gnu.org/archive/html/qemu-devel/2008-04/msg00077.html

Thanks a lot for the pointers.

BTW, lots of the credit in the design of my current implementation
goes to Avi, I forgot to mention it in previous emails.

The little cache layer I added at the last minute was very buggy so
don't look much of it, just assume it works when reading the patch ;).
I think I fixed it now in my tree, so next version will be much
better. I've also noticed some problems with windows (I didn't test
windows before posting), those aren't related to the cache layer as I
added a #define to disable it and replace it with malloc/free. But
that's not the cache layer, as soon as windows runs completely
flawlessy I post an update.

The iov cache layer is now also improved so that it only caches at
most N elements, where N is the max number of simultaneous in-flight
dma that ever happened during the runtime, so it's a bit smarter than
a generic slab cache.

Last but not the least there's still one malloc in the direct fast
path but the plan is to eliminate that too, by embedding the iov
inside the param (keeping it at the end of the struct like if I'd be
extending the linear_iov), so then the cache layer will handle it all
and there will be zero mallocs. The bounce path will be penalized as
it'll have to allocate the direct-iov too, but we don't care.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-28 18:50       ` Andrea Arcangeli
@ 2008-11-28 19:03         ` Blue Swirl
  2008-11-28 19:18           ` Jamie Lokier
  2008-11-30 18:04           ` Andrea Arcangeli
  2008-11-30 17:41         ` [Qemu-devel] [RFC 1/1] pci-dma-api-v2 Andrea Arcangeli
  2008-11-30 22:38         ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Anthony Liguori
  2 siblings, 2 replies; 32+ messages in thread
From: Blue Swirl @ 2008-11-28 19:03 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: qemu-devel

On 11/28/08, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Fri, Nov 28, 2008 at 07:59:13PM +0200, Blue Swirl wrote:
>  > I don't know, here's a pointer:
>  > http://lists.gnu.org/archive/html/qemu-devel/2008-08/msg00092.html
>
>
> I'm in total agreement with it. The missing "proper vectored AIO
>  operations" are bdrv_aio_readv/writev ;).
>
>  I wonder how can possibly aio_readv/writev be missing in posix aio?
>  Unbelievable. It'd be totally trivial to add those to glibc, much
>  easier infact than to pthread_create by hand, but how can we add a
>  dependency on a certain glibc version? Ironically it'll be more
>  user-friendly to add dependency on linux kernel-aio implementation
>  that is already available for ages and it's guaranteed to run faster
>  (or at least not slower).

There's also lio_listio that provides for vectored AIO.

>  > Sorry, my description seems to have lead you to a totally wrong track.
>  > I meant this scenario: device (Lance Ethernet) -> DMA controller
>  > (MACIO) -> IOMMU -> physical memory. (In this case vectored DMA won't
>  > be useful since there is byte swapping involved, but serves as an
>  > example about generic DMA). At each step the DMA address is rewritten.
>  > It would be nice if the interface between Lance and DMA, DMA and IOMMU
>  > and IOMMU and memory was the same.
>
>
> No problem. So you think I should change it to qemu_dma_sg instead of
>  pci_dma_sg? We can decide it later, but surely we can think about it
>  in the meantime ;).

Yes.

>  > Here's some history, please have a look.
>  >
>  > My first failed attempt:
>  > http://lists.gnu.org/archive/html/qemu-devel/2007-08/msg00179.html
>  >
>  > My second failed rough sketch:
>  > http://lists.gnu.org/archive/html/qemu-devel/2007-10/msg00626.html
>  >
>  > Anthony's version:
>  > http://lists.gnu.org/archive/html/qemu-devel/2008-03/msg00474.html
>  >
>  > Anthony's second version:
>  > http://lists.gnu.org/archive/html/qemu-devel/2008-04/msg00077.html
>
>
> Thanks a lot for the pointers.

Perhaps you could point out why the previous attempts failed, but
yours won't? ;-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-28 19:03         ` Blue Swirl
@ 2008-11-28 19:18           ` Jamie Lokier
  2008-11-29 19:49             ` Avi Kivity
                               ` (2 more replies)
  2008-11-30 18:04           ` Andrea Arcangeli
  1 sibling, 3 replies; 32+ messages in thread
From: Jamie Lokier @ 2008-11-28 19:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Andrea Arcangeli

Blue Swirl wrote:
> >  I wonder how can possibly aio_readv/writev be missing in posix aio?
> >  Unbelievable. It'd be totally trivial to add those to glibc, much
> >  easier infact than to pthread_create by hand, but how can we add a
> >  dependency on a certain glibc version? Ironically it'll be more
> >  user-friendly to add dependency on linux kernel-aio implementation
> >  that is already available for ages and it's guaranteed to run faster
> >  (or at least not slower).
> 
> There's also lio_listio that provides for vectored AIO.

I think lio_listio is the missing aio_readv/writev.

It's more versatile, and that'll by why POSIX never bothered with
aio_readv/writev.

Doesn't explain why they didn't _start_ with aio_readv before
inventing lio_listio, but there you go.  Unix history.

-- Jamie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-27 19:14 ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Blue Swirl
  2008-11-28  1:56   ` Andrea Arcangeli
@ 2008-11-29 19:48   ` Avi Kivity
  2008-11-30 17:29     ` Andrea Arcangeli
  2008-11-30 22:33   ` Anthony Liguori
  2 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2008-11-29 19:48 UTC (permalink / raw)
  To: qemu-devel

Blue Swirl wrote:
> The previous similar attempt by Anthony for generic DMA using vectored
> IO was abandoned because the malloc/free overhead was more than the
> performance gain. Have you made any performance measurements? How does
> this version compare to the previous ones?
>
>   

The pointers you gave don't blame malloc/free, instead the lack of aio 
readv/writev.

Since Andrea's patches contain emulation for aio readv/writev, the 
performance degradation will not occur (though we will not see the 
benefit either).

I doubt you can get measure malloc overhead with anything less a 
thousand disks (even there, other overheads are likely to drown that 
malloc).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-28 19:18           ` Jamie Lokier
@ 2008-11-29 19:49             ` Avi Kivity
  2008-11-30 17:20             ` Andrea Arcangeli
  2008-11-30 22:31             ` Anthony Liguori
  2 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2008-11-29 19:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Andrea Arcangeli

Jamie Lokier wrote:
> I think lio_listio is the missing aio_readv/writev.
>   

No, it isn't.  A 1MB read divided into 4k pages will have 256 completions.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-28 19:18           ` Jamie Lokier
  2008-11-29 19:49             ` Avi Kivity
@ 2008-11-30 17:20             ` Andrea Arcangeli
  2008-11-30 22:31             ` Anthony Liguori
  2 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-30 17:20 UTC (permalink / raw)
  To: qemu-devel

On Fri, Nov 28, 2008 at 07:18:19PM +0000, Jamie Lokier wrote:
> Blue Swirl wrote:
> > >  I wonder how can possibly aio_readv/writev be missing in posix aio?
> > >  Unbelievable. It'd be totally trivial to add those to glibc, much
> > >  easier infact than to pthread_create by hand, but how can we add a
> > >  dependency on a certain glibc version? Ironically it'll be more
> > >  user-friendly to add dependency on linux kernel-aio implementation
> > >  that is already available for ages and it's guaranteed to run faster
> > >  (or at least not slower).
> > 
> > There's also lio_listio that provides for vectored AIO.
> 
> I think lio_listio is the missing aio_readv/writev.
> 
> It's more versatile, and that'll by why POSIX never bothered with
> aio_readv/writev.
> 
> Doesn't explain why they didn't _start_ with aio_readv before
> inventing lio_listio, but there you go.  Unix history.

Well, I before grepped for readv or writev syscalls inside the
glibc-2.6.1/sysdeps/pthread and there was nothing there, so lio_listio
doesn't seem to be helpful at all. If that was a _kernel_ API then the
kernel could see the whole queue immediately and coalesce all
oustanding contiguous I/O in a single DMA operation, but the userland
queue here will not be visible to the kernel. So unless we can execute
the readv and writev syscalls, O_DIRECT performance with direct DMA
API will be destroyed compared to bounce buffering because the guest
OS will submit large DMA operations that will be executed as 4k DMA
operation in the storage hardware, the memcpy overhead that we're
eliminating is minor compared to such a major I/O bottleneck with
qemu cache=off.

The only way we could possibly use lio_listio, would be to improve
glibc so the lio_listio op will be smart enough to call readv/writev
if it finds contiguous I/O being queued, but overall this would be
still largely inefficient. If you check the dma api, I'm preparing
struct iovec *iov ready to submit to the kernel either through the
inexistent aio_readv/writev or with the kernel-API
IOCB_CMD_PREADV/WRITEV (they obviously both take the well defined
struct iovec as param so there's zero overhead).

So even if we improve lio_listio, lio_listio would introduce
artificial splitting-recolaescing overhead just because of its weird
API. Entirely different would be if lio_listio would resemble the
kernel sys_iosubmit API and had a PREADV/WRITEV type to submit iovecs,
but this only has a LIO_READ/WRITE, no sign of LIO_READV/WRITEV
unfortunately :(. Amittedly it's not so common having to use
readv/writev on contiguous I/O but the emulated DMA with SG truly
requires this. Anything that can't handle a native iovec we can't use.

Likely we'll have to add a pthread_create based our own aio
implementation for non-linux and kernel-AIO for linux, and get rid of
librt as a whole. It's pointless to mix our own userland aio (that will
support readv/writev too), with the posix one. And if this was just a
linux project kernel AIO would suffice. All DB that I know need to use
readv/writev with AIO and O_DIRECT for similar reasons as us, already
used kernel AIO.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-29 19:48   ` Avi Kivity
@ 2008-11-30 17:29     ` Andrea Arcangeli
  2008-11-30 20:27       ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-30 17:29 UTC (permalink / raw)
  To: qemu-devel

On Sat, Nov 29, 2008 at 09:48:22PM +0200, Avi Kivity wrote:
> Since Andrea's patches contain emulation for aio readv/writev, the 
> performance degradation will not occur (though we will not see the benefit 
> either).

It will not occur especially if you #define DEBUG_BOUNCE. The way I
emulated bdrv_aio_readv/writev didn't involve a bounce, but it's
serially submitting the I/O so that it truly runs zerocopy when
DEBUG_BOUNCE is not defined ;).

IF you enable DEBUG_BOUNCE then the bounce layer will be forced on,
and the iovec will be linearized and the DMA command executed on the
hardware will be allowed to be as large as MAX_DMA_BOUNCE_BUFFER like
before. So until we have a real bdrv_aio_readv/writev #defining
DEBUG_BOUNCE is a must with cache=off.

> I doubt you can get measure malloc overhead with anything less a thousand 
> disks (even there, other overheads are likely to drown that malloc).

I also eliminated any sign of malloc in the direct path with a small
cache layer that caches as many pci dma sg params (with the max iovcnt
seen so far embedded into it) as the max number of simultaneous
in-flight I/O seen for the whole runtime. With a max param of 10 (so
only if there are more than 10 simultaneous sg dma I/O operations in
flight, malloc will have to run). Only params with iov arrays with a
iovcnt < 2048 are cached. So worst case ram waste is limited, and it's
auto-tuning at runtime to remain near zero for single disk setups etc..

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] [RFC 1/1] pci-dma-api-v2
  2008-11-28 18:50       ` Andrea Arcangeli
  2008-11-28 19:03         ` Blue Swirl
@ 2008-11-30 17:41         ` Andrea Arcangeli
  2008-11-30 18:36           ` [Qemu-devel] " Blue Swirl
  2008-11-30 22:50           ` [Qemu-devel] " Anthony Liguori
  2008-11-30 22:38         ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Anthony Liguori
  2 siblings, 2 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-30 17:41 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel

On Fri, Nov 28, 2008 at 07:50:01PM +0100, Andrea Arcangeli wrote:
> better. I've also noticed some problems with windows (I didn't test
> windows before posting), those aren't related to the cache layer as I
> added a #define to disable it and replace it with malloc/free. But
> that's not the cache layer, as soon as windows runs completely
> flawlessy I post an update.

As promised here an updated. I fixed all outstanding issues, it's
running rock solid. Before there were a few spots on ide.c plus the
cache layer was very buggy with more than one dma in-flight.

One of the issues was also that the sg[].len may not be aligned. It's
the total DMA length that has to be aligned. So while linux never
submits memory regions with sg[].len that isn't 512byte aligned, other
OS submitted 512 bytes large I/O scattered over different memory
regions each one smaller than 512bytes. So the alignment check is
moved on the whole DMA operation (param->curr_len) and the max
alignment depends on the max size of the buffer (currently is
hardcoded to 1M and it can't be too big for security reasons). The way
host direct DMA works, the memory region length must be a multiple of
the alignment, so bounce buffering is a must for those few times the
buffer is smaller than blocksize. I guess the hardware behaves
similarly and those dma op with sg entries with length < 512bytes runs
slower in the hardware too.

I tested on various OS all combinations, direct dma, bounce buffering,
and bounce buffering with 512byte size (which runs so much slower as
expected, visibly even during boot, showing how much overhead there is
even with cache=off by reading in small chunks and entering/exiting
the host kenrel so frequently). I didn't split the _em out of the
patch this time as it's so small anyway and I doubt we're ready for
merging in qemu svn, but those aren't meant to be merged. However I
think the below is close to good enough, the reason I don't think
we're ready for merging is that this shall be merged at the same time
we add a real bdrv_aio_readv/writev. So once we can agree on the
below to be the way to go, next thing to discuss is how to add readv/writev ;).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Index: block_int.h
===================================================================
--- block_int.h	(revision 5818)
+++ block_int.h	(working copy)
@@ -55,6 +55,8 @@
         int64_t sector_num, const uint8_t *buf, int nb_sectors,
         BlockDriverCompletionFunc *cb, void *opaque);
     void (*bdrv_aio_cancel)(BlockDriverAIOCB *acb);
+    BlockDriverAIOIOV *bdrv_aio_readv;
+    BlockDriverAIOIOV *bdrv_aio_writev;
     int aiocb_size;
 
     const char *protocol_name;
Index: Makefile.target
===================================================================
--- Makefile.target	(revision 5818)
+++ Makefile.target	(working copy)
@@ -659,7 +659,7 @@
 
 ifeq ($(TARGET_BASE_ARCH), i386)
 # Hardware support
-OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o
+OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o pci_dma.o
 OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o
 OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o
 OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o
@@ -668,7 +668,7 @@
 ifeq ($(TARGET_BASE_ARCH), ppc)
 CPPFLAGS += -DHAS_AUDIO -DHAS_AUDIO_CHOICE
 # shared objects
-OBJS+= ppc.o ide.o vga.o $(SOUND_HW) dma.o openpic.o
+OBJS+= ppc.o ide.o vga.o $(SOUND_HW) dma.o openpic.o pci_dma.o
 # PREP target
 OBJS+= pckbd.o ps2.o serial.o i8259.o i8254.o fdc.o m48t59.o mc146818rtc.o
 OBJS+= prep_pci.o ppc_prep.o
@@ -686,7 +686,7 @@
 OBJS+= mips_timer.o mips_int.o dma.o vga.o serial.o i8254.o i8259.o rc4030.o
 OBJS+= g364fb.o jazz_led.o
 OBJS+= ide.o gt64xxx.o pckbd.o ps2.o fdc.o mc146818rtc.o usb-uhci.o acpi.o ds1225y.o
-OBJS+= piix_pci.o parallel.o cirrus_vga.o pcspk.o $(SOUND_HW)
+OBJS+= piix_pci.o parallel.o cirrus_vga.o pcspk.o pci_dma.o $(SOUND_HW)
 OBJS+= mipsnet.o
 OBJS+= pflash_cfi01.o
 CPPFLAGS += -DHAS_AUDIO -DHAS_AUDIO_CHOICE
@@ -710,7 +710,7 @@
 else
 OBJS+= sun4m.o tcx.o pcnet.o iommu.o m48t59.o slavio_intctl.o
 OBJS+= slavio_timer.o slavio_serial.o slavio_misc.o fdc.o sparc32_dma.o
-OBJS+= cs4231.o ptimer.o eccmemctl.o sbi.o sun4c_intctl.o
+OBJS+= cs4231.o ptimer.o eccmemctl.o sbi.o sun4c_intctl.o pci_dma.o
 endif
 endif
 ifeq ($(TARGET_BASE_ARCH), arm)
@@ -731,7 +731,7 @@
 OBJS+= nseries.o blizzard.o onenand.o vga.o cbus.o tusb6010.o usb-musb.o
 OBJS+= tsc2005.o bt-hci-csr.o
 OBJS+= mst_fpga.o mainstone.o
-OBJS+= musicpal.o pflash_cfi02.o
+OBJS+= musicpal.o pflash_cfi02.o pci_dma.o
 CPPFLAGS += -DHAS_AUDIO
 endif
 ifeq ($(TARGET_BASE_ARCH), sh4)
Index: exec.c
===================================================================
--- exec.c	(revision 5818)
+++ exec.c	(working copy)
@@ -2807,7 +2807,7 @@
 /* physical memory access (slow version, mainly for debug) */
 #if defined(CONFIG_USER_ONLY)
 void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
-                            int len, int is_write)
+                            size_t len, int is_write)
 {
     int l, flags;
     target_ulong page;
@@ -2848,7 +2848,7 @@
 
 #else
 void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
-                            int len, int is_write)
+                            size_t len, int is_write)
 {
     int l, io_index;
     uint8_t *ptr;
@@ -2938,6 +2938,111 @@
     }
 }
 
+uint8_t *cpu_physical_memory_can_dma(target_phys_addr_t addr,
+				     size_t len, int is_write,
+				     int alignment)
+{
+    int l, first = 1;
+    uint8_t *ptr = NULL;
+    target_phys_addr_t page;
+    unsigned long pd, pd_first = 0;
+    PhysPageDesc *p;
+
+    if (len & (alignment-1))
+	goto out;
+
+    while (len > 0) {
+	page = addr & TARGET_PAGE_MASK;
+	p = phys_page_find(page >> TARGET_PAGE_BITS);
+
+	l = (page + TARGET_PAGE_SIZE) - addr;
+        if (l > len)
+            l = len;
+
+        if (!p)
+            pd = IO_MEM_UNASSIGNED;
+        else
+            pd = p->phys_offset;
+
+        if (is_write) {
+            if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM)
+		return NULL;
+        } else {
+            if ((pd & ~TARGET_PAGE_MASK) > IO_MEM_ROM &&
+                !(pd & IO_MEM_ROMD))
+		return NULL;
+        }
+
+	if (first) {
+	    first = 0;
+	    ptr = phys_ram_base + (pd & TARGET_PAGE_MASK) +
+		(addr & ~TARGET_PAGE_MASK);
+	    if ((unsigned long)ptr & (alignment-1))
+		return NULL;
+	    pd_first = pd;
+	}
+
+	/* nonlinear range */
+	if (pd_first != pd)
+	    return NULL;
+	pd_first += TARGET_PAGE_SIZE;
+
+        len -= l;
+        addr += l;
+    }
+
+out:
+    return ptr;
+}
+
+void cpu_physical_memory_write_post_dma(target_phys_addr_t addr,
+					size_t len)
+{
+    int l;
+    uint8_t *ptr;
+    target_phys_addr_t page;
+    unsigned long pd;
+    PhysPageDesc *p;
+
+    while (len > 0) {
+        page = addr & TARGET_PAGE_MASK;
+
+        l = (page + TARGET_PAGE_SIZE) - addr;
+        if (l > len)
+            l = len;
+
+        p = phys_page_find(page >> TARGET_PAGE_BITS);
+        if (!p)
+            pd = IO_MEM_UNASSIGNED;
+        else
+            pd = p->phys_offset;
+
+	if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
+	    printf("ERROR cpu_physical_memory_post_dma: memory layout changed\n");
+	    continue;
+	} else {
+	    unsigned long addr1;
+	    addr1 = (pd & TARGET_PAGE_MASK) + (addr & ~TARGET_PAGE_MASK);
+	    /* RAM case */
+	    ptr = phys_ram_base + addr1;
+	    if (!cpu_physical_memory_is_dirty(addr1)) {
+		/* invalidate code */
+		tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
+		/* set dirty bit */
+		phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
+		    (0xff & ~CODE_DIRTY_FLAG);
+	    }
+	    /* qemu doesn't execute guest code directly, but kvm does
+	       therefore fluch instruction caches */
+	    if (kvm_enabled())
+		flush_icache_range((unsigned long)ptr,
+				   ((unsigned long)ptr)+l);
+        }
+        len -= l;
+        addr += l;
+    }
+}
+
 /* used for ROM loading : can write in RAM and ROM */
 void cpu_physical_memory_write_rom(target_phys_addr_t addr,
                                    const uint8_t *buf, int len)
Index: block.c
===================================================================
--- block.c	(revision 5818)
+++ block.c	(working copy)
@@ -53,6 +53,20 @@
                         uint8_t *buf, int nb_sectors);
 static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
                          const uint8_t *buf, int nb_sectors);
+static BlockDriverAIOCB *bdrv_aio_readv_em(BlockDriverState *bs,
+					   int64_t sector_num,
+					   struct iovec *iov,
+					   int iovnct,
+					   size_t len,
+					   BlockDriverCompletionFunc *cb,
+					   void *opaque);
+static BlockDriverAIOCB *bdrv_aio_writev_em(BlockDriverState *bs,
+					    int64_t sector_num,
+					    struct iovec *iov,
+					    int iovnct,
+					    size_t len,
+					    BlockDriverCompletionFunc *cb,
+					    void *opaque);
 
 BlockDriverState *bdrv_first;
 
@@ -135,6 +149,9 @@
         /* add synchronous IO emulation layer */
         bdrv->bdrv_read = bdrv_read_em;
         bdrv->bdrv_write = bdrv_write_em;
+        bdrv->bdrv_aio_readv = bdrv_aio_readv_em; /* FIXME */
+        bdrv->bdrv_aio_writev = bdrv_aio_writev_em; /* FIXME */
+        bdrv->bdrv_aio_cancel = bdrv_aio_cancel_em; /* FIXME */
     }
     bdrv->next = first_drv;
     first_drv = bdrv;
@@ -1291,7 +1308,51 @@
     drv->bdrv_aio_cancel(acb);
 }
 
+BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
+				 struct iovec *iov, int iovcnt, size_t len,
+				 BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BlockDriver *drv = bs->drv;
+    BlockDriverAIOCB *ret;
+    
 
+    if (!drv)
+        return NULL;
+
+    ret = drv->bdrv_aio_readv(bs, sector_num, iov, iovcnt, len, cb, opaque);
+
+    if (ret) {
+	/* Update stats even though technically transfer has not happened. */
+	bs->rd_bytes += (unsigned) len;
+	bs->rd_ops ++;
+    }
+
+    return ret;
+}
+
+BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
+				  struct iovec *iov, int iovcnt, size_t len,
+				  BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BlockDriver *drv = bs->drv;
+    BlockDriverAIOCB *ret;
+    
+
+    if (!drv)
+        return NULL;
+
+    ret = drv->bdrv_aio_writev(bs, sector_num, iov, iovcnt, len, cb, opaque);
+
+    if (ret) {
+	/* Update stats even though technically transfer has not happened. */
+	bs->rd_bytes += (unsigned) len;
+	bs->rd_ops ++;
+    }
+
+    return ret;
+}
+
+
 /**************************************************************/
 /* async block device emulation */
 
@@ -1341,6 +1402,74 @@
     qemu_aio_release(acb);
 }
 
+static void bdrv_aio_iov_bh_cb(void *opaque)
+{
+    BlockDriverAIOCBSync *acb = opaque;
+    acb->common.cb(acb->common.opaque, acb->ret);
+    qemu_bh_delete(acb->bh);
+    qemu_free(acb);
+}
+
+static BlockDriverAIOCB *bdrv_aio_readv_em(BlockDriverState *bs,
+					   int64_t sector_num,
+					   struct iovec *iov,
+					   int iovcnt,
+					   size_t len,
+					   BlockDriverCompletionFunc *cb,
+					   void *opaque)
+{
+    BlockDriverAIOCBSync *acb;
+    int ret = -1, idx;
+
+    for (idx = 0; idx < iovcnt; idx++) {
+	size_t sectors = iov[idx].iov_len >> SECTOR_BITS;
+	ret = bdrv_read(bs, sector_num, iov[idx].iov_base, sectors);
+	if (ret)
+	    break;
+	sector_num += sectors;
+    }
+    acb = qemu_mallocz(sizeof(BlockDriverAIOCBSync));
+    if (!acb)
+            return NULL;
+    acb->common.bs = bs;
+    acb->common.cb = cb;
+    acb->common.opaque = opaque;
+    acb->bh = qemu_bh_new(bdrv_aio_iov_bh_cb, acb);
+    acb->ret = ret;
+    qemu_bh_schedule(acb->bh);
+    return &acb->common;
+}
+
+static BlockDriverAIOCB *bdrv_aio_writev_em(BlockDriverState *bs,
+					    int64_t sector_num,
+					    struct iovec *iov,
+					    int iovcnt,
+					    size_t len,
+					    BlockDriverCompletionFunc *cb,
+					    void *opaque)
+{
+    BlockDriverAIOCBSync *acb;
+    int ret = -1, idx;
+
+    for (idx = 0; idx < iovcnt; idx++) {
+	size_t sectors = iov[idx].iov_len >> SECTOR_BITS;
+	ret = bdrv_write(bs, sector_num, iov[idx].iov_base, sectors);
+	if (ret)
+	    break;
+	sector_num += sectors;
+    }
+    acb = qemu_mallocz(sizeof(BlockDriverAIOCBSync));
+    if (!acb)
+            return NULL;
+    acb->common.bs = bs;
+    acb->common.cb = cb;
+    acb->common.opaque = opaque;
+    acb->bh = qemu_bh_new(bdrv_aio_iov_bh_cb, acb);
+    acb->ret = ret;
+    qemu_bh_schedule(acb->bh);
+    return &acb->common;
+}
+
 /**************************************************************/
 /* sync block device emulation */
 
Index: block.h
===================================================================
--- block.h	(revision 5818)
+++ block.h	(working copy)
@@ -83,6 +83,13 @@
 /* async block I/O */
 typedef struct BlockDriverAIOCB BlockDriverAIOCB;
 typedef void BlockDriverCompletionFunc(void *opaque, int ret);
+typedef BlockDriverAIOCB *BlockDriverAIOIOV(BlockDriverState *bs,
+					    int64_t sector_num,
+					    struct iovec *iov,
+					    int iovnct,
+					    size_t len,
+					    BlockDriverCompletionFunc *cb,
+					    void *opaque);
 
 BlockDriverAIOCB *bdrv_aio_read(BlockDriverState *bs, int64_t sector_num,
                                 uint8_t *buf, int nb_sectors,
@@ -91,6 +98,12 @@
                                  const uint8_t *buf, int nb_sectors,
                                  BlockDriverCompletionFunc *cb, void *opaque);
 void bdrv_aio_cancel(BlockDriverAIOCB *acb);
+BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
+				 struct iovec *iov, int iovnct, size_t len,
+				 BlockDriverCompletionFunc *cb, void *opaque);
+BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
+				  struct iovec *iov, int iovnct, size_t len,
+				  BlockDriverCompletionFunc *cb, void *opaque);
 
 int qemu_key_check(BlockDriverState *bs, const char *name);
 
Index: hw/ide.c
===================================================================
--- hw/ide.c	(revision 5818)
+++ hw/ide.c	(working copy)
@@ -31,6 +31,7 @@
 #include "qemu-timer.h"
 #include "sysemu.h"
 #include "ppc_mac.h"
+#include "pci_dma.h"
 
 /* debug IDE devices */
 //#define DEBUG_IDE
@@ -480,12 +481,15 @@
     struct PCIIDEState *pci_dev;
     /* current transfer state */
     uint32_t cur_addr;
-    uint32_t cur_prd_last;
-    uint32_t cur_prd_addr;
-    uint32_t cur_prd_len;
+    uint32_t cur_prd_last; /* fixme delete */
+    uint32_t cur_prd_addr; /* fixme delete */
+    uint32_t cur_prd_len; /* fixme delete */
     IDEState *ide_if;
     BlockDriverCompletionFunc *dma_cb;
     BlockDriverAIOCB *aiocb;
+    QEMUPciDmaSg sg[IDE_DMA_BUF_SECTORS];
+    BlockDriverAIOIOV *bdrv_aio_iov;
+    int dma_to_memory;
 } BMDMAState;
 
 typedef struct PCIIDEState {
@@ -870,7 +874,7 @@
 }
 
 /* return 0 if buffer completed */
-static int dma_buf_rw(BMDMAState *bm, int is_write)
+static int dma_buf_rw(BMDMAState *bm, int is_write) /* fixme delete */
 {
     IDEState *s = bm->ide_if;
     struct {
@@ -917,61 +921,80 @@
     return 1;
 }
 
-static void ide_read_dma_cb(void *opaque, int ret)
+static int build_dma_sg(BMDMAState *bm)
 {
-    BMDMAState *bm = opaque;
-    IDEState *s = bm->ide_if;
-    int n;
-    int64_t sector_num;
+    struct {
+        uint32_t addr;
+        uint32_t size;
+    } prd;
+    int len;
+    int idx;
 
-    if (ret < 0) {
-	ide_dma_error(s);
-	return;
+    for (idx = 1; idx <= IDE_DMA_BUF_SECTORS; idx++) {
+	cpu_physical_memory_read(bm->cur_addr, (uint8_t *)&prd, 8);
+	bm->cur_addr += 8;
+	bm->sg[idx-1].addr = le32_to_cpu(prd.addr);
+	prd.size = le32_to_cpu(prd.size);
+	len = prd.size & 0xfffe;
+	if (len == 0)
+	    len = 0x10000;
+	bm->sg[idx-1].len = len;
+	/* end of table (with a fail safe of one page) */
+	if ((prd.size & 0x80000000) ||
+	    (bm->cur_addr - bm->addr) >= 4096)
+	    break;
     }
+    if (idx > IDE_DMA_BUF_SECTORS)
+	printf("build_dma_sg: too many sg entries\n");
+    return idx;
+}
 
-    n = s->io_buffer_size >> 9;
-    sector_num = ide_get_sector(s);
-    if (n > 0) {
-        sector_num += n;
-        ide_set_sector(s, sector_num);
-        s->nsector -= n;
-        if (dma_buf_rw(bm, 1) == 0)
-            goto eot;
-    }
+static void ide_dma_complete(void *opaque, int ret)
+{
+    BMDMAState *bm = opaque;
+    IDEState *s = bm->ide_if;
 
+    bm->bdrv_aio_iov = NULL;
+    bm->ide_if = NULL;
+    bm->aiocb = NULL;
     /* end of transfer ? */
-    if (s->nsector == 0) {
+    if (s->nsector == 0 && !ret) {
         s->status = READY_STAT | SEEK_STAT;
         ide_set_irq(s);
-    eot:
         bm->status &= ~BM_STATUS_DMAING;
         bm->status |= BM_STATUS_INT;
-        bm->dma_cb = NULL;
-        bm->ide_if = NULL;
-        bm->aiocb = NULL;
-        return;
+    } else {
+	ide_dma_error(s);
+	printf("ide_dma_complete error: nsector %d err %d\n", s->nsector, ret);
     }
+}
 
-    /* launch next transfer */
-    n = s->nsector;
-    if (n > IDE_DMA_BUF_SECTORS)
-        n = IDE_DMA_BUF_SECTORS;
-    s->io_buffer_index = 0;
-    s->io_buffer_size = n * 512;
+static int ide_dma_submit(void *opaque, struct iovec *dma_iov,
+			  int iovcnt, size_t len,
+			  BlockDriverCompletionFunc dma_cb,
+			  void *dma_cb_param)
+{
+    BMDMAState *bm = opaque;
+    IDEState *s = bm->ide_if;
+    size_t sectors;
+    int64_t sector_num;
+
+    sectors = len >> 9;
+    if (s->nsector < sectors)
+	return -3000;
+    sector_num = ide_get_sector(s);
+    ide_set_sector(s, sector_num  + sectors);
+    s->nsector -= sectors;
+
 #ifdef DEBUG_AIO
-    printf("aio_read: sector_num=%lld n=%d\n", sector_num, n);
+    printf("ide_dma_submit_write: sector_num=%lld n=%d\n", sector_num, sectors);
 #endif
-    bm->aiocb = bdrv_aio_read(s->bs, sector_num, s->io_buffer, n,
-                              ide_read_dma_cb, bm);
-    ide_dma_submit_check(s, ide_read_dma_cb, bm);
-}
+    bm->aiocb = bm->bdrv_aio_iov(s->bs, sector_num, dma_iov, iovcnt, len,
+				 dma_cb, dma_cb_param);
+    if (!bm->aiocb)
+	return -3001;
 
-static void ide_sector_read_dma(IDEState *s)
-{
-    s->status = READY_STAT | SEEK_STAT | DRQ_STAT | BUSY_STAT;
-    s->io_buffer_index = 0;
-    s->io_buffer_size = 0;
-    ide_dma_start(s, ide_read_dma_cb);
+    return 0;
 }
 
 static void ide_sector_write_timer_cb(void *opaque)
@@ -1028,62 +1051,31 @@
     }
 }
 
-static void ide_write_dma_cb(void *opaque, int ret)
+static void ide_sector_dma_start(BMDMAState *bm)
 {
-    BMDMAState *bm = opaque;
-    IDEState *s = bm->ide_if;
-    int n;
-    int64_t sector_num;
-
-    if (ret < 0) {
-	ide_dma_error(s);
-	return;
-    }
-
-    n = s->io_buffer_size >> 9;
-    sector_num = ide_get_sector(s);
-    if (n > 0) {
-        sector_num += n;
-        ide_set_sector(s, sector_num);
-        s->nsector -= n;
-    }
-
-    /* end of transfer ? */
-    if (s->nsector == 0) {
-        s->status = READY_STAT | SEEK_STAT;
-        ide_set_irq(s);
-    eot:
-        bm->status &= ~BM_STATUS_DMAING;
-        bm->status |= BM_STATUS_INT;
-        bm->dma_cb = NULL;
-        bm->ide_if = NULL;
-        bm->aiocb = NULL;
-        return;
-    }
-
-    /* launch next transfer */
-    n = s->nsector;
-    if (n > IDE_DMA_BUF_SECTORS)
-        n = IDE_DMA_BUF_SECTORS;
-    s->io_buffer_index = 0;
-    s->io_buffer_size = n * 512;
-
-    if (dma_buf_rw(bm, 0) == 0)
-        goto eot;
-#ifdef DEBUG_AIO
-    printf("aio_write: sector_num=%lld n=%d\n", sector_num, n);
-#endif
-    bm->aiocb = bdrv_aio_write(s->bs, sector_num, s->io_buffer, n,
-                               ide_write_dma_cb, bm);
-    ide_dma_submit_check(s, ide_write_dma_cb, bm);
+    int iovcnt = build_dma_sg(bm);
+    pci_dma_sg((PCIDevice *)bm->pci_dev, bm->sg, iovcnt,
+	       ide_dma_submit, ide_dma_complete, bm,
+	       bm->dma_to_memory, 512);
 }
 
-static void ide_sector_write_dma(IDEState *s)
+static void ide_sector_dma(IDEState *s, BlockDriverAIOIOV *bdrv_aio_iov,
+			   int dma_to_memory)
 {
+    BMDMAState *bm = s->bmdma;
+    if(!bm)
+	goto err;
+
     s->status = READY_STAT | SEEK_STAT | DRQ_STAT | BUSY_STAT;
-    s->io_buffer_index = 0;
-    s->io_buffer_size = 0;
-    ide_dma_start(s, ide_write_dma_cb);
+    bm->ide_if = s;
+    bm->bdrv_aio_iov = bdrv_aio_iov;
+    bm->dma_to_memory = dma_to_memory;
+    if (bm->status & BM_STATUS_DMAING)
+	ide_sector_dma_start(bm);
+    return;
+
+err:
+    ide_dma_error(s);
 }
 
 static void ide_atapi_cmd_ok(IDEState *s)
@@ -2219,7 +2211,7 @@
             if (!s->bs)
                 goto abort_cmd;
 	    ide_cmd_lba48_transform(s, lba48);
-            ide_sector_read_dma(s);
+            ide_sector_dma(s, bdrv_aio_readv, 1);
             break;
 	case WIN_WRITEDMA_EXT:
 	    lba48 = 1;
@@ -2228,7 +2220,7 @@
             if (!s->bs)
                 goto abort_cmd;
 	    ide_cmd_lba48_transform(s, lba48);
-            ide_sector_write_dma(s);
+            ide_sector_dma(s, bdrv_aio_writev, 0);
             s->media_changed = 1;
             break;
         case WIN_READ_NATIVE_MAX_EXT:
@@ -2852,6 +2844,7 @@
         /* cancel DMA request */
         bm->ide_if = NULL;
         bm->dma_cb = NULL;
+	bm->bdrv_aio_iov = NULL;
         if (bm->aiocb) {
 #ifdef DEBUG_AIO
             printf("aio_cancel\n");
@@ -2876,7 +2869,9 @@
         if (!(bm->status & BM_STATUS_DMAING)) {
             bm->status |= BM_STATUS_DMAING;
             /* start dma transfer if possible */
-            if (bm->dma_cb)
+	    if (bm->bdrv_aio_iov)
+		ide_sector_dma_start(bm);
+            else if (bm->dma_cb)
                 bm->dma_cb(bm, 0);
         }
         bm->cmd = val & 0x09;
Index: cpu-all.h
===================================================================
--- cpu-all.h	(revision 5818)
+++ cpu-all.h	(working copy)
@@ -891,14 +891,19 @@
 CPUReadMemoryFunc **cpu_get_io_memory_read(int io_index);
 
 void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
-                            int len, int is_write);
+                            size_t len, int is_write);
+void cpu_physical_memory_write_post_dma(target_phys_addr_t addr,
+					size_t len);
+uint8_t *cpu_physical_memory_can_dma(target_phys_addr_t addr,
+				     size_t len, int is_write,
+				     int alignment);
 static inline void cpu_physical_memory_read(target_phys_addr_t addr,
-                                            uint8_t *buf, int len)
+                                            uint8_t *buf, size_t len)
 {
     cpu_physical_memory_rw(addr, buf, len, 0);
 }
 static inline void cpu_physical_memory_write(target_phys_addr_t addr,
-                                             const uint8_t *buf, int len)
+                                             const uint8_t *buf, size_t len)
 {
     cpu_physical_memory_rw(addr, (uint8_t *)buf, len, 1);
 }
diff --git a/qemu/hw/pci_dma.c b/qemu/hw/pci_dma.c
new file mode 100644
index 0000000..48762a8
--- /dev/null
+++ b/qemu/hw/pci_dma.c
@@ -0,0 +1,366 @@
+/*
+ * QEMU PCI DMA operations
+ *
+ * Copyright (c) 2008 Red Hat, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include <assert.h>
+#include "pci_dma.h"
+
+#define MAX_DMA_BOUNCE_BUFFER (1024*1024)
+//#define DEBUG_BOUNCE
+//#define MAX_DMA_BOUNCE_BUFFER (512)
+//#define DISABLE_IOVEC_CACHE
+
+typedef struct QEMUPciDmaSgParam {
+    QEMUPciDmaSgSubmit *pci_dma_sg_submit;
+    QEMUPciDmaSgComplete *pci_dma_sg_complete;
+    void *pci_dma_sg_opaque;
+    int dma_to_memory;
+    int alignment;
+    uint8_t *bounce;
+    QEMUPciDmaSg *sg;
+    int iovcnt;
+    int restart_iovcnt;
+    size_t restart_offset;
+    int curr_restart_iovcnt;
+    size_t curr_restart_offset;
+    size_t curr_len;
+#ifndef DISABLE_IOVEC_CACHE
+    struct QEMUPciDmaSgParam *next;
+#endif
+    struct iovec iov;
+} QEMUPciDmaSgParam;
+
+#ifndef DISABLE_IOVEC_CACHE
+/*
+ * Too many entries will run slower and waste memory, this is really
+ * only about the fast path so it must be small, slow path is fine to
+ * allocate dynamically. Max memory used is slightly more than
+ * MAX_IOVEC_ENTRIES * MAX_IOVEC_IOVCNT * sizeof(struct iovec).
+ */
+#define MAX_IOVEC_ENTRIES 10
+
+/*
+ * Don't cache exceptionally large iovcnt used for huge DMA transfers
+ * as the DMA transfer may take much longer than malloc and huge
+ * memory could be wasted if it happens only once in a while.
+ */
+#define MAX_IOVEC_IOVCNT 2048
+
+static QEMUPciDmaSgParam *sg_list;
+static long max_sg_in_flight, sg_in_flight;
+
+static QEMUPciDmaSgParam *qemu_get_pci_dma_sg(int iovcnt)
+{
+    QEMUPciDmaSgParam *entry = sg_list, *last;
+
+    while (entry) {
+	if (entry->iovcnt >= iovcnt) {
+	    if (entry == sg_list)
+		sg_list = sg_list->next;
+	    else
+		last->next = entry->next;
+	    goto found;
+	}
+	last = entry;
+	entry = entry->next;
+    }
+
+    entry = qemu_malloc(sizeof(QEMUPciDmaSgParam) +
+			sizeof(struct iovec) * (iovcnt-1));
+    if (!entry)
+	return NULL;
+
+    if (iovcnt <= MAX_IOVEC_IOVCNT) {
+    found:
+	sg_in_flight += 1;
+	if (sg_in_flight > max_sg_in_flight)
+	    max_sg_in_flight = sg_in_flight;
+    }
+    return entry;
+}
+
+static void qemu_release_pci_dma_sg(QEMUPciDmaSgParam *this)
+{
+    QEMUPciDmaSgParam *min_entry = NULL, *entry = sg_list;
+    QEMUPciDmaSgParam *min_last = NULL, *last = NULL;
+    unsigned int min_iovcnt = -1;
+    int nr = 0, tot;
+
+    if (this->iovcnt > MAX_IOVEC_IOVCNT) {
+	qemu_free(this);
+	return;
+    }
+
+    while (entry) {
+	nr += 1;
+	if ((unsigned int)entry->iovcnt <= min_iovcnt) {
+	    min_entry = entry;
+	    min_last = last;
+	    min_iovcnt = entry->iovcnt;
+	}
+	last = entry;
+	entry = entry->next;
+    }
+
+    assert(max_sg_in_flight > 0);
+    assert(sg_in_flight > 0);
+    tot = nr+sg_in_flight; 
+    if (tot > max_sg_in_flight || tot > MAX_IOVEC_ENTRIES) {
+	/* detail: replace even if it's equal as it's cache hot */
+	if ((unsigned int)this->iovcnt < min_iovcnt)
+	    qemu_free(this);
+	else {
+	    assert(nr > 0);
+	    if (min_entry == sg_list) {
+		this->next = sg_list->next;
+	    } else {
+		min_last->next = min_entry->next;
+		this->next = sg_list;
+	    }
+	    sg_list = this;
+	    qemu_free(min_entry);
+	}
+    } else {
+	this->next = sg_list;
+	sg_list = this;
+    }
+    sg_in_flight -= 1;
+    assert(sg_in_flight >= 0);
+}
+#else /* DISABLE_IOVEC_CACHE */
+#define qemu_get_pci_dma_sg(iovcnt) qemu_malloc(sizeof(QEMUPciDmaSgParam)+(sizeof(struct iovec)*((iovcnt)-1))) 
+#define qemu_release_pci_dma_sg(param) qemu_free(param)
+#endif /* DISABLE_IOVEC_CACHE */
+
+static int pci_dma_sg_map_direct(QEMUPciDmaSg *sg,
+				 int iovcnt,
+				 int dma_to_memory,
+				 int alignment,
+				 size_t *len,
+				 struct iovec *dma_iov)
+{
+    int idx = 0;
+    size_t _len = 0;
+
+#ifdef DEBUG_BOUNCE
+    return 0;
+#endif
+
+    for (idx = 0; idx < iovcnt; idx++) {
+	void * addr;
+
+	if (_len + sg[idx].len <= _len)
+	    return 0;
+	_len += sg[idx].len;
+
+	addr = cpu_physical_memory_can_dma(sg[idx].addr,
+					   sg[idx].len,
+					   dma_to_memory,
+					   alignment);
+	if (!addr)
+	    return 0;
+
+	dma_iov[idx].iov_base = addr;
+	dma_iov[idx].iov_len = sg[idx].len;
+    }
+
+    *len = _len;
+    return 1;
+}
+
+static int pci_dma_sg_map_bounce(QEMUPciDmaSgParam *param)
+{
+    int idx;
+    size_t len = 0;
+
+    param->curr_restart_iovcnt = param->restart_iovcnt;
+    param->curr_restart_offset = param->restart_offset;
+
+    for (idx = param->restart_iovcnt; idx < param->iovcnt; idx++) {
+	if (len + param->sg[idx].len <= len)
+	    return 0;
+	len += param->sg[idx].len - param->restart_offset;
+	param->restart_offset = 0;
+	if (len > MAX_DMA_BOUNCE_BUFFER) {
+	    size_t leftover = len - MAX_DMA_BOUNCE_BUFFER;
+	    param->restart_offset = param->sg[idx].len - leftover;
+	    len = MAX_DMA_BOUNCE_BUFFER;
+	    break;
+	}
+    }
+    param->restart_iovcnt = idx;
+    param->curr_len = len;
+
+    if (len & (param->alignment-1))
+	return 0;
+
+    param->iov.iov_len = len;
+    if (!param->bounce) {
+	param->bounce = qemu_memalign(param->alignment, len);
+	if (!param->bounce)
+	    return 0;
+	param->iov.iov_base = param->bounce;
+    }
+
+    if (!param->dma_to_memory) {
+	int idx;
+	size_t offset = 0;
+	for (idx = param->curr_restart_iovcnt;
+	     idx < param->iovcnt && offset < len; idx++) {
+	    size_t copy_len = param->sg[idx].len - param->curr_restart_offset;
+	    if (offset+copy_len > len)
+		copy_len = len;
+	    cpu_physical_memory_read(param->sg[idx].addr + 
+				     param->curr_restart_offset,
+				     param->bounce + offset,
+				     copy_len);
+	    param->curr_restart_offset = 0;
+	    offset += copy_len;
+	}
+    }
+
+    return 1;
+}
+
+static void pci_dma_sg_unmap_direct(QEMUPciDmaSgParam *param, int ret)
+{
+    if (!ret && param->dma_to_memory) {
+	int idx;
+	QEMUPciDmaSg *sg = param->sg;
+	for (idx = 0; idx < param->iovcnt; idx++)
+	    cpu_physical_memory_write_post_dma(sg[idx].addr,
+					       sg[idx].len);
+    }
+}
+
+int pci_dma_sg_unmap_bounce(QEMUPciDmaSgParam *param, int ret)
+{
+    if (!ret && param->dma_to_memory) {
+	int idx;
+	size_t offset = 0;
+	for (idx = param->curr_restart_iovcnt;
+	     idx < param->iovcnt && offset < param->curr_len; idx++) {
+	    size_t copy_len = param->sg[idx].len - param->curr_restart_offset;
+	    if (offset+copy_len > param->curr_len)
+		copy_len = param->curr_len;
+	    cpu_physical_memory_write(param->sg[idx].addr +
+				      param->curr_restart_offset,
+				      param->bounce + offset,
+				      copy_len);
+	    param->curr_restart_offset = 0;
+	    offset += copy_len;
+	}
+    }
+    if (param->restart_iovcnt == param->iovcnt || ret) {
+	qemu_free(param->bounce);
+	return 0;
+    }
+    return 1;
+}
+
+static void pci_dma_sg_cb(void *opaque, int ret)
+{
+    QEMUPciDmaSgParam *param = opaque;
+    int restart = 0;
+
+    if (!param->bounce)
+	pci_dma_sg_unmap_direct(param, ret);
+    else
+	restart = pci_dma_sg_unmap_bounce(param, ret);
+
+    if (restart) {
+	ret = -1000;
+	if (!pci_dma_sg_map_bounce(param)) {
+	    qemu_free(param->bounce);
+	    goto out_free;
+	}
+	ret = param->pci_dma_sg_submit(param->pci_dma_sg_opaque,
+				       &param->iov, 1,
+				       param->curr_len,
+				       pci_dma_sg_cb,
+				       param);
+    }
+    if (ret || !restart) {
+    out_free:
+	param->pci_dma_sg_complete(param->pci_dma_sg_opaque, ret);
+	qemu_release_pci_dma_sg(param);
+    }
+}
+
+/* PCIDevice is there in case we want to emulate an iommu later */
+void pci_dma_sg(PCIDevice *pci_dev,
+		QEMUPciDmaSg *sg, int iovcnt,
+		QEMUPciDmaSgSubmit pci_dma_sg_submit,
+		QEMUPciDmaSgComplete pci_dma_sg_complete,
+		void *pci_dma_sg_opaque,
+		int dma_to_memory, int alignment)
+{
+    int ret;
+    QEMUPciDmaSgParam *param;
+
+    ret = -2000;
+    if ((unsigned int) dma_to_memory > 1)
+	goto err;
+    if ((unsigned int) alignment > MAX_DMA_BOUNCE_BUFFER)
+	goto err;
+    if (iovcnt < 1)
+	goto err;
+
+    param = qemu_get_pci_dma_sg(iovcnt);
+    if (!param)
+	goto err;
+
+    param->pci_dma_sg_submit = pci_dma_sg_submit;
+    param->pci_dma_sg_complete = pci_dma_sg_complete;
+    param->pci_dma_sg_opaque = pci_dma_sg_opaque;
+    param->dma_to_memory = dma_to_memory;
+    param->alignment = alignment;
+    param->bounce = NULL;
+    param->sg = sg;
+    param->iovcnt = iovcnt;
+    param->restart_offset = param->restart_iovcnt = 0;
+
+    /* map the sg */
+    if (!pci_dma_sg_map_direct(sg, iovcnt,
+			       dma_to_memory, alignment,
+			       &param->curr_len, &param->iov)) {
+	ret = -2004;
+	if (!pci_dma_sg_map_bounce(param))
+	    goto out_free;
+	iovcnt = 1;
+    }
+
+    /* run the I/O */
+    ret = pci_dma_sg_submit(pci_dma_sg_opaque,
+			    &param->iov, iovcnt, param->curr_len,
+			    pci_dma_sg_cb,
+			    param);
+    if (ret)
+    out_free:
+	pci_dma_sg_cb(param, ret);
+    return;
+
+ err:
+    pci_dma_sg_complete(pci_dma_sg_opaque, ret);
+    return;
+}
diff --git a/qemu/hw/pci_dma.h b/qemu/hw/pci_dma.h
new file mode 100644
index 0000000..5cc8413
--- /dev/null
+++ b/qemu/hw/pci_dma.h
@@ -0,0 +1,29 @@
+#ifndef QEMU_PCI_DMA_H
+#define QEMU_PCI_DMA_H
+
+#include "qemu-common.h"
+#include "block.h"
+#include <sys/uio.h> /* struct iovec */
+
+typedef int QEMUPciDmaSgSubmit(void *pci_dma_sg_opaque,
+			       struct iovec *iov, int iovcnt,
+			       size_t len,
+			       BlockDriverCompletionFunc dma_cb,
+			       void *dma_cb_param);
+
+typedef void QEMUPciDmaSgComplete(void *pci_dma_sg_opaque, int ret);
+
+typedef struct QEMUPciDmaSg {
+    target_phys_addr_t addr;
+    size_t len;
+} QEMUPciDmaSg;
+
+/* pci_dma.c */
+void pci_dma_sg(PCIDevice *pci_dev,
+		QEMUPciDmaSg *sg, int iovcnt,
+		QEMUPciDmaSgSubmit *pci_dma_sg_submit,
+		QEMUPciDmaSgComplete *pci_dma_sg_complete,
+		void *pci_dma_sg_opaque,
+		int dma_to_memory, int alignment);
+
+#endif

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-28 19:03         ` Blue Swirl
  2008-11-28 19:18           ` Jamie Lokier
@ 2008-11-30 18:04           ` Andrea Arcangeli
  1 sibling, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-30 18:04 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel

On Fri, Nov 28, 2008 at 09:03:06PM +0200, Blue Swirl wrote:
> There's also lio_listio that provides for vectored AIO.

Discussed this in answer to Jamie, basically no LIO_READV/WRITEV, no
way to submit 'struct iovec' to the kernel with it still, which is a
must to perform with cache=off.

> >  > Anthony's second version:
> >  > http://lists.gnu.org/archive/html/qemu-devel/2008-04/msg00077.html

Actually this version of the emulated bdrv_writev/readv should run
faster thanks to malloc+memcpy instead of not having any memcpy and
running more syscalls. I opted for an emulated bdrv_aio_readv/writev
that does true zerocopy. But it doesn't make a whole lot of difference
as neither one should run on any host kernel supporting readv/writev
syscalls, this is just to we can test the rest of the zerocopy dma
api. The bdrv_aio_readv/writev support has to be in a separated from
the pci dma api anyway and surely I intend to reject my version of
bdrv_aio_readv/writev as I think all qemu targets supports at least
pthread posix API and readv/writev sycsalls allowing not having to do
hacks like my current _em.

> Perhaps you could point out why the previous attempts failed, but
> yours won't? ;-)

One can always hope to be more lucky? ;)

Seriously, just try to apply my last patch to your qemu tree (kvm
rejects in Makefile.target ppc section but it'll work on kvm too for
x86* targets) and try to test it. As an example also look at the below
IDE code, much of an improvement compared to current code IMHO and it
leaves all aiocb knowledge outside of the dma API itself, as it has to be.

static int build_dma_sg(BMDMAState *bm)
{
    struct {
        uint32_t addr;
        uint32_t size;
    } prd;
    int len;
    int idx;

    for (idx = 1; idx <= IDE_DMA_BUF_SECTORS; idx++) {
	cpu_physical_memory_read(bm->cur_addr, (uint8_t *)&prd, 8);
	bm->cur_addr += 8;
	bm->sg[idx-1].addr = le32_to_cpu(prd.addr);
	prd.size = le32_to_cpu(prd.size);
	len = prd.size & 0xfffe;
	if (len == 0)
	    len = 0x10000;
	bm->sg[idx-1].len = len;
	/* end of table (with a fail safe of one page) */
	if ((prd.size & 0x80000000) ||
	    (bm->cur_addr - bm->addr) >= 4096)
	    break;
    }
    if (idx > IDE_DMA_BUF_SECTORS)
	printf("build_dma_sg: too many sg entries\n");
    return idx;
}

static void ide_dma_complete(void *opaque, int ret)
{
    BMDMAState *bm = opaque;
    IDEState *s = bm->ide_if;

    bm->bdrv_aio_iov = NULL;
    bm->ide_if = NULL;
    bm->aiocb = NULL;
    /* end of transfer ? */
    if (s->nsector == 0 && !ret) {
        s->status = READY_STAT | SEEK_STAT;
        ide_set_irq(s);
        bm->status &= ~BM_STATUS_DMAING;
        bm->status |= BM_STATUS_INT;
    } else {
	ide_dma_error(s);
	printf("ide_dma_complete error: nsector %d err %d\n", s->nsector, ret);
    }
}

static int ide_dma_submit(void *opaque, struct iovec *dma_iov,
			  int iovcnt, size_t len,
			  BlockDriverCompletionFunc dma_cb,
			  void *dma_cb_param)
{
    BMDMAState *bm = opaque;
    IDEState *s = bm->ide_if;
    size_t sectors;
    int64_t sector_num;

    sectors = len >> 9;
    if (s->nsector < sectors)
	return -3000;
    sector_num = ide_get_sector(s);
    ide_set_sector(s, sector_num  + sectors);
    s->nsector -= sectors;

#ifdef DEBUG_AIO
    printf("ide_dma_submit_write: sector_num=%lld n=%d\n", sector_num, sectors);
#endif
    bm->aiocb = bm->bdrv_aio_iov(s->bs, sector_num, dma_iov, iovcnt, len,
				 dma_cb, dma_cb_param);
    if (!bm->aiocb)
	return -3001;

    return 0;
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [RFC 1/1] pci-dma-api-v2
  2008-11-30 17:41         ` [Qemu-devel] [RFC 1/1] pci-dma-api-v2 Andrea Arcangeli
@ 2008-11-30 18:36           ` Blue Swirl
  2008-11-30 19:04             ` Andrea Arcangeli
  2008-11-30 22:50           ` [Qemu-devel] " Anthony Liguori
  1 sibling, 1 reply; 32+ messages in thread
From: Blue Swirl @ 2008-11-30 18:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: qemu-devel

On 11/30/08, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Fri, Nov 28, 2008 at 07:50:01PM +0100, Andrea Arcangeli wrote:
>  > better. I've also noticed some problems with windows (I didn't test
>  > windows before posting), those aren't related to the cache layer as I
>  > added a #define to disable it and replace it with malloc/free. But
>  > that's not the cache layer, as soon as windows runs completely
>  > flawlessy I post an update.
>
>  As promised here an updated. I fixed all outstanding issues, it's
>  running rock solid. Before there were a few spots on ide.c plus the
>  cache layer was very buggy with more than one dma in-flight.

The patch does not apply as is:
>  --- hw/ide.c    (revision 5818)
>  +++ hw/ide.c    (working copy)
>  diff --git a/qemu/hw/pci_dma.c b/qemu/hw/pci_dma.c
>  new file mode 100644
>  index 0000000..48762a8
>  --- /dev/null
>  +++ b/qemu/hw/pci_dma.c
>  diff --git a/qemu/hw/pci_dma.h b/qemu/hw/pci_dma.h
>  new file mode 100644
>  index 0000000..5cc8413
>  --- /dev/null
>  +++ b/qemu/hw/pci_dma.h

Even as I fixed the patch, it still does not compile, for example:
/src/qemu/block.c:1335: warning: 'struct iovec' declared inside parameter list
/src/qemu/block.c:1336: error: conflicting types for 'bdrv_aio_writev'
/src/qemu/block.h:106: error: previous declaration of 'bdrv_aio_writev' was here
/src/qemu/block.c:1425: error: invalid use of undefined type 'struct iovec'

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [RFC 1/1] pci-dma-api-v2
  2008-11-30 18:36           ` [Qemu-devel] " Blue Swirl
@ 2008-11-30 19:04             ` Andrea Arcangeli
  2008-11-30 19:11               ` Blue Swirl
  0 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-30 19:04 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel

On Sun, Nov 30, 2008 at 08:36:56PM +0200, Blue Swirl wrote:
> The patch does not apply as is:
> >  --- hw/ide.c    (revision 5818)
> >  +++ hw/ide.c    (working copy)
> >  diff --git a/qemu/hw/pci_dma.c b/qemu/hw/pci_dma.c
> >  new file mode 100644
> >  index 0000000..48762a8
> >  --- /dev/null
> >  +++ b/qemu/hw/pci_dma.c
> >  diff --git a/qemu/hw/pci_dma.h b/qemu/hw/pci_dma.h
> >  new file mode 100644
> >  index 0000000..5cc8413
> >  --- /dev/null
> >  +++ b/qemu/hw/pci_dma.h

Yes sorry it's because I generate the patches with 'git diff master'
and I cut and pasted them in... hand editing the patch to remove
'b/qemu/' will fix it (to apply later with patch -p0 of course, not
standard but the default with svn).

> Even as I fixed the patch, it still does not compile, for example:
> /src/qemu/block.c:1335: warning: 'struct iovec' declared inside parameter list
> /src/qemu/block.c:1336: error: conflicting types for 'bdrv_aio_writev'
> /src/qemu/block.h:106: error: previous declaration of 'bdrv_aio_writev' was here
> /src/qemu/block.c:1425: error: invalid use of undefined type 'struct iovec'

Weird... I don't get that error, is that an old compiler? Does it go
away if you move the #include <sys/uio.h> from pci_dma.h to block.h?

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [RFC 1/1] pci-dma-api-v2
  2008-11-30 19:04             ` Andrea Arcangeli
@ 2008-11-30 19:11               ` Blue Swirl
  2008-11-30 19:20                 ` Andrea Arcangeli
  0 siblings, 1 reply; 32+ messages in thread
From: Blue Swirl @ 2008-11-30 19:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

On 11/30/08, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Sun, Nov 30, 2008 at 08:36:56PM +0200, Blue Swirl wrote:
>  > The patch does not apply as is:
>  > >  --- hw/ide.c    (revision 5818)
>  > >  +++ hw/ide.c    (working copy)
>  > >  diff --git a/qemu/hw/pci_dma.c b/qemu/hw/pci_dma.c
>  > >  new file mode 100644
>  > >  index 0000000..48762a8
>  > >  --- /dev/null
>  > >  +++ b/qemu/hw/pci_dma.c
>  > >  diff --git a/qemu/hw/pci_dma.h b/qemu/hw/pci_dma.h
>  > >  new file mode 100644
>  > >  index 0000000..5cc8413
>  > >  --- /dev/null
>  > >  +++ b/qemu/hw/pci_dma.h
>
>
> Yes sorry it's because I generate the patches with 'git diff master'
>  and I cut and pasted them in... hand editing the patch to remove
>  'b/qemu/' will fix it (to apply later with patch -p0 of course, not
>  standard but the default with svn).
>
>
>  > Even as I fixed the patch, it still does not compile, for example:
>  > /src/qemu/block.c:1335: warning: 'struct iovec' declared inside parameter list
>  > /src/qemu/block.c:1336: error: conflicting types for 'bdrv_aio_writev'
>  > /src/qemu/block.h:106: error: previous declaration of 'bdrv_aio_writev' was here
>  > /src/qemu/block.c:1425: error: invalid use of undefined type 'struct iovec'
>
>
> Weird... I don't get that error, is that an old compiler? Does it go
>  away if you move the #include <sys/uio.h> from pci_dma.h to block.h?

Yes, this version (attached) works.

[-- Attachment #2: pci-dma-api-v2.diff --]
[-- Type: plain/text, Size: 32314 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [RFC 1/1] pci-dma-api-v2
  2008-11-30 19:11               ` Blue Swirl
@ 2008-11-30 19:20                 ` Andrea Arcangeli
  2008-11-30 21:36                   ` Blue Swirl
  0 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-30 19:20 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel

On Sun, Nov 30, 2008 at 09:11:28PM +0200, Blue Swirl wrote:
> Yes, this version (attached) works.

Ok great! ;)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-30 17:29     ` Andrea Arcangeli
@ 2008-11-30 20:27       ` Avi Kivity
  2008-11-30 22:33         ` Andrea Arcangeli
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2008-11-30 20:27 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: qemu-devel

Andrea Arcangeli wrote:
> On Sat, Nov 29, 2008 at 09:48:22PM +0200, Avi Kivity wrote:
>   
>> Since Andrea's patches contain emulation for aio readv/writev, the 
>> performance degradation will not occur (though we will not see the benefit 
>> either).
>>     
>
> It will not occur especially if you #define DEBUG_BOUNCE. The way I
> emulated bdrv_aio_readv/writev didn't involve a bounce, but it's
> serially submitting the I/O so that it truly runs zerocopy when
> DEBUG_BOUNCE is not defined ;).
>
> IF you enable DEBUG_BOUNCE then the bounce layer will be forced on,
> and the iovec will be linearized and the DMA command executed on the
> hardware will be allowed to be as large as MAX_DMA_BOUNCE_BUFFER like
> before. So until we have a real bdrv_aio_readv/writev #defining
> DEBUG_BOUNCE is a must with cache=off.
>
>   

Oh okay.  In that case it should be committed with DEBUG_BOUNCE enabled, 
and that removed when we have proper aio *v.

>> I doubt you can get measure malloc overhead with anything less a thousand 
>> disks (even there, other overheads are likely to drown that malloc).
>>     
>
> I also eliminated any sign of malloc in the direct path with a small
> cache layer that caches as many pci dma sg params (with the max iovcnt
> seen so far embedded into it) as the max number of simultaneous
> in-flight I/O seen for the whole runtime. With a max param of 10 (so
> only if there are more than 10 simultaneous sg dma I/O operations in
> flight, malloc will have to run). Only params with iov arrays with a
> iovcnt < 2048 are cached. So worst case ram waste is limited, and it's
> auto-tuning at runtime to remain near zero for single disk setups etc..
>   

Overkill IMO (glibc likely caches mallocs), but can't hurt.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] Re: [RFC 1/1] pci-dma-api-v2
  2008-11-30 19:20                 ` Andrea Arcangeli
@ 2008-11-30 21:36                   ` Blue Swirl
  2008-11-30 22:54                     ` Anthony Liguori
  0 siblings, 1 reply; 32+ messages in thread
From: Blue Swirl @ 2008-11-30 21:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: qemu-devel

On 11/30/08, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Sun, Nov 30, 2008 at 09:11:28PM +0200, Blue Swirl wrote:
>  > Yes, this version (attached) works.
>
>
> Ok great! ;)

This looks promising. Maybe the previous attempts tried to solve
several problems at once, like the complex address translation schemes
I have in mind, in addition to the basic zero-copy DMA transfer.This
version leaves the DMA address translation to the caller and operates
using physical memory addresses. Another way would be to use directly
host pointers.

Regarding the code, I'd drop the "PCI", this will fit SBus, ISA or
whatever bus we are using.

Sparc64 does not compile, unless pci_dma.o is added to the
Makefile.target list of Sparc64 objects.

The first function called is pci_dma_sg, which also performs some
initialization. I'd move the function pointer setup to a separate init
function.

Quilt complains about trailing whitespace, please trim.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-28 19:18           ` Jamie Lokier
  2008-11-29 19:49             ` Avi Kivity
  2008-11-30 17:20             ` Andrea Arcangeli
@ 2008-11-30 22:31             ` Anthony Liguori
  2 siblings, 0 replies; 32+ messages in thread
From: Anthony Liguori @ 2008-11-30 22:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: Andrea Arcangeli

Jamie Lokier wrote:
> Blue Swirl wrote:
>   
>>>  I wonder how can possibly aio_readv/writev be missing in posix aio?
>>>  Unbelievable. It'd be totally trivial to add those to glibc, much
>>>  easier infact than to pthread_create by hand, but how can we add a
>>>  dependency on a certain glibc version? Ironically it'll be more
>>>  user-friendly to add dependency on linux kernel-aio implementation
>>>  that is already available for ages and it's guaranteed to run faster
>>>  (or at least not slower).
>>>       
>> There's also lio_listio that provides for vectored AIO.
>>     
>
> I think lio_listio is the missing aio_readv/writev.
>
> It's more versatile, and that'll by why POSIX never bothered with
> aio_readv/writev.
>   

No.  lio_listio is a batch submission mechanism, not an operation.

Each request maintains the same behavior as if they were submitted 
independently.  This means if you're doing a write operation and you 
submit a vectored request via lio_listio, each portion of the vector has 
to complete before the other one.

But most importantly, pthreads doesn't use readv/writev to implement 
lio_listio as has been mentioned elsewhere.

Regards,

Anthony Liguori

> Doesn't explain why they didn't _start_ with aio_readv before
> inventing lio_listio, but there you go.  Unix history.
>
> -- Jamie
>
>
>   

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-27 19:14 ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Blue Swirl
  2008-11-28  1:56   ` Andrea Arcangeli
  2008-11-29 19:48   ` Avi Kivity
@ 2008-11-30 22:33   ` Anthony Liguori
  2 siblings, 0 replies; 32+ messages in thread
From: Anthony Liguori @ 2008-11-30 22:33 UTC (permalink / raw)
  To: qemu-devel

Blue Swirl wrote:
> On 11/27/08, Andrea Arcangeli <aarcange@redhat.com> wrote:
>   
>> Hello everyone,
>>
>>  Once we finish fixing storage performance with a real
>>  bdrv_aio_readv/writev (now a blocker issue), a pci_dma_single can be
>>  added for zero copy networking (one NIC per VM, or VMDq, IOV
>>  etc..). The DMA API should allow for that too.
>>     
>
> The previous similar attempt by Anthony for generic DMA using vectored
> IO was abandoned because the malloc/free overhead was more than the
> performance gain. Have you made any performance measurements? How does
> this version compare to the previous ones?
>   

No, I never measured any malloc/free overhead.  I think people are more 
concerned with that than warranted.  A good malloc implementation will 
usually have O(1) complexity for similar sized allocs so malloc/free 
should not be the bottleneck.

My attempt was abandoned because the API is complex and at the time, 
zero copy wasn't the bottle neck we needed to overcome.  Fortunately, 
enough has progressed since then that it is now the bottle neck :-).

Regards,

Anthony Liguori

> I think the pci_ prefix can be removed, there is little PCI specific.
>
> For Sparc32 IOMMU (and probably other IOMMUS), it should be possible
> to register a function used in place of  cpu_physical_memory_rw,
> c_p_m_can_dma etc. The goal is that it should be possible to stack the
> DMA resolvers (think of devices behind a number of buses).
>
>
>   

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-30 20:27       ` Avi Kivity
@ 2008-11-30 22:33         ` Andrea Arcangeli
  0 siblings, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2008-11-30 22:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

On Sun, Nov 30, 2008 at 10:27:22PM +0200, Avi Kivity wrote:
> Oh okay.  In that case it should be committed with DEBUG_BOUNCE enabled, 
> and that removed when we have proper aio *v.

Exactly.

But before this can be merged the emulation must be solved. Currently
bdrv_aio_cancel can't work safe (with last patch it works safe only
for drivers using the dma api). I guess one way is to move the
emulation to the block-raw-posix.c layer. But in general I'm not
really sure how to best handle this bdrv readv/writev in short-term
merging terms. Perhaps we should just get rid of posix aio and add the
real thing that will handle readv/writev as well even if that will
surely take a bit more than some emulation form in block-raw-posix.c
that could handle a safe cancellation for all read/write/readv/writev.

> Overkill IMO (glibc likely caches mallocs), but can't hurt.

Yes.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-28 17:59     ` Blue Swirl
  2008-11-28 18:50       ` Andrea Arcangeli
@ 2008-11-30 22:34       ` Anthony Liguori
  1 sibling, 0 replies; 32+ messages in thread
From: Anthony Liguori @ 2008-11-30 22:34 UTC (permalink / raw)
  To: qemu-devel; +Cc: Andrea Arcangeli

Blue Swirl wrote:
> On 11/28/08, Andrea Arcangeli <aarcange@redhat.com> wrote:
>   
>> On Thu, Nov 27, 2008 at 09:14:45PM +0200, Blue Swirl wrote:
>>  > The previous similar attempt by Anthony for generic DMA using vectored
>>  > IO was abandoned because the malloc/free overhead was more than the
>>
>>
>> Even if there were dynamic allocations in the fast path, the overhead
>>  of malloc/free is nothing if compared to running and waiting a host
>>  kernel syscall to return every 4k, not to tell with O_DIRECT enabled
>>  which is the whole point of having a direct-dma API that truly doesn't
>>  pollute the cache. With O_DIRECT, without a real readv/writev I/O
>>  performance would be destroyed going down to something like 10M/sec
>>  even on the fastest storage/CPU/ram combinations.
>>
>>  So the question is how those benchmarks were run, with or without a
>>  real readv/writev and with or without O_DIRECT to truly eliminate all
>>  CPU cache pollution out of the memory copies?
>>     
>
> I don't know, here's a pointer:
> http://lists.gnu.org/archive/html/qemu-devel/2008-08/msg00092.html
>   

Yes.  A vector IO API is not terribly useful in QEMU without an AIO 
implementation that can make use of it.  posix-aio cannot.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-28 18:50       ` Andrea Arcangeli
  2008-11-28 19:03         ` Blue Swirl
  2008-11-30 17:41         ` [Qemu-devel] [RFC 1/1] pci-dma-api-v2 Andrea Arcangeli
@ 2008-11-30 22:38         ` Anthony Liguori
  2008-11-30 22:51           ` Jamie Lokier
  2 siblings, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2008-11-30 22:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Blue Swirl

Andrea Arcangeli wrote:
> On Fri, Nov 28, 2008 at 07:59:13PM +0200, Blue Swirl wrote:
>   
>> I don't know, here's a pointer:
>> http://lists.gnu.org/archive/html/qemu-devel/2008-08/msg00092.html
>>     
>
> I'm in total agreement with it. The missing "proper vectored AIO
> operations" are bdrv_aio_readv/writev ;).
>
> I wonder how can possibly aio_readv/writev be missing in posix aio?
>   

Because it sucks.

> Unbelievable. It'd be totally trivial to add those to glibc, much
> easier infact than to pthread_create by hand, but how can we add a
> dependency on a certain glibc version?

I've already asked, Ulrich will not likely add this unless it first goes 
into the Posix standard.  It won't go into the Posix standard without an 
existing implementation.  Either way, we're talking about a multi-year 
process.

We need our own thread-pool based AIO implementation.

>  Ironically it'll be more
> user-friendly to add dependency on linux kernel-aio implementation
> that is already available for ages and it's guaranteed to run faster
> (or at least not slower).
>   

No, kernel-aio is more brain dead.  io_submit may block if the 
underlying file descriptor does not support asynchronous IO.  There is 
no way to detect this.  This means the only safe way to use io_submit is 
to call it from a thread, and to have a thread pool that can do multiple 
io_submits in parallel.

Yes, that's ridiculous.  There's a thread right now on 
linux-aio@vger.kernel.org discussing the future of linux-aio.  The 
prevailing wisdom seems to be that linux-aio does not have a future.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] pci-dma-api-v2
  2008-11-30 17:41         ` [Qemu-devel] [RFC 1/1] pci-dma-api-v2 Andrea Arcangeli
  2008-11-30 18:36           ` [Qemu-devel] " Blue Swirl
@ 2008-11-30 22:50           ` Anthony Liguori
  2008-12-01  9:41             ` Avi Kivity
  1 sibling, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2008-11-30 22:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Blue Swirl, Andrea Arcangeli

Andrea Arcangeli wrote:
> On Fri, Nov 28, 2008 at 07:50:01PM +0100, Andrea Arcangeli wrote:

This patch is too large.  It should be split up.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>
> Index: block_int.h
> ===================================================================
> --- block_int.h	(revision 5818)
> +++ block_int.h	(working copy)
> @@ -55,6 +55,8 @@
>          int64_t sector_num, const uint8_t *buf, int nb_sectors,
>          BlockDriverCompletionFunc *cb, void *opaque);
>      void (*bdrv_aio_cancel)(BlockDriverAIOCB *acb);
> +    BlockDriverAIOIOV *bdrv_aio_readv;
> +    BlockDriverAIOIOV *bdrv_aio_writev;
>      int aiocb_size;
>   

Please maintain consistency with the rest of the struct by not 
introducing typedefs for these function pointers.

> Index: exec.c
> ===================================================================
> --- exec.c	(revision 5818)
> +++ exec.c	(working copy)
> @@ -2807,7 +2807,7 @@
>  /* physical memory access (slow version, mainly for debug) */
>  #if defined(CONFIG_USER_ONLY)
>  void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
> -                            int len, int is_write)
> +                            size_t len, int is_write)
>  {
>      int l, flags;
>      target_ulong page;
> @@ -2848,7 +2848,7 @@
>  
>  #else
>  void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
> -                            int len, int is_write)
> +                            size_t len, int is_write)
>  {
>      int l, io_index;
>      uint8_t *ptr;
> @@ -2938,6 +2938,111 @@
>      }
>  }
>   

This API change should be a separate patch.  It's orthogonal to this one.

> +uint8_t *cpu_physical_memory_can_dma(target_phys_addr_t addr,
> +				     size_t len, int is_write,
> +				     int alignment)
>   

Your patch is tab-damaged.  This API seems flawed.  It's duplicating 
most of what is in cpu_physical_memory_rw.

I think you need something more sophisticated that can split up a 
scatter/gather list into a set of partial bounce buffers and partial 
direct copies.  Just checking the vector once is a bit hackish.

> +	    /* qemu doesn't execute guest code directly, but kvm does
> +	       therefore fluch instruction caches */
> +	    if (kvm_enabled())
> +		flush_icache_range((unsigned long)ptr,
> +				   ((unsigned long)ptr)+l);
>   

This is incorrect for upstream QEMU and just as wrong for 
kvm-userspace.  This is a nasty hack for ia64.  flush_icache_range() is 
dyngen generated.   ia64 support is not in upstream QEMU.

>  
> @@ -135,6 +149,9 @@
>          /* add synchronous IO emulation layer */
>          bdrv->bdrv_read = bdrv_read_em;
>          bdrv->bdrv_write = bdrv_write_em;
> +        bdrv->bdrv_aio_readv = bdrv_aio_readv_em; /* FIXME */
> +        bdrv->bdrv_aio_writev = bdrv_aio_writev_em; /* FIXME */
> +        bdrv->bdrv_aio_cancel = bdrv_aio_cancel_em; /* FIXME */
>   

What should be fixed?  Are these emulated functions wrong?

There's a lack of symmetry here.  We should have a bdrv_readv and 
bdrv_aio_readv.  bdrv_read and bdrv_aio_read should disappear.  We can 
maintain wrappers that create a compatible interface for older code but 
just adding a new API is wrong.

> +BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
> +				 struct iovec *iov, int iovcnt, size_t len,
> +				 BlockDriverCompletionFunc *cb, void *opaque)
> +{
>   

struct iovec does not exist on Windows.  I also don't think you've got 
the abstraction right.  Reading and writing may require actions to 
happen.  I don't think you can just introduce something as simple as an 
iovec here.  I think we need something more active.

> Index: hw/ide.c
> ===================================================================
> --- hw/ide.c	(revision 5818)
> +++ hw/ide.c	(working copy)
> @@ -31,6 +31,7 @@
>  #include "qemu-timer.h"
>  #include "sysemu.h"
>  #include "ppc_mac.h"
> +#include "pci_dma.h"

All of the changes to IDE/SCSI should be separate patches.  That will 
help isolate failures when bisecting.

> diff --git a/qemu/hw/pci_dma.c b/qemu/hw/pci_dma.c
> new file mode 100644
> index 0000000..48762a8
> --- /dev/null
> +++ b/qemu/hw/pci_dma.c
> @@ -0,0 +1,366 @@
> +/*
> + * QEMU PCI DMA operations
> + *
> + * Copyright (c) 2008 Red Hat, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#include <assert.h>
> +#include "pci_dma.h"
> +
> +#define MAX_DMA_BOUNCE_BUFFER (1024*1024)
> +//#define DEBUG_BOUNCE
> +//#define MAX_DMA_BOUNCE_BUFFER (512)
> +//#define DISABLE_IOVEC_CACHE
> +
> +typedef struct QEMUPciDmaSgParam {
> +    QEMUPciDmaSgSubmit *pci_dma_sg_submit;
> +    QEMUPciDmaSgComplete *pci_dma_sg_complete;
> +    void *pci_dma_sg_opaque;
> +    int dma_to_memory;
> +    int alignment;
> +    uint8_t *bounce;
> +    QEMUPciDmaSg *sg;
> +    int iovcnt;
> +    int restart_iovcnt;
> +    size_t restart_offset;
> +    int curr_restart_iovcnt;
> +    size_t curr_restart_offset;
> +    size_t curr_len;
> +#ifndef DISABLE_IOVEC_CACHE
> +    struct QEMUPciDmaSgParam *next;
> +#endif
> +    struct iovec iov;
> +} QEMUPciDmaSgParam;
>
>   
What is the IOVEC_CACHE and why do we need it?  Either way, it should be 
using sys-queue.

I think you missed the mark here.  This API needs to go through the 
PCIBus and be pluggable at that level.  There can be a default 
implementation but it may differ for each PCI bus.

Please read the previous threads about a DMA API.  All of this has been 
discussed at length.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
  2008-11-30 22:38         ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Anthony Liguori
@ 2008-11-30 22:51           ` Jamie Lokier
  0 siblings, 0 replies; 32+ messages in thread
From: Jamie Lokier @ 2008-11-30 22:51 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> Yes, that's ridiculous.  There's a thread right now on 
> linux-aio@vger.kernel.org discussing the future of linux-aio.  The 
> prevailing wisdom seems to be that linux-aio does not have a future.

I think you meant linux-aio@kvack.org, btw.

-- Jamie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] Re: [RFC 1/1] pci-dma-api-v2
  2008-11-30 21:36                   ` Blue Swirl
@ 2008-11-30 22:54                     ` Anthony Liguori
  0 siblings, 0 replies; 32+ messages in thread
From: Anthony Liguori @ 2008-11-30 22:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Andrea Arcangeli

Blue Swirl wrote:
> On 11/30/08, Andrea Arcangeli <aarcange@redhat.com> wrote:
>   
>> On Sun, Nov 30, 2008 at 09:11:28PM +0200, Blue Swirl wrote:
>>  > Yes, this version (attached) works.
>>
>>
>> Ok great! ;)
>>     
>
> This looks promising. Maybe the previous attempts tried to solve
> several problems at once, like the complex address translation schemes
> I have in mind, in addition to the basic zero-copy DMA transfer.This
> version leaves the DMA address translation to the caller and operates
> using physical memory addresses. Another way would be to use directly
> host pointers.
>   

I'm happy to separate address translation from just zero-copy IO.  I 
agree that a lot of the complexity is introduced trying to solve everything.

For me, just extending the BlockDriver API is a problem.  I'd like to 
see the API properly structured around a single vector API.  This is 
independent of zero copy.

I also think that in the very least, a DMA must be encapsulated in a 
structure, that has functions to pin()/unpin() that generate bounce 
buffers as necessary.

I think it's probably a good idea to separate a basic DMA api from a PCI 
DMA api.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] pci-dma-api-v2
  2008-11-30 22:50           ` [Qemu-devel] " Anthony Liguori
@ 2008-12-01  9:41             ` Avi Kivity
  2008-12-01 16:37               ` Anthony Liguori
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2008-12-01  9:41 UTC (permalink / raw)
  To: qemu-devel; +Cc: Blue Swirl, Andrea Arcangeli

Anthony Liguori wrote:
>
> I think you need something more sophisticated that can split up a 
> scatter/gather list into a set of partial bounce buffers and partial 
> direct copies.  Just checking the vector once is a bit hackish.
>

That code path would never be used or tested.  As it is, bouncing will 
only be invoked by dma to or from mmio, and I expect most guests never 
to invoke it.  Why would we complicate the code even more?

>
> What should be fixed?  Are these emulated functions wrong?
>
> There's a lack of symmetry here.  We should have a bdrv_readv and 
> bdrv_aio_readv.  bdrv_read and bdrv_aio_read should disappear.  We can 
> maintain wrappers that create a compatible interface for older code 
> but just adding a new API is wrong.
>

It was understood a real aio readv/writev was being worked on, so the 
emulation could be a temporary step.

>> +BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t 
>> sector_num,
>> +                 struct iovec *iov, int iovcnt, size_t len,
>> +                 BlockDriverCompletionFunc *cb, void *opaque)
>> +{
>>   
>
> struct iovec does not exist on Windows.  I also don't think you've got 
> the abstraction right.  Reading and writing may require actions to 
> happen.  I don't think you can just introduce something as simple as 
> an iovec here.  I think we need something more active.
>

Can you elaborate?  Actions happen in the completion callback.  This is 
just an straightforward extension of the block API, I don't think a dma 
api should materially change the block api.

>
> I think you missed the mark here.  This API needs to go through the 
> PCIBus and be pluggable at that level.  There can be a default 
> implementation but it may differ for each PCI bus.
>

I think this can serve as the default implementation.  Perhaps when a 
concrete user of pluggable dma emerges we can understand the 
requirements better.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] pci-dma-api-v2
  2008-12-01  9:41             ` Avi Kivity
@ 2008-12-01 16:37               ` Anthony Liguori
  2008-12-02  9:45                 ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Anthony Liguori @ 2008-12-01 16:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: Blue Swirl, Andrea Arcangeli

Avi Kivity wrote:
> Anthony Liguori wrote:
>>
>> I think you need something more sophisticated that can split up a 
>> scatter/gather list into a set of partial bounce buffers and partial 
>> direct copies.  Just checking the vector once is a bit hackish.
>>
>
> That code path would never be used or tested.  As it is, bouncing will 
> only be invoked by dma to or from mmio, and I expect most guests never 
> to invoke it.  Why would we complicate the code even more?

I'm not sure it's strictly required but I think it would be a 
simplificaction.

>>
>> What should be fixed?  Are these emulated functions wrong?
>>
>> There's a lack of symmetry here.  We should have a bdrv_readv and 
>> bdrv_aio_readv.  bdrv_read and bdrv_aio_read should disappear.  We 
>> can maintain wrappers that create a compatible interface for older 
>> code but just adding a new API is wrong.
>>
>
> It was understood a real aio readv/writev was being worked on, so the 
> emulation could be a temporary step.

Yes, but that's orthogonal to what I'm saying here.  I'm saying that 
instead of adding an optional ->aio_readv() member, we should eliminate 
the ->aio_read() member and replace it with ->aio_readv().  There are 
only three or four places in the code that implement aio so it's not a 
big change.  It avoids introducing a new API too.

I also think we should have complimentary synchronous vector functions 
although I'd like to see the synchronous API disappear completely.

>>> +BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t 
>>> sector_num,
>>> +                 struct iovec *iov, int iovcnt, size_t len,
>>> +                 BlockDriverCompletionFunc *cb, void *opaque)
>>> +{
>>>   
>>
>> struct iovec does not exist on Windows.  I also don't think you've 
>> got the abstraction right.  Reading and writing may require actions 
>> to happen.  I don't think you can just introduce something as simple 
>> as an iovec here.  I think we need something more active.
>>
>
> Can you elaborate?  Actions happen in the completion callback.  This 
> is just an straightforward extension of the block API, I don't think a 
> dma api should materially change the block api.

If we're not going to try and fold in the IOMMU/PCI BUS API at this 
pass, then this is less important.  But to implement a proper PCI bus 
API, I think there has to be function pointers associated with each 
element of the scatter/gather list that control how data is copied in 
and out of each element.

>>
>> I think you missed the mark here.  This API needs to go through the 
>> PCIBus and be pluggable at that level.  There can be a default 
>> implementation but it may differ for each PCI bus.
>>
>
> I think this can serve as the default implementation.  Perhaps when a 
> concrete user of pluggable dma emerges we can understand the 
> requirements better.

I think if we drop the pci_ prefixes and just treat it as a basic DMA 
API as Blue Swirl suggested, then this is closer to what we need.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [RFC 1/1] pci-dma-api-v2
  2008-12-01 16:37               ` Anthony Liguori
@ 2008-12-02  9:45                 ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2008-12-02  9:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Blue Swirl, Andrea Arcangeli

Anthony Liguori wrote:

>>>
>>> What should be fixed?  Are these emulated functions wrong?
>>>
>>> There's a lack of symmetry here.  We should have a bdrv_readv and 
>>> bdrv_aio_readv.  bdrv_read and bdrv_aio_read should disappear.  We 
>>> can maintain wrappers that create a compatible interface for older 
>>> code but just adding a new API is wrong.
>>>
>>
>> It was understood a real aio readv/writev was being worked on, so the 
>> emulation could be a temporary step.
>
> Yes, but that's orthogonal to what I'm saying here.  I'm saying that 
> instead of adding an optional ->aio_readv() member, we should 
> eliminate the ->aio_read() member and replace it with ->aio_readv().  
> There are only three or four places in the code that implement aio so 
> it's not a big change.  It avoids introducing a new API too.

The api replacement would make a lot more sense when there's an 
implementation backing it up.  I have no concrete objection to this 
though, maybe Andrea can whip up a follow on patch to kill aio_read.

> I also think we should have complimentary synchronous vector functions 
> although I'd like to see the synchronous API disappear completely.

I don't see a user for the synchronous ops.  I think the sync ops are 
used only by the implementation of block formats, not by disk controller 
emulation, and these only use them to access metadata.  Haven't checked 
this though.

>>>
>>> struct iovec does not exist on Windows.  I also don't think you've 
>>> got the abstraction right.  Reading and writing may require actions 
>>> to happen.  I don't think you can just introduce something as simple 
>>> as an iovec here.  I think we need something more active.
>>>
>>
>> Can you elaborate?  Actions happen in the completion callback.  This 
>> is just an straightforward extension of the block API, I don't think 
>> a dma api should materially change the block api.
>
> If we're not going to try and fold in the IOMMU/PCI BUS API at this 
> pass, then this is less important.  But to implement a proper PCI bus 
> API, I think there has to be function pointers associated with each 
> element of the scatter/gather list that control how data is copied in 
> and out of each element.

Certainly not with each element, that would increase overhead for the 
common zero copy case.  The way to do transformation is before/after 
bouncing, in the slow path.

The way I see it is:
- transformations that only manipulate addresses, maintaining alignment 
requirements, can be done at the sg->iovec transform phase without any 
need for further manipulation
- if there are alignment problems, we bounce, but again no need for 
further manipulation
- transformations that manipulate data are bounced and manipulated on 
the bounce buffer

It can all be done within the dma api, no need to involve the block 
layer with callbacks.  That's not its job.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2008-12-02  9:45 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-27 12:35 [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Andrea Arcangeli
2008-11-27 12:43 ` [Qemu-devel] [RFC 2/2] bdrv_aio_readv/writev_em Andrea Arcangeli
2008-11-28 11:09   ` Jamie Lokier
2008-11-27 19:14 ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Blue Swirl
2008-11-28  1:56   ` Andrea Arcangeli
2008-11-28 17:59     ` Blue Swirl
2008-11-28 18:50       ` Andrea Arcangeli
2008-11-28 19:03         ` Blue Swirl
2008-11-28 19:18           ` Jamie Lokier
2008-11-29 19:49             ` Avi Kivity
2008-11-30 17:20             ` Andrea Arcangeli
2008-11-30 22:31             ` Anthony Liguori
2008-11-30 18:04           ` Andrea Arcangeli
2008-11-30 17:41         ` [Qemu-devel] [RFC 1/1] pci-dma-api-v2 Andrea Arcangeli
2008-11-30 18:36           ` [Qemu-devel] " Blue Swirl
2008-11-30 19:04             ` Andrea Arcangeli
2008-11-30 19:11               ` Blue Swirl
2008-11-30 19:20                 ` Andrea Arcangeli
2008-11-30 21:36                   ` Blue Swirl
2008-11-30 22:54                     ` Anthony Liguori
2008-11-30 22:50           ` [Qemu-devel] " Anthony Liguori
2008-12-01  9:41             ` Avi Kivity
2008-12-01 16:37               ` Anthony Liguori
2008-12-02  9:45                 ` Avi Kivity
2008-11-30 22:38         ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Anthony Liguori
2008-11-30 22:51           ` Jamie Lokier
2008-11-30 22:34       ` Anthony Liguori
2008-11-29 19:48   ` Avi Kivity
2008-11-30 17:29     ` Andrea Arcangeli
2008-11-30 20:27       ` Avi Kivity
2008-11-30 22:33         ` Andrea Arcangeli
2008-11-30 22:33   ` Anthony Liguori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).