[Qemu-devel] [PATCH 0/5] Direct memory access for devices

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/5] Direct memory access for devices
@ 2009-01-18 19:53 Avi Kivity
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
                   ` (5 more replies)
  0 siblings, 6 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-18 19:53 UTC (permalink / raw)
  To: qemu-devel, Anthony Liguori

One of the deficiencies of the current device layer is that it can only access
guest RAM via cpu_physical_memory_rw().  This means that the device emulation
code must copy the memory to or from a temporary buffer, even though the host
offers APIs which allow direct access to memory.  This reduces efficiency on
DMA capable devices, especially disks.

This patchset introduces a complement to the read/write API,
cpu_physical_memory_map() which allows device emulation code to map
guest memory directly.  The API bounces memory regions which cannot be
mapped (such as mmio regions) using an internal buffer.

As an example, IDE emulation is converted to use the new API.  This exposes
another deficiency: lack of scatter/gather support in the block layer.  To
work around this, a vectored block API is introduced, currently emulated
by bouncing.  Additional work is needed to convert all block format drivers
to use the vectored API.

Avi Kivity (5):
  Add target memory mapping API
  Add map client retry notification
  Vectored block device API
  I/O vector helpers
  Convert IDE to directly access guest memory

 block.c       |   92 ++++++++++++++++++++++++++++++++++++++++
 block.h       |    7 +++
 cpu-all.h     |    8 ++++
 exec.c        |  131 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/ide.c      |  127 +++++++++++++++++++++++++++++++++++++++++++++++++------
 qemu-common.h |   10 ++++
 vl.c          |   25 +++++++++++
 7 files changed, 386 insertions(+), 14 deletions(-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-18 19:53 [Qemu-devel] [PATCH 0/5] Direct memory access for devices Avi Kivity
@ 2009-01-18 19:53 ` Avi Kivity
  2009-01-19 13:49   ` Ian Jackson
                     ` (3 more replies)
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 2/5] Add map client retry notification Avi Kivity
                   ` (4 subsequent siblings)
  5 siblings, 4 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-18 19:53 UTC (permalink / raw)
  To: qemu-devel, Anthony Liguori

Devices accessing large amounts of memory (as with DMA) will wish to obtain
a pointer to guest memory rather than access it indirectly via
cpu_physical_memory_rw().  Add a new API to convert target addresses to
host pointers.

In case the target address does not correspond to RAM, a bounce buffer is
allocated.  To prevent the guest from causing the host to allocate unbounded
amounts of bounce buffer, this memory is limited (currently to one page).

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 cpu-all.h |    5 +++
 exec.c    |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+), 0 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index ee0a6e3..3439999 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -923,6 +923,11 @@ static inline void cpu_physical_memory_write(target_phys_addr_t addr,
 {
     cpu_physical_memory_rw(addr, (uint8_t *)buf, len, 1);
 }
+void *cpu_physical_memory_map(target_phys_addr_t addr,
+                              target_phys_addr_t *plen,
+                              int is_write);
+void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
+                               int is_write);
 uint32_t ldub_phys(target_phys_addr_t addr);
 uint32_t lduw_phys(target_phys_addr_t addr);
 uint32_t ldl_phys(target_phys_addr_t addr);
diff --git a/exec.c b/exec.c
index faa6333..7162271 100644
--- a/exec.c
+++ b/exec.c
@@ -3045,6 +3045,99 @@ void cpu_physical_memory_write_rom(target_phys_addr_t addr,
     }
 }
 
+typedef struct {
+    void *buffer;
+    target_phys_addr_t addr;
+    target_phys_addr_t len;
+} BounceBuffer;
+
+static BounceBuffer bounce;
+
+void *cpu_physical_memory_map(target_phys_addr_t addr,
+                              target_phys_addr_t *plen,
+                              int is_write)
+{
+    target_phys_addr_t len = *plen;
+    target_phys_addr_t done = 0;
+    int l;
+    uint8_t *ret = NULL;
+    uint8_t *ptr;
+    target_phys_addr_t page;
+    unsigned long pd;
+    PhysPageDesc *p;
+    unsigned long addr1;
+
+    while (len > 0) {
+        page = addr & TARGET_PAGE_MASK;
+        l = (page + TARGET_PAGE_SIZE) - addr;
+        if (l > len)
+            l = len;
+        p = phys_page_find(page >> TARGET_PAGE_BITS);
+        if (!p) {
+            pd = IO_MEM_UNASSIGNED;
+        } else {
+            pd = p->phys_offset;
+        }
+
+        if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
+            if (done || bounce.buffer) {
+                break;
+            }
+            bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, TARGET_PAGE_SIZE);
+            bounce.addr = addr;
+            bounce.len = l;
+            if (!is_write) {
+                cpu_physical_memory_rw(addr, bounce.buffer, l, 0);
+            }
+            ptr = bounce.buffer;
+        } else {
+            addr1 = (pd & TARGET_PAGE_MASK) + (addr & ~TARGET_PAGE_MASK);
+            ptr = phys_ram_base + addr1;
+        }
+        if (!done) {
+            ret = ptr;
+        } else if (ret + done != ptr) {
+            break;
+        }
+
+        len -= l;
+        addr += l;
+        done += l;
+    }
+    *plen = done;
+    return ret;
+}
+
+void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
+                               int is_write)
+{
+    if (buffer != bounce.buffer) {
+        if (is_write) {
+            unsigned long addr1 = (uint8_t *)buffer - phys_ram_base;
+            do {
+                unsigned l;
+                l = TARGET_PAGE_SIZE;
+                if (l > len)
+                    l = len;
+                if (!cpu_physical_memory_is_dirty(addr1)) {
+                    /* invalidate code */
+                    tb_invalidate_phys_page_range(addr1, addr1 + len, 0);
+                    /* set dirty bit */
+                    phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
+                        (0xff & ~CODE_DIRTY_FLAG);
+                }
+                addr1 += l;
+                len -= l;
+            } while (len);
+        }
+        return;
+    }
+    if (is_write) {
+        cpu_physical_memory_write(bounce.addr, bounce.buffer, bounce.len);
+    }
+    qemu_free(bounce.buffer);
+    bounce.buffer = NULL;
+}
 
 /* warning: addr must be aligned */
 uint32_t ldl_phys(target_phys_addr_t addr)
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 2/5] Add map client retry notification
  2009-01-18 19:53 [Qemu-devel] [PATCH 0/5] Direct memory access for devices Avi Kivity
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
@ 2009-01-18 19:53 ` Avi Kivity
  2009-01-19 14:58   ` [Qemu-devel] " Anthony Liguori
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 3/5] Vectored block device API Avi Kivity
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 64+ messages in thread
From: Avi Kivity @ 2009-01-18 19:53 UTC (permalink / raw)
  To: qemu-devel, Anthony Liguori

The target memory mapping API may fail if the bounce buffer resources
are exhausted.  Add a notification mechanism to allow clients to retry
the mapping operation when resources become available again.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 cpu-all.h |    3 +++
 exec.c    |   38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 3439999..67e795e 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -928,6 +928,9 @@ void *cpu_physical_memory_map(target_phys_addr_t addr,
                               int is_write);
 void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
                                int is_write);
+void *cpu_register_map_client(void *opaque, void (*callback)(void *opaque));
+void cpu_unregister_map_client(void *cookie);
+
 uint32_t ldub_phys(target_phys_addr_t addr);
 uint32_t lduw_phys(target_phys_addr_t addr);
 uint32_t ldl_phys(target_phys_addr_t addr);
diff --git a/exec.c b/exec.c
index 7162271..62bedc0 100644
--- a/exec.c
+++ b/exec.c
@@ -3053,6 +3053,43 @@ typedef struct {
 
 static BounceBuffer bounce;
 
+typedef struct MapClient {
+    void *opaque;
+    void (*callback)(void *opaque);
+    LIST_ENTRY(MapClient) link;
+} MapClient;
+
+static LIST_HEAD(map_client_list, MapClient) map_client_list
+    = LIST_HEAD_INITIALIZER(map_client_list);
+
+void *cpu_register_map_client(void *opaque, void (*callback)(void *opaque))
+{
+    MapClient *client = qemu_malloc(sizeof(*client));
+
+    client->opaque = opaque;
+    client->callback = callback;
+    LIST_INSERT_HEAD(&map_client_list, client, link);
+    return client;
+}
+
+void cpu_unregister_map_client(void *_client)
+{
+    MapClient *client = (MapClient *)_client;
+
+    LIST_REMOVE(client, link);
+}
+
+static void cpu_notify_map_clients(void)
+{
+    MapClient *client;
+
+    while (!LIST_EMPTY(&map_client_list)) {
+        client = LIST_FIRST(&map_client_list);
+        client->callback(client->opaque);
+        LIST_REMOVE(client, link);
+    }
+}
+
 void *cpu_physical_memory_map(target_phys_addr_t addr,
                               target_phys_addr_t *plen,
                               int is_write)
@@ -3137,6 +3174,7 @@ void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
     }
     qemu_free(bounce.buffer);
     bounce.buffer = NULL;
+    cpu_notify_map_clients();
 }
 
 /* warning: addr must be aligned */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 3/5] Vectored block device API
  2009-01-18 19:53 [Qemu-devel] [PATCH 0/5] Direct memory access for devices Avi Kivity
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 2/5] Add map client retry notification Avi Kivity
@ 2009-01-18 19:53 ` Avi Kivity
  2009-01-19 16:54   ` Blue Swirl
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 4/5] I/O vector helpers Avi Kivity
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 64+ messages in thread
From: Avi Kivity @ 2009-01-18 19:53 UTC (permalink / raw)
  To: qemu-devel, Anthony Liguori

Most devices that are capable of DMA are also capable of scatter-gather.
With the memory mapping API, this means that the device code needs to be
able to access discontiguous host memory regions.

For block devices, this translates to vectored I/O.  This patch implements
an aynchronous vectored interface for the qemu block devices.  At the moment
all I/O is bounced and submitted through the non-vectored API; in the future
we will convert block devices to natively support vectored I/O wherever
possible.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 block.c |   92 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block.h |    7 +++++
 2 files changed, 99 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index 3250327..4b2e34b 100644
--- a/block.c
+++ b/block.c
@@ -1246,6 +1246,93 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, QEMUSnapshotInfo *sn)
 /**************************************************************/
 /* async I/Os */
 
+typedef struct VectorTranslationState {
+    struct iovec *iov;
+    int niov;
+    uint8_t *bounce;
+    int is_write;
+    BlockDriverAIOCB *aiocb;
+    BlockDriverAIOCB *this_aiocb;
+} VectorTranslationState;
+
+static void flatten_iovec(VectorTranslationState *s)
+{
+    uint8_t *p = s->bounce;
+    int i;
+
+    for (i = 0; i < s->niov; ++i) {
+        memcpy(p, s->iov[i].iov_base, s->iov[i].iov_len);
+        p += s->iov[i].iov_len;
+    }
+}
+
+static void unflatten_iovec(VectorTranslationState *s)
+{
+    uint8_t *p = s->bounce;
+    int i;
+
+    for (i = 0; i < s->niov; ++i) {
+        memcpy(s->iov[i].iov_base, p, s->iov[i].iov_len);
+        p += s->iov[i].iov_len;
+    }
+}
+
+static void bdrv_aio_rw_vector_cb(void *opaque, int ret)
+{
+    VectorTranslationState *s = opaque;
+
+    if (!s->is_write) {
+        unflatten_iovec(s);
+    }
+    qemu_free(s->bounce);
+    s->this_aiocb->cb(s->this_aiocb->opaque, ret);
+    qemu_aio_release(s->this_aiocb);
+}
+
+static BlockDriverAIOCB *bdrv_aio_rw_vector(BlockDriverState *bs,
+                                            int64_t sector_num,
+                                            struct iovec *iov, int niov,
+                                            int nb_sectors,
+                                            BlockDriverCompletionFunc *cb,
+                                            void *opaque,
+                                            int is_write)
+
+{
+    VectorTranslationState *s = qemu_mallocz(sizeof(*s));
+    BlockDriverAIOCB *aiocb = qemu_aio_get(bs, cb, opaque);
+
+    s->this_aiocb = aiocb;
+    s->iov = iov;
+    s->niov = niov;
+    s->bounce = qemu_memalign(512, nb_sectors * 512);
+    s->is_write = is_write;
+    if (is_write) {
+        flatten_iovec(s);
+        s->aiocb = bdrv_aio_write(bs, sector_num, s->bounce, nb_sectors,
+                                  bdrv_aio_rw_vector_cb, s);
+    } else {
+        s->aiocb = bdrv_aio_read(bs, sector_num, s->bounce, nb_sectors,
+                                 bdrv_aio_rw_vector_cb, s);
+    }
+    return aiocb;
+}
+
+BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
+                                 struct iovec *iov, int niov, int nb_sectors,
+                                 BlockDriverCompletionFunc *cb, void *opaque)
+{
+    return bdrv_aio_rw_vector(bs, sector_num, iov, niov, nb_sectors,
+                              cb, opaque, 0);
+}
+
+BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
+                                  struct iovec *iov, int niov, int nb_sectors,
+                                  BlockDriverCompletionFunc *cb, void *opaque)
+{
+    return bdrv_aio_rw_vector(bs, sector_num, iov, niov, nb_sectors,
+                              cb, opaque, 1);
+}
+
 BlockDriverAIOCB *bdrv_aio_read(BlockDriverState *bs, int64_t sector_num,
                                 uint8_t *buf, int nb_sectors,
                                 BlockDriverCompletionFunc *cb, void *opaque)
@@ -1294,6 +1381,11 @@ void bdrv_aio_cancel(BlockDriverAIOCB *acb)
 {
     BlockDriver *drv = acb->bs->drv;
 
+    if (acb->cb == bdrv_aio_rw_vector_cb) {
+        VectorTranslationState *s = acb->opaque;
+        acb = s->aiocb;
+    }
+
     drv->bdrv_aio_cancel(acb);
 }
 
diff --git a/block.h b/block.h
index c3314a1..0391704 100644
--- a/block.h
+++ b/block.h
@@ -85,6 +85,13 @@ int bdrv_commit(BlockDriverState *bs);
 typedef struct BlockDriverAIOCB BlockDriverAIOCB;
 typedef void BlockDriverCompletionFunc(void *opaque, int ret);
 
+BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
+                                 struct iovec *iov, int niov, int nb_sectors,
+                                 BlockDriverCompletionFunc *cb, void *opaque);
+BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
+                                  struct iovec *iov, int niov, int nb_sectors,
+                                  BlockDriverCompletionFunc *cb, void *opaque);
+
 BlockDriverAIOCB *bdrv_aio_read(BlockDriverState *bs, int64_t sector_num,
                                 uint8_t *buf, int nb_sectors,
                                 BlockDriverCompletionFunc *cb, void *opaque);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 4/5] I/O vector helpers
  2009-01-18 19:53 [Qemu-devel] [PATCH 0/5] Direct memory access for devices Avi Kivity
                   ` (2 preceding siblings ...)
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 3/5] Vectored block device API Avi Kivity
@ 2009-01-18 19:53 ` Avi Kivity
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 5/5] Convert IDE to directly access guest memory Avi Kivity
  2009-01-19 16:50 ` [Qemu-devel] [PATCH 0/5] Direct memory access for devices Blue Swirl
  5 siblings, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-18 19:53 UTC (permalink / raw)
  To: qemu-devel, Anthony Liguori

In general, it is not possible to predict the size of of an I/O vector since
a contiguous guest region may map to a disconiguous host region.  Add some
helpers to manage I/O vector growth.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 qemu-common.h |   10 ++++++++++
 vl.c          |   25 +++++++++++++++++++++++++
 2 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/qemu-common.h b/qemu-common.h
index d83e61b..1a746f9 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -191,6 +191,16 @@ int cpu_load(QEMUFile *f, void *opaque, int version_id);
 /* Force QEMU to stop what it's doing and service IO */
 void qemu_service_io(void);
 
+typedef struct QEMUIOVector {
+    struct iovec *iov;
+    int niov;
+    int nalloc;
+} QEMUIOVector;
+
+void qemu_iovec_init(QEMUIOVector *qiov, int alloc_hint);
+void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len);
+void qemu_iovec_destroy(QEMUIOVector *qiov);
+
 #endif /* dyngen-exec.h hack */
 
 #endif
diff --git a/vl.c b/vl.c
index 34ddc07..7ceedd4 100644
--- a/vl.c
+++ b/vl.c
@@ -3344,6 +3344,31 @@ static void qemu_bh_update_timeout(int *timeout)
     }
 }
 
+/* io vectors */
+
+void qemu_iovec_init(QEMUIOVector *qiov, int alloc_hint)
+{
+    qiov->iov = qemu_malloc(alloc_hint * sizeof(struct iovec));
+    qiov->niov = 0;
+    qiov->nalloc = alloc_hint;
+}
+
+void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len)
+{
+    if (qiov->niov == qiov->nalloc) {
+        qiov->nalloc = 2 * qiov->nalloc + 1;
+        qiov->iov = qemu_realloc(qiov->iov, qiov->nalloc * sizeof(struct iovec));
+    }
+    qiov->iov[qiov->niov].iov_base = base;
+    qiov->iov[qiov->niov].iov_len = len;
+    ++qiov->niov;
+}
+
+void qemu_iovec_destroy(QEMUIOVector *qiov)
+{
+    qemu_free(qiov->iov);
+}
+
 /***********************************************************/
 /* machine registration */
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Qemu-devel] [PATCH 5/5] Convert IDE to directly access guest memory
  2009-01-18 19:53 [Qemu-devel] [PATCH 0/5] Direct memory access for devices Avi Kivity
                   ` (3 preceding siblings ...)
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 4/5] I/O vector helpers Avi Kivity
@ 2009-01-18 19:53 ` Avi Kivity
  2009-01-19 16:50 ` [Qemu-devel] [PATCH 0/5] Direct memory access for devices Blue Swirl
  5 siblings, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-18 19:53 UTC (permalink / raw)
  To: qemu-devel, Anthony Liguori

Instead of copying to a temporary buffer, map guest memory for IDE DMA
transactions.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 hw/ide.c |  127 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 113 insertions(+), 14 deletions(-)

diff --git a/hw/ide.c b/hw/ide.c
index 7dd41f7..76adc59 100644
--- a/hw/ide.c
+++ b/hw/ide.c
@@ -422,6 +422,7 @@ typedef struct IDEState {
     int atapi_dma; /* true if dma is requested for the packet cmd */
     /* ATA DMA state */
     int io_buffer_size;
+    QEMUIOVector iovec;
     /* PIO transfer handling */
     int req_nb_sectors; /* number of sectors per interrupt */
     EndTransferFunc *end_transfer_func;
@@ -913,6 +914,97 @@ static int dma_buf_rw(BMDMAState *bm, int is_write)
     return 1;
 }
 
+/* return 0 if buffer completed */
+static int dma_buf_prepare(BMDMAState *bm, int is_write)
+{
+    IDEState *s = bm->ide_if;
+    struct {
+        uint32_t addr;
+        uint32_t size;
+    } prd;
+    int l, len;
+    void *mem;
+    target_phys_addr_t l1;
+
+    qemu_iovec_init(&s->iovec, s->nsector / (TARGET_PAGE_SIZE/512) + 1);
+    s->io_buffer_size = 0;
+    for(;;) {
+        if (bm->cur_prd_len == 0) {
+            /* end of table (with a fail safe of one page) */
+            if (bm->cur_prd_last ||
+                (bm->cur_addr - bm->addr) >= 4096)
+                return s->io_buffer_size != 0;
+            cpu_physical_memory_read(bm->cur_addr, (uint8_t *)&prd, 8);
+            bm->cur_addr += 8;
+            prd.addr = le32_to_cpu(prd.addr);
+            prd.size = le32_to_cpu(prd.size);
+            len = prd.size & 0xfffe;
+            if (len == 0)
+                len = 0x10000;
+            bm->cur_prd_len = len;
+            bm->cur_prd_addr = prd.addr;
+            bm->cur_prd_last = (prd.size & 0x80000000);
+        }
+        l = bm->cur_prd_len;
+        if (l > 0) {
+            l1 = l;
+            mem = cpu_physical_memory_map(bm->cur_prd_addr, &l1, is_write);
+            if (!mem) {
+                break;
+            }
+            qemu_iovec_add(&s->iovec, mem, l1);
+            bm->cur_prd_addr += l1;
+            bm->cur_prd_len -= l1;
+            s->io_buffer_size += l1;
+        }
+    }
+    return 1;
+}
+
+static void dma_buf_commit(IDEState *s, int is_write)
+{
+    int i;
+
+    for (i = 0; i < s->iovec.niov; ++i) {
+        cpu_physical_memory_unmap(s->iovec.iov[i].iov_base,
+                                  s->iovec.iov[i].iov_len, is_write);
+    }
+    qemu_iovec_destroy(&s->iovec);
+}
+
+typedef struct {
+    BMDMAState *bm;
+    void (*cb)(void *opaque, int ret);
+    QEMUBH *bh;
+} MapFailureContinuation;
+
+static void reschedule_dma(void *opaque)
+{
+    MapFailureContinuation *cont = opaque;
+
+    cont->cb(cont->bm, 0);
+    qemu_bh_delete(cont->bh);
+    qemu_free(cont);
+}
+
+static void continue_after_map_failure(void *opaque)
+{
+    MapFailureContinuation *cont = opaque;
+
+    cont->bh = qemu_bh_new(reschedule_dma, opaque);
+    qemu_bh_schedule(cont->bh);
+}
+
+static void wait_for_bounce_buffer(BMDMAState *bmdma,
+                                   void (*cb)(void *opaque, int ret))
+{
+    MapFailureContinuation *cont = qemu_malloc(sizeof(*cont));
+
+    cont->bm = bmdma;
+    cont->cb = cb;
+    cpu_register_map_client(cont, continue_after_map_failure);
+}
+
 static void ide_read_dma_cb(void *opaque, int ret)
 {
     BMDMAState *bm = opaque;
@@ -921,6 +1013,7 @@ static void ide_read_dma_cb(void *opaque, int ret)
     int64_t sector_num;
 
     if (ret < 0) {
+        dma_buf_commit(s, 1);
 	ide_dma_error(s);
 	return;
     }
@@ -928,11 +1021,10 @@ static void ide_read_dma_cb(void *opaque, int ret)
     n = s->io_buffer_size >> 9;
     sector_num = ide_get_sector(s);
     if (n > 0) {
+        dma_buf_commit(s, 1);
         sector_num += n;
         ide_set_sector(s, sector_num);
         s->nsector -= n;
-        if (dma_buf_rw(bm, 1) == 0)
-            goto eot;
     }
 
     /* end of transfer ? */
@@ -950,15 +1042,19 @@ static void ide_read_dma_cb(void *opaque, int ret)
 
     /* launch next transfer */
     n = s->nsector;
-    if (n > IDE_DMA_BUF_SECTORS)
-        n = IDE_DMA_BUF_SECTORS;
     s->io_buffer_index = 0;
     s->io_buffer_size = n * 512;
+    if (dma_buf_prepare(bm, 1) == 0)
+        goto eot;
+    if (!s->iovec.niov) {
+        wait_for_bounce_buffer(bm, ide_read_dma_cb);
+        return;
+    }
 #ifdef DEBUG_AIO
     printf("aio_read: sector_num=%" PRId64 " n=%d\n", sector_num, n);
 #endif
-    bm->aiocb = bdrv_aio_read(s->bs, sector_num, s->io_buffer, n,
-                              ide_read_dma_cb, bm);
+    bm->aiocb = bdrv_aio_readv(s->bs, sector_num, s->iovec.iov, s->iovec.niov,
+                               n, ide_read_dma_cb, bm);
     ide_dma_submit_check(s, ide_read_dma_cb, bm);
 }
 
@@ -1032,6 +1128,7 @@ static void ide_write_dma_cb(void *opaque, int ret)
     int64_t sector_num;
 
     if (ret < 0) {
+        dma_buf_commit(s, 0);
 	ide_dma_error(s);
 	return;
     }
@@ -1039,6 +1136,7 @@ static void ide_write_dma_cb(void *opaque, int ret)
     n = s->io_buffer_size >> 9;
     sector_num = ide_get_sector(s);
     if (n > 0) {
+        dma_buf_commit(s, 0);
         sector_num += n;
         ide_set_sector(s, sector_num);
         s->nsector -= n;
@@ -1057,20 +1155,21 @@ static void ide_write_dma_cb(void *opaque, int ret)
         return;
     }
 
-    /* launch next transfer */
     n = s->nsector;
-    if (n > IDE_DMA_BUF_SECTORS)
-        n = IDE_DMA_BUF_SECTORS;
-    s->io_buffer_index = 0;
     s->io_buffer_size = n * 512;
-
-    if (dma_buf_rw(bm, 0) == 0)
+    /* launch next transfer */
+    if (dma_buf_prepare(bm, 0) == 0)
         goto eot;
+    if (!s->iovec.niov) {
+        wait_for_bounce_buffer(bm, ide_write_dma_cb);
+        return;
+    }
 #ifdef DEBUG_AIO
     printf("aio_write: sector_num=%" PRId64 " n=%d\n", sector_num, n);
 #endif
-    bm->aiocb = bdrv_aio_write(s->bs, sector_num, s->io_buffer, n,
-                               ide_write_dma_cb, bm);
+    bm->aiocb = bdrv_aio_writev(s->bs, sector_num,
+                                s->iovec.iov, s->iovec.niov, n,
+                                ide_write_dma_cb, bm);
     ide_dma_submit_check(s, ide_write_dma_cb, bm);
 }
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
@ 2009-01-19 13:49   ` Ian Jackson
  2009-01-19 14:54     ` Avi Kivity
  2009-01-19 15:05     ` [Qemu-devel] [PATCH 1/5] " Gerd Hoffmann
  2009-01-19 14:56   ` [Qemu-devel] " Anthony Liguori
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 64+ messages in thread
From: Ian Jackson @ 2009-01-19 13:49 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("[Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> Devices accessing large amounts of memory (as with DMA) will wish to obtain
> a pointer to guest memory rather than access it indirectly via
> cpu_physical_memory_rw().  Add a new API to convert target addresses to
> host pointers.

In the Xen qemu tree we have alternative implementations of
cpu_physical_memory_{read,write}.  So we will need our own
implementation of this too.

So it is important to us that this interface is sufficient for its
callers (so that it doesn't need to change in the future), sufficient
for various different implementations (so that we can implement our
version), and correct in the sense that it is possible to write
correct code which uses it.

So as a result I have some comments about this API.  In summary I'm
afraid I think it needs a bit more work.  Perhaps we could have a
discussion about what this API should look like, by waving function
prototypes about (rather than complete patchsets) ?

Also there are some questions which your interface specification
leaves unanswered; these questions about the API semantics should
really be addressed in comments by the declaration.

> +void *cpu_physical_memory_map(target_phys_addr_t addr,
> +                              target_phys_addr_t *plen,
> +                              int is_write);

If the caller specifies is_write==1, are they entitled to read from
the buffer and expect it to contain something reasonable ?  In your
implementation, the answer appears to be `no'.

Is the caller allowed to assume that the accesses to the underlying
memory happen in any particular order - in particular, that they
happen in the same order as the caller's accesses to the buffer ?  Is
the caller allowed to assume that the same number of accesses to the
underlying memory happen as are made by the caller to the buffer ?
No, because if there is a bounce buffer the caller's accesses are
irrelevant.

What happens if the caller maps with is_write==1, writes part of the
buffer, and then unmaps ?  In your implementation undefined data is
written to guest memory.  I think that since not all callers can
guarantee to wish to write to the whole region, this means that the
interface is defective.  Callers need to be required to specify which
areas they have written.

The interface when cpu_physical_memory_map returns 0 is strange.
Normally everything in qemu is done with completion callbacks, but
here we have a kind of repeated polling.

This function should return a separate handle as well as the physical
memory pointer.  That will make it much easier to provide an
implementation which permits multiple bounce buffers or multiple
mappings simultaneously.

So I would prefer something like

  #define CPUPHYSMEMMAPFLAG_READABLE    001
  #define CPUPHYSMEMMAPFLAG_WRITABLE    002

  #define CPUPHYSMEMMAPFLAG_ALLOWFAIL   010
     /* If _ALLOWFAIL set then on callback, buffer may be NULL, in
      * which case caller should use cpu_physical_memory_{read,write} */

  typedef struct {
    /* to be filled in by caller before calling _map: */
    unsigned flags;
    target_phys_addr_t addr, len; /* len is updated by _map */
    /* filled in by _map: */
    void *buffer;
    /* private to _map et al: */
  } CpuPhysicalMemoryMapping;

  void cpu_physical_memory_map(CpuPhysicalMemoryMapping*;
      CpuPhysicalMemoryMapCallback *cb, void *cb_arg);
    /* There may be a limit on the number of concurrent maps
     * and the limit may be as low as one. */

  typedef void CpuPhysicalMemoryMapCallback(CpuPhysicalMemoryMapping*,
      void *cb_arg);

  void cpu_physical_memory_map_notify_read(CpuPhysicalMemoryMapping*,
      target_phys_addr_t addr, target_phys_addr_t *plen);
    /* must be called before read accesses to any buffer region */

  void cpu_physical_memory_map_notify_write(CpuPhysicalMemoryMapping*,
      target_phys_addr_t addr, target_phys_addr_t *plen);
    /* should be called after write accesses to any buffer region
     * unless it is OK for it to be indeterminate whether the
     * guest's memory actually gets written */

  void cpu_physical_memory_unmap(CpuPhysicalMemoryMapping*);
    /* Guest memory may be read and written out of order or even a
     * different number of times. */

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 13:49   ` Ian Jackson
@ 2009-01-19 14:54     ` Avi Kivity
  2009-01-19 15:39       ` Anthony Liguori
  2009-01-19 16:40       ` Ian Jackson
  2009-01-19 15:05     ` [Qemu-devel] [PATCH 1/5] " Gerd Hoffmann
  1 sibling, 2 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 14:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Ian Jackson

Ian Jackson wrote:
> Avi Kivity writes ("[Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>   
>> Devices accessing large amounts of memory (as with DMA) will wish to obtain
>> a pointer to guest memory rather than access it indirectly via
>> cpu_physical_memory_rw().  Add a new API to convert target addresses to
>> host pointers.
>>     
>
> In the Xen qemu tree we have alternative implementations of
> cpu_physical_memory_{read,write}.  So we will need our own
> implementation of this too.
>
> So it is important to us that this interface is sufficient for its
> callers (so that it doesn't need to change in the future), sufficient
> for various different implementations (so that we can implement our
> version), and correct in the sense that it is possible to write
> correct code which uses it.
>   

Sure.

> So as a result I have some comments about this API.  In summary I'm
> afraid I think it needs a bit more work.  Perhaps we could have a
> discussion about what this API should look like, by waving function
> prototypes about (rather than complete patchsets) ?
>
> Also there are some questions which your interface specification
> leaves unanswered; these questions about the API semantics should
> really be addressed in comments by the declaration.
>
>   
>> +void *cpu_physical_memory_map(target_phys_addr_t addr,
>> +                              target_phys_addr_t *plen,
>> +                              int is_write);
>>     
>
> If the caller specifies is_write==1, are they entitled to read from
> the buffer and expect it to contain something reasonable ?  In your
> implementation, the answer appears to be `no'.
>   

Correct.  If you need to perform read-modify-write, you need to use 
cpu_physical_memory_rw(), twice.  If we ever want to support RMW, we'll 
need to add another value for is_write.  I don't think we have 
interesting devices at this point which require efficient RMW.

> Is the caller allowed to assume that the accesses to the underlying
> memory happen in any particular order - in particular, that they
> happen in the same order as the caller's accesses to the buffer ?  Is
> the caller allowed to assume that the same number of accesses to the
> underlying memory happen as are made by the caller to the buffer ?
> No, because if there is a bounce buffer the caller's accesses are
> irrelevant.
>   

Right.  Moreover, the access sizes will be swallowed by the API; devices 
are allowed to provide different implementations for different access sizes.

> What happens if the caller maps with is_write==1, writes part of the
> buffer, and then unmaps ?  In your implementation undefined data is
> written to guest memory.  I think that since not all callers can
> guarantee to wish to write to the whole region, this means that the
> interface is defective.  Callers need to be required to specify which
> areas they have written.
>   

Alternatively, only use this interface with devices where this doesn't 
matter.  Given that bouncing happens for mmio only, this would be all 
devices which you'd want to use this interface with anyway.

(I'm assuming that you'll implement the fastpath by directly mapping 
guest memory, not bouncing).

> The interface when cpu_physical_memory_map returns 0 is strange.
> Normally everything in qemu is done with completion callbacks, but
> here we have a kind of repeated polling.
>   

I agree.  This was done at Anthony's request so I'll defer the response 
to him.

A variant of this API (posted by Andrea) hid all of the scheduling away 
within the implementation.

> This function should return a separate handle as well as the physical
> memory pointer.  That will make it much easier to provide an
> implementation which permits multiple bounce buffers or multiple
> mappings simultaneously.
>   

The downside to a separate handle is that device emulation code will now 
need to maintain the handle in addition to the the virtual address.  
Since the addresses will typically be maintained in an iovec, this means 
another array to be allocated and resized.

The design goals here were to keep things as simple as possible for the 
fast path.  Since the API fits all high-bandwidth devices that I know 
of, I don't think it's a good tradeoff to make the API more complex in 
order to be applicable to some corner cases.

Provided you implement the RAM case by mmap()ing, not bouncing, I think 
it will fit Xen quite well.  You might need to return NULL when the map 
cache is exhausted, but callers will not need to be modified.

>
> So I would prefer something like
>
>   #define CPUPHYSMEMMAPFLAG_READABLE    001
>   #define CPUPHYSMEMMAPFLAG_WRITABLE    002
>
>   #define CPUPHYSMEMMAPFLAG_ALLOWFAIL   010
>      /* If _ALLOWFAIL set then on callback, buffer may be NULL, in
>       * which case caller should use cpu_physical_memory_{read,write} */
>
>   typedef struct {
>     /* to be filled in by caller before calling _map: */
>     unsigned flags;
>     target_phys_addr_t addr, len; /* len is updated by _map */
>     /* filled in by _map: */
>     void *buffer;
>     /* private to _map et al: */
>   } CpuPhysicalMemoryMapping;
>
>   void cpu_physical_memory_map(CpuPhysicalMemoryMapping*;
>       CpuPhysicalMemoryMapCallback *cb, void *cb_arg);
>     /* There may be a limit on the number of concurrent maps
>      * and the limit may be as low as one. */
>
>   typedef void CpuPhysicalMemoryMapCallback(CpuPhysicalMemoryMapping*,
>       void *cb_arg);
>
>   void cpu_physical_memory_map_notify_read(CpuPhysicalMemoryMapping*,
>       target_phys_addr_t addr, target_phys_addr_t *plen);
>     /* must be called before read accesses to any buffer region */
>
>   void cpu_physical_memory_map_notify_write(CpuPhysicalMemoryMapping*,
>       target_phys_addr_t addr, target_phys_addr_t *plen);
>     /* should be called after write accesses to any buffer region
>      * unless it is OK for it to be indeterminate whether the
>      * guest's memory actually gets written */
>
>   void cpu_physical_memory_unmap(CpuPhysicalMemoryMapping*);
>     /* Guest memory may be read and written out of order or even a
>      * different number of times. */
>   

If we go in this direction the API should manage scatter/gather vectors, 
not single mappings.  This will remove the need for clients to store 
handles.

I suggest you look at Andrea's patchset for a very different approach.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
  2009-01-19 13:49   ` Ian Jackson
@ 2009-01-19 14:56   ` Anthony Liguori
  2009-01-19 15:03     ` Avi Kivity
  2009-01-20 18:43   ` Anthony Liguori
  2009-01-21 11:52   ` [Qemu-devel] " Mike Day
  3 siblings, 1 reply; 64+ messages in thread
From: Anthony Liguori @ 2009-01-19 14:56 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

Avi Kivity wrote:
> Devices accessing large amounts of memory (as with DMA) will wish to obtain
> a pointer to guest memory rather than access it indirectly via
> cpu_physical_memory_rw().  Add a new API to convert target addresses to
> host pointers.
>
> In case the target address does not correspond to RAM, a bounce buffer is
> allocated.  To prevent the guest from causing the host to allocate unbounded
> amounts of bounce buffer, this memory is limited (currently to one page).
>
> Signed-off-by: Avi Kivity <avi@redhat.com>
> ---
>  cpu-all.h |    5 +++
>  exec.c    |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 98 insertions(+), 0 deletions(-)
>
> diff --git a/cpu-all.h b/cpu-all.h
> index ee0a6e3..3439999 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -923,6 +923,11 @@ static inline void cpu_physical_memory_write(target_phys_addr_t addr,
>  {
>      cpu_physical_memory_rw(addr, (uint8_t *)buf, len, 1);
>  }
> +void *cpu_physical_memory_map(target_phys_addr_t addr,
> +                              target_phys_addr_t *plen,
> +                              int is_write);
> +void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
> +                               int is_write);
>  uint32_t ldub_phys(target_phys_addr_t addr);
>  uint32_t lduw_phys(target_phys_addr_t addr);
>  uint32_t ldl_phys(target_phys_addr_t addr);
> diff --git a/exec.c b/exec.c
> index faa6333..7162271 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -3045,6 +3045,99 @@ void cpu_physical_memory_write_rom(target_phys_addr_t addr,
>      }
>  }
>  
> +typedef struct {
> +    void *buffer;
> +    target_phys_addr_t addr;
> +    target_phys_addr_t len;
> +} BounceBuffer;
> +
> +static BounceBuffer bounce;
> +
> +void *cpu_physical_memory_map(target_phys_addr_t addr,
> +                              target_phys_addr_t *plen,
> +                              int is_write)
> +{
> +    target_phys_addr_t len = *plen;
> +    target_phys_addr_t done = 0;
> +    int l;
> +    uint8_t *ret = NULL;
> +    uint8_t *ptr;
> +    target_phys_addr_t page;
> +    unsigned long pd;
> +    PhysPageDesc *p;
> +    unsigned long addr1;
> +
> +    while (len > 0) {
> +        page = addr & TARGET_PAGE_MASK;
> +        l = (page + TARGET_PAGE_SIZE) - addr;
> +        if (l > len)
> +            l = len;
> +        p = phys_page_find(page >> TARGET_PAGE_BITS);
> +        if (!p) {
> +            pd = IO_MEM_UNASSIGNED;
> +        } else {
> +            pd = p->phys_offset;
> +        }
> +
> +        if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
> +            if (done || bounce.buffer) {
> +                break;
> +            }
> +            bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, TARGET_PAGE_SIZE);
> +            bounce.addr = addr;
> +            bounce.len = l;
>   

I like the bouncing approach.  Namely, that it never bounces more than a 
page at a time.  I think that's clever.

Could you add some documentation to this function?  Namely, making it 
clear that it can return NULL and that if it does, the caller must retry.

> +void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
> +                               int is_write)
> +{
> +    if (buffer != bounce.buffer) {
> +        if (is_write) {
> +            unsigned long addr1 = (uint8_t *)buffer - phys_ram_base;
> +            do {
> +                unsigned l;
> +                l = TARGET_PAGE_SIZE;
> +                if (l > len)
> +                    l = len;
> +                if (!cpu_physical_memory_is_dirty(addr1)) {
> +                    /* invalidate code */
> +                    tb_invalidate_phys_page_range(addr1, addr1 + len, 0);
> +                    /* set dirty bit */
> +                    phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
> +                        (0xff & ~CODE_DIRTY_FLAG);
> +                }
> +                addr1 += l;
> +                len -= l;
> +            } while (len);
> +        }
> +        return;
> +    }
> +    if (is_write) {
> +        cpu_physical_memory_write(bounce.addr, bounce.buffer, bounce.len);
> +    }
> +    qemu_free(bounce.buffer);
> +    bounce.buffer = NULL;
>   

If map() fails, how does the caller determine when to retry the mapping?

Regards,

Anthony Liguori

> +}
>  
>  /* warning: addr must be aligned */
>  uint32_t ldl_phys(target_phys_addr_t addr)
>   

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] Re: [PATCH 2/5] Add map client retry notification
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 2/5] Add map client retry notification Avi Kivity
@ 2009-01-19 14:58   ` Anthony Liguori
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Liguori @ 2009-01-19 14:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

Avi Kivity wrote:
> The target memory mapping API may fail if the bounce buffer resources
> are exhausted.  Add a notification mechanism to allow clients to retry
> the mapping operation when resources become available again.
>
> Signed-off-by: Avi Kivity <avi@redhat.com>
>   

Fantastic!  Clearly, I should have read the next patch before responding 
to the first.

Regards,

Anthony Liguori

> ---
>  cpu-all.h |    3 +++
>  exec.c    |   38 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 41 insertions(+), 0 deletions(-)
>
> diff --git a/cpu-all.h b/cpu-all.h
> index 3439999..67e795e 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -928,6 +928,9 @@ void *cpu_physical_memory_map(target_phys_addr_t addr,
>                                int is_write);
>  void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
>                                 int is_write);
> +void *cpu_register_map_client(void *opaque, void (*callback)(void *opaque));
> +void cpu_unregister_map_client(void *cookie);
> +
>  uint32_t ldub_phys(target_phys_addr_t addr);
>  uint32_t lduw_phys(target_phys_addr_t addr);
>  uint32_t ldl_phys(target_phys_addr_t addr);
> diff --git a/exec.c b/exec.c
> index 7162271..62bedc0 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -3053,6 +3053,43 @@ typedef struct {
>  
>  static BounceBuffer bounce;
>  
> +typedef struct MapClient {
> +    void *opaque;
> +    void (*callback)(void *opaque);
> +    LIST_ENTRY(MapClient) link;
> +} MapClient;
> +
> +static LIST_HEAD(map_client_list, MapClient) map_client_list
> +    = LIST_HEAD_INITIALIZER(map_client_list);
> +
> +void *cpu_register_map_client(void *opaque, void (*callback)(void *opaque))
> +{
> +    MapClient *client = qemu_malloc(sizeof(*client));
> +
> +    client->opaque = opaque;
> +    client->callback = callback;
> +    LIST_INSERT_HEAD(&map_client_list, client, link);
> +    return client;
> +}
> +
> +void cpu_unregister_map_client(void *_client)
> +{
> +    MapClient *client = (MapClient *)_client;
> +
> +    LIST_REMOVE(client, link);
> +}
> +
> +static void cpu_notify_map_clients(void)
> +{
> +    MapClient *client;
> +
> +    while (!LIST_EMPTY(&map_client_list)) {
> +        client = LIST_FIRST(&map_client_list);
> +        client->callback(client->opaque);
> +        LIST_REMOVE(client, link);
> +    }
> +}
> +
>  void *cpu_physical_memory_map(target_phys_addr_t addr,
>                                target_phys_addr_t *plen,
>                                int is_write)
> @@ -3137,6 +3174,7 @@ void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
>      }
>      qemu_free(bounce.buffer);
>      bounce.buffer = NULL;
> +    cpu_notify_map_clients();
>  }
>  
>  /* warning: addr must be aligned */
>   

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-19 14:56   ` [Qemu-devel] " Anthony Liguori
@ 2009-01-19 15:03     ` Avi Kivity
  2009-01-19 15:49       ` Anthony Liguori
  0 siblings, 1 reply; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 15:03 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

Anthony Liguori wrote:
>> +        if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
>> +            if (done || bounce.buffer) {
>> +                break;
>> +            }
>> +            bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, 
>> TARGET_PAGE_SIZE);
>> +            bounce.addr = addr;
>> +            bounce.len = l;
>>   
>
> I like the bouncing approach.  Namely, that it never bounces more than 
> a page at a time.  I think that's clever.

I tried to make it as stupid as possible.  Looks like I failed.

>
> Could you add some documentation to this function?  Namely, making it 
> clear that it can return NULL and that if it does, the caller must retry.

Sure.

> If map() fails, how does the caller determine when to retry the mapping?
>

By reading the next patch :)

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 13:49   ` Ian Jackson
  2009-01-19 14:54     ` Avi Kivity
@ 2009-01-19 15:05     ` Gerd Hoffmann
  2009-01-19 15:23       ` Avi Kivity
  1 sibling, 1 reply; 64+ messages in thread
From: Gerd Hoffmann @ 2009-01-19 15:05 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> The interface when cpu_physical_memory_map returns 0 is strange.
> Normally everything in qemu is done with completion callbacks, but
> here we have a kind of repeated polling.

Check patch #2 ;)

>   typedef struct {
>     /* to be filled in by caller before calling _map: */
>     unsigned flags;
>     target_phys_addr_t addr, len; /* len is updated by _map */
>     /* filled in by _map: */
>     void *buffer;
>     /* private to _map et al: */
>   } CpuPhysicalMemoryMapping;
> 
>   void cpu_physical_memory_map(CpuPhysicalMemoryMapping*;
>       CpuPhysicalMemoryMapCallback *cb, void *cb_arg);
>     /* There may be a limit on the number of concurrent maps
>      * and the limit may be as low as one. */

Hmm, I'd like to see grant tables fit into this picture too.
Using flags (instead of is_write) for map requests we can handle that
too I think.  We need a flag indicating we are passing a grant handle
not a guest address, then we can stick the handle into addr (or make
that a union).

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:05     ` [Qemu-devel] [PATCH 1/5] " Gerd Hoffmann
@ 2009-01-19 15:23       ` Avi Kivity
  2009-01-19 15:29         ` Avi Kivity
  0 siblings, 1 reply; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 15:23 UTC (permalink / raw)
  To: qemu-devel

Gerd Hoffmann wrote:
> Hmm, I'd like to see grant tables fit into this picture too.
> Using flags (instead of is_write) for map requests we can handle that
> too I think.  We need a flag indicating we are passing a grant handle
> not a guest address, then we can stick the handle into addr (or make
> that a union).
>   

Grant handles should be handled by a different API.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:23       ` Avi Kivity
@ 2009-01-19 15:29         ` Avi Kivity
  2009-01-19 15:57           ` Gerd Hoffmann
  0 siblings, 1 reply; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 15:29 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> Gerd Hoffmann wrote:
>> Hmm, I'd like to see grant tables fit into this picture too.
>> Using flags (instead of is_write) for map requests we can handle that
>> too I think.  We need a flag indicating we are passing a grant handle
>> not a guest address, then we can stick the handle into addr (or make
>> that a union).
>>   
>
> Grant handles should be handled by a different API.
>

That was terse... here's how I think it should be done:

- step 1: map the grant handle into guest physical memory
- step 2: use cpu_physical_memory_{read,write,map}() to access the memory

This makes grant tables transparent to the actual memory access API.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 14:54     ` Avi Kivity
@ 2009-01-19 15:39       ` Anthony Liguori
  2009-01-19 16:18         ` Paul Brook
  2009-01-19 16:57         ` Ian Jackson
  2009-01-19 16:40       ` Ian Jackson
  1 sibling, 2 replies; 64+ messages in thread
From: Anthony Liguori @ 2009-01-19 15:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Ian Jackson, Avi Kivity

Avi Kivity wrote:
>> The interface when cpu_physical_memory_map returns 0 is strange.
>> Normally everything in qemu is done with completion callbacks, but
>> here we have a kind of repeated polling.
>>   
>
> I agree.  This was done at Anthony's request so I'll defer the 
> response to him.

There are two distinct IO consumers of this API: streaming IO and packet IO.

For streaming IO, you have a model of:

process_data:
while (offset < size) {
  data = map(offset, &len)
  if (data) {
     do IO on (data, len)
     unmap(data)
     offset += len;
  } else
     break
}

if (offset < size)
   add callback for mappability to process_data

I agree that this model could be formalized into something that took a 
'do IO on (data, len)' actor.  In fact, since map() and unmap() are 
pretty generic, they too could be actors.  This would then work for CPU 
memory IO, PCI memory IO, etc.

The packet IO API is a bit different.  It looks like:

while (offset < size) {
  data = map(offset, &len)
  if (data == NULL)
     break;
  sg[n_sg].iov_base = data;
  sg[n_sg].iov_len = len;
  n_sg++;
  offset += len;
}

if (offset < len) {
  for (i = 0; i < n_sg; i++)
     unmap(sg[i].iov_base);
   sg[0].iov_base = alloc_buffer(size);
   sg[0].iov_len = size;
   cpu_physical_memory_rw(sg[0].iov_base, size);
}

do IO on (sg)

if (offset < len) {
  free(sg[0].iov_base);
}

In this case, it isn't useful to get a callback with some of the packet 
data.  You need to know up front whether you can map all of the packet 
data.  In fact, a callback API doesn't really work because it implies 
that at the end of the callback, you either release the data or that the 
next callback could not be invoked until you unmap a previous data.

So this is why I prefer the map() API, as it accommodates two distinct 
users in a way that the callback API wouldn't.  We can formalize these 
idioms into an API, of course.

BTW, to support this model, we have to reserve at least one bounce 
buffer for cpu_physical_memory_rw.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:03     ` Avi Kivity
@ 2009-01-19 15:49       ` Anthony Liguori
  2009-01-19 15:51         ` Avi Kivity
  0 siblings, 1 reply; 64+ messages in thread
From: Anthony Liguori @ 2009-01-19 15:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

Avi Kivity wrote:
> Anthony Liguori wrote:
>>> +        if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
>>> +            if (done || bounce.buffer) {
>>> +                break;
>>> +            }
>>> +            bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, 
>>> TARGET_PAGE_SIZE);
>>> +            bounce.addr = addr;
>>> +            bounce.len = l;
>>>   
>>
>> I like the bouncing approach.  Namely, that it never bounces more 
>> than a page at a time.  I think that's clever.
>
> I tried to make it as stupid as possible.  Looks like I failed.

Or your version of stupid is my version of clever :-)

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:49       ` Anthony Liguori
@ 2009-01-19 15:51         ` Avi Kivity
  0 siblings, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 15:51 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel

Anthony Liguori wrote:
>>
>> I tried to make it as stupid as possible.  Looks like I failed.
>
> Or your version of stupid is my version of clever :-)
>

I'm not going there... :)

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:29         ` Avi Kivity
@ 2009-01-19 15:57           ` Gerd Hoffmann
  2009-01-19 16:25             ` Avi Kivity
  2009-01-19 17:08             ` Ian Jackson
  0 siblings, 2 replies; 64+ messages in thread
From: Gerd Hoffmann @ 2009-01-19 15:57 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
>>
>> Grant handles should be handled by a different API.
> 
> That was terse... here's how I think it should be done:
> 
> - step 1: map the grant handle into guest physical memory

Haha.

Guess I should explain a bit more what grant handles are ...

A xen domain can give other domains permission to map specific pages.
This is called a grant.  This is used by xen paravirtual drivers.  The
domU frontend driver hands out grants to the backend driver (usually
dom0), so the backend driver can fill the guest buffers with the data.

I want qemu be able to handle the backend side, so it must be able to
map and unmap grants.   I think fitting this into the memory mapping API
can be useful, especially if the mapping API starts to get used by
generic code.  Maybe the block layer some day accepts iovecs with guest
physical addresses and does map/unmap transparently?  Then I'd like to
be able to pass in a iovec with grants instead.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:39       ` Anthony Liguori
@ 2009-01-19 16:18         ` Paul Brook
  2009-01-19 16:33           ` Anthony Liguori
  2009-01-19 16:57         ` Ian Jackson
  1 sibling, 1 reply; 64+ messages in thread
From: Paul Brook @ 2009-01-19 16:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Ian Jackson, Avi Kivity

>[block iterator]
> I agree that this model could be formalized into something that took a
> 'do IO on (data, len)' actor.  In fact, since map() and unmap() are
> pretty generic, they too could be actors.  This would then work for CPU
> memory IO, PCI memory IO, etc.
>
> The packet IO API is a bit different.  It looks like:
>...
> if (offset < len) {
>    sg[0].iov_base = alloc_buffer(size);
>    cpu_physical_memory_rw(sg[0].iov_base, size);
>
> In this case, it isn't useful to get a callback with some of the packet
> data.  You need to know up front whether you can map all of the packet
> data.  In fact, a callback API doesn't really work because it implies
> that at the end of the callback, you either release the data or that the
> next callback could not be invoked until you unmap a previous data.

It looks like what you're actually doing is pushing the bounce buffer 
allocation into the individual packet consumers.

Maybe a solution to this is a 'do IO on IOVEC' actor, with an additional flag 
that says whether it is acceptable to split the allocation. That way both 
block and packet interfaces use the same API, and avoids proliferation of 
manual bounce buffers in packet devices.

Paul

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:57           ` Gerd Hoffmann
@ 2009-01-19 16:25             ` Avi Kivity
  2009-01-19 17:08             ` Ian Jackson
  1 sibling, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 16:25 UTC (permalink / raw)
  To: qemu-devel

Gerd Hoffmann wrote:
> Avi Kivity wrote:
>   
>>> Grant handles should be handled by a different API.
>>>       
>> That was terse... here's how I think it should be done:
>>
>> - step 1: map the grant handle into guest physical memory
>>     
>
> Haha.
>
>   
Right :)

> Guess I should explain a bit more what grant handles are ...
>
> A xen domain can give other domains permission to map specific pages.
> This is called a grant.  This is used by xen paravirtual drivers.  The
> domU frontend driver hands out grants to the backend driver (usually
> dom0), so the backend driver can fill the guest buffers with the data.
>
> I want qemu be able to handle the backend side, so it must be able to
> map and unmap grants.   I think fitting this into the memory mapping API
> can be useful, especially if the mapping API starts to get used by
> generic code.  Maybe the block layer some day accepts iovecs with guest
> physical addresses and does map/unmap transparently?  Then I'd like to
> be able to pass in a iovec with grants instead.
>   

Well, I still think it should be done by a separate API.  Multiplexing 
target addresses and grant handles on the same API is messy.

The block driver should keep its host virtual address interface; maybe 
we'll add a layer that accepts sglists on top of it; in that case we can 
add another API that accepts sglists that contain grant handles instead 
of addresses.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 16:18         ` Paul Brook
@ 2009-01-19 16:33           ` Anthony Liguori
  2009-01-19 16:39             ` Avi Kivity
  0 siblings, 1 reply; 64+ messages in thread
From: Anthony Liguori @ 2009-01-19 16:33 UTC (permalink / raw)
  To: Paul Brook; +Cc: Ian Jackson, qemu-devel, Avi Kivity

Paul Brook wrote:
> It looks like what you're actually doing is pushing the bounce buffer 
> allocation into the individual packet consumers.
>
> Maybe a solution to this is a 'do IO on IOVEC' actor, with an additional flag 
> that says whether it is acceptable to split the allocation. That way both 
> block and packet interfaces use the same API, and avoids proliferation of 
> manual bounce buffers in packet devices.
>   

I think there may be utility in having packet devices provide the bounce 
buffers, in which case, you could probably unique both into a single 
function with a flag.  But why not just have two separate functions?

Those two functions can live in exec.c too.  The nice thing about using 
map() is that it's easily overriden and chained.  So what I'm proposing.

cpu_physical_memory_map()
cpu_physical_memory_unmap()
do_streaming_IO(map, unmap, ioworker, opaque);
do_packet_IO(map, unmap, buffer, size, ioworker, opaque);

Additional helpers that pass c_p_m_map/unmap would also be acceptable.

Regards,

Anthony Liguori

> Paul
>   

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 16:33           ` Anthony Liguori
@ 2009-01-19 16:39             ` Avi Kivity
  2009-01-19 19:15               ` Anthony Liguori
  0 siblings, 1 reply; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 16:39 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Ian Jackson, Paul Brook, qemu-devel

Anthony Liguori wrote:
> Paul Brook wrote:
>> It looks like what you're actually doing is pushing the bounce buffer 
>> allocation into the individual packet consumers.
>>
>> Maybe a solution to this is a 'do IO on IOVEC' actor, with an 
>> additional flag that says whether it is acceptable to split the 
>> allocation. That way both block and packet interfaces use the same 
>> API, and avoids proliferation of manual bounce buffers in packet 
>> devices.
>>   
>
> I think there may be utility in having packet devices provide the 
> bounce buffers, in which case, you could probably unique both into a 
> single function with a flag.  But why not just have two separate 
> functions?
>
> Those two functions can live in exec.c too.  The nice thing about 
> using map() is that it's easily overriden and chained.  So what I'm 
> proposing.
>
> cpu_physical_memory_map()
> cpu_physical_memory_unmap()

This should be the baseline API with the rest using it.

> do_streaming_IO(map, unmap, ioworker, opaque);

Why pass map and unmap?

grant based devices needn't go through this at all, since you never mix 
grants and physical addresses, and since grants never need bouncing.

> do_packet_IO(map, unmap, buffer, size, ioworker, opaque);

If you pass the buffer then the device needs to allocate large amounts 
of bounce memory.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 14:54     ` Avi Kivity
  2009-01-19 15:39       ` Anthony Liguori
@ 2009-01-19 16:40       ` Ian Jackson
  2009-01-19 17:28         ` Avi Kivity
  1 sibling, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2009-01-19 16:40 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> Ian Jackson wrote:
> > So it is important to us that this interface is [nice]
>
> Sure.

Thanks :-).

> Correct.  If you need to perform read-modify-write, you need to use 
> cpu_physical_memory_rw(), twice.  If we ever want to support RMW, we'll 
> need to add another value for is_write.  I don't think we have 
> interesting devices at this point which require efficient RMW.

Efficient read-modify-write may be very hard for some setups to
achieve.  It can't be done with the bounce buffer implementation.
I think ond good rule of thumb would be to make sure that the interface
as specified can be implemented in terms of cpu_physical_memory_rw.

> Right.  Moreover, the access sizes will be swallowed by the API; devices 
> are allowed to provide different implementations for different access sizes.

Yes, that's another important fact about the API.

> > What happens if the caller maps with is_write==1, writes part of the
> > buffer, and then unmaps ?  In your implementation undefined data is
> > written to guest memory.  I think that since not all callers can
> > guarantee to wish to write to the whole region, this means that the
> > interface is defective.  Callers need to be required to specify which
> > areas they have written.
...
> Alternatively, only use this interface with devices where this doesn't 
> matter.  Given that bouncing happens for mmio only, this would be all 
> devices which you'd want to use this interface with anyway.

That would be one alternative but isn't it the case that (for example)
with a partial DMA completion, the guest can assume that the
supposedly-untouched parts of the DMA target memory actually remain
untouched rather than (say) zeroed ?

In a system where we're trying to do zero copy, we may issue the map
request for a large transfer, before we know how much the host kernel
will actually provide.

> (I'm assuming that you'll implement the fastpath by directly mapping 
> guest memory, not bouncing).

Yes.  We can do that in Xen too but it's less of a priority for us
given that we expect people who really care about performance to
install PV drivers in the guest.

> A variant of this API (posted by Andrea) hid all of the scheduling away 
> within the implementation.

I remember seeing this before but I don't think your previous one
provided a callback for map completion ?  I thought it just blocked
the caller until the map could complete.  That's obviously not ideal.

> > This function should return a separate handle as well as the physical
> > memory pointer.  That will make it much easier to provide an
> > implementation which permits multiple bounce buffers or multiple
> > mappings simultaneously.
> 
> The downside to a separate handle is that device emulation code will now 
> need to maintain the handle in addition to the the virtual address.  
> Since the addresses will typically be maintained in an iovec, this means 
> another array to be allocated and resized.

Err, no, I don't really see that.  In my proposal the `handle' is
actually allocated by the caller.  The implementation provides the
private data and that can be empty.  There is no additional memory
allocation.

> The design goals here were to keep things as simple as possible for the 
> fast path.  Since the API fits all high-bandwidth devices that I know 
> of, I don't think it's a good tradeoff to make the API more complex in 
> order to be applicable to some corner cases.

I think my question about partial DMA writes is very relevant.  If we
don't care about that, nor about the corresponding notification for
reads, then the API can be a lot simpler.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 0/5] Direct memory access for devices
  2009-01-18 19:53 [Qemu-devel] [PATCH 0/5] Direct memory access for devices Avi Kivity
                   ` (4 preceding siblings ...)
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 5/5] Convert IDE to directly access guest memory Avi Kivity
@ 2009-01-19 16:50 ` Blue Swirl
  5 siblings, 0 replies; 64+ messages in thread
From: Blue Swirl @ 2009-01-19 16:50 UTC (permalink / raw)
  To: qemu-devel

On 1/18/09, Avi Kivity <avi@redhat.com> wrote:
> One of the deficiencies of the current device layer is that it can only access
>  guest RAM via cpu_physical_memory_rw().  This means that the device emulation
>  code must copy the memory to or from a temporary buffer, even though the host
>  offers APIs which allow direct access to memory.  This reduces efficiency on
>  DMA capable devices, especially disks.
>
>  This patchset introduces a complement to the read/write API,
>  cpu_physical_memory_map() which allows device emulation code to map
>  guest memory directly.  The API bounces memory regions which cannot be
>  mapped (such as mmio regions) using an internal buffer.
>
>  As an example, IDE emulation is converted to use the new API.  This exposes
>  another deficiency: lack of scatter/gather support in the block layer.  To
>  work around this, a vectored block API is introduced, currently emulated
>  by bouncing.  Additional work is needed to convert all block format drivers
>  to use the vectored API.

Great!

I'll try to use the API for ESP and Lance on Sparc32, that may reveal
if there are any problems in the design.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Vectored block device API
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 3/5] Vectored block device API Avi Kivity
@ 2009-01-19 16:54   ` Blue Swirl
  2009-01-19 17:19     ` Avi Kivity
  0 siblings, 1 reply; 64+ messages in thread
From: Blue Swirl @ 2009-01-19 16:54 UTC (permalink / raw)
  To: qemu-devel

On 1/18/09, Avi Kivity <avi@redhat.com> wrote:
> Most devices that are capable of DMA are also capable of scatter-gather.
>  With the memory mapping API, this means that the device code needs to be
>  able to access discontiguous host memory regions.
>
>  For block devices, this translates to vectored I/O.  This patch implements
>  an aynchronous vectored interface for the qemu block devices.  At the moment
>  all I/O is bounced and submitted through the non-vectored API; in the future
>  we will convert block devices to natively support vectored I/O wherever
>  possible.
>
>  Signed-off-by: Avi Kivity <avi@redhat.com>

>  +static void flatten_iovec(VectorTranslationState *s)
>  +{
>  +    uint8_t *p = s->bounce;
>  +    int i;
>  +
>  +    for (i = 0; i < s->niov; ++i) {
>  +        memcpy(p, s->iov[i].iov_base, s->iov[i].iov_len);
>  +        p += s->iov[i].iov_len;
>  +    }
>  +}
>  +
>  +static void unflatten_iovec(VectorTranslationState *s)
>  +{
>  +    uint8_t *p = s->bounce;
>  +    int i;
>  +
>  +    for (i = 0; i < s->niov; ++i) {
>  +        memcpy(s->iov[i].iov_base, p, s->iov[i].iov_len);
>  +        p += s->iov[i].iov_len;
>  +    }
>  +}

I think these should be moved to vl.c together with patch 4 stuff,
they are not block device specific.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:39       ` Anthony Liguori
  2009-01-19 16:18         ` Paul Brook
@ 2009-01-19 16:57         ` Ian Jackson
  2009-01-19 19:23           ` Anthony Liguori
  1 sibling, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2009-01-19 16:57 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> The packet IO API is a bit different.  It looks like:

The purpose here is to be able to make only one system call to the
host kernel in order to do an operation which involves a scatter
gather list in guest physical memory (as provided by the guest to eg
an emulated DMA controller) ?

And the idea is to try to map as much of that contiguously as
possible so that only one system call need be made ?

> while (offset < size) {
>   data = map(offset, &len)
>   if (data == NULL)
>      break;
>   sg[n_sg].iov_base = data;
>   sg[n_sg].iov_len = len;
>   n_sg++;
>   offset += len;
> }
...
> if (offset < len) {
>   for (i = 0; i < n_sg; i++)
>      unmap(sg[i].iov_base);
>    sg[0].iov_base = alloc_buffer(size);
>    sg[0].iov_len = size;
>    cpu_physical_memory_rw(sg[0].iov_base, size);
> }

Why is it necessary for there only to be one of these

> do IO on (sg)

calls ?

Are there supposed to be fast-path devices where we absolutely must
make the host system call for the whole transfer in one go, in one
contiguous memory region ?

Obviously for the fast path to be actually fast the whole mapping must
succeed as requested, but this could be achieved by an interface where
the mapping caller provided a scatter gather list in guest memory:

  typedef struct {
    target_phys_addr_t addr, len;
  } CpuPhysicalMemoryMappingEntry;

  typedef struct {
    /* to be filled in by caller before calling _map: */
    unsigned flags;
    const CpuPhysicalMemoryMappingEntry *sg_list;
    target_phys_addr_t total_len; /* gives sg_list length; updated by _map */
    /* filled in by _map: */
    void *buffer;
    /* private to _map et al: */
  } CpuPhysicalMemoryMapping;

  void cpu_physical_memory_map(CpuPhysicalMemoryMapping*;
      CpuPhysicalMemoryMapCallback *cb, void *cb_arg);
    /* There may be a limit on the number of concurrent maps
     * and the limit may be as low as one. */

If a caller doesn't want to deal with a complete sg_list then they do
  typedef struct {
    ...
    CpuPhysicalMemoryMapping dma_mapping;
    CpuPhysicalMemoryMappingRequest dma_request;
    ...
  } SomePCIDevice;
and do
  ourself->dma_mapping.sg_list = &ourself->dma_request;
  ourself->dma_mapping.total_len = ourself->dma_request.len;

> So this is why I prefer the map() API, as it accommodates two distinct 
> users in a way that the callback API wouldn't.  We can formalize these 
> idioms into an API, of course.

I don't think there is any fundamental difference between a callback
API and a polling API; you can implement whatever semantics you like
with either.

But callbacks are needed in at least some cases because of the way
that the bounce buffer may need to be reserved/released.  That means
all of the callers have to deal with callbacks anyway.

So it makes sense to make that the only code path.  That way callers
only need to be written once.

> BTW, to support this model, we have to reserve at least one bounce 
> buffer for cpu_physical_memory_rw.

Yes.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:57           ` Gerd Hoffmann
  2009-01-19 16:25             ` Avi Kivity
@ 2009-01-19 17:08             ` Ian Jackson
  2009-01-19 17:16               ` Avi Kivity
  1 sibling, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2009-01-19 17:08 UTC (permalink / raw)
  To: qemu-devel

Gerd Hoffmann writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>Avi Kivity wrote:
>[Avi:]
>>> Grant handles should be handled by a different API.
>> 
>> That was terse... here's how I think it should be done:
>> - step 1: map the grant handle into guest physical memory

I agree that grant handles should be handled by a different API.  But
I don't see why they need to be mapped into guest physical memory, or
referred to via a target_phys_addr_t at all.

The only consumers of grant references in qemu are Xen-specific
backend drivers, and they can call the Xen-specific grant reference
APIs directly.  There is no need for anyone but those backend drivers
to access those pages of guest memory.

So I don't see why grant references are relevant to the
cpu_physical_memory_map API at all.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 17:08             ` Ian Jackson
@ 2009-01-19 17:16               ` Avi Kivity
  0 siblings, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 17:16 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> Gerd Hoffmann writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>   
>> Avi Kivity wrote:
>> [Avi:]
>>     
>>>> Grant handles should be handled by a different API.
>>>>         
>>> That was terse... here's how I think it should be done:
>>> - step 1: map the grant handle into guest physical memory
>>>       
>
> I agree that grant handles should be handled by a different API.  But
> I don't see why they need to be mapped into guest physical memory, or
> referred to via a target_phys_addr_t at all.
>
> The only consumers of grant references in qemu are Xen-specific
> backend drivers, and they can call the Xen-specific grant reference
> APIs directly.  There is no need for anyone but those backend drivers
> to access those pages of guest memory.
>
> So I don't see why grant references are relevant to the
> cpu_physical_memory_map API at all.
>   

Thinko on my part.  Wasn't a good suggestion.  Side-by-side APIs seem best.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] Vectored block device API
  2009-01-19 16:54   ` Blue Swirl
@ 2009-01-19 17:19     ` Avi Kivity
  0 siblings, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 17:19 UTC (permalink / raw)
  To: qemu-devel

Blue Swirl wrote:
>>  +static void flatten_iovec(VectorTranslationState *s)
>>  +static void unflatten_iovec(VectorTranslationState *s)
>>     
> I think these should be moved to vl.c together with patch 4 stuff,
> they are not block device specific.
>   

Agree.  Further, they should accept a QEMUIOVector (as should the bdrvv 
API).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 16:40       ` Ian Jackson
@ 2009-01-19 17:28         ` Avi Kivity
  2009-01-19 17:53           ` Ian Jackson
                             ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 17:28 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
>> Correct.  If you need to perform read-modify-write, you need to use 
>> cpu_physical_memory_rw(), twice.  If we ever want to support RMW, we'll 
>> need to add another value for is_write.  I don't think we have 
>> interesting devices at this point which require efficient RMW.
>>     
>
> Efficient read-modify-write may be very hard for some setups to
> achieve.  It can't be done with the bounce buffer implementation.
> I think ond good rule of thumb would be to make sure that the interface
> as specified can be implemented in terms of cpu_physical_memory_rw.
>   

What is the motivation for efficient rmw?

>> Alternatively, only use this interface with devices where this doesn't 
>> matter.  Given that bouncing happens for mmio only, this would be all 
>> devices which you'd want to use this interface with anyway.
>>     
>
> That would be one alternative but isn't it the case that (for example)
> with a partial DMA completion, the guest can assume that the
> supposedly-untouched parts of the DMA target memory actually remain
> untouched rather than (say) zeroed ?
>   

For block devices, I don't think it can.  In any case, this will only 
occur with mmio.  I don't think the guest can assume much in such cases.

In fact, we could even say that the virtual hardware doesn't support 
dma-to-mmio at all and MCE the guest.  I'm sure no x86 guest would even 
notice.  Don't know about non-x86.

> In a system where we're trying to do zero copy, we may issue the map
> request for a large transfer, before we know how much the host kernel
> will actually provide.
>
>   

Won't it be at least 1GB?  Partition you requests to that size.

>> (I'm assuming that you'll implement the fastpath by directly mapping 
>> guest memory, not bouncing).
>>     
>
> Yes.  We can do that in Xen too but it's less of a priority for us
> given that we expect people who really care about performance to
> install PV drivers in the guest.
>   

I'm all in favor of accommodating Xen, but as long as you're out-of-tree 
you need to conform to qemu, not the other way around.

>> A variant of this API (posted by Andrea) hid all of the scheduling away 
>> within the implementation.
>>     
>
> I remember seeing this before but I don't think your previous one
> provided a callback for map completion ?  I thought it just blocked
> the caller until the map could complete.  That's obviously not ideal.
>   

It didn't block, it scheduled.

>>> This function should return a separate handle as well as the physical
>>> memory pointer.  That will make it much easier to provide an
>>> implementation which permits multiple bounce buffers or multiple
>>> mappings simultaneously.
>>>       
>> The downside to a separate handle is that device emulation code will now 
>> need to maintain the handle in addition to the the virtual address.  
>> Since the addresses will typically be maintained in an iovec, this means 
>> another array to be allocated and resized.
>>     
>
> Err, no, I don't really see that.  In my proposal the `handle' is
> actually allocated by the caller.  The implementation provides the
> private data and that can be empty.  There is no additional memory
> allocation.
>   

You need to store multiple handles (one per sg element), so you need to 
allocate a variable size vector for it.  Preallocation may be possible 
but perhaps wasteful.

>   
>> The design goals here were to keep things as simple as possible for the 
>> fast path.  Since the API fits all high-bandwidth devices that I know 
>> of, I don't think it's a good tradeoff to make the API more complex in 
>> order to be applicable to some corner cases.
>>     
>
> I think my question about partial DMA writes is very relevant.  If we
> don't care about that, nor about the corresponding notification for
> reads, then the API can be a lot simpler.

I don't see a concrete reason to care about it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 17:28         ` Avi Kivity
@ 2009-01-19 17:53           ` Ian Jackson
  2009-01-19 18:29             ` Avi Kivity
  2009-01-19 18:25           ` Jamie Lokier
  2009-01-21 12:06           ` [Qemu-devel] " Mike Day
  2 siblings, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2009-01-19 17:53 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> Ian Jackson wrote:
> > Efficient read-modify-write may be very hard for some setups to
> > achieve.  It can't be done with the bounce buffer implementation.
> > I think ond good rule of thumb would be to make sure that the interface
> > as specified can be implemented in terms of cpu_physical_memory_rw.
> 
> What is the motivation for efficient rmw?

I think you've misunderstood me.  I don't think there is such a
motivation.  I was saying it was so difficult to implement that we
might as well exclude it.

But whatever we do, the map interface we specify should be
implementable in terms of cpu_physical_memory_rw.  Ie, it should be
possible to implement the map interface with a bounce buffer.  Which
your proposal did except for the partial write case.

> > That would be one alternative but isn't it the case that (for example)
> > with a partial DMA completion, the guest can assume that the
> > supposedly-untouched parts of the DMA target memory actually remain
> > untouched rather than (say) zeroed ?
> 
> For block devices, I don't think it can.

`Block devices' ?  We're talking about (say) IDE controllers here.  I
would be very surprised if an IDE controller used DMA to overwrite RAM
beyond the amount of successful transfer.

If a Unix variant does zero copy IO using DMA direct into process
memory space, then it must even rely on the IDE controller not doing
DMA beyond the end of the successful transfer, as the read(2) API
promises to the calling process that data beyond the successful read
is left untouched.

And even if the IDE spec happily says that the (IDE) host (ie our
guest) is not allowed to assume that that memory (ie the memory beyond
the extent of the successful part of a partially successful transfer)
is unchanged, there will almost certainly be some other IO device on
some some platform that will make that promise.

So we need a call into the DMA API from the device model to say which
regions have actually been touched.

> > In a system where we're trying to do zero copy, we may issue the map
> > request for a large transfer, before we know how much the host kernel
> > will actually provide.
> 
> Won't it be at least 1GB?  Partition you requests to that size.

No, I mean, before we know how much data qemu's read(2) will transfer.

>  In any case, this will only occur with mmio.  I don't think the
> guest can assume much in such cases.

No, it won't only occur with mmio.

In the initial implementation in Xen, we will almost certainly simply
emulate everything with cpu_physical_memory_rw.  So it will happen all
the time.

> > Err, no, I don't really see that.  In my proposal the `handle' is
> > actually allocated by the caller.  The implementation provides the
> > private data and that can be empty.  There is no additional memory
> > allocation.
> 
> You need to store multiple handles (one per sg element), so you need to 
> allocate a variable size vector for it.  Preallocation may be possible 
> but perhaps wasteful.

See my reply to Anthony Ligouri, which shows how this can be avoided.
Since you hope for a single call to map everything, you can do an sg
list with a single handle.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 17:28         ` Avi Kivity
  2009-01-19 17:53           ` Ian Jackson
@ 2009-01-19 18:25           ` Jamie Lokier
  2009-01-19 18:43             ` Avi Kivity
  2009-01-20 14:44             ` Ian Jackson
  2009-01-21 12:06           ` [Qemu-devel] " Mike Day
  2 siblings, 2 replies; 64+ messages in thread
From: Jamie Lokier @ 2009-01-19 18:25 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> In fact, we could even say that the virtual hardware doesn't support 
> dma-to-mmio at all and MCE the guest.  I'm sure no x86 guest would even 
> notice.  Don't know about non-x86.

Guest userspace does:

   1. mmap() framebuffer device.
   2. read() from file opened with O_DIRECT.

Both are allowed by non-root processes on Linux.

(I imagine this might be more common in some obscure DOS programs though).

Think also variation with reading from a video capture device into
video memory.  I've seen that done on x86, never seen it (yet)
on non-x86 :-)

However, that is known to break on some PCI bridges.

I'm not sure if it's reasonable to abort emulation with an MCE in this
case.

> >I think my question about partial DMA writes is very relevant.  If we
> >don't care about that, nor about the corresponding notification for
> >reads, then the API can be a lot simpler.
> 
> I don't see a concrete reason to care about it.

Writing zeros or junk after a partial DMA is quite different to real
hardware behaviour.  Virtually all devices with a "DMA count"
register are certain to have not written to a later address when DMA stops.

QEMU tries to do a fairly good job at emulating devices with many of
their quirks.  It would be odd if the high-performance API got in the
way of high-quality device emulation, when that's wanted.

Potential example: If a graphics card or video capture card, or USB
webcam etc. (more likely!) is doing a large streaming DMA into a
guests's userspace process when that process calls read() (in the
guest OS), and the DMA is stopped for any reason, such as triggered by
a guest OS SIGINT or simply the data having ended, the guest's
userspace can reasonably assume data after the count returned by
read() is untouched.

Just as importantly, the guest OS in that example can assume that the
later pages are not dirtied, therefore not swap them, or return them
to its pre-zero pool or whatever.  This is a legitimate guest OS
optimisation for streaming-DMA-with-unknown-length devices.  This can
happen without a userspace process too.

I'm guessing truncated DMAs using this API are always going to dirty
only an initial part of the buffer, not arbitrary regions.  (In rare
cases where this isn't true, don't use the API).

So wouldn't it be trivial to pass "amount written" to the unmap
function - to be used in the bounce buffer case?

-- Jamie

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 17:53           ` Ian Jackson
@ 2009-01-19 18:29             ` Avi Kivity
  2009-01-20 14:32               ` Ian Jackson
  0 siblings, 1 reply; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 18:29 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
>>> Efficient read-modify-write may be very hard for some setups to
>>> achieve.  It can't be done with the bounce buffer implementation.
>>> I think ond good rule of thumb would be to make sure that the interface
>>> as specified can be implemented in terms of cpu_physical_memory_rw.
>>>       
>> What is the motivation for efficient rmw?
>>     
>
> I think you've misunderstood me.  I don't think there is such a
> motivation.  I was saying it was so difficult to implement that we
> might as well exclude it.
>   

Then we agree. The map API is for read OR write operations, not both at 
the same time.

>   
>>> That would be one alternative but isn't it the case that (for example)
>>> with a partial DMA completion, the guest can assume that the
>>> supposedly-untouched parts of the DMA target memory actually remain
>>> untouched rather than (say) zeroed ?
>>>       
>> For block devices, I don't think it can.
>>     
>
> `Block devices' ?  We're talking about (say) IDE controllers here.  I
> would be very surprised if an IDE controller used DMA to overwrite RAM
> beyond the amount of successful transfer.
>
> If a Unix variant does zero copy IO using DMA direct into process
> memory space, then it must even rely on the IDE controller not doing
> DMA beyond the end of the successful transfer, as the read(2) API
> promises to the calling process that data beyond the successful read
> is left untouched.
>
> And even if the IDE spec happily says that the (IDE) host (ie our
> guest) is not allowed to assume that that memory (ie the memory beyond
> the extent of the successful part of a partially successful transfer)
> is unchanged, there will almost certainly be some other IO device on
> some some platform that will make that promise.
>
> So we need a call into the DMA API from the device model to say which
> regions have actually been touched.
>
>   

It's not possible to implement this efficiently. The qemu block layer 
will submit the results of the map operation to the kernel in an async 
zero copy operation. The kernel may break up this operation into several 
parts (if the underlying backing store is fragmented) and submit in 
parallel to the underlying device(s). Those requests will complete 
out-of-order, so you can't guarantee that if an error occurs all memory 
before will have been written and none after.

I really doubt that any guest will be affected by this. It's a tradeoff 
between decent performance and needlessly accurate emulation. I don't 
see how we can choose the latter.

>>> In a system where we're trying to do zero copy, we may issue the map
>>> request for a large transfer, before we know how much the host kernel
>>> will actually provide.
>>>       
>> Won't it be at least 1GB?  Partition you requests to that size.
>>     
>
> No, I mean, before we know how much data qemu's read(2) will transfer.
>   

You don't know afterwards either. Maybe read() is specced as you say, 
but practical implementations will return the minimum bytes read, not exact.

Think software RAID.

>>  In any case, this will only occur with mmio.  I don't think the
>> guest can assume much in such cases.
>>     
>
> No, it won't only occur with mmio.
>
> In the initial implementation in Xen, we will almost certainly simply
> emulate everything with cpu_physical_memory_rw.  So it will happen all
> the time.
>   

Try it out. I'm sure it will work just fine (if incredibly slowly, 
unless you provide multiple bounce buffers).

>>> Err, no, I don't really see that.  In my proposal the `handle' is
>>> actually allocated by the caller.  The implementation provides the
>>> private data and that can be empty.  There is no additional memory
>>> allocation.
>>>       
>> You need to store multiple handles (one per sg element), so you need to 
>> allocate a variable size vector for it.  Preallocation may be possible 
>> but perhaps wasteful.
>>     
>
> See my reply to Anthony Ligouri, which shows how this can be avoided.
> Since you hope for a single call to map everything, you can do an sg
> list with a single handle.
>   

That's a very different API.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 18:25           ` Jamie Lokier
@ 2009-01-19 18:43             ` Avi Kivity
  2009-01-20 14:49               ` Ian Jackson
  2009-01-20 14:44             ` Ian Jackson
  1 sibling, 1 reply; 64+ messages in thread
From: Avi Kivity @ 2009-01-19 18:43 UTC (permalink / raw)
  To: qemu-devel

Jamie Lokier wrote:
> Avi Kivity wrote:
>   
>> In fact, we could even say that the virtual hardware doesn't support 
>> dma-to-mmio at all and MCE the guest.  I'm sure no x86 guest would even 
>> notice.  Don't know about non-x86.
>>     
>
> Guest userspace does:
>
>    1. mmap() framebuffer device.
>    2. read() from file opened with O_DIRECT.
>
> Both are allowed by non-root processes on Linux.
>
> (I imagine this might be more common in some obscure DOS programs though).
>
> Think also variation with reading from a video capture device into
> video memory.  I've seen that done on x86, never seen it (yet)
> on non-x86 :-)
>
> However, that is known to break on some PCI bridges.
>
> I'm not sure if it's reasonable to abort emulation with an MCE in this
> case.
>
>   

Framebuffers are mapped as RAM, so we won't bounce this case. Try harder :)

>>> I think my question about partial DMA writes is very relevant.  If we
>>> don't care about that, nor about the corresponding notification for
>>> reads, then the API can be a lot simpler.
>>>       
>> I don't see a concrete reason to care about it.
>>     
>
> Writing zeros or junk after a partial DMA is quite different to real
> hardware behaviour.  Virtually all devices with a "DMA count"
> register are certain to have not written to a later address when DMA stops.
>
>   

The devices we're talking about here don't have a DMA count register. 
They are passed scatter-gather lists, and I don't think they make 
guarantees about the order in which they're accessed.

> QEMU tries to do a fairly good job at emulating devices with many of
> their quirks.  It would be odd if the high-performance API got in the
> way of high-quality device emulation, when that's wanted.
>
> Potential example: If a graphics card or video capture card, or USB
> webcam etc. (more likely!) is doing a large streaming DMA into a
> guests's userspace process when that process calls read() (in the
> guest OS), and the DMA is stopped for any reason, such as triggered by
> a guest OS SIGINT or simply the data having ended, the guest's
> userspace can reasonably assume data after the count returned by
> read() is untouched.
>   

This DMA will be into RAM, not mmio.

> Just as importantly, the guest OS in that example can assume that the
> later pages are not dirtied, therefore not swap them, or return them
> to its pre-zero pool or whatever.  This is a legitimate guest OS
> optimisation for streaming-DMA-with-unknown-length devices.  This can
> happen without a userspace process too.
>
> I'm guessing truncated DMAs using this API are always going to dirty
> only an initial part of the buffer, not arbitrary regions.  (In rare
> cases where this isn't true, don't use the API).
>
> So wouldn't it be trivial to pass "amount written" to the unmap
> function - to be used in the bounce buffer case?
>   

We don't have a reliable amount to pass.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 16:39             ` Avi Kivity
@ 2009-01-19 19:15               ` Anthony Liguori
  2009-01-20 10:09                 ` Avi Kivity
  0 siblings, 1 reply; 64+ messages in thread
From: Anthony Liguori @ 2009-01-19 19:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ian Jackson, Paul Brook, qemu-devel

Avi Kivity wrote:
> Anthony Liguori wrote:
>> Paul Brook wrote:
>>> It looks like what you're actually doing is pushing the bounce 
>>> buffer allocation into the individual packet consumers.
>>>
>>> Maybe a solution to this is a 'do IO on IOVEC' actor, with an 
>>> additional flag that says whether it is acceptable to split the 
>>> allocation. That way both block and packet interfaces use the same 
>>> API, and avoids proliferation of manual bounce buffers in packet 
>>> devices.
>>>   
>>
>> I think there may be utility in having packet devices provide the 
>> bounce buffers, in which case, you could probably unique both into a 
>> single function with a flag.  But why not just have two separate 
>> functions?
>>
>> Those two functions can live in exec.c too.  The nice thing about 
>> using map() is that it's easily overriden and chained.  So what I'm 
>> proposing.
>>
>> cpu_physical_memory_map()
>> cpu_physical_memory_unmap()
>
> This should be the baseline API with the rest using it.

Yup.

>> do_streaming_IO(map, unmap, ioworker, opaque);
>
> Why pass map and unmap?

Because we'll eventually have:

pci_device_memory_map()
pci_device_memory_unmap()

In the simplest case, pci_device_memory_map() just calls 
cpu_physical_memory_map().  But it may do other things.

> grant based devices needn't go through this at all, since you never 
> mix grants and physical addresses, and since grants never need bouncing.

So the grant map/unmap function doesn't need to deal with calling 
cpu_physical_memory_map/unmap.  You could still use the above API or 
not.  It's hard to say.

>> do_packet_IO(map, unmap, buffer, size, ioworker, opaque);
>
> If you pass the buffer then the device needs to allocate large amounts 
> of bounce memory.

If do_packet_IO took a buffer, instead of calling alloc_buffer(size) 
when map fails (you run out of bounce memory), you simply use buffer.  
Otherwise, alloc_buffer() must be able to allocate enough memory to 
satisfy any request.

Since each packet device knows it's maximum size up front, it makes 
sense for the device to allocate it.  You could also not care and just 
trust that callers do the right thing.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 16:57         ` Ian Jackson
@ 2009-01-19 19:23           ` Anthony Liguori
  2009-01-20 10:17             ` Avi Kivity
  2009-01-20 14:18             ` Ian Jackson
  0 siblings, 2 replies; 64+ messages in thread
From: Anthony Liguori @ 2009-01-19 19:23 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> Anthony Liguori writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>   
>> The packet IO API is a bit different.  It looks like:
>>     
>
> The purpose here is to be able to make only one system call to the
> host kernel in order to do an operation which involves a scatter
> gather list in guest physical memory (as provided by the guest to eg
> an emulated DMA controller) ?
>
> And the idea is to try to map as much of that contiguously as
> possible so that only one system call need be made ?
>   

No, it's not an issue of performance, it's an issue of correctness.  A 
single readv/writev system call to a tap file descriptor corresponds to 
a single packet.  You cannot split a packet into multiple read/write 
operations to tap IIUC.

> Are there supposed to be fast-path devices where we absolutely must
> make the host system call for the whole transfer in one go, in one
> contiguous memory region ?
>   

It's a matter of correctness.  Packet protocols (datagram protocols if 
you will) must preserve message boundaries.

>> So this is why I prefer the map() API, as it accommodates two distinct 
>> users in a way that the callback API wouldn't.  We can formalize these 
>> idioms into an API, of course.
>>     
>
> I don't think there is any fundamental difference between a callback
> API and a polling API; you can implement whatever semantics you like
> with either.
>
> But callbacks are needed in at least some cases because of the way
> that the bounce buffer may need to be reserved/released.  That means
> all of the callers have to deal with callbacks anyway.
>
> So it makes sense to make that the only code path.  That way callers
> only need to be written once.
>   

If you use callbacks, consider the following:

Scenario 1 (no callbacks)

1) Attempt to map all data in packet
2) Failure occurs b/c bounce buffer is too small to contain packet
3) We can't do zero-copy IO here, so fall back to a copy to a device 
supplied buffer
4) Transmit full packet

Scenario 2 (callbacks)

1) Register callback and begin data mapping
2) Succeed in mapping first part of packet and 1 page of bounced buffer
3) Bounce buffer is exhausted, callback will not be invoked until the 
next unmap
4) Deadlock

You could add some sort of call to determine if the bounce memory is 
exhausted and try to take some corrective action in #2 but it gets ugly.

So as I said earlier, I think the callback model makes sense when you 
have streaming data (where there are no packet boundaries).  I think 
it's pretty difficult to avoid deadlocking though with callbacks in a 
packet model (where boundaries must be preserved).

Regards,

Anthony Liguori

>> BTW, to support this model, we have to reserve at least one bounce 
>> buffer for cpu_physical_memory_rw.
>>     
>
> Yes.
>
> Ian.
>
>
>   

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 19:15               ` Anthony Liguori
@ 2009-01-20 10:09                 ` Avi Kivity
  0 siblings, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-20 10:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: Ian Jackson, Paul Brook

Anthony Liguori wrote:
>>> Those two functions can live in exec.c too.  The nice thing about 
>>> using map() is that it's easily overriden and chained.  So what I'm 
>>> proposing.
>>>
>>> cpu_physical_memory_map()
>>> cpu_physical_memory_unmap()
>>
>> This should be the baseline API with the rest using it.
>
> Yup.
>
>>> do_streaming_IO(map, unmap, ioworker, opaque);
>>
>> Why pass map and unmap?
>
> Because we'll eventually have:
>
> pci_device_memory_map()
> pci_device_memory_unmap()
>
> In the simplest case, pci_device_memory_map() just calls 
> cpu_physical_memory_map().  But it may do other things.

Let's start without those callbacks, until those "other things" reveal 
themselves.

>>> do_packet_IO(map, unmap, buffer, size, ioworker, opaque);
>>
>> If you pass the buffer then the device needs to allocate large 
>> amounts of bounce memory.
>
> If do_packet_IO took a buffer, instead of calling alloc_buffer(size) 
> when map fails (you run out of bounce memory), you simply use buffer.  
> Otherwise, alloc_buffer() must be able to allocate enough memory to 
> satisfy any request.
>
> Since each packet device knows it's maximum size up front, it makes 
> sense for the device to allocate it.  You could also not care and just 
> trust that callers do the right thing.
>

Pushing things to the caller complicates all callers.  This isn't the 
ideal solution.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 19:23           ` Anthony Liguori
@ 2009-01-20 10:17             ` Avi Kivity
  2009-01-20 14:18             ` Ian Jackson
  1 sibling, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-20 10:17 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
>
> Scenario 2 (callbacks)
>
0) Register minimum required bounce memory at init time
> 1) Register callback and begin data mapping
> 2) Succeed in mapping first part of packet and 1 page of bounced buffer
> 3) Bounce buffer is exhausted, callback will not be invoked until the 
> next unmap
> 4) Deadlock
>
> You could add some sort of call to determine if the bounce memory is 
> exhausted and try to take some corrective action in #2 but it gets ugly.
>
> So as I said earlier, I think the callback model makes sense when you 
> have streaming data (where there are no packet boundaries).  I think 
> it's pretty difficult to avoid deadlocking though with callbacks in a 
> packet model (where boundaries must be preserved).
>

If you specify the boundaries (an additional input array, or maybe a 
function that says whether a particular point is a "good enough" 
boundary) then I think it can be workable.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 19:23           ` Anthony Liguori
  2009-01-20 10:17             ` Avi Kivity
@ 2009-01-20 14:18             ` Ian Jackson
  1 sibling, 0 replies; 64+ messages in thread
From: Ian Jackson @ 2009-01-20 14:18 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> No, it's not an issue of performance, it's an issue of correctness.  A 
> single readv/writev system call to a tap file descriptor corresponds to 
> a single packet.  You cannot split a packet into multiple read/write 
> operations to tap IIUC.

Ah, right.  I still think that can be accomodated in a callback style.

As I said:
> > But callbacks are needed in at least some cases because of the way
> > that the bounce buffer may need to be reserved/released.  That means
> > all of the callers have to deal with callbacks anyway.
> >
> > So it makes sense to make that the only code path.  That way callers
> > only need to be written once.

You countered with:

> If you use callbacks, consider the following:
...
> Scenario 2 (callbacks)
> [problem description]

Yes, I see this problem but you are assuming _multiple_ callbacks.

Instead, we could just have a single callback for the entire mapping
attempt.  Then we're back to the sequence of events in your scenario
1; the only difference is the function call graph.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 18:29             ` Avi Kivity
@ 2009-01-20 14:32               ` Ian Jackson
  2009-01-20 17:23                 ` Avi Kivity
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2009-01-20 14:32 UTC (permalink / raw)
  To: qemu-devel

I think the key points in Avi's message are this:

Avi Kivity writes:
> You don't know afterwards either. Maybe read() is specced as you
> say, but practical implementations will return the minimum bytes
> read, not exact.

And this:

> I really doubt that any guest will be affected by this. It's a tradeoff 
> between decent performance and needlessly accurate emulation. I don't 
> see how we can choose the latter.

I don't think this is the right way to analyse this situation.  We are
trying to define a general-purpose DMA API for _all_ emulated devices,
not just the IDE emulation and block devices that you seem to be
considering.

If there is ever any hardware which behaves `properly' with partial
DMA, and any host kernel device which can tell us what succeeded and
what failed, then it is necessary for the DMA API we are now inventing
to allow that device to be properly emulated.

Even if we can't come up with an example right now of such a device
then I would suggest that it's very likely that we will encounter one
eventually.  But actually I can think of one straight away: a SCSI
tapestreamer.  Tapestreamers often give partial transfers at the end
of tapefiles; hosts (ie, qemu guests) talking to the SCSI controller
do not expect the controller to DMA beyond the successful SCSI
transfer length; and the (qemu host's) kernel's read() call will not
overwrite beyond the successful transfer length either.

If it is difficult for a block device to provide the faithful
behaviour then it might be acceptable for the block device to always
indicate to the DMA API that the entire transfer had taken place, even
though actually some of it had failed.

But personally I think you're mistaken about the behaviour of the
(qemu host's) kernel's {aio_,p,}read(2).

> > In the initial implementation in Xen, we will almost certainly simply
> > emulate everything with cpu_physical_memory_rw.  So it will happen all
> > the time.
> 
> Try it out. I'm sure it will work just fine (if incredibly slowly, 
> unless you provide multiple bounce buffers).

It will certainly work except (a) there are partial (interrupted)
transfers and (b) the host relies on the partial DMA not overwriting
more data than it successfully transferred.  So what that means is
that if this introduces bugs they will be very difficult to find in
testing.  I don't think testing is the answer here.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 18:25           ` Jamie Lokier
  2009-01-19 18:43             ` Avi Kivity
@ 2009-01-20 14:44             ` Ian Jackson
  1 sibling, 0 replies; 64+ messages in thread
From: Ian Jackson @ 2009-01-20 14:44 UTC (permalink / raw)
  To: qemu-devel

Jamie Lokier writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> So wouldn't it be trivial to pass "amount written" to the unmap
> function - to be used in the bounce buffer case?

I think that's probably the best plan.  It's certainly a lot simpler
than my most recent proposal.

It seems to me that:
 - there has to be a callback mechanism to schedule bounce buffers
 - there should be only one control flow
 - therefore the interface should always be callback-based
and:
 - as Anthony points out, we need to be able to set up for a single-
   packet scatter-gather IO
 - thus the interface should involve a single request and callback
   for a whole DMA scatter gather

This would seem to yield an API that looks roughly like this:

  #define CPUPHYSMEMMAPFLAG_READABLE    001
  #define CPUPHYSMEMMAPFLAG_WRITABLE    002

  #define CPUPHYSMEMMAP_MAX 131072
     /* Callers of the mapping API must only request mappings
      * totalling this amount or less.  Larger requests will
      * fail an assertion. */

  typedef struct {
    target_phys_addr_t addr, len;
  } CpuPhysicalMemoryMappingEntry;
     /* If requested addresses and lengths are not page-aligned,
      * performance will definitely be poor. */

  typedef struct {
   /* to be filled in by caller before calling _map: */
    unsigned flags;
    int sg_list_entries;
    CpuPhysicalMemoryMappingEntry *sg_list;
   /* filled in by _map: */
    void *buffer;
   /* private to _map et al: */
    ... some stuff here to actually implement ...
  } CpuPhysicalMemoryMapping;

  void cpu_physical_memory_map(CpuPhysicalMemoryMapping*,
      CpuPhysicalMemoryMapCallback *cb, void *cb_arg);
    /* There may be a limit on the number of concurrent maps
     * and the limit may be as low as one. */

  void cpu_physical_memory_unmap(CpuPhysicalMemoryMapping*,
      target_phys_addr_t transferred);
    /* For read mappings, transferred is ignored so 0 is OK. */

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 18:43             ` Avi Kivity
@ 2009-01-20 14:49               ` Ian Jackson
  2009-01-20 17:42                 ` Avi Kivity
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2009-01-20 14:49 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> The devices we're talking about here don't have a DMA count register. 

Which devices ?  All devices ever that want to do zero-copy DMA into
the guest ?

> They are passed scatter-gather lists, and I don't think they make 
> guarantees about the order in which they're accessed.

Many devices will be able to make such promises.

> This DMA will be into RAM, not mmio.

As previously discussed, we might be using bounce buffers even for
RAM, depending on the execution model.  You said earlier:

  Try it out. I'm sure it will work just fine (if incredibly slowly, 
  unless you provide multiple bounce buffers).

but here is an example from Jamie of a situation where it won't work
right.

> We don't have a reliable amount to pass.

A device which _really_ doesn't have a reliable amount to pass, and
which is entitled to scribble all over the RAM it was to DMA to even
if it does only a partial transfer, can simply pass the total transfer
length.  That would be no different to your proposal.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 14:32               ` Ian Jackson
@ 2009-01-20 17:23                 ` Avi Kivity
  0 siblings, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-20 17:23 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> I think the key points in Avi's message are this:
>
> Avi Kivity writes:
>   
>> You don't know afterwards either. Maybe read() is specced as you
>> say, but practical implementations will return the minimum bytes
>> read, not exact.
>>     
>
> And this:
>
>   
>> I really doubt that any guest will be affected by this. It's a tradeoff 
>> between decent performance and needlessly accurate emulation. I don't 
>> see how we can choose the latter.
>>     
>
> I don't think this is the right way to analyse this situation.  We are
> trying to define a general-purpose DMA API for _all_ emulated devices,
> not just the IDE emulation and block devices that you seem to be
> considering.
>   

No.  There already exists a general API: cpu_physical_memory_rw().  We 
are trying to define an API which will allow the high-throughput devices 
(IDE, scsi, virtio-blk, virtio-net) to be implemented efficiently.

If device X does not work well with the API, then, unless it's important 
for some reason, it shouldn't use it.  If it is important, we can adapt 
the API then.

> If there is ever any hardware which behaves `properly' with partial
> DMA, and any host kernel device which can tell us what succeeded and
> what failed, then it is necessary for the DMA API we are now inventing
> to allow that device to be properly emulated.
>
> Even if we can't come up with an example right now of such a device
> then I would suggest that it's very likely that we will encounter one
> eventually.  But actually I can think of one straight away: a SCSI
> tapestreamer.  Tapestreamers often give partial transfers at the end
> of tapefiles; hosts (ie, qemu guests) talking to the SCSI controller
> do not expect the controller to DMA beyond the successful SCSI
> transfer length; and the (qemu host's) kernel's read() call will not
> overwrite beyond the successful transfer length either.
>   

That will work out fine as the DMA will be to kernel memory, and read() 
will copy just the interesting parts.

> If it is difficult for a block device to provide the faithful
> behaviour then it might be acceptable for the block device to always
> indicate to the DMA API that the entire transfer had taken place, even
> though actually some of it had failed.
>
> But personally I think you're mistaken about the behaviour of the
> (qemu host's) kernel's {aio_,p,}read(2).
>   

I'm pretty sure reads to software RAIDs will be submitted in parallel.  
If those reads are O_DIRECT, then it's impossible to maintain DMA ordering.

>>> In the initial implementation in Xen, we will almost certainly simply
>>> emulate everything with cpu_physical_memory_rw.  So it will happen all
>>> the time.
>>>       
>> Try it out. I'm sure it will work just fine (if incredibly slowly, 
>> unless you provide multiple bounce buffers).
>>     
>
> It will certainly work except (a) there are partial (interrupted)
> transfers and (b) the host relies on the partial DMA not overwriting
> more data than it successfully transferred.  So what that means is
> that if this introduces bugs they will be very difficult to find in
> testing.  I don't think testing is the answer here.
>   

The only workaround I can think of is not to DMA.  But that will be 
horribly slow.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 14:49               ` Ian Jackson
@ 2009-01-20 17:42                 ` Avi Kivity
  2009-01-20 18:08                   ` Jamie Lokier
  2009-01-21 16:50                   ` Ian Jackson
  0 siblings, 2 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-20 17:42 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>   
>> The devices we're talking about here don't have a DMA count register. 
>>     
>
> Which devices ?  All devices ever that want to do zero-copy DMA into
> the guest ?
>   

IDE, scsi, virtio-blk, virtio-net, e1000, maybe a few more.

>> They are passed scatter-gather lists, and I don't think they make 
>> guarantees about the order in which they're accessed.
>>     
>
> Many devices will be able to make such promises.
>   

If they do, and if guests actually depend on these promises, then we 
will not use the new API until someone is sufficiently motivated to send 
a patch to enable it.

>> This DMA will be into RAM, not mmio.
>>     
>
> As previously discussed, we might be using bounce buffers even for
> RAM, depending on the execution model.  You said earlier:
>
>   Try it out. I'm sure it will work just fine (if incredibly slowly, 
>   unless you provide multiple bounce buffers).
>
> but here is an example from Jamie of a situation where it won't work
> right.
>   

Framebuffers?  Those are RAM.  USB webcams?  These can't be interrupted 
by SIGINT.  Are you saying a guest depends on an O_DIRECT USB transfer 
not affecting memory when a USB cable is pulled out?

>> We don't have a reliable amount to pass.
>>     
>
> A device which _really_ doesn't have a reliable amount to pass, and
> which is entitled to scribble all over the RAM it was to DMA to even
> if it does only a partial transfer, can simply pass the total transfer
> length.  That would be no different to your proposal.
>   

I'm suggesting we do that unconditionally (as my patch does) and only 
add that complexity when we know it's needed for certain.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 17:42                 ` Avi Kivity
@ 2009-01-20 18:08                   ` Jamie Lokier
  2009-01-20 20:27                     ` Avi Kivity
  2009-01-21 16:50                   ` Ian Jackson
  1 sibling, 1 reply; 64+ messages in thread
From: Jamie Lokier @ 2009-01-20 18:08 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> Framebuffers?  Those are RAM.  USB webcams?  These can't be interrupted 
> by SIGINT.  Are you saying a guest depends on an O_DIRECT USB transfer 
> not affecting memory when a USB cable is pulled out.

The USB thing is probably more about emulated-UHCI/EHCI pass-through
to real host USB devices which aren't emulated in QEMU.

> >>We don't have a reliable amount to pass.
> >>    
> >A device which _really_ doesn't have a reliable amount to pass, and
> >which is entitled to scribble all over the RAM it was to DMA to even
> >if it does only a partial transfer, can simply pass the total transfer
> >length.  That would be no different to your proposal.
> >  
> 
> I'm suggesting we do that unconditionally (as my patch does) and only 
> add that complexity when we know it's needed for certain.

Fair enough.  Things can be added if needed.  But please make it real
clear in the DMA API comments that the whole buffer may be
overwritten.  If a guest actually does depend on this, it won't show
up in testing because the whole buffer won't be overwritten except
when a bounce copy is used.

Linux itself had some issues with _its_ DMA API recently: people have
been writing drivers using the Linux DMA API making broken assumptions
that happen to work on x86, and work some of the time on other
architectures.  These things don't show up during driver tests, and
are very difficult to track down later.  I suspect "overwrites the
remaining buffer with randomness/zeros but only under bounce-buffer
conditions" will be similarly unlikely to trigger, and very difficult
to track down if it causes a problem anywhere.

The recent solution in Linux is to add some debugging options which
check it's used correctly, even on x86.

Here's a final thought, to do with performance:

e1000, configured for jumbo frames, receives logs of small packets,
and the DMA subsystem has to bounce-copy the data for some reason (Ian
suggested maybe always doing that with Xen for DMA to guest RAM?)

Without a length passed to unmap, won't it copy 65536 bytes per packet
(most from cold cache) because that's the amount set up for DMA to
receive from /dev/tap, instead of 256 or 1514 bytes per packet which
is their actual size?

-- Jamie

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
  2009-01-19 13:49   ` Ian Jackson
  2009-01-19 14:56   ` [Qemu-devel] " Anthony Liguori
@ 2009-01-20 18:43   ` Anthony Liguori
  2009-01-21 17:09     ` Ian Jackson
  2009-01-21 11:52   ` [Qemu-devel] " Mike Day
  3 siblings, 1 reply; 64+ messages in thread
From: Anthony Liguori @ 2009-01-20 18:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

Avi Kivity wrote:
> Devices accessing large amounts of memory (as with DMA) will wish to obtain
> a pointer to guest memory rather than access it indirectly via
> cpu_physical_memory_rw().  Add a new API to convert target addresses to
> host pointers.
>
> In case the target address does not correspond to RAM, a bounce buffer is
> allocated.  To prevent the guest from causing the host to allocate unbounded
> amounts of bounce buffer, this memory is limited (currently to one page).
>
> Signed-off-by: Avi Kivity <avi@redhat.com>
>   

I've gone though the thread again and I think this patch series is a 
pretty good start.  I think the API can be refined a bit more down the 
road to better support Xen but I think this is a good start.

Does anyone have major blockers with this API?  If not, I'd like to 
commit these patches.  The consumers of it should be small provided that 
we make sure to write helpers for the various types of IO pipelines.

Regards,

Anthony Liguori

> ---
>  cpu-all.h |    5 +++
>  exec.c    |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 98 insertions(+), 0 deletions(-)
>
> diff --git a/cpu-all.h b/cpu-all.h
> index ee0a6e3..3439999 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -923,6 +923,11 @@ static inline void cpu_physical_memory_write(target_phys_addr_t addr,
>  {
>      cpu_physical_memory_rw(addr, (uint8_t *)buf, len, 1);
>  }
> +void *cpu_physical_memory_map(target_phys_addr_t addr,
> +                              target_phys_addr_t *plen,
> +                              int is_write);
> +void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
> +                               int is_write);
>  uint32_t ldub_phys(target_phys_addr_t addr);
>  uint32_t lduw_phys(target_phys_addr_t addr);
>  uint32_t ldl_phys(target_phys_addr_t addr);
> diff --git a/exec.c b/exec.c
> index faa6333..7162271 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -3045,6 +3045,99 @@ void cpu_physical_memory_write_rom(target_phys_addr_t addr,
>      }
>  }
>  
> +typedef struct {
> +    void *buffer;
> +    target_phys_addr_t addr;
> +    target_phys_addr_t len;
> +} BounceBuffer;
> +
> +static BounceBuffer bounce;
> +
> +void *cpu_physical_memory_map(target_phys_addr_t addr,
> +                              target_phys_addr_t *plen,
> +                              int is_write)
> +{
> +    target_phys_addr_t len = *plen;
> +    target_phys_addr_t done = 0;
> +    int l;
> +    uint8_t *ret = NULL;
> +    uint8_t *ptr;
> +    target_phys_addr_t page;
> +    unsigned long pd;
> +    PhysPageDesc *p;
> +    unsigned long addr1;
> +
> +    while (len > 0) {
> +        page = addr & TARGET_PAGE_MASK;
> +        l = (page + TARGET_PAGE_SIZE) - addr;
> +        if (l > len)
> +            l = len;
> +        p = phys_page_find(page >> TARGET_PAGE_BITS);
> +        if (!p) {
> +            pd = IO_MEM_UNASSIGNED;
> +        } else {
> +            pd = p->phys_offset;
> +        }
> +
> +        if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
> +            if (done || bounce.buffer) {
> +                break;
> +            }
> +            bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, TARGET_PAGE_SIZE);
> +            bounce.addr = addr;
> +            bounce.len = l;
> +            if (!is_write) {
> +                cpu_physical_memory_rw(addr, bounce.buffer, l, 0);
> +            }
> +            ptr = bounce.buffer;
> +        } else {
> +            addr1 = (pd & TARGET_PAGE_MASK) + (addr & ~TARGET_PAGE_MASK);
> +            ptr = phys_ram_base + addr1;
> +        }
> +        if (!done) {
> +            ret = ptr;
> +        } else if (ret + done != ptr) {
> +            break;
> +        }
> +
> +        len -= l;
> +        addr += l;
> +        done += l;
> +    }
> +    *plen = done;
> +    return ret;
> +}
> +
> +void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
> +                               int is_write)
> +{
> +    if (buffer != bounce.buffer) {
> +        if (is_write) {
> +            unsigned long addr1 = (uint8_t *)buffer - phys_ram_base;
> +            do {
> +                unsigned l;
> +                l = TARGET_PAGE_SIZE;
> +                if (l > len)
> +                    l = len;
> +                if (!cpu_physical_memory_is_dirty(addr1)) {
> +                    /* invalidate code */
> +                    tb_invalidate_phys_page_range(addr1, addr1 + len, 0);
> +                    /* set dirty bit */
> +                    phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
> +                        (0xff & ~CODE_DIRTY_FLAG);
> +                }
> +                addr1 += l;
> +                len -= l;
> +            } while (len);
> +        }
> +        return;
> +    }
> +    if (is_write) {
> +        cpu_physical_memory_write(bounce.addr, bounce.buffer, bounce.len);
> +    }
> +    qemu_free(bounce.buffer);
> +    bounce.buffer = NULL;
> +}
>  
>  /* warning: addr must be aligned */
>  uint32_t ldl_phys(target_phys_addr_t addr)
>   

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 18:08                   ` Jamie Lokier
@ 2009-01-20 20:27                     ` Avi Kivity
  2009-01-21 16:53                       ` Ian Jackson
  0 siblings, 1 reply; 64+ messages in thread
From: Avi Kivity @ 2009-01-20 20:27 UTC (permalink / raw)
  To: qemu-devel

Jamie Lokier wrote:
> Linux itself had some issues with _its_ DMA API recently: people have
> been writing drivers using the Linux DMA API making broken assumptions
> that happen to work on x86, and work some of the time on other
> architectures.  These things don't show up during driver tests, and
> are very difficult to track down later.  I suspect "overwrites the
> remaining buffer with randomness/zeros but only under bounce-buffer
> conditions" will be similarly unlikely to trigger, and very difficult
> to track down if it causes a problem anywhere.
>   

This reminds me -- Gleb pointed out that Linux itself bounces some DMA 
operations, and will copy the entire range.  So Linux itself doesn't 
satisfy these requirements.

> The recent solution in Linux is to add some debugging options which
> check it's used correctly, even on x86.
>
> Here's a final thought, to do with performance:
>
> e1000, configured for jumbo frames, receives logs of small packets,
> and the DMA subsystem has to bounce-copy the data for some reason (Ian
> suggested maybe always doing that with Xen for DMA to guest RAM?)
>
> Without a length passed to unmap, won't it copy 65536 bytes per packet
> (most from cold cache) because that's the amount set up for DMA to
> receive from /dev/tap, instead of 256 or 1514 bytes per packet which
> is their actual size?
>   

It will.  If we bounce.  We'll never bounce in real life.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] Re: Add target memory mapping API
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
                     ` (2 preceding siblings ...)
  2009-01-20 18:43   ` Anthony Liguori
@ 2009-01-21 11:52   ` Mike Day
  2009-01-21 12:17     ` Avi Kivity
  2009-01-21 17:37     ` Paul Brook
  3 siblings, 2 replies; 64+ messages in thread
From: Mike Day @ 2009-01-21 11:52 UTC (permalink / raw)
  To: qemu-devel

On 18/01/09 21:53 +0200, Avi Kivity wrote:
> Devices accessing large amounts of memory (as with DMA) will wish to obtain
> a pointer to guest memory rather than access it indirectly via
> cpu_physical_memory_rw().  Add a new API to convert target addresses to
> host pointers.
> 
> In case the target address does not correspond to RAM, a bounce buffer is
> allocated.  To prevent the guest from causing the host to allocate unbounded
> amounts of bounce buffer, this memory is limited (currently to one page).
> 
> Signed-off-by: Avi Kivity <avi@redhat.com>
> ---
>  cpu-all.h |    5 +++
>  exec.c    |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 98 insertions(+), 0 deletions(-)
> 
> diff --git a/cpu-all.h b/cpu-all.h
> index ee0a6e3..3439999 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -923,6 +923,11 @@ static inline void cpu_physical_memory_write(target_phys_addr_t addr,
>  {
>      cpu_physical_memory_rw(addr, (uint8_t *)buf, len, 1);
>  }
> +void *cpu_physical_memory_map(target_phys_addr_t addr,
> +                              target_phys_addr_t *plen,
> +                              int is_write);
> +void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
> +                               int is_write);
>  uint32_t ldub_phys(target_phys_addr_t addr);
>  uint32_t lduw_phys(target_phys_addr_t addr);
>  uint32_t ldl_phys(target_phys_addr_t addr);
> diff --git a/exec.c b/exec.c
> index faa6333..7162271 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -3045,6 +3045,99 @@ void cpu_physical_memory_write_rom(target_phys_addr_t addr,
>      }
>  }
>  
> +typedef struct {
> +    void *buffer;
> +    target_phys_addr_t addr;
> +    target_phys_addr_t len;
> +} BounceBuffer;
> +
> +static BounceBuffer bounce;


Should there be one bounce buffer for each vcpu? Eventually the device
model lock will be refactored to allow more parallelism, right? 


Mike

-- 
Mike Day
http://www.ncultra.org
AIM: ncmikeday |  Yahoo IM: ultra.runner
PGP key: http://www.ncultra.org/ncmike/pubkey.asc

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] Re: Add target memory mapping API
  2009-01-19 17:28         ` Avi Kivity
  2009-01-19 17:53           ` Ian Jackson
  2009-01-19 18:25           ` Jamie Lokier
@ 2009-01-21 12:06           ` Mike Day
  2009-01-21 12:18             ` Avi Kivity
  2 siblings, 1 reply; 64+ messages in thread
From: Mike Day @ 2009-01-21 12:06 UTC (permalink / raw)
  To: qemu-devel

On 19/01/09 19:28 +0200, Avi Kivity wrote:
> Ian Jackson wrote:
>>> Correct.  If you need to perform read-modify-write, you need to use  
>>> cpu_physical_memory_rw(), twice.  If we ever want to support RMW, 
>>> we'll need to add another value for is_write.  I don't think we have  
>>> interesting devices at this point which require efficient RMW.
>>>     
>>
>> Efficient read-modify-write may be very hard for some setups to
>> achieve.  It can't be done with the bounce buffer implementation.
>> I think ond good rule of thumb would be to make sure that the interface
>> as specified can be implemented in terms of cpu_physical_memory_rw.
>>   
>
> What is the motivation for efficient rmw?

rmw would be good for changing bytes in a device config space, or
programming an IOMMU. Right now these are not on the critical path but
they may be soon in some cases. For example imagine an assigned I/O
device that wants to dynamically map io translations for each DMA.


Mike

-- 
Mike Day
http://www.ncultra.org
AIM: ncmikeday |  Yahoo IM: ultra.runner
PGP key: http://www.ncultra.org/ncmike/pubkey.asc

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] Re: Add target memory mapping API
  2009-01-21 11:52   ` [Qemu-devel] " Mike Day
@ 2009-01-21 12:17     ` Avi Kivity
  2009-01-21 17:37     ` Paul Brook
  1 sibling, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-21 12:17 UTC (permalink / raw)
  To: ncmike, qemu-devel

Mike Day wrote:
>>  
>> +typedef struct {
>> +    void *buffer;
>> +    target_phys_addr_t addr;
>> +    target_phys_addr_t len;
>> +} BounceBuffer;
>> +
>> +static BounceBuffer bounce;
>>     
>
>
> Should there be one bounce buffer for each vcpu? Eventually the device
> model lock will be refactored to allow more parallelism, right? 
>   

I don't see why.  Bouncing only happens when you dma into mmio space 
instead of RAM.  This never happens.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] Re: Add target memory mapping API
  2009-01-21 12:06           ` [Qemu-devel] " Mike Day
@ 2009-01-21 12:18             ` Avi Kivity
  0 siblings, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-21 12:18 UTC (permalink / raw)
  To: ncmike, qemu-devel

Mike Day wrote:
>> What is the motivation for efficient rmw?
>>     
>
> rmw would be good for changing bytes in a device config space, or
> programming an IOMMU. Right now these are not on the critical path but
> they may be soon in some cases. For example imagine an assigned I/O
> device that wants to dynamically map io translations for each DMA.
>   

You don't do these with dma, instead you use cpu instructions which go 
through cpu_physical_memory_rw().

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 17:42                 ` Avi Kivity
  2009-01-20 18:08                   ` Jamie Lokier
@ 2009-01-21 16:50                   ` Ian Jackson
  2009-01-21 17:18                     ` Avi Kivity
  1 sibling, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2009-01-21 16:50 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> Ian Jackson wrote:
> > Which devices ?  All devices ever that want to do zero-copy DMA into
> > the guest ?
> 
> IDE, scsi, virtio-blk, virtio-net, e1000, maybe a few more.

Yesterday I produced the example of a SCSI tape drive, which is
vanishingly unlikely to results in writes past the actual transfer
length since the drive definitely produces all of the data in order.

> >> They are passed scatter-gather lists, and I don't think they make 
> >> guarantees about the order in which they're accessed.
> >
> > Many devices will be able to make such promises.
> 
> If they do, and if guests actually depend on these promises, then we 
> will not use the new API until someone is sufficiently motivated to send 
> a patch to enable it.

As I have already pointed out, we won't discover that any guest
depends on those promises in testing, because it's the kind of thing
that will only happen in practice with reasonably obscure situations
including some error conditions.

So "let's only do this if we discover we need it" is not good enough.
We won't know that we need it.  What will probably happen is that some
user somewhere who is already suffering from some kind of problem will
experience additional apparently-random corruption.  Naturally that's
not going to result in a good bug report.

Even from our point of view as the programmers this isn't a good
approach because the proposed fix is an API and API change.  What
you're suggesting is that we introduce a bug, and wait and see if it
bites anyone, in the full knowledge that by then fixing the bug will
involve either widespread changes to all of the DMA API users or
changing a particular driver to be much slower.

> >> This DMA will be into RAM, not mmio.
> >
> > As previously discussed, we might be using bounce buffers even for
> > RAM, depending on the execution model.  You said earlier:
> >
> >   Try it out. I'm sure it will work just fine (if incredibly slowly, 
> >   unless you provide multiple bounce buffers).
> >
> > but here is an example from Jamie of a situation where it won't work
> > right.
> 
> Framebuffers?  Those are RAM.  USB webcams?  These can't be interrupted 
> by SIGINT.  Are you saying a guest depends on an O_DIRECT USB transfer 
> not affecting memory when a USB cable is pulled out?

No, as I said earlier, and as you appeared to accept, it is quite
possible that in some uses of the qemu code - including some uses of
Xen - _all_ DMA will go through bounce buffers.

> >> We don't have a reliable amount to pass.
> >
> > A device which _really_ doesn't have a reliable amount to pass, and
> > which is entitled to scribble all over the RAM it was to DMA to even
> > if it does only a partial transfer, can simply pass the total transfer
> > length.  That would be no different to your proposal.
> 
> I'm suggesting we do that unconditionally (as my patch does) and only 
> add that complexity when we know it's needed for certain.

At the moment there are no such devices (your claims about ide
notwithstanding) but I think it will be easier to argue about the
specific case after we have agreed on a non-deficient API.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 20:27                     ` Avi Kivity
@ 2009-01-21 16:53                       ` Ian Jackson
  0 siblings, 0 replies; 64+ messages in thread
From: Ian Jackson @ 2009-01-21 16:53 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> This reminds me -- Gleb pointed out that Linux itself bounces some DMA 
> operations, and will copy the entire range.  So Linux itself doesn't 
> satisfy these requirements.

What you mean is that Linux drivers are not supposed to rely on this
behaviour if they might execute in Linux systems with more than 4G of
RAM, and that any Linux drivers which do are buggy.

That's not the same as saying that there are no Linux systems which do
not rely on partial DMA transfers not overwriting beyond the transfer
length.  For example, there might easily be <4G systems with buggy
drivers where bounce buffers are never used.

> > Without a length passed to unmap, won't it copy 65536 bytes per packet
> > (most from cold cache) because that's the amount set up for DMA to
> > receive from /dev/tap, instead of 256 or 1514 bytes per packet which
> > is their actual size?
> 
> It will.  If we bounce.  We'll never bounce in real life.

I have explained why this is not necessarily true.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-20 18:43   ` Anthony Liguori
@ 2009-01-21 17:09     ` Ian Jackson
  2009-01-21 18:56       ` [Qemu-devel] " Mike Day
  2009-01-21 19:36       ` [Qemu-devel] Re: [PATCH 1/5] " Anthony Liguori
  0 siblings, 2 replies; 64+ messages in thread
From: Ian Jackson @ 2009-01-21 17:09 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori writes ("[Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API"):
> I've gone though the thread again and I think this patch series is a 
> pretty good start.  I think the API can be refined a bit more down the 
> road to better support Xen but I think this is a good start.

I'm afraid I disagree.

> Does anyone have major blockers with this API?  If not, I'd like to 
> commit these patches.  The consumers of it should be small provided that 
> we make sure to write helpers for the various types of IO pipelines.

I have three criticisms, the first of which is in my view a major
blocker:

 * The unmap call should take a transfer length, for the reasons
   I have explained.

   This is absolutely critical for correctness if bounce buffers are
   used.

   I think it is important that the API permits implementations where
   the memory cannot be just mapped.  At the moment the APIs used by
   qemu device emulations do not assume that they can access RAM
   willy-nilly; they are all expected to go through
   cpu_physical_memory_rw.  I think it is important to preserve this
   for the future flexibility of the qemu codebase.

   Note that Xen is not such an implementation, although in the Xen
   case the additional performance of mapping is not so important so
   we might not choose to implement the mapping right away.  Obviously
   it would be much easier for us just to implement this mapping than
   to argue at length on this list.  It's not hard for Xen.  So I'm
   not trying to special-case Xen here.  You shouldn't read my
   criticisms as some kind of Xen fanatic trying to defend some crazy
   Xen behaviour.

   I'm trying to preserve an important and useful property of the
   internal qemu API.  My suggestion will mean the device emulation
   parts of qemu continue to be reasonably easily useable separately
   from the rest of qemu, and possibly very separately from any
   emulation of CPU or memory.  The benefit is difficult to evaluate
   exactly but the cost is very small.

Of my criticisms, I think this is by far the most important.  It
relates to the correctness of the API and in my view it is essential
that this be properly addressed.

Avi will say again (as if repetition made things true) that `bounce
buffers are never used' but of course they will be sometimes.  If they
were never used his patch wouldn't need to have that code in them.
It's true that they will be _rarely_ used but a lack of correctness in
a rare case is not acceptable either.

The second criticism is a matter of taste, perhaps, and the third can
be addressed later:

 * The mixed call/callback API: DMAs which can be mappped immediately
   are treated as synchronous calls; ones which can't involve a
   separate callback.

   This is no different semantically from a pure-callback API.
   However, it makes life more clumsy for the caller - the caller now
   needs two code paths: one for immediate success, and one for
   deferred allocation.

   Callbacks are increasingly dominating the innards of qemu for good
   reasons.  I think I have explained how a callback style can provide
   the necessary functionality.

 * The documentation, in the form of comments describing the
   semantics of and restrictions on the use of the DMA mapping,
   is inadequate.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-21 16:50                   ` Ian Jackson
@ 2009-01-21 17:18                     ` Avi Kivity
  2009-01-21 21:54                       ` Anthony Liguori
  0 siblings, 1 reply; 64+ messages in thread
From: Avi Kivity @ 2009-01-21 17:18 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>   
>> Ian Jackson wrote:
>>     
>>> Which devices ?  All devices ever that want to do zero-copy DMA into
>>> the guest ?
>>>       
>> IDE, scsi, virtio-blk, virtio-net, e1000, maybe a few more.
>>     
>
> Yesterday I produced the example of a SCSI tape drive, which is
> vanishingly unlikely to results in writes past the actual transfer
> length since the drive definitely produces all of the data in order.
>   

And I explained that it's very unlikely to ever be noticed by a guest, 
since the dma will happen into kernel memory (which will be clobbered), 
but the subsequent copying into userspace will use the correct size.  I 
also pointed out that the holy kernel itself might use bounce buffers 
and disregard the actually copied size.

If you're into accurate emulation but not into performance, use 
cpu_physical_memory_rw().  This API is optional, for performance minded 
implementations.

> As I have already pointed out, we won't discover that any guest
> depends on those promises in testing, because it's the kind of thing
> that will only happen in practice with reasonably obscure situations
> including some error conditions.
>
> So "let's only do this if we discover we need it" is not good enough.
> We won't know that we need it.  What will probably happen is that some
> user somewhere who is already suffering from some kind of problem will
> experience additional apparently-random corruption.  Naturally that's
> not going to result in a good bug report.
>
> Even from our point of view as the programmers this isn't a good
> approach because the proposed fix is an API and API change.  What
> you're suggesting is that we introduce a bug, and wait and see if it
> bites anyone, in the full knowledge that by then fixing the bug will
> involve either widespread changes to all of the DMA API users or
> changing a particular driver to be much slower.
>   

That's because I estimate the probability of change being required as zero.

>> Framebuffers?  Those are RAM.  USB webcams?  These can't be interrupted 
>> by SIGINT.  Are you saying a guest depends on an O_DIRECT USB transfer 
>> not affecting memory when a USB cable is pulled out?
>>     
>
> No, as I said earlier, and as you appeared to accept, it is quite
> possible that in some uses of the qemu code - including some uses of
> Xen - _all_ DMA will go through bounce buffers.
>   

Right now Xen doesn't bounce DMAs, it uses the map cache.  I am not 
coding for all possible uses of qemu.  I am coding for what's in 
upstream qemu, and allowing for reasonable implementations in Xen.


>> I'm suggesting we do that unconditionally (as my patch does) and only 
>> add that complexity when we know it's needed for certain.
>>     
>
> At the moment there are no such devices (your claims about ide
> notwithstanding) but I think it will be easier to argue about the
> specific case after we have agreed on a non-deficient API.
>   

I don't think we'll ever reach agreement.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] Re: Add target memory mapping API
  2009-01-21 11:52   ` [Qemu-devel] " Mike Day
  2009-01-21 12:17     ` Avi Kivity
@ 2009-01-21 17:37     ` Paul Brook
  1 sibling, 0 replies; 64+ messages in thread
From: Paul Brook @ 2009-01-21 17:37 UTC (permalink / raw)
  To: qemu-devel, ncmike

> Should there be one bounce buffer for each vcpu? Eventually the device
> model lock will be refactored to allow more parallelism, right?

Doubtful. Making the device models threadsafe would be a lot of work, 
introduce lots of subtle and hard to reproduce bugs, and probably give very 
little practical benefit.

Paul

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] Re: Add target memory mapping API
  2009-01-21 17:09     ` Ian Jackson
@ 2009-01-21 18:56       ` Mike Day
  2009-01-21 19:35         ` Avi Kivity
  2009-01-21 19:36       ` [Qemu-devel] Re: [PATCH 1/5] " Anthony Liguori
  1 sibling, 1 reply; 64+ messages in thread
From: Mike Day @ 2009-01-21 18:56 UTC (permalink / raw)
  To: qemu-devel

On 21/01/09 17:09 +0000, Ian Jackson wrote:
> Anthony Liguori writes ("[Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API"):
> > I've gone though the thread again and I think this patch series is a 
> > pretty good start.  I think the API can be refined a bit more down the 
> > road to better support Xen but I think this is a good start.
> 
> I'm afraid I disagree.
> 
> > Does anyone have major blockers with this API?  If not, I'd like to 
> > commit these patches.  The consumers of it should be small provided that 
> > we make sure to write helpers for the various types of IO pipelines.
> 
> I have three criticisms, the first of which is in my view a major
> blocker:
> 
>  * The unmap call should take a transfer length, for the reasons
>    I have explained.
> 
>    This is absolutely critical for correctness if bounce buffers are
>    used.
> 
>    I think it is important that the API permits implementations where
>    the memory cannot be just mapped.  At the moment the APIs used by
>    qemu device emulations do not assume that they can access RAM
>    willy-nilly; they are all expected to go through
>    cpu_physical_memory_rw.  I think it is important to preserve this
>    for the future flexibility of the qemu codebase.

cpu_physical_memory_rw will still be available and used by most
devices even with this proposed API, right? Certain devices will be
able to now use the new API.
 
>    Note that Xen is not such an implementation, although in the Xen
>    case the additional performance of mapping is not so important so
>    we might not choose to implement the mapping right away.  Obviously
>    it would be much easier for us just to implement this mapping than
>    to argue at length on this list.  It's not hard for Xen.  So I'm
>    not trying to special-case Xen here.  You shouldn't read my
>    criticisms as some kind of Xen fanatic trying to defend some crazy
>    Xen behaviour.
> 
>    I'm trying to preserve an important and useful property of the
>    internal qemu API.  My suggestion will mean the device emulation
>    parts of qemu continue to be reasonably easily useable separately
>    from the rest of qemu, and possibly very separately from any
>    emulation of CPU or memory.  The benefit is difficult to evaluate
>    exactly but the cost is very small.
> 
> Of my criticisms, I think this is by far the most important.  It
> relates to the correctness of the API and in my view it is essential
> that this be properly addressed.
> 
> Avi will say again (as if repetition made things true) that `bounce
> buffers are never used' but of course they will be sometimes.  If they
> were never used his patch wouldn't need to have that code in them.
> It's true that they will be _rarely_ used but a lack of correctness in
> a rare case is not acceptable either.

I think a DMA into a device MMIO range is a special case.


Mike
-- 
Mike Day
http://www.ncultra.org
AIM: ncmikeday |  Yahoo IM: ultra.runner
PGP key: http://www.ncultra.org/ncmike/pubkey.asc

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] Re: Add target memory mapping API
  2009-01-21 18:56       ` [Qemu-devel] " Mike Day
@ 2009-01-21 19:35         ` Avi Kivity
  0 siblings, 0 replies; 64+ messages in thread
From: Avi Kivity @ 2009-01-21 19:35 UTC (permalink / raw)
  To: ncmike, qemu-devel

Mike Day wrote:
> cpu_physical_memory_rw will still be available and used by most
> devices even with this proposed API, right? Certain devices will be
> able to now use the new API.
>   

Yes.  When I converted IDE, for example, access to the scatter/gather 
list remains using cpu_p_m_rw(), while accessing the buffers pointed to 
by the lists uses the new API.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-21 17:09     ` Ian Jackson
  2009-01-21 18:56       ` [Qemu-devel] " Mike Day
@ 2009-01-21 19:36       ` Anthony Liguori
  2009-01-22 12:18         ` Ian Jackson
  1 sibling, 1 reply; 64+ messages in thread
From: Anthony Liguori @ 2009-01-21 19:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: Avi Kivity

Ian Jackson wrote:
> Anthony Liguori writes ("[Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API"):
>   
>> I've gone though the thread again and I think this patch series is a 
>> pretty good start.  I think the API can be refined a bit more down the 
>> road to better support Xen but I think this is a good start.
>>     
>
> I'm afraid I disagree.
>   

That's why I sent a note first instead of just committing :-)

>> Does anyone have major blockers with this API?  If not, I'd like to 
>> commit these patches.  The consumers of it should be small provided that 
>> we make sure to write helpers for the various types of IO pipelines.
>>     
>
> I have three criticisms, the first of which is in my view a major
> blocker:
>   

Okay, then let's ignore the last two for now.  A DMA API is really 
important for performance and I'd like to get something in the tree we 
can start working with.

>  * The unmap call should take a transfer length, for the reasons
>    I have explained.
>   

I have a hard time disagree that it should take a transfer length, if 
for nothing else, to use to update the right amount of the dirty 
bitmap.  Avi, do you object to having unmap take a transfer length?

FWIW, I don't think we need to support RMW operations right now, but I 
think the transfer length is still required since you may get a partial 
IO result.  It may not be an important issue of correctness but it's 
still an issue of correctness.


>    I think it is important that the API permits implementations where
>    the memory cannot be just mapped.  At the moment the APIs used by
>    qemu device emulations do not assume that they can access RAM
>    willy-nilly; they are all expected to go through
>    cpu_physical_memory_rw.  I think it is important to preserve this
>    for the future flexibility of the qemu codebase.
>   

map() can, and will, fail under certain conditions and the code has to 
accommodate that.

>    I'm trying to preserve an important and useful property of the
>    internal qemu API.  My suggestion will mean the device emulation
>    parts of qemu continue to be reasonably easily useable separately
>    from the rest of qemu, and possibly very separately from any
>    emulation of CPU or memory.  The benefit is difficult to evaluate
>    exactly but the cost is very small.
>   

Are you arguing for "callback" based mapping?  The vast majority of 
devices will end up using a callback based approach so I'm not sure what 
you're unhappy about.

> The second criticism is a matter of taste, perhaps, and the third can
> be addressed later:
>
>  * The mixed call/callback API: DMAs which can be mappped immediately
>    are treated as synchronous calls; ones which can't involve a
>    separate callback.
>
>    This is no different semantically from a pure-callback API.
>    However, it makes life more clumsy for the caller - the caller now
>    needs two code paths: one for immediate success, and one for
>    deferred allocation.
>
>    Callbacks are increasingly dominating the innards of qemu for good
>    reasons.  I think I have explained how a callback style can provide
>    the necessary functionality.
>   

I understand the point you make here and I think we'll end up having a 
callback API.  But for reasons I've already mentioned, I don't think it 
makes sense for it to be the core API since it introduces very difficult 
to eliminate deadlocks.

>  * The documentation, in the form of comments describing the
>    semantics of and restrictions on the use of the DMA mapping,
>    is inadequate.
>   

It is.  Avi has promised to fix that :-)

Regards,

Anthony Liguori

> Ian.
>
>
>   

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-21 17:18                     ` Avi Kivity
@ 2009-01-21 21:54                       ` Anthony Liguori
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Liguori @ 2009-01-21 21:54 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> Ian Jackson wrote:
>
>>> I'm suggesting we do that unconditionally (as my patch does) and 
>>> only add that complexity when we know it's needed for certain.
>>>     
>>
>> At the moment there are no such devices (your claims about ide
>> notwithstanding) but I think it will be easier to argue about the
>> specific case after we have agreed on a non-deficient API.
>>   
>
> I don't think we'll ever reach agreement.

API's don't have to be fixed for any period of time.  They can be 
changed as needed.  The only thing we have to agree on is that 1) we're 
not causing regressions and 2) we're not digging ourselves into a hole.

So far, I don't see anything in the API that would cause either issue.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-21 19:36       ` [Qemu-devel] Re: [PATCH 1/5] " Anthony Liguori
@ 2009-01-22 12:18         ` Ian Jackson
  2009-01-22 18:46           ` Anthony Liguori
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2009-01-22 12:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Avi Kivity

Anthony Liguori writes ("Re: [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API"):
> Ian Jackson wrote:
> > I'm afraid I disagree.
> 
> That's why I sent a note first instead of just committing :-)

Thanks.

> > I have three criticisms, the first of which is in my view a major
> > blocker:
> 
> Okay, then let's ignore the last two for now.  A DMA API is really 
> important for performance and I'd like to get something in the tree we 
> can start working with.

Right.  I agree that this is a good and useful thing to have.

> >  * The unmap call should take a transfer length, for the reasons
> >    I have explained.
> 
> I have a hard time disagree that it should take a transfer length, if 
> for nothing else, to use to update the right amount of the dirty 
> bitmap.  Avi, do you object to having unmap take a transfer length?

I hadn't thought about the dirty bitmap but of course you are right
there too.

> FWIW, I don't think we need to support RMW operations right now, but I 
> think the transfer length is still required since you may get a partial 
> IO result.  It may not be an important issue of correctness but it's 
> still an issue of correctness.

I think r-m-w operations are much less practical, really.  If map
isn't available for some reason then a direct read-modify-write
interface will be almost impossible to emulate with
cpu_physical_memory_rw.  So I think we shouldn't provide one.

> >    I think it is important that the API permits implementations where
> >    the memory cannot be just mapped.  At the moment the APIs used by
> >    qemu device emulations do not assume that they can access RAM
> >    willy-nilly; they are all expected to go through
> >    cpu_physical_memory_rw.  I think it is important to preserve this
> >    for the future flexibility of the qemu codebase.
> 
> map() can, and will, fail under certain conditions and the code has to 
> accommodate that.

Yes.

> >    I'm trying to preserve an important and useful property of the
> >    internal qemu API.  My suggestion will mean the device emulation
> >    parts of qemu continue to be reasonably easily useable separately
> >    from the rest of qemu, and possibly very separately from any
> >    emulation of CPU or memory.  The benefit is difficult to evaluate
> >    exactly but the cost is very small.
> 
> Are you arguing for "callback" based mapping?  The vast majority of 
> devices will end up using a callback based approach so I'm not sure what 
> you're unhappy about.

No, at this point I'm just arguing in favour of a transfer length at
unmap.  That lack is the only thing about Avi's patches that I think
is a major blocker.


The question of callbacks is my next point:

> > The second criticism is a matter of taste, perhaps, and the third can
> > be addressed later:
> >
> >  * The mixed call/callback API: [...]
> >
> >    Callbacks are increasingly dominating the innards of qemu for good
> >    reasons.  I think I have explained how a callback style can provide
> >    the necessary functionality.
> 
> I understand the point you make here and I think we'll end up having a 
> callback API.  But for reasons I've already mentioned, I don't think it 
> makes sense for it to be the core API since it introduces very difficult 
> to eliminate deadlocks.

I think my most recently sketched-out proposal (where the caller
passes a DMA sg list to the mapping request function) doesn't have
these deadlocks ?

It seems to me that the way to avoid the deadlock is to allow the
caller to make their whole request at once rather than having two
callers both racing to grab resources.

Regards,
Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-22 12:18         ` Ian Jackson
@ 2009-01-22 18:46           ` Anthony Liguori
  2009-01-26 12:23             ` Ian Jackson
  0 siblings, 1 reply; 64+ messages in thread
From: Anthony Liguori @ 2009-01-22 18:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: Avi Kivity

Ian Jackson wrote:
> Anthony Liguori writes ("Re: [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API"):
>   
>
>>> The second criticism is a matter of taste, perhaps, and the third can
>>> be addressed later:
>>>
>>>  * The mixed call/callback API: [...]
>>>
>>>    Callbacks are increasingly dominating the innards of qemu for good
>>>    reasons.  I think I have explained how a callback style can provide
>>>    the necessary functionality.
>>>       
>> I understand the point you make here and I think we'll end up having a 
>> callback API.  But for reasons I've already mentioned, I don't think it 
>> makes sense for it to be the core API since it introduces very difficult 
>> to eliminate deadlocks.
>>     
>
> I think my most recently sketched-out proposal (where the caller
> passes a DMA sg list to the mapping request function) doesn't have
> these deadlocks ?
>   

But what do you do with a DMA request that cannot possibly be bounced 
but that can be split?  In other words, a DMA request to 1G of MMIO 
memory that is disk access.  We can absolutely split that into multiple 
requests.

> It seems to me that the way to avoid the deadlock is to allow the
> caller to make their whole request at once rather than having two
> callers both racing to grab resources.
>   

In practice, you can't rely on being able to allocate enough memory in 
the host to complete a full request.  This is only true for very special 
devices.  You definitely need to support two distinct modes.

Regards,

Anthony Liguori

> Regards,
> Ian.
>
>
>   

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-22 18:46           ` Anthony Liguori
@ 2009-01-26 12:23             ` Ian Jackson
  2009-01-26 18:03               ` Anthony Liguori
  0 siblings, 1 reply; 64+ messages in thread
From: Ian Jackson @ 2009-01-26 12:23 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori writes ("Re: [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API"):
> Ian Jackson wrote:
> > I think my most recently sketched-out proposal (where the caller
> > passes a DMA sg list to the mapping request function) doesn't have
> > these deadlocks ?
> 
> But what do you do with a DMA request that cannot possibly be bounced 
> but that can be split?  In other words, a DMA request to 1G of MMIO 
> memory that is disk access.  We can absolutely split that into multiple 
> requests.

You impose a maximum size for any DMA request; that is, a maximum size
that callers of the DMA API are allowed to request.  Any larger
request is a bug.  Callers who might want to try to map bigger blocks
need to be able to split.

Alternatively you can have the API return an `amount mapped' on the
callback, but state that this number will always be at least
DMA_MAP_SG_GUARANTEE_MIN if at least that amount was requested.

Ian.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API
  2009-01-26 12:23             ` Ian Jackson
@ 2009-01-26 18:03               ` Anthony Liguori
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Liguori @ 2009-01-26 18:03 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> Anthony Liguori writes ("Re: [Qemu-devel] Re: [PATCH 1/5] Add target memory mapping API"):
>   
>> Ian Jackson wrote:
>>     
>>> I think my most recently sketched-out proposal (where the caller
>>> passes a DMA sg list to the mapping request function) doesn't have
>>> these deadlocks ?
>>>       
>> But what do you do with a DMA request that cannot possibly be bounced 
>> but that can be split?  In other words, a DMA request to 1G of MMIO 
>> memory that is disk access.  We can absolutely split that into multiple 
>> requests.
>>     
>
> You impose a maximum size for any DMA request; that is, a maximum size
> that callers of the DMA API are allowed to request.  Any larger
> request is a bug.  Callers who might want to try to map bigger blocks
> need to be able to split.
>
> Alternatively you can have the API return an `amount mapped' on the
> callback, but state that this number will always be at least
> DMA_MAP_SG_GUARANTEE_MIN if at least that amount was requested.
>   

For that to be useful for network traffic, it has to be at least 64k.  
If isn't at least 64k, then the NIC code must deal with two cases, one 
where it was able to map and one where it wasn't able to map.

In the general case, we can't know ahead of time what the largest MIN is 
to make all devices happy.

Regards,

Anthony Liguori

> Ian.
>
>
>   

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2009-01-26 18:04 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-18 19:53 [Qemu-devel] [PATCH 0/5] Direct memory access for devices Avi Kivity
2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
2009-01-19 13:49   ` Ian Jackson
2009-01-19 14:54     ` Avi Kivity
2009-01-19 15:39       ` Anthony Liguori
2009-01-19 16:18         ` Paul Brook
2009-01-19 16:33           ` Anthony Liguori
2009-01-19 16:39             ` Avi Kivity
2009-01-19 19:15               ` Anthony Liguori
2009-01-20 10:09                 ` Avi Kivity
2009-01-19 16:57         ` Ian Jackson
2009-01-19 19:23           ` Anthony Liguori
2009-01-20 10:17             ` Avi Kivity
2009-01-20 14:18             ` Ian Jackson
2009-01-19 16:40       ` Ian Jackson
2009-01-19 17:28         ` Avi Kivity
2009-01-19 17:53           ` Ian Jackson
2009-01-19 18:29             ` Avi Kivity
2009-01-20 14:32               ` Ian Jackson
2009-01-20 17:23                 ` Avi Kivity
2009-01-19 18:25           ` Jamie Lokier
2009-01-19 18:43             ` Avi Kivity
2009-01-20 14:49               ` Ian Jackson
2009-01-20 17:42                 ` Avi Kivity
2009-01-20 18:08                   ` Jamie Lokier
2009-01-20 20:27                     ` Avi Kivity
2009-01-21 16:53                       ` Ian Jackson
2009-01-21 16:50                   ` Ian Jackson
2009-01-21 17:18                     ` Avi Kivity
2009-01-21 21:54                       ` Anthony Liguori
2009-01-20 14:44             ` Ian Jackson
2009-01-21 12:06           ` [Qemu-devel] " Mike Day
2009-01-21 12:18             ` Avi Kivity
2009-01-19 15:05     ` [Qemu-devel] [PATCH 1/5] " Gerd Hoffmann
2009-01-19 15:23       ` Avi Kivity
2009-01-19 15:29         ` Avi Kivity
2009-01-19 15:57           ` Gerd Hoffmann
2009-01-19 16:25             ` Avi Kivity
2009-01-19 17:08             ` Ian Jackson
2009-01-19 17:16               ` Avi Kivity
2009-01-19 14:56   ` [Qemu-devel] " Anthony Liguori
2009-01-19 15:03     ` Avi Kivity
2009-01-19 15:49       ` Anthony Liguori
2009-01-19 15:51         ` Avi Kivity
2009-01-20 18:43   ` Anthony Liguori
2009-01-21 17:09     ` Ian Jackson
2009-01-21 18:56       ` [Qemu-devel] " Mike Day
2009-01-21 19:35         ` Avi Kivity
2009-01-21 19:36       ` [Qemu-devel] Re: [PATCH 1/5] " Anthony Liguori
2009-01-22 12:18         ` Ian Jackson
2009-01-22 18:46           ` Anthony Liguori
2009-01-26 12:23             ` Ian Jackson
2009-01-26 18:03               ` Anthony Liguori
2009-01-21 11:52   ` [Qemu-devel] " Mike Day
2009-01-21 12:17     ` Avi Kivity
2009-01-21 17:37     ` Paul Brook
2009-01-18 19:53 ` [Qemu-devel] [PATCH 2/5] Add map client retry notification Avi Kivity
2009-01-19 14:58   ` [Qemu-devel] " Anthony Liguori
2009-01-18 19:53 ` [Qemu-devel] [PATCH 3/5] Vectored block device API Avi Kivity
2009-01-19 16:54   ` Blue Swirl
2009-01-19 17:19     ` Avi Kivity
2009-01-18 19:53 ` [Qemu-devel] [PATCH 4/5] I/O vector helpers Avi Kivity
2009-01-18 19:53 ` [Qemu-devel] [PATCH 5/5] Convert IDE to directly access guest memory Avi Kivity
2009-01-19 16:50 ` [Qemu-devel] [PATCH 0/5] Direct memory access for devices Blue Swirl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).