[Qemu-devel] [PATCH 0/5] Direct memory access for devices (v2)

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/5] Direct memory access for devices (v2)
@ 2009-01-22 10:36 Avi Kivity
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
                   ` (5 more replies)
  0 siblings, 6 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-22 10:36 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel

One of the deficiencies of the current device layer is that it can only access
guest RAM via cpu_physical_memory_rw().  This means that the device emulation
code must copy the memory to or from a temporary buffer, even though the host
offers APIs which allow direct access to memory.  This reduces efficiency on
DMA capable devices, especially disks.

This patchset introduces a complement to the read/write API,
cpu_physical_memory_map() which allows device emulation code to map
guest memory directly.  The API bounces memory regions which cannot be
mapped (such as mmio regions) using an internal buffer.

As an example, IDE emulation is converted to use the new API.  This exposes
another deficiency: lack of scatter/gather support in the block layer.  To
work around this, a vectored block API is introduced, currently emulated
by bouncing.  Additional work is needed to convert all block format drivers
to use the vectored API.

Changes from v1:
 - documented memory mapping API
 - added access_len parameter to unmap operation, to indicate how much
   memory was actually accessed
 - move QEMUIOVector to cutils.c, and add flatten/unflatten operations
 - change block format driver API to accept a QEMUIOVector rather than a
   bare struct iovec

Avi Kivity (5):
  Add target memory mapping API
  Add map client retry notification
  I/O vector helpers
  Vectored block device API
  Convert IDE to directly access guest memory

 block.c       |   68 +++++++++++++++++++++++++++
 block.h       |    8 +++
 cpu-all.h     |    8 +++
 cutils.c      |   47 +++++++++++++++++++
 exec.c        |  142 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/ide.c      |  133 +++++++++++++++++++++++++++++++++++++++++++++++------
 qemu-common.h |   12 +++++
 7 files changed, 402 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-22 10:36 [Qemu-devel] [PATCH 0/5] Direct memory access for devices (v2) Avi Kivity
@ 2009-01-22 10:36 ` Avi Kivity
  2009-01-22 12:24   ` Ian Jackson
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 2/5] Add map client retry notification Avi Kivity
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-22 10:36 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel

Devices accessing large amounts of memory (as with DMA) will wish to obtain
a pointer to guest memory rather than access it indirectly via
cpu_physical_memory_rw().  Add a new API to convert target addresses to
host pointers.

In case the target address does not correspond to RAM, a bounce buffer is
allocated.  To prevent the guest from causing the host to allocate unbounded
amounts of bounce buffer, this memory is limited (currently to one page).

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 cpu-all.h |    6 +++
 exec.c    |  102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index ee0a6e3..22ffaa7 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -923,6 +923,12 @@ static inline void cpu_physical_memory_write(target_phys_addr_t addr,
 {
     cpu_physical_memory_rw(addr, (uint8_t *)buf, len, 1);
 }
+void *cpu_physical_memory_map(target_phys_addr_t addr,
+                              target_phys_addr_t *plen,
+                              int is_write);
+void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
+                               int is_write, target_phys_addr_t access_len);
+
 uint32_t ldub_phys(target_phys_addr_t addr);
 uint32_t lduw_phys(target_phys_addr_t addr);
 uint32_t ldl_phys(target_phys_addr_t addr);
diff --git a/exec.c b/exec.c
index faa6333..6dd88fc 100644
--- a/exec.c
+++ b/exec.c
@@ -3045,6 +3045,108 @@ void cpu_physical_memory_write_rom(target_phys_addr_t addr,
     }
 }
 
+typedef struct {
+    void *buffer;
+    target_phys_addr_t addr;
+    target_phys_addr_t len;
+} BounceBuffer;
+
+static BounceBuffer bounce;
+
+/* Map a physical memory region into a host virtual address.
+ * May map a subset of the requested range, given by and returned in *plen.
+ * May return NULL if resources needed to perform the mapping are exhausted.
+ * Use only for reads OR writes - not for read-modify-write operations.
+ */
+void *cpu_physical_memory_map(target_phys_addr_t addr,
+                              target_phys_addr_t *plen,
+                              int is_write)
+{
+    target_phys_addr_t len = *plen;
+    target_phys_addr_t done = 0;
+    int l;
+    uint8_t *ret = NULL;
+    uint8_t *ptr;
+    target_phys_addr_t page;
+    unsigned long pd;
+    PhysPageDesc *p;
+    unsigned long addr1;
+
+    while (len > 0) {
+        page = addr & TARGET_PAGE_MASK;
+        l = (page + TARGET_PAGE_SIZE) - addr;
+        if (l > len)
+            l = len;
+        p = phys_page_find(page >> TARGET_PAGE_BITS);
+        if (!p) {
+            pd = IO_MEM_UNASSIGNED;
+        } else {
+            pd = p->phys_offset;
+        }
+
+        if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
+            if (done || bounce.buffer) {
+                break;
+            }
+            bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, TARGET_PAGE_SIZE);
+            bounce.addr = addr;
+            bounce.len = l;
+            if (!is_write) {
+                cpu_physical_memory_rw(addr, bounce.buffer, l, 0);
+            }
+            ptr = bounce.buffer;
+        } else {
+            addr1 = (pd & TARGET_PAGE_MASK) + (addr & ~TARGET_PAGE_MASK);
+            ptr = phys_ram_base + addr1;
+        }
+        if (!done) {
+            ret = ptr;
+        } else if (ret + done != ptr) {
+            break;
+        }
+
+        len -= l;
+        addr += l;
+        done += l;
+    }
+    *plen = done;
+    return ret;
+}
+
+/* Unmaps a memory region previously mapped by cpu_physical_memory_map().
+ * Will also mark the memory as dirty if is_write == 1.  access_len gives
+ * the amount of memory that was actually read or written by the caller.
+ */
+void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
+                               int is_write, target_phys_addr_t access_len)
+{
+    if (buffer != bounce.buffer) {
+        if (is_write) {
+            unsigned long addr1 = (uint8_t *)buffer - phys_ram_base;
+            while (access_len) {
+                unsigned l;
+                l = TARGET_PAGE_SIZE;
+                if (l > access_len)
+                    l = access_len;
+                if (!cpu_physical_memory_is_dirty(addr1)) {
+                    /* invalidate code */
+                    tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
+                    /* set dirty bit */
+                    phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
+                        (0xff & ~CODE_DIRTY_FLAG);
+                }
+                addr1 += l;
+                access_len -= l;
+            }
+        }
+        return;
+    }
+    if (is_write) {
+        cpu_physical_memory_write(bounce.addr, bounce.buffer, access_len);
+    }
+    qemu_free(bounce.buffer);
+    bounce.buffer = NULL;
+}
 
 /* warning: addr must be aligned */
 uint32_t ldl_phys(target_phys_addr_t addr)
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
@ 2009-01-22 12:24   ` Ian Jackson
  0 siblings, 0 replies; 47+ messages in thread
From: Ian Jackson @ 2009-01-22 12:24 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("[Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> Devices accessing large amounts of memory (as with DMA) will wish to obtain
> a pointer to guest memory rather than access it indirectly via
> cpu_physical_memory_rw().  Add a new API to convert target addresses to
> host pointers.
> 
> In case the target address does not correspond to RAM, a bounce buffer is
> allocated.  To prevent the guest from causing the host to allocate unbounded
> amounts of bounce buffer, this memory is limited (currently to one page).
> 
> Signed-off-by: Avi Kivity <avi@redhat.com>
...
> +void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
> +                               int is_write, target_phys_addr_t access_len)

Great, thanks for adding the access_len.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH 2/5] Add map client retry notification
  2009-01-22 10:36 [Qemu-devel] [PATCH 0/5] Direct memory access for devices (v2) Avi Kivity
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
@ 2009-01-22 10:36 ` Avi Kivity
  2009-01-22 12:30   ` Ian Jackson
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 3/5] I/O vector helpers Avi Kivity
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-22 10:36 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel

The target memory mapping API may fail if the bounce buffer resources
are exhausted.  Add a notification mechanism to allow clients to retry
the mapping operation when resources become available again.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 cpu-all.h |    2 ++
 exec.c    |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 22ffaa7..e71bd06 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -928,6 +928,8 @@ void *cpu_physical_memory_map(target_phys_addr_t addr,
                               int is_write);
 void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
                                int is_write, target_phys_addr_t access_len);
+void *cpu_register_map_client(void *opaque, void (*callback)(void *opaque));
+void cpu_unregister_map_client(void *cookie);
 
 uint32_t ldub_phys(target_phys_addr_t addr);
 uint32_t lduw_phys(target_phys_addr_t addr);
diff --git a/exec.c b/exec.c
index 6dd88fc..56e5e48 100644
--- a/exec.c
+++ b/exec.c
@@ -3053,10 +3053,49 @@ typedef struct {
 
 static BounceBuffer bounce;
 
+typedef struct MapClient {
+    void *opaque;
+    void (*callback)(void *opaque);
+    LIST_ENTRY(MapClient) link;
+} MapClient;
+
+static LIST_HEAD(map_client_list, MapClient) map_client_list
+    = LIST_HEAD_INITIALIZER(map_client_list);
+
+void *cpu_register_map_client(void *opaque, void (*callback)(void *opaque))
+{
+    MapClient *client = qemu_malloc(sizeof(*client));
+
+    client->opaque = opaque;
+    client->callback = callback;
+    LIST_INSERT_HEAD(&map_client_list, client, link);
+    return client;
+}
+
+void cpu_unregister_map_client(void *_client)
+{
+    MapClient *client = (MapClient *)_client;
+
+    LIST_REMOVE(client, link);
+}
+
+static void cpu_notify_map_clients(void)
+{
+    MapClient *client;
+
+    while (!LIST_EMPTY(&map_client_list)) {
+        client = LIST_FIRST(&map_client_list);
+        client->callback(client->opaque);
+        LIST_REMOVE(client, link);
+    }
+}
+
 /* Map a physical memory region into a host virtual address.
  * May map a subset of the requested range, given by and returned in *plen.
  * May return NULL if resources needed to perform the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
+ * Use cpu_register_map_client() to know when retrying the map operation is
+ * likely to succeed.
  */
 void *cpu_physical_memory_map(target_phys_addr_t addr,
                               target_phys_addr_t *plen,
@@ -3146,6 +3185,7 @@ void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
     }
     qemu_free(bounce.buffer);
     bounce.buffer = NULL;
+    cpu_notify_map_clients();
 }
 
 /* warning: addr must be aligned */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] Add map client retry notification
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 2/5] Add map client retry notification Avi Kivity
@ 2009-01-22 12:30   ` Ian Jackson
  2009-01-22 18:51     ` Anthony Liguori
  0 siblings, 1 reply; 47+ messages in thread
From: Ian Jackson @ 2009-01-22 12:30 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("[Qemu-devel] [PATCH 2/5] Add map client retry notification"):
> The target memory mapping API may fail if the bounce buffer resources
> are exhausted.  Add a notification mechanism to allow clients to retry
> the mapping operation when resources become available again.

Does this API not suffer from the potential deadlock described by
Anthony ?

Imagine that for some reason bounce buffers are in use.  If we have a
client which wants to do a single writev on a tap device it will even
deadlock by itself:

  map(<block 0>) succeeds
  map(<block 1>) fails, NULL
  register_map_client

but the callback will never happen because the client is effectively
waiting for itself to release its own mapping.

Since callers cannot assume that they can map more than one range at
once (since there's only one bounce buffer), any caller which needs to
do scatter-gather (like a tap device, as Anthony points out) needs to
invent its own bounce buffers.  That seems like a waste of effort.

There should be a single bounce buffer fallback mechanism, and it
should be sufficiently powerful that it can be used for tap devices,
which means that the calling device emulation must present a single
scatter-gather list to the API all in one go.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] Add map client retry notification
  2009-01-22 12:30   ` Ian Jackson
@ 2009-01-22 18:51     ` Anthony Liguori
  0 siblings, 0 replies; 47+ messages in thread
From: Anthony Liguori @ 2009-01-22 18:51 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> Avi Kivity writes ("[Qemu-devel] [PATCH 2/5] Add map client retry notification"):
>   
>> The target memory mapping API may fail if the bounce buffer resources
>> are exhausted.  Add a notification mechanism to allow clients to retry
>> the mapping operation when resources become available again.
>>     
>
> Does this API not suffer from the potential deadlock described by
> Anthony ?
>
> Imagine that for some reason bounce buffers are in use.  If we have a
> client which wants to do a single writev on a tap device it will even
> deadlock by itself:
>
>   map(<block 0>) succeeds
>   map(<block 1>) fails, NULL
>   register_map_client
>
> but the callback will never happen because the client is effectively
> waiting for itself to release its own mapping.
>   

Yes, a client is not allowed to do this.  To put it another way (and 
perhaps this needs to be documented), register_map_client can only be 
used safely if a client unmaps all of it's existing mappings.

> Since callers cannot assume that they can map more than one range at
> once (since there's only one bounce buffer), any caller which needs to
> do scatter-gather (like a tap device, as Anthony points out) needs to
> invent its own bounce buffers.  That seems like a waste of effort.
>   

It needs to be able to fall back to something like cpu_physical_memory_rw.

> There should be a single bounce buffer fallback mechanism, and it
> should be sufficiently powerful that it can be used for tap devices,
> which means that the calling device emulation must present a single
> scatter-gather list to the API all in one go.
>   

You could have an API like:

try_to_map_or_bounce(list-of-phys-iovecs, buffer-to-bounce-to, callback, 
opaque);

That would be a nice addition for packet IO devices.  Better yet, it 
should be:

try_to_map_or_bounce(map-func, unmap-func, iofunc, opaque, 
list-of-phys-iovecs, buffer-to-bounce-to)

If you go back and look at my previous mails about packet helpers and 
stream helpers, that's just about the signature of my proposed packet 
helper.  Like I mentioned earlier, I definitely think we should have 
such a thing.

Regards,

Anthony Liguori

> Ian.
>
>
>   

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH 3/5] I/O vector helpers
  2009-01-22 10:36 [Qemu-devel] [PATCH 0/5] Direct memory access for devices (v2) Avi Kivity
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 2/5] Add map client retry notification Avi Kivity
@ 2009-01-22 10:36 ` Avi Kivity
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 4/5] Vectored block device API Avi Kivity
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-22 10:36 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel

In general, it is not possible to predict the size of of an I/O vector since
a contiguous guest region may map to a disconiguous host region.  Add some
helpers to manage I/O vector growth.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 cutils.c      |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 qemu-common.h |   12 ++++++++++++
 2 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/cutils.c b/cutils.c
index 9617e08..80a7a1d 100644
--- a/cutils.c
+++ b/cutils.c
@@ -101,3 +101,50 @@ int qemu_fls(int i)
 {
     return 32 - clz32(i);
 }
+
+/* io vectors */
+
+void qemu_iovec_init(QEMUIOVector *qiov, int alloc_hint)
+{
+    qiov->iov = qemu_malloc(alloc_hint * sizeof(struct iovec));
+    qiov->niov = 0;
+    qiov->nalloc = alloc_hint;
+}
+
+void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len)
+{
+    if (qiov->niov == qiov->nalloc) {
+        qiov->nalloc = 2 * qiov->nalloc + 1;
+        qiov->iov = qemu_realloc(qiov->iov, qiov->nalloc * sizeof(struct iovec));
+    }
+    qiov->iov[qiov->niov].iov_base = base;
+    qiov->iov[qiov->niov].iov_len = len;
+    ++qiov->niov;
+}
+
+void qemu_iovec_destroy(QEMUIOVector *qiov)
+{
+    qemu_free(qiov->iov);
+}
+
+void qemu_iovec_to_buffer(QEMUIOVector *qiov, void *buf)
+{
+    uint8_t *p = (uint8_t *)buf;
+    int i;
+
+    for (i = 0; i < qiov->niov; ++i) {
+        memcpy(p, qiov->iov[i].iov_base, qiov->iov[i].iov_len);
+        p += qiov->iov[i].iov_len;
+    }
+}
+
+void qemu_iovec_from_buffer(QEMUIOVector *qiov, const void *buf)
+{
+    const uint8_t *p = (const uint8_t *)buf;
+    int i;
+
+    for (i = 0; i < qiov->niov; ++i) {
+        memcpy(qiov->iov[i].iov_base, p, qiov->iov[i].iov_len);
+        p += qiov->iov[i].iov_len;
+    }
+}
diff --git a/qemu-common.h b/qemu-common.h
index d83e61b..ae773e0 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -191,6 +191,18 @@ int cpu_load(QEMUFile *f, void *opaque, int version_id);
 /* Force QEMU to stop what it's doing and service IO */
 void qemu_service_io(void);
 
+typedef struct QEMUIOVector {
+    struct iovec *iov;
+    int niov;
+    int nalloc;
+} QEMUIOVector;
+
+void qemu_iovec_init(QEMUIOVector *qiov, int alloc_hint);
+void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len);
+void qemu_iovec_destroy(QEMUIOVector *qiov);
+void qemu_iovec_to_buffer(QEMUIOVector *qiov, void *buf);
+void qemu_iovec_from_buffer(QEMUIOVector *qiov, const void *buf);
+
 #endif /* dyngen-exec.h hack */
 
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH 4/5] Vectored block device API
  2009-01-22 10:36 [Qemu-devel] [PATCH 0/5] Direct memory access for devices (v2) Avi Kivity
                   ` (2 preceding siblings ...)
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 3/5] I/O vector helpers Avi Kivity
@ 2009-01-22 10:36 ` Avi Kivity
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 5/5] Convert IDE to directly access guest memory Avi Kivity
  2009-01-22 16:59 ` [Qemu-devel] Re: [PATCH 0/5] Direct memory access for devices (v2) Anthony Liguori
  5 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-22 10:36 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel

Most devices that are capable of DMA are also capable of scatter-gather.
With the memory mapping API, this means that the device code needs to be
able to access discontiguous host memory regions.

For block devices, this translates to vectored I/O.  This patch implements
an aynchronous vectored interface for the qemu block devices.  At the moment
all I/O is bounced and submitted through the non-vectored API; in the future
we will convert block devices to natively support vectored I/O wherever
possible.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 block.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block.h |    8 +++++++
 2 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index 3250327..f570afc 100644
--- a/block.c
+++ b/block.c
@@ -1246,6 +1246,69 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, QEMUSnapshotInfo *sn)
 /**************************************************************/
 /* async I/Os */
 
+typedef struct VectorTranslationState {
+    QEMUIOVector *iov;
+    uint8_t *bounce;
+    int is_write;
+    BlockDriverAIOCB *aiocb;
+    BlockDriverAIOCB *this_aiocb;
+} VectorTranslationState;
+
+static void bdrv_aio_rw_vector_cb(void *opaque, int ret)
+{
+    VectorTranslationState *s = opaque;
+
+    if (!s->is_write) {
+        qemu_iovec_from_buffer(s->iov, s->bounce);
+    }
+    qemu_free(s->bounce);
+    s->this_aiocb->cb(s->this_aiocb->opaque, ret);
+    qemu_aio_release(s->this_aiocb);
+}
+
+static BlockDriverAIOCB *bdrv_aio_rw_vector(BlockDriverState *bs,
+                                            int64_t sector_num,
+                                            QEMUIOVector *iov,
+                                            int nb_sectors,
+                                            BlockDriverCompletionFunc *cb,
+                                            void *opaque,
+                                            int is_write)
+
+{
+    VectorTranslationState *s = qemu_mallocz(sizeof(*s));
+    BlockDriverAIOCB *aiocb = qemu_aio_get(bs, cb, opaque);
+
+    s->this_aiocb = aiocb;
+    s->iov = iov;
+    s->bounce = qemu_memalign(512, nb_sectors * 512);
+    s->is_write = is_write;
+    if (is_write) {
+        qemu_iovec_to_buffer(s->iov, s->bounce);
+        s->aiocb = bdrv_aio_write(bs, sector_num, s->bounce, nb_sectors,
+                                  bdrv_aio_rw_vector_cb, s);
+    } else {
+        s->aiocb = bdrv_aio_read(bs, sector_num, s->bounce, nb_sectors,
+                                 bdrv_aio_rw_vector_cb, s);
+    }
+    return aiocb;
+}
+
+BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
+                                 QEMUIOVector *iov, int nb_sectors,
+                                 BlockDriverCompletionFunc *cb, void *opaque)
+{
+    return bdrv_aio_rw_vector(bs, sector_num, iov, nb_sectors,
+                              cb, opaque, 0);
+}
+
+BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
+                                  QEMUIOVector *iov, int nb_sectors,
+                                  BlockDriverCompletionFunc *cb, void *opaque)
+{
+    return bdrv_aio_rw_vector(bs, sector_num, iov, nb_sectors,
+                              cb, opaque, 1);
+}
+
 BlockDriverAIOCB *bdrv_aio_read(BlockDriverState *bs, int64_t sector_num,
                                 uint8_t *buf, int nb_sectors,
                                 BlockDriverCompletionFunc *cb, void *opaque)
@@ -1294,6 +1357,11 @@ void bdrv_aio_cancel(BlockDriverAIOCB *acb)
 {
     BlockDriver *drv = acb->bs->drv;
 
+    if (acb->cb == bdrv_aio_rw_vector_cb) {
+        VectorTranslationState *s = acb->opaque;
+        acb = s->aiocb;
+    }
+
     drv->bdrv_aio_cancel(acb);
 }
 
diff --git a/block.h b/block.h
index c3314a1..9733409 100644
--- a/block.h
+++ b/block.h
@@ -2,6 +2,7 @@
 #define BLOCK_H
 
 #include "qemu-aio.h"
+#include "qemu-common.h"
 
 /* block.c */
 typedef struct BlockDriver BlockDriver;
@@ -85,6 +86,13 @@ int bdrv_commit(BlockDriverState *bs);
 typedef struct BlockDriverAIOCB BlockDriverAIOCB;
 typedef void BlockDriverCompletionFunc(void *opaque, int ret);
 
+BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
+                                 QEMUIOVector *iov, int nb_sectors,
+                                 BlockDriverCompletionFunc *cb, void *opaque);
+BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
+                                  QEMUIOVector *iov, int nb_sectors,
+                                  BlockDriverCompletionFunc *cb, void *opaque);
+
 BlockDriverAIOCB *bdrv_aio_read(BlockDriverState *bs, int64_t sector_num,
                                 uint8_t *buf, int nb_sectors,
                                 BlockDriverCompletionFunc *cb, void *opaque);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH 5/5] Convert IDE to directly access guest memory
  2009-01-22 10:36 [Qemu-devel] [PATCH 0/5] Direct memory access for devices (v2) Avi Kivity
                   ` (3 preceding siblings ...)
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 4/5] Vectored block device API Avi Kivity
@ 2009-01-22 10:36 ` Avi Kivity
  2009-01-22 16:59 ` [Qemu-devel] Re: [PATCH 0/5] Direct memory access for devices (v2) Anthony Liguori
  5 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-22 10:36 UTC (permalink / raw)
  To: Anthony Liguori, qemu-devel

Instead of copying to a temporary buffer, map guest memory for IDE DMA
transactions.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 hw/ide.c |  133 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 117 insertions(+), 16 deletions(-)

diff --git a/hw/ide.c b/hw/ide.c
index 06ac6dc..49c785d 100644
--- a/hw/ide.c
+++ b/hw/ide.c
@@ -422,6 +422,7 @@ typedef struct IDEState {
     int atapi_dma; /* true if dma is requested for the packet cmd */
     /* ATA DMA state */
     int io_buffer_size;
+    QEMUIOVector iovec;
     /* PIO transfer handling */
     int req_nb_sectors; /* number of sectors per interrupt */
     EndTransferFunc *end_transfer_func;
@@ -862,6 +863,66 @@ static void ide_sector_read(IDEState *s)
     }
 }
 
+
+/* return 0 if buffer completed */
+static int dma_buf_prepare(BMDMAState *bm, int is_write)
+{
+    IDEState *s = bm->ide_if;
+    struct {
+        uint32_t addr;
+        uint32_t size;
+    } prd;
+    int l, len;
+    void *mem;
+    target_phys_addr_t l1;
+
+    qemu_iovec_init(&s->iovec, s->nsector / (TARGET_PAGE_SIZE/512) + 1);
+    s->io_buffer_size = 0;
+    for(;;) {
+        if (bm->cur_prd_len == 0) {
+            /* end of table (with a fail safe of one page) */
+            if (bm->cur_prd_last ||
+                (bm->cur_addr - bm->addr) >= 4096)
+                return s->io_buffer_size != 0;
+            cpu_physical_memory_read(bm->cur_addr, (uint8_t *)&prd, 8);
+            bm->cur_addr += 8;
+            prd.addr = le32_to_cpu(prd.addr);
+            prd.size = le32_to_cpu(prd.size);
+            len = prd.size & 0xfffe;
+            if (len == 0)
+                len = 0x10000;
+            bm->cur_prd_len = len;
+            bm->cur_prd_addr = prd.addr;
+            bm->cur_prd_last = (prd.size & 0x80000000);
+        }
+        l = bm->cur_prd_len;
+        if (l > 0) {
+            l1 = l;
+            mem = cpu_physical_memory_map(bm->cur_prd_addr, &l1, is_write);
+            if (!mem) {
+                break;
+            }
+            qemu_iovec_add(&s->iovec, mem, l1);
+            bm->cur_prd_addr += l1;
+            bm->cur_prd_len -= l1;
+            s->io_buffer_size += l1;
+        }
+    }
+    return 1;
+}
+
+static void dma_buf_commit(IDEState *s, int is_write)
+{
+    int i;
+
+    for (i = 0; i < s->iovec.niov; ++i) {
+        cpu_physical_memory_unmap(s->iovec.iov[i].iov_base,
+                                  s->iovec.iov[i].iov_len, is_write,
+                                  s->iovec.iov[i].iov_len);
+    }
+    qemu_iovec_destroy(&s->iovec);
+}
+
 static void ide_dma_error(IDEState *s)
 {
     ide_transfer_stop(s);
@@ -883,10 +944,12 @@ static int ide_handle_write_error(IDEState *s, int error, int op)
         s->bmdma->status |= op;
         vm_stop(0);
     } else {
-        if (op == BM_STATUS_DMA_RETRY)
+        if (op == BM_STATUS_DMA_RETRY) {
+            dma_buf_commit(s, 0);
             ide_dma_error(s);
-        else
+        } else {
             ide_rw_error(s);
+        }
     }
 
     return 1;
@@ -940,6 +1003,39 @@ static int dma_buf_rw(BMDMAState *bm, int is_write)
     return 1;
 }
 
+typedef struct {
+    BMDMAState *bm;
+    void (*cb)(void *opaque, int ret);
+    QEMUBH *bh;
+} MapFailureContinuation;
+
+static void reschedule_dma(void *opaque)
+{
+    MapFailureContinuation *cont = opaque;
+
+    cont->cb(cont->bm, 0);
+    qemu_bh_delete(cont->bh);
+    qemu_free(cont);
+}
+
+static void continue_after_map_failure(void *opaque)
+{
+    MapFailureContinuation *cont = opaque;
+
+    cont->bh = qemu_bh_new(reschedule_dma, opaque);
+    qemu_bh_schedule(cont->bh);
+}
+
+static void wait_for_bounce_buffer(BMDMAState *bmdma,
+                                   void (*cb)(void *opaque, int ret))
+{
+    MapFailureContinuation *cont = qemu_malloc(sizeof(*cont));
+
+    cont->bm = bmdma;
+    cont->cb = cb;
+    cpu_register_map_client(cont, continue_after_map_failure);
+}
+
 static void ide_read_dma_cb(void *opaque, int ret)
 {
     BMDMAState *bm = opaque;
@@ -948,6 +1044,7 @@ static void ide_read_dma_cb(void *opaque, int ret)
     int64_t sector_num;
 
     if (ret < 0) {
+        dma_buf_commit(s, 1);
 	ide_dma_error(s);
 	return;
     }
@@ -955,11 +1052,10 @@ static void ide_read_dma_cb(void *opaque, int ret)
     n = s->io_buffer_size >> 9;
     sector_num = ide_get_sector(s);
     if (n > 0) {
+        dma_buf_commit(s, 1);
         sector_num += n;
         ide_set_sector(s, sector_num);
         s->nsector -= n;
-        if (dma_buf_rw(bm, 1) == 0)
-            goto eot;
     }
 
     /* end of transfer ? */
@@ -977,15 +1073,19 @@ static void ide_read_dma_cb(void *opaque, int ret)
 
     /* launch next transfer */
     n = s->nsector;
-    if (n > IDE_DMA_BUF_SECTORS)
-        n = IDE_DMA_BUF_SECTORS;
     s->io_buffer_index = 0;
     s->io_buffer_size = n * 512;
+    if (dma_buf_prepare(bm, 1) == 0)
+        goto eot;
+    if (!s->iovec.niov) {
+        wait_for_bounce_buffer(bm, ide_read_dma_cb);
+        return;
+    }
 #ifdef DEBUG_AIO
     printf("aio_read: sector_num=%" PRId64 " n=%d\n", sector_num, n);
 #endif
-    bm->aiocb = bdrv_aio_read(s->bs, sector_num, s->io_buffer, n,
-                              ide_read_dma_cb, bm);
+    bm->aiocb = bdrv_aio_readv(s->bs, sector_num, &s->iovec, n,
+                               ide_read_dma_cb, bm);
     ide_dma_submit_check(s, ide_read_dma_cb, bm);
 }
 
@@ -1081,6 +1181,7 @@ static void ide_write_dma_cb(void *opaque, int ret)
     n = s->io_buffer_size >> 9;
     sector_num = ide_get_sector(s);
     if (n > 0) {
+        dma_buf_commit(s, 0);
         sector_num += n;
         ide_set_sector(s, sector_num);
         s->nsector -= n;
@@ -1099,20 +1200,20 @@ static void ide_write_dma_cb(void *opaque, int ret)
         return;
     }
 
-    /* launch next transfer */
     n = s->nsector;
-    if (n > IDE_DMA_BUF_SECTORS)
-        n = IDE_DMA_BUF_SECTORS;
-    s->io_buffer_index = 0;
     s->io_buffer_size = n * 512;
-
-    if (dma_buf_rw(bm, 0) == 0)
+    /* launch next transfer */
+    if (dma_buf_prepare(bm, 0) == 0)
         goto eot;
+    if (!s->iovec.niov) {
+        wait_for_bounce_buffer(bm, ide_write_dma_cb);
+        return;
+    }
 #ifdef DEBUG_AIO
     printf("aio_write: sector_num=%" PRId64 " n=%d\n", sector_num, n);
 #endif
-    bm->aiocb = bdrv_aio_write(s->bs, sector_num, s->io_buffer, n,
-                               ide_write_dma_cb, bm);
+    bm->aiocb = bdrv_aio_writev(s->bs, sector_num, &s->iovec, n,
+                                ide_write_dma_cb, bm);
     ide_dma_submit_check(s, ide_write_dma_cb, bm);
 }
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Qemu-devel] Re: [PATCH 0/5] Direct memory access for devices (v2)
  2009-01-22 10:36 [Qemu-devel] [PATCH 0/5] Direct memory access for devices (v2) Avi Kivity
                   ` (4 preceding siblings ...)
  2009-01-22 10:36 ` [Qemu-devel] [PATCH 5/5] Convert IDE to directly access guest memory Avi Kivity
@ 2009-01-22 16:59 ` Anthony Liguori
  5 siblings, 0 replies; 47+ messages in thread
From: Anthony Liguori @ 2009-01-22 16:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel

Avi Kivity wrote:
> One of the deficiencies of the current device layer is that it can only access
> guest RAM via cpu_physical_memory_rw().  This means that the device emulation
> code must copy the memory to or from a temporary buffer, even though the host
> offers APIs which allow direct access to memory.  This reduces efficiency on
> DMA capable devices, especially disks.
>
> This patchset introduces a complement to the read/write API,
> cpu_physical_memory_map() which allows device emulation code to map
> guest memory directly.  The API bounces memory regions which cannot be
> mapped (such as mmio regions) using an internal buffer.
>
> As an example, IDE emulation is converted to use the new API.  This exposes
> another deficiency: lack of scatter/gather support in the block layer.  To
> work around this, a vectored block API is introduced, currently emulated
> by bouncing.  Additional work is needed to convert all block format drivers
> to use the vectored API.
>   

Applied all.  Thanks.

Regards,

Anthony Liguori

> Changes from v1:
>  - documented memory mapping API
>  - added access_len parameter to unmap operation, to indicate how much
>    memory was actually accessed
>  - move QEMUIOVector to cutils.c, and add flatten/unflatten operations
>  - change block format driver API to accept a QEMUIOVector rather than a
>    bare struct iovec
>
> Avi Kivity (5):
>   Add target memory mapping API
>   Add map client retry notification
>   I/O vector helpers
>   Vectored block device API
>   Convert IDE to directly access guest memory
>
>  block.c       |   68 +++++++++++++++++++++++++++
>  block.h       |    8 +++
>  cpu-all.h     |    8 +++
>  cutils.c      |   47 +++++++++++++++++++
>  exec.c        |  142 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/ide.c      |  133 +++++++++++++++++++++++++++++++++++++++++++++++------
>  qemu-common.h |   12 +++++
>  7 files changed, 402 insertions(+), 16 deletions(-)
>
>
>   

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH 0/5] Direct memory access for devices
@ 2009-01-18 19:53 Avi Kivity
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-18 19:53 UTC (permalink / raw)
  To: qemu-devel, Anthony Liguori

One of the deficiencies of the current device layer is that it can only access
guest RAM via cpu_physical_memory_rw().  This means that the device emulation
code must copy the memory to or from a temporary buffer, even though the host
offers APIs which allow direct access to memory.  This reduces efficiency on
DMA capable devices, especially disks.

This patchset introduces a complement to the read/write API,
cpu_physical_memory_map() which allows device emulation code to map
guest memory directly.  The API bounces memory regions which cannot be
mapped (such as mmio regions) using an internal buffer.

As an example, IDE emulation is converted to use the new API.  This exposes
another deficiency: lack of scatter/gather support in the block layer.  To
work around this, a vectored block API is introduced, currently emulated
by bouncing.  Additional work is needed to convert all block format drivers
to use the vectored API.

Avi Kivity (5):
  Add target memory mapping API
  Add map client retry notification
  Vectored block device API
  I/O vector helpers
  Convert IDE to directly access guest memory

 block.c       |   92 ++++++++++++++++++++++++++++++++++++++++
 block.h       |    7 +++
 cpu-all.h     |    8 ++++
 exec.c        |  131 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/ide.c      |  127 +++++++++++++++++++++++++++++++++++++++++++++++++------
 qemu-common.h |   10 ++++
 vl.c          |   25 +++++++++++
 7 files changed, 386 insertions(+), 14 deletions(-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-18 19:53 [Qemu-devel] [PATCH 0/5] Direct memory access for devices Avi Kivity
@ 2009-01-18 19:53 ` Avi Kivity
  2009-01-19 13:49   ` Ian Jackson
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-18 19:53 UTC (permalink / raw)
  To: qemu-devel, Anthony Liguori

Devices accessing large amounts of memory (as with DMA) will wish to obtain
a pointer to guest memory rather than access it indirectly via
cpu_physical_memory_rw().  Add a new API to convert target addresses to
host pointers.

In case the target address does not correspond to RAM, a bounce buffer is
allocated.  To prevent the guest from causing the host to allocate unbounded
amounts of bounce buffer, this memory is limited (currently to one page).

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 cpu-all.h |    5 +++
 exec.c    |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+), 0 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index ee0a6e3..3439999 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -923,6 +923,11 @@ static inline void cpu_physical_memory_write(target_phys_addr_t addr,
 {
     cpu_physical_memory_rw(addr, (uint8_t *)buf, len, 1);
 }
+void *cpu_physical_memory_map(target_phys_addr_t addr,
+                              target_phys_addr_t *plen,
+                              int is_write);
+void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
+                               int is_write);
 uint32_t ldub_phys(target_phys_addr_t addr);
 uint32_t lduw_phys(target_phys_addr_t addr);
 uint32_t ldl_phys(target_phys_addr_t addr);
diff --git a/exec.c b/exec.c
index faa6333..7162271 100644
--- a/exec.c
+++ b/exec.c
@@ -3045,6 +3045,99 @@ void cpu_physical_memory_write_rom(target_phys_addr_t addr,
     }
 }
 
+typedef struct {
+    void *buffer;
+    target_phys_addr_t addr;
+    target_phys_addr_t len;
+} BounceBuffer;
+
+static BounceBuffer bounce;
+
+void *cpu_physical_memory_map(target_phys_addr_t addr,
+                              target_phys_addr_t *plen,
+                              int is_write)
+{
+    target_phys_addr_t len = *plen;
+    target_phys_addr_t done = 0;
+    int l;
+    uint8_t *ret = NULL;
+    uint8_t *ptr;
+    target_phys_addr_t page;
+    unsigned long pd;
+    PhysPageDesc *p;
+    unsigned long addr1;
+
+    while (len > 0) {
+        page = addr & TARGET_PAGE_MASK;
+        l = (page + TARGET_PAGE_SIZE) - addr;
+        if (l > len)
+            l = len;
+        p = phys_page_find(page >> TARGET_PAGE_BITS);
+        if (!p) {
+            pd = IO_MEM_UNASSIGNED;
+        } else {
+            pd = p->phys_offset;
+        }
+
+        if ((pd & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
+            if (done || bounce.buffer) {
+                break;
+            }
+            bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, TARGET_PAGE_SIZE);
+            bounce.addr = addr;
+            bounce.len = l;
+            if (!is_write) {
+                cpu_physical_memory_rw(addr, bounce.buffer, l, 0);
+            }
+            ptr = bounce.buffer;
+        } else {
+            addr1 = (pd & TARGET_PAGE_MASK) + (addr & ~TARGET_PAGE_MASK);
+            ptr = phys_ram_base + addr1;
+        }
+        if (!done) {
+            ret = ptr;
+        } else if (ret + done != ptr) {
+            break;
+        }
+
+        len -= l;
+        addr += l;
+        done += l;
+    }
+    *plen = done;
+    return ret;
+}
+
+void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
+                               int is_write)
+{
+    if (buffer != bounce.buffer) {
+        if (is_write) {
+            unsigned long addr1 = (uint8_t *)buffer - phys_ram_base;
+            do {
+                unsigned l;
+                l = TARGET_PAGE_SIZE;
+                if (l > len)
+                    l = len;
+                if (!cpu_physical_memory_is_dirty(addr1)) {
+                    /* invalidate code */
+                    tb_invalidate_phys_page_range(addr1, addr1 + len, 0);
+                    /* set dirty bit */
+                    phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
+                        (0xff & ~CODE_DIRTY_FLAG);
+                }
+                addr1 += l;
+                len -= l;
+            } while (len);
+        }
+        return;
+    }
+    if (is_write) {
+        cpu_physical_memory_write(bounce.addr, bounce.buffer, bounce.len);
+    }
+    qemu_free(bounce.buffer);
+    bounce.buffer = NULL;
+}
 
 /* warning: addr must be aligned */
 uint32_t ldl_phys(target_phys_addr_t addr)
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
@ 2009-01-19 13:49   ` Ian Jackson
  2009-01-19 14:54     ` Avi Kivity
  2009-01-19 15:05     ` Gerd Hoffmann
  0 siblings, 2 replies; 47+ messages in thread
From: Ian Jackson @ 2009-01-19 13:49 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("[Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> Devices accessing large amounts of memory (as with DMA) will wish to obtain
> a pointer to guest memory rather than access it indirectly via
> cpu_physical_memory_rw().  Add a new API to convert target addresses to
> host pointers.

In the Xen qemu tree we have alternative implementations of
cpu_physical_memory_{read,write}.  So we will need our own
implementation of this too.

So it is important to us that this interface is sufficient for its
callers (so that it doesn't need to change in the future), sufficient
for various different implementations (so that we can implement our
version), and correct in the sense that it is possible to write
correct code which uses it.

So as a result I have some comments about this API.  In summary I'm
afraid I think it needs a bit more work.  Perhaps we could have a
discussion about what this API should look like, by waving function
prototypes about (rather than complete patchsets) ?

Also there are some questions which your interface specification
leaves unanswered; these questions about the API semantics should
really be addressed in comments by the declaration.

> +void *cpu_physical_memory_map(target_phys_addr_t addr,
> +                              target_phys_addr_t *plen,
> +                              int is_write);

If the caller specifies is_write==1, are they entitled to read from
the buffer and expect it to contain something reasonable ?  In your
implementation, the answer appears to be `no'.

Is the caller allowed to assume that the accesses to the underlying
memory happen in any particular order - in particular, that they
happen in the same order as the caller's accesses to the buffer ?  Is
the caller allowed to assume that the same number of accesses to the
underlying memory happen as are made by the caller to the buffer ?
No, because if there is a bounce buffer the caller's accesses are
irrelevant.

What happens if the caller maps with is_write==1, writes part of the
buffer, and then unmaps ?  In your implementation undefined data is
written to guest memory.  I think that since not all callers can
guarantee to wish to write to the whole region, this means that the
interface is defective.  Callers need to be required to specify which
areas they have written.

The interface when cpu_physical_memory_map returns 0 is strange.
Normally everything in qemu is done with completion callbacks, but
here we have a kind of repeated polling.

This function should return a separate handle as well as the physical
memory pointer.  That will make it much easier to provide an
implementation which permits multiple bounce buffers or multiple
mappings simultaneously.

So I would prefer something like

  #define CPUPHYSMEMMAPFLAG_READABLE    001
  #define CPUPHYSMEMMAPFLAG_WRITABLE    002

  #define CPUPHYSMEMMAPFLAG_ALLOWFAIL   010
     /* If _ALLOWFAIL set then on callback, buffer may be NULL, in
      * which case caller should use cpu_physical_memory_{read,write} */

  typedef struct {
    /* to be filled in by caller before calling _map: */
    unsigned flags;
    target_phys_addr_t addr, len; /* len is updated by _map */
    /* filled in by _map: */
    void *buffer;
    /* private to _map et al: */
  } CpuPhysicalMemoryMapping;

  void cpu_physical_memory_map(CpuPhysicalMemoryMapping*;
      CpuPhysicalMemoryMapCallback *cb, void *cb_arg);
    /* There may be a limit on the number of concurrent maps
     * and the limit may be as low as one. */

  typedef void CpuPhysicalMemoryMapCallback(CpuPhysicalMemoryMapping*,
      void *cb_arg);

  void cpu_physical_memory_map_notify_read(CpuPhysicalMemoryMapping*,
      target_phys_addr_t addr, target_phys_addr_t *plen);
    /* must be called before read accesses to any buffer region */

  void cpu_physical_memory_map_notify_write(CpuPhysicalMemoryMapping*,
      target_phys_addr_t addr, target_phys_addr_t *plen);
    /* should be called after write accesses to any buffer region
     * unless it is OK for it to be indeterminate whether the
     * guest's memory actually gets written */

  void cpu_physical_memory_unmap(CpuPhysicalMemoryMapping*);
    /* Guest memory may be read and written out of order or even a
     * different number of times. */

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 13:49   ` Ian Jackson
@ 2009-01-19 14:54     ` Avi Kivity
  2009-01-19 15:39       ` Anthony Liguori
  2009-01-19 16:40       ` Ian Jackson
  2009-01-19 15:05     ` Gerd Hoffmann
  1 sibling, 2 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-19 14:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Ian Jackson

Ian Jackson wrote:
> Avi Kivity writes ("[Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>   
>> Devices accessing large amounts of memory (as with DMA) will wish to obtain
>> a pointer to guest memory rather than access it indirectly via
>> cpu_physical_memory_rw().  Add a new API to convert target addresses to
>> host pointers.
>>     
>
> In the Xen qemu tree we have alternative implementations of
> cpu_physical_memory_{read,write}.  So we will need our own
> implementation of this too.
>
> So it is important to us that this interface is sufficient for its
> callers (so that it doesn't need to change in the future), sufficient
> for various different implementations (so that we can implement our
> version), and correct in the sense that it is possible to write
> correct code which uses it.
>   

Sure.

> So as a result I have some comments about this API.  In summary I'm
> afraid I think it needs a bit more work.  Perhaps we could have a
> discussion about what this API should look like, by waving function
> prototypes about (rather than complete patchsets) ?
>
> Also there are some questions which your interface specification
> leaves unanswered; these questions about the API semantics should
> really be addressed in comments by the declaration.
>
>   
>> +void *cpu_physical_memory_map(target_phys_addr_t addr,
>> +                              target_phys_addr_t *plen,
>> +                              int is_write);
>>     
>
> If the caller specifies is_write==1, are they entitled to read from
> the buffer and expect it to contain something reasonable ?  In your
> implementation, the answer appears to be `no'.
>   

Correct.  If you need to perform read-modify-write, you need to use 
cpu_physical_memory_rw(), twice.  If we ever want to support RMW, we'll 
need to add another value for is_write.  I don't think we have 
interesting devices at this point which require efficient RMW.

> Is the caller allowed to assume that the accesses to the underlying
> memory happen in any particular order - in particular, that they
> happen in the same order as the caller's accesses to the buffer ?  Is
> the caller allowed to assume that the same number of accesses to the
> underlying memory happen as are made by the caller to the buffer ?
> No, because if there is a bounce buffer the caller's accesses are
> irrelevant.
>   

Right.  Moreover, the access sizes will be swallowed by the API; devices 
are allowed to provide different implementations for different access sizes.

> What happens if the caller maps with is_write==1, writes part of the
> buffer, and then unmaps ?  In your implementation undefined data is
> written to guest memory.  I think that since not all callers can
> guarantee to wish to write to the whole region, this means that the
> interface is defective.  Callers need to be required to specify which
> areas they have written.
>   

Alternatively, only use this interface with devices where this doesn't 
matter.  Given that bouncing happens for mmio only, this would be all 
devices which you'd want to use this interface with anyway.

(I'm assuming that you'll implement the fastpath by directly mapping 
guest memory, not bouncing).

> The interface when cpu_physical_memory_map returns 0 is strange.
> Normally everything in qemu is done with completion callbacks, but
> here we have a kind of repeated polling.
>   

I agree.  This was done at Anthony's request so I'll defer the response 
to him.

A variant of this API (posted by Andrea) hid all of the scheduling away 
within the implementation.

> This function should return a separate handle as well as the physical
> memory pointer.  That will make it much easier to provide an
> implementation which permits multiple bounce buffers or multiple
> mappings simultaneously.
>   

The downside to a separate handle is that device emulation code will now 
need to maintain the handle in addition to the the virtual address.  
Since the addresses will typically be maintained in an iovec, this means 
another array to be allocated and resized.

The design goals here were to keep things as simple as possible for the 
fast path.  Since the API fits all high-bandwidth devices that I know 
of, I don't think it's a good tradeoff to make the API more complex in 
order to be applicable to some corner cases.

Provided you implement the RAM case by mmap()ing, not bouncing, I think 
it will fit Xen quite well.  You might need to return NULL when the map 
cache is exhausted, but callers will not need to be modified.

>
> So I would prefer something like
>
>   #define CPUPHYSMEMMAPFLAG_READABLE    001
>   #define CPUPHYSMEMMAPFLAG_WRITABLE    002
>
>   #define CPUPHYSMEMMAPFLAG_ALLOWFAIL   010
>      /* If _ALLOWFAIL set then on callback, buffer may be NULL, in
>       * which case caller should use cpu_physical_memory_{read,write} */
>
>   typedef struct {
>     /* to be filled in by caller before calling _map: */
>     unsigned flags;
>     target_phys_addr_t addr, len; /* len is updated by _map */
>     /* filled in by _map: */
>     void *buffer;
>     /* private to _map et al: */
>   } CpuPhysicalMemoryMapping;
>
>   void cpu_physical_memory_map(CpuPhysicalMemoryMapping*;
>       CpuPhysicalMemoryMapCallback *cb, void *cb_arg);
>     /* There may be a limit on the number of concurrent maps
>      * and the limit may be as low as one. */
>
>   typedef void CpuPhysicalMemoryMapCallback(CpuPhysicalMemoryMapping*,
>       void *cb_arg);
>
>   void cpu_physical_memory_map_notify_read(CpuPhysicalMemoryMapping*,
>       target_phys_addr_t addr, target_phys_addr_t *plen);
>     /* must be called before read accesses to any buffer region */
>
>   void cpu_physical_memory_map_notify_write(CpuPhysicalMemoryMapping*,
>       target_phys_addr_t addr, target_phys_addr_t *plen);
>     /* should be called after write accesses to any buffer region
>      * unless it is OK for it to be indeterminate whether the
>      * guest's memory actually gets written */
>
>   void cpu_physical_memory_unmap(CpuPhysicalMemoryMapping*);
>     /* Guest memory may be read and written out of order or even a
>      * different number of times. */
>   

If we go in this direction the API should manage scatter/gather vectors, 
not single mappings.  This will remove the need for clients to store 
handles.

I suggest you look at Andrea's patchset for a very different approach.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 14:54     ` Avi Kivity
@ 2009-01-19 15:39       ` Anthony Liguori
  2009-01-19 16:18         ` Paul Brook
  2009-01-19 16:57         ` Ian Jackson
  2009-01-19 16:40       ` Ian Jackson
  1 sibling, 2 replies; 47+ messages in thread
From: Anthony Liguori @ 2009-01-19 15:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Ian Jackson, Avi Kivity

Avi Kivity wrote:
>> The interface when cpu_physical_memory_map returns 0 is strange.
>> Normally everything in qemu is done with completion callbacks, but
>> here we have a kind of repeated polling.
>>   
>
> I agree.  This was done at Anthony's request so I'll defer the 
> response to him.

There are two distinct IO consumers of this API: streaming IO and packet IO.

For streaming IO, you have a model of:

process_data:
while (offset < size) {
  data = map(offset, &len)
  if (data) {
     do IO on (data, len)
     unmap(data)
     offset += len;
  } else
     break
}

if (offset < size)
   add callback for mappability to process_data

I agree that this model could be formalized into something that took a 
'do IO on (data, len)' actor.  In fact, since map() and unmap() are 
pretty generic, they too could be actors.  This would then work for CPU 
memory IO, PCI memory IO, etc.

The packet IO API is a bit different.  It looks like:

while (offset < size) {
  data = map(offset, &len)
  if (data == NULL)
     break;
  sg[n_sg].iov_base = data;
  sg[n_sg].iov_len = len;
  n_sg++;
  offset += len;
}

if (offset < len) {
  for (i = 0; i < n_sg; i++)
     unmap(sg[i].iov_base);
   sg[0].iov_base = alloc_buffer(size);
   sg[0].iov_len = size;
   cpu_physical_memory_rw(sg[0].iov_base, size);
}

do IO on (sg)

if (offset < len) {
  free(sg[0].iov_base);
}

In this case, it isn't useful to get a callback with some of the packet 
data.  You need to know up front whether you can map all of the packet 
data.  In fact, a callback API doesn't really work because it implies 
that at the end of the callback, you either release the data or that the 
next callback could not be invoked until you unmap a previous data.

So this is why I prefer the map() API, as it accommodates two distinct 
users in a way that the callback API wouldn't.  We can formalize these 
idioms into an API, of course.

BTW, to support this model, we have to reserve at least one bounce 
buffer for cpu_physical_memory_rw.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:39       ` Anthony Liguori
@ 2009-01-19 16:18         ` Paul Brook
  2009-01-19 16:33           ` Anthony Liguori
  2009-01-19 16:57         ` Ian Jackson
  1 sibling, 1 reply; 47+ messages in thread
From: Paul Brook @ 2009-01-19 16:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Ian Jackson, Avi Kivity

>[block iterator]
> I agree that this model could be formalized into something that took a
> 'do IO on (data, len)' actor.  In fact, since map() and unmap() are
> pretty generic, they too could be actors.  This would then work for CPU
> memory IO, PCI memory IO, etc.
>
> The packet IO API is a bit different.  It looks like:
>...
> if (offset < len) {
>    sg[0].iov_base = alloc_buffer(size);
>    cpu_physical_memory_rw(sg[0].iov_base, size);
>
> In this case, it isn't useful to get a callback with some of the packet
> data.  You need to know up front whether you can map all of the packet
> data.  In fact, a callback API doesn't really work because it implies
> that at the end of the callback, you either release the data or that the
> next callback could not be invoked until you unmap a previous data.

It looks like what you're actually doing is pushing the bounce buffer 
allocation into the individual packet consumers.

Maybe a solution to this is a 'do IO on IOVEC' actor, with an additional flag 
that says whether it is acceptable to split the allocation. That way both 
block and packet interfaces use the same API, and avoids proliferation of 
manual bounce buffers in packet devices.

Paul

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 16:18         ` Paul Brook
@ 2009-01-19 16:33           ` Anthony Liguori
  2009-01-19 16:39             ` Avi Kivity
  0 siblings, 1 reply; 47+ messages in thread
From: Anthony Liguori @ 2009-01-19 16:33 UTC (permalink / raw)
  To: Paul Brook; +Cc: Ian Jackson, qemu-devel, Avi Kivity

Paul Brook wrote:
> It looks like what you're actually doing is pushing the bounce buffer 
> allocation into the individual packet consumers.
>
> Maybe a solution to this is a 'do IO on IOVEC' actor, with an additional flag 
> that says whether it is acceptable to split the allocation. That way both 
> block and packet interfaces use the same API, and avoids proliferation of 
> manual bounce buffers in packet devices.
>   

I think there may be utility in having packet devices provide the bounce 
buffers, in which case, you could probably unique both into a single 
function with a flag.  But why not just have two separate functions?

Those two functions can live in exec.c too.  The nice thing about using 
map() is that it's easily overriden and chained.  So what I'm proposing.

cpu_physical_memory_map()
cpu_physical_memory_unmap()
do_streaming_IO(map, unmap, ioworker, opaque);
do_packet_IO(map, unmap, buffer, size, ioworker, opaque);

Additional helpers that pass c_p_m_map/unmap would also be acceptable.

Regards,

Anthony Liguori

> Paul
>   

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 16:33           ` Anthony Liguori
@ 2009-01-19 16:39             ` Avi Kivity
  2009-01-19 19:15               ` Anthony Liguori
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-19 16:39 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Ian Jackson, Paul Brook, qemu-devel

Anthony Liguori wrote:
> Paul Brook wrote:
>> It looks like what you're actually doing is pushing the bounce buffer 
>> allocation into the individual packet consumers.
>>
>> Maybe a solution to this is a 'do IO on IOVEC' actor, with an 
>> additional flag that says whether it is acceptable to split the 
>> allocation. That way both block and packet interfaces use the same 
>> API, and avoids proliferation of manual bounce buffers in packet 
>> devices.
>>   
>
> I think there may be utility in having packet devices provide the 
> bounce buffers, in which case, you could probably unique both into a 
> single function with a flag.  But why not just have two separate 
> functions?
>
> Those two functions can live in exec.c too.  The nice thing about 
> using map() is that it's easily overriden and chained.  So what I'm 
> proposing.
>
> cpu_physical_memory_map()
> cpu_physical_memory_unmap()

This should be the baseline API with the rest using it.

> do_streaming_IO(map, unmap, ioworker, opaque);

Why pass map and unmap?

grant based devices needn't go through this at all, since you never mix 
grants and physical addresses, and since grants never need bouncing.

> do_packet_IO(map, unmap, buffer, size, ioworker, opaque);

If you pass the buffer then the device needs to allocate large amounts 
of bounce memory.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 16:39             ` Avi Kivity
@ 2009-01-19 19:15               ` Anthony Liguori
  2009-01-20 10:09                 ` Avi Kivity
  0 siblings, 1 reply; 47+ messages in thread
From: Anthony Liguori @ 2009-01-19 19:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ian Jackson, Paul Brook, qemu-devel

Avi Kivity wrote:
> Anthony Liguori wrote:
>> Paul Brook wrote:
>>> It looks like what you're actually doing is pushing the bounce 
>>> buffer allocation into the individual packet consumers.
>>>
>>> Maybe a solution to this is a 'do IO on IOVEC' actor, with an 
>>> additional flag that says whether it is acceptable to split the 
>>> allocation. That way both block and packet interfaces use the same 
>>> API, and avoids proliferation of manual bounce buffers in packet 
>>> devices.
>>>   
>>
>> I think there may be utility in having packet devices provide the 
>> bounce buffers, in which case, you could probably unique both into a 
>> single function with a flag.  But why not just have two separate 
>> functions?
>>
>> Those two functions can live in exec.c too.  The nice thing about 
>> using map() is that it's easily overriden and chained.  So what I'm 
>> proposing.
>>
>> cpu_physical_memory_map()
>> cpu_physical_memory_unmap()
>
> This should be the baseline API with the rest using it.

Yup.

>> do_streaming_IO(map, unmap, ioworker, opaque);
>
> Why pass map and unmap?

Because we'll eventually have:

pci_device_memory_map()
pci_device_memory_unmap()

In the simplest case, pci_device_memory_map() just calls 
cpu_physical_memory_map().  But it may do other things.

> grant based devices needn't go through this at all, since you never 
> mix grants and physical addresses, and since grants never need bouncing.

So the grant map/unmap function doesn't need to deal with calling 
cpu_physical_memory_map/unmap.  You could still use the above API or 
not.  It's hard to say.

>> do_packet_IO(map, unmap, buffer, size, ioworker, opaque);
>
> If you pass the buffer then the device needs to allocate large amounts 
> of bounce memory.

If do_packet_IO took a buffer, instead of calling alloc_buffer(size) 
when map fails (you run out of bounce memory), you simply use buffer.  
Otherwise, alloc_buffer() must be able to allocate enough memory to 
satisfy any request.

Since each packet device knows it's maximum size up front, it makes 
sense for the device to allocate it.  You could also not care and just 
trust that callers do the right thing.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 19:15               ` Anthony Liguori
@ 2009-01-20 10:09                 ` Avi Kivity
  0 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-20 10:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: Ian Jackson, Paul Brook

Anthony Liguori wrote:
>>> Those two functions can live in exec.c too.  The nice thing about 
>>> using map() is that it's easily overriden and chained.  So what I'm 
>>> proposing.
>>>
>>> cpu_physical_memory_map()
>>> cpu_physical_memory_unmap()
>>
>> This should be the baseline API with the rest using it.
>
> Yup.
>
>>> do_streaming_IO(map, unmap, ioworker, opaque);
>>
>> Why pass map and unmap?
>
> Because we'll eventually have:
>
> pci_device_memory_map()
> pci_device_memory_unmap()
>
> In the simplest case, pci_device_memory_map() just calls 
> cpu_physical_memory_map().  But it may do other things.

Let's start without those callbacks, until those "other things" reveal 
themselves.

>>> do_packet_IO(map, unmap, buffer, size, ioworker, opaque);
>>
>> If you pass the buffer then the device needs to allocate large 
>> amounts of bounce memory.
>
> If do_packet_IO took a buffer, instead of calling alloc_buffer(size) 
> when map fails (you run out of bounce memory), you simply use buffer.  
> Otherwise, alloc_buffer() must be able to allocate enough memory to 
> satisfy any request.
>
> Since each packet device knows it's maximum size up front, it makes 
> sense for the device to allocate it.  You could also not care and just 
> trust that callers do the right thing.
>

Pushing things to the caller complicates all callers.  This isn't the 
ideal solution.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:39       ` Anthony Liguori
  2009-01-19 16:18         ` Paul Brook
@ 2009-01-19 16:57         ` Ian Jackson
  2009-01-19 19:23           ` Anthony Liguori
  1 sibling, 1 reply; 47+ messages in thread
From: Ian Jackson @ 2009-01-19 16:57 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> The packet IO API is a bit different.  It looks like:

The purpose here is to be able to make only one system call to the
host kernel in order to do an operation which involves a scatter
gather list in guest physical memory (as provided by the guest to eg
an emulated DMA controller) ?

And the idea is to try to map as much of that contiguously as
possible so that only one system call need be made ?

> while (offset < size) {
>   data = map(offset, &len)
>   if (data == NULL)
>      break;
>   sg[n_sg].iov_base = data;
>   sg[n_sg].iov_len = len;
>   n_sg++;
>   offset += len;
> }
...
> if (offset < len) {
>   for (i = 0; i < n_sg; i++)
>      unmap(sg[i].iov_base);
>    sg[0].iov_base = alloc_buffer(size);
>    sg[0].iov_len = size;
>    cpu_physical_memory_rw(sg[0].iov_base, size);
> }

Why is it necessary for there only to be one of these

> do IO on (sg)

calls ?

Are there supposed to be fast-path devices where we absolutely must
make the host system call for the whole transfer in one go, in one
contiguous memory region ?

Obviously for the fast path to be actually fast the whole mapping must
succeed as requested, but this could be achieved by an interface where
the mapping caller provided a scatter gather list in guest memory:

  typedef struct {
    target_phys_addr_t addr, len;
  } CpuPhysicalMemoryMappingEntry;

  typedef struct {
    /* to be filled in by caller before calling _map: */
    unsigned flags;
    const CpuPhysicalMemoryMappingEntry *sg_list;
    target_phys_addr_t total_len; /* gives sg_list length; updated by _map */
    /* filled in by _map: */
    void *buffer;
    /* private to _map et al: */
  } CpuPhysicalMemoryMapping;

  void cpu_physical_memory_map(CpuPhysicalMemoryMapping*;
      CpuPhysicalMemoryMapCallback *cb, void *cb_arg);
    /* There may be a limit on the number of concurrent maps
     * and the limit may be as low as one. */

If a caller doesn't want to deal with a complete sg_list then they do
  typedef struct {
    ...
    CpuPhysicalMemoryMapping dma_mapping;
    CpuPhysicalMemoryMappingRequest dma_request;
    ...
  } SomePCIDevice;
and do
  ourself->dma_mapping.sg_list = &ourself->dma_request;
  ourself->dma_mapping.total_len = ourself->dma_request.len;

> So this is why I prefer the map() API, as it accommodates two distinct 
> users in a way that the callback API wouldn't.  We can formalize these 
> idioms into an API, of course.

I don't think there is any fundamental difference between a callback
API and a polling API; you can implement whatever semantics you like
with either.

But callbacks are needed in at least some cases because of the way
that the bounce buffer may need to be reserved/released.  That means
all of the callers have to deal with callbacks anyway.

So it makes sense to make that the only code path.  That way callers
only need to be written once.

> BTW, to support this model, we have to reserve at least one bounce 
> buffer for cpu_physical_memory_rw.

Yes.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 16:57         ` Ian Jackson
@ 2009-01-19 19:23           ` Anthony Liguori
  2009-01-20 10:17             ` Avi Kivity
  2009-01-20 14:18             ` Ian Jackson
  0 siblings, 2 replies; 47+ messages in thread
From: Anthony Liguori @ 2009-01-19 19:23 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> Anthony Liguori writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>   
>> The packet IO API is a bit different.  It looks like:
>>     
>
> The purpose here is to be able to make only one system call to the
> host kernel in order to do an operation which involves a scatter
> gather list in guest physical memory (as provided by the guest to eg
> an emulated DMA controller) ?
>
> And the idea is to try to map as much of that contiguously as
> possible so that only one system call need be made ?
>   

No, it's not an issue of performance, it's an issue of correctness.  A 
single readv/writev system call to a tap file descriptor corresponds to 
a single packet.  You cannot split a packet into multiple read/write 
operations to tap IIUC.

> Are there supposed to be fast-path devices where we absolutely must
> make the host system call for the whole transfer in one go, in one
> contiguous memory region ?
>   

It's a matter of correctness.  Packet protocols (datagram protocols if 
you will) must preserve message boundaries.

>> So this is why I prefer the map() API, as it accommodates two distinct 
>> users in a way that the callback API wouldn't.  We can formalize these 
>> idioms into an API, of course.
>>     
>
> I don't think there is any fundamental difference between a callback
> API and a polling API; you can implement whatever semantics you like
> with either.
>
> But callbacks are needed in at least some cases because of the way
> that the bounce buffer may need to be reserved/released.  That means
> all of the callers have to deal with callbacks anyway.
>
> So it makes sense to make that the only code path.  That way callers
> only need to be written once.
>   

If you use callbacks, consider the following:

Scenario 1 (no callbacks)

1) Attempt to map all data in packet
2) Failure occurs b/c bounce buffer is too small to contain packet
3) We can't do zero-copy IO here, so fall back to a copy to a device 
supplied buffer
4) Transmit full packet

Scenario 2 (callbacks)

1) Register callback and begin data mapping
2) Succeed in mapping first part of packet and 1 page of bounced buffer
3) Bounce buffer is exhausted, callback will not be invoked until the 
next unmap
4) Deadlock

You could add some sort of call to determine if the bounce memory is 
exhausted and try to take some corrective action in #2 but it gets ugly.

So as I said earlier, I think the callback model makes sense when you 
have streaming data (where there are no packet boundaries).  I think 
it's pretty difficult to avoid deadlocking though with callbacks in a 
packet model (where boundaries must be preserved).

Regards,

Anthony Liguori

>> BTW, to support this model, we have to reserve at least one bounce 
>> buffer for cpu_physical_memory_rw.
>>     
>
> Yes.
>
> Ian.
>
>
>   

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 19:23           ` Anthony Liguori
@ 2009-01-20 10:17             ` Avi Kivity
  2009-01-20 14:18             ` Ian Jackson
  1 sibling, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-20 10:17 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
>
> Scenario 2 (callbacks)
>
0) Register minimum required bounce memory at init time
> 1) Register callback and begin data mapping
> 2) Succeed in mapping first part of packet and 1 page of bounced buffer
> 3) Bounce buffer is exhausted, callback will not be invoked until the 
> next unmap
> 4) Deadlock
>
> You could add some sort of call to determine if the bounce memory is 
> exhausted and try to take some corrective action in #2 but it gets ugly.
>
> So as I said earlier, I think the callback model makes sense when you 
> have streaming data (where there are no packet boundaries).  I think 
> it's pretty difficult to avoid deadlocking though with callbacks in a 
> packet model (where boundaries must be preserved).
>

If you specify the boundaries (an additional input array, or maybe a 
function that says whether a particular point is a "good enough" 
boundary) then I think it can be workable.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 19:23           ` Anthony Liguori
  2009-01-20 10:17             ` Avi Kivity
@ 2009-01-20 14:18             ` Ian Jackson
  1 sibling, 0 replies; 47+ messages in thread
From: Ian Jackson @ 2009-01-20 14:18 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> No, it's not an issue of performance, it's an issue of correctness.  A 
> single readv/writev system call to a tap file descriptor corresponds to 
> a single packet.  You cannot split a packet into multiple read/write 
> operations to tap IIUC.

Ah, right.  I still think that can be accomodated in a callback style.

As I said:
> > But callbacks are needed in at least some cases because of the way
> > that the bounce buffer may need to be reserved/released.  That means
> > all of the callers have to deal with callbacks anyway.
> >
> > So it makes sense to make that the only code path.  That way callers
> > only need to be written once.

You countered with:

> If you use callbacks, consider the following:
...
> Scenario 2 (callbacks)
> [problem description]

Yes, I see this problem but you are assuming _multiple_ callbacks.

Instead, we could just have a single callback for the entire mapping
attempt.  Then we're back to the sequence of events in your scenario
1; the only difference is the function call graph.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 14:54     ` Avi Kivity
  2009-01-19 15:39       ` Anthony Liguori
@ 2009-01-19 16:40       ` Ian Jackson
  2009-01-19 17:28         ` Avi Kivity
  1 sibling, 1 reply; 47+ messages in thread
From: Ian Jackson @ 2009-01-19 16:40 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> Ian Jackson wrote:
> > So it is important to us that this interface is [nice]
>
> Sure.

Thanks :-).

> Correct.  If you need to perform read-modify-write, you need to use 
> cpu_physical_memory_rw(), twice.  If we ever want to support RMW, we'll 
> need to add another value for is_write.  I don't think we have 
> interesting devices at this point which require efficient RMW.

Efficient read-modify-write may be very hard for some setups to
achieve.  It can't be done with the bounce buffer implementation.
I think ond good rule of thumb would be to make sure that the interface
as specified can be implemented in terms of cpu_physical_memory_rw.

> Right.  Moreover, the access sizes will be swallowed by the API; devices 
> are allowed to provide different implementations for different access sizes.

Yes, that's another important fact about the API.

> > What happens if the caller maps with is_write==1, writes part of the
> > buffer, and then unmaps ?  In your implementation undefined data is
> > written to guest memory.  I think that since not all callers can
> > guarantee to wish to write to the whole region, this means that the
> > interface is defective.  Callers need to be required to specify which
> > areas they have written.
...
> Alternatively, only use this interface with devices where this doesn't 
> matter.  Given that bouncing happens for mmio only, this would be all 
> devices which you'd want to use this interface with anyway.

That would be one alternative but isn't it the case that (for example)
with a partial DMA completion, the guest can assume that the
supposedly-untouched parts of the DMA target memory actually remain
untouched rather than (say) zeroed ?

In a system where we're trying to do zero copy, we may issue the map
request for a large transfer, before we know how much the host kernel
will actually provide.

> (I'm assuming that you'll implement the fastpath by directly mapping 
> guest memory, not bouncing).

Yes.  We can do that in Xen too but it's less of a priority for us
given that we expect people who really care about performance to
install PV drivers in the guest.

> A variant of this API (posted by Andrea) hid all of the scheduling away 
> within the implementation.

I remember seeing this before but I don't think your previous one
provided a callback for map completion ?  I thought it just blocked
the caller until the map could complete.  That's obviously not ideal.

> > This function should return a separate handle as well as the physical
> > memory pointer.  That will make it much easier to provide an
> > implementation which permits multiple bounce buffers or multiple
> > mappings simultaneously.
> 
> The downside to a separate handle is that device emulation code will now 
> need to maintain the handle in addition to the the virtual address.  
> Since the addresses will typically be maintained in an iovec, this means 
> another array to be allocated and resized.

Err, no, I don't really see that.  In my proposal the `handle' is
actually allocated by the caller.  The implementation provides the
private data and that can be empty.  There is no additional memory
allocation.

> The design goals here were to keep things as simple as possible for the 
> fast path.  Since the API fits all high-bandwidth devices that I know 
> of, I don't think it's a good tradeoff to make the API more complex in 
> order to be applicable to some corner cases.

I think my question about partial DMA writes is very relevant.  If we
don't care about that, nor about the corresponding notification for
reads, then the API can be a lot simpler.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 16:40       ` Ian Jackson
@ 2009-01-19 17:28         ` Avi Kivity
  2009-01-19 17:53           ` Ian Jackson
  2009-01-19 18:25           ` Jamie Lokier
  0 siblings, 2 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-19 17:28 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
>> Correct.  If you need to perform read-modify-write, you need to use 
>> cpu_physical_memory_rw(), twice.  If we ever want to support RMW, we'll 
>> need to add another value for is_write.  I don't think we have 
>> interesting devices at this point which require efficient RMW.
>>     
>
> Efficient read-modify-write may be very hard for some setups to
> achieve.  It can't be done with the bounce buffer implementation.
> I think ond good rule of thumb would be to make sure that the interface
> as specified can be implemented in terms of cpu_physical_memory_rw.
>   

What is the motivation for efficient rmw?

>> Alternatively, only use this interface with devices where this doesn't 
>> matter.  Given that bouncing happens for mmio only, this would be all 
>> devices which you'd want to use this interface with anyway.
>>     
>
> That would be one alternative but isn't it the case that (for example)
> with a partial DMA completion, the guest can assume that the
> supposedly-untouched parts of the DMA target memory actually remain
> untouched rather than (say) zeroed ?
>   

For block devices, I don't think it can.  In any case, this will only 
occur with mmio.  I don't think the guest can assume much in such cases.

In fact, we could even say that the virtual hardware doesn't support 
dma-to-mmio at all and MCE the guest.  I'm sure no x86 guest would even 
notice.  Don't know about non-x86.

> In a system where we're trying to do zero copy, we may issue the map
> request for a large transfer, before we know how much the host kernel
> will actually provide.
>
>   

Won't it be at least 1GB?  Partition you requests to that size.

>> (I'm assuming that you'll implement the fastpath by directly mapping 
>> guest memory, not bouncing).
>>     
>
> Yes.  We can do that in Xen too but it's less of a priority for us
> given that we expect people who really care about performance to
> install PV drivers in the guest.
>   

I'm all in favor of accommodating Xen, but as long as you're out-of-tree 
you need to conform to qemu, not the other way around.

>> A variant of this API (posted by Andrea) hid all of the scheduling away 
>> within the implementation.
>>     
>
> I remember seeing this before but I don't think your previous one
> provided a callback for map completion ?  I thought it just blocked
> the caller until the map could complete.  That's obviously not ideal.
>   

It didn't block, it scheduled.

>>> This function should return a separate handle as well as the physical
>>> memory pointer.  That will make it much easier to provide an
>>> implementation which permits multiple bounce buffers or multiple
>>> mappings simultaneously.
>>>       
>> The downside to a separate handle is that device emulation code will now 
>> need to maintain the handle in addition to the the virtual address.  
>> Since the addresses will typically be maintained in an iovec, this means 
>> another array to be allocated and resized.
>>     
>
> Err, no, I don't really see that.  In my proposal the `handle' is
> actually allocated by the caller.  The implementation provides the
> private data and that can be empty.  There is no additional memory
> allocation.
>   

You need to store multiple handles (one per sg element), so you need to 
allocate a variable size vector for it.  Preallocation may be possible 
but perhaps wasteful.

>   
>> The design goals here were to keep things as simple as possible for the 
>> fast path.  Since the API fits all high-bandwidth devices that I know 
>> of, I don't think it's a good tradeoff to make the API more complex in 
>> order to be applicable to some corner cases.
>>     
>
> I think my question about partial DMA writes is very relevant.  If we
> don't care about that, nor about the corresponding notification for
> reads, then the API can be a lot simpler.

I don't see a concrete reason to care about it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 17:28         ` Avi Kivity
@ 2009-01-19 17:53           ` Ian Jackson
  2009-01-19 18:29             ` Avi Kivity
  2009-01-19 18:25           ` Jamie Lokier
  1 sibling, 1 reply; 47+ messages in thread
From: Ian Jackson @ 2009-01-19 17:53 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> Ian Jackson wrote:
> > Efficient read-modify-write may be very hard for some setups to
> > achieve.  It can't be done with the bounce buffer implementation.
> > I think ond good rule of thumb would be to make sure that the interface
> > as specified can be implemented in terms of cpu_physical_memory_rw.
> 
> What is the motivation for efficient rmw?

I think you've misunderstood me.  I don't think there is such a
motivation.  I was saying it was so difficult to implement that we
might as well exclude it.

But whatever we do, the map interface we specify should be
implementable in terms of cpu_physical_memory_rw.  Ie, it should be
possible to implement the map interface with a bounce buffer.  Which
your proposal did except for the partial write case.

> > That would be one alternative but isn't it the case that (for example)
> > with a partial DMA completion, the guest can assume that the
> > supposedly-untouched parts of the DMA target memory actually remain
> > untouched rather than (say) zeroed ?
> 
> For block devices, I don't think it can.

`Block devices' ?  We're talking about (say) IDE controllers here.  I
would be very surprised if an IDE controller used DMA to overwrite RAM
beyond the amount of successful transfer.

If a Unix variant does zero copy IO using DMA direct into process
memory space, then it must even rely on the IDE controller not doing
DMA beyond the end of the successful transfer, as the read(2) API
promises to the calling process that data beyond the successful read
is left untouched.

And even if the IDE spec happily says that the (IDE) host (ie our
guest) is not allowed to assume that that memory (ie the memory beyond
the extent of the successful part of a partially successful transfer)
is unchanged, there will almost certainly be some other IO device on
some some platform that will make that promise.

So we need a call into the DMA API from the device model to say which
regions have actually been touched.

> > In a system where we're trying to do zero copy, we may issue the map
> > request for a large transfer, before we know how much the host kernel
> > will actually provide.
> 
> Won't it be at least 1GB?  Partition you requests to that size.

No, I mean, before we know how much data qemu's read(2) will transfer.

>  In any case, this will only occur with mmio.  I don't think the
> guest can assume much in such cases.

No, it won't only occur with mmio.

In the initial implementation in Xen, we will almost certainly simply
emulate everything with cpu_physical_memory_rw.  So it will happen all
the time.

> > Err, no, I don't really see that.  In my proposal the `handle' is
> > actually allocated by the caller.  The implementation provides the
> > private data and that can be empty.  There is no additional memory
> > allocation.
> 
> You need to store multiple handles (one per sg element), so you need to 
> allocate a variable size vector for it.  Preallocation may be possible 
> but perhaps wasteful.

See my reply to Anthony Ligouri, which shows how this can be avoided.
Since you hope for a single call to map everything, you can do an sg
list with a single handle.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 17:53           ` Ian Jackson
@ 2009-01-19 18:29             ` Avi Kivity
  2009-01-20 14:32               ` Ian Jackson
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-19 18:29 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
>>> Efficient read-modify-write may be very hard for some setups to
>>> achieve.  It can't be done with the bounce buffer implementation.
>>> I think ond good rule of thumb would be to make sure that the interface
>>> as specified can be implemented in terms of cpu_physical_memory_rw.
>>>       
>> What is the motivation for efficient rmw?
>>     
>
> I think you've misunderstood me.  I don't think there is such a
> motivation.  I was saying it was so difficult to implement that we
> might as well exclude it.
>   

Then we agree. The map API is for read OR write operations, not both at 
the same time.

>   
>>> That would be one alternative but isn't it the case that (for example)
>>> with a partial DMA completion, the guest can assume that the
>>> supposedly-untouched parts of the DMA target memory actually remain
>>> untouched rather than (say) zeroed ?
>>>       
>> For block devices, I don't think it can.
>>     
>
> `Block devices' ?  We're talking about (say) IDE controllers here.  I
> would be very surprised if an IDE controller used DMA to overwrite RAM
> beyond the amount of successful transfer.
>
> If a Unix variant does zero copy IO using DMA direct into process
> memory space, then it must even rely on the IDE controller not doing
> DMA beyond the end of the successful transfer, as the read(2) API
> promises to the calling process that data beyond the successful read
> is left untouched.
>
> And even if the IDE spec happily says that the (IDE) host (ie our
> guest) is not allowed to assume that that memory (ie the memory beyond
> the extent of the successful part of a partially successful transfer)
> is unchanged, there will almost certainly be some other IO device on
> some some platform that will make that promise.
>
> So we need a call into the DMA API from the device model to say which
> regions have actually been touched.
>
>   

It's not possible to implement this efficiently. The qemu block layer 
will submit the results of the map operation to the kernel in an async 
zero copy operation. The kernel may break up this operation into several 
parts (if the underlying backing store is fragmented) and submit in 
parallel to the underlying device(s). Those requests will complete 
out-of-order, so you can't guarantee that if an error occurs all memory 
before will have been written and none after.

I really doubt that any guest will be affected by this. It's a tradeoff 
between decent performance and needlessly accurate emulation. I don't 
see how we can choose the latter.

>>> In a system where we're trying to do zero copy, we may issue the map
>>> request for a large transfer, before we know how much the host kernel
>>> will actually provide.
>>>       
>> Won't it be at least 1GB?  Partition you requests to that size.
>>     
>
> No, I mean, before we know how much data qemu's read(2) will transfer.
>   

You don't know afterwards either. Maybe read() is specced as you say, 
but practical implementations will return the minimum bytes read, not exact.

Think software RAID.

>>  In any case, this will only occur with mmio.  I don't think the
>> guest can assume much in such cases.
>>     
>
> No, it won't only occur with mmio.
>
> In the initial implementation in Xen, we will almost certainly simply
> emulate everything with cpu_physical_memory_rw.  So it will happen all
> the time.
>   

Try it out. I'm sure it will work just fine (if incredibly slowly, 
unless you provide multiple bounce buffers).

>>> Err, no, I don't really see that.  In my proposal the `handle' is
>>> actually allocated by the caller.  The implementation provides the
>>> private data and that can be empty.  There is no additional memory
>>> allocation.
>>>       
>> You need to store multiple handles (one per sg element), so you need to 
>> allocate a variable size vector for it.  Preallocation may be possible 
>> but perhaps wasteful.
>>     
>
> See my reply to Anthony Ligouri, which shows how this can be avoided.
> Since you hope for a single call to map everything, you can do an sg
> list with a single handle.
>   

That's a very different API.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 18:29             ` Avi Kivity
@ 2009-01-20 14:32               ` Ian Jackson
  2009-01-20 17:23                 ` Avi Kivity
  0 siblings, 1 reply; 47+ messages in thread
From: Ian Jackson @ 2009-01-20 14:32 UTC (permalink / raw)
  To: qemu-devel

I think the key points in Avi's message are this:

Avi Kivity writes:
> You don't know afterwards either. Maybe read() is specced as you
> say, but practical implementations will return the minimum bytes
> read, not exact.

And this:

> I really doubt that any guest will be affected by this. It's a tradeoff 
> between decent performance and needlessly accurate emulation. I don't 
> see how we can choose the latter.

I don't think this is the right way to analyse this situation.  We are
trying to define a general-purpose DMA API for _all_ emulated devices,
not just the IDE emulation and block devices that you seem to be
considering.

If there is ever any hardware which behaves `properly' with partial
DMA, and any host kernel device which can tell us what succeeded and
what failed, then it is necessary for the DMA API we are now inventing
to allow that device to be properly emulated.

Even if we can't come up with an example right now of such a device
then I would suggest that it's very likely that we will encounter one
eventually.  But actually I can think of one straight away: a SCSI
tapestreamer.  Tapestreamers often give partial transfers at the end
of tapefiles; hosts (ie, qemu guests) talking to the SCSI controller
do not expect the controller to DMA beyond the successful SCSI
transfer length; and the (qemu host's) kernel's read() call will not
overwrite beyond the successful transfer length either.

If it is difficult for a block device to provide the faithful
behaviour then it might be acceptable for the block device to always
indicate to the DMA API that the entire transfer had taken place, even
though actually some of it had failed.

But personally I think you're mistaken about the behaviour of the
(qemu host's) kernel's {aio_,p,}read(2).

> > In the initial implementation in Xen, we will almost certainly simply
> > emulate everything with cpu_physical_memory_rw.  So it will happen all
> > the time.
> 
> Try it out. I'm sure it will work just fine (if incredibly slowly, 
> unless you provide multiple bounce buffers).

It will certainly work except (a) there are partial (interrupted)
transfers and (b) the host relies on the partial DMA not overwriting
more data than it successfully transferred.  So what that means is
that if this introduces bugs they will be very difficult to find in
testing.  I don't think testing is the answer here.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 14:32               ` Ian Jackson
@ 2009-01-20 17:23                 ` Avi Kivity
  0 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-20 17:23 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> I think the key points in Avi's message are this:
>
> Avi Kivity writes:
>   
>> You don't know afterwards either. Maybe read() is specced as you
>> say, but practical implementations will return the minimum bytes
>> read, not exact.
>>     
>
> And this:
>
>   
>> I really doubt that any guest will be affected by this. It's a tradeoff 
>> between decent performance and needlessly accurate emulation. I don't 
>> see how we can choose the latter.
>>     
>
> I don't think this is the right way to analyse this situation.  We are
> trying to define a general-purpose DMA API for _all_ emulated devices,
> not just the IDE emulation and block devices that you seem to be
> considering.
>   

No.  There already exists a general API: cpu_physical_memory_rw().  We 
are trying to define an API which will allow the high-throughput devices 
(IDE, scsi, virtio-blk, virtio-net) to be implemented efficiently.

If device X does not work well with the API, then, unless it's important 
for some reason, it shouldn't use it.  If it is important, we can adapt 
the API then.

> If there is ever any hardware which behaves `properly' with partial
> DMA, and any host kernel device which can tell us what succeeded and
> what failed, then it is necessary for the DMA API we are now inventing
> to allow that device to be properly emulated.
>
> Even if we can't come up with an example right now of such a device
> then I would suggest that it's very likely that we will encounter one
> eventually.  But actually I can think of one straight away: a SCSI
> tapestreamer.  Tapestreamers often give partial transfers at the end
> of tapefiles; hosts (ie, qemu guests) talking to the SCSI controller
> do not expect the controller to DMA beyond the successful SCSI
> transfer length; and the (qemu host's) kernel's read() call will not
> overwrite beyond the successful transfer length either.
>   

That will work out fine as the DMA will be to kernel memory, and read() 
will copy just the interesting parts.

> If it is difficult for a block device to provide the faithful
> behaviour then it might be acceptable for the block device to always
> indicate to the DMA API that the entire transfer had taken place, even
> though actually some of it had failed.
>
> But personally I think you're mistaken about the behaviour of the
> (qemu host's) kernel's {aio_,p,}read(2).
>   

I'm pretty sure reads to software RAIDs will be submitted in parallel.  
If those reads are O_DIRECT, then it's impossible to maintain DMA ordering.

>>> In the initial implementation in Xen, we will almost certainly simply
>>> emulate everything with cpu_physical_memory_rw.  So it will happen all
>>> the time.
>>>       
>> Try it out. I'm sure it will work just fine (if incredibly slowly, 
>> unless you provide multiple bounce buffers).
>>     
>
> It will certainly work except (a) there are partial (interrupted)
> transfers and (b) the host relies on the partial DMA not overwriting
> more data than it successfully transferred.  So what that means is
> that if this introduces bugs they will be very difficult to find in
> testing.  I don't think testing is the answer here.
>   

The only workaround I can think of is not to DMA.  But that will be 
horribly slow.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 17:28         ` Avi Kivity
  2009-01-19 17:53           ` Ian Jackson
@ 2009-01-19 18:25           ` Jamie Lokier
  2009-01-19 18:43             ` Avi Kivity
  2009-01-20 14:44             ` Ian Jackson
  1 sibling, 2 replies; 47+ messages in thread
From: Jamie Lokier @ 2009-01-19 18:25 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> In fact, we could even say that the virtual hardware doesn't support 
> dma-to-mmio at all and MCE the guest.  I'm sure no x86 guest would even 
> notice.  Don't know about non-x86.

Guest userspace does:

   1. mmap() framebuffer device.
   2. read() from file opened with O_DIRECT.

Both are allowed by non-root processes on Linux.

(I imagine this might be more common in some obscure DOS programs though).

Think also variation with reading from a video capture device into
video memory.  I've seen that done on x86, never seen it (yet)
on non-x86 :-)

However, that is known to break on some PCI bridges.

I'm not sure if it's reasonable to abort emulation with an MCE in this
case.

> >I think my question about partial DMA writes is very relevant.  If we
> >don't care about that, nor about the corresponding notification for
> >reads, then the API can be a lot simpler.
> 
> I don't see a concrete reason to care about it.

Writing zeros or junk after a partial DMA is quite different to real
hardware behaviour.  Virtually all devices with a "DMA count"
register are certain to have not written to a later address when DMA stops.

QEMU tries to do a fairly good job at emulating devices with many of
their quirks.  It would be odd if the high-performance API got in the
way of high-quality device emulation, when that's wanted.

Potential example: If a graphics card or video capture card, or USB
webcam etc. (more likely!) is doing a large streaming DMA into a
guests's userspace process when that process calls read() (in the
guest OS), and the DMA is stopped for any reason, such as triggered by
a guest OS SIGINT or simply the data having ended, the guest's
userspace can reasonably assume data after the count returned by
read() is untouched.

Just as importantly, the guest OS in that example can assume that the
later pages are not dirtied, therefore not swap them, or return them
to its pre-zero pool or whatever.  This is a legitimate guest OS
optimisation for streaming-DMA-with-unknown-length devices.  This can
happen without a userspace process too.

I'm guessing truncated DMAs using this API are always going to dirty
only an initial part of the buffer, not arbitrary regions.  (In rare
cases where this isn't true, don't use the API).

So wouldn't it be trivial to pass "amount written" to the unmap
function - to be used in the bounce buffer case?

-- Jamie

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 18:25           ` Jamie Lokier
@ 2009-01-19 18:43             ` Avi Kivity
  2009-01-20 14:49               ` Ian Jackson
  2009-01-20 14:44             ` Ian Jackson
  1 sibling, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-19 18:43 UTC (permalink / raw)
  To: qemu-devel

Jamie Lokier wrote:
> Avi Kivity wrote:
>   
>> In fact, we could even say that the virtual hardware doesn't support 
>> dma-to-mmio at all and MCE the guest.  I'm sure no x86 guest would even 
>> notice.  Don't know about non-x86.
>>     
>
> Guest userspace does:
>
>    1. mmap() framebuffer device.
>    2. read() from file opened with O_DIRECT.
>
> Both are allowed by non-root processes on Linux.
>
> (I imagine this might be more common in some obscure DOS programs though).
>
> Think also variation with reading from a video capture device into
> video memory.  I've seen that done on x86, never seen it (yet)
> on non-x86 :-)
>
> However, that is known to break on some PCI bridges.
>
> I'm not sure if it's reasonable to abort emulation with an MCE in this
> case.
>
>   

Framebuffers are mapped as RAM, so we won't bounce this case. Try harder :)

>>> I think my question about partial DMA writes is very relevant.  If we
>>> don't care about that, nor about the corresponding notification for
>>> reads, then the API can be a lot simpler.
>>>       
>> I don't see a concrete reason to care about it.
>>     
>
> Writing zeros or junk after a partial DMA is quite different to real
> hardware behaviour.  Virtually all devices with a "DMA count"
> register are certain to have not written to a later address when DMA stops.
>
>   

The devices we're talking about here don't have a DMA count register. 
They are passed scatter-gather lists, and I don't think they make 
guarantees about the order in which they're accessed.

> QEMU tries to do a fairly good job at emulating devices with many of
> their quirks.  It would be odd if the high-performance API got in the
> way of high-quality device emulation, when that's wanted.
>
> Potential example: If a graphics card or video capture card, or USB
> webcam etc. (more likely!) is doing a large streaming DMA into a
> guests's userspace process when that process calls read() (in the
> guest OS), and the DMA is stopped for any reason, such as triggered by
> a guest OS SIGINT or simply the data having ended, the guest's
> userspace can reasonably assume data after the count returned by
> read() is untouched.
>   

This DMA will be into RAM, not mmio.

> Just as importantly, the guest OS in that example can assume that the
> later pages are not dirtied, therefore not swap them, or return them
> to its pre-zero pool or whatever.  This is a legitimate guest OS
> optimisation for streaming-DMA-with-unknown-length devices.  This can
> happen without a userspace process too.
>
> I'm guessing truncated DMAs using this API are always going to dirty
> only an initial part of the buffer, not arbitrary regions.  (In rare
> cases where this isn't true, don't use the API).
>
> So wouldn't it be trivial to pass "amount written" to the unmap
> function - to be used in the bounce buffer case?
>   

We don't have a reliable amount to pass.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 18:43             ` Avi Kivity
@ 2009-01-20 14:49               ` Ian Jackson
  2009-01-20 17:42                 ` Avi Kivity
  0 siblings, 1 reply; 47+ messages in thread
From: Ian Jackson @ 2009-01-20 14:49 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> The devices we're talking about here don't have a DMA count register. 

Which devices ?  All devices ever that want to do zero-copy DMA into
the guest ?

> They are passed scatter-gather lists, and I don't think they make 
> guarantees about the order in which they're accessed.

Many devices will be able to make such promises.

> This DMA will be into RAM, not mmio.

As previously discussed, we might be using bounce buffers even for
RAM, depending on the execution model.  You said earlier:

  Try it out. I'm sure it will work just fine (if incredibly slowly, 
  unless you provide multiple bounce buffers).

but here is an example from Jamie of a situation where it won't work
right.

> We don't have a reliable amount to pass.

A device which _really_ doesn't have a reliable amount to pass, and
which is entitled to scribble all over the RAM it was to DMA to even
if it does only a partial transfer, can simply pass the total transfer
length.  That would be no different to your proposal.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 14:49               ` Ian Jackson
@ 2009-01-20 17:42                 ` Avi Kivity
  2009-01-20 18:08                   ` Jamie Lokier
  2009-01-21 16:50                   ` Ian Jackson
  0 siblings, 2 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-20 17:42 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>   
>> The devices we're talking about here don't have a DMA count register. 
>>     
>
> Which devices ?  All devices ever that want to do zero-copy DMA into
> the guest ?
>   

IDE, scsi, virtio-blk, virtio-net, e1000, maybe a few more.

>> They are passed scatter-gather lists, and I don't think they make 
>> guarantees about the order in which they're accessed.
>>     
>
> Many devices will be able to make such promises.
>   

If they do, and if guests actually depend on these promises, then we 
will not use the new API until someone is sufficiently motivated to send 
a patch to enable it.

>> This DMA will be into RAM, not mmio.
>>     
>
> As previously discussed, we might be using bounce buffers even for
> RAM, depending on the execution model.  You said earlier:
>
>   Try it out. I'm sure it will work just fine (if incredibly slowly, 
>   unless you provide multiple bounce buffers).
>
> but here is an example from Jamie of a situation where it won't work
> right.
>   

Framebuffers?  Those are RAM.  USB webcams?  These can't be interrupted 
by SIGINT.  Are you saying a guest depends on an O_DIRECT USB transfer 
not affecting memory when a USB cable is pulled out?

>> We don't have a reliable amount to pass.
>>     
>
> A device which _really_ doesn't have a reliable amount to pass, and
> which is entitled to scribble all over the RAM it was to DMA to even
> if it does only a partial transfer, can simply pass the total transfer
> length.  That would be no different to your proposal.
>   

I'm suggesting we do that unconditionally (as my patch does) and only 
add that complexity when we know it's needed for certain.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 17:42                 ` Avi Kivity
@ 2009-01-20 18:08                   ` Jamie Lokier
  2009-01-20 20:27                     ` Avi Kivity
  2009-01-21 16:50                   ` Ian Jackson
  1 sibling, 1 reply; 47+ messages in thread
From: Jamie Lokier @ 2009-01-20 18:08 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> Framebuffers?  Those are RAM.  USB webcams?  These can't be interrupted 
> by SIGINT.  Are you saying a guest depends on an O_DIRECT USB transfer 
> not affecting memory when a USB cable is pulled out.

The USB thing is probably more about emulated-UHCI/EHCI pass-through
to real host USB devices which aren't emulated in QEMU.

> >>We don't have a reliable amount to pass.
> >>    
> >A device which _really_ doesn't have a reliable amount to pass, and
> >which is entitled to scribble all over the RAM it was to DMA to even
> >if it does only a partial transfer, can simply pass the total transfer
> >length.  That would be no different to your proposal.
> >  
> 
> I'm suggesting we do that unconditionally (as my patch does) and only 
> add that complexity when we know it's needed for certain.

Fair enough.  Things can be added if needed.  But please make it real
clear in the DMA API comments that the whole buffer may be
overwritten.  If a guest actually does depend on this, it won't show
up in testing because the whole buffer won't be overwritten except
when a bounce copy is used.

Linux itself had some issues with _its_ DMA API recently: people have
been writing drivers using the Linux DMA API making broken assumptions
that happen to work on x86, and work some of the time on other
architectures.  These things don't show up during driver tests, and
are very difficult to track down later.  I suspect "overwrites the
remaining buffer with randomness/zeros but only under bounce-buffer
conditions" will be similarly unlikely to trigger, and very difficult
to track down if it causes a problem anywhere.

The recent solution in Linux is to add some debugging options which
check it's used correctly, even on x86.

Here's a final thought, to do with performance:

e1000, configured for jumbo frames, receives logs of small packets,
and the DMA subsystem has to bounce-copy the data for some reason (Ian
suggested maybe always doing that with Xen for DMA to guest RAM?)

Without a length passed to unmap, won't it copy 65536 bytes per packet
(most from cold cache) because that's the amount set up for DMA to
receive from /dev/tap, instead of 256 or 1514 bytes per packet which
is their actual size?

-- Jamie

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 18:08                   ` Jamie Lokier
@ 2009-01-20 20:27                     ` Avi Kivity
  2009-01-21 16:53                       ` Ian Jackson
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-20 20:27 UTC (permalink / raw)
  To: qemu-devel

Jamie Lokier wrote:
> Linux itself had some issues with _its_ DMA API recently: people have
> been writing drivers using the Linux DMA API making broken assumptions
> that happen to work on x86, and work some of the time on other
> architectures.  These things don't show up during driver tests, and
> are very difficult to track down later.  I suspect "overwrites the
> remaining buffer with randomness/zeros but only under bounce-buffer
> conditions" will be similarly unlikely to trigger, and very difficult
> to track down if it causes a problem anywhere.
>   

This reminds me -- Gleb pointed out that Linux itself bounces some DMA 
operations, and will copy the entire range.  So Linux itself doesn't 
satisfy these requirements.

> The recent solution in Linux is to add some debugging options which
> check it's used correctly, even on x86.
>
> Here's a final thought, to do with performance:
>
> e1000, configured for jumbo frames, receives logs of small packets,
> and the DMA subsystem has to bounce-copy the data for some reason (Ian
> suggested maybe always doing that with Xen for DMA to guest RAM?)
>
> Without a length passed to unmap, won't it copy 65536 bytes per packet
> (most from cold cache) because that's the amount set up for DMA to
> receive from /dev/tap, instead of 256 or 1514 bytes per packet which
> is their actual size?
>   

It will.  If we bounce.  We'll never bounce in real life.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 20:27                     ` Avi Kivity
@ 2009-01-21 16:53                       ` Ian Jackson
  0 siblings, 0 replies; 47+ messages in thread
From: Ian Jackson @ 2009-01-21 16:53 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> This reminds me -- Gleb pointed out that Linux itself bounces some DMA 
> operations, and will copy the entire range.  So Linux itself doesn't 
> satisfy these requirements.

What you mean is that Linux drivers are not supposed to rely on this
behaviour if they might execute in Linux systems with more than 4G of
RAM, and that any Linux drivers which do are buggy.

That's not the same as saying that there are no Linux systems which do
not rely on partial DMA transfers not overwriting beyond the transfer
length.  For example, there might easily be <4G systems with buggy
drivers where bounce buffers are never used.

> > Without a length passed to unmap, won't it copy 65536 bytes per packet
> > (most from cold cache) because that's the amount set up for DMA to
> > receive from /dev/tap, instead of 256 or 1514 bytes per packet which
> > is their actual size?
> 
> It will.  If we bounce.  We'll never bounce in real life.

I have explained why this is not necessarily true.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-20 17:42                 ` Avi Kivity
  2009-01-20 18:08                   ` Jamie Lokier
@ 2009-01-21 16:50                   ` Ian Jackson
  2009-01-21 17:18                     ` Avi Kivity
  1 sibling, 1 reply; 47+ messages in thread
From: Ian Jackson @ 2009-01-21 16:50 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> Ian Jackson wrote:
> > Which devices ?  All devices ever that want to do zero-copy DMA into
> > the guest ?
> 
> IDE, scsi, virtio-blk, virtio-net, e1000, maybe a few more.

Yesterday I produced the example of a SCSI tape drive, which is
vanishingly unlikely to results in writes past the actual transfer
length since the drive definitely produces all of the data in order.

> >> They are passed scatter-gather lists, and I don't think they make 
> >> guarantees about the order in which they're accessed.
> >
> > Many devices will be able to make such promises.
> 
> If they do, and if guests actually depend on these promises, then we 
> will not use the new API until someone is sufficiently motivated to send 
> a patch to enable it.

As I have already pointed out, we won't discover that any guest
depends on those promises in testing, because it's the kind of thing
that will only happen in practice with reasonably obscure situations
including some error conditions.

So "let's only do this if we discover we need it" is not good enough.
We won't know that we need it.  What will probably happen is that some
user somewhere who is already suffering from some kind of problem will
experience additional apparently-random corruption.  Naturally that's
not going to result in a good bug report.

Even from our point of view as the programmers this isn't a good
approach because the proposed fix is an API and API change.  What
you're suggesting is that we introduce a bug, and wait and see if it
bites anyone, in the full knowledge that by then fixing the bug will
involve either widespread changes to all of the DMA API users or
changing a particular driver to be much slower.

> >> This DMA will be into RAM, not mmio.
> >
> > As previously discussed, we might be using bounce buffers even for
> > RAM, depending on the execution model.  You said earlier:
> >
> >   Try it out. I'm sure it will work just fine (if incredibly slowly, 
> >   unless you provide multiple bounce buffers).
> >
> > but here is an example from Jamie of a situation where it won't work
> > right.
> 
> Framebuffers?  Those are RAM.  USB webcams?  These can't be interrupted 
> by SIGINT.  Are you saying a guest depends on an O_DIRECT USB transfer 
> not affecting memory when a USB cable is pulled out?

No, as I said earlier, and as you appeared to accept, it is quite
possible that in some uses of the qemu code - including some uses of
Xen - _all_ DMA will go through bounce buffers.

> >> We don't have a reliable amount to pass.
> >
> > A device which _really_ doesn't have a reliable amount to pass, and
> > which is entitled to scribble all over the RAM it was to DMA to even
> > if it does only a partial transfer, can simply pass the total transfer
> > length.  That would be no different to your proposal.
> 
> I'm suggesting we do that unconditionally (as my patch does) and only 
> add that complexity when we know it's needed for certain.

At the moment there are no such devices (your claims about ide
notwithstanding) but I think it will be easier to argue about the
specific case after we have agreed on a non-deficient API.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-21 16:50                   ` Ian Jackson
@ 2009-01-21 17:18                     ` Avi Kivity
  2009-01-21 21:54                       ` Anthony Liguori
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-21 17:18 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> Avi Kivity writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>   
>> Ian Jackson wrote:
>>     
>>> Which devices ?  All devices ever that want to do zero-copy DMA into
>>> the guest ?
>>>       
>> IDE, scsi, virtio-blk, virtio-net, e1000, maybe a few more.
>>     
>
> Yesterday I produced the example of a SCSI tape drive, which is
> vanishingly unlikely to results in writes past the actual transfer
> length since the drive definitely produces all of the data in order.
>   

And I explained that it's very unlikely to ever be noticed by a guest, 
since the dma will happen into kernel memory (which will be clobbered), 
but the subsequent copying into userspace will use the correct size.  I 
also pointed out that the holy kernel itself might use bounce buffers 
and disregard the actually copied size.

If you're into accurate emulation but not into performance, use 
cpu_physical_memory_rw().  This API is optional, for performance minded 
implementations.

> As I have already pointed out, we won't discover that any guest
> depends on those promises in testing, because it's the kind of thing
> that will only happen in practice with reasonably obscure situations
> including some error conditions.
>
> So "let's only do this if we discover we need it" is not good enough.
> We won't know that we need it.  What will probably happen is that some
> user somewhere who is already suffering from some kind of problem will
> experience additional apparently-random corruption.  Naturally that's
> not going to result in a good bug report.
>
> Even from our point of view as the programmers this isn't a good
> approach because the proposed fix is an API and API change.  What
> you're suggesting is that we introduce a bug, and wait and see if it
> bites anyone, in the full knowledge that by then fixing the bug will
> involve either widespread changes to all of the DMA API users or
> changing a particular driver to be much slower.
>   

That's because I estimate the probability of change being required as zero.

>> Framebuffers?  Those are RAM.  USB webcams?  These can't be interrupted 
>> by SIGINT.  Are you saying a guest depends on an O_DIRECT USB transfer 
>> not affecting memory when a USB cable is pulled out?
>>     
>
> No, as I said earlier, and as you appeared to accept, it is quite
> possible that in some uses of the qemu code - including some uses of
> Xen - _all_ DMA will go through bounce buffers.
>   

Right now Xen doesn't bounce DMAs, it uses the map cache.  I am not 
coding for all possible uses of qemu.  I am coding for what's in 
upstream qemu, and allowing for reasonable implementations in Xen.


>> I'm suggesting we do that unconditionally (as my patch does) and only 
>> add that complexity when we know it's needed for certain.
>>     
>
> At the moment there are no such devices (your claims about ide
> notwithstanding) but I think it will be easier to argue about the
> specific case after we have agreed on a non-deficient API.
>   

I don't think we'll ever reach agreement.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-21 17:18                     ` Avi Kivity
@ 2009-01-21 21:54                       ` Anthony Liguori
  0 siblings, 0 replies; 47+ messages in thread
From: Anthony Liguori @ 2009-01-21 21:54 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> Ian Jackson wrote:
>
>>> I'm suggesting we do that unconditionally (as my patch does) and 
>>> only add that complexity when we know it's needed for certain.
>>>     
>>
>> At the moment there are no such devices (your claims about ide
>> notwithstanding) but I think it will be easier to argue about the
>> specific case after we have agreed on a non-deficient API.
>>   
>
> I don't think we'll ever reach agreement.

API's don't have to be fixed for any period of time.  They can be 
changed as needed.  The only thing we have to agree on is that 1) we're 
not causing regressions and 2) we're not digging ourselves into a hole.

So far, I don't see anything in the API that would cause either issue.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 18:25           ` Jamie Lokier
  2009-01-19 18:43             ` Avi Kivity
@ 2009-01-20 14:44             ` Ian Jackson
  1 sibling, 0 replies; 47+ messages in thread
From: Ian Jackson @ 2009-01-20 14:44 UTC (permalink / raw)
  To: qemu-devel

Jamie Lokier writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
> So wouldn't it be trivial to pass "amount written" to the unmap
> function - to be used in the bounce buffer case?

I think that's probably the best plan.  It's certainly a lot simpler
than my most recent proposal.

It seems to me that:
 - there has to be a callback mechanism to schedule bounce buffers
 - there should be only one control flow
 - therefore the interface should always be callback-based
and:
 - as Anthony points out, we need to be able to set up for a single-
   packet scatter-gather IO
 - thus the interface should involve a single request and callback
   for a whole DMA scatter gather

This would seem to yield an API that looks roughly like this:

  #define CPUPHYSMEMMAPFLAG_READABLE    001
  #define CPUPHYSMEMMAPFLAG_WRITABLE    002

  #define CPUPHYSMEMMAP_MAX 131072
     /* Callers of the mapping API must only request mappings
      * totalling this amount or less.  Larger requests will
      * fail an assertion. */

  typedef struct {
    target_phys_addr_t addr, len;
  } CpuPhysicalMemoryMappingEntry;
     /* If requested addresses and lengths are not page-aligned,
      * performance will definitely be poor. */

  typedef struct {
   /* to be filled in by caller before calling _map: */
    unsigned flags;
    int sg_list_entries;
    CpuPhysicalMemoryMappingEntry *sg_list;
   /* filled in by _map: */
    void *buffer;
   /* private to _map et al: */
    ... some stuff here to actually implement ...
  } CpuPhysicalMemoryMapping;

  void cpu_physical_memory_map(CpuPhysicalMemoryMapping*,
      CpuPhysicalMemoryMapCallback *cb, void *cb_arg);
    /* There may be a limit on the number of concurrent maps
     * and the limit may be as low as one. */

  void cpu_physical_memory_unmap(CpuPhysicalMemoryMapping*,
      target_phys_addr_t transferred);
    /* For read mappings, transferred is ignored so 0 is OK. */

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 13:49   ` Ian Jackson
  2009-01-19 14:54     ` Avi Kivity
@ 2009-01-19 15:05     ` Gerd Hoffmann
  2009-01-19 15:23       ` Avi Kivity
  1 sibling, 1 reply; 47+ messages in thread
From: Gerd Hoffmann @ 2009-01-19 15:05 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> The interface when cpu_physical_memory_map returns 0 is strange.
> Normally everything in qemu is done with completion callbacks, but
> here we have a kind of repeated polling.

Check patch #2 ;)

>   typedef struct {
>     /* to be filled in by caller before calling _map: */
>     unsigned flags;
>     target_phys_addr_t addr, len; /* len is updated by _map */
>     /* filled in by _map: */
>     void *buffer;
>     /* private to _map et al: */
>   } CpuPhysicalMemoryMapping;
> 
>   void cpu_physical_memory_map(CpuPhysicalMemoryMapping*;
>       CpuPhysicalMemoryMapCallback *cb, void *cb_arg);
>     /* There may be a limit on the number of concurrent maps
>      * and the limit may be as low as one. */

Hmm, I'd like to see grant tables fit into this picture too.
Using flags (instead of is_write) for map requests we can handle that
too I think.  We need a flag indicating we are passing a grant handle
not a guest address, then we can stick the handle into addr (or make
that a union).

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:05     ` Gerd Hoffmann
@ 2009-01-19 15:23       ` Avi Kivity
  2009-01-19 15:29         ` Avi Kivity
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-19 15:23 UTC (permalink / raw)
  To: qemu-devel

Gerd Hoffmann wrote:
> Hmm, I'd like to see grant tables fit into this picture too.
> Using flags (instead of is_write) for map requests we can handle that
> too I think.  We need a flag indicating we are passing a grant handle
> not a guest address, then we can stick the handle into addr (or make
> that a union).
>   

Grant handles should be handled by a different API.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:23       ` Avi Kivity
@ 2009-01-19 15:29         ` Avi Kivity
  2009-01-19 15:57           ` Gerd Hoffmann
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2009-01-19 15:29 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> Gerd Hoffmann wrote:
>> Hmm, I'd like to see grant tables fit into this picture too.
>> Using flags (instead of is_write) for map requests we can handle that
>> too I think.  We need a flag indicating we are passing a grant handle
>> not a guest address, then we can stick the handle into addr (or make
>> that a union).
>>   
>
> Grant handles should be handled by a different API.
>

That was terse... here's how I think it should be done:

- step 1: map the grant handle into guest physical memory
- step 2: use cpu_physical_memory_{read,write,map}() to access the memory

This makes grant tables transparent to the actual memory access API.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:29         ` Avi Kivity
@ 2009-01-19 15:57           ` Gerd Hoffmann
  2009-01-19 16:25             ` Avi Kivity
  2009-01-19 17:08             ` Ian Jackson
  0 siblings, 2 replies; 47+ messages in thread
From: Gerd Hoffmann @ 2009-01-19 15:57 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
>>
>> Grant handles should be handled by a different API.
> 
> That was terse... here's how I think it should be done:
> 
> - step 1: map the grant handle into guest physical memory

Haha.

Guess I should explain a bit more what grant handles are ...

A xen domain can give other domains permission to map specific pages.
This is called a grant.  This is used by xen paravirtual drivers.  The
domU frontend driver hands out grants to the backend driver (usually
dom0), so the backend driver can fill the guest buffers with the data.

I want qemu be able to handle the backend side, so it must be able to
map and unmap grants.   I think fitting this into the memory mapping API
can be useful, especially if the mapping API starts to get used by
generic code.  Maybe the block layer some day accepts iovecs with guest
physical addresses and does map/unmap transparently?  Then I'd like to
be able to pass in a iovec with grants instead.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:57           ` Gerd Hoffmann
@ 2009-01-19 16:25             ` Avi Kivity
  2009-01-19 17:08             ` Ian Jackson
  1 sibling, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-19 16:25 UTC (permalink / raw)
  To: qemu-devel

Gerd Hoffmann wrote:
> Avi Kivity wrote:
>   
>>> Grant handles should be handled by a different API.
>>>       
>> That was terse... here's how I think it should be done:
>>
>> - step 1: map the grant handle into guest physical memory
>>     
>
> Haha.
>
>   
Right :)

> Guess I should explain a bit more what grant handles are ...
>
> A xen domain can give other domains permission to map specific pages.
> This is called a grant.  This is used by xen paravirtual drivers.  The
> domU frontend driver hands out grants to the backend driver (usually
> dom0), so the backend driver can fill the guest buffers with the data.
>
> I want qemu be able to handle the backend side, so it must be able to
> map and unmap grants.   I think fitting this into the memory mapping API
> can be useful, especially if the mapping API starts to get used by
> generic code.  Maybe the block layer some day accepts iovecs with guest
> physical addresses and does map/unmap transparently?  Then I'd like to
> be able to pass in a iovec with grants instead.
>   

Well, I still think it should be done by a separate API.  Multiplexing 
target addresses and grant handles on the same API is messy.

The block driver should keep its host virtual address interface; maybe 
we'll add a layer that accepts sglists on top of it; in that case we can 
add another API that accepts sglists that contain grant handles instead 
of addresses.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 15:57           ` Gerd Hoffmann
  2009-01-19 16:25             ` Avi Kivity
@ 2009-01-19 17:08             ` Ian Jackson
  2009-01-19 17:16               ` Avi Kivity
  1 sibling, 1 reply; 47+ messages in thread
From: Ian Jackson @ 2009-01-19 17:08 UTC (permalink / raw)
  To: qemu-devel

Gerd Hoffmann writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>Avi Kivity wrote:
>[Avi:]
>>> Grant handles should be handled by a different API.
>> 
>> That was terse... here's how I think it should be done:
>> - step 1: map the grant handle into guest physical memory

I agree that grant handles should be handled by a different API.  But
I don't see why they need to be mapped into guest physical memory, or
referred to via a target_phys_addr_t at all.

The only consumers of grant references in qemu are Xen-specific
backend drivers, and they can call the Xen-specific grant reference
APIs directly.  There is no need for anyone but those backend drivers
to access those pages of guest memory.

So I don't see why grant references are relevant to the
cpu_physical_memory_map API at all.

Ian.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
  2009-01-19 17:08             ` Ian Jackson
@ 2009-01-19 17:16               ` Avi Kivity
  0 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2009-01-19 17:16 UTC (permalink / raw)
  To: qemu-devel

Ian Jackson wrote:
> Gerd Hoffmann writes ("Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API"):
>   
>> Avi Kivity wrote:
>> [Avi:]
>>     
>>>> Grant handles should be handled by a different API.
>>>>         
>>> That was terse... here's how I think it should be done:
>>> - step 1: map the grant handle into guest physical memory
>>>       
>
> I agree that grant handles should be handled by a different API.  But
> I don't see why they need to be mapped into guest physical memory, or
> referred to via a target_phys_addr_t at all.
>
> The only consumers of grant references in qemu are Xen-specific
> backend drivers, and they can call the Xen-specific grant reference
> APIs directly.  There is no need for anyone but those backend drivers
> to access those pages of guest memory.
>
> So I don't see why grant references are relevant to the
> cpu_physical_memory_map API at all.
>   

Thinko on my part.  Wasn't a good suggestion.  Side-by-side APIs seem best.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2009-01-22 18:51 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-22 10:36 [Qemu-devel] [PATCH 0/5] Direct memory access for devices (v2) Avi Kivity
2009-01-22 10:36 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
2009-01-22 12:24   ` Ian Jackson
2009-01-22 10:36 ` [Qemu-devel] [PATCH 2/5] Add map client retry notification Avi Kivity
2009-01-22 12:30   ` Ian Jackson
2009-01-22 18:51     ` Anthony Liguori
2009-01-22 10:36 ` [Qemu-devel] [PATCH 3/5] I/O vector helpers Avi Kivity
2009-01-22 10:36 ` [Qemu-devel] [PATCH 4/5] Vectored block device API Avi Kivity
2009-01-22 10:36 ` [Qemu-devel] [PATCH 5/5] Convert IDE to directly access guest memory Avi Kivity
2009-01-22 16:59 ` [Qemu-devel] Re: [PATCH 0/5] Direct memory access for devices (v2) Anthony Liguori
  -- strict thread matches above, loose matches on Subject: below --
2009-01-18 19:53 [Qemu-devel] [PATCH 0/5] Direct memory access for devices Avi Kivity
2009-01-18 19:53 ` [Qemu-devel] [PATCH 1/5] Add target memory mapping API Avi Kivity
2009-01-19 13:49   ` Ian Jackson
2009-01-19 14:54     ` Avi Kivity
2009-01-19 15:39       ` Anthony Liguori
2009-01-19 16:18         ` Paul Brook
2009-01-19 16:33           ` Anthony Liguori
2009-01-19 16:39             ` Avi Kivity
2009-01-19 19:15               ` Anthony Liguori
2009-01-20 10:09                 ` Avi Kivity
2009-01-19 16:57         ` Ian Jackson
2009-01-19 19:23           ` Anthony Liguori
2009-01-20 10:17             ` Avi Kivity
2009-01-20 14:18             ` Ian Jackson
2009-01-19 16:40       ` Ian Jackson
2009-01-19 17:28         ` Avi Kivity
2009-01-19 17:53           ` Ian Jackson
2009-01-19 18:29             ` Avi Kivity
2009-01-20 14:32               ` Ian Jackson
2009-01-20 17:23                 ` Avi Kivity
2009-01-19 18:25           ` Jamie Lokier
2009-01-19 18:43             ` Avi Kivity
2009-01-20 14:49               ` Ian Jackson
2009-01-20 17:42                 ` Avi Kivity
2009-01-20 18:08                   ` Jamie Lokier
2009-01-20 20:27                     ` Avi Kivity
2009-01-21 16:53                       ` Ian Jackson
2009-01-21 16:50                   ` Ian Jackson
2009-01-21 17:18                     ` Avi Kivity
2009-01-21 21:54                       ` Anthony Liguori
2009-01-20 14:44             ` Ian Jackson
2009-01-19 15:05     ` Gerd Hoffmann
2009-01-19 15:23       ` Avi Kivity
2009-01-19 15:29         ` Avi Kivity
2009-01-19 15:57           ` Gerd Hoffmann
2009-01-19 16:25             ` Avi Kivity
2009-01-19 17:08             ` Ian Jackson
2009-01-19 17:16               ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).