qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/3] Refactor AIO to allow multiple AIO implementations
@ 2008-09-22 23:17 Ryan Harper
  2008-09-22 23:17 ` [Qemu-devel] [PATCH 1/3] Only call aio flush handler if set Ryan Harper
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Ryan Harper @ 2008-09-22 23:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, kvm

The patchset adds additional AIO driver abstraction to the block raw driver to
support multiple aio implementations for each device.  The first patch pulls
the posix aio implementation out of the block-raw device using a generic call
to the newly created AIO Driver structure.  The posix aio implementation that
was contained in block-raw-posix.c has been refactored in to aio-posix.c.  The
next patch adds a linux aio implementation for raw devices being opened
O_DIRECT via cache=off drive option.  We only use linux aio when cache=off as
linux aio falls back to synchronous ops if not opened with O_DIRECT flag.

Addtional work has been done on top of QEMU for KVM and virtio-blk devices.
While virtio-blk is not yet upstream in QEMU, the AIO changes here provide a
tremendous performance improvement (from 7.6% of native, to 100% of randwrite,
and 3.9% of native, to 101.4% of native for seq write) for virtio
devices with cache=off.

Storage subsystem:
  IBM EXP300 - 14 Disk Fiber Expansion, 17G - 15K RPMS
  Host: AMD Barcelona, 2 socket, 8G RAM
  HBA: QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA (rev 02)

Benchmark[1]:
           fio --name=guestrun --filename=/dev/mapper/volumes-fibre \
               --rw=randwrite --bs=16k --ioengine=libaio --direct=1 \
               --norandommap --runtime=120 --time_based --numjobs=1 \
               --group_reporting --thread --size=25g --write_lat_log \
               --write_bw_log --iodepth=74

Qemu parameters:
 -m 1024 \
 -drive file=/images/npt2-guest-virtio.qcow2,if=ide,boot=on,snapshot=off \
 -drive file=/dev/mapper/volumes-fibre,if=virtio,cache=(on|off) \
 -drive file=/dev/mapper/volumes-npt2--dom1,if=virtio,cache=off
 -net nic,macaddr=00:FF:FF:00:00:01,model=rtl8139 -net tap -vnc :123 \
 -monitor stdio

Guest io scheduler: noop

Results:
 These results are with the patch series applied to KVM (plus a small KVM only
 change -- KVM patches forthcoming).


 16k randwrite 1 thread, 74 iodepth | MB/s | avg sub lat (us) | avg comp lat (ms)
 ---------------------------------------+---------------------+------------------
 baremetal (O_DIRECT, aka cache=off)| 61.2 |   13.07          |  19.59
 kvm: cache=off posix-aio w/o patch |  4.7 | 3467.44          | 254.08
 kvm: cache=off linux-aio           | 61.1 |   75.35          |  19.57
 kvm: cache=on  posix-aio w/o patch |127.0 |  115.78          |   9.19
 kvm: cache=on  posix-aio w/ patch  |126.0 |   67.35          |   9.30


 16k write 1 thread, 74 iodepth     | MB/s | avg sub lat (us) | avg comp lat (ms)
 ---------------------------------------+---------------------+------------------
 baremetal (O_DIRECT, aka cache=off)|128.1 |   10.90          |   9.45
 kvm: cache=off posix-aio w/o patch |  5.1 | 3152.00          | 231.06 
 kvm: cache=off linux-aio           |130.0 |   83.83          |   8.99
 kvm: cache=on  posix-aio w/o patch |184.0 |   80.46          |   6.35
 kvm: cache=on  posix-aio w/ patch  |165.0 |   70.90          |   7.09


1. http://brick.kernel.dk/snaps/fio-1.21.tar.bz2

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 1/3] Only call aio flush handler if set
  2008-09-22 23:17 [Qemu-devel] [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Ryan Harper
@ 2008-09-22 23:17 ` Ryan Harper
  2008-09-23  2:38   ` [Qemu-devel] " Anthony Liguori
  2008-09-22 23:17 ` [Qemu-devel] [PATCH 2/3] Move aio implementation out of raw block driver Ryan Harper
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Ryan Harper @ 2008-09-22 23:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, Ryan Harper, kvm

If the aio handler doesn't register an io_flush handler, we'd SEGV; fix that by
only calling the flush handler if set.  BTW, aio handlers *should* register an
io_flush routine.

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>

diff --git a/aio.c b/aio.c
index 687e4be..2bb3ed4 100644
--- a/aio.c
+++ b/aio.c
@@ -105,7 +105,8 @@ void qemu_aio_flush(void)
         ret = 0;
 
         LIST_FOREACH(node, &aio_handlers, node) {
-            ret |= node->io_flush(node->opaque);
+            if (node->io_flush)
+                ret |= node->io_flush(node->opaque);
         }
 
         qemu_aio_wait();

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-22 23:17 [Qemu-devel] [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Ryan Harper
  2008-09-22 23:17 ` [Qemu-devel] [PATCH 1/3] Only call aio flush handler if set Ryan Harper
@ 2008-09-22 23:17 ` Ryan Harper
  2008-09-23  1:16   ` [Qemu-devel] " Ryan Harper
  2008-09-23  2:45   ` Anthony Liguori
  2008-09-22 23:17 ` [Qemu-devel] " Ryan Harper
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: Ryan Harper @ 2008-09-22 23:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, Ryan Harper, kvm

This patch moves the existing posix aio implementation out of block-raw-posix.c
into aio-posix.c.  Added in a per-block device  aio driver abstraction.
Block-raw-posix invokes the aio driver methods, .submit, .flush, and .cancel as
needed.  aio-posix.c contains the posix aio implementation.

The changes pave the way for other aio implementations, namely linux aio.  Each
block device will init the proper aio driver depending on how the device is
opened.

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>

diff --git a/Makefile b/Makefile
index de6393e..18477ba 100644
--- a/Makefile
+++ b/Makefile
@@ -60,7 +60,7 @@ BLOCK_OBJS += block-raw-posix.o
 endif
 
 ifdef CONFIG_AIO
-BLOCK_OBJS += compatfd.o
+BLOCK_OBJS += compatfd.o aio-posix.o
 endif
 
 ######################################################################
diff --git a/Makefile.target b/Makefile.target
index 4a490f4..4c6b3d5 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -482,7 +482,7 @@ OBJS+=block-raw-posix.o
 endif
 
 ifdef CONFIG_AIO
-OBJS+=compatfd.o
+OBJS+=compatfd.o aio-posix.o
 endif
 
 LIBS+=-lz
diff --git a/block-aio.h b/block-aio.h
new file mode 100644
index 0000000..b8597d0
--- /dev/null
+++ b/block-aio.h
@@ -0,0 +1,78 @@
+/*
+ * QEMU Block AIO API
+ *
+ * Copyright IBM, Corp. 2008
+ *
+ * Authors:
+ * Anthony Liguori <aliguori@us.ibm.com>
+ * Ryan Harper <ryanh@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_BLOCK_AIO_H
+#define QEMU_BLOCK_AIO_H
+
+#include "qemu-common.h"
+#include "block_int.h"
+#include "block.h"
+#include "qemu-aio.h"
+#ifdef CONFIG_AIO
+#include <aio.h>
+#endif
+
+//#define DEBUG_BLOCK_AIO
+#if defined(DEBUG_BLOCK_AIO)
+#define BLPRINTF(formatCstr, args...) do { fprintf(stderr, formatCstr, ##args); fflush(stderr); } while (0)
+#else
+#define BLPRINTF(formatCstr, args...)
+#endif
+
+typedef struct RawAIOCB {
+    BlockDriverAIOCB common;
+    struct aiocb posix_aiocb;
+    struct RawAIOCB *next;
+    int ret;
+} RawAIOCB;
+
+typedef struct AIODriver
+{
+    const char *name;
+    RawAIOCB *(*submit)(BlockDriverState *bs, int fd,
+                                    int64_t sector_num, uint8_t *buf,
+                                    int sectors, int write,
+                                    BlockDriverCompletionFunc *cb,
+                                    void *opaque);
+    void (*cancel)(BlockDriverAIOCB *aiocb);
+    int (*flush)(void *opaque);
+} AIODriver;
+
+typedef struct BDRVRawState {
+    int fd;
+    int type;
+    unsigned int lseek_err_cnt;
+#if defined(__linux__)
+    /* linux floppy specific */
+    int fd_open_flags;
+    int64_t fd_open_time;
+    int64_t fd_error_time;
+    int fd_got_error;
+    int fd_media_changed;
+#endif
+#if defined(O_DIRECT)
+    uint8_t* aligned_buf;
+#endif
+    AIODriver *aio_dvr;
+} BDRVRawState;
+
+typedef struct AIOState
+{
+    int fd;
+    RawAIOCB *first_aio;
+} AIOState;
+ 
+AIODriver* posix_aio_init(void);
+
+#endif /* QEMU_BLOCK_AIO_H */
diff --git a/block-raw-posix.c b/block-raw-posix.c
index 41f9976..cab7094 100644
--- a/block-raw-posix.c
+++ b/block-raw-posix.c
@@ -25,11 +25,8 @@
 #include "qemu-timer.h"
 #include "qemu-char.h"
 #include "block_int.h"
-#include "compatfd.h"
+#include "block-aio.h"
 #include <assert.h>
-#ifdef CONFIG_AIO
-#include <aio.h>
-#endif
 
 #ifdef CONFIG_COCOA
 #include <paths.h>
@@ -84,25 +81,6 @@
    reopen it to see if the disk has been changed */
 #define FD_OPEN_TIMEOUT 1000
 
-typedef struct BDRVRawState {
-    int fd;
-    int type;
-    unsigned int lseek_err_cnt;
-#if defined(__linux__)
-    /* linux floppy specific */
-    int fd_open_flags;
-    int64_t fd_open_time;
-    int64_t fd_error_time;
-    int fd_got_error;
-    int fd_media_changed;
-#endif
-#if defined(O_DIRECT)
-    uint8_t* aligned_buf;
-#endif
-} BDRVRawState;
-
-static int posix_aio_init(void);
-
 static int fd_open(BlockDriverState *bs);
 
 static int raw_open(BlockDriverState *bs, const char *filename, int flags)
@@ -110,8 +88,6 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags)
     BDRVRawState *s = bs->opaque;
     int fd, open_flags, ret;
 
-    posix_aio_init();
-
     s->lseek_err_cnt = 0;
 
     open_flags = O_BINARY;
@@ -149,6 +125,8 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags)
         }
     }
 #endif
+    /* init aio driver for this block device */
+    s->aio_dvr = posix_aio_init();
     return 0;
 }
 
@@ -429,166 +407,6 @@ static int raw_pwrite(BlockDriverState *bs, int64_t offset,
 #define raw_pwrite raw_pwrite_aligned
 #endif
 
-
-#ifdef CONFIG_AIO
-/***********************************************************/
-/* Unix AIO using POSIX AIO */
-
-typedef struct RawAIOCB {
-    BlockDriverAIOCB common;
-    struct aiocb aiocb;
-    struct RawAIOCB *next;
-    int ret;
-} RawAIOCB;
-
-typedef struct PosixAioState
-{
-    int fd;
-    RawAIOCB *first_aio;
-} PosixAioState;
-
-static void posix_aio_read(void *opaque)
-{
-    PosixAioState *s = opaque;
-    RawAIOCB *acb, **pacb;
-    int ret;
-    size_t offset;
-    union {
-        struct qemu_signalfd_siginfo siginfo;
-        char buf[128];
-    } sig;
-
-    /* try to read from signalfd, don't freak out if we can't read anything */
-    offset = 0;
-    while (offset < 128) {
-        ssize_t len;
-
-        len = read(s->fd, sig.buf + offset, 128 - offset);
-        if (len == -1 && errno == EINTR)
-            continue;
-        if (len == -1 && errno == EAGAIN) {
-            /* there is no natural reason for this to happen,
-             * so we'll spin hard until we get everything just
-             * to be on the safe side. */
-            if (offset > 0)
-                continue;
-        }
-
-        offset += len;
-    }
-
-    for(;;) {
-        pacb = &s->first_aio;
-        for(;;) {
-            acb = *pacb;
-            if (!acb)
-                goto the_end;
-            ret = aio_error(&acb->aiocb);
-            if (ret == ECANCELED) {
-                /* remove the request */
-                *pacb = acb->next;
-                qemu_aio_release(acb);
-            } else if (ret != EINPROGRESS) {
-                /* end of aio */
-                if (ret == 0) {
-                    ret = aio_return(&acb->aiocb);
-                    if (ret == acb->aiocb.aio_nbytes)
-                        ret = 0;
-                    else
-                        ret = -EINVAL;
-                } else {
-                    ret = -ret;
-                }
-                /* remove the request */
-                *pacb = acb->next;
-                /* call the callback */
-                acb->common.cb(acb->common.opaque, ret);
-                qemu_aio_release(acb);
-                break;
-            } else {
-                pacb = &acb->next;
-            }
-        }
-    }
- the_end: ;
-}
-
-static int posix_aio_flush(void *opaque)
-{
-    PosixAioState *s = opaque;
-    return !!s->first_aio;
-}
-
-static PosixAioState *posix_aio_state;
-
-static int posix_aio_init(void)
-{
-    sigset_t mask;
-    PosixAioState *s;
-  
-    if (posix_aio_state)
-        return 0;
-
-    s = qemu_malloc(sizeof(PosixAioState));
-    if (s == NULL)
-        return -ENOMEM;
-
-    /* Make sure to block AIO signal */
-    sigemptyset(&mask);
-    sigaddset(&mask, SIGUSR2);
-    sigprocmask(SIG_BLOCK, &mask, NULL);
-    
-    s->first_aio = NULL;
-    s->fd = qemu_signalfd(&mask);
-
-    fcntl(s->fd, F_SETFL, O_NONBLOCK);
-
-    qemu_aio_set_fd_handler(s->fd, posix_aio_read, NULL, posix_aio_flush, s);
-
-#if defined(__GLIBC__) && defined(__linux__)
-    {
-        /* XXX: aio thread exit seems to hang on RedHat 9 and this init
-           seems to fix the problem. */
-        struct aioinit ai;
-        memset(&ai, 0, sizeof(ai));
-        ai.aio_threads = 1;
-        ai.aio_num = 1;
-        ai.aio_idle_time = 365 * 100000;
-        aio_init(&ai);
-    }
-#endif
-    posix_aio_state = s;
-
-    return 0;
-}
-
-static RawAIOCB *raw_aio_setup(BlockDriverState *bs,
-        int64_t sector_num, uint8_t *buf, int nb_sectors,
-        BlockDriverCompletionFunc *cb, void *opaque)
-{
-    BDRVRawState *s = bs->opaque;
-    RawAIOCB *acb;
-
-    if (fd_open(bs) < 0)
-        return NULL;
-
-    acb = qemu_aio_get(bs, cb, opaque);
-    if (!acb)
-        return NULL;
-    acb->aiocb.aio_fildes = s->fd;
-    acb->aiocb.aio_sigevent.sigev_signo = SIGUSR2;
-    acb->aiocb.aio_sigevent.sigev_notify = SIGEV_SIGNAL;
-    acb->aiocb.aio_buf = buf;
-    if (nb_sectors < 0)
-        acb->aiocb.aio_nbytes = -nb_sectors;
-    else
-        acb->aiocb.aio_nbytes = nb_sectors * 512;
-    acb->aiocb.aio_offset = sector_num * 512;
-    acb->next = posix_aio_state->first_aio;
-    posix_aio_state->first_aio = acb;
-    return acb;
-}
-
 static void raw_aio_em_cb(void* opaque)
 {
     RawAIOCB *acb = opaque;
@@ -601,14 +419,13 @@ static BlockDriverAIOCB *raw_aio_read(BlockDriverState *bs,
         BlockDriverCompletionFunc *cb, void *opaque)
 {
     RawAIOCB *acb;
+    BDRVRawState *s = bs->opaque;
 
     /*
      * If O_DIRECT is used and the buffer is not aligned fall back
      * to synchronous IO.
      */
 #if defined(O_DIRECT)
-    BDRVRawState *s = bs->opaque;
-
     if (unlikely(s->aligned_buf != NULL && ((uintptr_t) buf % 512))) {
         QEMUBH *bh;
         acb = qemu_aio_get(bs, cb, opaque);
@@ -619,13 +436,14 @@ static BlockDriverAIOCB *raw_aio_read(BlockDriverState *bs,
     }
 #endif
 
-    acb = raw_aio_setup(bs, sector_num, buf, nb_sectors, cb, opaque);
-    if (!acb)
+    if (fd_open(bs) < 0)
         return NULL;
-    if (aio_read(&acb->aiocb) < 0) {
-        qemu_aio_release(acb);
+
+    /* submit read */
+    acb = s->aio_dvr->submit(bs, s->fd, sector_num, buf, nb_sectors, 0, cb,
+                             opaque);
+    if (!acb)
         return NULL;
-    }
     return &acb->common;
 }
 
@@ -634,13 +452,13 @@ static BlockDriverAIOCB *raw_aio_write(BlockDriverState *bs,
         BlockDriverCompletionFunc *cb, void *opaque)
 {
     RawAIOCB *acb;
+    BDRVRawState *s = bs->opaque;
 
     /*
      * If O_DIRECT is used and the buffer is not aligned fall back
      * to synchronous IO.
      */
 #if defined(O_DIRECT)
-    BDRVRawState *s = bs->opaque;
 
     if (unlikely(s->aligned_buf != NULL && ((uintptr_t) buf % 512))) {
         QEMUBH *bh;
@@ -652,48 +470,19 @@ static BlockDriverAIOCB *raw_aio_write(BlockDriverState *bs,
     }
 #endif
 
-    acb = raw_aio_setup(bs, sector_num, (uint8_t*)buf, nb_sectors, cb, opaque);
+    /* submit write */
+    acb = s->aio_dvr->submit(bs, s->fd, sector_num, buf, nb_sectors, 1, cb,
+                             opaque);
     if (!acb)
         return NULL;
-    if (aio_write(&acb->aiocb) < 0) {
-        qemu_aio_release(acb);
-        return NULL;
-    }
     return &acb->common;
 }
 
 static void raw_aio_cancel(BlockDriverAIOCB *blockacb)
 {
-    int ret;
-    RawAIOCB *acb = (RawAIOCB *)blockacb;
-    RawAIOCB **pacb;
-
-    ret = aio_cancel(acb->aiocb.aio_fildes, &acb->aiocb);
-    if (ret == AIO_NOTCANCELED) {
-        /* fail safe: if the aio could not be canceled, we wait for
-           it */
-        while (aio_error(&acb->aiocb) == EINPROGRESS);
-    }
-
-    /* remove the callback from the queue */
-    pacb = &posix_aio_state->first_aio;
-    for(;;) {
-        if (*pacb == NULL) {
-            break;
-        } else if (*pacb == acb) {
-            *pacb = acb->next;
-            qemu_aio_release(acb);
-            break;
-        }
-        pacb = &acb->next;
-    }
-}
-
-#else /* CONFIG_AIO */
-static int posix_aio_init(void)
-{
+    BDRVRawState *s = blockacb->bs->opaque;
+    s->aio_dvr->cancel(blockacb);
 }
-#endif /* CONFIG_AIO */
 
 static void raw_close(BlockDriverState *bs)
 {
@@ -898,8 +687,6 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags)
     BDRVRawState *s = bs->opaque;
     int fd, open_flags, ret;
 
-    posix_aio_init();
-
 #ifdef CONFIG_COCOA
     if (strstart(filename, "/dev/cdrom", NULL)) {
         kern_return_t kernResult;
@@ -969,6 +756,8 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags)
         s->fd_media_changed = 1;
     }
 #endif
+    /* init aio driver for this block device */
+    s->aio_dvr = posix_aio_init();
     return 0;
 }
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-22 23:17 [Qemu-devel] [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Ryan Harper
  2008-09-22 23:17 ` [Qemu-devel] [PATCH 1/3] Only call aio flush handler if set Ryan Harper
  2008-09-22 23:17 ` [Qemu-devel] [PATCH 2/3] Move aio implementation out of raw block driver Ryan Harper
@ 2008-09-22 23:17 ` Ryan Harper
  2008-09-23  1:22   ` [Qemu-devel] [PATCH 3/3] Add linux aio implementation for raw block devices Ryan Harper
  2008-09-23  3:32 ` [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Anthony Liguori
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Ryan Harper @ 2008-09-22 23:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Ryan Harper

This patch adds a linux aio raw block driver implementation.  If a raw block
device is opened with cached=off (O_DIRECT) then we can utilize linux aio to
submit io to/from the block device.  Utilizing linux aio allows for multiple
outstanding requests to be in flight against the io device potentially providing
higher IO throughput.  This implementation uses eventfd for event completion
notification.

Block devices with cache enabled will utilize posix aio since linux aio will
fallback to synchronous IO when used without O_DIRECT[1].


Signed-off-by: Ryan Harper <ryanh@us.ibm.com>

1. http://lse.sourceforge.net/io/aio.html

diff --git a/Makefile b/Makefile
index 18477ba..92ca5d9 100644
--- a/Makefile
+++ b/Makefile
@@ -60,7 +60,7 @@ BLOCK_OBJS += block-raw-posix.o
 endif
 
 ifdef CONFIG_AIO
-BLOCK_OBJS += compatfd.o aio-posix.o
+BLOCK_OBJS += compatfd.o aio-posix.o aio-linux.o
 endif
 
 ######################################################################
diff --git a/Makefile.target b/Makefile.target
index 4c6b3d5..599fa8a 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -482,7 +482,7 @@ OBJS+=block-raw-posix.o
 endif
 
 ifdef CONFIG_AIO
-OBJS+=compatfd.o aio-posix.o
+OBJS+=compatfd.o aio-posix.o aio-linux.o
 endif
 
 LIBS+=-lz
diff --git a/aio-linux.c b/aio-linux.c
new file mode 100644
index 0000000..0043fd1
--- /dev/null
+++ b/aio-linux.c
@@ -0,0 +1,225 @@
+/*
+ * QEMU Linux AIO implementation for Block Raw devices
+ *
+ * Copyright IBM, Corp. 2008
+ *
+ * Authors:
+ * Anthony Liguori <aliguori@us.ibm.com>
+ * Ryan Harper <ryanh@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+/***********************************************************/
+/* Unix AIO using LINUX AIO */
+
+#include "qemu-common.h"
+#include "block_int.h"
+#include "block.h"
+#include "block-aio.h"
+#include "compatfd.h"
+#include <sys/types.h>
+#include <sys/syscall.h>
+#include <linux/aio_abi.h>
+
+#define MAX_LINUX_AIO_EVENTS 256
+
+int eventfd(unsigned int initval)
+{
+    return syscall(SYS_eventfd, initval);
+}
+
+int io_setup(unsigned nr_reqs, aio_context_t *ctx_id)
+{
+    return syscall(SYS_io_setup, nr_reqs, ctx_id);
+}
+
+int io_destroy(aio_context_t ctx_id)
+{
+    return syscall(SYS_io_destroy, ctx_id);
+}
+
+int io_getevents(aio_context_t ctx_id, long min_nr, long nr,
+                 struct io_event *events, struct timespec *timeout)
+{
+    return syscall(SYS_io_getevents, ctx_id, min_nr, nr, events, timeout);
+}
+
+int io_submit(aio_context_t ctx_id, long nr, struct iocb **iocb)
+{
+    return syscall(SYS_io_submit, ctx_id, nr, iocb);
+}
+
+int io_cancel(aio_context_t ctx_id, struct iocb *iocb, struct io_event *result)
+{
+    return syscall(SYS_io_cancel, ctx_id, iocb, result);
+}
+
+typedef AIOState LinuxAioState;
+
+static int aio_efd;
+static aio_context_t aio_ctxt_id;
+static int outstanding_requests;
+static LinuxAioState *linux_aio_state;
+
+static RawAIOCB *la_submit(BlockDriverState *bs, int fd,
+                                   int64_t sector_num, uint8_t *buf,
+                                   int nb_sectors, int write,
+                                   BlockDriverCompletionFunc *cb, void *opaque)
+{
+    RawAIOCB *acb;
+    struct iocb *iocbs[1];
+    int err;
+
+    acb = qemu_aio_get(bs, cb, opaque);
+    if (!acb) {
+        fprintf(stderr, "%s: qemu_aio_get returned NULL!?!\n", __FUNCTION__);
+        return NULL;
+    }
+
+    if (write)
+        acb->linux_aiocb.aio_lio_opcode = IOCB_CMD_PWRITE;
+    else
+        acb->linux_aiocb.aio_lio_opcode = IOCB_CMD_PREAD;
+
+    acb->linux_aiocb.aio_data = (unsigned long)acb;
+    acb->linux_aiocb.aio_fildes = fd;
+    acb->linux_aiocb.aio_flags = IOCB_FLAG_RESFD;
+    acb->linux_aiocb.aio_resfd = aio_efd;
+    acb->linux_aiocb.aio_buf = (unsigned long)buf;
+    acb->linux_aiocb.aio_nbytes = nb_sectors * 512;
+    acb->linux_aiocb.aio_offset = sector_num * 512;
+
+    acb->next = linux_aio_state->first_aio;
+    linux_aio_state->first_aio = acb;
+
+    iocbs[0] = &acb->linux_aiocb;
+
+    do {
+        err = io_submit(aio_ctxt_id, 1, iocbs);
+    } while (err == -1 && errno == EINTR);
+
+    if (err != 1) {
+        qemu_aio_release(acb);
+        return NULL;
+    }
+
+    outstanding_requests++;
+
+    return acb;
+}
+
+static int la_flush(void)
+{
+    return outstanding_requests;
+}
+
+static void la_cancel(BlockDriverAIOCB *baiocb)
+{
+    RawAIOCB *acb = (void *)baiocb;
+    struct io_event result;
+    int err;
+
+    do {
+        err = io_cancel(aio_ctxt_id, &acb->linux_aiocb, &result);
+    } while (err == -1 && errno == EINTR);
+
+    /* it may have happened...  we probably should check and complete */
+
+    outstanding_requests--;
+
+    qemu_aio_release(acb);
+}
+
+static void la_completion(void *opaque)
+{
+    struct io_event events[MAX_LINUX_AIO_EVENTS];
+    struct timespec ts = {0, 0};
+    uint64_t count;
+    int i, ret;
+
+    BLPRINTF("%s ->\n", __FUNCTION__);
+    do {
+        ret = read(aio_efd, &count, sizeof(count));
+        if (ret == -1 && errno == EAGAIN) {
+            BLPRINTF("linux: got EAGAIN\n");
+            return;
+        }
+    } while (ret == -1 && errno == EINTR);
+
+    if (ret != 8) {
+        BLPRINTF("bad read from eventfd (ret=%d errno=%d)\n", ret, errno);
+        exit(1);
+    }
+
+    BLPRINTF("%s: after fd read\n", __FUNCTION__);
+    BLPRINTF("%s: calling io_getevents, min=%lu events\n", __FUNCTION__, count);
+    do {
+        ret = io_getevents(aio_ctxt_id, count, ARRAY_SIZE(events),
+                           events, &ts);
+    } while (ret == -1 && errno == EINTR);
+
+    if (ret < 0) {
+        BLPRINTF("io_getevents failed: %d %m\n", ret);
+        exit(1);
+    }
+
+    for (i = 0; i < ret; i++) {
+        RawAIOCB *acb;
+        int res;
+
+        acb = (RawAIOCB *)(unsigned long)events[i].data;
+        res = events[i].res;
+
+        if (res > 0)
+            res = 0;
+
+        acb->common.cb(acb->common.opaque, res);
+        qemu_aio_release(acb);
+
+        outstanding_requests--;
+    }
+    BLPRINTF("%s <-\n", __FUNCTION__);
+}
+
+static int la_init(void)
+{
+    LinuxAioState *s;
+
+    if (linux_aio_state)
+        return 0;
+
+    s = qemu_malloc(sizeof(LinuxAioState));
+    if (s == NULL)
+        return -ENOMEM;
+
+    /* setup eventfd and init linux aio context, register fd handler */
+    aio_efd = eventfd(0);
+    io_setup(MAX_LINUX_AIO_EVENTS, &aio_ctxt_id);
+    s->first_aio = NULL;
+    s->fd = aio_efd;
+
+    /* switch to non-blocking eventfd mode */
+    fcntl(aio_efd, F_SETFL, O_NONBLOCK);
+
+    qemu_aio_set_fd_handler(aio_efd, la_completion, NULL, la_flush, NULL);
+
+    linux_aio_state = s;
+
+    return 0;
+}
+
+static AIODriver linux_aio_drv = {
+    .name = "linux",
+    .submit = la_submit,
+    .cancel = la_cancel,
+    .flush = la_flush,
+};
+
+AIODriver *linux_aio_init(void) {
+    if (la_init() != 0)
+        return NULL;
+    return &linux_aio_drv;
+}
diff --git a/block-aio.h b/block-aio.h
index b8597d0..b1492d9 100644
--- a/block-aio.h
+++ b/block-aio.h
@@ -21,6 +21,7 @@
 #include "qemu-aio.h"
 #ifdef CONFIG_AIO
 #include <aio.h>
+#include <linux/aio_abi.h>
 #endif
 
 //#define DEBUG_BLOCK_AIO
@@ -33,6 +34,7 @@
 typedef struct RawAIOCB {
     BlockDriverAIOCB common;
     struct aiocb posix_aiocb;
+    struct iocb linux_aiocb;
     struct RawAIOCB *next;
     int ret;
 } RawAIOCB;
@@ -75,4 +77,5 @@ typedef struct AIOState
  
 AIODriver* posix_aio_init(void);
 
+AIODriver* linux_aio_init(void);
 #endif /* QEMU_BLOCK_AIO_H */
diff --git a/block-raw-posix.c b/block-raw-posix.c
index cab7094..80034ac 100644
--- a/block-raw-posix.c
+++ b/block-raw-posix.c
@@ -125,8 +125,11 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags)
         }
     }
 #endif
-    /* init aio driver for this block device */
-    s->aio_dvr = posix_aio_init();
+    /* init aio driver for this block device, linux if O_DIRECT is enabled */
+    if (flags & BDRV_O_DIRECT)
+        s->aio_dvr = linux_aio_init();
+    else
+        s->aio_dvr = posix_aio_init();
     return 0;
 }
 
@@ -756,8 +759,11 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags)
         s->fd_media_changed = 1;
     }
 #endif
-    /* init aio driver for this block device */
-    s->aio_dvr = posix_aio_init();
+    /* init aio driver for this block device, linux if O_DIRECT is enabled */
+    if (flags & BDRV_O_DIRECT)
+        s->aio_dvr = linux_aio_init();
+    else
+        s->aio_dvr = posix_aio_init();
     return 0;
 }
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-22 23:17 ` [Qemu-devel] [PATCH 2/3] Move aio implementation out of raw block driver Ryan Harper
@ 2008-09-23  1:16   ` Ryan Harper
  2008-09-23  2:45   ` Anthony Liguori
  1 sibling, 0 replies; 26+ messages in thread
From: Ryan Harper @ 2008-09-23  1:16 UTC (permalink / raw)
  To: qemu-devel, aliguori, kvm; +Cc: Ryan Harper

[-- Attachment #1: Type: text/plain, Size: 540 bytes --]

* Ryan Harper <ryanh@us.ibm.com> [2008-09-22 18:18]:
> This patch moves the existing posix aio implementation out of block-raw-posix.c
> into aio-posix.c.  Added in a per-block device  aio driver abstraction.
> Block-raw-posix invokes the aio driver methods, .submit, .flush, and .cancel as
> needed.  aio-posix.c contains the posix aio implementation.
> 

Ack, missing aio-posix.c.  Attached proper 2/3 patch.

--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

[-- Attachment #2: move_aio_from_block_raw.patch --]
[-- Type: text/plain, Size: 20215 bytes --]

Subject: [PATCH 2/3] Move aio implementation out of raw block driver
Cc: aliguori@us.ibm.com
Cc: kvm@vger.kernel.org

This patch moves the existing posix aio implementation out of block-raw-posix.c
into aio-posix.c.  Added in a per-block device  aio driver abstraction.
Block-raw-posix invokes the aio driver methods, .submit, .flush, and .cancel as
needed.  aio-posix.c contains the posix aio implementation.

The changes pave the way for other aio implementations, namely linux aio.  Each
block device will init the proper aio driver depending on how the device is
opened.

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>

diff --git a/Makefile b/Makefile
index de6393e..18477ba 100644
--- a/Makefile
+++ b/Makefile
@@ -60,7 +60,7 @@ BLOCK_OBJS += block-raw-posix.o
 endif
 
 ifdef CONFIG_AIO
-BLOCK_OBJS += compatfd.o
+BLOCK_OBJS += compatfd.o aio-posix.o
 endif
 
 ######################################################################
diff --git a/Makefile.target b/Makefile.target
index 4a490f4..4c6b3d5 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -482,7 +482,7 @@ OBJS+=block-raw-posix.o
 endif
 
 ifdef CONFIG_AIO
-OBJS+=compatfd.o
+OBJS+=compatfd.o aio-posix.o
 endif
 
 LIBS+=-lz
diff --git a/aio-posix.c b/aio-posix.c
new file mode 100644
index 0000000..cd85420
--- /dev/null
+++ b/aio-posix.c
@@ -0,0 +1,256 @@
+/*
+ * Block driver for RAW files (posix)
+ *
+ * Copyright (c) 2006 Fabrice Bellard
+ * Copyright (c) 2008 IBM Corp.
+ *   Authors: Anthony Liguori <aliguori@us.ibm.com>
+ *          : Ryan Harper <ryanh@us.ibm.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu-common.h"
+#include "block_int.h"
+#include "block.h"
+#include "block-aio.h"
+#include "compatfd.h"
+#ifdef CONFIG_AIO
+#include <aio.h>
+#endif
+
+/* Unix AIO using POSIX AIO */
+static AIOState *posix_aio_state;
+typedef AIOState PosixAioState;
+AIODriver posix_aio_dvr;
+
+static void pa_read(void *opaque)
+{
+    PosixAioState *s = opaque;
+    RawAIOCB *acb, **pacb;
+    int ret;
+    size_t offset;
+    union {
+        struct qemu_signalfd_siginfo siginfo;
+        char buf[128];
+    } sig;
+
+    BLPRINTF("%s ->\n", __FUNCTION__);
+    /* try to read from signalfd, don't freak out if we can't read anything */
+    offset = 0;
+    while (offset < 128) {
+        ssize_t len;
+
+        len = read(s->fd, sig.buf + offset, 128 - offset);
+        if (len == -1 && errno == EINTR)
+            continue;
+        if (len == -1 && errno == EAGAIN) {
+            /* there is no natural reason for this to happen,
+             * so we'll spin hard until we get everything just
+             * to be on the safe side. */
+            if (offset > 0)
+                continue;
+        }
+
+        offset += len;
+    }
+
+    for(;;) {
+        pacb = &s->first_aio;
+        for(;;) {
+            acb = *pacb;
+            if (!acb)
+                goto the_end;
+            ret = aio_error(&acb->posix_aiocb);
+            if (ret == ECANCELED) {
+                /* remove the request */
+                *pacb = acb->next;
+                qemu_aio_release(acb);
+            } else if (ret != EINPROGRESS) {
+                /* end of aio */
+                if (ret == 0) {
+                    ret = aio_return(&acb->posix_aiocb);
+                    if (ret == acb->posix_aiocb.aio_nbytes)
+                        ret = 0;
+                    else
+                        ret = -EINVAL;
+                } else {
+                    ret = -ret;
+                }
+                /* remove the request */
+                *pacb = acb->next;
+                /* call the callback */
+                acb->common.cb(acb->common.opaque, ret);
+                qemu_aio_release(acb);
+                break;
+            } else {
+                pacb = &acb->next;
+            }
+        }
+    }
+ the_end: ;
+    BLPRINTF("%s <-\n", __FUNCTION__);
+}
+
+static int pa_flush(void *opaque)
+{
+    PosixAioState *s = opaque;
+    return !!s->first_aio;
+}
+
+
+static int pa_init(void)
+{
+    sigset_t mask;
+    PosixAioState *s;
+
+    BLPRINTF("%s ->\n", __FUNCTION__);
+    if (posix_aio_state)
+        return 0;
+
+    s = qemu_malloc(sizeof(PosixAioState));
+    if (s == NULL)
+        return -ENOMEM;
+
+    /* Make sure to block AIO signal */
+    sigemptyset(&mask);
+    sigaddset(&mask, SIGUSR2);
+    sigprocmask(SIG_BLOCK, &mask, NULL);
+    
+    s->first_aio = NULL;
+    s->fd = qemu_signalfd(&mask);
+
+    fcntl(s->fd, F_SETFL, O_NONBLOCK);
+
+    qemu_aio_set_fd_handler(s->fd, pa_read, NULL, pa_flush, s);
+
+#if defined(__GLIBC__) && defined(__linux__)
+    {
+        /* XXX: aio thread exit seems to hang on RedHat 9 and this init
+           seems to fix the problem. */
+        struct aioinit ai;
+        memset(&ai, 0, sizeof(ai));
+        ai.aio_threads = 1;
+        ai.aio_num = 1;
+        ai.aio_idle_time = 365 * 100000;
+        aio_init(&ai);
+    }
+#endif
+    posix_aio_state = s;
+
+    BLPRINTF("%s <-\n", __FUNCTION__);
+    return 0;
+}
+
+static RawAIOCB *pa_setup(BlockDriverState *bs, int fd, int64_t sector_num,
+                          uint8_t *buf, int nb_sectors,
+                          BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BDRVRawState *s = bs->opaque;
+    RawAIOCB *acb;
+
+    BLPRINTF("%s ->\n", __FUNCTION__);
+    acb = qemu_aio_get(bs, cb, opaque);
+    if (!acb)
+        return NULL;
+    acb->posix_aiocb.aio_fildes = s->fd;
+    acb->posix_aiocb.aio_sigevent.sigev_signo = SIGUSR2;
+    acb->posix_aiocb.aio_sigevent.sigev_notify = SIGEV_SIGNAL;
+    acb->posix_aiocb.aio_buf = buf;
+    if (nb_sectors < 0)
+        acb->posix_aiocb.aio_nbytes = -nb_sectors;
+    else
+        acb->posix_aiocb.aio_nbytes = nb_sectors * 512;
+    acb->posix_aiocb.aio_offset = sector_num * 512;
+
+    acb->next = posix_aio_state->first_aio;
+    posix_aio_state->first_aio = acb;
+
+    BLPRINTF("%s <-\n", __FUNCTION__);
+    return acb;
+}
+
+static void pa_cancel(BlockDriverAIOCB *blockacb)
+{
+    int ret;
+    RawAIOCB *acb = (RawAIOCB *)blockacb;
+    RawAIOCB **pacb;
+
+    BLPRINTF("%s ->\n", __FUNCTION__);
+    ret = aio_cancel(acb->posix_aiocb.aio_fildes, &acb->posix_aiocb);
+    if (ret == AIO_NOTCANCELED) {
+        /* fail safe: if the aio could not be canceled, we wait for
+           it */
+        while (aio_error(&acb->posix_aiocb) == EINPROGRESS);
+    }
+
+    /* remove the callback from the queue */
+    pacb = &posix_aio_state->first_aio;
+    for(;;) {
+        if (*pacb == NULL) {
+            break;
+        } else if (*pacb == acb) {
+            *pacb = acb->next;
+            qemu_aio_release(acb);
+            break;
+        }
+        pacb = &acb->next;
+    }
+    BLPRINTF("%s <-\n", __FUNCTION__);
+}
+
+static RawAIOCB *pa_submit(BlockDriverState *bs, int fd,
+                                   int64_t sector_num, uint8_t *buf,
+                                   int nb_sectors, int write,
+                                   BlockDriverCompletionFunc *cb, void *opaque)
+{
+    int ret;
+    RawAIOCB *acb = pa_setup(bs, fd, sector_num, buf, nb_sectors,
+                                       cb, opaque);
+    BLPRINTF("%s ->\n", __FUNCTION__);
+    if (!acb)
+        return NULL;
+
+    if (write)
+        ret = aio_write(&acb->posix_aiocb);
+    else
+        ret = aio_read(&acb->posix_aiocb);
+    
+    if (ret < 0) {
+        qemu_aio_release(acb);
+        return NULL;
+    }
+    BLPRINTF("%s <-\n", __FUNCTION__);
+    return acb;
+}
+
+static AIODriver posix_aio_drv = {
+    .name = "posix",
+    .submit = pa_submit,
+    .cancel = pa_cancel,
+    .flush = pa_flush,
+};
+
+AIODriver* posix_aio_init(void)
+{
+    BLPRINTF("%s ->\n", __FUNCTION__);
+    if (pa_init() != 0)
+        return NULL;
+    return &posix_aio_drv;
+    BLPRINTF("%s <-\n", __FUNCTION__);
+}
diff --git a/block-aio.h b/block-aio.h
new file mode 100644
index 0000000..b8597d0
--- /dev/null
+++ b/block-aio.h
@@ -0,0 +1,78 @@
+/*
+ * QEMU Block AIO API
+ *
+ * Copyright IBM, Corp. 2008
+ *
+ * Authors:
+ * Anthony Liguori <aliguori@us.ibm.com>
+ * Ryan Harper <ryanh@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_BLOCK_AIO_H
+#define QEMU_BLOCK_AIO_H
+
+#include "qemu-common.h"
+#include "block_int.h"
+#include "block.h"
+#include "qemu-aio.h"
+#ifdef CONFIG_AIO
+#include <aio.h>
+#endif
+
+//#define DEBUG_BLOCK_AIO
+#if defined(DEBUG_BLOCK_AIO)
+#define BLPRINTF(formatCstr, args...) do { fprintf(stderr, formatCstr, ##args); fflush(stderr); } while (0)
+#else
+#define BLPRINTF(formatCstr, args...)
+#endif
+
+typedef struct RawAIOCB {
+    BlockDriverAIOCB common;
+    struct aiocb posix_aiocb;
+    struct RawAIOCB *next;
+    int ret;
+} RawAIOCB;
+
+typedef struct AIODriver
+{
+    const char *name;
+    RawAIOCB *(*submit)(BlockDriverState *bs, int fd,
+                                    int64_t sector_num, uint8_t *buf,
+                                    int sectors, int write,
+                                    BlockDriverCompletionFunc *cb,
+                                    void *opaque);
+    void (*cancel)(BlockDriverAIOCB *aiocb);
+    int (*flush)(void *opaque);
+} AIODriver;
+
+typedef struct BDRVRawState {
+    int fd;
+    int type;
+    unsigned int lseek_err_cnt;
+#if defined(__linux__)
+    /* linux floppy specific */
+    int fd_open_flags;
+    int64_t fd_open_time;
+    int64_t fd_error_time;
+    int fd_got_error;
+    int fd_media_changed;
+#endif
+#if defined(O_DIRECT)
+    uint8_t* aligned_buf;
+#endif
+    AIODriver *aio_dvr;
+} BDRVRawState;
+
+typedef struct AIOState
+{
+    int fd;
+    RawAIOCB *first_aio;
+} AIOState;
+ 
+AIODriver* posix_aio_init(void);
+
+#endif /* QEMU_BLOCK_AIO_H */
diff --git a/block-raw-posix.c b/block-raw-posix.c
index 41f9976..cab7094 100644
--- a/block-raw-posix.c
+++ b/block-raw-posix.c
@@ -25,11 +25,8 @@
 #include "qemu-timer.h"
 #include "qemu-char.h"
 #include "block_int.h"
-#include "compatfd.h"
+#include "block-aio.h"
 #include <assert.h>
-#ifdef CONFIG_AIO
-#include <aio.h>
-#endif
 
 #ifdef CONFIG_COCOA
 #include <paths.h>
@@ -84,25 +81,6 @@
    reopen it to see if the disk has been changed */
 #define FD_OPEN_TIMEOUT 1000
 
-typedef struct BDRVRawState {
-    int fd;
-    int type;
-    unsigned int lseek_err_cnt;
-#if defined(__linux__)
-    /* linux floppy specific */
-    int fd_open_flags;
-    int64_t fd_open_time;
-    int64_t fd_error_time;
-    int fd_got_error;
-    int fd_media_changed;
-#endif
-#if defined(O_DIRECT)
-    uint8_t* aligned_buf;
-#endif
-} BDRVRawState;
-
-static int posix_aio_init(void);
-
 static int fd_open(BlockDriverState *bs);
 
 static int raw_open(BlockDriverState *bs, const char *filename, int flags)
@@ -110,8 +88,6 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags)
     BDRVRawState *s = bs->opaque;
     int fd, open_flags, ret;
 
-    posix_aio_init();
-
     s->lseek_err_cnt = 0;
 
     open_flags = O_BINARY;
@@ -149,6 +125,8 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags)
         }
     }
 #endif
+    /* init aio driver for this block device */
+    s->aio_dvr = posix_aio_init();
     return 0;
 }
 
@@ -429,166 +407,6 @@ static int raw_pwrite(BlockDriverState *bs, int64_t offset,
 #define raw_pwrite raw_pwrite_aligned
 #endif
 
-
-#ifdef CONFIG_AIO
-/***********************************************************/
-/* Unix AIO using POSIX AIO */
-
-typedef struct RawAIOCB {
-    BlockDriverAIOCB common;
-    struct aiocb aiocb;
-    struct RawAIOCB *next;
-    int ret;
-} RawAIOCB;
-
-typedef struct PosixAioState
-{
-    int fd;
-    RawAIOCB *first_aio;
-} PosixAioState;
-
-static void posix_aio_read(void *opaque)
-{
-    PosixAioState *s = opaque;
-    RawAIOCB *acb, **pacb;
-    int ret;
-    size_t offset;
-    union {
-        struct qemu_signalfd_siginfo siginfo;
-        char buf[128];
-    } sig;
-
-    /* try to read from signalfd, don't freak out if we can't read anything */
-    offset = 0;
-    while (offset < 128) {
-        ssize_t len;
-
-        len = read(s->fd, sig.buf + offset, 128 - offset);
-        if (len == -1 && errno == EINTR)
-            continue;
-        if (len == -1 && errno == EAGAIN) {
-            /* there is no natural reason for this to happen,
-             * so we'll spin hard until we get everything just
-             * to be on the safe side. */
-            if (offset > 0)
-                continue;
-        }
-
-        offset += len;
-    }
-
-    for(;;) {
-        pacb = &s->first_aio;
-        for(;;) {
-            acb = *pacb;
-            if (!acb)
-                goto the_end;
-            ret = aio_error(&acb->aiocb);
-            if (ret == ECANCELED) {
-                /* remove the request */
-                *pacb = acb->next;
-                qemu_aio_release(acb);
-            } else if (ret != EINPROGRESS) {
-                /* end of aio */
-                if (ret == 0) {
-                    ret = aio_return(&acb->aiocb);
-                    if (ret == acb->aiocb.aio_nbytes)
-                        ret = 0;
-                    else
-                        ret = -EINVAL;
-                } else {
-                    ret = -ret;
-                }
-                /* remove the request */
-                *pacb = acb->next;
-                /* call the callback */
-                acb->common.cb(acb->common.opaque, ret);
-                qemu_aio_release(acb);
-                break;
-            } else {
-                pacb = &acb->next;
-            }
-        }
-    }
- the_end: ;
-}
-
-static int posix_aio_flush(void *opaque)
-{
-    PosixAioState *s = opaque;
-    return !!s->first_aio;
-}
-
-static PosixAioState *posix_aio_state;
-
-static int posix_aio_init(void)
-{
-    sigset_t mask;
-    PosixAioState *s;
-  
-    if (posix_aio_state)
-        return 0;
-
-    s = qemu_malloc(sizeof(PosixAioState));
-    if (s == NULL)
-        return -ENOMEM;
-
-    /* Make sure to block AIO signal */
-    sigemptyset(&mask);
-    sigaddset(&mask, SIGUSR2);
-    sigprocmask(SIG_BLOCK, &mask, NULL);
-    
-    s->first_aio = NULL;
-    s->fd = qemu_signalfd(&mask);
-
-    fcntl(s->fd, F_SETFL, O_NONBLOCK);
-
-    qemu_aio_set_fd_handler(s->fd, posix_aio_read, NULL, posix_aio_flush, s);
-
-#if defined(__GLIBC__) && defined(__linux__)
-    {
-        /* XXX: aio thread exit seems to hang on RedHat 9 and this init
-           seems to fix the problem. */
-        struct aioinit ai;
-        memset(&ai, 0, sizeof(ai));
-        ai.aio_threads = 1;
-        ai.aio_num = 1;
-        ai.aio_idle_time = 365 * 100000;
-        aio_init(&ai);
-    }
-#endif
-    posix_aio_state = s;
-
-    return 0;
-}
-
-static RawAIOCB *raw_aio_setup(BlockDriverState *bs,
-        int64_t sector_num, uint8_t *buf, int nb_sectors,
-        BlockDriverCompletionFunc *cb, void *opaque)
-{
-    BDRVRawState *s = bs->opaque;
-    RawAIOCB *acb;
-
-    if (fd_open(bs) < 0)
-        return NULL;
-
-    acb = qemu_aio_get(bs, cb, opaque);
-    if (!acb)
-        return NULL;
-    acb->aiocb.aio_fildes = s->fd;
-    acb->aiocb.aio_sigevent.sigev_signo = SIGUSR2;
-    acb->aiocb.aio_sigevent.sigev_notify = SIGEV_SIGNAL;
-    acb->aiocb.aio_buf = buf;
-    if (nb_sectors < 0)
-        acb->aiocb.aio_nbytes = -nb_sectors;
-    else
-        acb->aiocb.aio_nbytes = nb_sectors * 512;
-    acb->aiocb.aio_offset = sector_num * 512;
-    acb->next = posix_aio_state->first_aio;
-    posix_aio_state->first_aio = acb;
-    return acb;
-}
-
 static void raw_aio_em_cb(void* opaque)
 {
     RawAIOCB *acb = opaque;
@@ -601,14 +419,13 @@ static BlockDriverAIOCB *raw_aio_read(BlockDriverState *bs,
         BlockDriverCompletionFunc *cb, void *opaque)
 {
     RawAIOCB *acb;
+    BDRVRawState *s = bs->opaque;
 
     /*
      * If O_DIRECT is used and the buffer is not aligned fall back
      * to synchronous IO.
      */
 #if defined(O_DIRECT)
-    BDRVRawState *s = bs->opaque;
-
     if (unlikely(s->aligned_buf != NULL && ((uintptr_t) buf % 512))) {
         QEMUBH *bh;
         acb = qemu_aio_get(bs, cb, opaque);
@@ -619,13 +436,14 @@ static BlockDriverAIOCB *raw_aio_read(BlockDriverState *bs,
     }
 #endif
 
-    acb = raw_aio_setup(bs, sector_num, buf, nb_sectors, cb, opaque);
-    if (!acb)
+    if (fd_open(bs) < 0)
         return NULL;
-    if (aio_read(&acb->aiocb) < 0) {
-        qemu_aio_release(acb);
+
+    /* submit read */
+    acb = s->aio_dvr->submit(bs, s->fd, sector_num, buf, nb_sectors, 0, cb,
+                             opaque);
+    if (!acb)
         return NULL;
-    }
     return &acb->common;
 }
 
@@ -634,13 +452,13 @@ static BlockDriverAIOCB *raw_aio_write(BlockDriverState *bs,
         BlockDriverCompletionFunc *cb, void *opaque)
 {
     RawAIOCB *acb;
+    BDRVRawState *s = bs->opaque;
 
     /*
      * If O_DIRECT is used and the buffer is not aligned fall back
      * to synchronous IO.
      */
 #if defined(O_DIRECT)
-    BDRVRawState *s = bs->opaque;
 
     if (unlikely(s->aligned_buf != NULL && ((uintptr_t) buf % 512))) {
         QEMUBH *bh;
@@ -652,48 +470,19 @@ static BlockDriverAIOCB *raw_aio_write(BlockDriverState *bs,
     }
 #endif
 
-    acb = raw_aio_setup(bs, sector_num, (uint8_t*)buf, nb_sectors, cb, opaque);
+    /* submit write */
+    acb = s->aio_dvr->submit(bs, s->fd, sector_num, buf, nb_sectors, 1, cb,
+                             opaque);
     if (!acb)
         return NULL;
-    if (aio_write(&acb->aiocb) < 0) {
-        qemu_aio_release(acb);
-        return NULL;
-    }
     return &acb->common;
 }
 
 static void raw_aio_cancel(BlockDriverAIOCB *blockacb)
 {
-    int ret;
-    RawAIOCB *acb = (RawAIOCB *)blockacb;
-    RawAIOCB **pacb;
-
-    ret = aio_cancel(acb->aiocb.aio_fildes, &acb->aiocb);
-    if (ret == AIO_NOTCANCELED) {
-        /* fail safe: if the aio could not be canceled, we wait for
-           it */
-        while (aio_error(&acb->aiocb) == EINPROGRESS);
-    }
-
-    /* remove the callback from the queue */
-    pacb = &posix_aio_state->first_aio;
-    for(;;) {
-        if (*pacb == NULL) {
-            break;
-        } else if (*pacb == acb) {
-            *pacb = acb->next;
-            qemu_aio_release(acb);
-            break;
-        }
-        pacb = &acb->next;
-    }
-}
-
-#else /* CONFIG_AIO */
-static int posix_aio_init(void)
-{
+    BDRVRawState *s = blockacb->bs->opaque;
+    s->aio_dvr->cancel(blockacb);
 }
-#endif /* CONFIG_AIO */
 
 static void raw_close(BlockDriverState *bs)
 {
@@ -898,8 +687,6 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags)
     BDRVRawState *s = bs->opaque;
     int fd, open_flags, ret;
 
-    posix_aio_init();
-
 #ifdef CONFIG_COCOA
     if (strstart(filename, "/dev/cdrom", NULL)) {
         kern_return_t kernResult;
@@ -969,6 +756,8 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags)
         s->fd_media_changed = 1;
     }
 #endif
+    /* init aio driver for this block device */
+    s->aio_dvr = posix_aio_init();
     return 0;
 }
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 3/3] Add linux aio implementation for raw block devices
  2008-09-22 23:17 ` [Qemu-devel] " Ryan Harper
@ 2008-09-23  1:22   ` Ryan Harper
  0 siblings, 0 replies; 26+ messages in thread
From: Ryan Harper @ 2008-09-23  1:22 UTC (permalink / raw)
  To: qemu-devel; +Cc: Anthony Liguori, Ryan Harper, kvm

* Ryan Harper <ryanh@us.ibm.com> [2008-09-22 18:18]:
> This patch adds a linux aio raw block driver implementation.  If a raw block
> device is opened with cached=off (O_DIRECT) then we can utilize linux aio to
> submit io to/from the block device.  Utilizing linux aio allows for multiple
> outstanding requests to be in flight against the io device potentially providing
> higher IO throughput.  This implementation uses eventfd for event completion
> notification.

Hrm, git-send-email didn't seem to set the right subject for this email.
This is [PATCH 3/3], the linux aio implementation.

--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [PATCH 1/3] Only call aio flush handler if set
  2008-09-22 23:17 ` [Qemu-devel] [PATCH 1/3] Only call aio flush handler if set Ryan Harper
@ 2008-09-23  2:38   ` Anthony Liguori
  2008-09-23 14:26     ` Ryan Harper
  0 siblings, 1 reply; 26+ messages in thread
From: Anthony Liguori @ 2008-09-23  2:38 UTC (permalink / raw)
  To: Ryan Harper; +Cc: qemu-devel, kvm

Ryan Harper wrote:
> If the aio handler doesn't register an io_flush handler, we'd SEGV; fix that by
> only calling the flush handler if set.  BTW, aio handlers *should* register an
> io_flush routine.
>
> Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
>
> diff --git a/aio.c b/aio.c
> index 687e4be..2bb3ed4 100644
> --- a/aio.c
> +++ b/aio.c
> @@ -105,7 +105,8 @@ void qemu_aio_flush(void)
>          ret = 0;
>
>          LIST_FOREACH(node, &aio_handlers, node) {
> -            ret |= node->io_flush(node->opaque);
> +            if (node->io_flush)
> +                ret |= node->io_flush(node->opaque);
>          }
>   

Just not doing an io_flush is just hiding the real bug--that the user 
didn't register an io_flush handler.  If the inevitable SEGV is not your 
thing, I'd rather see a check added to qemu_aio_set_fd_handler() that 
threw an error and exited if called without an io_flush handler.

Regards,

Anthony Liguori

>          qemu_aio_wait();
>   

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-22 23:17 ` [Qemu-devel] [PATCH 2/3] Move aio implementation out of raw block driver Ryan Harper
  2008-09-23  1:16   ` [Qemu-devel] " Ryan Harper
@ 2008-09-23  2:45   ` Anthony Liguori
  2008-09-23 14:39     ` Ryan Harper
  2008-09-24 22:31     ` Marcelo Tosatti
  1 sibling, 2 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-09-23  2:45 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Marcelo Tosatti, qemu-devel, kvm

Ryan Harper wrote:
> This patch moves the existing posix aio implementation out of block-raw-posix.c
> into aio-posix.c.  Added in a per-block device  aio driver abstraction.
> Block-raw-posix invokes the aio driver methods, .submit, .flush, and .cancel as
> needed.  aio-posix.c contains the posix aio implementation.
>
> The changes pave the way for other aio implementations, namely linux aio.  Each
> block device will init the proper aio driver depending on how the device is
> opened.
>
> Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
>
> diff --git a/Makefile b/Makefile
> index de6393e..18477ba 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -60,7 +60,7 @@ BLOCK_OBJS += block-raw-posix.o
>  endif
>
>  ifdef CONFIG_AIO
> -BLOCK_OBJS += compatfd.o
> +BLOCK_OBJS += compatfd.o aio-posix.o
>  endif
>
>  ######################################################################
> diff --git a/Makefile.target b/Makefile.target
> index 4a490f4..4c6b3d5 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -482,7 +482,7 @@ OBJS+=block-raw-posix.o
>  endif
>
>  ifdef CONFIG_AIO
> -OBJS+=compatfd.o
> +OBJS+=compatfd.o aio-posix.o
>  endif
>
>  LIBS+=-lz
> diff --git a/block-aio.h b/block-aio.h
> new file mode 100644
> index 0000000..b8597d0
> --- /dev/null
> +++ b/block-aio.h
> @@ -0,0 +1,78 @@
> +/*
> + * QEMU Block AIO API
> + *
> + * Copyright IBM, Corp. 2008
> + *
> + * Authors:
> + * Anthony Liguori <aliguori@us.ibm.com>
> + * Ryan Harper <ryanh@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef QEMU_BLOCK_AIO_H
> +#define QEMU_BLOCK_AIO_H
> +
> +#include "qemu-common.h"
> +#include "block_int.h"
> +#include "block.h"
> +#include "qemu-aio.h"
> +#ifdef CONFIG_AIO
> +#include <aio.h>
> +#endif
> +
> +//#define DEBUG_BLOCK_AIO
> +#if defined(DEBUG_BLOCK_AIO)
> +#define BLPRINTF(formatCstr, args...) do { fprintf(stderr, formatCstr, ##args); fflush(stderr); } while (0)
>   

This is GCC syntax.  You should use C99 syntax.  That would be:

#define BLPRINTF(fmt, ...) do { fprintf(stderr, fmt, ## __VA_ARGS__); } 
while (0)

stderr is defined to not be buffered so an explicit fflush is 
unnecessary.  formatCstr is also a goofy name :-)

> +#else
> +#define BLPRINTF(formatCstr, args...)
>   

should be defined to do { } while (0) to maintain statement semantics.

> +#endif
> +
> +typedef struct RawAIOCB {
> +    BlockDriverAIOCB common;
> +    struct aiocb posix_aiocb;
> +    struct RawAIOCB *next;
> +    int ret;
> +} RawAIOCB;
>   

The whole small-object allocator for AIOCBs seems a bit of a premature 
optimization to me.  It makes this whole thing terribly awkward.  
Marcelo had a patch at one point to switch the small object allocate to 
just malloc/free.  Marcelo: any reason you didn't follow up with that patch?

> +typedef struct AIODriver
> +{
> +    const char *name;
> +    RawAIOCB *(*submit)(BlockDriverState *bs, int fd,
> +                                    int64_t sector_num, uint8_t *buf,
> +                                    int sectors, int write,
> +                                    BlockDriverCompletionFunc *cb,
> +                                    void *opaque);
> +    void (*cancel)(BlockDriverAIOCB *aiocb);
> +    int (*flush)(void *opaque);
> +} AIODriver;
>   

I think the AIODriver interface should just take an fd, a completion 
function, and an opaque pointer.  It should have to have knowledge of 
BDRVRawState or BlockDriverState.

> @@ -619,13 +436,14 @@ static BlockDriverAIOCB *raw_aio_read(BlockDriverState *bs,
>      }
>  #endif
>
> -    acb = raw_aio_setup(bs, sector_num, buf, nb_sectors, cb, opaque);
> -    if (!acb)
> +    if (fd_open(bs) < 0)
>          return NULL;
> -    if (aio_read(&acb->aiocb) < 0) {
> -        qemu_aio_release(acb);
> +
> +    /* submit read */
> +    acb = s->aio_dvr->submit(bs, s->fd, sector_num, buf, nb_sectors, 0, cb,
> +                             opaque);
> +    if (!acb)
>          return NULL;
> -    }
>      return &acb->common;
>  }
>   

So what happens if !defined(CONFIG_AIO)?  By my reading of the code, 
aio_drv will be NULL and this will SEGV.

> @@ -634,13 +452,13 @@ static BlockDriverAIOCB *raw_aio_write(BlockDriverState *bs,
>          BlockDriverCompletionFunc *cb, void *opaque)
>  {
>      RawAIOCB *acb;
> +    BDRVRawState *s = bs->opaque;
>
>      /*
>       * If O_DIRECT is used and the buffer is not aligned fall back
>       * to synchronous IO.
>       */
>  #if defined(O_DIRECT)
> -    BDRVRawState *s = bs->opaque;
>
>      if (unlikely(s->aligned_buf != NULL && ((uintptr_t) buf % 512))) {
>          QEMUBH *bh;
> @@ -652,48 +470,19 @@ static BlockDriverAIOCB *raw_aio_write(BlockDriverState *bs,
>      }
>  #endif
>
> -    acb = raw_aio_setup(bs, sector_num, (uint8_t*)buf, nb_sectors, cb, opaque);
> +    /* submit write */
> +    acb = s->aio_dvr->submit(bs, s->fd, sector_num, buf, nb_sectors, 1, cb,
> +                             opaque);
>      if (!acb)
>          return NULL;
> -    if (aio_write(&acb->aiocb) < 0) {
> -        qemu_aio_release(acb);
> -        return NULL;
> -    }
>      return &acb->common;
>  }
>
>  static void raw_aio_cancel(BlockDriverAIOCB *blockacb)
>  {
> -    int ret;
> -    RawAIOCB *acb = (RawAIOCB *)blockacb;
> -    RawAIOCB **pacb;
> -
> -    ret = aio_cancel(acb->aiocb.aio_fildes, &acb->aiocb);
> -    if (ret == AIO_NOTCANCELED) {
> -        /* fail safe: if the aio could not be canceled, we wait for
> -           it */
> -        while (aio_error(&acb->aiocb) == EINPROGRESS);
> -    }
> -
> -    /* remove the callback from the queue */
> -    pacb = &posix_aio_state->first_aio;
> -    for(;;) {
> -        if (*pacb == NULL) {
> -            break;
> -        } else if (*pacb == acb) {
> -            *pacb = acb->next;
> -            qemu_aio_release(acb);
> -            break;
> -        }
> -        pacb = &acb->next;
> -    }
> -}
> -
> -#else /* CONFIG_AIO */
> -static int posix_aio_init(void)
> -{
> +    BDRVRawState *s = blockacb->bs->opaque;
> +    s->aio_dvr->cancel(blockacb);
>  }
> -#endif /* CONFIG_AIO */
>
>  static void raw_close(BlockDriverState *bs)
>  {
> @@ -898,8 +687,6 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags)
>      BDRVRawState *s = bs->opaque;
>      int fd, open_flags, ret;
>
> -    posix_aio_init();
> -
>  #ifdef CONFIG_COCOA
>      if (strstart(filename, "/dev/cdrom", NULL)) {
>          kern_return_t kernResult;
> @@ -969,6 +756,8 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags)
>          s->fd_media_changed = 1;
>      }
>  #endif
> +    /* init aio driver for this block device */
> +    s->aio_dvr = posix_aio_init();
>   

Doesn't this need to be conditional on CONFIG_AIO?

Regards,

Anthony Liguori

>      return 0;
>  }
>
>   

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations
  2008-09-22 23:17 [Qemu-devel] [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Ryan Harper
                   ` (2 preceding siblings ...)
  2008-09-22 23:17 ` [Qemu-devel] " Ryan Harper
@ 2008-09-23  3:32 ` Anthony Liguori
  2008-09-23 14:43   ` Ryan Harper
  2008-09-23 10:27 ` [Qemu-devel] " Jamie Lokier
  2008-10-02 22:41 ` [Qemu-devel] " john cooper
  5 siblings, 1 reply; 26+ messages in thread
From: Anthony Liguori @ 2008-09-23  3:32 UTC (permalink / raw)
  To: Ryan Harper; +Cc: qemu-devel, kvm

[-- Attachment #1: Type: text/plain, Size: 3559 bytes --]

Can you run the same performance tests with the following patches (using 
sync=on instead of cache=off)?

You'll need my aio_init fix too.  I suspect this will give equally good 
performance to your patch set.  That's not saying your patch set isn't 
useful, but I would like to get performance to be better for the case 
that we're going through the page cache.

Regards,

Anthony Liguori

Ryan Harper wrote:
> The patchset adds additional AIO driver abstraction to the block raw driver to
> support multiple aio implementations for each device.  The first patch pulls
> the posix aio implementation out of the block-raw device using a generic call
> to the newly created AIO Driver structure.  The posix aio implementation that
> was contained in block-raw-posix.c has been refactored in to aio-posix.c.  The
> next patch adds a linux aio implementation for raw devices being opened
> O_DIRECT via cache=off drive option.  We only use linux aio when cache=off as
> linux aio falls back to synchronous ops if not opened with O_DIRECT flag.
>
> Addtional work has been done on top of QEMU for KVM and virtio-blk devices.
> While virtio-blk is not yet upstream in QEMU, the AIO changes here provide a
> tremendous performance improvement (from 7.6% of native, to 100% of randwrite,
> and 3.9% of native, to 101.4% of native for seq write) for virtio
> devices with cache=off.
>
> Storage subsystem:
>   IBM EXP300 - 14 Disk Fiber Expansion, 17G - 15K RPMS
>   Host: AMD Barcelona, 2 socket, 8G RAM
>   HBA: QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA (rev 02)
>
> Benchmark[1]:
>            fio --name=guestrun --filename=/dev/mapper/volumes-fibre \
>                --rw=randwrite --bs=16k --ioengine=libaio --direct=1 \
>                --norandommap --runtime=120 --time_based --numjobs=1 \
>                --group_reporting --thread --size=25g --write_lat_log \
>                --write_bw_log --iodepth=74
>
> Qemu parameters:
>  -m 1024 \
>  -drive file=/images/npt2-guest-virtio.qcow2,if=ide,boot=on,snapshot=off \
>  -drive file=/dev/mapper/volumes-fibre,if=virtio,cache=(on|off) \
>  -drive file=/dev/mapper/volumes-npt2--dom1,if=virtio,cache=off
>  -net nic,macaddr=00:FF:FF:00:00:01,model=rtl8139 -net tap -vnc :123 \
>  -monitor stdio
>
> Guest io scheduler: noop
>
> Results:
>  These results are with the patch series applied to KVM (plus a small KVM only
>  change -- KVM patches forthcoming).
>
>
>  16k randwrite 1 thread, 74 iodepth | MB/s | avg sub lat (us) | avg comp lat (ms)
>  ---------------------------------------+---------------------+------------------
>  baremetal (O_DIRECT, aka cache=off)| 61.2 |   13.07          |  19.59
>  kvm: cache=off posix-aio w/o patch |  4.7 | 3467.44          | 254.08
>  kvm: cache=off linux-aio           | 61.1 |   75.35          |  19.57
>  kvm: cache=on  posix-aio w/o patch |127.0 |  115.78          |   9.19
>  kvm: cache=on  posix-aio w/ patch  |126.0 |   67.35          |   9.30
>
>
>  16k write 1 thread, 74 iodepth     | MB/s | avg sub lat (us) | avg comp lat (ms)
>  ---------------------------------------+---------------------+------------------
>  baremetal (O_DIRECT, aka cache=off)|128.1 |   10.90          |   9.45
>  kvm: cache=off posix-aio w/o patch |  5.1 | 3152.00          | 231.06 
>  kvm: cache=off linux-aio           |130.0 |   83.83          |   8.99
>  kvm: cache=on  posix-aio w/o patch |184.0 |   80.46          |   6.35
>  kvm: cache=on  posix-aio w/ patch  |165.0 |   70.90          |   7.09
>
>
> 1. http://brick.kernel.dk/snaps/fio-1.21.tar.bz2
>   


[-- Attachment #2: sync.patch --]
[-- Type: text/x-patch, Size: 2779 bytes --]

Index: vl.c
===================================================================
--- vl.c	(revision 5300)
+++ vl.c	(working copy)
@@ -5439,12 +5439,13 @@
     int max_devs;
     int index;
     int cache;
+    int sync;
     int bdrv_flags;
     char *str = arg->opt;
     static const char * const params[] = { "bus", "unit", "if", "index",
                                            "cyls", "heads", "secs", "trans",
                                            "media", "snapshot", "file",
-                                           "cache", "format", NULL };
+                                           "cache", "format", "sync", NULL };
 
     if (check_params(buf, sizeof(buf), params, str) < 0) {
          fprintf(stderr, "qemu: unknown parameter '%s' in '%s'\n",
@@ -5459,6 +5460,7 @@
     translation = BIOS_ATA_TRANSLATION_AUTO;
     index = -1;
     cache = 1;
+    sync = 0;
 
     if (!strcmp(machine->name, "realview") ||
         !strcmp(machine->name, "SS-5") ||
@@ -5612,6 +5614,17 @@
         }
     }
 
+    if (get_param_value(buf, sizeof(buf), "sync", str)) {
+        if (!strcmp(buf, "off"))
+            sync = 0;
+        else if (!strcmp(buf, "on"))
+            sync = 1;
+        else {
+           fprintf(stderr, "qemu: invalid sync option\n");
+           return -1;
+        }
+    }
+
     if (get_param_value(buf, sizeof(buf), "format", str)) {
        if (strcmp(buf, "?") == 0) {
             fprintf(stderr, "qemu: Supported formats:");
@@ -5728,6 +5741,8 @@
         bdrv_flags |= BDRV_O_SNAPSHOT;
     if (!cache)
         bdrv_flags |= BDRV_O_DIRECT;
+    if (sync)
+        bdrv_flags |= BDRV_O_SYNC;
     if (bdrv_open2(bdrv, file, bdrv_flags, drv) < 0 || qemu_key_check(bdrv, file)) {
         fprintf(stderr, "qemu: could not open disk image %s\n",
                         file);
Index: block-raw-posix.c
===================================================================
--- block-raw-posix.c	(revision 5304)
+++ block-raw-posix.c	(working copy)
@@ -127,6 +127,8 @@
     if (flags & BDRV_O_DIRECT)
         open_flags |= O_DIRECT;
 #endif
+    if (flags & BDRV_O_SYNC)
+        open_flags |= O_SYNC;
 
     s->type = FTYPE_FILE;
 
@@ -937,6 +939,8 @@
     if (flags & BDRV_O_DIRECT)
         open_flags |= O_DIRECT;
 #endif
+    if (flags & BDRV_O_SYNC)
+        open_flags |= O_SYNC;
 
     s->type = FTYPE_FILE;
 #if defined(__linux__)
Index: block.h
===================================================================
--- block.h	(revision 5300)
+++ block.h	(working copy)
@@ -48,6 +48,7 @@
                                      it (default for
                                      bdrv_file_open()) */
 #define BDRV_O_DIRECT      0x0020
+#define BDRV_O_SYNC        0x0040
 
 void bdrv_info(void);
 void bdrv_info_stats(void);

[-- Attachment #3: posix-aio-dup.patch --]
[-- Type: text/x-patch, Size: 5012 bytes --]

Index: block-raw-posix.c
===================================================================
--- block-raw-posix.c	(revision 5304)
+++ block-raw-posix.c	(working copy)
@@ -84,10 +84,20 @@
    reopen it to see if the disk has been changed */
 #define FD_OPEN_TIMEOUT 1000
 
+#define RAW_FD_POOL_SIZE	4
+
+typedef struct RawFdPoolEntry
+{
+    int fd;
+    int inuse;
+} RawFdPoolEntry;
+
 typedef struct BDRVRawState {
     int fd;
     int type;
     unsigned int lseek_err_cnt;
+    int fd0_inuse;
+    RawFdPoolEntry fd_pool[RAW_FD_POOL_SIZE];
 #if defined(__linux__)
     /* linux floppy specific */
     int fd_open_flags;
@@ -109,6 +119,7 @@
 {
     BDRVRawState *s = bs->opaque;
     int fd, open_flags, ret;
+    int i;
 
     posix_aio_init();
 
@@ -138,6 +149,11 @@
         return ret;
     }
     s->fd = fd;
+    for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+        s->fd_pool[i].fd = -1;
+        s->fd_pool[i].inuse = 0;
+    }
+    s->fd0_inuse = 0;
 #if defined(O_DIRECT)
     s->aligned_buf = NULL;
     if (flags & BDRV_O_DIRECT) {
@@ -436,6 +452,7 @@
 
 typedef struct RawAIOCB {
     BlockDriverAIOCB common;
+    int fd;
     struct aiocb aiocb;
     struct RawAIOCB *next;
     int ret;
@@ -447,6 +464,52 @@
     RawAIOCB *first_aio;
 } PosixAioState;
 
+static int raw_fd_pool_get(BDRVRawState *s)
+{
+    if (s->fd0_inuse) {
+        int i;
+
+        for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+            if (s->fd_pool[i].fd == -1) {
+                s->fd_pool[i].fd = dup(s->fd);
+                if (s->fd_pool[i].fd == -1)
+                    continue;
+                s->fd_pool[i].inuse = 0;
+            }
+
+            if (!s->fd_pool[i].inuse) {
+                s->fd_pool[i].inuse++;
+                return s->fd_pool[i].fd;
+            }
+        }
+    }
+    s->fd0_inuse++;
+
+    return s->fd;
+}
+
+static void raw_fd_pool_put(RawAIOCB *acb)
+{
+    BDRVRawState *s = acb->common.bs->opaque;
+    int fd = acb->fd;
+    int i;
+
+    if (s->fd == fd) {
+        s->fd0_inuse--;
+        return;
+    }
+
+    for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+        if (s->fd_pool[i].fd == fd) {
+            s->fd_pool[i].inuse--;
+            if (s->fd_pool[i].inuse == 0) {
+                close(s->fd_pool[i].fd);
+                s->fd_pool[i].fd = -1;
+            }
+        }
+    }
+}
+
 static void posix_aio_read(void *opaque)
 {
     PosixAioState *s = opaque;
@@ -487,6 +550,7 @@
             if (ret == ECANCELED) {
                 /* remove the request */
                 *pacb = acb->next;
+                raw_fd_pool_put(acb);
                 qemu_aio_release(acb);
             } else if (ret != EINPROGRESS) {
                 /* end of aio */
@@ -503,6 +567,7 @@
                 *pacb = acb->next;
                 /* call the callback */
                 acb->common.cb(acb->common.opaque, ret);
+                raw_fd_pool_put(acb);
                 qemu_aio_release(acb);
                 break;
             } else {
@@ -575,7 +640,8 @@
     acb = qemu_aio_get(bs, cb, opaque);
     if (!acb)
         return NULL;
-    acb->aiocb.aio_fildes = s->fd;
+    acb->fd = raw_fd_pool_get(s);
+    acb->aiocb.aio_fildes = acb->fd;
     acb->aiocb.aio_sigevent.sigev_signo = SIGUSR2;
     acb->aiocb.aio_sigevent.sigev_notify = SIGEV_SIGNAL;
     acb->aiocb.aio_buf = buf;
@@ -682,6 +748,7 @@
             break;
         } else if (*pacb == acb) {
             *pacb = acb->next;
+            raw_fd_pool_put(acb);
             qemu_aio_release(acb);
             break;
         }
@@ -698,6 +765,7 @@
 static void raw_close(BlockDriverState *bs)
 {
     BDRVRawState *s = bs->opaque;
+    int i;
     if (s->fd >= 0) {
         close(s->fd);
         s->fd = -1;
@@ -706,6 +774,10 @@
             qemu_free(s->aligned_buf);
 #endif
     }
+    for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+        if (s->fd_pool[i].fd != -1)
+            close(s->fd_pool[i].fd);
+    } 
 }
 
 static int raw_truncate(BlockDriverState *bs, int64_t offset)
@@ -973,6 +1045,18 @@
 }
 
 #if defined(__linux__)
+static void raw_invalidate_fd_pool(BDRVRawState *s)
+{
+    int i;
+    for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+        if (s->fd_pool[i].fd != -1) {
+            close(s->fd_pool[i].fd);
+            s->fd_pool[i].fd = -1;
+            s->fd_pool[i].inuse = 0;
+        }
+    }
+    s->fd0_inuse = 0;
+}
 
 /* Note: we do not have a reliable method to detect if the floppy is
    present. The current method is to try to open the floppy at every
@@ -989,6 +1073,7 @@
         (qemu_get_clock(rt_clock) - s->fd_open_time) >= FD_OPEN_TIMEOUT) {
         close(s->fd);
         s->fd = -1;
+        raw_invalidate_fd_pool(s);
 #ifdef DEBUG_FLOPPY
         printf("Floppy closed\n");
 #endif
@@ -1089,6 +1174,7 @@
             if (s->fd >= 0) {
                 close(s->fd);
                 s->fd = -1;
+                raw_invalidate_fd_pool(s);
             }
             fd = open(bs->filename, s->fd_open_flags | O_NONBLOCK);
             if (fd >= 0) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 0/3] Refactor AIO to allow multiple AIO implementations
  2008-09-22 23:17 [Qemu-devel] [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Ryan Harper
                   ` (3 preceding siblings ...)
  2008-09-23  3:32 ` [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Anthony Liguori
@ 2008-09-23 10:27 ` Jamie Lokier
  2008-10-02 22:41 ` [Qemu-devel] " john cooper
  5 siblings, 0 replies; 26+ messages in thread
From: Jamie Lokier @ 2008-09-23 10:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, kvm

Ryan Harper wrote:
> Addtional work has been done on top of QEMU for KVM and virtio-blk
> devices.  While virtio-blk is not yet upstream in QEMU, the AIO
> changes here provide a tremendous performance improvement (from 7.6%
> of native, to 100% of randwrite, and 3.9% of native, to 101.4% of
> native for seq write) for virtio devices with cache=off.

Is the improvement due to using linux-aio instead of (Glibc's) posix-aio?

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 1/3] Only call aio flush handler if set
  2008-09-23  2:38   ` [Qemu-devel] " Anthony Liguori
@ 2008-09-23 14:26     ` Ryan Harper
  2008-09-23 14:34       ` Anthony Liguori
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Harper @ 2008-09-23 14:26 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Ryan Harper, qemu-devel, kvm

* Anthony Liguori <aliguori@us.ibm.com> [2008-09-22 21:49]:
> Ryan Harper wrote:
> >If the aio handler doesn't register an io_flush handler, we'd SEGV; fix 
> >that by
> >only calling the flush handler if set.  BTW, aio handlers *should* 
> >register an
> >io_flush routine.
> >
> >Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
> >
> >diff --git a/aio.c b/aio.c
> >index 687e4be..2bb3ed4 100644
> >--- a/aio.c
> >+++ b/aio.c
> >@@ -105,7 +105,8 @@ void qemu_aio_flush(void)
> >         ret = 0;
> >
> >         LIST_FOREACH(node, &aio_handlers, node) {
> >-            ret |= node->io_flush(node->opaque);
> >+            if (node->io_flush)
> >+                ret |= node->io_flush(node->opaque);
> >         }
> >  
> 
> Just not doing an io_flush is just hiding the real bug--that the user 
> didn't register an io_flush handler.  If the inevitable SEGV is not your 

That may be true, but it it is no different than the check for read and
write handlers in qemu_aio_wait().

> thing, I'd rather see a check added to qemu_aio_set_fd_handler() that 
> threw an error and exited if called without an io_flush handler.

I'm not sure that is useful since it isn't always obvious whether or not
an aio handler needs to register all of the functions, read, write and
flush.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 1/3] Only call aio flush handler if set
  2008-09-23 14:26     ` Ryan Harper
@ 2008-09-23 14:34       ` Anthony Liguori
  2008-09-23 14:41         ` Ryan Harper
  0 siblings, 1 reply; 26+ messages in thread
From: Anthony Liguori @ 2008-09-23 14:34 UTC (permalink / raw)
  To: Ryan Harper; +Cc: qemu-devel, kvm

Ryan Harper wrote:
> * Anthony Liguori <aliguori@us.ibm.com> [2008-09-22 21:49]:
>   
>> Ryan Harper wrote:
>>     
>>> If the aio handler doesn't register an io_flush handler, we'd SEGV; fix 
>>> that by
>>> only calling the flush handler if set.  BTW, aio handlers *should* 
>>> register an
>>> io_flush routine.
>>>
>>> Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
>>>
>>> diff --git a/aio.c b/aio.c
>>> index 687e4be..2bb3ed4 100644
>>> --- a/aio.c
>>> +++ b/aio.c
>>> @@ -105,7 +105,8 @@ void qemu_aio_flush(void)
>>>         ret = 0;
>>>
>>>         LIST_FOREACH(node, &aio_handlers, node) {
>>> -            ret |= node->io_flush(node->opaque);
>>> +            if (node->io_flush)
>>> +                ret |= node->io_flush(node->opaque);
>>>         }
>>>  
>>>       
>> Just not doing an io_flush is just hiding the real bug--that the user 
>> didn't register an io_flush handler.  If the inevitable SEGV is not your 
>>     
>
> That may be true, but it it is no different than the check for read and
> write handlers in qemu_aio_wait().
>   

Read and write handlers are optional.  I guess in practice one or the 
other should be set but neither one is individually required.  The 
problem with your patch is that it takes something that is a bug, and 
makes it more difficult to spot.  So it actually makes things worse.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-23  2:45   ` Anthony Liguori
@ 2008-09-23 14:39     ` Ryan Harper
  2008-09-23 14:40       ` Anthony Liguori
  2008-09-23 14:53       ` Gerd Hoffmann
  2008-09-24 22:31     ` Marcelo Tosatti
  1 sibling, 2 replies; 26+ messages in thread
From: Ryan Harper @ 2008-09-23 14:39 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Ryan Harper, qemu-devel, kvm, Marcelo Tosatti

* Anthony Liguori <aliguori@us.ibm.com> [2008-09-22 21:52]:
> Ryan Harper wrote:

> >+//#define DEBUG_BLOCK_AIO
> >+#if defined(DEBUG_BLOCK_AIO)
> >+#define BLPRINTF(formatCstr, args...) do { fprintf(stderr, formatCstr, 
> >##args); fflush(stderr); } while (0)
> >  
> 
> This is GCC syntax.  You should use C99 syntax.  That would be:
> 
> #define BLPRINTF(fmt, ...) do { fprintf(stderr, fmt, ## __VA_ARGS__); } 
> while (0)
> 
> stderr is defined to not be buffered so an explicit fflush is 
> unnecessary.  formatCstr is also a goofy name :-)
> 
> >+#else
> >+#define BLPRINTF(formatCstr, args...)
> >  
> 
> should be defined to do { } while (0) to maintain statement semantics.

This was taken from block-raw-posix.c, DEBUG_BLOCK.  I'll change up the
DEBUG_BLOCK_AIO, should I also submit a patch to rework DEBUG_BLOCK?

> >+typedef struct AIODriver
> >+{
> >+    const char *name;
> >+    RawAIOCB *(*submit)(BlockDriverState *bs, int fd,
> >+                                    int64_t sector_num, uint8_t *buf,
> >+                                    int sectors, int write,
> >+                                    BlockDriverCompletionFunc *cb,
> >+                                    void *opaque);
> >+    void (*cancel)(BlockDriverAIOCB *aiocb);
> >+    int (*flush)(void *opaque);
> >+} AIODriver;
> >  
> 
> I think the AIODriver interface should just take an fd, a completion 
> function, and an opaque pointer.  It should have to have knowledge of 
> BDRVRawState or BlockDriverState.

I assume you mean shouldn't have to have.  Looking at things like
pa_read() in aio-posix.c , I'm not sure I see how I avoid that
knowledge.

> >*raw_aio_read(BlockDriverState *bs,
> >     }
> > #endif
> >
> >-    acb = raw_aio_setup(bs, sector_num, buf, nb_sectors, cb, opaque);
> >-    if (!acb)
> >+    if (fd_open(bs) < 0)
> >         return NULL;
> >-    if (aio_read(&acb->aiocb) < 0) {
> >-        qemu_aio_release(acb);
> >+
> >+    /* submit read */
> >+    acb = s->aio_dvr->submit(bs, s->fd, sector_num, buf, nb_sectors, 0, 
> >cb,
> >+                             opaque);
> >+    if (!acb)
> >         return NULL;
> >-    }
> >     return &acb->common;
> > }
> >  
> 
> So what happens if !defined(CONFIG_AIO)?  By my reading of the code, 
> aio_drv will be NULL and this will SEGV.

raw_aio_read/write/cancel aren't included in the bdrv structure unless
CONFIG_AIO is defined.  Rather in bdrv_register, the aio emulation
functions are used instead.

> >+    /* init aio driver for this block device */
> >+    s->aio_dvr = posix_aio_init();
> >  
> 
> Doesn't this need to be conditional on CONFIG_AIO?

Yep, in both raw_open and hdev_open().  


Thanks for the review.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-23 14:39     ` Ryan Harper
@ 2008-09-23 14:40       ` Anthony Liguori
  2008-09-23 14:53       ` Gerd Hoffmann
  1 sibling, 0 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-09-23 14:40 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Marcelo Tosatti, qemu-devel, kvm

Ryan Harper wrote:
> * Anthony Liguori <aliguori@us.ibm.com> [2008-09-22 21:52]:
>   
>> Ryan Harper wrote:
>>     
>
>   
> This was taken from block-raw-posix.c, DEBUG_BLOCK.  I'll change up the
> DEBUG_BLOCK_AIO, should I also submit a patch to rework DEBUG_BLOCK?
>   

Yes, please.

>>> +typedef struct AIODriver
>>> +{
>>> +    const char *name;
>>> +    RawAIOCB *(*submit)(BlockDriverState *bs, int fd,
>>> +                                    int64_t sector_num, uint8_t *buf,
>>> +                                    int sectors, int write,
>>> +                                    BlockDriverCompletionFunc *cb,
>>> +                                    void *opaque);
>>> +    void (*cancel)(BlockDriverAIOCB *aiocb);
>>> +    int (*flush)(void *opaque);
>>> +} AIODriver;
>>>  
>>>       
>> I think the AIODriver interface should just take an fd, a completion 
>> function, and an opaque pointer.  It should have to have knowledge of 
>> BDRVRawState or BlockDriverState.
>>     
>
> I assume you mean shouldn't have to have.  Looking at things like
> pa_read() in aio-posix.c , I'm not sure I see how I avoid that
> knowledge.
>   

I think the key is to change the way AIOCBs are allocated.

>>> *raw_aio_read(BlockDriverState *bs,
>>>     }
>>> #endif
>>>
>>> -    acb = raw_aio_setup(bs, sector_num, buf, nb_sectors, cb, opaque);
>>> -    if (!acb)
>>> +    if (fd_open(bs) < 0)
>>>         return NULL;
>>> -    if (aio_read(&acb->aiocb) < 0) {
>>> -        qemu_aio_release(acb);
>>> +
>>> +    /* submit read */
>>> +    acb = s->aio_dvr->submit(bs, s->fd, sector_num, buf, nb_sectors, 0, 
>>> cb,
>>> +                             opaque);
>>> +    if (!acb)
>>>         return NULL;
>>> -    }
>>>     return &acb->common;
>>> }
>>>  
>>>       
>> So what happens if !defined(CONFIG_AIO)?  By my reading of the code, 
>> aio_drv will be NULL and this will SEGV.
>>     
>
> raw_aio_read/write/cancel aren't included in the bdrv structure unless
> CONFIG_AIO is defined.  Rather in bdrv_register, the aio emulation
> functions are used instead.
>   

So these will give warnings then of unused statics?  Because they are no 
longer conditional on CONFIG_AIO?

Regards,

Anthony Liguori

>>> +    /* init aio driver for this block device */
>>> +    s->aio_dvr = posix_aio_init();
>>>  
>>>       
>> Doesn't this need to be conditional on CONFIG_AIO?
>>     
>
> Yep, in both raw_open and hdev_open().  
>
>
> Thanks for the review.
>
>   

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 1/3] Only call aio flush handler if set
  2008-09-23 14:34       ` Anthony Liguori
@ 2008-09-23 14:41         ` Ryan Harper
  2008-09-23 14:50           ` Anthony Liguori
  0 siblings, 1 reply; 26+ messages in thread
From: Ryan Harper @ 2008-09-23 14:41 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Ryan Harper, qemu-devel, kvm

* Anthony Liguori <aliguori@us.ibm.com> [2008-09-23 09:36]:
> Ryan Harper wrote:
> >* Anthony Liguori <aliguori@us.ibm.com> [2008-09-22 21:49]:
> >  
> >>Ryan Harper wrote:
> >>    
> >>>If the aio handler doesn't register an io_flush handler, we'd SEGV; fix 
> >>>that by
> >>>only calling the flush handler if set.  BTW, aio handlers *should* 
> >>>register an
> >>>io_flush routine.
> >>>
> >>>Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
> >>>
> >>>diff --git a/aio.c b/aio.c
> >>>index 687e4be..2bb3ed4 100644
> >>>--- a/aio.c
> >>>+++ b/aio.c
> >>>@@ -105,7 +105,8 @@ void qemu_aio_flush(void)
> >>>        ret = 0;
> >>>
> >>>        LIST_FOREACH(node, &aio_handlers, node) {
> >>>-            ret |= node->io_flush(node->opaque);
> >>>+            if (node->io_flush)
> >>>+                ret |= node->io_flush(node->opaque);
> >>>        }
> >>> 
> >>>      
> >>Just not doing an io_flush is just hiding the real bug--that the user 
> >>didn't register an io_flush handler.  If the inevitable SEGV is not your 
> >>    
> >
> >That may be true, but it it is no different than the check for read and
> >write handlers in qemu_aio_wait().
> >  
> 
> Read and write handlers are optional.  I guess in practice one or the 
> other should be set but neither one is individually required.  The 
> problem with your patch is that it takes something that is a bug, and 
> makes it more difficult to spot.  So it actually makes things worse.

I disagree that anything is worse off by not SEGV'ing, In any case, what
do you want here?

Read or Write must be set along with flush or we error in fd
registration?


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations
  2008-09-23  3:32 ` [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Anthony Liguori
@ 2008-09-23 14:43   ` Ryan Harper
  2008-09-23 14:47     ` Anthony Liguori
  2008-09-23 16:09     ` Anthony Liguori
  0 siblings, 2 replies; 26+ messages in thread
From: Ryan Harper @ 2008-09-23 14:43 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Ryan Harper, qemu-devel, kvm

* Anthony Liguori <aliguori@us.ibm.com> [2008-09-22 22:44]:
> Can you run the same performance tests with the following patches (using 
> sync=on instead of cache=off)?
> 
> You'll need my aio_init fix too.  I suspect this will give equally good 
> performance to your patch set.  That's not saying your patch set isn't 
> useful, but I would like to get performance to be better for the case 
> that we're going through the page cache.

I can run the test, but it is orthogonal to the patchset which is
focused on using O_DIRECT and linux-aio.


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations
  2008-09-23 14:43   ` Ryan Harper
@ 2008-09-23 14:47     ` Anthony Liguori
  2008-09-23 16:09     ` Anthony Liguori
  1 sibling, 0 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-09-23 14:47 UTC (permalink / raw)
  To: Ryan Harper; +Cc: qemu-devel, kvm

Ryan Harper wrote:
> * Anthony Liguori <aliguori@us.ibm.com> [2008-09-22 22:44]:
>   
>> Can you run the same performance tests with the following patches (using 
>> sync=on instead of cache=off)?
>>
>> You'll need my aio_init fix too.  I suspect this will give equally good 
>> performance to your patch set.  That's not saying your patch set isn't 
>> useful, but I would like to get performance to be better for the case 
>> that we're going through the page cache.
>>     
>
> I can run the test, but it is orthogonal to the patchset which is
> focused on using O_DIRECT and linux-aio.
>   

Yes, I'm trying to understand where the performance is coming from.  The 
hunch is that supporting multiple simultaneous requests along is what 
did it.  O_SYNC should give similar completion behavior to O_DIRECT.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 1/3] Only call aio flush handler if set
  2008-09-23 14:41         ` Ryan Harper
@ 2008-09-23 14:50           ` Anthony Liguori
  0 siblings, 0 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-09-23 14:50 UTC (permalink / raw)
  To: Ryan Harper; +Cc: qemu-devel, kvm

Ryan Harper wrote:
> * Anthony Liguori <aliguori@us.ibm.com> [2008-09-23 09:36]:
>   
>
> I disagree that anything is worse off by not SEGV'ing, In any case, what
> do you want here?
>
> Read or Write must be set along with flush or we error in fd
> registration?
>   

Yes.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-23 14:39     ` Ryan Harper
  2008-09-23 14:40       ` Anthony Liguori
@ 2008-09-23 14:53       ` Gerd Hoffmann
  2008-09-23 16:06         ` Anthony Liguori
  1 sibling, 1 reply; 26+ messages in thread
From: Gerd Hoffmann @ 2008-09-23 14:53 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Anthony Liguori, Marcelo Tosatti, qemu-devel, kvm

Ryan Harper wrote:
>> So what happens if !defined(CONFIG_AIO)?  By my reading of the code, 
>> aio_drv will be NULL and this will SEGV.
> 
> raw_aio_read/write/cancel aren't included in the bdrv structure unless
> CONFIG_AIO is defined.  Rather in bdrv_register, the aio emulation
> functions are used instead.

How about providing a aio interface implementation which simply uses
read/write syscalls (thereby not being really async obviously)?  Then
use that as fallback instead of aio emulation?  And also drop CONFIG_AIO
then?

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-23 14:53       ` Gerd Hoffmann
@ 2008-09-23 16:06         ` Anthony Liguori
  2008-09-23 18:04           ` Gerd Hoffmann
  0 siblings, 1 reply; 26+ messages in thread
From: Anthony Liguori @ 2008-09-23 16:06 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Ryan Harper, qemu-devel, kvm, Marcelo Tosatti

Gerd Hoffmann wrote:
> Ryan Harper wrote:
>   
>>> So what happens if !defined(CONFIG_AIO)?  By my reading of the code, 
>>> aio_drv will be NULL and this will SEGV.
>>>       
>> raw_aio_read/write/cancel aren't included in the bdrv structure unless
>> CONFIG_AIO is defined.  Rather in bdrv_register, the aio emulation
>> functions are used instead.
>>     
>
> How about providing a aio interface implementation which simply uses
> read/write syscalls (thereby not being really async obviously)?  Then
> use that as fallback instead of aio emulation?  And also drop CONFIG_AIO
> then?
>   

Yeah, this is basically what block-raw-posix does today.  I was thinking 
the same thing.  I was also thinking that you could do an aio 
implementation for win32 and possibly reunify block-raw-posix and 
block-raw-linux.

But before going down this route, I want to see if linux-aio is really 
the right tool for the job.

Regards,

Anthony Liguori

> cheers,
>   Gerd
>   

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations
  2008-09-23 14:43   ` Ryan Harper
  2008-09-23 14:47     ` Anthony Liguori
@ 2008-09-23 16:09     ` Anthony Liguori
  1 sibling, 0 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-09-23 16:09 UTC (permalink / raw)
  To: Ryan Harper; +Cc: qemu-devel, kvm

Ryan Harper wrote:
> * Anthony Liguori <aliguori@us.ibm.com> [2008-09-22 22:44]:
>   
>> Can you run the same performance tests with the following patches (using 
>> sync=on instead of cache=off)?
>>
>> You'll need my aio_init fix too.  I suspect this will give equally good 
>> performance to your patch set.  That's not saying your patch set isn't 
>> useful, but I would like to get performance to be better for the case 
>> that we're going through the page cache.
>>     
>
> I can run the test, but it is orthogonal to the patchset which is
> focused on using O_DIRECT and linux-aio.
>   

Actually, I'm now much more interested in using the fd_pool patch with 
cache=off.  Using it with the sync=on patch is interesting but I'm 
curious how close fd_poll + cache=off gets to linux-aio + cache=off.

Supporting linux-aio is going to be a royal pain.  I don't know how we 
can do a runtime probe of whether we support resfd or not.  A build time 
probe is going to be lame because we'll be relying on the glibc 
headers.  Plus, I'm really, really interested in avoiding the 
association of cache=off == better performance.

In theory, the dup() + posix-aio should do okay compared to a custom 
thread pool.  It should have slightly higher latency, but completion 
time should be pretty close.  That would let us hold off supporting a 
thread pool until we're ready to do zero-copy IO (which is the only 
argument for a thread pool verses posix-aio).

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-23 16:06         ` Anthony Liguori
@ 2008-09-23 18:04           ` Gerd Hoffmann
  2008-09-23 18:28             ` Anthony Liguori
  0 siblings, 1 reply; 26+ messages in thread
From: Gerd Hoffmann @ 2008-09-23 18:04 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Ryan Harper, qemu-devel, kvm, Marcelo Tosatti

Anthony Liguori wrote:
> Gerd Hoffmann wrote:
>> How about providing a aio interface implementation which simply uses
>> read/write syscalls (thereby not being really async obviously)?  Then
>> use that as fallback instead of aio emulation?  And also drop CONFIG_AIO
>> then?
> 
> Yeah, this is basically what block-raw-posix does today.  I was thinking
> the same thing.  I was also thinking that you could do an aio
> implementation for win32 and possibly reunify block-raw-posix and
> block-raw-linux.

Sure, that the next logical steps.  Later we can also convert all
block-* drivers to the new aio interface and subsequently drop alot of
dead block layer code.

> But before going down this route, I want to see if linux-aio is really
> the right tool for the job.

IMHO this all makes sense even in case linux-aio turns out to not be
worth it.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-23 18:04           ` Gerd Hoffmann
@ 2008-09-23 18:28             ` Anthony Liguori
  0 siblings, 0 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-09-23 18:28 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Ryan Harper, qemu-devel, kvm, Marcelo Tosatti

Gerd Hoffmann wrote:
> Anthony Liguori wrote:
>   
>> Gerd Hoffmann wrote:
>>     
>>> How about providing a aio interface implementation which simply uses
>>> read/write syscalls (thereby not being really async obviously)?  Then
>>> use that as fallback instead of aio emulation?  And also drop CONFIG_AIO
>>> then?
>>>       
>> Yeah, this is basically what block-raw-posix does today.  I was thinking
>> the same thing.  I was also thinking that you could do an aio
>> implementation for win32 and possibly reunify block-raw-posix and
>> block-raw-linux.
>>     
>
> Sure, that the next logical steps.  Later we can also convert all
> block-* drivers to the new aio interface and subsequently drop alot of
> dead block layer code.
>   

Yup.

>> But before going down this route, I want to see if linux-aio is really
>> the right tool for the job.
>>     
>
> IMHO this all makes sense even in case linux-aio turns out to not be
> worth it.
>   

Yes, I agree.  For a bit of a spoiler, initial results are that 
cache=off + my fd_pool patch is getting equivalent performance to 
linux-aio so it's looking like we can avoid linux-aio for now.

Regards,

Anthony Liguori

> cheers,
>   Gerd
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [PATCH 2/3] Move aio implementation out of raw block driver
  2008-09-23  2:45   ` Anthony Liguori
  2008-09-23 14:39     ` Ryan Harper
@ 2008-09-24 22:31     ` Marcelo Tosatti
  1 sibling, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2008-09-24 22:31 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Ryan Harper, qemu-devel, kvm

On Mon, Sep 22, 2008 at 09:45:29PM -0500, Anthony Liguori wrote:
>> +#endif
>> +
>> +typedef struct RawAIOCB {
>> +    BlockDriverAIOCB common;
>> +    struct aiocb posix_aiocb;
>> +    struct RawAIOCB *next;
>> +    int ret;
>> +} RawAIOCB;
>>   
>
> The whole small-object allocator for AIOCBs seems a bit of a premature  
> optimization to me.  It makes this whole thing terribly awkward.   
> Marcelo had a patch at one point to switch the small object allocate to  
> just malloc/free.  Marcelo: any reason you didn't follow up with that 
> patch?

You mean the free object caching? Yeah can't see much point in it.

Will look for it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations
  2008-09-22 23:17 [Qemu-devel] [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Ryan Harper
                   ` (4 preceding siblings ...)
  2008-09-23 10:27 ` [Qemu-devel] " Jamie Lokier
@ 2008-10-02 22:41 ` john cooper
  2008-10-03 13:33   ` Ryan Harper
  5 siblings, 1 reply; 26+ messages in thread
From: john cooper @ 2008-10-02 22:41 UTC (permalink / raw)
  To: Ryan Harper; +Cc: john cooper, aliguori, qemu-devel, kvm

Ryan Harper wrote:

> Results:
>  These results are with the patch series applied to KVM (plus a small KVM only
>  change -- KVM patches forthcoming).

Is the above "KVM only change" a kernel-side kvm patch?
If so could you provide a pointer?  Didn't see anything
related AFAICT in kvm.git since then.

Thanks,

-john

-- 
john.cooper@third-harmonic.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations
  2008-10-02 22:41 ` [Qemu-devel] " john cooper
@ 2008-10-03 13:33   ` Ryan Harper
  0 siblings, 0 replies; 26+ messages in thread
From: Ryan Harper @ 2008-10-03 13:33 UTC (permalink / raw)
  To: john cooper; +Cc: aliguori, Ryan Harper, qemu-devel, kvm

* john cooper <john.cooper@third-harmonic.com> [2008-10-02 18:51]:
> Ryan Harper wrote:
> 
> >Results:
> > These results are with the patch series applied to KVM (plus a small KVM 
> > only
> > change -- KVM patches forthcoming).
> 
> Is the above "KVM only change" a kernel-side kvm patch?
> If so could you provide a pointer?  Didn't see anything
> related AFAICT in kvm.git since then.

The main change isn't in the kernel portion of virtio,  but rather the
virtio-blk backend in qemu.  I've not submitted the patch to tune the
parameter yet since I was waiting for the aio changes to be committed
first.  They have, and I'll have something out RSN.  

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-10-03 13:35 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-22 23:17 [Qemu-devel] [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Ryan Harper
2008-09-22 23:17 ` [Qemu-devel] [PATCH 1/3] Only call aio flush handler if set Ryan Harper
2008-09-23  2:38   ` [Qemu-devel] " Anthony Liguori
2008-09-23 14:26     ` Ryan Harper
2008-09-23 14:34       ` Anthony Liguori
2008-09-23 14:41         ` Ryan Harper
2008-09-23 14:50           ` Anthony Liguori
2008-09-22 23:17 ` [Qemu-devel] [PATCH 2/3] Move aio implementation out of raw block driver Ryan Harper
2008-09-23  1:16   ` [Qemu-devel] " Ryan Harper
2008-09-23  2:45   ` Anthony Liguori
2008-09-23 14:39     ` Ryan Harper
2008-09-23 14:40       ` Anthony Liguori
2008-09-23 14:53       ` Gerd Hoffmann
2008-09-23 16:06         ` Anthony Liguori
2008-09-23 18:04           ` Gerd Hoffmann
2008-09-23 18:28             ` Anthony Liguori
2008-09-24 22:31     ` Marcelo Tosatti
2008-09-22 23:17 ` [Qemu-devel] " Ryan Harper
2008-09-23  1:22   ` [Qemu-devel] [PATCH 3/3] Add linux aio implementation for raw block devices Ryan Harper
2008-09-23  3:32 ` [Qemu-devel] Re: [PATCH 0/3] Refactor AIO to allow multiple AIO implementations Anthony Liguori
2008-09-23 14:43   ` Ryan Harper
2008-09-23 14:47     ` Anthony Liguori
2008-09-23 16:09     ` Anthony Liguori
2008-09-23 10:27 ` [Qemu-devel] " Jamie Lokier
2008-10-02 22:41 ` [Qemu-devel] " john cooper
2008-10-03 13:33   ` Ryan Harper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).