qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH 0/3] linux-aio: Convert to coroutines
@ 2014-11-26 14:46 Kevin Wolf
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 1/3] qemu-img bench Kevin Wolf
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Kevin Wolf @ 2014-11-26 14:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, ming.lei, pl, stefanha

This is my current branch for converting the linux-aio interface to
coroutines. I started this as a performance optimisation, but I think it
also makes it much easier to avoid the recursive coroutine reentrace
that Ming Lei has sent a relatively complex callback-based patch for.
See patch 3 for a quick attempt on fixing the same problem in a
coroutine-based linux-aio implementation.

Not ready to be merged mainly because of incomplete testing and
benchmarking.

Kevin Wolf (3):
  qemu-img bench
  raw-posix: Convert Linux AIO submission to coroutines
  linux-aio: Don't reenter request coroutine recursively

 block/linux-aio.c |  89 +++++++++++-----------------
 block/raw-aio.h   |   5 +-
 block/raw-posix.c |  62 +++++++++----------
 qemu-img-cmds.hx  |   6 ++
 qemu-img.c        | 174 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu-img.texi     |  10 ++++
 6 files changed, 256 insertions(+), 90 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC PATCH 1/3] qemu-img bench
  2014-11-26 14:46 [Qemu-devel] [RFC PATCH 0/3] linux-aio: Convert to coroutines Kevin Wolf
@ 2014-11-26 14:46 ` Kevin Wolf
  2014-11-28 11:49   ` Stefan Hajnoczi
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines Kevin Wolf
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively Kevin Wolf
  2 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2014-11-26 14:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, ming.lei, pl, stefanha

This adds a qemu-img command that allows doing some simple benchmarks
for the block layer without involving guest devices and a real VM.

For the start, this implements only a test of sequential reads.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img-cmds.hx |   6 ++
 qemu-img.c       | 174 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu-img.texi    |  10 ++++
 3 files changed, 190 insertions(+)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 9567774..2b23b57 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -9,6 +9,12 @@ STEXI
 @table @option
 ETEXI
 
+DEF("bench", img_bench,
+    "bench [-c count] [-d depth] [-f fmt] [-n] [-q] [-s buffer_size] [-t cache] filename")
+STEXI
+@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] [-q] [-s @var{buffer_size}] [-t @var{cache}] @var{filename}
+ETEXI
+
 DEF("check", img_check,
     "check [-q] [-f fmt] [--output=ofmt] [-r [leaks | all]] [-T src_cache] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index a42335c..c2388ad 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3003,6 +3003,180 @@ out:
     return 0;
 }
 
+typedef struct BenchData {
+    BlockDriverState *bs;
+    int bufsize;
+    int nrreq;
+    int n;
+    uint8_t *buf;
+    QEMUIOVector *qiov;
+
+    int in_flight;
+    uint64_t sector;
+} BenchData;
+
+static void bench_cb(void *opaque, int ret)
+{
+    BenchData *b = opaque;
+    BlockAIOCB *acb;
+
+    if (ret < 0) {
+        error_report("Failed request: %s\n", strerror(-ret));
+        exit(EXIT_FAILURE);
+    }
+    if (b->in_flight > 0) {
+        b->n--;
+        b->in_flight--;
+    }
+
+    while (b->n > b->in_flight && b->in_flight < b->nrreq) {
+        acb = bdrv_aio_readv(b->bs, b->sector, b->qiov,
+                             b->bufsize >> BDRV_SECTOR_BITS,
+                             bench_cb, b);
+        if (!acb) {
+            error_report("Failed to issue request");
+            exit(EXIT_FAILURE);
+        }
+        b->in_flight++;
+        b->sector += b->bufsize;
+        b->sector %= b->bs->total_sectors;
+    }
+}
+
+static int img_bench(int argc, char **argv)
+{
+    int c, ret = 0;
+    const char *fmt = NULL, *filename;
+    bool quiet = false;
+    int count = 75000;
+    int depth = 64;
+    size_t bufsize = 4096;
+    BlockBackend *blk = NULL;
+    BlockDriverState *bs;
+    BenchData data = {};
+    int flags = BDRV_O_FLAGS;
+    struct timeval t1, t2;
+    int i;
+
+    for (;;) {
+        c = getopt(argc, argv, "hc:d:f:nqs:t:");
+        if (c == -1) {
+            break;
+        }
+
+        switch (c) {
+        case 'h':
+        case '?':
+            help();
+            break;
+        case 'c':
+        {
+            char *end;
+            errno = 0;
+            count = strtoul(optarg, &end, 0);
+            if (errno || *end || count > INT_MAX) {
+                error_report("Invalid request count specified");
+                return 1;
+            }
+            break;
+        }
+        case 'd':
+        {
+            char *end;
+            errno = 0;
+            depth = strtoul(optarg, &end, 0);
+            if (errno || *end || depth > INT_MAX) {
+                error_report("Invalid request count specified");
+                return 1;
+            }
+            break;
+        }
+        case 'f':
+            fmt = optarg;
+            break;
+        case 'n':
+            flags |= BDRV_O_NATIVE_AIO;
+            break;
+        case 'q':
+            quiet = true;
+            break;
+        case 's':
+        {
+            int64_t sval;
+            char *end;
+
+            sval = strtosz_suffix(optarg, &end, STRTOSZ_DEFSUFFIX_B);
+            if (sval < 0 || sval > INT_MAX || *end) {
+                error_report("Invalid buffer size specified");
+                return 1;
+            }
+
+            bufsize = sval;
+            break;
+        }
+        case 't':
+            ret = bdrv_parse_cache_flags(optarg, &flags);
+            if (ret < 0) {
+                error_report("Invalid cache mode");
+                ret = -1;
+                goto out;
+            }
+            break;
+        }
+    }
+
+    if (optind != argc - 1) {
+        error_exit("Expecting one image file name");
+    }
+    filename = argv[argc - 1];
+
+    blk = img_open("image", filename, fmt, flags, true, quiet);
+    if (!blk) {
+        ret = -1;
+        goto out;
+    }
+    bs = blk_bs(blk);
+
+    data = (BenchData) {
+        .bs = bs,
+        .bufsize = bufsize,
+        .nrreq = depth,
+        .n = count,
+    };
+    printf("Sending %d requests, %d bytes each, %d in parallel\n",
+        data.n, data.bufsize, data.nrreq);
+
+    data.buf = qemu_blockalign(bs, data.nrreq * data.bufsize);
+    data.qiov = g_new(QEMUIOVector, data.nrreq);
+    for (i = 0; i < data.nrreq; i++) {
+        qemu_iovec_init(&data.qiov[i], 1);
+        qemu_iovec_add(&data.qiov[i],
+                       data.buf + i * data.bufsize, data.bufsize);
+    }
+
+    gettimeofday(&t1, NULL);
+    bench_cb(&data, 0);
+
+    while (data.n > 0) {
+        main_loop_wait(false);
+    }
+    gettimeofday(&t2, NULL);
+
+    printf("Run completed in %3.3f seconds.\n",
+           (t2.tv_sec - t1.tv_sec)
+           + ((double)(t2.tv_usec - t1.tv_usec) / 1000000));
+
+out:
+    qemu_vfree(data.buf);
+    blk_unref(blk);
+
+    if (ret) {
+        return 1;
+    }
+    return 0;
+}
+
+
 static const img_cmd_t img_cmds[] = {
 #define DEF(option, callback, arg_string)        \
     { option, callback },
diff --git a/qemu-img.texi b/qemu-img.texi
index 0a1ab35..fc499cb 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -117,6 +117,16 @@ Skip the creation of the target volume
 Command description:
 
 @table @option
+@item bench [-c @var{count}] [-d @var{depth}] [-f @var{fmt}] [-n] [-q] [-s @var{buffer_size}] [-t @var{cache}] @var{filename}
+
+Run a simple sequential read benchmark on the specified image. A total number
+of @var{count} I/O requests is performed, each @var{buffer_size} bytes in size,
+and with @var{depth} requests in parallel.
+
+If @code{-n} is specified, the native AIO backend is used if possible. On
+Linux, this option has no effect unless @code{-t none} or @code{-t directsync}
+is specified as well.
+
 @item check [-f @var{fmt}] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
 
 Perform a consistency check on the disk image @var{filename}. The command can
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-26 14:46 [Qemu-devel] [RFC PATCH 0/3] linux-aio: Convert to coroutines Kevin Wolf
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 1/3] qemu-img bench Kevin Wolf
@ 2014-11-26 14:46 ` Kevin Wolf
  2014-11-27  9:50   ` Peter Lieven
  2014-11-28  2:59   ` Ming Lei
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively Kevin Wolf
  2 siblings, 2 replies; 21+ messages in thread
From: Kevin Wolf @ 2014-11-26 14:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, ming.lei, pl, stefanha

This improves the performance of requests because an ACB doesn't need to
be allocated on the heap any more. It also makes the code nicer and
smaller.

As a side effect, the codepath taken by aio=threads is changed to use
paio_submit_co(). This doesn't change the performance at this point.

Results of qemu-img bench -t none -c 10000000 [-n] /dev/loop0:

      |      aio=native       |     aio=threads
      | before   | with patch | before   | with patch
------+----------+------------+----------+------------
run 1 | 29.921s  | 26.932s    | 35.286s  | 35.447s
run 2 | 29.793s  | 26.252s    | 35.276s  | 35.111s
run 3 | 30.186s  | 27.114s    | 35.042s  | 34.921s
run 4 | 30.425s  | 26.600s    | 35.169s  | 34.968s
run 5 | 30.041s  | 26.263s    | 35.224s  | 35.000s

TODO: Do some more serious benchmarking in VMs with less variance.
Results of a quick fio run are vaguely positive.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/linux-aio.c | 70 +++++++++++++++++--------------------------------------
 block/raw-aio.h   |  5 ++--
 block/raw-posix.c | 62 ++++++++++++++++++++++--------------------------
 3 files changed, 52 insertions(+), 85 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index d92513b..99b259d 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -12,6 +12,7 @@
 #include "qemu/queue.h"
 #include "block/raw-aio.h"
 #include "qemu/event_notifier.h"
+#include "block/coroutine.h"
 
 #include <libaio.h>
 
@@ -28,7 +29,7 @@
 #define MAX_QUEUED_IO  128
 
 struct qemu_laiocb {
-    BlockAIOCB common;
+    Coroutine *co;
     struct qemu_laio_state *ctx;
     struct iocb iocb;
     ssize_t ret;
@@ -86,9 +87,9 @@ static void qemu_laio_process_completion(struct qemu_laio_state *s,
             }
         }
     }
-    laiocb->common.cb(laiocb->common.opaque, ret);
 
-    qemu_aio_unref(laiocb);
+    laiocb->ret = ret;
+    qemu_coroutine_enter(laiocb->co, NULL);
 }
 
 /* The completion BH fetches completed I/O requests and invokes their
@@ -146,30 +147,6 @@ static void qemu_laio_completion_cb(EventNotifier *e)
     }
 }
 
-static void laio_cancel(BlockAIOCB *blockacb)
-{
-    struct qemu_laiocb *laiocb = (struct qemu_laiocb *)blockacb;
-    struct io_event event;
-    int ret;
-
-    if (laiocb->ret != -EINPROGRESS) {
-        return;
-    }
-    ret = io_cancel(laiocb->ctx->ctx, &laiocb->iocb, &event);
-    laiocb->ret = -ECANCELED;
-    if (ret != 0) {
-        /* iocb is not cancelled, cb will be called by the event loop later */
-        return;
-    }
-
-    laiocb->common.cb(laiocb->common.opaque, laiocb->ret);
-}
-
-static const AIOCBInfo laio_aiocb_info = {
-    .aiocb_size         = sizeof(struct qemu_laiocb),
-    .cancel_async       = laio_cancel,
-};
-
 static void ioq_init(LaioQueue *io_q)
 {
     io_q->size = MAX_QUEUED_IO;
@@ -243,23 +220,21 @@ int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug)
     return ret;
 }
 
-BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
-        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
-        BlockCompletionFunc *cb, void *opaque, int type)
+int laio_submit_co(BlockDriverState *bs, void *aio_ctx, int fd,
+        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type)
 {
     struct qemu_laio_state *s = aio_ctx;
-    struct qemu_laiocb *laiocb;
-    struct iocb *iocbs;
     off_t offset = sector_num * 512;
 
-    laiocb = qemu_aio_get(&laio_aiocb_info, bs, cb, opaque);
-    laiocb->nbytes = nb_sectors * 512;
-    laiocb->ctx = s;
-    laiocb->ret = -EINPROGRESS;
-    laiocb->is_read = (type == QEMU_AIO_READ);
-    laiocb->qiov = qiov;
-
-    iocbs = &laiocb->iocb;
+    struct qemu_laiocb laiocb = {
+        .co = qemu_coroutine_self(),
+        .nbytes = nb_sectors * 512,
+        .ctx = s,
+        .is_read = (type == QEMU_AIO_READ),
+        .qiov = qiov,
+    };
+    struct iocb *iocbs = &laiocb.iocb;
+    int ret;
 
     switch (type) {
     case QEMU_AIO_WRITE:
@@ -272,22 +247,21 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
     default:
         fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
                         __func__, type);
-        goto out_free_aiocb;
+        return -EIO;
     }
-    io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
+    io_set_eventfd(&laiocb.iocb, event_notifier_get_fd(&s->e));
 
     if (!s->io_q.plugged) {
-        if (io_submit(s->ctx, 1, &iocbs) < 0) {
-            goto out_free_aiocb;
+        ret = io_submit(s->ctx, 1, &iocbs);
+        if (ret < 0) {
+            return ret;
         }
     } else {
         ioq_enqueue(s, iocbs);
     }
-    return &laiocb->common;
 
-out_free_aiocb:
-    qemu_aio_unref(laiocb);
-    return NULL;
+    qemu_coroutine_yield();
+    return laiocb.ret;
 }
 
 void laio_detach_aio_context(void *s_, AioContext *old_context)
diff --git a/block/raw-aio.h b/block/raw-aio.h
index 80681ce..b83c8af 100644
--- a/block/raw-aio.h
+++ b/block/raw-aio.h
@@ -35,9 +35,8 @@
 #ifdef CONFIG_LINUX_AIO
 void *laio_init(void);
 void laio_cleanup(void *s);
-BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
-        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
-        BlockCompletionFunc *cb, void *opaque, int type);
+int laio_submit_co(BlockDriverState *bs, void *aio_ctx, int fd,
+        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type);
 void laio_detach_aio_context(void *s, AioContext *old_context);
 void laio_attach_aio_context(void *s, AioContext *new_context);
 void laio_io_plug(BlockDriverState *bs, void *aio_ctx);
diff --git a/block/raw-posix.c b/block/raw-posix.c
index b1af77e..aa16b53 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -1075,14 +1075,13 @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, int fd,
     return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
 }
 
-static BlockAIOCB *raw_aio_submit(BlockDriverState *bs,
-        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
-        BlockCompletionFunc *cb, void *opaque, int type)
+static int coroutine_fn raw_co_rw(BlockDriverState *bs, int64_t sector_num,
+                                  int nb_sectors, QEMUIOVector *qiov, int type)
 {
     BDRVRawState *s = bs->opaque;
 
     if (fd_open(bs) < 0)
-        return NULL;
+        return -EIO;
 
     /*
      * Check if the underlying device requires requests to be aligned,
@@ -1095,14 +1094,25 @@ static BlockAIOCB *raw_aio_submit(BlockDriverState *bs,
             type |= QEMU_AIO_MISALIGNED;
 #ifdef CONFIG_LINUX_AIO
         } else if (s->use_aio) {
-            return laio_submit(bs, s->aio_ctx, s->fd, sector_num, qiov,
-                               nb_sectors, cb, opaque, type);
+            return laio_submit_co(bs, s->aio_ctx, s->fd, sector_num, qiov,
+                               nb_sectors, type);
 #endif
         }
     }
 
-    return paio_submit(bs, s->fd, sector_num, qiov, nb_sectors,
-                       cb, opaque, type);
+    return paio_submit_co(bs, s->fd, sector_num, qiov, nb_sectors, type);
+}
+
+static int coroutine_fn raw_co_readv(BlockDriverState *bs, int64_t sector_num,
+                                     int nb_sectors, QEMUIOVector *qiov)
+{
+    return raw_co_rw(bs, sector_num, nb_sectors, qiov, QEMU_AIO_READ);
+}
+
+static int coroutine_fn raw_co_writev(BlockDriverState *bs, int64_t sector_num,
+                                      int nb_sectors, QEMUIOVector *qiov)
+{
+    return raw_co_rw(bs, sector_num, nb_sectors, qiov, QEMU_AIO_WRITE);
 }
 
 static void raw_aio_plug(BlockDriverState *bs)
@@ -1135,22 +1145,6 @@ static void raw_aio_flush_io_queue(BlockDriverState *bs)
 #endif
 }
 
-static BlockAIOCB *raw_aio_readv(BlockDriverState *bs,
-        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
-        BlockCompletionFunc *cb, void *opaque)
-{
-    return raw_aio_submit(bs, sector_num, qiov, nb_sectors,
-                          cb, opaque, QEMU_AIO_READ);
-}
-
-static BlockAIOCB *raw_aio_writev(BlockDriverState *bs,
-        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
-        BlockCompletionFunc *cb, void *opaque)
-{
-    return raw_aio_submit(bs, sector_num, qiov, nb_sectors,
-                          cb, opaque, QEMU_AIO_WRITE);
-}
-
 static BlockAIOCB *raw_aio_flush(BlockDriverState *bs,
         BlockCompletionFunc *cb, void *opaque)
 {
@@ -1701,8 +1695,8 @@ static BlockDriver bdrv_file = {
     .bdrv_co_get_block_status = raw_co_get_block_status,
     .bdrv_co_write_zeroes = raw_co_write_zeroes,
 
-    .bdrv_aio_readv = raw_aio_readv,
-    .bdrv_aio_writev = raw_aio_writev,
+    .bdrv_co_readv          = raw_co_readv,
+    .bdrv_co_writev         = raw_co_writev,
     .bdrv_aio_flush = raw_aio_flush,
     .bdrv_aio_discard = raw_aio_discard,
     .bdrv_refresh_limits = raw_refresh_limits,
@@ -2103,8 +2097,8 @@ static BlockDriver bdrv_host_device = {
     .create_opts         = &raw_create_opts,
     .bdrv_co_write_zeroes = hdev_co_write_zeroes,
 
-    .bdrv_aio_readv	= raw_aio_readv,
-    .bdrv_aio_writev	= raw_aio_writev,
+    .bdrv_co_readv          = raw_co_readv,
+    .bdrv_co_writev         = raw_co_writev,
     .bdrv_aio_flush	= raw_aio_flush,
     .bdrv_aio_discard   = hdev_aio_discard,
     .bdrv_refresh_limits = raw_refresh_limits,
@@ -2252,8 +2246,8 @@ static BlockDriver bdrv_host_floppy = {
     .bdrv_create         = hdev_create,
     .create_opts         = &raw_create_opts,
 
-    .bdrv_aio_readv     = raw_aio_readv,
-    .bdrv_aio_writev    = raw_aio_writev,
+    .bdrv_co_readv          = raw_co_readv,
+    .bdrv_co_writev         = raw_co_writev,
     .bdrv_aio_flush	= raw_aio_flush,
     .bdrv_refresh_limits = raw_refresh_limits,
     .bdrv_io_plug = raw_aio_plug,
@@ -2383,8 +2377,8 @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_create         = hdev_create,
     .create_opts         = &raw_create_opts,
 
-    .bdrv_aio_readv     = raw_aio_readv,
-    .bdrv_aio_writev    = raw_aio_writev,
+    .bdrv_co_readv          = raw_co_readv,
+    .bdrv_co_writev         = raw_co_writev,
     .bdrv_aio_flush	= raw_aio_flush,
     .bdrv_refresh_limits = raw_refresh_limits,
     .bdrv_io_plug = raw_aio_plug,
@@ -2520,8 +2514,8 @@ static BlockDriver bdrv_host_cdrom = {
     .bdrv_create        = hdev_create,
     .create_opts        = &raw_create_opts,
 
-    .bdrv_aio_readv     = raw_aio_readv,
-    .bdrv_aio_writev    = raw_aio_writev,
+    .bdrv_co_readv          = raw_co_readv,
+    .bdrv_co_writev         = raw_co_writev,
     .bdrv_aio_flush	= raw_aio_flush,
     .bdrv_refresh_limits = raw_refresh_limits,
     .bdrv_io_plug = raw_aio_plug,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively
  2014-11-26 14:46 [Qemu-devel] [RFC PATCH 0/3] linux-aio: Convert to coroutines Kevin Wolf
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 1/3] qemu-img bench Kevin Wolf
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines Kevin Wolf
@ 2014-11-26 14:46 ` Kevin Wolf
  2014-12-04 14:37   ` Kevin Wolf
  2 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2014-11-26 14:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, ming.lei, pl, stefanha

When getting an error while submitting requests, we must be careful to
wake up only inactive coroutines. Therefore we must special-case the
currently active coroutine and communicate an error for that request
using the ordinary return value of ioq_submit().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/linux-aio.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 99b259d..fd8f0e4 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -177,12 +177,19 @@ static int ioq_submit(struct qemu_laio_state *s)
             container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
 
         laiocb->ret = (ret < 0) ? ret : -EIO;
-        qemu_laio_process_completion(s, laiocb);
+        if (laiocb->co != qemu_coroutine_self()) {
+            qemu_coroutine_enter(laiocb->co, NULL);
+        } else {
+            /* The return value is used for the currently active coroutine.
+             * We're always in ioq_enqueue() here, ioq_submit() never runs from
+             * a request's coroutine.*/
+            ret = laiocb->ret;
+        }
     }
     return ret;
 }
 
-static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
+static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 {
     unsigned int idx = s->io_q.idx;
 
@@ -191,7 +198,9 @@ static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
 
     /* submit immediately if queue is full */
     if (idx == s->io_q.size) {
-        ioq_submit(s);
+        return ioq_submit(s);
+    } else {
+        return 0;
     }
 }
 
@@ -253,11 +262,11 @@ int laio_submit_co(BlockDriverState *bs, void *aio_ctx, int fd,
 
     if (!s->io_q.plugged) {
         ret = io_submit(s->ctx, 1, &iocbs);
-        if (ret < 0) {
-            return ret;
-        }
     } else {
-        ioq_enqueue(s, iocbs);
+        ret = ioq_enqueue(s, iocbs);
+    }
+    if (ret < 0) {
+        return ret;
     }
 
     qemu_coroutine_yield();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines Kevin Wolf
@ 2014-11-27  9:50   ` Peter Lieven
  2014-11-28 12:57     ` Kevin Wolf
  2014-11-28  2:59   ` Ming Lei
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Lieven @ 2014-11-27  9:50 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel; +Cc: pbonzini, ming.lei, stefanha

On 26.11.2014 15:46, Kevin Wolf wrote:
> This improves the performance of requests because an ACB doesn't need to
> be allocated on the heap any more. It also makes the code nicer and
> smaller.
>
> As a side effect, the codepath taken by aio=threads is changed to use
> paio_submit_co(). This doesn't change the performance at this point.
>
> Results of qemu-img bench -t none -c 10000000 [-n] /dev/loop0:
>
>        |      aio=native       |     aio=threads
>        | before   | with patch | before   | with patch
> ------+----------+------------+----------+------------
> run 1 | 29.921s  | 26.932s    | 35.286s  | 35.447s
> run 2 | 29.793s  | 26.252s    | 35.276s  | 35.111s
> run 3 | 30.186s  | 27.114s    | 35.042s  | 34.921s
> run 4 | 30.425s  | 26.600s    | 35.169s  | 34.968s
> run 5 | 30.041s  | 26.263s    | 35.224s  | 35.000s
>
> TODO: Do some more serious benchmarking in VMs with less variance.
> Results of a quick fio run are vaguely positive.

I still see the main-loop spun warnings with this patches applied to master.
It wasn't there with the original patch from August.

~/git/qemu$ ./qemu-img bench -t none -c 10000000 -n /dev/ram1
Sending 10000000 requests, 4096 bytes each, 64 in parallel
main-loop: WARNING: I/O thread spun for 1000 iterations
Run completed in 31.947 seconds.

Peter

>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/linux-aio.c | 70 +++++++++++++++++--------------------------------------
>   block/raw-aio.h   |  5 ++--
>   block/raw-posix.c | 62 ++++++++++++++++++++++--------------------------
>   3 files changed, 52 insertions(+), 85 deletions(-)
>
> diff --git a/block/linux-aio.c b/block/linux-aio.c
> index d92513b..99b259d 100644
> --- a/block/linux-aio.c
> +++ b/block/linux-aio.c
> @@ -12,6 +12,7 @@
>   #include "qemu/queue.h"
>   #include "block/raw-aio.h"
>   #include "qemu/event_notifier.h"
> +#include "block/coroutine.h"
>   
>   #include <libaio.h>
>   
> @@ -28,7 +29,7 @@
>   #define MAX_QUEUED_IO  128
>   
>   struct qemu_laiocb {
> -    BlockAIOCB common;
> +    Coroutine *co;
>       struct qemu_laio_state *ctx;
>       struct iocb iocb;
>       ssize_t ret;
> @@ -86,9 +87,9 @@ static void qemu_laio_process_completion(struct qemu_laio_state *s,
>               }
>           }
>       }
> -    laiocb->common.cb(laiocb->common.opaque, ret);
>   
> -    qemu_aio_unref(laiocb);
> +    laiocb->ret = ret;
> +    qemu_coroutine_enter(laiocb->co, NULL);
>   }
>   
>   /* The completion BH fetches completed I/O requests and invokes their
> @@ -146,30 +147,6 @@ static void qemu_laio_completion_cb(EventNotifier *e)
>       }
>   }
>   
> -static void laio_cancel(BlockAIOCB *blockacb)
> -{
> -    struct qemu_laiocb *laiocb = (struct qemu_laiocb *)blockacb;
> -    struct io_event event;
> -    int ret;
> -
> -    if (laiocb->ret != -EINPROGRESS) {
> -        return;
> -    }
> -    ret = io_cancel(laiocb->ctx->ctx, &laiocb->iocb, &event);
> -    laiocb->ret = -ECANCELED;
> -    if (ret != 0) {
> -        /* iocb is not cancelled, cb will be called by the event loop later */
> -        return;
> -    }
> -
> -    laiocb->common.cb(laiocb->common.opaque, laiocb->ret);
> -}
> -
> -static const AIOCBInfo laio_aiocb_info = {
> -    .aiocb_size         = sizeof(struct qemu_laiocb),
> -    .cancel_async       = laio_cancel,
> -};
> -
>   static void ioq_init(LaioQueue *io_q)
>   {
>       io_q->size = MAX_QUEUED_IO;
> @@ -243,23 +220,21 @@ int laio_io_unplug(BlockDriverState *bs, void *aio_ctx, bool unplug)
>       return ret;
>   }
>   
> -BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
> -        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> -        BlockCompletionFunc *cb, void *opaque, int type)
> +int laio_submit_co(BlockDriverState *bs, void *aio_ctx, int fd,
> +        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type)
>   {
>       struct qemu_laio_state *s = aio_ctx;
> -    struct qemu_laiocb *laiocb;
> -    struct iocb *iocbs;
>       off_t offset = sector_num * 512;
>   
> -    laiocb = qemu_aio_get(&laio_aiocb_info, bs, cb, opaque);
> -    laiocb->nbytes = nb_sectors * 512;
> -    laiocb->ctx = s;
> -    laiocb->ret = -EINPROGRESS;
> -    laiocb->is_read = (type == QEMU_AIO_READ);
> -    laiocb->qiov = qiov;
> -
> -    iocbs = &laiocb->iocb;
> +    struct qemu_laiocb laiocb = {
> +        .co = qemu_coroutine_self(),
> +        .nbytes = nb_sectors * 512,
> +        .ctx = s,
> +        .is_read = (type == QEMU_AIO_READ),
> +        .qiov = qiov,
> +    };
> +    struct iocb *iocbs = &laiocb.iocb;
> +    int ret;
>   
>       switch (type) {
>       case QEMU_AIO_WRITE:
> @@ -272,22 +247,21 @@ BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
>       default:
>           fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
>                           __func__, type);
> -        goto out_free_aiocb;
> +        return -EIO;
>       }
> -    io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
> +    io_set_eventfd(&laiocb.iocb, event_notifier_get_fd(&s->e));
>   
>       if (!s->io_q.plugged) {
> -        if (io_submit(s->ctx, 1, &iocbs) < 0) {
> -            goto out_free_aiocb;
> +        ret = io_submit(s->ctx, 1, &iocbs);
> +        if (ret < 0) {
> +            return ret;
>           }
>       } else {
>           ioq_enqueue(s, iocbs);
>       }
> -    return &laiocb->common;
>   
> -out_free_aiocb:
> -    qemu_aio_unref(laiocb);
> -    return NULL;
> +    qemu_coroutine_yield();
> +    return laiocb.ret;
>   }
>   
>   void laio_detach_aio_context(void *s_, AioContext *old_context)
> diff --git a/block/raw-aio.h b/block/raw-aio.h
> index 80681ce..b83c8af 100644
> --- a/block/raw-aio.h
> +++ b/block/raw-aio.h
> @@ -35,9 +35,8 @@
>   #ifdef CONFIG_LINUX_AIO
>   void *laio_init(void);
>   void laio_cleanup(void *s);
> -BlockAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
> -        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> -        BlockCompletionFunc *cb, void *opaque, int type);
> +int laio_submit_co(BlockDriverState *bs, void *aio_ctx, int fd,
> +        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, int type);
>   void laio_detach_aio_context(void *s, AioContext *old_context);
>   void laio_attach_aio_context(void *s, AioContext *new_context);
>   void laio_io_plug(BlockDriverState *bs, void *aio_ctx);
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index b1af77e..aa16b53 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -1075,14 +1075,13 @@ static BlockAIOCB *paio_submit(BlockDriverState *bs, int fd,
>       return thread_pool_submit_aio(pool, aio_worker, acb, cb, opaque);
>   }
>   
> -static BlockAIOCB *raw_aio_submit(BlockDriverState *bs,
> -        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> -        BlockCompletionFunc *cb, void *opaque, int type)
> +static int coroutine_fn raw_co_rw(BlockDriverState *bs, int64_t sector_num,
> +                                  int nb_sectors, QEMUIOVector *qiov, int type)
>   {
>       BDRVRawState *s = bs->opaque;
>   
>       if (fd_open(bs) < 0)
> -        return NULL;
> +        return -EIO;
>   
>       /*
>        * Check if the underlying device requires requests to be aligned,
> @@ -1095,14 +1094,25 @@ static BlockAIOCB *raw_aio_submit(BlockDriverState *bs,
>               type |= QEMU_AIO_MISALIGNED;
>   #ifdef CONFIG_LINUX_AIO
>           } else if (s->use_aio) {
> -            return laio_submit(bs, s->aio_ctx, s->fd, sector_num, qiov,
> -                               nb_sectors, cb, opaque, type);
> +            return laio_submit_co(bs, s->aio_ctx, s->fd, sector_num, qiov,
> +                               nb_sectors, type);
>   #endif
>           }
>       }
>   
> -    return paio_submit(bs, s->fd, sector_num, qiov, nb_sectors,
> -                       cb, opaque, type);
> +    return paio_submit_co(bs, s->fd, sector_num, qiov, nb_sectors, type);
> +}
> +
> +static int coroutine_fn raw_co_readv(BlockDriverState *bs, int64_t sector_num,
> +                                     int nb_sectors, QEMUIOVector *qiov)
> +{
> +    return raw_co_rw(bs, sector_num, nb_sectors, qiov, QEMU_AIO_READ);
> +}
> +
> +static int coroutine_fn raw_co_writev(BlockDriverState *bs, int64_t sector_num,
> +                                      int nb_sectors, QEMUIOVector *qiov)
> +{
> +    return raw_co_rw(bs, sector_num, nb_sectors, qiov, QEMU_AIO_WRITE);
>   }
>   
>   static void raw_aio_plug(BlockDriverState *bs)
> @@ -1135,22 +1145,6 @@ static void raw_aio_flush_io_queue(BlockDriverState *bs)
>   #endif
>   }
>   
> -static BlockAIOCB *raw_aio_readv(BlockDriverState *bs,
> -        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> -        BlockCompletionFunc *cb, void *opaque)
> -{
> -    return raw_aio_submit(bs, sector_num, qiov, nb_sectors,
> -                          cb, opaque, QEMU_AIO_READ);
> -}
> -
> -static BlockAIOCB *raw_aio_writev(BlockDriverState *bs,
> -        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> -        BlockCompletionFunc *cb, void *opaque)
> -{
> -    return raw_aio_submit(bs, sector_num, qiov, nb_sectors,
> -                          cb, opaque, QEMU_AIO_WRITE);
> -}
> -
>   static BlockAIOCB *raw_aio_flush(BlockDriverState *bs,
>           BlockCompletionFunc *cb, void *opaque)
>   {
> @@ -1701,8 +1695,8 @@ static BlockDriver bdrv_file = {
>       .bdrv_co_get_block_status = raw_co_get_block_status,
>       .bdrv_co_write_zeroes = raw_co_write_zeroes,
>   
> -    .bdrv_aio_readv = raw_aio_readv,
> -    .bdrv_aio_writev = raw_aio_writev,
> +    .bdrv_co_readv          = raw_co_readv,
> +    .bdrv_co_writev         = raw_co_writev,
>       .bdrv_aio_flush = raw_aio_flush,
>       .bdrv_aio_discard = raw_aio_discard,
>       .bdrv_refresh_limits = raw_refresh_limits,
> @@ -2103,8 +2097,8 @@ static BlockDriver bdrv_host_device = {
>       .create_opts         = &raw_create_opts,
>       .bdrv_co_write_zeroes = hdev_co_write_zeroes,
>   
> -    .bdrv_aio_readv	= raw_aio_readv,
> -    .bdrv_aio_writev	= raw_aio_writev,
> +    .bdrv_co_readv          = raw_co_readv,
> +    .bdrv_co_writev         = raw_co_writev,
>       .bdrv_aio_flush	= raw_aio_flush,
>       .bdrv_aio_discard   = hdev_aio_discard,
>       .bdrv_refresh_limits = raw_refresh_limits,
> @@ -2252,8 +2246,8 @@ static BlockDriver bdrv_host_floppy = {
>       .bdrv_create         = hdev_create,
>       .create_opts         = &raw_create_opts,
>   
> -    .bdrv_aio_readv     = raw_aio_readv,
> -    .bdrv_aio_writev    = raw_aio_writev,
> +    .bdrv_co_readv          = raw_co_readv,
> +    .bdrv_co_writev         = raw_co_writev,
>       .bdrv_aio_flush	= raw_aio_flush,
>       .bdrv_refresh_limits = raw_refresh_limits,
>       .bdrv_io_plug = raw_aio_plug,
> @@ -2383,8 +2377,8 @@ static BlockDriver bdrv_host_cdrom = {
>       .bdrv_create         = hdev_create,
>       .create_opts         = &raw_create_opts,
>   
> -    .bdrv_aio_readv     = raw_aio_readv,
> -    .bdrv_aio_writev    = raw_aio_writev,
> +    .bdrv_co_readv          = raw_co_readv,
> +    .bdrv_co_writev         = raw_co_writev,
>       .bdrv_aio_flush	= raw_aio_flush,
>       .bdrv_refresh_limits = raw_refresh_limits,
>       .bdrv_io_plug = raw_aio_plug,
> @@ -2520,8 +2514,8 @@ static BlockDriver bdrv_host_cdrom = {
>       .bdrv_create        = hdev_create,
>       .create_opts        = &raw_create_opts,
>   
> -    .bdrv_aio_readv     = raw_aio_readv,
> -    .bdrv_aio_writev    = raw_aio_writev,
> +    .bdrv_co_readv          = raw_co_readv,
> +    .bdrv_co_writev         = raw_co_writev,
>       .bdrv_aio_flush	= raw_aio_flush,
>       .bdrv_refresh_limits = raw_refresh_limits,
>       .bdrv_io_plug = raw_aio_plug,


-- 

Mit freundlichen Grüßen

Peter Lieven

...........................................................

   KAMP Netzwerkdienste GmbH
   Vestische Str. 89-91 | 46117 Oberhausen
   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
   pl@kamp.de | http://www.kamp.de

   Geschäftsführer: Heiner Lante | Michael Lante
   Amtsgericht Duisburg | HRB Nr. 12154
   USt-Id-Nr.: DE 120607556

...........................................................

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines Kevin Wolf
  2014-11-27  9:50   ` Peter Lieven
@ 2014-11-28  2:59   ` Ming Lei
  2014-11-28  7:33     ` Markus Armbruster
                       ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Ming Lei @ 2014-11-28  2:59 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, pl, qemu-devel, Stefan Hajnoczi

Hi Kevin,

On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> This improves the performance of requests because an ACB doesn't need to
> be allocated on the heap any more. It also makes the code nicer and
> smaller.

I am not sure it is good way for linux aio optimization:

- for raw image with some constraint, coroutine can be avoided since
io_submit() won't sleep most of times

- handling one time coroutine takes much time than handling malloc,
memset and free on small buffer, following the test data:

         --   241ns per coroutine
         --   61ns per (malloc, memset, free for 128bytes)

I still think we should figure out a fast path to avoid cocourinte
for linux-aio with raw image, otherwise it can't scale well for high
IOPS device.

Also we can use simple buf pool to avoid the dynamic allocation
easily, can't we?

>
> As a side effect, the codepath taken by aio=threads is changed to use
> paio_submit_co(). This doesn't change the performance at this point.
>
> Results of qemu-img bench -t none -c 10000000 [-n] /dev/loop0:
>
>       |      aio=native       |     aio=threads
>       | before   | with patch | before   | with patch
> ------+----------+------------+----------+------------
> run 1 | 29.921s  | 26.932s    | 35.286s  | 35.447s
> run 2 | 29.793s  | 26.252s    | 35.276s  | 35.111s
> run 3 | 30.186s  | 27.114s    | 35.042s  | 34.921s
> run 4 | 30.425s  | 26.600s    | 35.169s  | 34.968s
> run 5 | 30.041s  | 26.263s    | 35.224s  | 35.000s
>
> TODO: Do some more serious benchmarking in VMs with less variance.
> Results of a quick fio run are vaguely positive.

I will do the test with Paolo's fast path approach under
VM I/O situation.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-28  2:59   ` Ming Lei
@ 2014-11-28  7:33     ` Markus Armbruster
  2014-11-28  8:12       ` Ming Lei
  2014-11-28  9:44     ` Paolo Bonzini
  2014-11-28 10:06     ` Kevin Wolf
  2 siblings, 1 reply; 21+ messages in thread
From: Markus Armbruster @ 2014-11-28  7:33 UTC (permalink / raw)
  To: Ming Lei; +Cc: Kevin Wolf, Paolo Bonzini, pl, qemu-devel, Stefan Hajnoczi

Ming Lei <ming.lei@canonical.com> writes:

> Hi Kevin,
>
> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> This improves the performance of requests because an ACB doesn't need to
>> be allocated on the heap any more. It also makes the code nicer and
>> smaller.
>
> I am not sure it is good way for linux aio optimization:
>
> - for raw image with some constraint, coroutine can be avoided since
> io_submit() won't sleep most of times
>
> - handling one time coroutine takes much time than handling malloc,
> memset and free on small buffer, following the test data:
>
>          --   241ns per coroutine

What do you mean by "coroutine" here?  Create + destroy?  Yield?

>          --   61ns per (malloc, memset, free for 128bytes)
>
> I still think we should figure out a fast path to avoid cocourinte
> for linux-aio with raw image, otherwise it can't scale well for high
> IOPS device.
>
> Also we can use simple buf pool to avoid the dynamic allocation
> easily, can't we?
[...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-28  7:33     ` Markus Armbruster
@ 2014-11-28  8:12       ` Ming Lei
  2014-11-28  8:59         ` Markus Armbruster
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2014-11-28  8:12 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Kevin Wolf, Paolo Bonzini, pl, qemu-devel, Stefan Hajnoczi

On 11/28/14, Markus Armbruster <armbru@redhat.com> wrote:
> Ming Lei <ming.lei@canonical.com> writes:
>
>> Hi Kevin,
>>
>> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> This improves the performance of requests because an ACB doesn't need to
>>> be allocated on the heap any more. It also makes the code nicer and
>>> smaller.
>>
>> I am not sure it is good way for linux aio optimization:
>>
>> - for raw image with some constraint, coroutine can be avoided since
>> io_submit() won't sleep most of times
>>
>> - handling one time coroutine takes much time than handling malloc,
>> memset and free on small buffer, following the test data:
>>
>>          --   241ns per coroutine
>
> What do you mean by "coroutine" here?  Create + destroy?  Yield?

Please see perf_cost() in tests/test-coroutine.c

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-28  8:12       ` Ming Lei
@ 2014-11-28  8:59         ` Markus Armbruster
  2014-11-28  9:15           ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Markus Armbruster @ 2014-11-28  8:59 UTC (permalink / raw)
  To: Ming Lei; +Cc: Kevin Wolf, Paolo Bonzini, pl, qemu-devel, Stefan Hajnoczi

Ming Lei <ming.lei@canonical.com> writes:

> On 11/28/14, Markus Armbruster <armbru@redhat.com> wrote:
>> Ming Lei <ming.lei@canonical.com> writes:
>>
>>> Hi Kevin,
>>>
>>> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> This improves the performance of requests because an ACB doesn't need to
>>>> be allocated on the heap any more. It also makes the code nicer and
>>>> smaller.
>>>
>>> I am not sure it is good way for linux aio optimization:
>>>
>>> - for raw image with some constraint, coroutine can be avoided since
>>> io_submit() won't sleep most of times
>>>
>>> - handling one time coroutine takes much time than handling malloc,
>>> memset and free on small buffer, following the test data:
>>>
>>>          --   241ns per coroutine
>>
>> What do you mean by "coroutine" here?  Create + destroy?  Yield?
>
> Please see perf_cost() in tests/test-coroutine.c

    static __attribute__((noinline)) void perf_cost_func(void *opaque)
    {
        qemu_coroutine_yield();
    }

    static void perf_cost(void)
    {
        const unsigned long maxcycles = 40000000;
        unsigned long i = 0;
        double duration;
        unsigned long ops;
        Coroutine *co;

        g_test_timer_start();
        while (i++ < maxcycles) {
            co = qemu_coroutine_create(perf_cost_func);
            qemu_coroutine_enter(co, &i);
            qemu_coroutine_enter(co, NULL);
        }
        duration = g_test_timer_elapsed();
        ops = (long)(maxcycles / (duration * 1000));

        g_test_message("Run operation %lu iterations %f s, %luK operations/s, "
                       "%luns per coroutine",
                       maxcycles,
                       duration, ops,
                       (unsigned long)(1000000000 * duration) / maxcycles);
    }

This tests create, enter, yield, reenter, terminate, destroy.  The cost
of create + destroy may well dominate.

If we create and destroy coroutines for each AIO request, we're doing it
wrong.  I doubt Kevin's doing it *that* wrong ;)

Anyway, let's benchmark the real code instead of putting undue trust in
tests/test-coroutine.c micro-benchmarks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-28  8:59         ` Markus Armbruster
@ 2014-11-28  9:15           ` Ming Lei
  0 siblings, 0 replies; 21+ messages in thread
From: Ming Lei @ 2014-11-28  9:15 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Kevin Wolf, Paolo Bonzini, pl, qemu-devel, Stefan Hajnoczi

On 11/28/14, Markus Armbruster <armbru@redhat.com> wrote:
> Ming Lei <ming.lei@canonical.com> writes:
>
>> On 11/28/14, Markus Armbruster <armbru@redhat.com> wrote:
>>> Ming Lei <ming.lei@canonical.com> writes:
>>>
>>>> Hi Kevin,
>>>>
>>>> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>> This improves the performance of requests because an ACB doesn't need
>>>>> to
>>>>> be allocated on the heap any more. It also makes the code nicer and
>>>>> smaller.
>>>>
>>>> I am not sure it is good way for linux aio optimization:
>>>>
>>>> - for raw image with some constraint, coroutine can be avoided since
>>>> io_submit() won't sleep most of times
>>>>
>>>> - handling one time coroutine takes much time than handling malloc,
>>>> memset and free on small buffer, following the test data:
>>>>
>>>>          --   241ns per coroutine
>>>
>>> What do you mean by "coroutine" here?  Create + destroy?  Yield?
>>
>> Please see perf_cost() in tests/test-coroutine.c
>
>     static __attribute__((noinline)) void perf_cost_func(void *opaque)
>     {
>         qemu_coroutine_yield();
>     }
>
>     static void perf_cost(void)
>     {
>         const unsigned long maxcycles = 40000000;
>         unsigned long i = 0;
>         double duration;
>         unsigned long ops;
>         Coroutine *co;
>
>         g_test_timer_start();
>         while (i++ < maxcycles) {
>             co = qemu_coroutine_create(perf_cost_func);
>             qemu_coroutine_enter(co, &i);
>             qemu_coroutine_enter(co, NULL);
>         }
>         duration = g_test_timer_elapsed();
>         ops = (long)(maxcycles / (duration * 1000));
>
>         g_test_message("Run operation %lu iterations %f s, %luK
> operations/s, "
>                        "%luns per coroutine",
>                        maxcycles,
>                        duration, ops,
>                        (unsigned long)(1000000000 * duration) / maxcycles);
>     }
>
> This tests create, enter, yield, reenter, terminate, destroy.  The cost
> of create + destroy may well dominate.

Actually there shouldn't have been much cost from create and destroy
attributed to coroutine pool.

>
> If we create and destroy coroutines for each AIO request, we're doing it
> wrong.  I doubt Kevin's doing it *that* wrong ;)
>
> Anyway, let's benchmark the real code instead of putting undue trust in
> tests/test-coroutine.c micro-benchmarks.

I don't think there isn't trust from the micro-benchmark.

That is the direct cost from coroutine, and the cost won't be avoided at all,
not mention cost from switching stack.

If you google some test data posted by me previously, that would show
bypassing coroutine can increase throughput with ~50% for raw image
in case of linux aio, that is the real test case, not micro-benchmark.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-28  2:59   ` Ming Lei
  2014-11-28  7:33     ` Markus Armbruster
@ 2014-11-28  9:44     ` Paolo Bonzini
  2014-11-28 10:06     ` Kevin Wolf
  2 siblings, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2014-11-28  9:44 UTC (permalink / raw)
  To: Ming Lei, Kevin Wolf; +Cc: pl, qemu-devel, Stefan Hajnoczi



On 28/11/2014 03:59, Ming Lei wrote:
> Hi Kevin,
> 
> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> This improves the performance of requests because an ACB doesn't need to
>> be allocated on the heap any more. It also makes the code nicer and
>> smaller.
> 
> I am not sure it is good way for linux aio optimization:
> 
> - for raw image with some constraint, coroutine can be avoided since
> io_submit() won't sleep most of times
> 
> - handling one time coroutine takes much time than handling malloc,
> memset and free on small buffer, following the test data:
> 
>          --   241ns per coroutine
>          --   61ns per (malloc, memset, free for 128bytes)
> 
> I still think we should figure out a fast path to avoid cocourinte
> for linux-aio with raw image, otherwise it can't scale well for high
> IOPS device.

sigsetjmp/siglongjmp are just ~60 instructions, it cannot account for
180ns (600 clock cycles).  The cost of creating and destroying the
coroutine must come from somewhere else.

Let's just try something else.  Let's remove the pool mutex, as suggested
by Peter but in a way that works even with non-ioeventfd backends.

I still believe we will end with some kind of coroutine bypass scheme
(even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv
directly can help), but hey it cannot hurt to optimize hot code.

The patch below has a single pool where coroutines are placed on
destruction, and a per-thread allocation pool.  Whenever the destruction
pool is big enough, the next thread that runs out of coroutines will
steal from it instead of making a new coroutine.  If this works, it
would be beautiful in two ways:

1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.

2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also removes qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous).

The patch below is very raw, and untested beyond tests/test-coroutine.
There may be some premature optimization (not using GPrivate, even though
I need it to run the per-thread destructor) but it was easy enough.  Ming,
Kevin, can you benchmark it?

Related to this, we can see if __thread beats GPrivate in coroutine-ucontext.c.
Every clock cycle counts (600 clock cycles are a 2% improvement at 3 GHz
and 100 kiops) so we can see what we get.

Paolo

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index d433b90..6a01e2f 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -191,6 +191,17 @@ struct {                                                                \
 #define QSLIST_INSERT_HEAD(head, elm, field) do {                        \
         (elm)->field.sle_next = (head)->slh_first;                      \
         (head)->slh_first = (elm);                                      \
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do {                   \
+	do {                                                               \
+	   (elm)->field.sle_next = (head)->slh_first;                      \
+	} while (atomic_cmpxchg(&(head)->slh_first, (elm)->field.sle_next, \
+                              (elm)) != (elm)->field.sle_next);            \
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_MOVE_ATOMIC(dest, src) do {                               \
+	(dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL);        \
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_REMOVE_HEAD(head, field) do {                             \
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index bd574aa..60d761f 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -15,35 +15,52 @@
 #include "trace.h"
 #include "qemu-common.h"
 #include "qemu/thread.h"
+#include "qemu/atomic.h"
 #include "block/coroutine.h"
 #include "block/coroutine_int.h"
 
 enum {
-    POOL_DEFAULT_SIZE = 64,
+    POOL_BATCH_SIZE = 64,
 };
 
 /** Free list to speed up creation */
-static QemuMutex pool_lock;
-static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
+static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
 static unsigned int pool_size;
-static unsigned int pool_max_size = POOL_DEFAULT_SIZE;
+static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
+
+/* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
+static void coroutine_pool_cleanup(void *value);
+static GPrivate dummy_key = G_PRIVATE_INIT(coroutine_pool_cleanup);
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 {
     Coroutine *co = NULL;
 
     if (CONFIG_COROUTINE_POOL) {
-        qemu_mutex_lock(&pool_lock);
-        co = QSLIST_FIRST(&pool);
+        co = QSLIST_FIRST(&alloc_pool);
+        if (!co) {
+            if (pool_size > POOL_BATCH_SIZE) {
+                /* This is not exact; there could be a little skew between pool_size
+                 * and the actual size of alloc_pool.  But it is just a heuristic,
+                 * it does not need to be perfect.
+                 */
+                pool_size = 0;
+                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
+                co = QSLIST_FIRST(&alloc_pool);
+
+                /* Use this slow path as an easy place to make the per-thread vale
+                 * non-NULL and thus trigger the destructor.
+                 */
+                g_private_set(&dummy_key, &dummy_key);
+            }
+        }
         if (co) {
-            QSLIST_REMOVE_HEAD(&pool, pool_next);
-            pool_size--;
+            QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
         }
-        qemu_mutex_unlock(&pool_lock);
     }
 
     if (!co) {
         co = qemu_coroutine_new();
     }
 
     co->entry = entry;
@@ -54,36 +71,26 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 static void coroutine_delete(Coroutine *co)
 {
     if (CONFIG_COROUTINE_POOL) {
-        qemu_mutex_lock(&pool_lock);
-        if (pool_size < pool_max_size) {
-            QSLIST_INSERT_HEAD(&pool, co, pool_next);
+        if (pool_size < POOL_BATCH_SIZE * 2) {
             co->caller = NULL;
-            pool_size++;
-            qemu_mutex_unlock(&pool_lock);
+            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
+            atomic_inc(&pool_size);
             return;
         }
-        qemu_mutex_unlock(&pool_lock);
     }
 
     qemu_coroutine_delete(co);
 }
 
-static void __attribute__((constructor)) coroutine_pool_init(void)
-{
-    qemu_mutex_init(&pool_lock);
-}
-
-static void __attribute__((destructor)) coroutine_pool_cleanup(void)
+static void coroutine_pool_cleanup(void *value)
 {
     Coroutine *co;
     Coroutine *tmp;
 
-    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
-        QSLIST_REMOVE_HEAD(&pool, pool_next);
+    QSLIST_FOREACH_SAFE(co, &alloc_pool, pool_next, tmp) {
+        QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
         qemu_coroutine_delete(co);
     }
-
-    qemu_mutex_destroy(&pool_lock);
 }
 
 static void coroutine_swap(Coroutine *from, Coroutine *to)
@@ -140,20 +156,4 @@ void coroutine_fn qemu_coroutine_yield(void)
 
 void qemu_coroutine_adjust_pool_size(int n)
 {
-    qemu_mutex_lock(&pool_lock);
-
-    pool_max_size += n;
-
-    /* Callers should never take away more than they added */
-    assert(pool_max_size >= POOL_DEFAULT_SIZE);
-
-    /* Trim oversized pool down to new max */
-    while (pool_size > pool_max_size) {
-        Coroutine *co = QSLIST_FIRST(&pool);
-        QSLIST_REMOVE_HEAD(&pool, pool_next);
-        pool_size--;
-        qemu_coroutine_delete(co);
-    }
-
-    qemu_mutex_unlock(&pool_lock);
 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-28  2:59   ` Ming Lei
  2014-11-28  7:33     ` Markus Armbruster
  2014-11-28  9:44     ` Paolo Bonzini
@ 2014-11-28 10:06     ` Kevin Wolf
  2 siblings, 0 replies; 21+ messages in thread
From: Kevin Wolf @ 2014-11-28 10:06 UTC (permalink / raw)
  To: Ming Lei; +Cc: Paolo Bonzini, pl, qemu-devel, Stefan Hajnoczi

Am 28.11.2014 um 03:59 hat Ming Lei geschrieben:
> Hi Kevin,
> 
> On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > This improves the performance of requests because an ACB doesn't need to
> > be allocated on the heap any more. It also makes the code nicer and
> > smaller.
> 
> I am not sure it is good way for linux aio optimization:
> 
> - for raw image with some constraint, coroutine can be avoided since
> io_submit() won't sleep most of times
> 
> - handling one time coroutine takes much time than handling malloc,
> memset and free on small buffer, following the test data:
> 
>          --   241ns per coroutine
>          --   61ns per (malloc, memset, free for 128bytes)

Please finally stop making comparisons between completely unrelated
things and trying to make a case against coroutines out of it. It simply
doesn't make any sense.

The truth is that in the 'qemu-img bench' case as well as in the highest
performing VM setup for Peter and me, the practically existing coroutine
based git branches perform better then the practically existing bypass
branches. If you think that theoretically the bypass branches must be
better, show us the patches and benchmarks.

If you can't, let's merge the coroutine improvements (which improve
more than just the case of raw images using no block layer features,
including cases that benefit the average user) and be done.

> I still think we should figure out a fast path to avoid cocourinte
> for linux-aio with raw image, otherwise it can't scale well for high
> IOPS device.
> 
> Also we can use simple buf pool to avoid the dynamic allocation
> easily, can't we?

Yes, the change to g_slice_alloc() was a bad move performance-wise.

> > As a side effect, the codepath taken by aio=threads is changed to use
> > paio_submit_co(). This doesn't change the performance at this point.
> >
> > Results of qemu-img bench -t none -c 10000000 [-n] /dev/loop0:
> >
> >       |      aio=native       |     aio=threads
> >       | before   | with patch | before   | with patch
> > ------+----------+------------+----------+------------
> > run 1 | 29.921s  | 26.932s    | 35.286s  | 35.447s
> > run 2 | 29.793s  | 26.252s    | 35.276s  | 35.111s
> > run 3 | 30.186s  | 27.114s    | 35.042s  | 34.921s
> > run 4 | 30.425s  | 26.600s    | 35.169s  | 34.968s
> > run 5 | 30.041s  | 26.263s    | 35.224s  | 35.000s
> >
> > TODO: Do some more serious benchmarking in VMs with less variance.
> > Results of a quick fio run are vaguely positive.
> 
> I will do the test with Paolo's fast path approach under
> VM I/O situation.

Currently, the best thing to compare it against is probably Peter's git
branch at https://github.com/plieven/qemu.git perf_master2. This patch
is only a first step in a whole series of possible optimisations.

Kevin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/3] qemu-img bench
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 1/3] qemu-img bench Kevin Wolf
@ 2014-11-28 11:49   ` Stefan Hajnoczi
  2014-11-28 12:19     ` Kevin Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2014-11-28 11:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pbonzini, ming.lei, pl, qemu-devel, stefanha

[-- Attachment #1: Type: text/plain, Size: 398 bytes --]

On Wed, Nov 26, 2014 at 03:46:42PM +0100, Kevin Wolf wrote:
> +    while (data.n > 0) {
> +        main_loop_wait(false);
> +    }

Why is this false (non-blocking)?  This is why you get the main loop
spun warning message.

Using true (blocking) seems like the right thing.  data.n changes as
part of the callback, which is invoked from the main loop.  There is no
need to be non-blocking.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/3] qemu-img bench
  2014-11-28 11:49   ` Stefan Hajnoczi
@ 2014-11-28 12:19     ` Kevin Wolf
  2014-12-01 11:15       ` Stefan Hajnoczi
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2014-11-28 12:19 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: pbonzini, ming.lei, pl, qemu-devel, stefanha

[-- Attachment #1: Type: text/plain, Size: 704 bytes --]

Am 28.11.2014 um 12:49 hat Stefan Hajnoczi geschrieben:
> On Wed, Nov 26, 2014 at 03:46:42PM +0100, Kevin Wolf wrote:
> > +    while (data.n > 0) {
> > +        main_loop_wait(false);
> > +    }
> 
> Why is this false (non-blocking)?  This is why you get the main loop
> spun warning message.
> 
> Using true (blocking) seems like the right thing.  data.n changes as
> part of the callback, which is invoked from the main loop.  There is no
> need to be non-blocking.

I think the parameter has exactly the opposite meaning as what you
describe:

    int main_loop_wait(int nonblocking)

If it were true, you would get timeout = 0. qemu-io and qemu-nbd also
pass false here.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-27  9:50   ` Peter Lieven
@ 2014-11-28 12:57     ` Kevin Wolf
  2014-11-28 13:44       ` Kevin Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2014-11-28 12:57 UTC (permalink / raw)
  To: Peter Lieven; +Cc: pbonzini, ming.lei, qemu-devel, stefanha

Am 27.11.2014 um 10:50 hat Peter Lieven geschrieben:
> On 26.11.2014 15:46, Kevin Wolf wrote:
> >This improves the performance of requests because an ACB doesn't need to
> >be allocated on the heap any more. It also makes the code nicer and
> >smaller.
> >
> >As a side effect, the codepath taken by aio=threads is changed to use
> >paio_submit_co(). This doesn't change the performance at this point.
> >
> >Results of qemu-img bench -t none -c 10000000 [-n] /dev/loop0:
> >
> >       |      aio=native       |     aio=threads
> >       | before   | with patch | before   | with patch
> >------+----------+------------+----------+------------
> >run 1 | 29.921s  | 26.932s    | 35.286s  | 35.447s
> >run 2 | 29.793s  | 26.252s    | 35.276s  | 35.111s
> >run 3 | 30.186s  | 27.114s    | 35.042s  | 34.921s
> >run 4 | 30.425s  | 26.600s    | 35.169s  | 34.968s
> >run 5 | 30.041s  | 26.263s    | 35.224s  | 35.000s
> >
> >TODO: Do some more serious benchmarking in VMs with less variance.
> >Results of a quick fio run are vaguely positive.
> 
> I still see the main-loop spun warnings with this patches applied to master.
> It wasn't there with the original patch from August.
> 
> ~/git/qemu$ ./qemu-img bench -t none -c 10000000 -n /dev/ram1
> Sending 10000000 requests, 4096 bytes each, 64 in parallel
> main-loop: WARNING: I/O thread spun for 1000 iterations
> Run completed in 31.947 seconds.

Yes, I still need to bisect that. The 'qemu-img bench' numbers above are
actually also from August, we have regressed meanwhile by about a
second, and I also haven't found the reason for that yet.

Kevin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
  2014-11-28 12:57     ` Kevin Wolf
@ 2014-11-28 13:44       ` Kevin Wolf
  0 siblings, 0 replies; 21+ messages in thread
From: Kevin Wolf @ 2014-11-28 13:44 UTC (permalink / raw)
  To: Peter Lieven; +Cc: pbonzini, ming.lei, qemu-devel, stefanha

Am 28.11.2014 um 13:57 hat Kevin Wolf geschrieben:
> Am 27.11.2014 um 10:50 hat Peter Lieven geschrieben:
> > On 26.11.2014 15:46, Kevin Wolf wrote:
> > >This improves the performance of requests because an ACB doesn't need to
> > >be allocated on the heap any more. It also makes the code nicer and
> > >smaller.
> > >
> > >As a side effect, the codepath taken by aio=threads is changed to use
> > >paio_submit_co(). This doesn't change the performance at this point.
> > >
> > >Results of qemu-img bench -t none -c 10000000 [-n] /dev/loop0:
> > >
> > >       |      aio=native       |     aio=threads
> > >       | before   | with patch | before   | with patch
> > >------+----------+------------+----------+------------
> > >run 1 | 29.921s  | 26.932s    | 35.286s  | 35.447s
> > >run 2 | 29.793s  | 26.252s    | 35.276s  | 35.111s
> > >run 3 | 30.186s  | 27.114s    | 35.042s  | 34.921s
> > >run 4 | 30.425s  | 26.600s    | 35.169s  | 34.968s
> > >run 5 | 30.041s  | 26.263s    | 35.224s  | 35.000s
> > >
> > >TODO: Do some more serious benchmarking in VMs with less variance.
> > >Results of a quick fio run are vaguely positive.
> > 
> > I still see the main-loop spun warnings with this patches applied to master.
> > It wasn't there with the original patch from August.
> > 
> > ~/git/qemu$ ./qemu-img bench -t none -c 10000000 -n /dev/ram1
> > Sending 10000000 requests, 4096 bytes each, 64 in parallel
> > main-loop: WARNING: I/O thread spun for 1000 iterations
> > Run completed in 31.947 seconds.
> 
> Yes, I still need to bisect that. The 'qemu-img bench' numbers above are
> actually also from August, we have regressed meanwhile by about a
> second, and I also haven't found the reason for that yet.

Did the first part of this now. The commit that introduced the "spun"
message is 2cdff7f6 ('linux-aio: avoid deadlock in nested aio_poll()
calls').

The following patch doesn't make it go away completely, but I only see
it sometime during like every other run now, instead of immediately
after starting qemu-img bench. It's probably a (very) minor performance
optimisation, too.

Kevin


diff --git a/block/linux-aio.c b/block/linux-aio.c
index fd8f0e4..1a0ec62 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -136,6 +136,8 @@ static void qemu_laio_completion_bh(void *opaque)
 
         qemu_laio_process_completion(s, laiocb);
     }
+
+    qemu_bh_cancel(s->completion_bh);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
@@ -143,7 +145,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
     struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, e);
 
     if (event_notifier_test_and_clear(&s->e)) {
-        qemu_bh_schedule(s->completion_bh);
+        qemu_laio_completion_bh(s);
     }
 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/3] qemu-img bench
  2014-11-28 12:19     ` Kevin Wolf
@ 2014-12-01 11:15       ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2014-12-01 11:15 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, ming.lei, pl, qemu-devel, pbonzini

[-- Attachment #1: Type: text/plain, Size: 854 bytes --]

On Fri, Nov 28, 2014 at 01:19:59PM +0100, Kevin Wolf wrote:
> Am 28.11.2014 um 12:49 hat Stefan Hajnoczi geschrieben:
> > On Wed, Nov 26, 2014 at 03:46:42PM +0100, Kevin Wolf wrote:
> > > +    while (data.n > 0) {
> > > +        main_loop_wait(false);
> > > +    }
> > 
> > Why is this false (non-blocking)?  This is why you get the main loop
> > spun warning message.
> > 
> > Using true (blocking) seems like the right thing.  data.n changes as
> > part of the callback, which is invoked from the main loop.  There is no
> > need to be non-blocking.
> 
> I think the parameter has exactly the opposite meaning as what you
> describe:
> 
>     int main_loop_wait(int nonblocking)
> 
> If it were true, you would get timeout = 0. qemu-io and qemu-nbd also
> pass false here.

Oops, you are right!  Sorry, I was confused.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively
  2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively Kevin Wolf
@ 2014-12-04 14:37   ` Kevin Wolf
  2014-12-04 15:22     ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2014-12-04 14:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, ming.lei, pl, stefanha

Am 26.11.2014 um 15:46 hat Kevin Wolf geschrieben:
> When getting an error while submitting requests, we must be careful to
> wake up only inactive coroutines. Therefore we must special-case the
> currently active coroutine and communicate an error for that request
> using the ordinary return value of ioq_submit().
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/linux-aio.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)

Ming, did you have a look at this patch specifically? Does it fix the
issue that you tried to address with a much more complex callback-based
patch?

I'd like to go forward with this as both Peter and I have measured
considerable performance improvements with our optimisation proposals,
and this series is an important part of it.

Kevin


> diff --git a/block/linux-aio.c b/block/linux-aio.c
> index 99b259d..fd8f0e4 100644
> --- a/block/linux-aio.c
> +++ b/block/linux-aio.c
> @@ -177,12 +177,19 @@ static int ioq_submit(struct qemu_laio_state *s)
>              container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
>  
>          laiocb->ret = (ret < 0) ? ret : -EIO;
> -        qemu_laio_process_completion(s, laiocb);
> +        if (laiocb->co != qemu_coroutine_self()) {
> +            qemu_coroutine_enter(laiocb->co, NULL);
> +        } else {
> +            /* The return value is used for the currently active coroutine.
> +             * We're always in ioq_enqueue() here, ioq_submit() never runs from
> +             * a request's coroutine.*/
> +            ret = laiocb->ret;
> +        }
>      }
>      return ret;
>  }
>  
> -static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
> +static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>  {
>      unsigned int idx = s->io_q.idx;
>  
> @@ -191,7 +198,9 @@ static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>  
>      /* submit immediately if queue is full */
>      if (idx == s->io_q.size) {
> -        ioq_submit(s);
> +        return ioq_submit(s);
> +    } else {
> +        return 0;
>      }
>  }
>  
> @@ -253,11 +262,11 @@ int laio_submit_co(BlockDriverState *bs, void *aio_ctx, int fd,
>  
>      if (!s->io_q.plugged) {
>          ret = io_submit(s->ctx, 1, &iocbs);
> -        if (ret < 0) {
> -            return ret;
> -        }
>      } else {
> -        ioq_enqueue(s, iocbs);
> +        ret = ioq_enqueue(s, iocbs);
> +    }
> +    if (ret < 0) {
> +        return ret;
>      }
>  
>      qemu_coroutine_yield();
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively
  2014-12-04 14:37   ` Kevin Wolf
@ 2014-12-04 15:22     ` Ming Lei
  2014-12-04 15:39       ` Kevin Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2014-12-04 15:22 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, pl, qemu-devel, Stefan Hajnoczi

On Thu, Dec 4, 2014 at 10:37 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 26.11.2014 um 15:46 hat Kevin Wolf geschrieben:
>> When getting an error while submitting requests, we must be careful to
>> wake up only inactive coroutines. Therefore we must special-case the
>> currently active coroutine and communicate an error for that request
>> using the ordinary return value of ioq_submit().
>>
>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> ---
>>  block/linux-aio.c | 23 ++++++++++++++++-------
>>  1 file changed, 16 insertions(+), 7 deletions(-)
>
> Ming, did you have a look at this patch specifically? Does it fix the
> issue that you tried to address with a much more complex callback-based
> patch?

I think your patch can't fix my issue.

As I explained, we have to handle -EAGAIN and partial submission,
which can be triggered quite easily in case of multi-queue, and other
case like NVME.

Thanks,

>
> I'd like to go forward with this as both Peter and I have measured
> considerable performance improvements with our optimisation proposals,
> and this series is an important part of it.
>
> Kevin
>
>
>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>> index 99b259d..fd8f0e4 100644
>> --- a/block/linux-aio.c
>> +++ b/block/linux-aio.c
>> @@ -177,12 +177,19 @@ static int ioq_submit(struct qemu_laio_state *s)
>>              container_of(s->io_q.iocbs[i], struct qemu_laiocb, iocb);
>>
>>          laiocb->ret = (ret < 0) ? ret : -EIO;
>> -        qemu_laio_process_completion(s, laiocb);
>> +        if (laiocb->co != qemu_coroutine_self()) {
>> +            qemu_coroutine_enter(laiocb->co, NULL);
>> +        } else {
>> +            /* The return value is used for the currently active coroutine.
>> +             * We're always in ioq_enqueue() here, ioq_submit() never runs from
>> +             * a request's coroutine.*/
>> +            ret = laiocb->ret;
>> +        }
>>      }
>>      return ret;
>>  }
>>
>> -static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>> +static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>>  {
>>      unsigned int idx = s->io_q.idx;
>>
>> @@ -191,7 +198,9 @@ static void ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb)
>>
>>      /* submit immediately if queue is full */
>>      if (idx == s->io_q.size) {
>> -        ioq_submit(s);
>> +        return ioq_submit(s);
>> +    } else {
>> +        return 0;
>>      }
>>  }
>>
>> @@ -253,11 +262,11 @@ int laio_submit_co(BlockDriverState *bs, void *aio_ctx, int fd,
>>
>>      if (!s->io_q.plugged) {
>>          ret = io_submit(s->ctx, 1, &iocbs);
>> -        if (ret < 0) {
>> -            return ret;
>> -        }
>>      } else {
>> -        ioq_enqueue(s, iocbs);
>> +        ret = ioq_enqueue(s, iocbs);
>> +    }
>> +    if (ret < 0) {
>> +        return ret;
>>      }
>>
>>      qemu_coroutine_yield();
>> --
>> 1.8.3.1
>>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively
  2014-12-04 15:22     ` Ming Lei
@ 2014-12-04 15:39       ` Kevin Wolf
  2014-12-04 15:45         ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2014-12-04 15:39 UTC (permalink / raw)
  To: Ming Lei; +Cc: Paolo Bonzini, pl, qemu-devel, Stefan Hajnoczi

Am 04.12.2014 um 16:22 hat Ming Lei geschrieben:
> On Thu, Dec 4, 2014 at 10:37 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 26.11.2014 um 15:46 hat Kevin Wolf geschrieben:
> >> When getting an error while submitting requests, we must be careful to
> >> wake up only inactive coroutines. Therefore we must special-case the
> >> currently active coroutine and communicate an error for that request
> >> using the ordinary return value of ioq_submit().
> >>
> >> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >> ---
> >>  block/linux-aio.c | 23 ++++++++++++++++-------
> >>  1 file changed, 16 insertions(+), 7 deletions(-)
> >
> > Ming, did you have a look at this patch specifically? Does it fix the
> > issue that you tried to address with a much more complex callback-based
> > patch?
> 
> I think your patch can't fix my issue.
> 
> As I explained, we have to handle -EAGAIN and partial submission,
> which can be triggered quite easily in case of multi-queue, and other
> case like NVME.

Yes, this is an alternative only to the first patch of your series. If
we go that route, your second patch would still be needed as well, of
course. It should be relatively obvious how to change it to apply on top
of this one.

Kevin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively
  2014-12-04 15:39       ` Kevin Wolf
@ 2014-12-04 15:45         ` Ming Lei
  0 siblings, 0 replies; 21+ messages in thread
From: Ming Lei @ 2014-12-04 15:45 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, pl, qemu-devel, Stefan Hajnoczi

On Thu, Dec 4, 2014 at 11:39 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 04.12.2014 um 16:22 hat Ming Lei geschrieben:
>> On Thu, Dec 4, 2014 at 10:37 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 26.11.2014 um 15:46 hat Kevin Wolf geschrieben:
>> >> When getting an error while submitting requests, we must be careful to
>> >> wake up only inactive coroutines. Therefore we must special-case the
>> >> currently active coroutine and communicate an error for that request
>> >> using the ordinary return value of ioq_submit().
>> >>
>> >> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> >> ---
>> >>  block/linux-aio.c | 23 ++++++++++++++++-------
>> >>  1 file changed, 16 insertions(+), 7 deletions(-)
>> >
>> > Ming, did you have a look at this patch specifically? Does it fix the
>> > issue that you tried to address with a much more complex callback-based
>> > patch?
>>
>> I think your patch can't fix my issue.
>>
>> As I explained, we have to handle -EAGAIN and partial submission,
>> which can be triggered quite easily in case of multi-queue, and other
>> case like NVME.
>
> Yes, this is an alternative only to the first patch of your series. If

No, most of my first patch is for handling -EAGAIN and partial
submission, so both my two patches are needed for the issue.

> we go that route, your second patch would still be needed as well, of
> course. It should be relatively obvious how to change it to apply on top
> of this one.
>
> Kevin
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-12-04 15:45 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-26 14:46 [Qemu-devel] [RFC PATCH 0/3] linux-aio: Convert to coroutines Kevin Wolf
2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 1/3] qemu-img bench Kevin Wolf
2014-11-28 11:49   ` Stefan Hajnoczi
2014-11-28 12:19     ` Kevin Wolf
2014-12-01 11:15       ` Stefan Hajnoczi
2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines Kevin Wolf
2014-11-27  9:50   ` Peter Lieven
2014-11-28 12:57     ` Kevin Wolf
2014-11-28 13:44       ` Kevin Wolf
2014-11-28  2:59   ` Ming Lei
2014-11-28  7:33     ` Markus Armbruster
2014-11-28  8:12       ` Ming Lei
2014-11-28  8:59         ` Markus Armbruster
2014-11-28  9:15           ` Ming Lei
2014-11-28  9:44     ` Paolo Bonzini
2014-11-28 10:06     ` Kevin Wolf
2014-11-26 14:46 ` [Qemu-devel] [RFC PATCH 3/3] linux-aio: Don't reenter request coroutine recursively Kevin Wolf
2014-12-04 14:37   ` Kevin Wolf
2014-12-04 15:22     ` Ming Lei
2014-12-04 15:39       ` Kevin Wolf
2014-12-04 15:45         ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).