[Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency
@ 2016-07-19 12:27 Roman Pen
  2016-07-19 12:27 ` [Qemu-devel] [PATCH v2 1/3] linux-aio: consume events in userspace instead of calling io_getevents Roman Pen
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Roman Pen @ 2016-07-19 12:27 UTC (permalink / raw)
  Cc: Roman Pen, Stefan Hajnoczi, Paolo Bonzini, qemu-devel

v2:
 o For the third patch do not introduce extra member for LinuxAioState
   structure, reuse ret == -EINPROGRESS.

 o Add explicit comment which explains why we do not hang if requests
   are still pended.

This series are intended to reduce completion latencies by two changes:

1. QEMU does not use any timeout value for harvesting completed AIO
   requests from the ring buffer, thus io_getevents() can be implemented
   in userspace (first patch).

2. In order to reduce completion latency it makes sense to harvest completed
   requests ASAP.  Very fast backend device can complete requests just after
   submission, so it is worth trying to check ring buffer and peek completed
   requests directly after io_submit() has been called (third patch).

Indeed, the series reduces the completions latencies and increases the
overall throughput, e.g. the following is the percentiles of number of
completed requests at once:

        1th 10th  20th  30th  40th  50th  60th  70th  80th  90th  99.99th
Before    2    4    42   112   128   128   128   128   128   128    128
 After    1    1     4    14    33    45    47    48    50    51    108

That means, that before the third patch is applied the ring buffer is
observed as full (128 requests were consumed at once) in 60% of calls.

After the third patch is applied the distribution of number of completed
requests is "smoother" and the queue (requests in-flight) is almost never
full.

The fio read results are the following (write results are almost the
same and are not showed here):

  Before
  ------
job: (groupid=0, jobs=8): err= 0: pid=2227: Tue Jul 19 11:29:50 2016
  Description  : [Emulation of Storage Server Access Pattern]
  read : io=54681MB, bw=1822.7MB/s, iops=179779, runt= 30001msec
    slat (usec): min=172, max=16883, avg=338.35, stdev=109.66
    clat (usec): min=1, max=21977, avg=1051.45, stdev=299.29
     lat (usec): min=317, max=22521, avg=1389.83, stdev=300.73
    clat percentiles (usec):
     |  1.00th=[  346],  5.00th=[  596], 10.00th=[  708], 20.00th=[  852],
     | 30.00th=[  932], 40.00th=[  996], 50.00th=[ 1048], 60.00th=[ 1112],
     | 70.00th=[ 1176], 80.00th=[ 1256], 90.00th=[ 1384], 95.00th=[ 1496],
     | 99.00th=[ 1800], 99.50th=[ 1928], 99.90th=[ 2320], 99.95th=[ 2672],
     | 99.99th=[ 4704]
    bw (KB  /s): min=205229, max=553181, per=12.50%, avg=233278.26, stdev=18383.51

  After
  ------
job: (groupid=0, jobs=8): err= 0: pid=2220: Tue Jul 19 11:31:51 2016
  Description  : [Emulation of Storage Server Access Pattern]
  read : io=57637MB, bw=1921.2MB/s, iops=189529, runt= 30002msec
    slat (usec): min=169, max=20636, avg=329.61, stdev=124.18
    clat (usec): min=2, max=19592, avg=988.78, stdev=251.04
     lat (usec): min=381, max=21067, avg=1318.42, stdev=243.58
    clat percentiles (usec):
     |  1.00th=[  310],  5.00th=[  580], 10.00th=[  748], 20.00th=[  876],
     | 30.00th=[  908], 40.00th=[  948], 50.00th=[ 1012], 60.00th=[ 1064],
     | 70.00th=[ 1080], 80.00th=[ 1128], 90.00th=[ 1224], 95.00th=[ 1288],
     | 99.00th=[ 1496], 99.50th=[ 1608], 99.90th=[ 1960], 99.95th=[ 2256],
     | 99.99th=[ 5408]
    bw (KB  /s): min=212149, max=390160, per=12.49%, avg=245746.04, stdev=11606.75

Throughput increased from 1822MB/s to 1921MB/s, average completion latencies
decreased from 1051us to 988us.

Roman Pen (3):
  linux-aio: consume events in userspace instead of calling io_getevents
  linux-aio: split processing events function
  linux-aio: process completions from ioq_submit()

 block/linux-aio.c | 178 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 141 insertions(+), 37 deletions(-)

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org
-- 
2.8.2

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Qemu-devel] [PATCH v2 1/3] linux-aio: consume events in userspace instead of calling io_getevents
  2016-07-19 12:27 [Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency Roman Pen
@ 2016-07-19 12:27 ` Roman Pen
  2016-07-19 12:27 ` [Qemu-devel] [PATCH v2 2/3] linux-aio: split processing events function Roman Pen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Roman Pen @ 2016-07-19 12:27 UTC (permalink / raw)
  Cc: Roman Pen, Stefan Hajnoczi, Paolo Bonzini, qemu-devel

AIO context in userspace is represented as a simple ring buffer, which
can be consumed directly without entering the kernel, which obviously
can bring some performance gain.  QEMU does not use timeout value for
waiting for events completions, so we can consume all events from
userspace.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org
---
 block/linux-aio.c | 125 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 99 insertions(+), 26 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 78f4524..3e29f1d 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -57,7 +57,6 @@ struct LinuxAioState {
 
     /* I/O completion processing */
     QEMUBH *completion_bh;
-    struct io_event events[MAX_EVENTS];
     int event_idx;
     int event_max;
 };
@@ -100,6 +99,85 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
     }
 }
 
+/**
+ * aio_ring buffer which is shared between userspace and kernel.
+ *
+ * This copied from linux/fs/aio.c, common header does not exist
+ * but AIO exists for ages so we assume ABI is stable.
+ */
+struct aio_ring {
+    unsigned    id;    /* kernel internal index number */
+    unsigned    nr;    /* number of io_events */
+    unsigned    head;  /* Written to by userland or by kernel. */
+    unsigned    tail;
+
+    unsigned    magic;
+    unsigned    compat_features;
+    unsigned    incompat_features;
+    unsigned    header_length;  /* size of aio_ring */
+
+    struct io_event io_events[0];
+};
+
+/**
+ * io_getevents_peek:
+ * @ctx: AIO context
+ * @events: pointer on events array, output value
+
+ * Returns the number of completed events and sets a pointer
+ * on events array.  This function does not update the internal
+ * ring buffer, only reads head and tail.  When @events has been
+ * processed io_getevents_commit() must be called.
+ */
+static inline unsigned int io_getevents_peek(io_context_t ctx,
+                                             struct io_event **events)
+{
+    struct aio_ring *ring = (struct aio_ring *)ctx;
+    unsigned int head = ring->head, tail = ring->tail;
+    unsigned int nr;
+
+    nr = tail >= head ? tail - head : ring->nr - head;
+    *events = ring->io_events + head;
+    /* To avoid speculative loads of s->events[i] before observing tail.
+       Paired with smp_wmb() inside linux/fs/aio.c: aio_complete(). */
+    smp_rmb();
+
+    return nr;
+}
+
+/**
+ * io_getevents_commit:
+ * @ctx: AIO context
+ * @nr: the number of events on which head should be advanced
+ *
+ * Advances head of a ring buffer.
+ */
+static inline void io_getevents_commit(io_context_t ctx, unsigned int nr)
+{
+    struct aio_ring *ring = (struct aio_ring *)ctx;
+
+    if (nr) {
+        ring->head = (ring->head + nr) % ring->nr;
+    }
+}
+
+/**
+ * io_getevents_advance_and_peek:
+ * @ctx: AIO context
+ * @events: pointer on events array, output value
+ * @nr: the number of events on which head should be advanced
+ *
+ * Advances head of a ring buffer and returns number of elements left.
+ */
+static inline unsigned int
+io_getevents_advance_and_peek(io_context_t ctx,
+                              struct io_event **events,
+                              unsigned int nr)
+{
+    io_getevents_commit(ctx, nr);
+    return io_getevents_peek(ctx, events);
+}
+
 /* The completion BH fetches completed I/O requests and invokes their
  * callbacks.
  *
@@ -114,43 +192,38 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
 static void qemu_laio_completion_bh(void *opaque)
 {
     LinuxAioState *s = opaque;
-
-    /* Fetch more completion events when empty */
-    if (s->event_idx == s->event_max) {
-        do {
-            struct timespec ts = { 0 };
-            s->event_max = io_getevents(s->ctx, MAX_EVENTS, MAX_EVENTS,
-                                        s->events, &ts);
-        } while (s->event_max == -EINTR);
-
-        s->event_idx = 0;
-        if (s->event_max <= 0) {
-            s->event_max = 0;
-            return; /* no more events */
-        }
-        s->io_q.in_flight -= s->event_max;
-    }
+    struct io_event *events;
 
     /* Reschedule so nested event loops see currently pending completions */
     qemu_bh_schedule(s->completion_bh);
 
-    /* Process completion events */
-    while (s->event_idx < s->event_max) {
-        struct iocb *iocb = s->events[s->event_idx].obj;
-        struct qemu_laiocb *laiocb =
+    while ((s->event_max = io_getevents_advance_and_peek(s->ctx, &events,
+                                                         s->event_idx))) {
+        for (s->event_idx = 0; s->event_idx < s->event_max; ) {
+            struct iocb *iocb = events[s->event_idx].obj;
+            struct qemu_laiocb *laiocb =
                 container_of(iocb, struct qemu_laiocb, iocb);
 
-        laiocb->ret = io_event_ret(&s->events[s->event_idx]);
-        s->event_idx++;
+            laiocb->ret = io_event_ret(&events[s->event_idx]);
 
-        qemu_laio_process_completion(laiocb);
+            /* Change counters one-by-one because we can be nested. */
+            s->io_q.in_flight--;
+            s->event_idx++;
+            qemu_laio_process_completion(laiocb);
+        }
     }
 
+    qemu_bh_cancel(s->completion_bh);
+
+    /* If we are nested we have to notify the level above that we are done
+     * by setting event_max to zero, upper level will then jump out of it's
+     * own `for` loop.  If we are the last all counters droped to zero. */
+    s->event_max = 0;
+    s->event_idx = 0;
+
     if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
-
-    qemu_bh_cancel(s->completion_bh);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [Qemu-devel] [PATCH v2 2/3] linux-aio: split processing events function
  2016-07-19 12:27 [Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency Roman Pen
  2016-07-19 12:27 ` [Qemu-devel] [PATCH v2 1/3] linux-aio: consume events in userspace instead of calling io_getevents Roman Pen
@ 2016-07-19 12:27 ` Roman Pen
  2016-07-19 12:27 ` [Qemu-devel] [PATCH v2 3/3] linux-aio: process completions from ioq_submit() Roman Pen
  2016-07-20  9:08 ` [Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency Stefan Hajnoczi
  3 siblings, 0 replies; 5+ messages in thread
From: Roman Pen @ 2016-07-19 12:27 UTC (permalink / raw)
  Cc: Roman Pen, Stefan Hajnoczi, Paolo Bonzini, qemu-devel

Prepare processing events function to be called from ioq_submit(),
thus split function on two parts: the first harvests completed IO
requests, the second submits pending requests.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org
---
 block/linux-aio.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 3e29f1d..fae5c82 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -178,20 +178,20 @@ io_getevents_advance_and_peek(io_context_t ctx,
     return io_getevents_peek(ctx, events);
 }
 
-/* The completion BH fetches completed I/O requests and invokes their
- * callbacks.
+/**
+ * qemu_laio_process_completions:
+ * @s: AIO state
+ *
+ * Fetches completed I/O requests and invokes their callbacks.
  *
  * The function is somewhat tricky because it supports nested event loops, for
  * example when a request callback invokes aio_poll().  In order to do this,
- * the completion events array and index are kept in LinuxAioState.  The BH
- * reschedules itself as long as there are completions pending so it will
- * either be called again in a nested event loop or will be called after all
- * events have been completed.  When there are no events left to complete, the
- * BH returns without rescheduling.
+ * indices are kept in LinuxAioState.  Function schedules BH completion so it
+ * can be called again in a nested event loop.  When there are no events left
+ * to complete the BH is being canceled.
  */
-static void qemu_laio_completion_bh(void *opaque)
+static void qemu_laio_process_completions(LinuxAioState *s)
 {
-    LinuxAioState *s = opaque;
     struct io_event *events;
 
     /* Reschedule so nested event loops see currently pending completions */
@@ -220,18 +220,29 @@ static void qemu_laio_completion_bh(void *opaque)
      * own `for` loop.  If we are the last all counters droped to zero. */
     s->event_max = 0;
     s->event_idx = 0;
+}
 
+static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
+{
+    qemu_laio_process_completions(s);
     if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
 }
 
+static void qemu_laio_completion_bh(void *opaque)
+{
+    LinuxAioState *s = opaque;
+
+    qemu_laio_process_completions_and_submit(s);
+}
+
 static void qemu_laio_completion_cb(EventNotifier *e)
 {
     LinuxAioState *s = container_of(e, LinuxAioState, e);
 
     if (event_notifier_test_and_clear(&s->e)) {
-        qemu_laio_completion_bh(s);
+        qemu_laio_process_completions_and_submit(s);
     }
 }
 
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [Qemu-devel] [PATCH v2 3/3] linux-aio: process completions from ioq_submit()
  2016-07-19 12:27 [Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency Roman Pen
  2016-07-19 12:27 ` [Qemu-devel] [PATCH v2 1/3] linux-aio: consume events in userspace instead of calling io_getevents Roman Pen
  2016-07-19 12:27 ` [Qemu-devel] [PATCH v2 2/3] linux-aio: split processing events function Roman Pen
@ 2016-07-19 12:27 ` Roman Pen
  2016-07-20  9:08 ` [Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency Stefan Hajnoczi
  3 siblings, 0 replies; 5+ messages in thread
From: Roman Pen @ 2016-07-19 12:27 UTC (permalink / raw)
  Cc: Roman Pen, Stefan Hajnoczi, Paolo Bonzini, qemu-devel

In order to reduce completion latency it makes sense to harvest completed
requests ASAP.  Very fast backend device can complete requests just after
submission, so it is worth trying to check ring buffer in order to peek
completed requests directly after io_submit() has been called.

Indeed, this patch reduces the completions latencies and increases the
overall throughput, e.g. the following is the percentiles of number of
completed requests at once:

        1th 10th  20th  30th  40th  50th  60th  70th  80th  90th  99.99th
Before    2    4    42   112   128   128   128   128   128   128    128
 After    1    1     4    14    33    45    47    48    50    51    108

That means, that before the current patch is applied the ring buffer is
observed as full (128 requests were consumed at once) in 60% of calls.

After patch is applied the distribution of number of completed requests
is "smoother" and the queue (requests in-flight) is almost never full.

The fio read results are the following (write results are almost the
same and are not showed here):

  Before
  ------
job: (groupid=0, jobs=8): err= 0: pid=2227: Tue Jul 19 11:29:50 2016
  Description  : [Emulation of Storage Server Access Pattern]
  read : io=54681MB, bw=1822.7MB/s, iops=179779, runt= 30001msec
    slat (usec): min=172, max=16883, avg=338.35, stdev=109.66
    clat (usec): min=1, max=21977, avg=1051.45, stdev=299.29
     lat (usec): min=317, max=22521, avg=1389.83, stdev=300.73
    clat percentiles (usec):
     |  1.00th=[  346],  5.00th=[  596], 10.00th=[  708], 20.00th=[  852],
     | 30.00th=[  932], 40.00th=[  996], 50.00th=[ 1048], 60.00th=[ 1112],
     | 70.00th=[ 1176], 80.00th=[ 1256], 90.00th=[ 1384], 95.00th=[ 1496],
     | 99.00th=[ 1800], 99.50th=[ 1928], 99.90th=[ 2320], 99.95th=[ 2672],
     | 99.99th=[ 4704]
    bw (KB  /s): min=205229, max=553181, per=12.50%, avg=233278.26, stdev=18383.51

  After
  ------
job: (groupid=0, jobs=8): err= 0: pid=2220: Tue Jul 19 11:31:51 2016
  Description  : [Emulation of Storage Server Access Pattern]
  read : io=57637MB, bw=1921.2MB/s, iops=189529, runt= 30002msec
    slat (usec): min=169, max=20636, avg=329.61, stdev=124.18
    clat (usec): min=2, max=19592, avg=988.78, stdev=251.04
     lat (usec): min=381, max=21067, avg=1318.42, stdev=243.58
    clat percentiles (usec):
     |  1.00th=[  310],  5.00th=[  580], 10.00th=[  748], 20.00th=[  876],
     | 30.00th=[  908], 40.00th=[  948], 50.00th=[ 1012], 60.00th=[ 1064],
     | 70.00th=[ 1080], 80.00th=[ 1128], 90.00th=[ 1224], 95.00th=[ 1288],
     | 99.00th=[ 1496], 99.50th=[ 1608], 99.90th=[ 1960], 99.95th=[ 2256],
     | 99.99th=[ 5408]
    bw (KB  /s): min=212149, max=390160, per=12.49%, avg=245746.04, stdev=11606.75

Throughput increased from 1822MB/s to 1921MB/s, average completion latencies
decreased from 1051us to 988us.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org
---
 block/linux-aio.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index fae5c82..53fd10f 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -92,7 +92,11 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
 
     laiocb->ret = ret;
     if (laiocb->co) {
-        qemu_coroutine_enter(laiocb->co, NULL);
+        /* Jump and continue completion for foreign requests, don't do
+         * anything for current request, it will be completed shortly. */
+        if (laiocb->co != qemu_coroutine_self()) {
+            qemu_coroutine_enter(laiocb->co, NULL);
+        }
     } else {
         laiocb->common.cb(laiocb->common.opaque, ret);
         qemu_aio_unref(laiocb);
@@ -312,6 +316,19 @@ static void ioq_submit(LinuxAioState *s)
         QSIMPLEQ_SPLIT_AFTER(&s->io_q.pending, aiocb, next, &completed);
     } while (ret == len && !QSIMPLEQ_EMPTY(&s->io_q.pending));
     s->io_q.blocked = (s->io_q.in_queue > 0);
+
+    if (s->io_q.in_flight) {
+        /* We can try to complete something just right away if there are
+         * still requests in-flight. */
+        qemu_laio_process_completions(s);
+        /*
+         * Even we have completed everything (in_flight == 0), the queue can
+         * have still pended requests (in_queue > 0).  We do not attempt to
+         * repeat submission to avoid IO hang.  The reason is simple: s->e is
+         * still set and completion callback will be called shortly and all
+         * pended requests will be submitted from there.
+         */
+    }
 }
 
 void laio_io_plug(BlockDriverState *bs, LinuxAioState *s)
@@ -370,6 +387,7 @@ int coroutine_fn laio_co_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
         .co         = qemu_coroutine_self(),
         .nbytes     = qiov->size,
         .ctx        = s,
+        .ret        = -EINPROGRESS,
         .is_read    = (type == QEMU_AIO_READ),
         .qiov       = qiov,
     };
@@ -379,7 +397,9 @@ int coroutine_fn laio_co_submit(BlockDriverState *bs, LinuxAioState *s, int fd,
         return ret;
     }
 
-    qemu_coroutine_yield();
+    if (laiocb.ret == -EINPROGRESS) {
+        qemu_coroutine_yield();
+    }
     return laiocb.ret;
 }
 
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency
  2016-07-19 12:27 [Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency Roman Pen
                   ` (2 preceding siblings ...)
  2016-07-19 12:27 ` [Qemu-devel] [PATCH v2 3/3] linux-aio: process completions from ioq_submit() Roman Pen
@ 2016-07-20  9:08 ` Stefan Hajnoczi
  3 siblings, 0 replies; 5+ messages in thread
From: Stefan Hajnoczi @ 2016-07-20  9:08 UTC (permalink / raw)
  To: Roman Pen; +Cc: Paolo Bonzini, Stefan Hajnoczi, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4240 bytes --]

On Tue, Jul 19, 2016 at 02:27:40PM +0200, Roman Pen wrote:
> v2:
>  o For the third patch do not introduce extra member for LinuxAioState
>    structure, reuse ret == -EINPROGRESS.
> 
>  o Add explicit comment which explains why we do not hang if requests
>    are still pended.
> 
> 
> This series are intended to reduce completion latencies by two changes:
> 
> 1. QEMU does not use any timeout value for harvesting completed AIO
>    requests from the ring buffer, thus io_getevents() can be implemented
>    in userspace (first patch).
> 
> 2. In order to reduce completion latency it makes sense to harvest completed
>    requests ASAP.  Very fast backend device can complete requests just after
>    submission, so it is worth trying to check ring buffer and peek completed
>    requests directly after io_submit() has been called (third patch).
> 
> Indeed, the series reduces the completions latencies and increases the
> overall throughput, e.g. the following is the percentiles of number of
> completed requests at once:
> 
>         1th 10th  20th  30th  40th  50th  60th  70th  80th  90th  99.99th
> Before    2    4    42   112   128   128   128   128   128   128    128
>  After    1    1     4    14    33    45    47    48    50    51    108
> 
> That means, that before the third patch is applied the ring buffer is
> observed as full (128 requests were consumed at once) in 60% of calls.
> 
> After the third patch is applied the distribution of number of completed
> requests is "smoother" and the queue (requests in-flight) is almost never
> full.
> 
> The fio read results are the following (write results are almost the
> same and are not showed here):
> 
>   Before
>   ------
> job: (groupid=0, jobs=8): err= 0: pid=2227: Tue Jul 19 11:29:50 2016
>   Description  : [Emulation of Storage Server Access Pattern]
>   read : io=54681MB, bw=1822.7MB/s, iops=179779, runt= 30001msec
>     slat (usec): min=172, max=16883, avg=338.35, stdev=109.66
>     clat (usec): min=1, max=21977, avg=1051.45, stdev=299.29
>      lat (usec): min=317, max=22521, avg=1389.83, stdev=300.73
>     clat percentiles (usec):
>      |  1.00th=[  346],  5.00th=[  596], 10.00th=[  708], 20.00th=[  852],
>      | 30.00th=[  932], 40.00th=[  996], 50.00th=[ 1048], 60.00th=[ 1112],
>      | 70.00th=[ 1176], 80.00th=[ 1256], 90.00th=[ 1384], 95.00th=[ 1496],
>      | 99.00th=[ 1800], 99.50th=[ 1928], 99.90th=[ 2320], 99.95th=[ 2672],
>      | 99.99th=[ 4704]
>     bw (KB  /s): min=205229, max=553181, per=12.50%, avg=233278.26, stdev=18383.51
> 
>   After
>   ------
> job: (groupid=0, jobs=8): err= 0: pid=2220: Tue Jul 19 11:31:51 2016
>   Description  : [Emulation of Storage Server Access Pattern]
>   read : io=57637MB, bw=1921.2MB/s, iops=189529, runt= 30002msec
>     slat (usec): min=169, max=20636, avg=329.61, stdev=124.18
>     clat (usec): min=2, max=19592, avg=988.78, stdev=251.04
>      lat (usec): min=381, max=21067, avg=1318.42, stdev=243.58
>     clat percentiles (usec):
>      |  1.00th=[  310],  5.00th=[  580], 10.00th=[  748], 20.00th=[  876],
>      | 30.00th=[  908], 40.00th=[  948], 50.00th=[ 1012], 60.00th=[ 1064],
>      | 70.00th=[ 1080], 80.00th=[ 1128], 90.00th=[ 1224], 95.00th=[ 1288],
>      | 99.00th=[ 1496], 99.50th=[ 1608], 99.90th=[ 1960], 99.95th=[ 2256],
>      | 99.99th=[ 5408]
>     bw (KB  /s): min=212149, max=390160, per=12.49%, avg=245746.04, stdev=11606.75
> 
> Throughput increased from 1822MB/s to 1921MB/s, average completion latencies
> decreased from 1051us to 988us.
> 
> Roman Pen (3):
>   linux-aio: consume events in userspace instead of calling io_getevents
>   linux-aio: split processing events function
>   linux-aio: process completions from ioq_submit()
> 
>  block/linux-aio.c | 178 ++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 141 insertions(+), 37 deletions(-)
> 
> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: qemu-devel@nongnu.org

Thanks, applied to my block-next tree for QEMU 2.8:
https://github.com/stefanha/qemu/commits/block-next

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-07-20 11:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-19 12:27 [Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency Roman Pen
2016-07-19 12:27 ` [Qemu-devel] [PATCH v2 1/3] linux-aio: consume events in userspace instead of calling io_getevents Roman Pen
2016-07-19 12:27 ` [Qemu-devel] [PATCH v2 2/3] linux-aio: split processing events function Roman Pen
2016-07-19 12:27 ` [Qemu-devel] [PATCH v2 3/3] linux-aio: process completions from ioq_submit() Roman Pen
2016-07-20  9:08 ` [Qemu-devel] [PATCH v2 0/3] linux-aio: reduce completion latency Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).