[PATCH 0/5] block: Improve writethrough performance

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] block: Improve writethrough performance
@ 2025-03-07 22:16 Kevin Wolf
  2025-03-07 22:16 ` [PATCH 1/5] file-posix: Support FUA writes Kevin Wolf
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Kevin Wolf @ 2025-03-07 22:16 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, stefanha, pbonzini, afaria, hreitz, qemu-devel

Write requests in writethrough mode mean that QEMU sends a separate
flush request (i.e. fdatasync) after each completed write request.
This is unnecessary overhead when we can just pass a flag for the write
request that gives us the desired FUA semantics.

Unfortunately, this made a problem in the adaptive polling algorithm
apparent that would result in a performance regression in some cases
with FUA support in file-posix. Therefore, adaptive polling is changed
in this series, too, to avoid the regression.

Kevin Wolf (5):
  file-posix: Support FUA writes
  block/io: Ignore FUA with cache.no-flush=on
  aio: Create AioPolledEvent
  aio-posix: Factor out adjust_polling_time()
  aio-posix: Separate AioPolledEvent per AioHandler

 include/block/aio.h     |  5 ++-
 include/block/raw-aio.h |  8 +++-
 util/aio-posix.h        |  1 +
 block/file-posix.c      | 26 ++++++++----
 block/io.c              |  4 ++
 block/io_uring.c        | 13 +++---
 block/linux-aio.c       | 24 +++++++++--
 util/aio-posix.c        | 94 ++++++++++++++++++++++++++---------------
 util/async.c            |  1 -
 meson.build             |  4 ++
 10 files changed, 125 insertions(+), 55 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/5] file-posix: Support FUA writes
  2025-03-07 22:16 [PATCH 0/5] block: Improve writethrough performance Kevin Wolf
@ 2025-03-07 22:16 ` Kevin Wolf
  2025-03-10 10:41   ` Stefan Hajnoczi
  2025-03-07 22:16 ` [PATCH 2/5] block/io: Ignore FUA with cache.no-flush=on Kevin Wolf
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2025-03-07 22:16 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, stefanha, pbonzini, afaria, hreitz, qemu-devel

Until now, FUA was always emulated with a separate flush after the write
for file-posix. The overhead of processing a second request can reduce
performance significantly for a guest disk that has disabled the write
cache, especially if the host disk is already write through, too, and
the flush isn't actually doing anything.

Advertise support for REQ_FUA in write requests and implement it for
Linux AIO and io_uring using the RWF_DSYNC flag for write requests. The
thread pool still performs a separate fdatasync() call. This can be
improved later by using the pwritev2() syscall if available.

As an example, this is how fio numbers can be improved in some scenarios
with this patch (all using virtio-blk with cache=directsync on an nvme
block device for the VM, fio with ioengine=libaio,direct=1,sync=1):

                              | old           | with FUA support
------------------------------+---------------+-------------------
bs=4k, iodepth=1, numjobs=1   |  45.6k iops   |  56.1k iops
bs=4k, iodepth=1, numjobs=16  | 183.3k iops   | 236.0k iops
bs=4k, iodepth=16, numjobs=1  | 258.4k iops   | 311.1k iops

However, not all scenarios are clear wins. On another slower disk I saw
little to no improvment. In fact, in two corner case scenarios, I even
observed a regression, which I however consider acceptable:

1. On slow host disks in a write through cache mode, when the guest is
   using virtio-blk in a separate iothread so that polling can be
   enabled, and each completion is quickly followed up with a new
   request (so that polling gets it), it can happen that enabling FUA
   makes things slower - the additional very fast no-op flush we used to
   have gave the adaptive polling algorithm a success so that it kept
   polling. Without it, we only have the slow write request, which
   disables polling. This is a problem in the polling algorithm that
   will be fixed later in this series.

2. With a high queue depth, it can be beneficial to have flush requests
   for another reason: The optimisation in bdrv_co_flush() that flushes
   only once per write generation acts as a synchronisation mechanism
   that lets all requests complete at the same time. This can result in
   better batching and if the disk is very fast (I only saw this with a
   null_blk backend), this can make up for the overhead of the flush and
   improve throughput. In theory, we could optionally introduce a
   similar artificial latency in the normal completion path to achieve
   the same kind of completion batching. This is not implemented in this
   series.

Compatibility is not a concern for io_uring, it has supported RWF_DSYNC
from the start. Linux AIO started supporting it in Linux 4.13 and libaio
0.3.111. The kernel is not a problem for any supported build platform,
so it's not necessary to add runtime checks. However, openSUSE is still
stuck with an older libaio version that would break the build. We must
detect this at build time to avoid build failures.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/raw-aio.h |  8 ++++++--
 block/file-posix.c      | 26 ++++++++++++++++++--------
 block/io_uring.c        | 13 ++++++++-----
 block/linux-aio.c       | 24 +++++++++++++++++++++---
 meson.build             |  4 ++++
 5 files changed, 57 insertions(+), 18 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 626706827f..247bdbff13 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -17,6 +17,7 @@
 #define QEMU_RAW_AIO_H
 
 #include "block/aio.h"
+#include "block/block-common.h"
 #include "qemu/iov.h"
 
 /* AIO request types */
@@ -58,9 +59,11 @@ void laio_cleanup(LinuxAioState *s);
 
 /* laio_co_submit: submit I/O requests in the thread's current AioContext. */
 int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
-                                int type, uint64_t dev_max_batch);
+                                int type, BdrvRequestFlags flags,
+                                uint64_t dev_max_batch);
 
 bool laio_has_fdsync(int);
+bool laio_has_fua(void);
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
 #endif
@@ -71,7 +74,8 @@ void luring_cleanup(LuringState *s);
 
 /* luring_co_submit: submit I/O requests in the thread's current AioContext. */
 int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
-                                  QEMUIOVector *qiov, int type);
+                                  QEMUIOVector *qiov, int type,
+                                  BdrvRequestFlags flags);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
 #endif
diff --git a/block/file-posix.c b/block/file-posix.c
index 44e16dda87..0f1c722804 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -194,6 +194,7 @@ static int fd_open(BlockDriverState *bs)
 }
 
 static int64_t raw_getlength(BlockDriverState *bs);
+static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs);
 
 typedef struct RawPosixAIOData {
     BlockDriverState *bs;
@@ -804,6 +805,10 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
 #endif
     s->needs_alignment = raw_needs_alignment(bs);
 
+    if (!s->use_linux_aio || laio_has_fua()) {
+        bs->supported_write_flags = BDRV_REQ_FUA;
+    }
+
     bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK;
     if (S_ISREG(st.st_mode)) {
         /* When extending regular files, we get zeros from the OS */
@@ -2477,7 +2482,8 @@ static inline bool raw_check_linux_aio(BDRVRawState *s)
 #endif
 
 static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
-                                   uint64_t bytes, QEMUIOVector *qiov, int type)
+                                   uint64_t bytes, QEMUIOVector *qiov, int type,
+                                   int flags)
 {
     BDRVRawState *s = bs->opaque;
     RawPosixAIOData acb;
@@ -2508,13 +2514,13 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
 #ifdef CONFIG_LINUX_IO_URING
     } else if (raw_check_linux_io_uring(s)) {
         assert(qiov->size == bytes);
-        ret = luring_co_submit(bs, s->fd, offset, qiov, type);
+        ret = luring_co_submit(bs, s->fd, offset, qiov, type, flags);
         goto out;
 #endif
 #ifdef CONFIG_LINUX_AIO
     } else if (raw_check_linux_aio(s)) {
         assert(qiov->size == bytes);
-        ret = laio_co_submit(s->fd, offset, qiov, type,
+        ret = laio_co_submit(s->fd, offset, qiov, type, flags,
                               s->aio_max_batch);
         goto out;
 #endif
@@ -2534,6 +2540,10 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
 
     assert(qiov->size == bytes);
     ret = raw_thread_pool_submit(handle_aiocb_rw, &acb);
+    if (ret == 0 && (flags & BDRV_REQ_FUA)) {
+        /* TODO Use pwritev2() instead if it's available */
+        ret = raw_co_flush_to_disk(bs);
+    }
     goto out; /* Avoid the compiler err of unused label */
 
 out:
@@ -2571,14 +2581,14 @@ static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset,
                                       int64_t bytes, QEMUIOVector *qiov,
                                       BdrvRequestFlags flags)
 {
-    return raw_co_prw(bs, &offset, bytes, qiov, QEMU_AIO_READ);
+    return raw_co_prw(bs, &offset, bytes, qiov, QEMU_AIO_READ, flags);
 }
 
 static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
                                        int64_t bytes, QEMUIOVector *qiov,
                                        BdrvRequestFlags flags)
 {
-    return raw_co_prw(bs, &offset, bytes, qiov, QEMU_AIO_WRITE);
+    return raw_co_prw(bs, &offset, bytes, qiov, QEMU_AIO_WRITE, flags);
 }
 
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
@@ -2600,12 +2610,12 @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 
 #ifdef CONFIG_LINUX_IO_URING
     if (raw_check_linux_io_uring(s)) {
-        return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
+        return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
     }
 #endif
 #ifdef CONFIG_LINUX_AIO
     if (s->has_laio_fdsync && raw_check_linux_aio(s)) {
-        return laio_co_submit(s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
+        return laio_co_submit(s->fd, 0, NULL, QEMU_AIO_FLUSH, 0, 0);
     }
 #endif
     return raw_thread_pool_submit(handle_aiocb_flush, &acb);
@@ -3540,7 +3550,7 @@ static int coroutine_fn raw_co_zone_append(BlockDriverState *bs,
     }
 
     trace_zbd_zone_append(bs, *offset >> BDRV_SECTOR_BITS);
-    return raw_co_prw(bs, offset, len, qiov, QEMU_AIO_ZONE_APPEND);
+    return raw_co_prw(bs, offset, len, qiov, QEMU_AIO_ZONE_APPEND, 0);
 }
 #endif
 
diff --git a/block/io_uring.c b/block/io_uring.c
index f52b66b340..dc967dbf91 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -335,15 +335,17 @@ static void luring_deferred_fn(void *opaque)
  *
  */
 static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
-                            uint64_t offset, int type)
+                            uint64_t offset, int type, BdrvRequestFlags flags)
 {
     int ret;
     struct io_uring_sqe *sqes = &luringcb->sqeq;
+    int luring_flags;
 
     switch (type) {
     case QEMU_AIO_WRITE:
-        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
-                             luringcb->qiov->niov, offset);
+        luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
+        io_uring_prep_writev2(sqes, fd, luringcb->qiov->iov,
+                              luringcb->qiov->niov, offset, luring_flags);
         break;
     case QEMU_AIO_ZONE_APPEND:
         io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
@@ -380,7 +382,8 @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
 }
 
 int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
-                                  QEMUIOVector *qiov, int type)
+                                  QEMUIOVector *qiov, int type,
+                                  BdrvRequestFlags flags)
 {
     int ret;
     AioContext *ctx = qemu_get_current_aio_context();
@@ -393,7 +396,7 @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
     };
     trace_luring_co_submit(bs, s, &luringcb, fd, offset, qiov ? qiov->size : 0,
                            type);
-    ret = luring_do_submit(fd, &luringcb, s, offset, type);
+    ret = luring_do_submit(fd, &luringcb, s, offset, type, flags);
 
     if (ret < 0) {
         return ret;
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 194c8f434f..1108ae361c 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -368,15 +368,23 @@ static void laio_deferred_fn(void *opaque)
 }
 
 static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
-                          int type, uint64_t dev_max_batch)
+                          int type, BdrvRequestFlags flags,
+                          uint64_t dev_max_batch)
 {
     LinuxAioState *s = laiocb->ctx;
     struct iocb *iocbs = &laiocb->iocb;
     QEMUIOVector *qiov = laiocb->qiov;
+    int laio_flags;
 
     switch (type) {
     case QEMU_AIO_WRITE:
+#ifdef HAVE_IO_PREP_PWRITEV2
+        laio_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
+        io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov, offset, laio_flags);
+#else
+        assert(flags == 0);
         io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
+#endif
         break;
     case QEMU_AIO_ZONE_APPEND:
         io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
@@ -409,7 +417,8 @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
 }
 
 int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
-                                int type, uint64_t dev_max_batch)
+                                int type, BdrvRequestFlags flags,
+                                uint64_t dev_max_batch)
 {
     int ret;
     AioContext *ctx = qemu_get_current_aio_context();
@@ -422,7 +431,7 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
         .qiov       = qiov,
     };
 
-    ret = laio_do_submit(fd, &laiocb, offset, type, dev_max_batch);
+    ret = laio_do_submit(fd, &laiocb, offset, type, flags, dev_max_batch);
     if (ret < 0) {
         return ret;
     }
@@ -505,3 +514,12 @@ bool laio_has_fdsync(int fd)
     io_destroy(ctx);
     return (ret == -EINVAL) ? false : true;
 }
+
+bool laio_has_fua(void)
+{
+#ifdef HAVE_IO_PREP_PWRITEV2
+    return true;
+#else
+    return false;
+#endif
+}
diff --git a/meson.build b/meson.build
index 0ee79c664d..b2b2c9bb46 100644
--- a/meson.build
+++ b/meson.build
@@ -2724,6 +2724,10 @@ config_host_data.set('HAVE_OPTRESET',
                      cc.has_header_symbol('getopt.h', 'optreset'))
 config_host_data.set('HAVE_IPPROTO_MPTCP',
                      cc.has_header_symbol('netinet/in.h', 'IPPROTO_MPTCP'))
+if libaio.found()
+  config_host_data.set('HAVE_IO_PREP_PWRITEV2',
+                       cc.has_header_symbol('libaio.h', 'io_prep_pwritev2'))
+endif
 
 # has_member
 config_host_data.set('HAVE_SIGEV_NOTIFY_THREAD_ID',
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/5] block/io: Ignore FUA with cache.no-flush=on
  2025-03-07 22:16 [PATCH 0/5] block: Improve writethrough performance Kevin Wolf
  2025-03-07 22:16 ` [PATCH 1/5] file-posix: Support FUA writes Kevin Wolf
@ 2025-03-07 22:16 ` Kevin Wolf
  2025-03-10 10:42   ` Stefan Hajnoczi
  2025-03-07 22:16 ` [PATCH 3/5] aio: Create AioPolledEvent Kevin Wolf
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2025-03-07 22:16 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, stefanha, pbonzini, afaria, hreitz, qemu-devel

For block drivers that don't advertise FUA support, we already call
bdrv_co_flush(), which considers BDRV_O_NO_FLUSH. However, drivers that
do support FUA still see the FUA flag with BDRV_O_NO_FLUSH and get the
associated performance penalty that cache.no-flush=on was supposed to
avoid.

Clear FUA for write requests if BDRV_O_NO_FLUSH is set.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/io.c b/block/io.c
index d369b994df..1ba8d1aeea 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1058,6 +1058,10 @@ bdrv_driver_pwritev(BlockDriverState *bs, int64_t offset, int64_t bytes,
         return -ENOMEDIUM;
     }
 
+    if (bs->open_flags & BDRV_O_NO_FLUSH) {
+        flags &= ~BDRV_REQ_FUA;
+    }
+
     if ((flags & BDRV_REQ_FUA) &&
         (~bs->supported_write_flags & BDRV_REQ_FUA)) {
         flags &= ~BDRV_REQ_FUA;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/5] aio: Create AioPolledEvent
  2025-03-07 22:16 [PATCH 0/5] block: Improve writethrough performance Kevin Wolf
  2025-03-07 22:16 ` [PATCH 1/5] file-posix: Support FUA writes Kevin Wolf
  2025-03-07 22:16 ` [PATCH 2/5] block/io: Ignore FUA with cache.no-flush=on Kevin Wolf
@ 2025-03-07 22:16 ` Kevin Wolf
  2025-03-10 10:55   ` Stefan Hajnoczi
  2025-03-07 22:16 ` [PATCH 4/5] aio-posix: Factor out adjust_polling_time() Kevin Wolf
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2025-03-07 22:16 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, stefanha, pbonzini, afaria, hreitz, qemu-devel

As a preparation for having multiple adaptive polling states per
AioContext, move the 'ns' field into a separate struct.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h |  6 +++++-
 util/aio-posix.c    | 31 ++++++++++++++++---------------
 util/async.c        |  3 ++-
 3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 43883a8a33..49f46e01cb 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -123,6 +123,10 @@ struct BHListSlice {
 
 typedef QSLIST_HEAD(, AioHandler) AioHandlerSList;
 
+typedef struct AioPolledEvent {
+    int64_t ns;        /* current polling time in nanoseconds */
+} AioPolledEvent;
+
 struct AioContext {
     GSource source;
 
@@ -229,7 +233,7 @@ struct AioContext {
     int poll_disable_cnt;
 
     /* Polling mode parameters */
-    int64_t poll_ns;        /* current polling time in nanoseconds */
+    AioPolledEvent poll;
     int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
     int64_t poll_grow;      /* polling time growth factor */
     int64_t poll_shrink;    /* polling time shrink factor */
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 06bf9f456c..95bddb9e4b 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -585,7 +585,7 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
         return false;
     }
 
-    max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
+    max_ns = qemu_soonest_timeout(*timeout, ctx->poll.ns);
     if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
         /*
          * Enable poll mode. It pairs with the poll_set_started() in
@@ -683,40 +683,40 @@ bool aio_poll(AioContext *ctx, bool blocking)
     if (ctx->poll_max_ns) {
         int64_t block_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start;
 
-        if (block_ns <= ctx->poll_ns) {
+        if (block_ns <= ctx->poll.ns) {
             /* This is the sweet spot, no adjustment needed */
         } else if (block_ns > ctx->poll_max_ns) {
             /* We'd have to poll for too long, poll less */
-            int64_t old = ctx->poll_ns;
+            int64_t old = ctx->poll.ns;
 
             if (ctx->poll_shrink) {
-                ctx->poll_ns /= ctx->poll_shrink;
+                ctx->poll.ns /= ctx->poll_shrink;
             } else {
-                ctx->poll_ns = 0;
+                ctx->poll.ns = 0;
             }
 
-            trace_poll_shrink(ctx, old, ctx->poll_ns);
-        } else if (ctx->poll_ns < ctx->poll_max_ns &&
+            trace_poll_shrink(ctx, old, ctx->poll.ns);
+        } else if (ctx->poll.ns < ctx->poll_max_ns &&
                    block_ns < ctx->poll_max_ns) {
             /* There is room to grow, poll longer */
-            int64_t old = ctx->poll_ns;
+            int64_t old = ctx->poll.ns;
             int64_t grow = ctx->poll_grow;
 
             if (grow == 0) {
                 grow = 2;
             }
 
-            if (ctx->poll_ns) {
-                ctx->poll_ns *= grow;
+            if (ctx->poll.ns) {
+                ctx->poll.ns *= grow;
             } else {
-                ctx->poll_ns = 4000; /* start polling at 4 microseconds */
+                ctx->poll.ns = 4000; /* start polling at 4 microseconds */
             }
 
-            if (ctx->poll_ns > ctx->poll_max_ns) {
-                ctx->poll_ns = ctx->poll_max_ns;
+            if (ctx->poll.ns > ctx->poll_max_ns) {
+                ctx->poll.ns = ctx->poll_max_ns;
             }
 
-            trace_poll_grow(ctx, old, ctx->poll_ns);
+            trace_poll_grow(ctx, old, ctx->poll.ns);
         }
     }
 
@@ -770,8 +770,9 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
     /* No thread synchronization here, it doesn't matter if an incorrect value
      * is used once.
      */
+    ctx->poll.ns = 0;
+
     ctx->poll_max_ns = max_ns;
-    ctx->poll_ns = 0;
     ctx->poll_grow = grow;
     ctx->poll_shrink = shrink;
 
diff --git a/util/async.c b/util/async.c
index 0fe2943609..38667ea091 100644
--- a/util/async.c
+++ b/util/async.c
@@ -609,7 +609,8 @@ AioContext *aio_context_new(Error **errp)
     qemu_rec_mutex_init(&ctx->lock);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
 
-    ctx->poll_ns = 0;
+    ctx->poll.ns = 0;
+
     ctx->poll_max_ns = 0;
     ctx->poll_grow = 0;
     ctx->poll_shrink = 0;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/5] aio-posix: Factor out adjust_polling_time()
  2025-03-07 22:16 [PATCH 0/5] block: Improve writethrough performance Kevin Wolf
                   ` (2 preceding siblings ...)
  2025-03-07 22:16 ` [PATCH 3/5] aio: Create AioPolledEvent Kevin Wolf
@ 2025-03-07 22:16 ` Kevin Wolf
  2025-03-10 10:55   ` Stefan Hajnoczi
  2025-03-07 22:16 ` [PATCH 5/5] aio-posix: Separate AioPolledEvent per AioHandler Kevin Wolf
  2025-03-10 10:55 ` [PATCH 0/5] block: Improve writethrough performance Stefan Hajnoczi
  5 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2025-03-07 22:16 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, stefanha, pbonzini, afaria, hreitz, qemu-devel

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 util/aio-posix.c | 77 ++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 36 deletions(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 95bddb9e4b..259827c7ad 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -600,6 +600,46 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
     return false;
 }
 
+static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
+                                int64_t block_ns)
+{
+    if (block_ns <= poll->ns) {
+        /* This is the sweet spot, no adjustment needed */
+    } else if (block_ns > ctx->poll_max_ns) {
+        /* We'd have to poll for too long, poll less */
+        int64_t old = poll->ns;
+
+        if (ctx->poll_shrink) {
+            poll->ns /= ctx->poll_shrink;
+        } else {
+            poll->ns = 0;
+        }
+
+        trace_poll_shrink(ctx, old, poll->ns);
+    } else if (poll->ns < ctx->poll_max_ns &&
+               block_ns < ctx->poll_max_ns) {
+        /* There is room to grow, poll longer */
+        int64_t old = poll->ns;
+        int64_t grow = ctx->poll_grow;
+
+        if (grow == 0) {
+            grow = 2;
+        }
+
+        if (poll->ns) {
+            poll->ns *= grow;
+        } else {
+            poll->ns = 4000; /* start polling at 4 microseconds */
+        }
+
+        if (poll->ns > ctx->poll_max_ns) {
+            poll->ns = ctx->poll_max_ns;
+        }
+
+        trace_poll_grow(ctx, old, poll->ns);
+    }
+}
+
 bool aio_poll(AioContext *ctx, bool blocking)
 {
     AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
@@ -682,42 +722,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
     /* Adjust polling time */
     if (ctx->poll_max_ns) {
         int64_t block_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start;
-
-        if (block_ns <= ctx->poll.ns) {
-            /* This is the sweet spot, no adjustment needed */
-        } else if (block_ns > ctx->poll_max_ns) {
-            /* We'd have to poll for too long, poll less */
-            int64_t old = ctx->poll.ns;
-
-            if (ctx->poll_shrink) {
-                ctx->poll.ns /= ctx->poll_shrink;
-            } else {
-                ctx->poll.ns = 0;
-            }
-
-            trace_poll_shrink(ctx, old, ctx->poll.ns);
-        } else if (ctx->poll.ns < ctx->poll_max_ns &&
-                   block_ns < ctx->poll_max_ns) {
-            /* There is room to grow, poll longer */
-            int64_t old = ctx->poll.ns;
-            int64_t grow = ctx->poll_grow;
-
-            if (grow == 0) {
-                grow = 2;
-            }
-
-            if (ctx->poll.ns) {
-                ctx->poll.ns *= grow;
-            } else {
-                ctx->poll.ns = 4000; /* start polling at 4 microseconds */
-            }
-
-            if (ctx->poll.ns > ctx->poll_max_ns) {
-                ctx->poll.ns = ctx->poll_max_ns;
-            }
-
-            trace_poll_grow(ctx, old, ctx->poll.ns);
-        }
+        adjust_polling_time(ctx, &ctx->poll, block_ns);
     }
 
     progress |= aio_bh_poll(ctx);
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/5] aio-posix: Separate AioPolledEvent per AioHandler
  2025-03-07 22:16 [PATCH 0/5] block: Improve writethrough performance Kevin Wolf
                   ` (3 preceding siblings ...)
  2025-03-07 22:16 ` [PATCH 4/5] aio-posix: Factor out adjust_polling_time() Kevin Wolf
@ 2025-03-07 22:16 ` Kevin Wolf
  2025-03-10 10:55   ` Stefan Hajnoczi
  2025-03-10 10:55 ` [PATCH 0/5] block: Improve writethrough performance Stefan Hajnoczi
  5 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2025-03-07 22:16 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, stefanha, pbonzini, afaria, hreitz, qemu-devel

Adaptive polling has a big problem: It doesn't consider that an event
loop can wait for many different events that may have very different
typical latencies.

For example, think of a guest that tends to send a new I/O request soon
after the previous I/O request completes, but the storage on the host is
rather slow. In this case, getting the new request from guest quickly
means that polling is enabled, but the next thing is performing the I/O
request on the backend, which is slow and disables polling again for the
next guest request. This means that in such a scenario, polling could
help for every other event, but is only ever enabled when it can't
succeed.

In order to fix this, keep a separate AioPolledEvent for each
AioHandler. We will then know that the backend file descriptor always
has a high latency and isn't worth polling for, but we also know that
the guest is always fast and we should poll for it. This solves at least
half of the problem, we can now keep polling for those cases where it
makes sense and get the improved performance from it.

Since the event loop doesn't know which event will be next, we still do
some unnecessary polling while we're waiting for the slow disk. I made
some attempts to be more clever than just randomly growing and shrinking
the polling time, and even to let callers be explicit about when they
expect a new event, but so far this hasn't resulted in improved
performance or even caused performance regressions. For now, let's just
fix the part that is easy enough to fix, we can revisit the rest later.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/aio.h |  1 -
 util/aio-posix.h    |  1 +
 util/aio-posix.c    | 24 +++++++++++++++++++++---
 util/async.c        |  2 --
 4 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 49f46e01cb..0ef7ce48e3 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -233,7 +233,6 @@ struct AioContext {
     int poll_disable_cnt;
 
     /* Polling mode parameters */
-    AioPolledEvent poll;
     int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
     int64_t poll_grow;      /* polling time growth factor */
     int64_t poll_shrink;    /* polling time shrink factor */
diff --git a/util/aio-posix.h b/util/aio-posix.h
index 4264c518be..82a0201ea4 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -38,6 +38,7 @@ struct AioHandler {
 #endif
     int64_t poll_idle_timeout; /* when to stop userspace polling */
     bool poll_ready; /* has polling detected an event? */
+    AioPolledEvent poll;
 };
 
 /* Add a handler to a ready list */
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 259827c7ad..2251871c61 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -579,13 +579,19 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
 static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
                           int64_t *timeout)
 {
+    AioHandler *node;
     int64_t max_ns;
 
     if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
         return false;
     }
 
-    max_ns = qemu_soonest_timeout(*timeout, ctx->poll.ns);
+    max_ns = 0;
+    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
+        max_ns = MAX(max_ns, node->poll.ns);
+    }
+    max_ns = qemu_soonest_timeout(*timeout, max_ns);
+
     if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
         /*
          * Enable poll mode. It pairs with the poll_set_started() in
@@ -721,8 +727,14 @@ bool aio_poll(AioContext *ctx, bool blocking)
 
     /* Adjust polling time */
     if (ctx->poll_max_ns) {
+        AioHandler *node;
         int64_t block_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start;
-        adjust_polling_time(ctx, &ctx->poll, block_ns);
+
+        QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
+            if (QLIST_IS_INSERTED(node, node_ready)) {
+                adjust_polling_time(ctx, &node->poll, block_ns);
+            }
+        }
     }
 
     progress |= aio_bh_poll(ctx);
@@ -772,10 +784,16 @@ void aio_context_use_g_source(AioContext *ctx)
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
                                  int64_t grow, int64_t shrink, Error **errp)
 {
+    AioHandler *node;
+
     /* No thread synchronization here, it doesn't matter if an incorrect value
      * is used once.
      */
-    ctx->poll.ns = 0;
+    qemu_lockcnt_inc(&ctx->list_lock);
+    QLIST_FOREACH(node, &ctx->aio_handlers, node) {
+        node->poll.ns = 0;
+    }
+    qemu_lockcnt_dec(&ctx->list_lock);
 
     ctx->poll_max_ns = max_ns;
     ctx->poll_grow = grow;
diff --git a/util/async.c b/util/async.c
index 38667ea091..4124a948fd 100644
--- a/util/async.c
+++ b/util/async.c
@@ -609,8 +609,6 @@ AioContext *aio_context_new(Error **errp)
     qemu_rec_mutex_init(&ctx->lock);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
 
-    ctx->poll.ns = 0;
-
     ctx->poll_max_ns = 0;
     ctx->poll_grow = 0;
     ctx->poll_shrink = 0;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/5] file-posix: Support FUA writes
  2025-03-07 22:16 ` [PATCH 1/5] file-posix: Support FUA writes Kevin Wolf
@ 2025-03-10 10:41   ` Stefan Hajnoczi
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Hajnoczi @ 2025-03-10 10:41 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, pbonzini, afaria, hreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3613 bytes --]

On Fri, Mar 07, 2025 at 11:16:30PM +0100, Kevin Wolf wrote:
> Until now, FUA was always emulated with a separate flush after the write
> for file-posix. The overhead of processing a second request can reduce
> performance significantly for a guest disk that has disabled the write
> cache, especially if the host disk is already write through, too, and
> the flush isn't actually doing anything.
> 
> Advertise support for REQ_FUA in write requests and implement it for
> Linux AIO and io_uring using the RWF_DSYNC flag for write requests. The
> thread pool still performs a separate fdatasync() call. This can be
> improved later by using the pwritev2() syscall if available.
> 
> As an example, this is how fio numbers can be improved in some scenarios
> with this patch (all using virtio-blk with cache=directsync on an nvme
> block device for the VM, fio with ioengine=libaio,direct=1,sync=1):
> 
>                               | old           | with FUA support
> ------------------------------+---------------+-------------------
> bs=4k, iodepth=1, numjobs=1   |  45.6k iops   |  56.1k iops
> bs=4k, iodepth=1, numjobs=16  | 183.3k iops   | 236.0k iops
> bs=4k, iodepth=16, numjobs=1  | 258.4k iops   | 311.1k iops
> 
> However, not all scenarios are clear wins. On another slower disk I saw
> little to no improvment. In fact, in two corner case scenarios, I even
> observed a regression, which I however consider acceptable:
> 
> 1. On slow host disks in a write through cache mode, when the guest is
>    using virtio-blk in a separate iothread so that polling can be
>    enabled, and each completion is quickly followed up with a new
>    request (so that polling gets it), it can happen that enabling FUA
>    makes things slower - the additional very fast no-op flush we used to
>    have gave the adaptive polling algorithm a success so that it kept
>    polling. Without it, we only have the slow write request, which
>    disables polling. This is a problem in the polling algorithm that
>    will be fixed later in this series.
> 
> 2. With a high queue depth, it can be beneficial to have flush requests
>    for another reason: The optimisation in bdrv_co_flush() that flushes
>    only once per write generation acts as a synchronisation mechanism
>    that lets all requests complete at the same time. This can result in
>    better batching and if the disk is very fast (I only saw this with a
>    null_blk backend), this can make up for the overhead of the flush and
>    improve throughput. In theory, we could optionally introduce a
>    similar artificial latency in the normal completion path to achieve
>    the same kind of completion batching. This is not implemented in this
>    series.
> 
> Compatibility is not a concern for io_uring, it has supported RWF_DSYNC
> from the start. Linux AIO started supporting it in Linux 4.13 and libaio
> 0.3.111. The kernel is not a problem for any supported build platform,
> so it's not necessary to add runtime checks. However, openSUSE is still
> stuck with an older libaio version that would break the build. We must
> detect this at build time to avoid build failures.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/block/raw-aio.h |  8 ++++++--
>  block/file-posix.c      | 26 ++++++++++++++++++--------
>  block/io_uring.c        | 13 ++++++++-----
>  block/linux-aio.c       | 24 +++++++++++++++++++++---
>  meson.build             |  4 ++++
>  5 files changed, 57 insertions(+), 18 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/5] block/io: Ignore FUA with cache.no-flush=on
  2025-03-07 22:16 ` [PATCH 2/5] block/io: Ignore FUA with cache.no-flush=on Kevin Wolf
@ 2025-03-10 10:42   ` Stefan Hajnoczi
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Hajnoczi @ 2025-03-10 10:42 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, pbonzini, afaria, hreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 600 bytes --]

On Fri, Mar 07, 2025 at 11:16:31PM +0100, Kevin Wolf wrote:
> For block drivers that don't advertise FUA support, we already call
> bdrv_co_flush(), which considers BDRV_O_NO_FLUSH. However, drivers that
> do support FUA still see the FUA flag with BDRV_O_NO_FLUSH and get the
> associated performance penalty that cache.no-flush=on was supposed to
> avoid.
> 
> Clear FUA for write requests if BDRV_O_NO_FLUSH is set.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/io.c | 4 ++++
>  1 file changed, 4 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] aio-posix: Separate AioPolledEvent per AioHandler
  2025-03-07 22:16 ` [PATCH 5/5] aio-posix: Separate AioPolledEvent per AioHandler Kevin Wolf
@ 2025-03-10 10:55   ` Stefan Hajnoczi
  2025-03-10 11:11     ` Kevin Wolf
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Hajnoczi @ 2025-03-10 10:55 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, pbonzini, afaria, hreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5924 bytes --]

On Fri, Mar 07, 2025 at 11:16:34PM +0100, Kevin Wolf wrote:
> Adaptive polling has a big problem: It doesn't consider that an event
> loop can wait for many different events that may have very different
> typical latencies.
> 
> For example, think of a guest that tends to send a new I/O request soon
> after the previous I/O request completes, but the storage on the host is
> rather slow. In this case, getting the new request from guest quickly
> means that polling is enabled, but the next thing is performing the I/O
> request on the backend, which is slow and disables polling again for the
> next guest request. This means that in such a scenario, polling could
> help for every other event, but is only ever enabled when it can't
> succeed.
> 
> In order to fix this, keep a separate AioPolledEvent for each
> AioHandler. We will then know that the backend file descriptor always
> has a high latency and isn't worth polling for, but we also know that
> the guest is always fast and we should poll for it. This solves at least
> half of the problem, we can now keep polling for those cases where it
> makes sense and get the improved performance from it.
> 
> Since the event loop doesn't know which event will be next, we still do
> some unnecessary polling while we're waiting for the slow disk. I made
> some attempts to be more clever than just randomly growing and shrinking
> the polling time, and even to let callers be explicit about when they
> expect a new event, but so far this hasn't resulted in improved
> performance or even caused performance regressions. For now, let's just
> fix the part that is easy enough to fix, we can revisit the rest later.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/block/aio.h |  1 -
>  util/aio-posix.h    |  1 +
>  util/aio-posix.c    | 24 +++++++++++++++++++++---
>  util/async.c        |  2 --
>  4 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 49f46e01cb..0ef7ce48e3 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -233,7 +233,6 @@ struct AioContext {
>      int poll_disable_cnt;
>  
>      /* Polling mode parameters */
> -    AioPolledEvent poll;
>      int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
>      int64_t poll_grow;      /* polling time growth factor */
>      int64_t poll_shrink;    /* polling time shrink factor */
> diff --git a/util/aio-posix.h b/util/aio-posix.h
> index 4264c518be..82a0201ea4 100644
> --- a/util/aio-posix.h
> +++ b/util/aio-posix.h
> @@ -38,6 +38,7 @@ struct AioHandler {
>  #endif
>      int64_t poll_idle_timeout; /* when to stop userspace polling */
>      bool poll_ready; /* has polling detected an event? */
> +    AioPolledEvent poll;
>  };
>  
>  /* Add a handler to a ready list */
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 259827c7ad..2251871c61 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -579,13 +579,19 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
>  static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
>                            int64_t *timeout)
>  {
> +    AioHandler *node;
>      int64_t max_ns;
>  
>      if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
>          return false;
>      }
>  
> -    max_ns = qemu_soonest_timeout(*timeout, ctx->poll.ns);
> +    max_ns = 0;
> +    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
> +        max_ns = MAX(max_ns, node->poll.ns);
> +    }
> +    max_ns = qemu_soonest_timeout(*timeout, max_ns);
> +
>      if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
>          /*
>           * Enable poll mode. It pairs with the poll_set_started() in
> @@ -721,8 +727,14 @@ bool aio_poll(AioContext *ctx, bool blocking)
>  
>      /* Adjust polling time */
>      if (ctx->poll_max_ns) {
> +        AioHandler *node;
>          int64_t block_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start;
> -        adjust_polling_time(ctx, &ctx->poll, block_ns);
> +
> +        QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
> +            if (QLIST_IS_INSERTED(node, node_ready)) {
> +                adjust_polling_time(ctx, &node->poll, block_ns);
> +            }
> +        }
>      }
>  
>      progress |= aio_bh_poll(ctx);
> @@ -772,10 +784,16 @@ void aio_context_use_g_source(AioContext *ctx)
>  void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>                                   int64_t grow, int64_t shrink, Error **errp)
>  {
> +    AioHandler *node;
> +
>      /* No thread synchronization here, it doesn't matter if an incorrect value
>       * is used once.
>       */

If you respin this series:

This comment is confusing now that qemu_lockcnt_inc() is being used.
Lockcnt tells other threads in aio_set_fd_handler() not to remove nodes
from the aio_handlers list (because we're traversing the list).

The comment is about the poll state though, not about the aio_handlers
list. Moving it down to where poll_max_ns, etc are assigned would make
it clearer.

> -    ctx->poll.ns = 0;
> +    qemu_lockcnt_inc(&ctx->list_lock);
> +    QLIST_FOREACH(node, &ctx->aio_handlers, node) {
> +        node->poll.ns = 0;
> +    }
> +    qemu_lockcnt_dec(&ctx->list_lock);
>  
>      ctx->poll_max_ns = max_ns;
>      ctx->poll_grow = grow;
> diff --git a/util/async.c b/util/async.c
> index 38667ea091..4124a948fd 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -609,8 +609,6 @@ AioContext *aio_context_new(Error **errp)
>      qemu_rec_mutex_init(&ctx->lock);
>      timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
>  
> -    ctx->poll.ns = 0;
> -
>      ctx->poll_max_ns = 0;
>      ctx->poll_grow = 0;
>      ctx->poll_shrink = 0;
> -- 
> 2.48.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] aio: Create AioPolledEvent
  2025-03-07 22:16 ` [PATCH 3/5] aio: Create AioPolledEvent Kevin Wolf
@ 2025-03-10 10:55   ` Stefan Hajnoczi
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Hajnoczi @ 2025-03-10 10:55 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, pbonzini, afaria, hreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 486 bytes --]

On Fri, Mar 07, 2025 at 11:16:32PM +0100, Kevin Wolf wrote:
> As a preparation for having multiple adaptive polling states per
> AioContext, move the 'ns' field into a separate struct.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/block/aio.h |  6 +++++-
>  util/aio-posix.c    | 31 ++++++++++++++++---------------
>  util/async.c        |  3 ++-
>  3 files changed, 23 insertions(+), 17 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/5] aio-posix: Factor out adjust_polling_time()
  2025-03-07 22:16 ` [PATCH 4/5] aio-posix: Factor out adjust_polling_time() Kevin Wolf
@ 2025-03-10 10:55   ` Stefan Hajnoczi
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Hajnoczi @ 2025-03-10 10:55 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, pbonzini, afaria, hreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 292 bytes --]

On Fri, Mar 07, 2025 at 11:16:33PM +0100, Kevin Wolf wrote:
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  util/aio-posix.c | 77 ++++++++++++++++++++++++++----------------------
>  1 file changed, 41 insertions(+), 36 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/5] block: Improve writethrough performance
  2025-03-07 22:16 [PATCH 0/5] block: Improve writethrough performance Kevin Wolf
                   ` (4 preceding siblings ...)
  2025-03-07 22:16 ` [PATCH 5/5] aio-posix: Separate AioPolledEvent per AioHandler Kevin Wolf
@ 2025-03-10 10:55 ` Stefan Hajnoczi
  5 siblings, 0 replies; 14+ messages in thread
From: Stefan Hajnoczi @ 2025-03-10 10:55 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, pbonzini, afaria, hreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1380 bytes --]

On Fri, Mar 07, 2025 at 11:16:29PM +0100, Kevin Wolf wrote:
> Write requests in writethrough mode mean that QEMU sends a separate
> flush request (i.e. fdatasync) after each completed write request.
> This is unnecessary overhead when we can just pass a flag for the write
> request that gives us the desired FUA semantics.
> 
> Unfortunately, this made a problem in the adaptive polling algorithm
> apparent that would result in a performance regression in some cases
> with FUA support in file-posix. Therefore, adaptive polling is changed
> in this series, too, to avoid the regression.

Looks good!

Stefan

> 
> Kevin Wolf (5):
>   file-posix: Support FUA writes
>   block/io: Ignore FUA with cache.no-flush=on
>   aio: Create AioPolledEvent
>   aio-posix: Factor out adjust_polling_time()
>   aio-posix: Separate AioPolledEvent per AioHandler
> 
>  include/block/aio.h     |  5 ++-
>  include/block/raw-aio.h |  8 +++-
>  util/aio-posix.h        |  1 +
>  block/file-posix.c      | 26 ++++++++----
>  block/io.c              |  4 ++
>  block/io_uring.c        | 13 +++---
>  block/linux-aio.c       | 24 +++++++++--
>  util/aio-posix.c        | 94 ++++++++++++++++++++++++++---------------
>  util/async.c            |  1 -
>  meson.build             |  4 ++
>  10 files changed, 125 insertions(+), 55 deletions(-)
> 
> -- 
> 2.48.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] aio-posix: Separate AioPolledEvent per AioHandler
  2025-03-10 10:55   ` Stefan Hajnoczi
@ 2025-03-10 11:11     ` Kevin Wolf
  2025-03-11  2:18       ` Stefan Hajnoczi
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2025-03-10 11:11 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, pbonzini, afaria, hreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5604 bytes --]

Am 10.03.2025 um 11:55 hat Stefan Hajnoczi geschrieben:
> On Fri, Mar 07, 2025 at 11:16:34PM +0100, Kevin Wolf wrote:
> > Adaptive polling has a big problem: It doesn't consider that an event
> > loop can wait for many different events that may have very different
> > typical latencies.
> > 
> > For example, think of a guest that tends to send a new I/O request soon
> > after the previous I/O request completes, but the storage on the host is
> > rather slow. In this case, getting the new request from guest quickly
> > means that polling is enabled, but the next thing is performing the I/O
> > request on the backend, which is slow and disables polling again for the
> > next guest request. This means that in such a scenario, polling could
> > help for every other event, but is only ever enabled when it can't
> > succeed.
> > 
> > In order to fix this, keep a separate AioPolledEvent for each
> > AioHandler. We will then know that the backend file descriptor always
> > has a high latency and isn't worth polling for, but we also know that
> > the guest is always fast and we should poll for it. This solves at least
> > half of the problem, we can now keep polling for those cases where it
> > makes sense and get the improved performance from it.
> > 
> > Since the event loop doesn't know which event will be next, we still do
> > some unnecessary polling while we're waiting for the slow disk. I made
> > some attempts to be more clever than just randomly growing and shrinking
> > the polling time, and even to let callers be explicit about when they
> > expect a new event, but so far this hasn't resulted in improved
> > performance or even caused performance regressions. For now, let's just
> > fix the part that is easy enough to fix, we can revisit the rest later.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  include/block/aio.h |  1 -
> >  util/aio-posix.h    |  1 +
> >  util/aio-posix.c    | 24 +++++++++++++++++++++---
> >  util/async.c        |  2 --
> >  4 files changed, 22 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/block/aio.h b/include/block/aio.h
> > index 49f46e01cb..0ef7ce48e3 100644
> > --- a/include/block/aio.h
> > +++ b/include/block/aio.h
> > @@ -233,7 +233,6 @@ struct AioContext {
> >      int poll_disable_cnt;
> >  
> >      /* Polling mode parameters */
> > -    AioPolledEvent poll;
> >      int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
> >      int64_t poll_grow;      /* polling time growth factor */
> >      int64_t poll_shrink;    /* polling time shrink factor */
> > diff --git a/util/aio-posix.h b/util/aio-posix.h
> > index 4264c518be..82a0201ea4 100644
> > --- a/util/aio-posix.h
> > +++ b/util/aio-posix.h
> > @@ -38,6 +38,7 @@ struct AioHandler {
> >  #endif
> >      int64_t poll_idle_timeout; /* when to stop userspace polling */
> >      bool poll_ready; /* has polling detected an event? */
> > +    AioPolledEvent poll;
> >  };
> >  
> >  /* Add a handler to a ready list */
> > diff --git a/util/aio-posix.c b/util/aio-posix.c
> > index 259827c7ad..2251871c61 100644
> > --- a/util/aio-posix.c
> > +++ b/util/aio-posix.c
> > @@ -579,13 +579,19 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
> >  static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
> >                            int64_t *timeout)
> >  {
> > +    AioHandler *node;
> >      int64_t max_ns;
> >  
> >      if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
> >          return false;
> >      }
> >  
> > -    max_ns = qemu_soonest_timeout(*timeout, ctx->poll.ns);
> > +    max_ns = 0;
> > +    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
> > +        max_ns = MAX(max_ns, node->poll.ns);
> > +    }
> > +    max_ns = qemu_soonest_timeout(*timeout, max_ns);
> > +
> >      if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
> >          /*
> >           * Enable poll mode. It pairs with the poll_set_started() in
> > @@ -721,8 +727,14 @@ bool aio_poll(AioContext *ctx, bool blocking)
> >  
> >      /* Adjust polling time */
> >      if (ctx->poll_max_ns) {
> > +        AioHandler *node;
> >          int64_t block_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start;
> > -        adjust_polling_time(ctx, &ctx->poll, block_ns);
> > +
> > +        QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
> > +            if (QLIST_IS_INSERTED(node, node_ready)) {
> > +                adjust_polling_time(ctx, &node->poll, block_ns);
> > +            }
> > +        }
> >      }
> >  
> >      progress |= aio_bh_poll(ctx);
> > @@ -772,10 +784,16 @@ void aio_context_use_g_source(AioContext *ctx)
> >  void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
> >                                   int64_t grow, int64_t shrink, Error **errp)
> >  {
> > +    AioHandler *node;
> > +
> >      /* No thread synchronization here, it doesn't matter if an incorrect value
> >       * is used once.
> >       */
> 
> If you respin this series:
> 
> This comment is confusing now that qemu_lockcnt_inc() is being used.
> Lockcnt tells other threads in aio_set_fd_handler() not to remove nodes
> from the aio_handlers list (because we're traversing the list).
> 
> The comment is about the poll state though, not about the aio_handlers
> list. Moving it down to where poll_max_ns, etc are assigned would make
> it clearer.

Yes, I can do that while applying the series.

Should I add your R-b after making the change?

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] aio-posix: Separate AioPolledEvent per AioHandler
  2025-03-10 11:11     ` Kevin Wolf
@ 2025-03-11  2:18       ` Stefan Hajnoczi
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Hajnoczi @ 2025-03-11  2:18 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, pbonzini, afaria, hreitz, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5934 bytes --]

On Mon, Mar 10, 2025 at 12:11:44PM +0100, Kevin Wolf wrote:
> Am 10.03.2025 um 11:55 hat Stefan Hajnoczi geschrieben:
> > On Fri, Mar 07, 2025 at 11:16:34PM +0100, Kevin Wolf wrote:
> > > Adaptive polling has a big problem: It doesn't consider that an event
> > > loop can wait for many different events that may have very different
> > > typical latencies.
> > > 
> > > For example, think of a guest that tends to send a new I/O request soon
> > > after the previous I/O request completes, but the storage on the host is
> > > rather slow. In this case, getting the new request from guest quickly
> > > means that polling is enabled, but the next thing is performing the I/O
> > > request on the backend, which is slow and disables polling again for the
> > > next guest request. This means that in such a scenario, polling could
> > > help for every other event, but is only ever enabled when it can't
> > > succeed.
> > > 
> > > In order to fix this, keep a separate AioPolledEvent for each
> > > AioHandler. We will then know that the backend file descriptor always
> > > has a high latency and isn't worth polling for, but we also know that
> > > the guest is always fast and we should poll for it. This solves at least
> > > half of the problem, we can now keep polling for those cases where it
> > > makes sense and get the improved performance from it.
> > > 
> > > Since the event loop doesn't know which event will be next, we still do
> > > some unnecessary polling while we're waiting for the slow disk. I made
> > > some attempts to be more clever than just randomly growing and shrinking
> > > the polling time, and even to let callers be explicit about when they
> > > expect a new event, but so far this hasn't resulted in improved
> > > performance or even caused performance regressions. For now, let's just
> > > fix the part that is easy enough to fix, we can revisit the rest later.
> > > 
> > > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > > ---
> > >  include/block/aio.h |  1 -
> > >  util/aio-posix.h    |  1 +
> > >  util/aio-posix.c    | 24 +++++++++++++++++++++---
> > >  util/async.c        |  2 --
> > >  4 files changed, 22 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/include/block/aio.h b/include/block/aio.h
> > > index 49f46e01cb..0ef7ce48e3 100644
> > > --- a/include/block/aio.h
> > > +++ b/include/block/aio.h
> > > @@ -233,7 +233,6 @@ struct AioContext {
> > >      int poll_disable_cnt;
> > >  
> > >      /* Polling mode parameters */
> > > -    AioPolledEvent poll;
> > >      int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
> > >      int64_t poll_grow;      /* polling time growth factor */
> > >      int64_t poll_shrink;    /* polling time shrink factor */
> > > diff --git a/util/aio-posix.h b/util/aio-posix.h
> > > index 4264c518be..82a0201ea4 100644
> > > --- a/util/aio-posix.h
> > > +++ b/util/aio-posix.h
> > > @@ -38,6 +38,7 @@ struct AioHandler {
> > >  #endif
> > >      int64_t poll_idle_timeout; /* when to stop userspace polling */
> > >      bool poll_ready; /* has polling detected an event? */
> > > +    AioPolledEvent poll;
> > >  };
> > >  
> > >  /* Add a handler to a ready list */
> > > diff --git a/util/aio-posix.c b/util/aio-posix.c
> > > index 259827c7ad..2251871c61 100644
> > > --- a/util/aio-posix.c
> > > +++ b/util/aio-posix.c
> > > @@ -579,13 +579,19 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
> > >  static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
> > >                            int64_t *timeout)
> > >  {
> > > +    AioHandler *node;
> > >      int64_t max_ns;
> > >  
> > >      if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
> > >          return false;
> > >      }
> > >  
> > > -    max_ns = qemu_soonest_timeout(*timeout, ctx->poll.ns);
> > > +    max_ns = 0;
> > > +    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
> > > +        max_ns = MAX(max_ns, node->poll.ns);
> > > +    }
> > > +    max_ns = qemu_soonest_timeout(*timeout, max_ns);
> > > +
> > >      if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
> > >          /*
> > >           * Enable poll mode. It pairs with the poll_set_started() in
> > > @@ -721,8 +727,14 @@ bool aio_poll(AioContext *ctx, bool blocking)
> > >  
> > >      /* Adjust polling time */
> > >      if (ctx->poll_max_ns) {
> > > +        AioHandler *node;
> > >          int64_t block_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start;
> > > -        adjust_polling_time(ctx, &ctx->poll, block_ns);
> > > +
> > > +        QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
> > > +            if (QLIST_IS_INSERTED(node, node_ready)) {
> > > +                adjust_polling_time(ctx, &node->poll, block_ns);
> > > +            }
> > > +        }
> > >      }
> > >  
> > >      progress |= aio_bh_poll(ctx);
> > > @@ -772,10 +784,16 @@ void aio_context_use_g_source(AioContext *ctx)
> > >  void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
> > >                                   int64_t grow, int64_t shrink, Error **errp)
> > >  {
> > > +    AioHandler *node;
> > > +
> > >      /* No thread synchronization here, it doesn't matter if an incorrect value
> > >       * is used once.
> > >       */
> > 
> > If you respin this series:
> > 
> > This comment is confusing now that qemu_lockcnt_inc() is being used.
> > Lockcnt tells other threads in aio_set_fd_handler() not to remove nodes
> > from the aio_handlers list (because we're traversing the list).
> > 
> > The comment is about the poll state though, not about the aio_handlers
> > list. Moving it down to where poll_max_ns, etc are assigned would make
> > it clearer.
> 
> Yes, I can do that while applying the series.
> 
> Should I add your R-b after making the change?

Yes, please.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-03-11  2:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-07 22:16 [PATCH 0/5] block: Improve writethrough performance Kevin Wolf
2025-03-07 22:16 ` [PATCH 1/5] file-posix: Support FUA writes Kevin Wolf
2025-03-10 10:41   ` Stefan Hajnoczi
2025-03-07 22:16 ` [PATCH 2/5] block/io: Ignore FUA with cache.no-flush=on Kevin Wolf
2025-03-10 10:42   ` Stefan Hajnoczi
2025-03-07 22:16 ` [PATCH 3/5] aio: Create AioPolledEvent Kevin Wolf
2025-03-10 10:55   ` Stefan Hajnoczi
2025-03-07 22:16 ` [PATCH 4/5] aio-posix: Factor out adjust_polling_time() Kevin Wolf
2025-03-10 10:55   ` Stefan Hajnoczi
2025-03-07 22:16 ` [PATCH 5/5] aio-posix: Separate AioPolledEvent per AioHandler Kevin Wolf
2025-03-10 10:55   ` Stefan Hajnoczi
2025-03-10 11:11     ` Kevin Wolf
2025-03-11  2:18       ` Stefan Hajnoczi
2025-03-10 10:55 ` [PATCH 0/5] block: Improve writethrough performance Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).