qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] aio-posix: enable io_uring SINGLE_ISSUER and TASKRUN flags
@ 2025-07-24 20:46 Stefan Hajnoczi
  2025-07-24 20:47 ` [RFC 1/3] iothread: create AioContext in iothread_run() Stefan Hajnoczi
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2025-07-24 20:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Brian Song, qemu-block, Bernd Schubert,
	Kevin Wolf, h0lyalg0rithm, Fam Zheng

Do not merge this series. The performance effects are not significant. I am
sharing this mainly to archive the patches and in case someone has ideas on how
to improve this.

Bernd Schubert mentioned io_uring_setup(2) flags that may improve performance:
- IORING_SETUP_SINGLE_ISSUER: optimization when only 1 thread uses an io_uring context
- IORING_SETUP_COOP_TASKRUN: avoids IPIs
- IORING_SETUP_TASKRUN_FLAG: makes COOP_TASKRUN work with userspace CQ ring polling

Suraj Shirvankar already started work on SINGLE_ISSUER in the past:
https://lore.kernel.org/qemu-devel/174293621917.22751.11381319865102029969-0@git.sr.ht/

Where this differs from Suraj's previous work is that I have worked around the
need for the main loop AioContext to be shared by multiple threads (vCPU
threads and the migration thread).

Here are the performance numbers for fio bs=4k in a 4 vCPU guest with 1
IOThread using a virtio-blk disk backed by a local NVMe drive:

                      IOPS               IOPS
Benchmark             SINGLE_ISSUER      SINGLE_ISSUER|COOP_TASKRUN|TASKRUN_FLAG
randread  iodepth=1   54,045 (+1.2%)     54,189 (+1.5%)
randread  iodepth=64  318,135 (+0.1%)    315,632 (-0.68%)
randwrite iodepth=1   141,918 (-0.44%)   143,337 (+0.55%)
randwrite iodepth=64  323,948 (-0.015%)  322,755 (-0.38%)

You can find detailed benchmarking results here including the fio
output, fio command-line, and guest libvirt domain XML:
https://gitlab.com/stefanha/virt-playbooks/-/tree/io_uring-flags/notebook/fio-output
https://gitlab.com/stefanha/virt-playbooks/-/blob/io_uring-flags/files/fio.sh
https://gitlab.com/stefanha/virt-playbooks/-/blob/io_uring-flags/files/test.xml.j2

Stefan Hajnoczi (3):
  iothread: create AioContext in iothread_run()
  aio-posix: enable IORING_SETUP_SINGLE_ISSUER
  aio-posix: enable IORING_SETUP_COOP_TASKRUN |
    IORING_SETUP_TASKRUN_FLAG

 include/system/iothread.h |   1 -
 iothread.c                | 140 +++++++++++++++++++++-----------------
 util/fdmon-io_uring.c     |  26 ++++++-
 3 files changed, 101 insertions(+), 66 deletions(-)

-- 
2.50.1



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC 1/3] iothread: create AioContext in iothread_run()
  2025-07-24 20:46 [RFC 0/3] aio-posix: enable io_uring SINGLE_ISSUER and TASKRUN flags Stefan Hajnoczi
@ 2025-07-24 20:47 ` Stefan Hajnoczi
  2025-07-24 20:47 ` [RFC 2/3] aio-posix: enable IORING_SETUP_SINGLE_ISSUER Stefan Hajnoczi
  2025-07-24 20:47 ` [RFC 3/3] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG Stefan Hajnoczi
  2 siblings, 0 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2025-07-24 20:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Brian Song, qemu-block, Bernd Schubert,
	Kevin Wolf, h0lyalg0rithm, Fam Zheng

The IOThread's AioContext is currently created in iothread_init() where
it's easy to propagate errors before spawning the thread that runs
iothread_run(). However, this means that aio_context_new() is called
from the main loop thread rather than from the IOThread.

In order to use Linux io_uring's IORING_SETUP_SINGLE_ISSUER feature in
the next commit, only one thread can use the io_uring context and
therefore iothread.c must call aio_context_new() from iothread_run()
instead of iothread_init().

Extract the iothread_run() arguments into an IOThreadRunArgs struct
where an Error *error field can be used to report back initialization
errors. This works pretty well thanks to the init_done_sem semaphore
that is already used by iothread_init() to wait for iothread_run() to
initialize.

Move iothread_run() further down for proximity with iothread_init() and
to avoid adding a function prototype for
iothread_set_aio_context_params().

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/system/iothread.h |   1 -
 iothread.c                | 140 +++++++++++++++++++++-----------------
 2 files changed, 78 insertions(+), 63 deletions(-)

diff --git a/include/system/iothread.h b/include/system/iothread.h
index d95c17a645..ec4e798d5e 100644
--- a/include/system/iothread.h
+++ b/include/system/iothread.h
@@ -29,7 +29,6 @@ struct IOThread {
     bool run_gcontext;          /* whether we should run gcontext */
     GMainContext *worker_context;
     GMainLoop *main_loop;
-    QemuSemaphore init_done_sem; /* is thread init done? */
     bool stopping;              /* has iothread_stop() been called? */
     bool running;               /* should iothread_run() continue? */
     int thread_id;
diff --git a/iothread.c b/iothread.c
index 8810376dce..c6547779d0 100644
--- a/iothread.c
+++ b/iothread.c
@@ -36,46 +36,6 @@
 #define IOTHREAD_POLL_MAX_NS_DEFAULT 0ULL
 #endif
 
-static void *iothread_run(void *opaque)
-{
-    IOThread *iothread = opaque;
-
-    rcu_register_thread();
-    /*
-     * g_main_context_push_thread_default() must be called before anything
-     * in this new thread uses glib.
-     */
-    g_main_context_push_thread_default(iothread->worker_context);
-    qemu_set_current_aio_context(iothread->ctx);
-    iothread->thread_id = qemu_get_thread_id();
-    qemu_sem_post(&iothread->init_done_sem);
-
-    while (iothread->running) {
-        /*
-         * Note: from functional-wise the g_main_loop_run() below can
-         * already cover the aio_poll() events, but we can't run the
-         * main loop unconditionally because explicit aio_poll() here
-         * is faster than g_main_loop_run() when we do not need the
-         * gcontext at all (e.g., pure block layer iothreads).  In
-         * other words, when we want to run the gcontext with the
-         * iothread we need to pay some performance for functionality.
-         */
-        aio_poll(iothread->ctx, true);
-
-        /*
-         * We must check the running state again in case it was
-         * changed in previous aio_poll()
-         */
-        if (iothread->running && qatomic_read(&iothread->run_gcontext)) {
-            g_main_loop_run(iothread->main_loop);
-        }
-    }
-
-    g_main_context_pop_thread_default(iothread->worker_context);
-    rcu_unregister_thread();
-    return NULL;
-}
-
 /* Runs in iothread_run() thread */
 static void iothread_stop_bh(void *opaque)
 {
@@ -104,7 +64,6 @@ static void iothread_instance_init(Object *obj)
 
     iothread->poll_max_ns = IOTHREAD_POLL_MAX_NS_DEFAULT;
     iothread->thread_id = -1;
-    qemu_sem_init(&iothread->init_done_sem, 0);
     /* By default, we don't run gcontext */
     qatomic_set(&iothread->run_gcontext, 0);
 }
@@ -135,7 +94,6 @@ static void iothread_instance_finalize(Object *obj)
         g_main_loop_unref(iothread->main_loop);
         iothread->main_loop = NULL;
     }
-    qemu_sem_destroy(&iothread->init_done_sem);
 }
 
 static void iothread_init_gcontext(IOThread *iothread, const char *thread_name)
@@ -176,47 +134,105 @@ static void iothread_set_aio_context_params(EventLoopBase *base, Error **errp)
                                        base->thread_pool_max, errp);
 }
 
+typedef struct {
+    IOThread *iothread;
+    const char *thread_name;
+    QemuSemaphore init_done_sem; /* is thread init done? */
+    Error *error; /* filled in before init_done_sem is posted */
+} IOThreadRunArgs;
 
-static void iothread_init(EventLoopBase *base, Error **errp)
+static void *iothread_run(void *opaque)
 {
-    Error *local_error = NULL;
-    IOThread *iothread = IOTHREAD(base);
-    g_autofree char *thread_name = NULL;
+    IOThreadRunArgs *args = opaque;
+    IOThread *iothread = args->iothread;
 
-    iothread->stopping = false;
-    iothread->running = true;
-    iothread->ctx = aio_context_new(errp);
+    rcu_register_thread();
+
+    iothread->ctx = aio_context_new(&args->error);
     if (!iothread->ctx) {
-        return;
+        goto out;
     }
 
-    thread_name = g_strdup_printf("IO %s",
-                        object_get_canonical_path_component(OBJECT(base)));
+    iothread_set_aio_context_params(EVENT_LOOP_BASE(iothread), &args->error);
+    if (args->error) {
+        aio_context_unref(iothread->ctx);
+        iothread->ctx = NULL;
+        goto out;
+    }
 
     /*
      * Init one GMainContext for the iothread unconditionally, even if
      * it's not used
      */
-    iothread_init_gcontext(iothread, thread_name);
+    iothread_init_gcontext(iothread, args->thread_name);
 
-    iothread_set_aio_context_params(base, &local_error);
-    if (local_error) {
-        error_propagate(errp, local_error);
-        aio_context_unref(iothread->ctx);
-        iothread->ctx = NULL;
-        return;
+    /*
+     * g_main_context_push_thread_default() must be called before anything
+     * in this new thread uses glib.
+     */
+    g_main_context_push_thread_default(iothread->worker_context);
+    qemu_set_current_aio_context(iothread->ctx);
+
+    iothread->stopping = false;
+    iothread->running = true;
+
+    iothread->thread_id = qemu_get_thread_id();
+    qemu_sem_post(&args->init_done_sem);
+
+    while (iothread->running) {
+        /*
+         * Note: from functional-wise the g_main_loop_run() below can
+         * already cover the aio_poll() events, but we can't run the
+         * main loop unconditionally because explicit aio_poll() here
+         * is faster than g_main_loop_run() when we do not need the
+         * gcontext at all (e.g., pure block layer iothreads).  In
+         * other words, when we want to run the gcontext with the
+         * iothread we need to pay some performance for functionality.
+         */
+        aio_poll(iothread->ctx, true);
+
+        /*
+         * We must check the running state again in case it was
+         * changed in previous aio_poll()
+         */
+        if (iothread->running && qatomic_read(&iothread->run_gcontext)) {
+            g_main_loop_run(iothread->main_loop);
+        }
     }
 
+    g_main_context_pop_thread_default(iothread->worker_context);
+out:
+    rcu_unregister_thread();
+    return NULL;
+}
+
+static void iothread_init(EventLoopBase *base, Error **errp)
+{
+    IOThread *iothread = IOTHREAD(base);
+    g_autofree char *thread_name = NULL;
+    IOThreadRunArgs args = {
+        .iothread = iothread,
+    };
+
+    qemu_sem_init(&args.init_done_sem, 0);
+
+    thread_name = g_strdup_printf("IO %s",
+                        object_get_canonical_path_component(OBJECT(base)));
+    args.thread_name = thread_name;
+
     /* This assumes we are called from a thread with useful CPU affinity for us
      * to inherit.
      */
-    qemu_thread_create(&iothread->thread, thread_name, iothread_run,
-                       iothread, QEMU_THREAD_JOINABLE);
+    qemu_thread_create(&iothread->thread, thread_name, iothread_run, &args,
+                       QEMU_THREAD_JOINABLE);
 
     /* Wait for initialization to complete */
     while (iothread->thread_id == -1) {
-        qemu_sem_wait(&iothread->init_done_sem);
+        qemu_sem_wait(&args.init_done_sem);
     }
+
+    qemu_sem_destroy(&args.init_done_sem);
+    error_propagate(errp, args.error);
 }
 
 typedef struct {
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [RFC 2/3] aio-posix: enable IORING_SETUP_SINGLE_ISSUER
  2025-07-24 20:46 [RFC 0/3] aio-posix: enable io_uring SINGLE_ISSUER and TASKRUN flags Stefan Hajnoczi
  2025-07-24 20:47 ` [RFC 1/3] iothread: create AioContext in iothread_run() Stefan Hajnoczi
@ 2025-07-24 20:47 ` Stefan Hajnoczi
  2025-07-24 20:47 ` [RFC 3/3] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG Stefan Hajnoczi
  2 siblings, 0 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2025-07-24 20:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Brian Song, qemu-block, Bernd Schubert,
	Kevin Wolf, h0lyalg0rithm, Fam Zheng

IORING_SETUP_SINGLE_ISSUER enables optimizations in the host Linux
kernel's io_uring code when the io_uring context is only used from a
single thread. This is true is QEMU because io_uring SQEs are submitted
from the same thread that processes the CQEs.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/fdmon-io_uring.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 28b93c8ab9..4798439097 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -456,13 +456,30 @@ static const FDMonOps fdmon_io_uring_ops = {
     .add_sqe = fdmon_io_uring_add_sqe,
 };
 
+static inline bool is_creating_iothread(void)
+{
+    return qemu_get_thread_id() != getpid();
+}
+
 void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
 {
     int ret;
+    /* TODO only enable these flags if they are available in the host's kernel headers */
+    unsigned flags = 0;
 
     ctx->io_uring_fd_tag = NULL;
 
-    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
+    /*
+     * The main thread's AioContexts are created from the main loop thread but
+     * may be accessed from multiple threads (e.g. vCPUs or the migration
+     * thread). IOThread AioContexts are only accessed from the IOThread
+     * itself.
+     */
+    if (is_creating_iothread()) {
+        flags = IORING_SETUP_SINGLE_ISSUER;
+    }
+
+    ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, flags);
     if (ret != 0) {
         error_setg_errno(errp, -ret, "Failed to initialize io_uring");
         return;
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [RFC 3/3] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
  2025-07-24 20:46 [RFC 0/3] aio-posix: enable io_uring SINGLE_ISSUER and TASKRUN flags Stefan Hajnoczi
  2025-07-24 20:47 ` [RFC 1/3] iothread: create AioContext in iothread_run() Stefan Hajnoczi
  2025-07-24 20:47 ` [RFC 2/3] aio-posix: enable IORING_SETUP_SINGLE_ISSUER Stefan Hajnoczi
@ 2025-07-24 20:47 ` Stefan Hajnoczi
  2 siblings, 0 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2025-07-24 20:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Brian Song, qemu-block, Bernd Schubert,
	Kevin Wolf, h0lyalg0rithm, Fam Zheng

The IORING_SETUP_COOP_TASKRUN flag reduces interprocessor interrupts
when an io_uring event occurs on a different CPU. The idea is that the
QEMU thread will wait for a CQE anyway, so there is no need to interrupt
the CPU that it is on.

The IORING_SETUP_TASKRUN_FLAG ensures that QEMU's io_uring CQ ring
polling still works with COOP_TASKRUN. The kernel will set a flag in the
SQ ring (this is not a typo, the flag is located in the SQ ring even
though it pertains to the CQ ring) that can be polled from userspace.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/fdmon-io_uring.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 4798439097..649dc18907 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -428,13 +428,16 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
 
 static bool fdmon_io_uring_need_wait(AioContext *ctx)
 {
+    struct io_uring *ring = &ctx->fdmon_io_uring;
+
     /* Have io_uring events completed? */
-    if (io_uring_cq_ready(&ctx->fdmon_io_uring)) {
+    if (io_uring_cq_ready(ring) ||
+        IO_URING_READ_ONCE(*ring->sq.kflags) & IORING_SQ_TASKRUN) {
         return true;
     }
 
     /* Are there pending sqes to submit? */
-    if (io_uring_sq_ready(&ctx->fdmon_io_uring)) {
+    if (io_uring_sq_ready(ring)) {
         return true;
     }
 
@@ -465,7 +468,7 @@ void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
 {
     int ret;
     /* TODO only enable these flags if they are available in the host's kernel headers */
-    unsigned flags = 0;
+    unsigned flags = IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG;
 
     ctx->io_uring_fd_tag = NULL;
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-07-24 20:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24 20:46 [RFC 0/3] aio-posix: enable io_uring SINGLE_ISSUER and TASKRUN flags Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 1/3] iothread: create AioContext in iothread_run() Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 2/3] aio-posix: enable IORING_SETUP_SINGLE_ISSUER Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 3/3] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).