* [PATCH v2 0/4] aio-posix: enable io_uring SINGLE_ISSUER, TASKRUN, and NO_SQARRAY flags
@ 2026-02-25 8:13 Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 1/4] iothread: create AioContext in iothread_run() Stefan Hajnoczi
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2026-02-25 8:13 UTC (permalink / raw)
To: qemu-devel; +Cc: Stefan Hajnoczi, qemu-block, Kevin Wolf, Fam Zheng, Jens Axboe
These patches enable Linux io_uring flags that can improve performance.
Bernd Schubert mentioned io_uring_setup(2) flags that may improve performance:
- IORING_SETUP_SINGLE_ISSUER: optimization when only 1 thread uses an io_uring context
- IORING_SETUP_COOP_TASKRUN: avoids IPIs
- IORING_SETUP_TASKRUN_FLAG: makes COOP_TASKRUN work with userspace CQ ring polling
Jens Axboe recently confirmed that SINGLE_ISSUER makes sense.
Suraj Shirvankar already started work on SINGLE_ISSUER in the past:
https://lore.kernel.org/qemu-devel/174293621917.22751.11381319865102029969-0@git.sr.ht/
Where this differs from Suraj's previous work is that I have worked around the
need for the main loop AioContext to be shared by multiple threads (vCPU
threads and the migration thread).
Here are the performance numbers for fio bs=4k in a 4 vCPU guest with 1
IOThread using a virtio-blk disk backed by a local NVMe drive:
IOPS IOPS IOPS
Benchmark SINGLE_ISSUER +TASKRUN +NO_SQARRAY
randread iodepth=1 99108 (+0.33%) 100816 (+2.1%) 104411 (+5.7%)
randread iodepth=64 276314 (+0.12%) 275939 (-0.012%) 275899 (-0.026%)
randwrite iodepth=1 99997 (-0.11%) 102866 (+2.8%) 105588 (+5.5%)
randwrite iodepth=64 272205 (-0.2%) 271973 (-0.29%) 273257 (+0.18%)
You can find detailed benchmarking results here including the fio
output, fio command-line, and guest libvirt domain XML:
https://gitlab.com/stefanha/virt-playbooks/-/tree/io_uring-flags/notebook/fio-output
https://gitlab.com/stefanha/virt-playbooks/-/blob/io_uring-flags/files/fio.sh
https://gitlab.com/stefanha/virt-playbooks/-/blob/io_uring-flags/files/test.xml.j2
Stefan Hajnoczi (4):
iothread: create AioContext in iothread_run()
aio-posix: enable IORING_SETUP_SINGLE_ISSUER
aio-posix: enable IORING_SETUP_COOP_TASKRUN |
IORING_SETUP_TASKRUN_FLAG
aio-posix: enable IORING_SETUP_NO_SQARRAY
include/system/iothread.h | 1 -
iothread.c | 140 +++++++++++++++++++++-----------------
util/fdmon-io_uring.c | 38 ++++++++++-
3 files changed, 113 insertions(+), 66 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 1/4] iothread: create AioContext in iothread_run()
2026-02-25 8:13 [PATCH v2 0/4] aio-posix: enable io_uring SINGLE_ISSUER, TASKRUN, and NO_SQARRAY flags Stefan Hajnoczi
@ 2026-02-25 8:13 ` Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 2/4] aio-posix: enable IORING_SETUP_SINGLE_ISSUER Stefan Hajnoczi
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2026-02-25 8:13 UTC (permalink / raw)
To: qemu-devel; +Cc: Stefan Hajnoczi, qemu-block, Kevin Wolf, Fam Zheng, Jens Axboe
The IOThread's AioContext is currently created in iothread_init() where
it's easy to propagate errors before spawning the thread that runs
iothread_run(). However, this means that aio_context_new() is called
from the main loop thread rather than from the IOThread.
In order to use Linux io_uring's IORING_SETUP_SINGLE_ISSUER feature in
the next commit, only one thread can use the io_uring context and
therefore iothread.c must call aio_context_new() from iothread_run()
instead of iothread_init().
Extract the iothread_run() arguments into an IOThreadRunArgs struct
where an Error *error field can be used to report back initialization
errors. This works pretty well thanks to the init_done_sem semaphore
that is already used by iothread_init() to wait for iothread_run() to
initialize.
Move iothread_run() further down for proximity with iothread_init() and
to avoid adding a function prototype for
iothread_set_aio_context_params().
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
include/system/iothread.h | 1 -
iothread.c | 140 +++++++++++++++++++++-----------------
2 files changed, 78 insertions(+), 63 deletions(-)
diff --git a/include/system/iothread.h b/include/system/iothread.h
index e26d13c6c7..9c70c7fa0b 100644
--- a/include/system/iothread.h
+++ b/include/system/iothread.h
@@ -29,7 +29,6 @@ struct IOThread {
bool run_gcontext; /* whether we should run gcontext */
GMainContext *worker_context;
GMainLoop *main_loop;
- QemuSemaphore init_done_sem; /* is thread init done? */
bool stopping; /* has iothread_stop() been called? */
bool running; /* should iothread_run() continue? */
int thread_id;
diff --git a/iothread.c b/iothread.c
index caf68e0764..6e79c02d02 100644
--- a/iothread.c
+++ b/iothread.c
@@ -36,46 +36,6 @@
#define IOTHREAD_POLL_MAX_NS_DEFAULT 0ULL
#endif
-static void *iothread_run(void *opaque)
-{
- IOThread *iothread = opaque;
-
- rcu_register_thread();
- /*
- * g_main_context_push_thread_default() must be called before anything
- * in this new thread uses glib.
- */
- g_main_context_push_thread_default(iothread->worker_context);
- qemu_set_current_aio_context(iothread->ctx);
- iothread->thread_id = qemu_get_thread_id();
- qemu_sem_post(&iothread->init_done_sem);
-
- while (iothread->running) {
- /*
- * Note: from functional-wise the g_main_loop_run() below can
- * already cover the aio_poll() events, but we can't run the
- * main loop unconditionally because explicit aio_poll() here
- * is faster than g_main_loop_run() when we do not need the
- * gcontext at all (e.g., pure block layer iothreads). In
- * other words, when we want to run the gcontext with the
- * iothread we need to pay some performance for functionality.
- */
- aio_poll(iothread->ctx, true);
-
- /*
- * We must check the running state again in case it was
- * changed in previous aio_poll()
- */
- if (iothread->running && qatomic_read(&iothread->run_gcontext)) {
- g_main_loop_run(iothread->main_loop);
- }
- }
-
- g_main_context_pop_thread_default(iothread->worker_context);
- rcu_unregister_thread();
- return NULL;
-}
-
/* Runs in iothread_run() thread */
static void iothread_stop_bh(void *opaque)
{
@@ -104,7 +64,6 @@ static void iothread_instance_init(Object *obj)
iothread->poll_max_ns = IOTHREAD_POLL_MAX_NS_DEFAULT;
iothread->thread_id = -1;
- qemu_sem_init(&iothread->init_done_sem, 0);
/* By default, we don't run gcontext */
qatomic_set(&iothread->run_gcontext, 0);
}
@@ -135,7 +94,6 @@ static void iothread_instance_finalize(Object *obj)
g_main_loop_unref(iothread->main_loop);
iothread->main_loop = NULL;
}
- qemu_sem_destroy(&iothread->init_done_sem);
}
static void iothread_init_gcontext(IOThread *iothread, const char *thread_name)
@@ -176,47 +134,105 @@ static void iothread_set_aio_context_params(EventLoopBase *base, Error **errp)
base->thread_pool_max, errp);
}
+typedef struct {
+ IOThread *iothread;
+ const char *thread_name;
+ QemuSemaphore init_done_sem; /* is thread init done? */
+ Error *error; /* filled in before init_done_sem is posted */
+} IOThreadRunArgs;
-static void iothread_init(EventLoopBase *base, Error **errp)
+static void *iothread_run(void *opaque)
{
- Error *local_error = NULL;
- IOThread *iothread = IOTHREAD(base);
- g_autofree char *thread_name = NULL;
+ IOThreadRunArgs *args = opaque;
+ IOThread *iothread = args->iothread;
- iothread->stopping = false;
- iothread->running = true;
- iothread->ctx = aio_context_new(errp);
+ rcu_register_thread();
+
+ iothread->ctx = aio_context_new(&args->error);
if (!iothread->ctx) {
- return;
+ goto out;
}
- thread_name = g_strdup_printf("IO %s",
- object_get_canonical_path_component(OBJECT(base)));
+ iothread_set_aio_context_params(EVENT_LOOP_BASE(iothread), &args->error);
+ if (args->error) {
+ aio_context_unref(iothread->ctx);
+ iothread->ctx = NULL;
+ goto out;
+ }
/*
* Init one GMainContext for the iothread unconditionally, even if
* it's not used
*/
- iothread_init_gcontext(iothread, thread_name);
+ iothread_init_gcontext(iothread, args->thread_name);
- iothread_set_aio_context_params(base, &local_error);
- if (local_error) {
- error_propagate(errp, local_error);
- aio_context_unref(iothread->ctx);
- iothread->ctx = NULL;
- return;
+ /*
+ * g_main_context_push_thread_default() must be called before anything
+ * in this new thread uses glib.
+ */
+ g_main_context_push_thread_default(iothread->worker_context);
+ qemu_set_current_aio_context(iothread->ctx);
+
+ iothread->stopping = false;
+ iothread->running = true;
+
+ iothread->thread_id = qemu_get_thread_id();
+ qemu_sem_post(&args->init_done_sem);
+
+ while (iothread->running) {
+ /*
+ * Note: from functional-wise the g_main_loop_run() below can
+ * already cover the aio_poll() events, but we can't run the
+ * main loop unconditionally because explicit aio_poll() here
+ * is faster than g_main_loop_run() when we do not need the
+ * gcontext at all (e.g., pure block layer iothreads). In
+ * other words, when we want to run the gcontext with the
+ * iothread we need to pay some performance for functionality.
+ */
+ aio_poll(iothread->ctx, true);
+
+ /*
+ * We must check the running state again in case it was
+ * changed in previous aio_poll()
+ */
+ if (iothread->running && qatomic_read(&iothread->run_gcontext)) {
+ g_main_loop_run(iothread->main_loop);
+ }
}
+ g_main_context_pop_thread_default(iothread->worker_context);
+out:
+ rcu_unregister_thread();
+ return NULL;
+}
+
+static void iothread_init(EventLoopBase *base, Error **errp)
+{
+ IOThread *iothread = IOTHREAD(base);
+ g_autofree char *thread_name = NULL;
+ IOThreadRunArgs args = {
+ .iothread = iothread,
+ };
+
+ qemu_sem_init(&args.init_done_sem, 0);
+
+ thread_name = g_strdup_printf("IO %s",
+ object_get_canonical_path_component(OBJECT(base)));
+ args.thread_name = thread_name;
+
/* This assumes we are called from a thread with useful CPU affinity for us
* to inherit.
*/
- qemu_thread_create(&iothread->thread, thread_name, iothread_run,
- iothread, QEMU_THREAD_JOINABLE);
+ qemu_thread_create(&iothread->thread, thread_name, iothread_run, &args,
+ QEMU_THREAD_JOINABLE);
/* Wait for initialization to complete */
while (iothread->thread_id == -1) {
- qemu_sem_wait(&iothread->init_done_sem);
+ qemu_sem_wait(&args.init_done_sem);
}
+
+ qemu_sem_destroy(&args.init_done_sem);
+ error_propagate(errp, args.error);
}
typedef struct {
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 2/4] aio-posix: enable IORING_SETUP_SINGLE_ISSUER
2026-02-25 8:13 [PATCH v2 0/4] aio-posix: enable io_uring SINGLE_ISSUER, TASKRUN, and NO_SQARRAY flags Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 1/4] iothread: create AioContext in iothread_run() Stefan Hajnoczi
@ 2026-02-25 8:13 ` Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 3/4] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 4/4] aio-posix: enable IORING_SETUP_NO_SQARRAY Stefan Hajnoczi
3 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2026-02-25 8:13 UTC (permalink / raw)
To: qemu-devel; +Cc: Stefan Hajnoczi, qemu-block, Kevin Wolf, Fam Zheng, Jens Axboe
IORING_SETUP_SINGLE_ISSUER enables optimizations in the host Linux
kernel's io_uring code when the io_uring context is only used from a
single thread. This is true is QEMU because io_uring SQEs are submitted
from the same thread that processes the CQEs.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
util/fdmon-io_uring.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index d0b56127c6..ec056b4818 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -452,13 +452,31 @@ static const FDMonOps fdmon_io_uring_ops = {
.add_sqe = fdmon_io_uring_add_sqe,
};
+static inline bool is_creating_iothread(void)
+{
+ return qemu_get_thread_id() != getpid();
+}
+
bool fdmon_io_uring_setup(AioContext *ctx, Error **errp)
{
+ unsigned flags = 0;
int ret;
+ /*
+ * The main thread's AioContexts are created from the main loop thread but
+ * may be accessed from multiple threads (e.g. vCPUs or the migration
+ * thread). IOThread AioContexts are only accessed from the IOThread
+ * itself.
+ */
+#if IORING_SETUP_SINGLE_ISSUER
+ if (is_creating_iothread()) {
+ flags |= IORING_SETUP_SINGLE_ISSUER;
+ }
+#endif
+
ctx->io_uring_fd_tag = NULL;
- ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
+ ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, flags);
if (ret != 0) {
error_setg_errno(errp, -ret, "Failed to initialize io_uring");
return false;
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 3/4] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
2026-02-25 8:13 [PATCH v2 0/4] aio-posix: enable io_uring SINGLE_ISSUER, TASKRUN, and NO_SQARRAY flags Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 1/4] iothread: create AioContext in iothread_run() Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 2/4] aio-posix: enable IORING_SETUP_SINGLE_ISSUER Stefan Hajnoczi
@ 2026-02-25 8:13 ` Stefan Hajnoczi
2026-02-25 9:33 ` Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 4/4] aio-posix: enable IORING_SETUP_NO_SQARRAY Stefan Hajnoczi
3 siblings, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2026-02-25 8:13 UTC (permalink / raw)
To: qemu-devel; +Cc: Stefan Hajnoczi, qemu-block, Kevin Wolf, Fam Zheng, Jens Axboe
The IORING_SETUP_COOP_TASKRUN flag reduces interprocessor interrupts
when an io_uring event occurs on a different CPU. The idea is that the
QEMU thread will wait for a CQE anyway, so there is no need to interrupt
the CPU that it is on.
The IORING_SETUP_TASKRUN_FLAG ensures that QEMU's io_uring CQ ring
polling still works with COOP_TASKRUN. The kernel will set a flag in the
SQ ring (this is not a typo, the flag is located in the SQ ring even
though it pertains to the CQ ring) that can be polled from userspace.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
util/fdmon-io_uring.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index ec056b4818..2e2c0e6785 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -423,13 +423,16 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
static bool fdmon_io_uring_need_wait(AioContext *ctx)
{
+ struct io_uring *ring = &ctx->fdmon_io_uring;
+
/* Have io_uring events completed? */
- if (io_uring_cq_ready(&ctx->fdmon_io_uring)) {
+ if (io_uring_cq_ready(ring) ||
+ IO_URING_READ_ONCE(*ring->sq.kflags) & IORING_SQ_TASKRUN) {
return true;
}
/* Are there pending sqes to submit? */
- if (io_uring_sq_ready(&ctx->fdmon_io_uring)) {
+ if (io_uring_sq_ready(ring)) {
return true;
}
@@ -459,7 +462,15 @@ static inline bool is_creating_iothread(void)
bool fdmon_io_uring_setup(AioContext *ctx, Error **errp)
{
- unsigned flags = 0;
+ /* Enable modern flags supported by the host kernel */
+ unsigned flags =
+#ifdef IORING_SETUP_COOP_TASKRUN
+ IORING_SETUP_COOP_TASKRUN |
+#endif
+#ifdef IORING_SETUP_TASKRUN_FLAG
+ IORING_SETUP_TASKRUN_FLAG |
+#endif
+ 0;
int ret;
/*
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 4/4] aio-posix: enable IORING_SETUP_NO_SQARRAY
2026-02-25 8:13 [PATCH v2 0/4] aio-posix: enable io_uring SINGLE_ISSUER, TASKRUN, and NO_SQARRAY flags Stefan Hajnoczi
` (2 preceding siblings ...)
2026-02-25 8:13 ` [PATCH v2 3/4] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG Stefan Hajnoczi
@ 2026-02-25 8:13 ` Stefan Hajnoczi
3 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2026-02-25 8:13 UTC (permalink / raw)
To: qemu-devel; +Cc: Stefan Hajnoczi, qemu-block, Kevin Wolf, Fam Zheng, Jens Axboe
This simplifies SQ ring processing and saves CPU cycles. A big
improvement is not expected, but it doesn't hurt to enable this flag.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
util/fdmon-io_uring.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 2e2c0e6785..99c4932937 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -469,6 +469,9 @@ bool fdmon_io_uring_setup(AioContext *ctx, Error **errp)
#endif
#ifdef IORING_SETUP_TASKRUN_FLAG
IORING_SETUP_TASKRUN_FLAG |
+#endif
+#ifdef IORING_SETUP_NO_SQARRAY
+ IORING_SETUP_NO_SQARRAY |
#endif
0;
int ret;
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2 3/4] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
2026-02-25 8:13 ` [PATCH v2 3/4] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG Stefan Hajnoczi
@ 2026-02-25 9:33 ` Stefan Hajnoczi
0 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2026-02-25 9:33 UTC (permalink / raw)
To: Jens Axboe; +Cc: qemu-block, Kevin Wolf, Fam Zheng, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 1810 bytes --]
On Wed, Feb 25, 2026 at 04:13:35PM +0800, Stefan Hajnoczi wrote:
> The IORING_SETUP_COOP_TASKRUN flag reduces interprocessor interrupts
> when an io_uring event occurs on a different CPU. The idea is that the
> QEMU thread will wait for a CQE anyway, so there is no need to interrupt
> the CPU that it is on.
>
> The IORING_SETUP_TASKRUN_FLAG ensures that QEMU's io_uring CQ ring
> polling still works with COOP_TASKRUN. The kernel will set a flag in the
> SQ ring (this is not a typo, the flag is located in the SQ ring even
> though it pertains to the CQ ring) that can be polled from userspace.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
> util/fdmon-io_uring.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
Hi Jens,
I noticed liburing's io_uring_cq_ready() does not check the
IORING_SQ_TASKRUN flag. Maybe QEMU's fdmon_io_uring_gsource_check()
needs to check it here so that io_uring_enter(2) will be called with
IORING_ENTER_GETEVENTS in the glib event loop?
(This is a similar idea to your recent patch but needed when
IORING_SETUP_TASKRUN_FLAG is enabled.)
I tried to benchmark this but couldn't observe a difference in IOPS:
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 652d269e03..ef4257924b 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -356,7 +356,8 @@ static bool fdmon_io_uring_gsource_check(AioContext *ctx)
* the main loop can miss completions and sleep in ppoll() until the
* next timer fires.
*/
- return io_uring_cq_ready(&ctx->fdmon_io_uring);
+ return io_uring_cq_ready(&ctx->fdmon_io_uring) ||
+ (IO_URING_READ_ONCE(*ctx->fdmon_io_uring.sq.kflags) & IORING_SQ_TASKRUN);
}
/* Dispatch CQE handlers that are ready */
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-25 9:34 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-25 8:13 [PATCH v2 0/4] aio-posix: enable io_uring SINGLE_ISSUER, TASKRUN, and NO_SQARRAY flags Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 1/4] iothread: create AioContext in iothread_run() Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 2/4] aio-posix: enable IORING_SETUP_SINGLE_ISSUER Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 3/4] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG Stefan Hajnoczi
2026-02-25 9:33 ` Stefan Hajnoczi
2026-02-25 8:13 ` [PATCH v2 4/4] aio-posix: enable IORING_SETUP_NO_SQARRAY Stefan Hajnoczi
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.