* [RFC 0/3] aio-posix: enable io_uring SINGLE_ISSUER and TASKRUN flags
@ 2025-07-24 20:46 Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 1/3] iothread: create AioContext in iothread_run() Stefan Hajnoczi
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2025-07-24 20:46 UTC (permalink / raw)
To: qemu-devel
Cc: Stefan Hajnoczi, Brian Song, qemu-block, Bernd Schubert,
Kevin Wolf, h0lyalg0rithm, Fam Zheng
Do not merge this series. The performance effects are not significant. I am
sharing this mainly to archive the patches and in case someone has ideas on how
to improve this.
Bernd Schubert mentioned io_uring_setup(2) flags that may improve performance:
- IORING_SETUP_SINGLE_ISSUER: optimization when only 1 thread uses an io_uring context
- IORING_SETUP_COOP_TASKRUN: avoids IPIs
- IORING_SETUP_TASKRUN_FLAG: makes COOP_TASKRUN work with userspace CQ ring polling
Suraj Shirvankar already started work on SINGLE_ISSUER in the past:
https://lore.kernel.org/qemu-devel/174293621917.22751.11381319865102029969-0@git.sr.ht/
Where this differs from Suraj's previous work is that I have worked around the
need for the main loop AioContext to be shared by multiple threads (vCPU
threads and the migration thread).
Here are the performance numbers for fio bs=4k in a 4 vCPU guest with 1
IOThread using a virtio-blk disk backed by a local NVMe drive:
IOPS IOPS
Benchmark SINGLE_ISSUER SINGLE_ISSUER|COOP_TASKRUN|TASKRUN_FLAG
randread iodepth=1 54,045 (+1.2%) 54,189 (+1.5%)
randread iodepth=64 318,135 (+0.1%) 315,632 (-0.68%)
randwrite iodepth=1 141,918 (-0.44%) 143,337 (+0.55%)
randwrite iodepth=64 323,948 (-0.015%) 322,755 (-0.38%)
You can find detailed benchmarking results here including the fio
output, fio command-line, and guest libvirt domain XML:
https://gitlab.com/stefanha/virt-playbooks/-/tree/io_uring-flags/notebook/fio-output
https://gitlab.com/stefanha/virt-playbooks/-/blob/io_uring-flags/files/fio.sh
https://gitlab.com/stefanha/virt-playbooks/-/blob/io_uring-flags/files/test.xml.j2
Stefan Hajnoczi (3):
iothread: create AioContext in iothread_run()
aio-posix: enable IORING_SETUP_SINGLE_ISSUER
aio-posix: enable IORING_SETUP_COOP_TASKRUN |
IORING_SETUP_TASKRUN_FLAG
include/system/iothread.h | 1 -
iothread.c | 140 +++++++++++++++++++++-----------------
util/fdmon-io_uring.c | 26 ++++++-
3 files changed, 101 insertions(+), 66 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* [RFC 1/3] iothread: create AioContext in iothread_run()
2025-07-24 20:46 [RFC 0/3] aio-posix: enable io_uring SINGLE_ISSUER and TASKRUN flags Stefan Hajnoczi
@ 2025-07-24 20:47 ` Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 2/3] aio-posix: enable IORING_SETUP_SINGLE_ISSUER Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 3/3] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG Stefan Hajnoczi
2 siblings, 0 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2025-07-24 20:47 UTC (permalink / raw)
To: qemu-devel
Cc: Stefan Hajnoczi, Brian Song, qemu-block, Bernd Schubert,
Kevin Wolf, h0lyalg0rithm, Fam Zheng
The IOThread's AioContext is currently created in iothread_init() where
it's easy to propagate errors before spawning the thread that runs
iothread_run(). However, this means that aio_context_new() is called
from the main loop thread rather than from the IOThread.
In order to use Linux io_uring's IORING_SETUP_SINGLE_ISSUER feature in
the next commit, only one thread can use the io_uring context and
therefore iothread.c must call aio_context_new() from iothread_run()
instead of iothread_init().
Extract the iothread_run() arguments into an IOThreadRunArgs struct
where an Error *error field can be used to report back initialization
errors. This works pretty well thanks to the init_done_sem semaphore
that is already used by iothread_init() to wait for iothread_run() to
initialize.
Move iothread_run() further down for proximity with iothread_init() and
to avoid adding a function prototype for
iothread_set_aio_context_params().
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
include/system/iothread.h | 1 -
iothread.c | 140 +++++++++++++++++++++-----------------
2 files changed, 78 insertions(+), 63 deletions(-)
diff --git a/include/system/iothread.h b/include/system/iothread.h
index d95c17a645..ec4e798d5e 100644
--- a/include/system/iothread.h
+++ b/include/system/iothread.h
@@ -29,7 +29,6 @@ struct IOThread {
bool run_gcontext; /* whether we should run gcontext */
GMainContext *worker_context;
GMainLoop *main_loop;
- QemuSemaphore init_done_sem; /* is thread init done? */
bool stopping; /* has iothread_stop() been called? */
bool running; /* should iothread_run() continue? */
int thread_id;
diff --git a/iothread.c b/iothread.c
index 8810376dce..c6547779d0 100644
--- a/iothread.c
+++ b/iothread.c
@@ -36,46 +36,6 @@
#define IOTHREAD_POLL_MAX_NS_DEFAULT 0ULL
#endif
-static void *iothread_run(void *opaque)
-{
- IOThread *iothread = opaque;
-
- rcu_register_thread();
- /*
- * g_main_context_push_thread_default() must be called before anything
- * in this new thread uses glib.
- */
- g_main_context_push_thread_default(iothread->worker_context);
- qemu_set_current_aio_context(iothread->ctx);
- iothread->thread_id = qemu_get_thread_id();
- qemu_sem_post(&iothread->init_done_sem);
-
- while (iothread->running) {
- /*
- * Note: from functional-wise the g_main_loop_run() below can
- * already cover the aio_poll() events, but we can't run the
- * main loop unconditionally because explicit aio_poll() here
- * is faster than g_main_loop_run() when we do not need the
- * gcontext at all (e.g., pure block layer iothreads). In
- * other words, when we want to run the gcontext with the
- * iothread we need to pay some performance for functionality.
- */
- aio_poll(iothread->ctx, true);
-
- /*
- * We must check the running state again in case it was
- * changed in previous aio_poll()
- */
- if (iothread->running && qatomic_read(&iothread->run_gcontext)) {
- g_main_loop_run(iothread->main_loop);
- }
- }
-
- g_main_context_pop_thread_default(iothread->worker_context);
- rcu_unregister_thread();
- return NULL;
-}
-
/* Runs in iothread_run() thread */
static void iothread_stop_bh(void *opaque)
{
@@ -104,7 +64,6 @@ static void iothread_instance_init(Object *obj)
iothread->poll_max_ns = IOTHREAD_POLL_MAX_NS_DEFAULT;
iothread->thread_id = -1;
- qemu_sem_init(&iothread->init_done_sem, 0);
/* By default, we don't run gcontext */
qatomic_set(&iothread->run_gcontext, 0);
}
@@ -135,7 +94,6 @@ static void iothread_instance_finalize(Object *obj)
g_main_loop_unref(iothread->main_loop);
iothread->main_loop = NULL;
}
- qemu_sem_destroy(&iothread->init_done_sem);
}
static void iothread_init_gcontext(IOThread *iothread, const char *thread_name)
@@ -176,47 +134,105 @@ static void iothread_set_aio_context_params(EventLoopBase *base, Error **errp)
base->thread_pool_max, errp);
}
+typedef struct {
+ IOThread *iothread;
+ const char *thread_name;
+ QemuSemaphore init_done_sem; /* is thread init done? */
+ Error *error; /* filled in before init_done_sem is posted */
+} IOThreadRunArgs;
-static void iothread_init(EventLoopBase *base, Error **errp)
+static void *iothread_run(void *opaque)
{
- Error *local_error = NULL;
- IOThread *iothread = IOTHREAD(base);
- g_autofree char *thread_name = NULL;
+ IOThreadRunArgs *args = opaque;
+ IOThread *iothread = args->iothread;
- iothread->stopping = false;
- iothread->running = true;
- iothread->ctx = aio_context_new(errp);
+ rcu_register_thread();
+
+ iothread->ctx = aio_context_new(&args->error);
if (!iothread->ctx) {
- return;
+ goto out;
}
- thread_name = g_strdup_printf("IO %s",
- object_get_canonical_path_component(OBJECT(base)));
+ iothread_set_aio_context_params(EVENT_LOOP_BASE(iothread), &args->error);
+ if (args->error) {
+ aio_context_unref(iothread->ctx);
+ iothread->ctx = NULL;
+ goto out;
+ }
/*
* Init one GMainContext for the iothread unconditionally, even if
* it's not used
*/
- iothread_init_gcontext(iothread, thread_name);
+ iothread_init_gcontext(iothread, args->thread_name);
- iothread_set_aio_context_params(base, &local_error);
- if (local_error) {
- error_propagate(errp, local_error);
- aio_context_unref(iothread->ctx);
- iothread->ctx = NULL;
- return;
+ /*
+ * g_main_context_push_thread_default() must be called before anything
+ * in this new thread uses glib.
+ */
+ g_main_context_push_thread_default(iothread->worker_context);
+ qemu_set_current_aio_context(iothread->ctx);
+
+ iothread->stopping = false;
+ iothread->running = true;
+
+ iothread->thread_id = qemu_get_thread_id();
+ qemu_sem_post(&args->init_done_sem);
+
+ while (iothread->running) {
+ /*
+ * Note: from functional-wise the g_main_loop_run() below can
+ * already cover the aio_poll() events, but we can't run the
+ * main loop unconditionally because explicit aio_poll() here
+ * is faster than g_main_loop_run() when we do not need the
+ * gcontext at all (e.g., pure block layer iothreads). In
+ * other words, when we want to run the gcontext with the
+ * iothread we need to pay some performance for functionality.
+ */
+ aio_poll(iothread->ctx, true);
+
+ /*
+ * We must check the running state again in case it was
+ * changed in previous aio_poll()
+ */
+ if (iothread->running && qatomic_read(&iothread->run_gcontext)) {
+ g_main_loop_run(iothread->main_loop);
+ }
}
+ g_main_context_pop_thread_default(iothread->worker_context);
+out:
+ rcu_unregister_thread();
+ return NULL;
+}
+
+static void iothread_init(EventLoopBase *base, Error **errp)
+{
+ IOThread *iothread = IOTHREAD(base);
+ g_autofree char *thread_name = NULL;
+ IOThreadRunArgs args = {
+ .iothread = iothread,
+ };
+
+ qemu_sem_init(&args.init_done_sem, 0);
+
+ thread_name = g_strdup_printf("IO %s",
+ object_get_canonical_path_component(OBJECT(base)));
+ args.thread_name = thread_name;
+
/* This assumes we are called from a thread with useful CPU affinity for us
* to inherit.
*/
- qemu_thread_create(&iothread->thread, thread_name, iothread_run,
- iothread, QEMU_THREAD_JOINABLE);
+ qemu_thread_create(&iothread->thread, thread_name, iothread_run, &args,
+ QEMU_THREAD_JOINABLE);
/* Wait for initialization to complete */
while (iothread->thread_id == -1) {
- qemu_sem_wait(&iothread->init_done_sem);
+ qemu_sem_wait(&args.init_done_sem);
}
+
+ qemu_sem_destroy(&args.init_done_sem);
+ error_propagate(errp, args.error);
}
typedef struct {
--
2.50.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [RFC 2/3] aio-posix: enable IORING_SETUP_SINGLE_ISSUER
2025-07-24 20:46 [RFC 0/3] aio-posix: enable io_uring SINGLE_ISSUER and TASKRUN flags Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 1/3] iothread: create AioContext in iothread_run() Stefan Hajnoczi
@ 2025-07-24 20:47 ` Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 3/3] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG Stefan Hajnoczi
2 siblings, 0 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2025-07-24 20:47 UTC (permalink / raw)
To: qemu-devel
Cc: Stefan Hajnoczi, Brian Song, qemu-block, Bernd Schubert,
Kevin Wolf, h0lyalg0rithm, Fam Zheng
IORING_SETUP_SINGLE_ISSUER enables optimizations in the host Linux
kernel's io_uring code when the io_uring context is only used from a
single thread. This is true is QEMU because io_uring SQEs are submitted
from the same thread that processes the CQEs.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
util/fdmon-io_uring.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 28b93c8ab9..4798439097 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -456,13 +456,30 @@ static const FDMonOps fdmon_io_uring_ops = {
.add_sqe = fdmon_io_uring_add_sqe,
};
+static inline bool is_creating_iothread(void)
+{
+ return qemu_get_thread_id() != getpid();
+}
+
void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
{
int ret;
+ /* TODO only enable these flags if they are available in the host's kernel headers */
+ unsigned flags = 0;
ctx->io_uring_fd_tag = NULL;
- ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
+ /*
+ * The main thread's AioContexts are created from the main loop thread but
+ * may be accessed from multiple threads (e.g. vCPUs or the migration
+ * thread). IOThread AioContexts are only accessed from the IOThread
+ * itself.
+ */
+ if (is_creating_iothread()) {
+ flags = IORING_SETUP_SINGLE_ISSUER;
+ }
+
+ ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, flags);
if (ret != 0) {
error_setg_errno(errp, -ret, "Failed to initialize io_uring");
return;
--
2.50.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [RFC 3/3] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG
2025-07-24 20:46 [RFC 0/3] aio-posix: enable io_uring SINGLE_ISSUER and TASKRUN flags Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 1/3] iothread: create AioContext in iothread_run() Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 2/3] aio-posix: enable IORING_SETUP_SINGLE_ISSUER Stefan Hajnoczi
@ 2025-07-24 20:47 ` Stefan Hajnoczi
2 siblings, 0 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2025-07-24 20:47 UTC (permalink / raw)
To: qemu-devel
Cc: Stefan Hajnoczi, Brian Song, qemu-block, Bernd Schubert,
Kevin Wolf, h0lyalg0rithm, Fam Zheng
The IORING_SETUP_COOP_TASKRUN flag reduces interprocessor interrupts
when an io_uring event occurs on a different CPU. The idea is that the
QEMU thread will wait for a CQE anyway, so there is no need to interrupt
the CPU that it is on.
The IORING_SETUP_TASKRUN_FLAG ensures that QEMU's io_uring CQ ring
polling still works with COOP_TASKRUN. The kernel will set a flag in the
SQ ring (this is not a typo, the flag is located in the SQ ring even
though it pertains to the CQ ring) that can be polled from userspace.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
util/fdmon-io_uring.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 4798439097..649dc18907 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -428,13 +428,16 @@ static int fdmon_io_uring_wait(AioContext *ctx, AioHandlerList *ready_list,
static bool fdmon_io_uring_need_wait(AioContext *ctx)
{
+ struct io_uring *ring = &ctx->fdmon_io_uring;
+
/* Have io_uring events completed? */
- if (io_uring_cq_ready(&ctx->fdmon_io_uring)) {
+ if (io_uring_cq_ready(ring) ||
+ IO_URING_READ_ONCE(*ring->sq.kflags) & IORING_SQ_TASKRUN) {
return true;
}
/* Are there pending sqes to submit? */
- if (io_uring_sq_ready(&ctx->fdmon_io_uring)) {
+ if (io_uring_sq_ready(ring)) {
return true;
}
@@ -465,7 +468,7 @@ void fdmon_io_uring_setup(AioContext *ctx, Error **errp)
{
int ret;
/* TODO only enable these flags if they are available in the host's kernel headers */
- unsigned flags = 0;
+ unsigned flags = IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG;
ctx->io_uring_fd_tag = NULL;
--
2.50.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-07-24 20:48 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24 20:46 [RFC 0/3] aio-posix: enable io_uring SINGLE_ISSUER and TASKRUN flags Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 1/3] iothread: create AioContext in iothread_run() Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 2/3] aio-posix: enable IORING_SETUP_SINGLE_ISSUER Stefan Hajnoczi
2025-07-24 20:47 ` [RFC 3/3] aio-posix: enable IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG Stefan Hajnoczi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).