* [PATCH 0/3] [RESEND] block: unify block and fdmon io_uring
@ 2025-04-01 14:27 Stefan Hajnoczi
2025-04-01 14:27 ` [PATCH 1/3] aio-posix: treat io_uring setup failure as fatal Stefan Hajnoczi
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2025-04-01 14:27 UTC (permalink / raw)
To: qemu-devel
Cc: Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella,
surajshirvankar, Hanna Reitz, qemu-block, Kevin Wolf,
Paolo Bonzini, Fam Zheng
For some reason the patch series didn't reach the mailing list. Resending.
Every AioContext has an io_uring context for file descriptor monitoring. In
addition, block/io_uring.c also has an io_uring context for disk I/O. This
patch series eliminates the extra io_uring context so that AioContext has a
single io_uring context.
Patch 2 introduces the aio_add_sqe() API for submitting custom io_uring SQEs.
Patch 3 then replaces its io_uring context with the aio_add_sqe() API so that
disk I/O requests share the AioContext's io_uring context.
I will post performance results when they become available.
Based-on: 20250326183340.1010531-1-stefanha@redhat.com
Stefan Hajnoczi (3):
aio-posix: treat io_uring setup failure as fatal
aio-posix: add aio_add_sqe() API for user-defined io_uring requests
block/io_uring: use aio_add_sqe()
include/block/aio.h | 74 +++++-
include/block/raw-aio.h | 5 -
util/aio-posix.h | 1 +
block/file-posix.c | 25 +-
block/io_uring.c | 489 ++++++++++------------------------------
stubs/io_uring.c | 32 ---
util/aio-posix.c | 9 +
util/async.c | 35 ---
util/fdmon-io_uring.c | 155 +++++++++----
block/trace-events | 12 +-
stubs/meson.build | 3 -
util/trace-events | 4 +
12 files changed, 316 insertions(+), 528 deletions(-)
delete mode 100644 stubs/io_uring.c
--
2.49.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/3] aio-posix: treat io_uring setup failure as fatal
2025-04-01 14:27 [PATCH 0/3] [RESEND] block: unify block and fdmon io_uring Stefan Hajnoczi
@ 2025-04-01 14:27 ` Stefan Hajnoczi
2025-04-25 15:51 ` Kevin Wolf
2025-04-01 14:27 ` [PATCH 2/3] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Stefan Hajnoczi
2025-04-01 14:27 ` [PATCH 3/3] block/io_uring: use aio_add_sqe() Stefan Hajnoczi
2 siblings, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2025-04-01 14:27 UTC (permalink / raw)
To: qemu-devel
Cc: Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella,
surajshirvankar, Hanna Reitz, qemu-block, Kevin Wolf,
Paolo Bonzini, Fam Zheng
In the early days of io_uring it was possible for io_uring_setup(2) to
fail due to exhausting RLIMIT_MEMLOCK. QEMU's solution was to fall back
to epoll(7) or ppoll(2) when io_uring could not be used in an
AioContext.
Nowadays io_uring memory is accounted differently so io_uring_setup(2)
won't fail. Treat failure as a fatal error. Keep it simple: io_uring is
available if and only if CONFIG_LINUX_IO_URING is defined.
Upcoming features that rely on io_uring won't need to handle the case
where a subset of AioContexts lacks io_uring. This will simplify the
aio_add_sqe() API introduced in the next commit.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
util/fdmon-io_uring.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 2092d08d24..18b33a0370 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -45,6 +45,7 @@
#include "qemu/osdep.h"
#include <poll.h>
+#include "qemu/error-report.h"
#include "qemu/rcu_queue.h"
#include "aio-posix.h"
@@ -369,7 +370,8 @@ bool fdmon_io_uring_setup(AioContext *ctx)
ret = io_uring_queue_init(FDMON_IO_URING_ENTRIES, &ctx->fdmon_io_uring, 0);
if (ret != 0) {
- return false;
+ error_report("failed to initialize io_uring: %s", strerror(-ret));
+ exit(EXIT_FAILURE);
}
QSLIST_INIT(&ctx->submit_list);
--
2.49.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/3] aio-posix: add aio_add_sqe() API for user-defined io_uring requests
2025-04-01 14:27 [PATCH 0/3] [RESEND] block: unify block and fdmon io_uring Stefan Hajnoczi
2025-04-01 14:27 ` [PATCH 1/3] aio-posix: treat io_uring setup failure as fatal Stefan Hajnoczi
@ 2025-04-01 14:27 ` Stefan Hajnoczi
2025-04-01 14:27 ` [PATCH 3/3] block/io_uring: use aio_add_sqe() Stefan Hajnoczi
2 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2025-04-01 14:27 UTC (permalink / raw)
To: qemu-devel
Cc: Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella,
surajshirvankar, Hanna Reitz, qemu-block, Kevin Wolf,
Paolo Bonzini, Fam Zheng
Introduce the aio_add_sqe() API for submitting io_uring requests in the
current AioContext. This allows other components in QEMU, like the block
layer, to take advantage of io_uring features without creating their own
io_uring context.
This API supports nested event loops just like file descriptor
monitoring and BHs do. This comes at a complexity cost: a BH is required
to dispatch CQE callbacks and they are placed on a list so that a nested
event loop can invoke its parent's pending CQE callbacks. If you're
wondering why CqeHandler exists instead of just a callback function
pointer, this is why.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
include/block/aio.h | 67 +++++++++++++++++++
util/aio-posix.h | 1 +
util/aio-posix.c | 9 +++
util/fdmon-io_uring.c | 145 +++++++++++++++++++++++++++++++-----------
4 files changed, 185 insertions(+), 37 deletions(-)
diff --git a/include/block/aio.h b/include/block/aio.h
index 1657740a0e..4dfb419a21 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -61,6 +61,27 @@ typedef struct LuringState LuringState;
/* Is polling disabled? */
bool aio_poll_disabled(AioContext *ctx);
+#ifdef CONFIG_LINUX_IO_URING
+/*
+ * Each io_uring request must have a unique CqeHandler that processes the cqe.
+ * The lifetime of a CqeHandler must be at least from aio_add_sqe() until
+ * ->cb() invocation.
+ */
+typedef struct CqeHandler CqeHandler;
+struct CqeHandler {
+ /* Called by the AioContext when the request has completed */
+ void (*cb)(CqeHandler *handler);
+
+ /* Used internally, do not access this */
+ QSIMPLEQ_ENTRY(CqeHandler) next;
+
+ /* This field is filled in before ->cb() is called */
+ struct io_uring_cqe cqe;
+};
+
+typedef QSIMPLEQ_HEAD(, CqeHandler) CqeHandlerSimpleQ;
+#endif /* CONFIG_LINUX_IO_URING */
+
/* Callbacks for file descriptor monitoring implementations */
typedef struct {
/*
@@ -138,6 +159,27 @@ typedef struct {
* Called with list_lock incremented.
*/
void (*gsource_dispatch)(AioContext *ctx, AioHandlerList *ready_list);
+
+#ifdef CONFIG_LINUX_IO_URING
+ /**
+ * aio_add_sqe: Add an io_uring sqe for submission.
+ * @prep_sqe: invoked with an sqe that should be prepared for submission
+ * @opaque: user-defined argument to @prep_sqe()
+ * @cqe_handler: the unique cqe handler associated with this request
+ *
+ * The caller's @prep_sqe() function is invoked to fill in the details of
+ * the sqe. Do not call io_uring_sqe_set_data() on this sqe.
+ *
+ * The kernel may see the sqe as soon as @pre_sqe() returns or it may take
+ * until the next event loop iteration.
+ *
+ * This function is called from the current AioContext and is not
+ * thread-safe.
+ */
+ void (*add_sqe)(AioContext *ctx,
+ void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+ void *opaque, CqeHandler *cqe_handler);
+#endif /* CONFIG_LINUX_IO_URING */
} FDMonOps;
/*
@@ -255,6 +297,10 @@ struct AioContext {
struct io_uring fdmon_io_uring;
AioHandlerSList submit_list;
gpointer io_uring_fd_tag;
+
+ /* Pending callback state for cqe handlers */
+ CqeHandlerSimpleQ cqe_handler_ready_list;
+ QEMUBH *cqe_handler_bh;
#endif
/* TimerLists for calling timers - one per clock type. Has its own
@@ -370,6 +416,27 @@ QEMUBH *aio_bh_new_full(AioContext *ctx, QEMUBHFunc *cb, void *opaque,
#define aio_bh_new_guarded(ctx, cb, opaque, guard) \
aio_bh_new_full((ctx), (cb), (opaque), (stringify(cb)), guard)
+#ifdef CONFIG_LINUX_IO_URING
+/**
+ * aio_add_sqe: Add an io_uring sqe for submission.
+ * @prep_sqe: invoked with an sqe that should be prepared for submission
+ * @opaque: user-defined argument to @prep_sqe()
+ * @cqe_handler: the unique cqe handler associated with this request
+ *
+ * The caller's @prep_sqe() function is invoked to fill in the details of the
+ * sqe. Do not call io_uring_sqe_set_data() on this sqe.
+ *
+ * The sqe is submitted by the current AioContext. The kernel may see the sqe
+ * as soon as @pre_sqe() returns or it may take until the next event loop
+ * iteration.
+ *
+ * When the AioContext is destroyed, pending sqes are ignored and their
+ * CqeHandlers are not invoked.
+ */
+void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+ void *opaque, CqeHandler *cqe_handler);
+#endif /* CONFIG_LINUX_IO_URING */
+
/**
* aio_notify: Force processing of pending events.
*
diff --git a/util/aio-posix.h b/util/aio-posix.h
index f9994ed79e..d3e2f66957 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -35,6 +35,7 @@ struct AioHandler {
#ifdef CONFIG_LINUX_IO_URING
QSLIST_ENTRY(AioHandler) node_submitted;
unsigned flags; /* see fdmon-io_uring.c */
+ CqeHandler cqe_handler;
#endif
int64_t poll_idle_timeout; /* when to stop userspace polling */
bool poll_ready; /* has polling detected an event? */
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 6c2ee0b0b4..f2535dc868 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -767,3 +767,12 @@ void aio_context_set_aio_params(AioContext *ctx, int64_t max_batch)
aio_notify(ctx);
}
+
+#ifdef CONFIG_LINUX_IO_URING
+void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+ void *opaque, CqeHandler *cqe_handler)
+{
+ AioContext *ctx = qemu_get_current_aio_context();
+ ctx->fdmon_ops->add_sqe(ctx, prep_sqe, opaque, cqe_handler);
+}
+#endif /* CONFIG_LINUX_IO_URING */
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index 18b33a0370..a4523e3dcc 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -75,8 +75,8 @@ static inline int pfd_events_from_poll(int poll_events)
}
/*
- * Returns an sqe for submitting a request. Only be called within
- * fdmon_io_uring_wait().
+ * Returns an sqe for submitting a request. Only called from the AioContext
+ * thread.
*/
static struct io_uring_sqe *get_sqe(AioContext *ctx)
{
@@ -166,23 +166,43 @@ static void fdmon_io_uring_update(AioContext *ctx,
}
}
+static void fdmon_io_uring_add_sqe(AioContext *ctx,
+ void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
+ void *opaque, CqeHandler *cqe_handler)
+{
+ struct io_uring_sqe *sqe = get_sqe(ctx);
+
+ prep_sqe(sqe, opaque);
+ io_uring_sqe_set_data(sqe, cqe_handler);
+}
+
+static void fdmon_special_cqe_handler(CqeHandler *cqe_handler)
+{
+ /*
+ * This is an empty function that is never called. It is used as a function
+ * pointer to distinguish it from ordinary cqe handlers.
+ */
+}
+
static void add_poll_multishot_sqe(AioContext *ctx, AioHandler *node)
{
struct io_uring_sqe *sqe = get_sqe(ctx);
int events = poll_events_from_pfd(node->pfd.events);
io_uring_prep_poll_multishot(sqe, node->pfd.fd, events);
- io_uring_sqe_set_data(sqe, node);
+ node->cqe_handler.cb = fdmon_special_cqe_handler;
+ io_uring_sqe_set_data(sqe, &node->cqe_handler);
}
static void add_poll_remove_sqe(AioContext *ctx, AioHandler *node)
{
struct io_uring_sqe *sqe = get_sqe(ctx);
+ CqeHandler *cqe_handler = &node->cqe_handler;
#ifdef LIBURING_HAVE_DATA64
- io_uring_prep_poll_remove(sqe, (uintptr_t)node);
+ io_uring_prep_poll_remove(sqe, (uintptr_t)cqe_handler);
#else
- io_uring_prep_poll_remove(sqe, node);
+ io_uring_prep_poll_remove(sqe, cqe_handler);
#endif
io_uring_sqe_set_data(sqe, NULL);
}
@@ -221,20 +241,12 @@ static void fill_sq_ring(AioContext *ctx)
}
}
-/* Returns true if a handler became ready */
-static bool process_cqe(AioContext *ctx,
- AioHandlerList *ready_list,
- struct io_uring_cqe *cqe)
+static bool process_cqe_aio_handler(AioContext *ctx,
+ AioHandlerList *ready_list,
+ AioHandler *node,
+ struct io_uring_cqe *cqe)
{
- AioHandler *node = io_uring_cqe_get_data(cqe);
- unsigned flags;
-
- /* poll_timeout and poll_remove have a zero user_data field */
- if (!node) {
- return false;
- }
-
- flags = qatomic_read(&node->flags);
+ unsigned flags = qatomic_read(&node->flags);
/*
* poll_multishot cancelled by poll_remove? Or completed early because fd
@@ -261,6 +273,56 @@ static bool process_cqe(AioContext *ctx,
return true;
}
+/* Process CqeHandlers from the ready list */
+static void cqe_handler_bh(void *opaque)
+{
+ AioContext *ctx = opaque;
+ CqeHandlerSimpleQ *ready_list = &ctx->cqe_handler_ready_list;
+
+ /*
+ * If cqe_handler->cb() calls aio_poll() it must continue processing
+ * ready_list. Schedule a BH so the inner event loop calls us again.
+ */
+ qemu_bh_schedule(ctx->cqe_handler_bh);
+
+ while (!QSIMPLEQ_EMPTY(ready_list)) {
+ CqeHandler *cqe_handler = QSIMPLEQ_FIRST(ready_list);
+
+ QSIMPLEQ_REMOVE_HEAD(ready_list, next);
+
+ cqe_handler->cb(cqe_handler);
+ }
+
+ qemu_bh_cancel(ctx->cqe_handler_bh);
+}
+
+/* Returns true if a handler became ready */
+static bool process_cqe(AioContext *ctx,
+ AioHandlerList *ready_list,
+ struct io_uring_cqe *cqe)
+{
+ CqeHandler *cqe_handler = io_uring_cqe_get_data(cqe);
+
+ /* poll_timeout and poll_remove have a zero user_data field */
+ if (!cqe_handler) {
+ return false;
+ }
+
+ /*
+ * Special handling for AioHandler cqes. They need ready_list and have a
+ * return value.
+ */
+ if (cqe_handler->cb == fdmon_special_cqe_handler) {
+ AioHandler *node = container_of(cqe_handler, AioHandler, cqe_handler);
+ return process_cqe_aio_handler(ctx, ready_list, node, cqe);
+ }
+
+ cqe_handler->cqe = *cqe;
+ QSIMPLEQ_INSERT_TAIL(&ctx->cqe_handler_ready_list, cqe_handler, next);
+ qemu_bh_schedule(ctx->cqe_handler_bh);
+ return false;
+}
+
static int process_cq_ring(AioContext *ctx, AioHandlerList *ready_list)
{
struct io_uring *ring = &ctx->fdmon_io_uring;
@@ -360,6 +422,7 @@ static const FDMonOps fdmon_io_uring_ops = {
.gsource_prepare = fdmon_io_uring_gsource_prepare,
.gsource_check = fdmon_io_uring_gsource_check,
.gsource_dispatch = fdmon_io_uring_gsource_dispatch,
+ .add_sqe = fdmon_io_uring_add_sqe,
};
bool fdmon_io_uring_setup(AioContext *ctx)
@@ -375,6 +438,8 @@ bool fdmon_io_uring_setup(AioContext *ctx)
}
QSLIST_INIT(&ctx->submit_list);
+ QSIMPLEQ_INIT(&ctx->cqe_handler_ready_list);
+ ctx->cqe_handler_bh = aio_bh_new(ctx, cqe_handler_bh, ctx);
ctx->fdmon_ops = &fdmon_io_uring_ops;
ctx->io_uring_fd_tag = g_source_add_unix_fd(&ctx->source,
ctx->fdmon_io_uring.ring_fd, G_IO_IN);
@@ -384,30 +449,36 @@ bool fdmon_io_uring_setup(AioContext *ctx)
void fdmon_io_uring_destroy(AioContext *ctx)
{
- if (ctx->fdmon_ops == &fdmon_io_uring_ops) {
- AioHandler *node;
+ AioHandler *node;
- io_uring_queue_exit(&ctx->fdmon_io_uring);
+ if (ctx->fdmon_ops != &fdmon_io_uring_ops) {
+ return;
+ }
- /* Move handlers due to be removed onto the deleted list */
- while ((node = QSLIST_FIRST_RCU(&ctx->submit_list))) {
- unsigned flags = qatomic_fetch_and(&node->flags,
- ~(FDMON_IO_URING_PENDING |
- FDMON_IO_URING_ADD |
- FDMON_IO_URING_REMOVE));
+ io_uring_queue_exit(&ctx->fdmon_io_uring);
- if (flags & FDMON_IO_URING_REMOVE) {
- QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers, node, node_deleted);
- }
+ /* Move handlers due to be removed onto the deleted list */
+ while ((node = QSLIST_FIRST_RCU(&ctx->submit_list))) {
+ unsigned flags = qatomic_fetch_and(&node->flags,
+ ~(FDMON_IO_URING_PENDING |
+ FDMON_IO_URING_ADD |
+ FDMON_IO_URING_REMOVE));
- QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
+ if (flags & FDMON_IO_URING_REMOVE) {
+ QLIST_INSERT_HEAD_RCU(&ctx->deleted_aio_handlers,
+ node, node_deleted);
}
- g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
- ctx->io_uring_fd_tag = NULL;
-
- qemu_lockcnt_lock(&ctx->list_lock);
- fdmon_poll_downgrade(ctx);
- qemu_lockcnt_unlock(&ctx->list_lock);
+ QSLIST_REMOVE_HEAD_RCU(&ctx->submit_list, node_submitted);
}
+
+ g_source_remove_unix_fd(&ctx->source, ctx->io_uring_fd_tag);
+ ctx->io_uring_fd_tag = NULL;
+
+ assert(QSIMPLEQ_EMPTY(&ctx->cqe_handler_ready_list));
+ qemu_bh_delete(ctx->cqe_handler_bh);
+
+ qemu_lockcnt_lock(&ctx->list_lock);
+ fdmon_poll_downgrade(ctx);
+ qemu_lockcnt_unlock(&ctx->list_lock);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 3/3] block/io_uring: use aio_add_sqe()
2025-04-01 14:27 [PATCH 0/3] [RESEND] block: unify block and fdmon io_uring Stefan Hajnoczi
2025-04-01 14:27 ` [PATCH 1/3] aio-posix: treat io_uring setup failure as fatal Stefan Hajnoczi
2025-04-01 14:27 ` [PATCH 2/3] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Stefan Hajnoczi
@ 2025-04-01 14:27 ` Stefan Hajnoczi
2 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2025-04-01 14:27 UTC (permalink / raw)
To: qemu-devel
Cc: Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella,
surajshirvankar, Hanna Reitz, qemu-block, Kevin Wolf,
Paolo Bonzini, Fam Zheng
AioContext has its own io_uring instance for file descriptor monitoring.
The disk I/O io_uring code was developed separately. Originally I
thought the characteristics of file descriptor monitoring and disk I/O
were too different, requiring separate io_uring instances.
Now it has become clear to me that it's feasible to share a single
io_uring instance for file descriptor monitoring and disk I/O. We're not
using io_uring's IOPOLL feature or anything else that would require a
separate instance.
Unify block/io_uring.c and util/fdmon-io_uring.c using the new
aio_add_sqe() API that allows user-defined io_uring sqe submission. Now
block/io_uring.c just needs to submit readv/writev/fsync and most of the
io_uring-specific logic is handled by fdmon-io_uring.c.
There are two immediate advantages:
1. Fewer system calls. There is no need to monitor the disk I/O io_uring
ring fd from the file descriptor monitoring io_uring instance. Disk
I/O completions are now picked up directly. Also, sqes are
accumulated in the sq ring until the end of the event loop iteration
and there are fewer io_uring_enter(2) syscalls.
2. Less code duplication.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
include/block/aio.h | 7 -
include/block/raw-aio.h | 5 -
block/file-posix.c | 25 +-
block/io_uring.c | 489 ++++++++++------------------------------
stubs/io_uring.c | 32 ---
util/async.c | 35 ---
util/fdmon-io_uring.c | 6 +
block/trace-events | 12 +-
stubs/meson.build | 3 -
util/trace-events | 4 +
10 files changed, 128 insertions(+), 490 deletions(-)
delete mode 100644 stubs/io_uring.c
diff --git a/include/block/aio.h b/include/block/aio.h
index 4dfb419a21..b390b7bb60 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -291,8 +291,6 @@ struct AioContext {
struct LinuxAioState *linux_aio;
#endif
#ifdef CONFIG_LINUX_IO_URING
- LuringState *linux_io_uring;
-
/* State for file descriptor monitoring using Linux io_uring */
struct io_uring fdmon_io_uring;
AioHandlerSList submit_list;
@@ -615,11 +613,6 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
/* Return the LinuxAioState bound to this AioContext */
struct LinuxAioState *aio_get_linux_aio(AioContext *ctx);
-/* Setup the LuringState bound to this AioContext */
-LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp);
-
-/* Return the LuringState bound to this AioContext */
-LuringState *aio_get_linux_io_uring(AioContext *ctx);
/**
* aio_timer_new_with_attrs:
* @ctx: the aio context
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 6570244496..30e5fc9a9f 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -74,15 +74,10 @@ static inline bool laio_has_fua(void)
#endif
/* io_uring.c - Linux io_uring implementation */
#ifdef CONFIG_LINUX_IO_URING
-LuringState *luring_init(Error **errp);
-void luring_cleanup(LuringState *s);
-
/* luring_co_submit: submit I/O requests in the thread's current AioContext. */
int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
QEMUIOVector *qiov, int type,
BdrvRequestFlags flags);
-void luring_detach_aio_context(LuringState *s, AioContext *old_context);
-void luring_attach_aio_context(LuringState *s, AioContext *new_context);
bool luring_has_fua(void);
#else
static inline bool luring_has_fua(void)
diff --git a/block/file-posix.c b/block/file-posix.c
index 56d1972d15..b1b1d7a5dc 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2442,27 +2442,6 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
return true;
}
-#ifdef CONFIG_LINUX_IO_URING
-static inline bool raw_check_linux_io_uring(BDRVRawState *s)
-{
- Error *local_err = NULL;
- AioContext *ctx;
-
- if (!s->use_linux_io_uring) {
- return false;
- }
-
- ctx = qemu_get_current_aio_context();
- if (unlikely(!aio_setup_linux_io_uring(ctx, &local_err))) {
- error_reportf_err(local_err, "Unable to use linux io_uring, "
- "falling back to thread pool: ");
- s->use_linux_io_uring = false;
- return false;
- }
- return true;
-}
-#endif
-
#ifdef CONFIG_LINUX_AIO
static inline bool raw_check_linux_aio(BDRVRawState *s)
{
@@ -2515,7 +2494,7 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) {
type |= QEMU_AIO_MISALIGNED;
#ifdef CONFIG_LINUX_IO_URING
- } else if (raw_check_linux_io_uring(s)) {
+ } else if (s->use_linux_io_uring) {
assert(qiov->size == bytes);
ret = luring_co_submit(bs, s->fd, offset, qiov, type, flags);
goto out;
@@ -2612,7 +2591,7 @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
};
#ifdef CONFIG_LINUX_IO_URING
- if (raw_check_linux_io_uring(s)) {
+ if (s->use_linux_io_uring) {
return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
}
#endif
diff --git a/block/io_uring.c b/block/io_uring.c
index dd4f304910..dd930ee57e 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -11,28 +11,20 @@
#include "qemu/osdep.h"
#include <liburing.h>
#include "block/aio.h"
-#include "qemu/queue.h"
#include "block/block.h"
#include "block/raw-aio.h"
#include "qemu/coroutine.h"
-#include "qemu/defer-call.h"
-#include "qapi/error.h"
#include "system/block-backend.h"
#include "trace.h"
-/* Only used for assertions. */
-#include "qemu/coroutine_int.h"
-
-/* io_uring ring size */
-#define MAX_ENTRIES 128
-
-typedef struct LuringAIOCB {
+typedef struct {
Coroutine *co;
- struct io_uring_sqe sqeq;
- ssize_t ret;
QEMUIOVector *qiov;
- bool is_read;
- QSIMPLEQ_ENTRY(LuringAIOCB) next;
+ uint64_t offset;
+ ssize_t ret;
+ int type;
+ int fd;
+ BdrvRequestFlags flags;
/*
* Buffered reads may require resubmission, see
@@ -40,36 +32,51 @@ typedef struct LuringAIOCB {
*/
int total_read;
QEMUIOVector resubmit_qiov;
-} LuringAIOCB;
-typedef struct LuringQueue {
- unsigned int in_queue;
- unsigned int in_flight;
- bool blocked;
- QSIMPLEQ_HEAD(, LuringAIOCB) submit_queue;
-} LuringQueue;
+ CqeHandler cqe_handler;
+} LuringRequest;
-struct LuringState {
- AioContext *aio_context;
-
- struct io_uring ring;
-
- /* No locking required, only accessed from AioContext home thread */
- LuringQueue io_q;
-
- QEMUBH *completion_bh;
-};
-
-/**
- * luring_resubmit:
- *
- * Resubmit a request by appending it to submit_queue. The caller must ensure
- * that ioq_submit() is called later so that submit_queue requests are started.
- */
-static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
+static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
{
- QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
- s->io_q.in_queue++;
+ LuringRequest *req = opaque;
+ QEMUIOVector *qiov = req->qiov;
+ uint64_t offset = req->offset;
+ int fd = req->fd;
+ BdrvRequestFlags flags = req->flags;
+
+ switch (req->type) {
+ case QEMU_AIO_WRITE:
+#ifdef HAVE_IO_URING_PREP_WRITEV2
+ {
+ int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
+ io_uring_prep_writev2(sqe, fd, qiov->iov,
+ qiov->niov, offset, luring_flags);
+ }
+#else
+ assert(flags == 0);
+ io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
+#endif
+ break;
+ case QEMU_AIO_ZONE_APPEND:
+ io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
+ break;
+ case QEMU_AIO_READ:
+ {
+ if (req->resubmit_qiov.iov != NULL) {
+ qiov = &req->resubmit_qiov;
+ }
+ io_uring_prep_readv(sqe, fd, qiov->iov, qiov->niov,
+ offset + req->total_read);
+ break;
+ }
+ case QEMU_AIO_FLUSH:
+ io_uring_prep_fsync(sqe, fd, IORING_FSYNC_DATASYNC);
+ break;
+ default:
+ fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
+ __func__, req->type);
+ abort();
+ }
}
/**
@@ -78,385 +85,115 @@ static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
* Short reads are rare but may occur. The remaining read request needs to be
* resubmitted.
*/
-static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
- int nread)
+static void luring_resubmit_short_read(LuringRequest *req, int nread)
{
QEMUIOVector *resubmit_qiov;
size_t remaining;
- trace_luring_resubmit_short_read(s, luringcb, nread);
+ trace_luring_resubmit_short_read(req, nread);
/* Update read position */
- luringcb->total_read += nread;
- remaining = luringcb->qiov->size - luringcb->total_read;
+ req->total_read += nread;
+ remaining = req->qiov->size - req->total_read;
/* Shorten qiov */
- resubmit_qiov = &luringcb->resubmit_qiov;
+ resubmit_qiov = &req->resubmit_qiov;
if (resubmit_qiov->iov == NULL) {
- qemu_iovec_init(resubmit_qiov, luringcb->qiov->niov);
+ qemu_iovec_init(resubmit_qiov, req->qiov->niov);
} else {
qemu_iovec_reset(resubmit_qiov);
}
- qemu_iovec_concat(resubmit_qiov, luringcb->qiov, luringcb->total_read,
- remaining);
+ qemu_iovec_concat(resubmit_qiov, req->qiov, req->total_read, remaining);
- /* Update sqe */
- luringcb->sqeq.off += nread;
- luringcb->sqeq.addr = (uintptr_t)luringcb->resubmit_qiov.iov;
- luringcb->sqeq.len = luringcb->resubmit_qiov.niov;
-
- luring_resubmit(s, luringcb);
+ aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler);
}
-/**
- * luring_process_completions:
- * @s: AIO state
- *
- * Fetches completed I/O requests, consumes cqes and invokes their callbacks
- * The function is somewhat tricky because it supports nested event loops, for
- * example when a request callback invokes aio_poll().
- *
- * Function schedules BH completion so it can be called again in a nested
- * event loop. When there are no events left to complete the BH is being
- * canceled.
- *
- */
-static void luring_process_completions(LuringState *s)
+static void luring_cqe_handler(CqeHandler *cqe_handler)
{
- struct io_uring_cqe *cqes;
- int total_bytes;
+ LuringRequest *req = container_of(cqe_handler, LuringRequest, cqe_handler);
+ int ret = cqe_handler->cqe.res;
- defer_call_begin();
+ trace_luring_cqe_handler(req, ret);
- /*
- * Request completion callbacks can run the nested event loop.
- * Schedule ourselves so the nested event loop will "see" remaining
- * completed requests and process them. Without this, completion
- * callbacks that wait for other requests using a nested event loop
- * would hang forever.
- *
- * This workaround is needed because io_uring uses poll_wait, which
- * is woken up when new events are added to the uring, thus polling on
- * the same uring fd will block unless more events are received.
- *
- * Other leaf block drivers (drivers that access the data themselves)
- * are networking based, so they poll sockets for data and run the
- * correct coroutine.
- */
- qemu_bh_schedule(s->completion_bh);
-
- while (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
- LuringAIOCB *luringcb;
- int ret;
-
- if (!cqes) {
- break;
+ if (ret < 0) {
+ /*
+ * Only writev/readv/fsync requests on regular files or host block
+ * devices are submitted. Therefore -EAGAIN is not expected but it's
+ * known to happen sometimes with Linux SCSI. Submit again and hope
+ * the request completes successfully.
+ *
+ * For more information, see:
+ * https://lore.kernel.org/io-uring/20210727165811.284510-3-axboe@kernel.dk/T/#u
+ *
+ * If the code is changed to submit other types of requests in the
+ * future, then this workaround may need to be extended to deal with
+ * genuine -EAGAIN results that should not be resubmitted
+ * immediately.
+ */
+ if (ret == -EINTR || ret == -EAGAIN) {
+ aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler);
+ return;
}
-
- luringcb = io_uring_cqe_get_data(cqes);
- ret = cqes->res;
- io_uring_cqe_seen(&s->ring, cqes);
- cqes = NULL;
-
- /* Change counters one-by-one because we can be nested. */
- s->io_q.in_flight--;
- trace_luring_process_completion(s, luringcb, ret);
-
+ } else if (req->qiov) {
/* total_read is non-zero only for resubmitted read requests */
- total_bytes = ret + luringcb->total_read;
+ int total_bytes = ret + req->total_read;
- if (ret < 0) {
- /*
- * Only writev/readv/fsync requests on regular files or host block
- * devices are submitted. Therefore -EAGAIN is not expected but it's
- * known to happen sometimes with Linux SCSI. Submit again and hope
- * the request completes successfully.
- *
- * For more information, see:
- * https://lore.kernel.org/io-uring/20210727165811.284510-3-axboe@kernel.dk/T/#u
- *
- * If the code is changed to submit other types of requests in the
- * future, then this workaround may need to be extended to deal with
- * genuine -EAGAIN results that should not be resubmitted
- * immediately.
- */
- if (ret == -EINTR || ret == -EAGAIN) {
- luring_resubmit(s, luringcb);
- continue;
- }
- } else if (!luringcb->qiov) {
- goto end;
- } else if (total_bytes == luringcb->qiov->size) {
+ if (total_bytes == req->qiov->size) {
ret = 0;
- /* Only read/write */
} else {
/* Short Read/Write */
- if (luringcb->is_read) {
+ if (req->type == QEMU_AIO_READ) {
if (ret > 0) {
- luring_resubmit_short_read(s, luringcb, ret);
- continue;
- } else {
- /* Pad with zeroes */
- qemu_iovec_memset(luringcb->qiov, total_bytes, 0,
- luringcb->qiov->size - total_bytes);
- ret = 0;
+ luring_resubmit_short_read(req, ret);
+ return;
}
+
+ /* Pad with zeroes */
+ qemu_iovec_memset(req->qiov, total_bytes, 0,
+ req->qiov->size - total_bytes);
+ ret = 0;
} else {
ret = -ENOSPC;
}
}
-end:
- luringcb->ret = ret;
- qemu_iovec_destroy(&luringcb->resubmit_qiov);
-
- /*
- * If the coroutine is already entered it must be in ioq_submit()
- * and will notice luringcb->ret has been filled in when it
- * eventually runs later. Coroutines cannot be entered recursively
- * so avoid doing that!
- */
- assert(luringcb->co->ctx == s->aio_context);
- if (!qemu_coroutine_entered(luringcb->co)) {
- aio_co_wake(luringcb->co);
- }
}
- qemu_bh_cancel(s->completion_bh);
+ req->ret = ret;
+ qemu_iovec_destroy(&req->resubmit_qiov);
- defer_call_end();
-}
-
-static int ioq_submit(LuringState *s)
-{
- int ret = 0;
- LuringAIOCB *luringcb, *luringcb_next;
-
- while (s->io_q.in_queue > 0) {
- /*
- * Try to fetch sqes from the ring for requests waiting in
- * the overflow queue
- */
- QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.submit_queue, next,
- luringcb_next) {
- struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
- if (!sqes) {
- break;
- }
- /* Prep sqe for submission */
- *sqes = luringcb->sqeq;
- QSIMPLEQ_REMOVE_HEAD(&s->io_q.submit_queue, next);
- }
- ret = io_uring_submit(&s->ring);
- trace_luring_io_uring_submit(s, ret);
- /* Prevent infinite loop if submission is refused */
- if (ret <= 0) {
- if (ret == -EAGAIN || ret == -EINTR) {
- continue;
- }
- break;
- }
- s->io_q.in_flight += ret;
- s->io_q.in_queue -= ret;
- }
- s->io_q.blocked = (s->io_q.in_queue > 0);
-
- if (s->io_q.in_flight) {
- /*
- * We can try to complete something just right away if there are
- * still requests in-flight.
- */
- luring_process_completions(s);
- }
- return ret;
-}
-
-static void luring_process_completions_and_submit(LuringState *s)
-{
- luring_process_completions(s);
-
- if (s->io_q.in_queue > 0) {
- ioq_submit(s);
+ /*
+ * If the coroutine is already entered it must be in luring_co_submit() and
+ * will notice req->ret has been filled in when it eventually runs later.
+ * Coroutines cannot be entered recursively so avoid doing that!
+ */
+ if (!qemu_coroutine_entered(req->co)) {
+ aio_co_wake(req->co);
}
}
-static void qemu_luring_completion_bh(void *opaque)
+int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd,
+ uint64_t offset, QEMUIOVector *qiov,
+ int type, BdrvRequestFlags flags)
{
- LuringState *s = opaque;
- luring_process_completions_and_submit(s);
-}
-
-static void qemu_luring_completion_cb(void *opaque)
-{
- LuringState *s = opaque;
- luring_process_completions_and_submit(s);
-}
-
-static bool qemu_luring_poll_cb(void *opaque)
-{
- LuringState *s = opaque;
-
- return io_uring_cq_ready(&s->ring);
-}
-
-static void qemu_luring_poll_ready(void *opaque)
-{
- LuringState *s = opaque;
-
- luring_process_completions_and_submit(s);
-}
-
-static void ioq_init(LuringQueue *io_q)
-{
- QSIMPLEQ_INIT(&io_q->submit_queue);
- io_q->in_queue = 0;
- io_q->in_flight = 0;
- io_q->blocked = false;
-}
-
-static void luring_deferred_fn(void *opaque)
-{
- LuringState *s = opaque;
- trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
- s->io_q.in_flight);
- if (!s->io_q.blocked && s->io_q.in_queue > 0) {
- ioq_submit(s);
- }
-}
-
-/**
- * luring_do_submit:
- * @fd: file descriptor for I/O
- * @luringcb: AIO control block
- * @s: AIO state
- * @offset: offset for request
- * @type: type of request
- *
- * Fetches sqes from ring, adds to pending queue and preps them
- *
- */
-static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
- uint64_t offset, int type, BdrvRequestFlags flags)
-{
- int ret;
- struct io_uring_sqe *sqes = &luringcb->sqeq;
-
- switch (type) {
- case QEMU_AIO_WRITE:
-#ifdef HAVE_IO_URING_PREP_WRITEV2
- {
- int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
- io_uring_prep_writev2(sqes, fd, luringcb->qiov->iov,
- luringcb->qiov->niov, offset, luring_flags);
- }
-#else
- assert(flags == 0);
- io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
- luringcb->qiov->niov, offset);
-#endif
- break;
- case QEMU_AIO_ZONE_APPEND:
- io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
- luringcb->qiov->niov, offset);
- break;
- case QEMU_AIO_READ:
- io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
- luringcb->qiov->niov, offset);
- break;
- case QEMU_AIO_FLUSH:
- io_uring_prep_fsync(sqes, fd, IORING_FSYNC_DATASYNC);
- break;
- default:
- fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
- __func__, type);
- abort();
- }
- io_uring_sqe_set_data(sqes, luringcb);
-
- QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
- s->io_q.in_queue++;
- trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
- s->io_q.in_flight);
- if (!s->io_q.blocked) {
- if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
- ret = ioq_submit(s);
- trace_luring_do_submit_done(s, ret);
- return ret;
- }
-
- defer_call(luring_deferred_fn, s);
- }
- return 0;
-}
-
-int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
- QEMUIOVector *qiov, int type,
- BdrvRequestFlags flags)
-{
- int ret;
- AioContext *ctx = qemu_get_current_aio_context();
- LuringState *s = aio_get_linux_io_uring(ctx);
- LuringAIOCB luringcb = {
+ LuringRequest req = {
.co = qemu_coroutine_self(),
- .ret = -EINPROGRESS,
.qiov = qiov,
- .is_read = (type == QEMU_AIO_READ),
+ .ret = -EINPROGRESS,
+ .type = type,
+ .fd = fd,
+ .offset = offset,
+ .flags = flags,
};
- trace_luring_co_submit(bs, s, &luringcb, fd, offset, qiov ? qiov->size : 0,
- type);
- ret = luring_do_submit(fd, &luringcb, s, offset, type, flags);
- if (ret < 0) {
- return ret;
- }
+ req.cqe_handler.cb = luring_cqe_handler;
- if (luringcb.ret == -EINPROGRESS) {
+ trace_luring_co_submit(bs, &req, fd, offset, qiov ? qiov->size : 0, type);
+ aio_add_sqe(luring_prep_sqe, &req, &req.cqe_handler);
+
+ if (req.ret == -EINPROGRESS) {
qemu_coroutine_yield();
}
- return luringcb.ret;
-}
-
-void luring_detach_aio_context(LuringState *s, AioContext *old_context)
-{
- aio_set_fd_handler(old_context, s->ring.ring_fd,
- NULL, NULL, NULL, NULL, s);
- qemu_bh_delete(s->completion_bh);
- s->aio_context = NULL;
-}
-
-void luring_attach_aio_context(LuringState *s, AioContext *new_context)
-{
- s->aio_context = new_context;
- s->completion_bh = aio_bh_new(new_context, qemu_luring_completion_bh, s);
- aio_set_fd_handler(s->aio_context, s->ring.ring_fd,
- qemu_luring_completion_cb, NULL,
- qemu_luring_poll_cb, qemu_luring_poll_ready, s);
-}
-
-LuringState *luring_init(Error **errp)
-{
- int rc;
- LuringState *s = g_new0(LuringState, 1);
- struct io_uring *ring = &s->ring;
-
- trace_luring_init_state(s, sizeof(*s));
-
- rc = io_uring_queue_init(MAX_ENTRIES, ring, 0);
- if (rc < 0) {
- error_setg_errno(errp, -rc, "failed to init linux io_uring ring");
- g_free(s);
- return NULL;
- }
-
- ioq_init(&s->io_q);
- return s;
-
-}
-
-void luring_cleanup(LuringState *s)
-{
- io_uring_queue_exit(&s->ring);
- trace_luring_cleanup_state(s);
- g_free(s);
+ return req.ret;
}
bool luring_has_fua(void)
diff --git a/stubs/io_uring.c b/stubs/io_uring.c
deleted file mode 100644
index 622d1e4648..0000000000
--- a/stubs/io_uring.c
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * Linux io_uring support.
- *
- * Copyright (C) 2009 IBM, Corp.
- * Copyright (C) 2009 Red Hat, Inc.
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- */
-#include "qemu/osdep.h"
-#include "block/aio.h"
-#include "block/raw-aio.h"
-
-void luring_detach_aio_context(LuringState *s, AioContext *old_context)
-{
- abort();
-}
-
-void luring_attach_aio_context(LuringState *s, AioContext *new_context)
-{
- abort();
-}
-
-LuringState *luring_init(Error **errp)
-{
- abort();
-}
-
-void luring_cleanup(LuringState *s)
-{
- abort();
-}
diff --git a/util/async.c b/util/async.c
index 11954f8931..4f8465978f 100644
--- a/util/async.c
+++ b/util/async.c
@@ -379,14 +379,6 @@ aio_ctx_finalize(GSource *source)
}
#endif
-#ifdef CONFIG_LINUX_IO_URING
- if (ctx->linux_io_uring) {
- luring_detach_aio_context(ctx->linux_io_uring, ctx);
- luring_cleanup(ctx->linux_io_uring);
- ctx->linux_io_uring = NULL;
- }
-#endif
-
assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
qemu_bh_delete(ctx->co_schedule_bh);
@@ -461,29 +453,6 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
}
#endif
-#ifdef CONFIG_LINUX_IO_URING
-LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp)
-{
- if (ctx->linux_io_uring) {
- return ctx->linux_io_uring;
- }
-
- ctx->linux_io_uring = luring_init(errp);
- if (!ctx->linux_io_uring) {
- return NULL;
- }
-
- luring_attach_aio_context(ctx->linux_io_uring, ctx);
- return ctx->linux_io_uring;
-}
-
-LuringState *aio_get_linux_io_uring(AioContext *ctx)
-{
- assert(ctx->linux_io_uring);
- return ctx->linux_io_uring;
-}
-#endif
-
void aio_notify(AioContext *ctx)
{
/*
@@ -600,10 +569,6 @@ AioContext *aio_context_new(Error **errp)
ctx->linux_aio = NULL;
#endif
-#ifdef CONFIG_LINUX_IO_URING
- ctx->linux_io_uring = NULL;
-#endif
-
ctx->thread_pool = NULL;
qemu_rec_mutex_init(&ctx->lock);
timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
index a4523e3dcc..a880243ec9 100644
--- a/util/fdmon-io_uring.c
+++ b/util/fdmon-io_uring.c
@@ -48,6 +48,7 @@
#include "qemu/error-report.h"
#include "qemu/rcu_queue.h"
#include "aio-posix.h"
+#include "trace.h"
enum {
FDMON_IO_URING_ENTRIES = 128, /* sq/cq ring size */
@@ -174,6 +175,9 @@ static void fdmon_io_uring_add_sqe(AioContext *ctx,
prep_sqe(sqe, opaque);
io_uring_sqe_set_data(sqe, cqe_handler);
+
+ trace_fdmon_io_uring_add_sqe(ctx, opaque, sqe->opcode, sqe->fd, sqe->off,
+ cqe_handler);
}
static void fdmon_special_cqe_handler(CqeHandler *cqe_handler)
@@ -290,6 +294,8 @@ static void cqe_handler_bh(void *opaque)
QSIMPLEQ_REMOVE_HEAD(ready_list, next);
+ trace_fdmon_io_uring_cqe_handler(ctx, cqe_handler,
+ cqe_handler->cqe.res);
cqe_handler->cb(cqe_handler);
}
diff --git a/block/trace-events b/block/trace-events
index 8e789e1f12..c9b4736ff8 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -62,15 +62,9 @@ qmp_block_stream(void *bs) "bs %p"
file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "acb %p opaque %p offset %"PRId64" count %d type %d"
# io_uring.c
-luring_init_state(void *s, size_t size) "s %p size %zu"
-luring_cleanup_state(void *s) "%p freed"
-luring_unplug_fn(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
-luring_do_submit(void *s, int blocked, int queued, int inflight) "LuringState %p blocked %d queued %d inflight %d"
-luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
-luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
-luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
-luring_io_uring_submit(void *s, int ret) "LuringState %p ret %d"
-luring_resubmit_short_read(void *s, void *luringcb, int nread) "LuringState %p luringcb %p nread %d"
+luring_cqe_handler(void *req, int ret) "req %p ret %d"
+luring_co_submit(void *bs, void *req, int fd, uint64_t offset, size_t nbytes, int type) "bs %p req %p fd %d offset %" PRId64 " nbytes %zd type %d"
+luring_resubmit_short_read(void *req, int nread) "req %p nread %d"
# qcow2.c
qcow2_add_task(void *co, void *bs, void *pool, const char *action, int cluster_type, uint64_t host_offset, uint64_t offset, uint64_t bytes, void *qiov, size_t qiov_offset) "co %p bs %p pool %p: %s: cluster_type %d file_cluster_offset %" PRIu64 " offset %" PRIu64 " bytes %" PRIu64 " qiov %p qiov_offset %zu"
diff --git a/stubs/meson.build b/stubs/meson.build
index 63392f5e78..d157b06273 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -32,9 +32,6 @@ if have_block or have_ga
stub_ss.add(files('cpus-virtual-clock.c'))
stub_ss.add(files('icount.c'))
stub_ss.add(files('graph-lock.c'))
- if linux_io_uring.found()
- stub_ss.add(files('io_uring.c'))
- endif
if libaio.found()
stub_ss.add(files('linux-aio.c'))
endif
diff --git a/util/trace-events b/util/trace-events
index bd8f25fb59..540d662507 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -24,6 +24,10 @@ buffer_move_empty(const char *buf, size_t len, const char *from) "%s: %zd bytes
buffer_move(const char *buf, size_t len, const char *from) "%s: %zd bytes from %s"
buffer_free(const char *buf, size_t len) "%s: capacity %zd"
+# fdmon-io_uring.c
+fdmon_io_uring_add_sqe(void *ctx, void *opaque, int opcode, int fd, uint64_t off, void *cqe_handler) "ctx %p opaque %p opcode %d fd %d off %"PRId64" cqe_handler %p"
+fdmon_io_uring_cqe_handler(void *ctx, void *cqe_handler, int cqe_res) "ctx %p cqe_handler %p cqe_res %d"
+
# filemonitor-inotify.c
qemu_file_monitor_add_watch(void *mon, const char *dirpath, const char *filename, void *cb, void *opaque, int64_t id) "File monitor %p add watch dir='%s' file='%s' cb=%p opaque=%p id=%" PRId64
qemu_file_monitor_remove_watch(void *mon, const char *dirpath, int64_t id) "File monitor %p remove watch dir='%s' id=%" PRId64
--
2.49.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/3] aio-posix: treat io_uring setup failure as fatal
2025-04-01 14:27 ` [PATCH 1/3] aio-posix: treat io_uring setup failure as fatal Stefan Hajnoczi
@ 2025-04-25 15:51 ` Kevin Wolf
2025-04-28 18:35 ` Stefan Hajnoczi
0 siblings, 1 reply; 6+ messages in thread
From: Kevin Wolf @ 2025-04-25 15:51 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: qemu-devel, Aarushi Mehta, Stefano Garzarella, surajshirvankar,
Hanna Reitz, qemu-block, Paolo Bonzini, Fam Zheng
Am 01.04.2025 um 16:27 hat Stefan Hajnoczi geschrieben:
> In the early days of io_uring it was possible for io_uring_setup(2) to
> fail due to exhausting RLIMIT_MEMLOCK. QEMU's solution was to fall back
> to epoll(7) or ppoll(2) when io_uring could not be used in an
> AioContext.
>
> Nowadays io_uring memory is accounted differently so io_uring_setup(2)
> won't fail. Treat failure as a fatal error. Keep it simple: io_uring is
> available if and only if CONFIG_LINUX_IO_URING is defined.
>
> Upcoming features that rely on io_uring won't need to handle the case
> where a subset of AioContexts lacks io_uring. This will simplify the
> aio_add_sqe() API introduced in the next commit.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
While making failure to set up io_uring for an AioContext a hard error
feels fine to me, I feel a bit uneasy about having fatal errors like
this in functions that aren't only called during startup. This function
is also called when adding an iothread object at runtime. The failure
mode that I would expect there is that object-add returns an error, not
that the whole QEMU process exits.
Kevin
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/3] aio-posix: treat io_uring setup failure as fatal
2025-04-25 15:51 ` Kevin Wolf
@ 2025-04-28 18:35 ` Stefan Hajnoczi
0 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2025-04-28 18:35 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-devel, Aarushi Mehta, Stefano Garzarella, surajshirvankar,
Hanna Reitz, qemu-block, Paolo Bonzini, Fam Zheng
[-- Attachment #1: Type: text/plain, Size: 1340 bytes --]
On Fri, Apr 25, 2025 at 05:51:43PM +0200, Kevin Wolf wrote:
> Am 01.04.2025 um 16:27 hat Stefan Hajnoczi geschrieben:
> > In the early days of io_uring it was possible for io_uring_setup(2) to
> > fail due to exhausting RLIMIT_MEMLOCK. QEMU's solution was to fall back
> > to epoll(7) or ppoll(2) when io_uring could not be used in an
> > AioContext.
> >
> > Nowadays io_uring memory is accounted differently so io_uring_setup(2)
> > won't fail. Treat failure as a fatal error. Keep it simple: io_uring is
> > available if and only if CONFIG_LINUX_IO_URING is defined.
> >
> > Upcoming features that rely on io_uring won't need to handle the case
> > where a subset of AioContexts lacks io_uring. This will simplify the
> > aio_add_sqe() API introduced in the next commit.
> >
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>
> While making failure to set up io_uring for an AioContext a hard error
> feels fine to me, I feel a bit uneasy about having fatal errors like
> this in functions that aren't only called during startup. This function
> is also called when adding an iothread object at runtime. The failure
> mode that I would expect there is that object-add returns an error, not
> that the whole QEMU process exits.
Thanks for bringing this up. I'll fix it in the next revision.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-04-28 18:36 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-01 14:27 [PATCH 0/3] [RESEND] block: unify block and fdmon io_uring Stefan Hajnoczi
2025-04-01 14:27 ` [PATCH 1/3] aio-posix: treat io_uring setup failure as fatal Stefan Hajnoczi
2025-04-25 15:51 ` Kevin Wolf
2025-04-28 18:35 ` Stefan Hajnoczi
2025-04-01 14:27 ` [PATCH 2/3] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Stefan Hajnoczi
2025-04-01 14:27 ` [PATCH 3/3] block/io_uring: use aio_add_sqe() Stefan Hajnoczi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).