From: Stefan Hajnoczi <stefanha@redhat.com>
To: Eric Blake <eblake@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org,
hibriansong@gmail.com, Kevin Wolf <kwolf@redhat.com>,
Hanna Czenczek <hreitz@redhat.com>
Subject: Re: [RFC 11/11] block/io_uring: use aio_add_sqe()
Date: Thu, 5 Jun 2025 14:40:09 -0400 [thread overview]
Message-ID: <20250605184009.GB481264@fedora> (raw)
In-Reply-To: <ofidabivimcy5qpuxqeni6e6lfwvfhutg5nxychdhgqs5cdkrr@nspmykro2ap5>
[-- Attachment #1: Type: text/plain, Size: 11791 bytes --]
On Thu, May 29, 2025 at 04:11:21PM -0500, Eric Blake wrote:
> On Wed, May 28, 2025 at 03:09:16PM -0400, Stefan Hajnoczi wrote:
> > AioContext has its own io_uring instance for file descriptor monitoring.
> > The disk I/O io_uring code was developed separately. Originally I
> > thought the characteristics of file descriptor monitoring and disk I/O
> > were too different, requiring separate io_uring instances.
> >
> > Now it has become clear to me that it's feasible to share a single
> > io_uring instance for file descriptor monitoring and disk I/O. We're not
> > using io_uring's IOPOLL feature or anything else that would require a
> > separate instance.
> >
> > Unify block/io_uring.c and util/fdmon-io_uring.c using the new
> > aio_add_sqe() API that allows user-defined io_uring sqe submission. Now
> > block/io_uring.c just needs to submit readv/writev/fsync and most of the
> > io_uring-specific logic is handled by fdmon-io_uring.c.
> >
> > There are two immediate advantages:
> > 1. Fewer system calls. There is no need to monitor the disk I/O io_uring
> > ring fd from the file descriptor monitoring io_uring instance. Disk
> > I/O completions are now picked up directly. Also, sqes are
> > accumulated in the sq ring until the end of the event loop iteration
> > and there are fewer io_uring_enter(2) syscalls.
> > 2. Less code duplication.
> >
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
>
> Comments below, but looks sane to me.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>
>
> > include/block/aio.h | 7 -
> > include/block/raw-aio.h | 5 -
> > block/file-posix.c | 38 ++--
> > block/io_uring.c | 489 ++++++++++------------------------------
> > stubs/io_uring.c | 32 ---
> > util/async.c | 35 ---
> > util/fdmon-io_uring.c | 6 +
> > block/trace-events | 12 +-
> > stubs/meson.build | 3 -
> > util/trace-events | 4 +
> > 10 files changed, 139 insertions(+), 492 deletions(-)
> > delete mode 100644 stubs/io_uring.c
> >
> > diff --git a/include/block/aio.h b/include/block/aio.h
> > index 95beef28c3..fbb45cca74 100644
> > --- a/include/block/aio.h
> > +++ b/include/block/aio.h
> > @@ -291,8 +291,6 @@ struct AioContext {
> > struct LinuxAioState *linux_aio;
> > #endif
> > #ifdef CONFIG_LINUX_IO_URING
> > - LuringState *linux_io_uring;
> > -
> > /* State for file descriptor monitoring using Linux io_uring */
> > struct io_uring fdmon_io_uring;
> > AioHandlerSList submit_list;
> > @@ -597,11 +595,6 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
> > /* Return the LinuxAioState bound to this AioContext */
> > struct LinuxAioState *aio_get_linux_aio(AioContext *ctx);
> >
> > -/* Setup the LuringState bound to this AioContext */
> > -LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp);
> > -
> > -/* Return the LuringState bound to this AioContext */
> > -LuringState *aio_get_linux_io_uring(AioContext *ctx);
> > /**
> > * aio_timer_new_with_attrs:
> > * @ctx: the aio context
> > diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> > index 6570244496..30e5fc9a9f 100644
> > --- a/include/block/raw-aio.h
> > +++ b/include/block/raw-aio.h
> > @@ -74,15 +74,10 @@ static inline bool laio_has_fua(void)
> > #endif
> > /* io_uring.c - Linux io_uring implementation */
> > #ifdef CONFIG_LINUX_IO_URING
> > -LuringState *luring_init(Error **errp);
> > -void luring_cleanup(LuringState *s);
> > -
> > /* luring_co_submit: submit I/O requests in the thread's current AioContext. */
> > int coroutine_fn luring_co_submit(BlockDriverState *bs, int fd, uint64_t offset,
> > QEMUIOVector *qiov, int type,
> > BdrvRequestFlags flags);
> > -void luring_detach_aio_context(LuringState *s, AioContext *old_context);
> > -void luring_attach_aio_context(LuringState *s, AioContext *new_context);
> > bool luring_has_fua(void);
> > #else
> > static inline bool luring_has_fua(void)
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 9b5f08ccb2..d1f1fc3a77 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -755,14 +755,23 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
> > }
> > #endif /* !defined(CONFIG_LINUX_AIO) */
> >
> > -#ifndef CONFIG_LINUX_IO_URING
> > if (s->use_linux_io_uring) {
> > +#ifdef CONFIG_LINUX_IO_URING
> > + if (!aio_has_io_uring()) {
>
> Compared to the old code... [1]
>
> > + error_setg(errp, "aio=io_uring was specified, but is not "
> > + "available (disabled via io_uring_disabled "
> > + "sysctl or blocked by container runtime "
> > + "seccomp policy?)");
> > + ret = -EINVAL;
> > + goto fail;
> > + }
> > +#else
> > error_setg(errp, "aio=io_uring was specified, but is not supported "
> > "in this build.");
>
> While here, let's get rid of the trailing '.' in the error_setg call.
Sounds good.
>
>
> > ret = -EINVAL;
> > goto fail;
> > - }
> > #endif /* !defined(CONFIG_LINUX_IO_URING) */
> > + }
> >
> > s->has_discard = true;
> > s->has_write_zeroes = true;
> > @@ -2522,27 +2531,6 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
> > return true;
> > }
> >
> > -#ifdef CONFIG_LINUX_IO_URING
> > -static inline bool raw_check_linux_io_uring(BDRVRawState *s)
> > -{
> > - Error *local_err = NULL;
> > - AioContext *ctx;
> > -
> > - if (!s->use_linux_io_uring) {
> > - return false;
> > - }
> > -
> > - ctx = qemu_get_current_aio_context();
> > - if (unlikely(!aio_setup_linux_io_uring(ctx, &local_err))) {
>
> [1]... is there a reason you dropped the unlikely() wrapper?
raw_check_linux_io_uring() was called from the I/O code path where a
branch prediction hint might make sense for performance reasons. The
raw_open_common() you mentioned above is not in the I/O code path.
Also, the new code in raw_open_common() is not the same as
raw_check_linux_io_uring(). aio_setup_linux_io_uring() no longer exists
in the new code. The patch removes this code entirely, it doesn't just
drop unlikely().
>
> > - error_reportf_err(local_err, "Unable to use linux io_uring, "
> > - "falling back to thread pool: ");
> > - s->use_linux_io_uring = false;
> > - return false;
> > - }
> > - return true;
> > -}
> > -#endif
> > -
> > #ifdef CONFIG_LINUX_AIO
> > static inline bool raw_check_linux_aio(BDRVRawState *s)
> > {
> > @@ -2595,7 +2583,7 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
> > if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) {
> > type |= QEMU_AIO_MISALIGNED;
> > #ifdef CONFIG_LINUX_IO_URING
> > - } else if (raw_check_linux_io_uring(s)) {
> > + } else if (s->use_linux_io_uring) {
> > assert(qiov->size == bytes);
> > ret = luring_co_submit(bs, s->fd, offset, qiov, type, flags);
> > goto out;
> > @@ -2692,7 +2680,7 @@ static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
> > };
> >
> > #ifdef CONFIG_LINUX_IO_URING
> > - if (raw_check_linux_io_uring(s)) {
> > + if (s->use_linux_io_uring) {
> > return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
> > }
> > #endif
> > diff --git a/block/io_uring.c b/block/io_uring.c
> > index dd4f304910..dd930ee57e 100644
> > --- a/block/io_uring.c
> > +++ b/block/io_uring.c
> > @@ -11,28 +11,20 @@
> > #include "qemu/osdep.h"
> > #include <liburing.h>
> > #include "block/aio.h"
> > -#include "qemu/queue.h"
> > #include "block/block.h"
> > #include "block/raw-aio.h"
> > #include "qemu/coroutine.h"
> > -#include "qemu/defer-call.h"
> > -#include "qapi/error.h"
> > #include "system/block-backend.h"
> > #include "trace.h"
> >
> > -/* Only used for assertions. */
> > -#include "qemu/coroutine_int.h"
> > -
> > -/* io_uring ring size */
> > -#define MAX_ENTRIES 128
> > -
> > -typedef struct LuringAIOCB {
> > +typedef struct {
> > Coroutine *co;
> > - struct io_uring_sqe sqeq;
> > - ssize_t ret;
> > QEMUIOVector *qiov;
> > - bool is_read;
> > - QSIMPLEQ_ENTRY(LuringAIOCB) next;
> > + uint64_t offset;
> > + ssize_t ret;
> > + int type;
> > + int fd;
> > + BdrvRequestFlags flags;
> >
> > /*
> > * Buffered reads may require resubmission, see
> > @@ -40,36 +32,51 @@ typedef struct LuringAIOCB {
> > */
> > int total_read;
> > QEMUIOVector resubmit_qiov;
> > -} LuringAIOCB;
> >
> > -typedef struct LuringQueue {
> > - unsigned int in_queue;
> > - unsigned int in_flight;
> > - bool blocked;
> > - QSIMPLEQ_HEAD(, LuringAIOCB) submit_queue;
> > -} LuringQueue;
> > + CqeHandler cqe_handler;
> > +} LuringRequest;
> >
> > -struct LuringState {
> > - AioContext *aio_context;
> > -
> > - struct io_uring ring;
> > -
> > - /* No locking required, only accessed from AioContext home thread */
> > - LuringQueue io_q;
> > -
> > - QEMUBH *completion_bh;
> > -};
> > -
> > -/**
> > - * luring_resubmit:
> > - *
> > - * Resubmit a request by appending it to submit_queue. The caller must ensure
> > - * that ioq_submit() is called later so that submit_queue requests are started.
> > - */
> > -static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
> > +static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
> > {
> > - QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
> > - s->io_q.in_queue++;
> > + LuringRequest *req = opaque;
> > + QEMUIOVector *qiov = req->qiov;
> > + uint64_t offset = req->offset;
> > + int fd = req->fd;
> > + BdrvRequestFlags flags = req->flags;
> > +
> > + switch (req->type) {
> > + case QEMU_AIO_WRITE:
> > +#ifdef HAVE_IO_URING_PREP_WRITEV2
> > + {
> > + int luring_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
> > + io_uring_prep_writev2(sqe, fd, qiov->iov,
> > + qiov->niov, offset, luring_flags);
> > + }
> > +#else
> > + assert(flags == 0);
> > + io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
>
> Hmm. 'man io_uring_prep_writev2' states:
>
> Unless an application explicitly needs to pass in more than one iovec,
> it is more efficient to use io_uring_prep_write(3) rather than this
> function, as no state has to be maintained for a non-vectored IO re‐
> quest.
>
> Obviously, if we want luring_flags of RWF_DSYNC to be set, we have to
> use the newer interface; but if that flag is not present, should we be
> conditionally falling back to the simpler interface when qiov->niov ==
> 1?
Thanks for the suggestion. I will try it out and benchmark it for v2.
>
> In fact, even when qiov->niov > 1, can we unvector it ourselves into
> multiple io_uring_prep_write() calls, since the whole point of the
> uring is that we aren't making syscalls, so more ops on the uring
> should still be cheaper? But that's a question for a followup patch
> (still, given that your series is RFC for performance reasons, it may
> be worth investigating).
That's more complicated. I won't pursue that for the time being.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
prev parent reply other threads:[~2025-06-05 18:40 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-28 19:09 [RFC 00/11] aio: add the aio_add_sqe() io_uring API Stefan Hajnoczi
2025-05-28 19:09 ` [RFC 01/11] aio-posix: fix polling mode with fdmon-io_uring Stefan Hajnoczi
2025-05-28 20:29 ` Eric Blake
2025-05-28 19:09 ` [RFC 02/11] aio-posix: keep polling enabled with fdmon-io_uring.c Stefan Hajnoczi
2025-05-28 20:34 ` Eric Blake
2025-05-28 19:09 ` [RFC 03/11] tests/unit: skip test-nested-aio-poll with io_uring Stefan Hajnoczi
2025-05-28 20:40 ` Eric Blake
2025-05-28 19:09 ` [RFC 04/11] aio-posix: integrate fdmon into glib event loop Stefan Hajnoczi
2025-05-28 21:01 ` Eric Blake
2025-05-28 19:09 ` [RFC 05/11] aio: remove aio_context_use_g_source() Stefan Hajnoczi
2025-05-28 21:02 ` Eric Blake
2025-05-28 19:09 ` [RFC 06/11] aio: free AioContext when aio_context_new() fails Stefan Hajnoczi
2025-05-28 21:06 ` Eric Blake
2025-06-05 17:49 ` Stefan Hajnoczi
2025-05-28 19:09 ` [RFC 07/11] aio: add errp argument to aio_context_setup() Stefan Hajnoczi
2025-05-28 21:07 ` Eric Blake
2025-05-28 19:09 ` [RFC 08/11] aio-posix: gracefully handle io_uring_queue_init() failure Stefan Hajnoczi
2025-05-28 22:12 ` Eric Blake
2025-05-29 15:38 ` Stefan Hajnoczi
2025-06-03 6:05 ` Markus Armbruster
2025-06-03 18:48 ` Stefan Hajnoczi
2025-06-02 12:26 ` Brian
2025-06-02 20:20 ` Stefan Hajnoczi
2025-06-02 22:37 ` Brian
2025-05-28 19:09 ` [RFC 09/11] aio-posix: add aio_add_sqe() API for user-defined io_uring requests Stefan Hajnoczi
2025-05-28 22:15 ` Eric Blake
2025-05-29 20:02 ` Eric Blake
2025-06-05 17:52 ` Stefan Hajnoczi
2025-05-28 19:09 ` [RFC 10/11] aio-posix: avoid EventNotifier for cqe_handler_bh Stefan Hajnoczi
2025-05-29 20:09 ` Eric Blake
2025-05-28 19:09 ` [RFC 11/11] block/io_uring: use aio_add_sqe() Stefan Hajnoczi
2025-05-29 21:11 ` Eric Blake
2025-06-05 18:40 ` Stefan Hajnoczi [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250605184009.GB481264@fedora \
--to=stefanha@redhat.com \
--cc=eblake@redhat.com \
--cc=hibriansong@gmail.com \
--cc=hreitz@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.