All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Kevin Wolf <kwolf@redhat.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org, fam@euphon.net,
	Fiona Ebner <f.ebner@proxmox.com>
Subject: Re: [PATCH 1/2] fdmon-io_uring: notify main loop when SQEs are queued
Date: Wed, 18 Feb 2026 15:02:06 -0500	[thread overview]
Message-ID: <20260218200206.GA605390@fedora> (raw)
In-Reply-To: <7811fc80-b2a1-4857-95f5-4bf01c806d07@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 7400 bytes --]

On Wed, Feb 18, 2026 at 09:17:57AM -0700, Jens Axboe wrote:
> On 2/18/26 9:06 AM, Stefan Hajnoczi wrote:
> > On Wed, Feb 18, 2026 at 10:57:02AM +0100, Fiona Ebner wrote:
> >> Am 13.02.26 um 5:05 PM schrieb Kevin Wolf:
> >>> Am 13.02.2026 um 15:26 hat Jens Axboe geschrieben:
> >>>> When a vCPU thread handles MMIO (holding BQL), aio_co_enter() runs the
> >>>> block I/O coroutine inline on the vCPU thread because
> >>>> qemu_get_current_aio_context() returns the main AioContext when BQL is
> >>>> held. The coroutine calls luring_co_submit() which queues an SQE via
> >>>> fdmon_io_uring_add_sqe(), but the actual io_uring_submit() only happens
> >>>> in gsource_prepare() on the main loop thread.
> >>>
> >>> Ouch! Yes, looks like we completely missed I/O submitted in vCPU threads
> >>> in the recent changes (or I guess worker threads in theory, but I don't
> >>> think there any that actually make use of aio_add_sqe()).
> >>>
> >>>> Since the coroutine ran inline (not via aio_co_schedule()), no BH is
> >>>> scheduled and aio_notify() is never called. The main loop remains asleep
> >>>> in ppoll() with up to a 499ms timeout, leaving the SQE unsubmitted until
> >>>> the next timer fires.
> >>>>
> >>>> Fix this by calling aio_notify() after queuing the SQE. This wakes the
> >>>> main loop via the eventfd so it can run gsource_prepare() and submit the
> >>>> pending SQE promptly.
> >>>>
> >>>> This is a generic fix that benefits all devices using aio=io_uring.
> >>>> Without it, AHCI/SATA devices see MUCH worse I/O latency since they use
> >>>> MMIO (not ioeventfd like virtio) and have no other mechanism to wake the
> >>>> main loop after queuing block I/O.
> >>>>
> >>>> This is usually a bit hard to detect, as it also relies on the ppoll
> >>>> loop not waking up for other activity, and micro benchmarks tend not to
> >>>> see it because they don't have any real processing time. With a
> >>>> synthetic test case that has a few usleep() to simulate processing of
> >>>> read data, it's very noticeable. The below example reads 128MB with
> >>>> O_DIRECT in 128KB chunks in batches of 16, and has a 1ms delay before
> >>>> each batch submit, and a 1ms delay after processing each completion.
> >>>> Running it on /dev/sda yields:
> >>>>
> >>>> time sudo ./iotest /dev/sda
> >>>>
> >>>> ________________________________________________________
> >>>> Executed in   25.76 secs      fish           external
> >>>>    usr time    6.19 millis  783.00 micros    5.41 millis
> >>>>    sys time   12.43 millis  642.00 micros   11.79 millis
> >>>>
> >>>> while on a virtio-blk or NVMe device we get:
> >>>>
> >>>> time sudo ./iotest /dev/vdb
> >>>>
> >>>> ________________________________________________________
> >>>> Executed in    1.25 secs      fish           external
> >>>>    usr time    1.40 millis    0.30 millis    1.10 millis
> >>>>    sys time   17.61 millis    1.43 millis   16.18 millis
> >>>>
> >>>> time sudo ./iotest /dev/nvme0n1
> >>>>
> >>>> ________________________________________________________
> >>>> Executed in    1.26 secs      fish           external
> >>>>    usr time    6.11 millis    0.52 millis    5.59 millis
> >>>>    sys time   13.94 millis    1.50 millis   12.43 millis
> >>>>
> >>>> where the latter are consistent. If we run the same test but keep the
> >>>> socket for the ssh connection active by having activity there, then
> >>>> the sda test looks as follows:
> >>>>
> >>>> time sudo ./iotest /dev/sda
> >>>>
> >>>> ________________________________________________________
> >>>> Executed in    1.23 secs      fish           external
> >>>>    usr time    2.70 millis   39.00 micros    2.66 millis
> >>>>    sys time    4.97 millis  977.00 micros    3.99 millis
> >>>>
> >>>> as now the ppoll loop is woken all the time anyway.
> >>>>
> >>>> After this fix, on an idle system:
> >>>>
> >>>> time sudo ./iotest /dev/sda
> >>>>
> >>>> ________________________________________________________
> >>>> Executed in    1.30 secs      fish           external
> >>>>    usr time    2.14 millis    0.14 millis    2.00 millis
> >>>>    sys time   16.93 millis    1.16 millis   15.76 millis
> >>>>
> >>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>> ---
> >>>>  util/fdmon-io_uring.c | 8 ++++++++
> >>>>  1 file changed, 8 insertions(+)
> >>>>
> >>>> diff --git a/util/fdmon-io_uring.c b/util/fdmon-io_uring.c
> >>>> index d0b56127c670..96392876b490 100644
> >>>> --- a/util/fdmon-io_uring.c
> >>>> +++ b/util/fdmon-io_uring.c
> >>>> @@ -181,6 +181,14 @@ static void fdmon_io_uring_add_sqe(AioContext *ctx,
> >>>>  
> >>>>      trace_fdmon_io_uring_add_sqe(ctx, opaque, sqe->opcode, sqe->fd, sqe->off,
> >>>>                                   cqe_handler);
> >>>> +
> >>>> +    /*
> >>>> +     * Wake the main loop if it is sleeping in ppoll().  When a vCPU thread
> >>>> +     * runs a coroutine inline (holding BQL), it queues SQEs here but the
> >>>> +     * actual io_uring_submit() only happens in gsource_prepare().  Without
> >>>> +     * this notify, ppoll() can sleep up to 499ms before submitting.
> >>>> +     */
> >>>> +    aio_notify(ctx);
> >>>>  }
> >>>
> >>> Makes sense to me.
> >>>
> >>> At first I wondered if we should use defer_call() for the aio_notify()
> >>> to batch the submission, but of course holding the BQL will already take
> >>> care of that. And in iothreads where there is no BQL, the aio_notify()
> >>> shouldn't make a difference anyway because we're already in the right
> >>> thread.
> >>>
> >>> I suppose the other variation could be have another io_uring_enter()
> >>> call here (but then probably really through defer_call()) to avoid
> >>> waiting for another CPU to submit the request in its main loop. But I
> >>> don't really have an intuition if that would make things better or worse
> >>> in the common case.
> > 
> > It's possible to call io_uring_enter(). QEMU currently doesn't use
> > IORING_SETUP_SINGLE_ISSUER, so it's okay for multiple threads to call
> > io_uring_enter() on the same io_uring fd.
> 
> I would not recommend that, see below.
> 
> > I experimented with IORING_SETUP_SINGLE_ISSUER (as well as
> > IORING_SETUP_COOP_TASKRUN and IORING_SETUP_TASKRUN_FLAG) in the past and
> > didn't measure a performance improvement:
> > https://lore.kernel.org/qemu-devel/20250724204702.576637-1-stefanha@redhat.com/
> > 
> > Jens, any advice regarding these flags?
> 
> None other than "yes you should use them" - it's an expanding area of
> "let's make that faster", so if you tested something older, then that
> may be why as we didn't have a lot earlier. We're toying with getting
> rid of the uring_lock for SINGLE_ISSUER, for example.
> 
> Hence I think having multiple threads do enter is a design mistake, and
> one that might snowball down the line and make it harder to step back
> and make SINGLE_ISSUER work for you. Certain features also end up being
> gated behing DEFER_TASKRUN, which requires SINGLE_ISSUER as well.
> 
> tldr - don't have multiple threads do enter on the same ring, ever, if
> it can be avoided. It's a design mistake.

That's useful information, thanks. I will resurrect the patches to add
modern io_uring_setup() flags and we'll document the assumption that
only one thread invokes io_uring_enter().

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2026-02-18 20:03 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13 14:26 [PATCHSET 0/2] io_uring fixes Jens Axboe
2026-02-13 14:26 ` [PATCH 1/2] fdmon-io_uring: notify main loop when SQEs are queued Jens Axboe
2026-02-13 16:04   ` Kevin Wolf
2026-02-18  9:57     ` Fiona Ebner
2026-02-18 16:06       ` Stefan Hajnoczi
2026-02-18 16:17         ` Jens Axboe
2026-02-18 20:02           ` Stefan Hajnoczi [this message]
2026-02-18 16:11       ` Stefan Hajnoczi
2026-02-18 16:19         ` Jens Axboe
2026-02-18 16:41           ` [PATCH v2] aio-posix: " Jens Axboe
2026-02-18 20:57             ` Stefan Hajnoczi
2026-02-19 14:27               ` Jens Axboe
2026-02-19 15:49             ` Kevin Wolf
2026-02-23 13:53               ` Stefan Hajnoczi
2026-02-18 15:56     ` [PATCH 1/2] fdmon-io_uring: " Stefan Hajnoczi
2026-02-13 14:26 ` [PATCH 2/2] fdmon-io_uring: check CQ ring directly in gsource_check Jens Axboe
2026-02-13 16:22   ` Kevin Wolf
2026-02-18 16:24   ` Stefan Hajnoczi
2026-02-18 10:07 ` [PATCHSET 0/2] io_uring fixes Fiona Ebner
2026-03-03 11:52 ` Fiona Ebner
2026-03-03 16:51   ` Jens Axboe
2026-03-08 12:11   ` Michael Tokarev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260218200206.GA605390@fedora \
    --to=stefanha@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=f.ebner@proxmox.com \
    --cc=fam@euphon.net \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.