From: Fiona Ebner <f.ebner@proxmox.com>
To: Jens Axboe <axboe@kernel.dk>, linux-kernel@vger.kernel.org
Cc: hannes@cmpxchg.org, surenb@google.com, peterz@infradead.org,
io-uring@vger.kernel.org,
Thomas Lamprecht <t.lamprecht@proxmox.com>
Subject: Re: io_uring_prep_timeout() leading to an IO pressure close to 100
Date: Fri, 24 Apr 2026 17:42:25 +0200 [thread overview]
Message-ID: <db7e6abb-677b-4b63-a028-d8fe0bec0277@proxmox.com> (raw)
In-Reply-To: <563f9b5f-9649-4a98-9025-671af55f29d7@proxmox.com>
Hi Jens,
Am 02.04.26 um 2:30 PM schrieb Fiona Ebner:
> Am 02.04.26 um 11:12 AM schrieb Fiona Ebner:
>> Am 01.04.26 um 5:02 PM schrieb Jens Axboe:
>>> On 4/1/26 8:59 AM, Fiona Ebner wrote:
>>>> I'm currently investigating an issue with QEMU causing an IO pressure
>>>> value of nearly 100 when io_uring is used for the event loop of a QEMU
>>>> iothread (which is the case since QEMU 10.2 if io_uring is enabled
>>>> during configuration and available).
>>>
>>> It's not "IO pressure", it's the useless iowait metric...
>>
>> But it is reported as IO pressure by the kernel, i.e. /proc/pressure/io
>> (and for a cgroup, /sys/fs/cgroup/foo.slice/bar.scope/io.pressure).
>>
>>>> The cause seems to be the io_uring_prep_timeout() call that is used for
>>>> blocking wait. I attached a minimal reproducer below, which exposes the
>>>> issue [0].
>>>>
>>>> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I
>>>> haven't investigated what happens inside the kernel yet, so I don't know
>>>> if it is an accounting issue or within io_uring.
>>>>
>>>> Let me know if you need more information or if I should test something
>>>> specific.
>>>
>>> If you won't want it, just turn it off with io_uring_set_iowait().
>>
>> QEMU does submit actual IO request on the same ring and I suppose iowait
>> should still be used for those?
>>
>> Maybe setting the IORING_ENTER_NO_IOWAIT flag if only the timeout
>> request is being submitted and no actual IO requests is an option? But
>> even then, if a request is submitted later via another thread, iowait
>> for that new request won't be accounted for, right?
>>
>> Is there a way to say "I don't want IO wait for timeout submissions"?
>> Wouldn't that even make sense by default?
>
> Turns out, that in my QEMU instances, the branch doing the
> io_uring_prep_timeout() call is not actually taken, so while the issue
> could arise like that too, it's different in this practical case.
>
> What I'm actually seeing is io_uring_submit_and_wait() being called with
> wait_nr=1 while there is nothing else going on. So a more accurate
> reproducer for the scenario is attached below [0]. Note that it does not
> happen without sumbitting+completing a single request first.
I started digging in the kernel now and am wondering whether the number
of inflight requests is correctly tracked? Does current_pending_io()
need to consider tctx->cached_refs?
In __io_cqring_wait_schedule(), there is
> if (ext_arg->iowait && current_pending_io())
> current->in_iowait = 1;
and current_pending_io() is
> static bool current_pending_io(void)
> {
> struct io_uring_task *tctx = current->io_uring;
>
> if (!tctx)
> return false;
> return percpu_counter_read_positive(&tctx->inflight);
> }
so okay, we get iowait when tctx->inflight is positive. Looking at where
that variable is modified, I found
> void io_task_refs_refill(struct io_uring_task *tctx)
> {
> unsigned int refill = -tctx->cached_refs + IO_TCTX_REFS_CACHE_NR;
>
> percpu_counter_add(&tctx->inflight, refill);
> refcount_add(refill, ¤t->usage);
> tctx->cached_refs += refill;
> }
as well as io_put_task() and io_uring_drop_tctx_refs().
I made __io_cqring_wait_schedule() and io_put_task() non-static,
non-inline to be able to trace them, made the following bpftrace script
[1] and ran the reproducer [0] getting the following output:
> Attaching 6 probes...
> 12104: io_task_refs_refill: cached: -1 inflight: 0
> 12104: ret io_task_refs_refill: cached: 1024 inflight: 1025
> 12104: io_put_task: cached: 1024 inflight: 1025
> 12104: ret io_put_task: cached: 1025 inflight: 1025
> 12104: __io_cqring_wait_schedule: iowait: 1
> 12104: __io_cqring_wait_schedule: inflight: 1025
And then it's stuck, as expected, but AFAICS, with current->in_iowait
set, which seems surprising to me.
Best Regards,
Fiona
[1]:
> kfunc::io_task_refs_refill
> {
> printf("%d: %s: cached: %d inflight: %d\n",
> tid,
> func,
> ((struct io_uring_task*)args.tctx)->cached_refs,
> ((struct io_uring_task*)args.tctx)->inflight.count
> );
> }
>
> kretfunc::io_task_refs_refill
> {
> printf("%d: ret %s: cached: %d inflight: %d\n",
> tid,
> func,
> ((struct io_uring_task*)args.tctx)->cached_refs,
> ((struct io_uring_task*)args.tctx)->inflight.count
> );
> }
>
> kfunc:io_uring_drop_tctx_refs
> {
> printf("%d: %s\n", tid, func);
> }
>
> kfunc:__io_cqring_wait_schedule
> {
> printf("%d: %s: iowait: %d\n",
> tid,
> func,
> ((struct ext_arg*)args.ext_arg)->iowait
> );
> if (curtask->io_uring) {
> printf("%d: %s: inflight: %d\n",
> tid,
> func,
> curtask->io_uring->inflight.count
> );
> } else {
> printf("%d: %s: got no tctx!\n", tid, func);
> }
> }
>
> kfunc:io_put_task
> {
> printf("%d: %s: cached: %d inflight: %d\n",
> tid,
> func,
> ((struct io_kiocb*)args.req)->tctx->cached_refs,
> ((struct io_kiocb*)args.req)->tctx->inflight.count
> );
> }
>
> kretfunc:io_put_task
> {
> printf("%d: ret %s: cached: %d inflight: %d\n",
> tid,
> func,
> ((struct io_kiocb*)args.req)->tctx->cached_refs,
> ((struct io_kiocb*)args.req)->tctx->inflight.count
> );
> }
>
> [0]:
>
> #include <errno.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <liburing.h>
>
> int main(void) {
> int fd;
> int ret;
> struct io_uring ring;
> struct io_uring_sqe *sqe;
>
> ret = io_uring_queue_init(128, &ring, 0);
> if (ret != 0) {
> printf("Failed to initialize io_uring\n");
> return ret;
> }
>
> // before submitting+advancing the issue does not happen
> // ret = io_uring_submit_and_wait(&ring, 1);
> // printf("got ret %d\n", ret);
>
> sqe = io_uring_get_sqe(&ring);
> if (!sqe) {
> printf("Full sq\n");
> return -1;
> }
>
> io_uring_prep_nop(sqe);
>
> do {
> ret = io_uring_submit_and_wait(&ring, 1);
> } while (ret == -EINTR);
>
> if (ret != 1) {
> printf("Expected to submit one\n");
> return -1;
> }
>
> // using peek+seen has the same effect
> // struct io_uring_cqe* cqe;
> // io_uring_peek_cqe(&ring, &cqe);
> // io_uring_cqe_seen(&ring, cqe);
> io_uring_cq_advance(&ring, 1);
>
> ret = io_uring_submit_and_wait(&ring, 1);
> printf("got ret %d\n", ret);
>
> io_uring_queue_exit(&ring);
>
> return 0;
> }
>
next prev parent reply other threads:[~2026-04-24 15:42 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-01 14:59 io_uring_prep_timeout() leading to an IO pressure close to 100 Fiona Ebner
2026-04-01 15:03 ` Jens Axboe
2026-04-02 9:12 ` Fiona Ebner
2026-04-02 12:31 ` Fiona Ebner
2026-04-24 15:42 ` Fiona Ebner [this message]
2026-04-26 21:13 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=db7e6abb-677b-4b63-a028-d8fe0bec0277@proxmox.com \
--to=f.ebner@proxmox.com \
--cc=axboe@kernel.dk \
--cc=hannes@cmpxchg.org \
--cc=io-uring@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=surenb@google.com \
--cc=t.lamprecht@proxmox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox