Excessive IO PSI for iothread when using io

All of lore.kernel.org
 help / color / mirror / Atom feed

* Excessive IO PSI for iothread when using io_uring since QEMU 10.2
@ 2026-04-24 10:25 Fiona Ebner
  2026-04-27 19:13 ` Stefan Hajnoczi
  0 siblings, 1 reply; 10+ messages in thread
From: Fiona Ebner @ 2026-04-24 10:25 UTC (permalink / raw)
  To: open list:Network Block Dev...
  Cc: QEMU Developers, Stefan Hajnoczi, Fam Zheng, Hanna Czenczek,
	Kevin Wolf

Dear maintainers,

since QEMU 10.2, if io_uring is enabled, it will be used for the event
loop of iothreads and this causes an IO pressure stall value of nearly
100 when idle.

The issue was also reported on the kernel mailing list [0]. The
suggestion from Jens Axboe was to just turn off the iowait accounting
completely. But since (for block/file-posix.c), there is actual IO
submitted via the same ring, I wasn't sure if that is the right approach.

So the idea was to keep track of whether the event loop is otherwise
idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1].

However, doing so would only help for block/file-posix.c, which submits
IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for
block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When
submitting that poll SQE in the iothread, we would need to be able to
know if IO for RBD is currently in-flight or not to be able to decide
whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good
way to do this (in a general way)?

Or should the flag really always be used (if supported by the kernel)?
Is there a way to tell io_uring/kernel that we are an event loop and our
waiting should only be accounted for when there is actual IO in-flight?

Happy to hear your opinions and suggestions!

[0]:
https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/

[1]:
https://lore.proxmox.com/pve-devel/525c4dad-6d04-41f0-8a21-9302b0c6baa4@proxmox.com/T/

Best Regards,
Fiona

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2
  2026-04-24 10:25 Excessive IO PSI for iothread when using io_uring since QEMU 10.2 Fiona Ebner
@ 2026-04-27 19:13 ` Stefan Hajnoczi
  2026-04-28 12:10   ` Fiona Ebner
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2026-04-27 19:13 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng,
	Hanna Czenczek, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 2124 bytes --]

On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote:
> Dear maintainers,
> 
> since QEMU 10.2, if io_uring is enabled, it will be used for the event
> loop of iothreads and this causes an IO pressure stall value of nearly
> 100 when idle.
> 
> The issue was also reported on the kernel mailing list [0]. The
> suggestion from Jens Axboe was to just turn off the iowait accounting
> completely. But since (for block/file-posix.c), there is actual IO
> submitted via the same ring, I wasn't sure if that is the right approach.
> 
> So the idea was to keep track of whether the event loop is otherwise
> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1].
> 
> However, doing so would only help for block/file-posix.c, which submits
> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for
> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When
> submitting that poll SQE in the iothread, we would need to be able to
> know if IO for RBD is currently in-flight or not to be able to decide
> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good
> way to do this (in a general way)?
> 
> Or should the flag really always be used (if supported by the kernel)?
> Is there a way to tell io_uring/kernel that we are an event loop and our
> waiting should only be accounted for when there is actual IO in-flight?
> 
> Happy to hear your opinions and suggestions!
> 
> [0]:
> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/

Hi Fiona,
Jens replied yesterday confirmed your suspicion that the number of
inflight requests is not being tracked correctly.

Is there still a problem after fixing the kernel's inflight counting? If
not, then no QEMU change is necessary and that seems like the cleanest
solution anyway. The kernel should know whether there is I/O in flight
and so it doesn't seem right that userspace needs to hint this.

Stefan

> 
> [1]:
> https://lore.proxmox.com/pve-devel/525c4dad-6d04-41f0-8a21-9302b0c6baa4@proxmox.com/T/
> 
> Best Regards,
> Fiona
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2
  2026-04-27 19:13 ` Stefan Hajnoczi
@ 2026-04-28 12:10   ` Fiona Ebner
  2026-04-28 13:31     ` Fiona Ebner
  2026-04-28 16:19     ` Stefan Hajnoczi
  0 siblings, 2 replies; 10+ messages in thread
From: Fiona Ebner @ 2026-04-28 12:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng,
	Hanna Czenczek, Kevin Wolf, Thomas Lamprecht

Hi Stefan,

Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi:
> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote:
>> Dear maintainers,
>>
>> since QEMU 10.2, if io_uring is enabled, it will be used for the event
>> loop of iothreads and this causes an IO pressure stall value of nearly
>> 100 when idle.
>>
>> The issue was also reported on the kernel mailing list [0]. The
>> suggestion from Jens Axboe was to just turn off the iowait accounting
>> completely. But since (for block/file-posix.c), there is actual IO
>> submitted via the same ring, I wasn't sure if that is the right approach.
>>
>> So the idea was to keep track of whether the event loop is otherwise
>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1].
>>
>> However, doing so would only help for block/file-posix.c, which submits
>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for
>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When
>> submitting that poll SQE in the iothread, we would need to be able to
>> know if IO for RBD is currently in-flight or not to be able to decide
>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good
>> way to do this (in a general way)?
>>
>> Or should the flag really always be used (if supported by the kernel)?
>> Is there a way to tell io_uring/kernel that we are an event loop and our
>> waiting should only be accounted for when there is actual IO in-flight?
>>
>> Happy to hear your opinions and suggestions!
>>
>> [0]:
>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/
> 
> Hi Fiona,
> Jens replied yesterday confirmed your suspicion that the number of
> inflight requests is not being tracked correctly.
> 
> Is there still a problem after fixing the kernel's inflight counting? If
> not, then no QEMU change is necessary and that seems like the cleanest
> solution anyway. The kernel should know whether there is I/O in flight
> and so it doesn't seem right that userspace needs to hint this.


unfortunately, yes. Even with the kernel fix [2], the real problem with
poll SQEs described above remains. I'm still seeing high IO pressure
stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll
SQEs for the AioHandler node fd, and that does count as pending IO. A
small reproducer modeling this [3].

So the question from above, how to deal with this for block drivers not
going through file-posix.c remains.

Best Regards,
Fiona

[2]:
https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@proxmox.com/T/

[3]:

#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <liburing.h>
#include <sys/eventfd.h>

int main(void) {
    int fd;
    int ret;
    struct io_uring ring;
    struct io_uring_sqe *sqe;

    fd = eventfd(0, 0);
    assert(fd >= 0);

    ret = io_uring_queue_init(128, &ring, 0);
    assert(ret == 0);

    sqe = io_uring_get_sqe(&ring);
    assert(sqe);

    io_uring_prep_poll_add(sqe, fd, 1);

    ret = io_uring_submit_and_wait(&ring, 1);
    printf("got ret %d\n", ret);

    io_uring_queue_exit(&ring);

    return 0;
}




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2
  2026-04-28 12:10   ` Fiona Ebner
@ 2026-04-28 13:31     ` Fiona Ebner
  2026-04-28 16:19     ` Stefan Hajnoczi
  1 sibling, 0 replies; 10+ messages in thread
From: Fiona Ebner @ 2026-04-28 13:31 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng,
	Hanna Czenczek, Kevin Wolf, Thomas Lamprecht

Am 28.04.26 um 2:09 PM schrieb Fiona Ebner:
> Hi Stefan,
> 
> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi:
>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote:
>>> Dear maintainers,
>>>
>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event
>>> loop of iothreads and this causes an IO pressure stall value of nearly
>>> 100 when idle.
>>>
>>> The issue was also reported on the kernel mailing list [0]. The
>>> suggestion from Jens Axboe was to just turn off the iowait accounting
>>> completely. But since (for block/file-posix.c), there is actual IO
>>> submitted via the same ring, I wasn't sure if that is the right approach.
>>>
>>> So the idea was to keep track of whether the event loop is otherwise
>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1].
>>>
>>> However, doing so would only help for block/file-posix.c, which submits
>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for
>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When
>>> submitting that poll SQE in the iothread, we would need to be able to
>>> know if IO for RBD is currently in-flight or not to be able to decide
>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good
>>> way to do this (in a general way)?
>>>
>>> Or should the flag really always be used (if supported by the kernel)?
>>> Is there a way to tell io_uring/kernel that we are an event loop and our
>>> waiting should only be accounted for when there is actual IO in-flight?
>>>
>>> Happy to hear your opinions and suggestions!
>>>
>>> [0]:
>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/
>>
>> Hi Fiona,
>> Jens replied yesterday confirmed your suspicion that the number of
>> inflight requests is not being tracked correctly.
>>
>> Is there still a problem after fixing the kernel's inflight counting? If
>> not, then no QEMU change is necessary and that seems like the cleanest
>> solution anyway. The kernel should know whether there is I/O in flight
>> and so it doesn't seem right that userspace needs to hint this.
> 
> 
> unfortunately, yes. Even with the kernel fix [2], the real problem with
> poll SQEs described above remains. I'm still seeing high IO pressure
> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll
> SQEs for the AioHandler node fd, and that does count as pending IO. A
> small reproducer modeling this [3].
> 
> So the question from above, how to deal with this for block drivers not
> going through file-posix.c remains.

Or maybe there is no actual issue with such drivers. We always use the
IORING_ENTER_NO_IOWAIT flag when we only have poll or timeout SQEs. If
there is actual IO via io_uring, i.e. submitted via luring_co_submit(),
we don't set the IORING_ENTER_NO_IOWAIT flag.

IO submitted outside of io_uring will still be accounted for by the
kernel just like it was before QEMU did the switch to the iothread event
loop. Or am I missing something there?

> 
> Best Regards,
> Fiona
> 
> [2]:
> https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@proxmox.com/T/
> 
> [3]:
> 
> #include <assert.h>
> #include <errno.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <liburing.h>
> #include <sys/eventfd.h>
> 
> int main(void) {
>     int fd;
>     int ret;
>     struct io_uring ring;
>     struct io_uring_sqe *sqe;
> 
>     fd = eventfd(0, 0);
>     assert(fd >= 0);
> 
>     ret = io_uring_queue_init(128, &ring, 0);
>     assert(ret == 0);
> 
>     sqe = io_uring_get_sqe(&ring);
>     assert(sqe);
> 
>     io_uring_prep_poll_add(sqe, fd, 1);
> 
>     ret = io_uring_submit_and_wait(&ring, 1);
>     printf("got ret %d\n", ret);
> 
>     io_uring_queue_exit(&ring);
> 
>     return 0;
> }
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2
  2026-04-28 12:10   ` Fiona Ebner
  2026-04-28 13:31     ` Fiona Ebner
@ 2026-04-28 16:19     ` Stefan Hajnoczi
  2026-04-29  8:00       ` Fiona Ebner
  1 sibling, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2026-04-28 16:19 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng,
	Hanna Czenczek, Kevin Wolf, Thomas Lamprecht

[-- Attachment #1: Type: text/plain, Size: 3785 bytes --]

On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote:
> Hi Stefan,
> 
> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi:
> > On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote:
> >> Dear maintainers,
> >>
> >> since QEMU 10.2, if io_uring is enabled, it will be used for the event
> >> loop of iothreads and this causes an IO pressure stall value of nearly
> >> 100 when idle.
> >>
> >> The issue was also reported on the kernel mailing list [0]. The
> >> suggestion from Jens Axboe was to just turn off the iowait accounting
> >> completely. But since (for block/file-posix.c), there is actual IO
> >> submitted via the same ring, I wasn't sure if that is the right approach.
> >>
> >> So the idea was to keep track of whether the event loop is otherwise
> >> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1].
> >>
> >> However, doing so would only help for block/file-posix.c, which submits
> >> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for
> >> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When
> >> submitting that poll SQE in the iothread, we would need to be able to
> >> know if IO for RBD is currently in-flight or not to be able to decide
> >> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good
> >> way to do this (in a general way)?
> >>
> >> Or should the flag really always be used (if supported by the kernel)?
> >> Is there a way to tell io_uring/kernel that we are an event loop and our
> >> waiting should only be accounted for when there is actual IO in-flight?
> >>
> >> Happy to hear your opinions and suggestions!
> >>
> >> [0]:
> >> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/
> > 
> > Hi Fiona,
> > Jens replied yesterday confirmed your suspicion that the number of
> > inflight requests is not being tracked correctly.
> > 
> > Is there still a problem after fixing the kernel's inflight counting? If
> > not, then no QEMU change is necessary and that seems like the cleanest
> > solution anyway. The kernel should know whether there is I/O in flight
> > and so it doesn't seem right that userspace needs to hint this.
> 
> 
> unfortunately, yes. Even with the kernel fix [2], the real problem with
> poll SQEs described above remains. I'm still seeing high IO pressure
> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll
> SQEs for the AioHandler node fd, and that does count as pending IO. A
> small reproducer modeling this [3].

Does the kernel account POLL_ADD SQEs as blocking I/O activity?

That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2)
syscalls do not count as blocking I/O activity. The kernel io_uring code
should account them correctly and not rely on a userspace hint.

Stefan

> 
> So the question from above, how to deal with this for block drivers not
> going through file-posix.c remains.
> 
> Best Regards,
> Fiona
> 
> [2]:
> https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@proxmox.com/T/
> 
> [3]:
> 
> #include <assert.h>
> #include <errno.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <liburing.h>
> #include <sys/eventfd.h>
> 
> int main(void) {
>     int fd;
>     int ret;
>     struct io_uring ring;
>     struct io_uring_sqe *sqe;
> 
>     fd = eventfd(0, 0);
>     assert(fd >= 0);
> 
>     ret = io_uring_queue_init(128, &ring, 0);
>     assert(ret == 0);
> 
>     sqe = io_uring_get_sqe(&ring);
>     assert(sqe);
> 
>     io_uring_prep_poll_add(sqe, fd, 1);
> 
>     ret = io_uring_submit_and_wait(&ring, 1);
>     printf("got ret %d\n", ret);
> 
>     io_uring_queue_exit(&ring);
> 
>     return 0;
> }
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2
  2026-04-28 16:19     ` Stefan Hajnoczi
@ 2026-04-29  8:00       ` Fiona Ebner
  2026-04-29 12:20         ` Stefan Hajnoczi
  2026-06-01 17:20         ` Stefan Hajnoczi
  0 siblings, 2 replies; 10+ messages in thread
From: Fiona Ebner @ 2026-04-29  8:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng,
	Hanna Czenczek, Kevin Wolf, Thomas Lamprecht, Jens Axboe

Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi:
> On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote:
>> Hi Stefan,
>>
>> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi:
>>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote:
>>>> Dear maintainers,
>>>>
>>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event
>>>> loop of iothreads and this causes an IO pressure stall value of nearly
>>>> 100 when idle.
>>>>
>>>> The issue was also reported on the kernel mailing list [0]. The
>>>> suggestion from Jens Axboe was to just turn off the iowait accounting
>>>> completely. But since (for block/file-posix.c), there is actual IO
>>>> submitted via the same ring, I wasn't sure if that is the right approach.
>>>>
>>>> So the idea was to keep track of whether the event loop is otherwise
>>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1].
>>>>
>>>> However, doing so would only help for block/file-posix.c, which submits
>>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for
>>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When
>>>> submitting that poll SQE in the iothread, we would need to be able to
>>>> know if IO for RBD is currently in-flight or not to be able to decide
>>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good
>>>> way to do this (in a general way)?
>>>>
>>>> Or should the flag really always be used (if supported by the kernel)?
>>>> Is there a way to tell io_uring/kernel that we are an event loop and our
>>>> waiting should only be accounted for when there is actual IO in-flight?
>>>>
>>>> Happy to hear your opinions and suggestions!
>>>>
>>>> [0]:
>>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/
>>>
>>> Hi Fiona,
>>> Jens replied yesterday confirmed your suspicion that the number of
>>> inflight requests is not being tracked correctly.
>>>
>>> Is there still a problem after fixing the kernel's inflight counting? If
>>> not, then no QEMU change is necessary and that seems like the cleanest
>>> solution anyway. The kernel should know whether there is I/O in flight
>>> and so it doesn't seem right that userspace needs to hint this.
>>
>>
>> unfortunately, yes. Even with the kernel fix [2], the real problem with
>> poll SQEs described above remains. I'm still seeing high IO pressure
>> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll
>> SQEs for the AioHandler node fd, and that does count as pending IO. A
>> small reproducer modeling this [3].
> 
> Does the kernel account POLL_ADD SQEs as blocking I/O activity?

Apparently yes. See the C program below [3].

> That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2)
> syscalls do not count as blocking I/O activity. The kernel io_uring code
> should account them correctly and not rely on a userspace hint.

@Jens Axboe: should there be a separate internal counter for
poll/timeout SQEs and have them not count towards IO wait by default?

> 
> Stefan
> 
>>
>> So the question from above, how to deal with this for block drivers not
>> going through file-posix.c remains.
>>
>> Best Regards,
>> Fiona
>>
>> [2]:
>> https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@proxmox.com/T/
>>
>> [3]:
>>
>> #include <assert.h>
>> #include <errno.h>
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <liburing.h>
>> #include <sys/eventfd.h>
>>
>> int main(void) {
>>     int fd;
>>     int ret;
>>     struct io_uring ring;
>>     struct io_uring_sqe *sqe;
>>
>>     fd = eventfd(0, 0);
>>     assert(fd >= 0);
>>
>>     ret = io_uring_queue_init(128, &ring, 0);
>>     assert(ret == 0);
>>
>>     sqe = io_uring_get_sqe(&ring);
>>     assert(sqe);
>>
>>     io_uring_prep_poll_add(sqe, fd, 1);
>>
>>     ret = io_uring_submit_and_wait(&ring, 1);
>>     printf("got ret %d\n", ret);
>>
>>     io_uring_queue_exit(&ring);
>>
>>     return 0;
>> }
>>
>>




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2
  2026-04-29  8:00       ` Fiona Ebner
@ 2026-04-29 12:20         ` Stefan Hajnoczi
  2026-06-01 17:20         ` Stefan Hajnoczi
  1 sibling, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2026-04-29 12:20 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng,
	Hanna Czenczek, Kevin Wolf, Thomas Lamprecht, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 5343 bytes --]

On Wed, Apr 29, 2026 at 10:00:34AM +0200, Fiona Ebner wrote:
> Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi:
> > On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote:
> >> Hi Stefan,
> >>
> >> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi:
> >>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote:
> >>>> Dear maintainers,
> >>>>
> >>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event
> >>>> loop of iothreads and this causes an IO pressure stall value of nearly
> >>>> 100 when idle.
> >>>>
> >>>> The issue was also reported on the kernel mailing list [0]. The
> >>>> suggestion from Jens Axboe was to just turn off the iowait accounting
> >>>> completely. But since (for block/file-posix.c), there is actual IO
> >>>> submitted via the same ring, I wasn't sure if that is the right approach.
> >>>>
> >>>> So the idea was to keep track of whether the event loop is otherwise
> >>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1].
> >>>>
> >>>> However, doing so would only help for block/file-posix.c, which submits
> >>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for
> >>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When
> >>>> submitting that poll SQE in the iothread, we would need to be able to
> >>>> know if IO for RBD is currently in-flight or not to be able to decide
> >>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good
> >>>> way to do this (in a general way)?
> >>>>
> >>>> Or should the flag really always be used (if supported by the kernel)?
> >>>> Is there a way to tell io_uring/kernel that we are an event loop and our
> >>>> waiting should only be accounted for when there is actual IO in-flight?
> >>>>
> >>>> Happy to hear your opinions and suggestions!
> >>>>
> >>>> [0]:
> >>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/
> >>>
> >>> Hi Fiona,
> >>> Jens replied yesterday confirmed your suspicion that the number of
> >>> inflight requests is not being tracked correctly.
> >>>
> >>> Is there still a problem after fixing the kernel's inflight counting? If
> >>> not, then no QEMU change is necessary and that seems like the cleanest
> >>> solution anyway. The kernel should know whether there is I/O in flight
> >>> and so it doesn't seem right that userspace needs to hint this.
> >>
> >>
> >> unfortunately, yes. Even with the kernel fix [2], the real problem with
> >> poll SQEs described above remains. I'm still seeing high IO pressure
> >> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll
> >> SQEs for the AioHandler node fd, and that does count as pending IO. A
> >> small reproducer modeling this [3].
> > 
> > Does the kernel account POLL_ADD SQEs as blocking I/O activity?
> 
> Apparently yes. See the C program below [3].
> 
> > That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2)
> > syscalls do not count as blocking I/O activity. The kernel io_uring code
> > should account them correctly and not rely on a userspace hint.
> 
> @Jens Axboe: should there be a separate internal counter for
> poll/timeout SQEs and have them not count towards IO wait by default?

I wanted to add more nuance to what I wrote:

As a baseline, io_uring should account IO activity in the same way as
the traditional syscalls for those operations. However, it does seem
like userspace hints can be useful in some cases.

For example, if a server process is reading from a socket/eventfd/pipe
waiting for an incoming request then it is not stalled by IO. However,
if the same process makes a request to another process and is reading a
socket/eventfd/pipe waiting for the response, then it may indeed be
considered as waiting for IO. In other words, whether a read means the
process is stalled waiting for IO or not depends on the application and
the kernel doesn't know that. Userspace hints make sense in this case.

I just think that in this case io_uring isn't following the IO pressure
stall accounting of the equivalent traditional system calls and that
seems like a gap that should be fixed in the kernel rather than
userspace.

Stefan

> > 
> > Stefan
> > 
> >>
> >> So the question from above, how to deal with this for block drivers not
> >> going through file-posix.c remains.
> >>
> >> Best Regards,
> >> Fiona
> >>
> >> [2]:
> >> https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@proxmox.com/T/
> >>
> >> [3]:
> >>
> >> #include <assert.h>
> >> #include <errno.h>
> >> #include <stdio.h>
> >> #include <unistd.h>
> >> #include <liburing.h>
> >> #include <sys/eventfd.h>
> >>
> >> int main(void) {
> >>     int fd;
> >>     int ret;
> >>     struct io_uring ring;
> >>     struct io_uring_sqe *sqe;
> >>
> >>     fd = eventfd(0, 0);
> >>     assert(fd >= 0);
> >>
> >>     ret = io_uring_queue_init(128, &ring, 0);
> >>     assert(ret == 0);
> >>
> >>     sqe = io_uring_get_sqe(&ring);
> >>     assert(sqe);
> >>
> >>     io_uring_prep_poll_add(sqe, fd, 1);
> >>
> >>     ret = io_uring_submit_and_wait(&ring, 1);
> >>     printf("got ret %d\n", ret);
> >>
> >>     io_uring_queue_exit(&ring);
> >>
> >>     return 0;
> >> }
> >>
> >>
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2
  2026-04-29  8:00       ` Fiona Ebner
  2026-04-29 12:20         ` Stefan Hajnoczi
@ 2026-06-01 17:20         ` Stefan Hajnoczi
  2026-06-02  8:41           ` Fiona Ebner
  1 sibling, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2026-06-01 17:20 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng,
	Hanna Czenczek, Kevin Wolf, Thomas Lamprecht, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 3403 bytes --]

On Wed, Apr 29, 2026 at 10:00:34AM +0200, Fiona Ebner wrote:
> Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi:
> > On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote:
> >> Hi Stefan,
> >>
> >> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi:
> >>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote:
> >>>> Dear maintainers,
> >>>>
> >>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event
> >>>> loop of iothreads and this causes an IO pressure stall value of nearly
> >>>> 100 when idle.
> >>>>
> >>>> The issue was also reported on the kernel mailing list [0]. The
> >>>> suggestion from Jens Axboe was to just turn off the iowait accounting
> >>>> completely. But since (for block/file-posix.c), there is actual IO
> >>>> submitted via the same ring, I wasn't sure if that is the right approach.
> >>>>
> >>>> So the idea was to keep track of whether the event loop is otherwise
> >>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1].
> >>>>
> >>>> However, doing so would only help for block/file-posix.c, which submits
> >>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for
> >>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When
> >>>> submitting that poll SQE in the iothread, we would need to be able to
> >>>> know if IO for RBD is currently in-flight or not to be able to decide
> >>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good
> >>>> way to do this (in a general way)?
> >>>>
> >>>> Or should the flag really always be used (if supported by the kernel)?
> >>>> Is there a way to tell io_uring/kernel that we are an event loop and our
> >>>> waiting should only be accounted for when there is actual IO in-flight?
> >>>>
> >>>> Happy to hear your opinions and suggestions!
> >>>>
> >>>> [0]:
> >>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/
> >>>
> >>> Hi Fiona,
> >>> Jens replied yesterday confirmed your suspicion that the number of
> >>> inflight requests is not being tracked correctly.
> >>>
> >>> Is there still a problem after fixing the kernel's inflight counting? If
> >>> not, then no QEMU change is necessary and that seems like the cleanest
> >>> solution anyway. The kernel should know whether there is I/O in flight
> >>> and so it doesn't seem right that userspace needs to hint this.
> >>
> >>
> >> unfortunately, yes. Even with the kernel fix [2], the real problem with
> >> poll SQEs described above remains. I'm still seeing high IO pressure
> >> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll
> >> SQEs for the AioHandler node fd, and that does count as pending IO. A
> >> small reproducer modeling this [3].
> > 
> > Does the kernel account POLL_ADD SQEs as blocking I/O activity?
> 
> Apparently yes. See the C program below [3].
> 
> > That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2)
> > syscalls do not count as blocking I/O activity. The kernel io_uring code
> > should account them correctly and not rely on a userspace hint.
> 
> @Jens Axboe: should there be a separate internal counter for
> poll/timeout SQEs and have them not count towards IO wait by default?

Hi Fiona,
Any update on this issue? Was it resolved in io_uring or is a QEMU patch
still needed?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2
  2026-06-01 17:20         ` Stefan Hajnoczi
@ 2026-06-02  8:41           ` Fiona Ebner
  2026-06-02 12:08             ` Stefan Hajnoczi
  0 siblings, 1 reply; 10+ messages in thread
From: Fiona Ebner @ 2026-06-02  8:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng,
	Hanna Czenczek, Kevin Wolf, Thomas Lamprecht, Jens Axboe

Am 01.06.26 um 7:20 PM schrieb Stefan Hajnoczi:
> On Wed, Apr 29, 2026 at 10:00:34AM +0200, Fiona Ebner wrote:
>> Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi:
>>> On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote:
>>>> Hi Stefan,
>>>>
>>>> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi:
>>>>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote:
>>>>>> Dear maintainers,
>>>>>>
>>>>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event
>>>>>> loop of iothreads and this causes an IO pressure stall value of nearly
>>>>>> 100 when idle.
>>>>>>
>>>>>> The issue was also reported on the kernel mailing list [0]. The
>>>>>> suggestion from Jens Axboe was to just turn off the iowait accounting
>>>>>> completely. But since (for block/file-posix.c), there is actual IO
>>>>>> submitted via the same ring, I wasn't sure if that is the right approach.
>>>>>>
>>>>>> So the idea was to keep track of whether the event loop is otherwise
>>>>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1].
>>>>>>
>>>>>> However, doing so would only help for block/file-posix.c, which submits
>>>>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for
>>>>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When
>>>>>> submitting that poll SQE in the iothread, we would need to be able to
>>>>>> know if IO for RBD is currently in-flight or not to be able to decide
>>>>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good
>>>>>> way to do this (in a general way)?
>>>>>>
>>>>>> Or should the flag really always be used (if supported by the kernel)?
>>>>>> Is there a way to tell io_uring/kernel that we are an event loop and our
>>>>>> waiting should only be accounted for when there is actual IO in-flight?
>>>>>>
>>>>>> Happy to hear your opinions and suggestions!
>>>>>>
>>>>>> [0]:
>>>>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/
>>>>>
>>>>> Hi Fiona,
>>>>> Jens replied yesterday confirmed your suspicion that the number of
>>>>> inflight requests is not being tracked correctly.
>>>>>
>>>>> Is there still a problem after fixing the kernel's inflight counting? If
>>>>> not, then no QEMU change is necessary and that seems like the cleanest
>>>>> solution anyway. The kernel should know whether there is I/O in flight
>>>>> and so it doesn't seem right that userspace needs to hint this.
>>>>
>>>>
>>>> unfortunately, yes. Even with the kernel fix [2], the real problem with
>>>> poll SQEs described above remains. I'm still seeing high IO pressure
>>>> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll
>>>> SQEs for the AioHandler node fd, and that does count as pending IO. A
>>>> small reproducer modeling this [3].
>>>
>>> Does the kernel account POLL_ADD SQEs as blocking I/O activity?
>>
>> Apparently yes. See the C program below [3].
>>
>>> That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2)
>>> syscalls do not count as blocking I/O activity. The kernel io_uring code
>>> should account them correctly and not rely on a userspace hint.
>>
>> @Jens Axboe: should there be a separate internal counter for
>> poll/timeout SQEs and have them not count towards IO wait by default?
> 
> Hi Fiona,
> Any update on this issue? Was it resolved in io_uring or is a QEMU patch
> still needed?

Hi Stefan,

I did not proceed with the above, since I did not get an ack from Jens
regarding the suggested approach.

We needed to go ahead with a release downstream, so for the meantime, we
applied a workaround by Thomas with setting the IORING_ENTER_NO_IOWAIT
flag when there is no actual IO in-flight [0]. Should it be submitted to
qemu-devel too?

[0]:
https://git.proxmox.com/?p=pve-qemu.git;a=commitdiff;h=775e41b890a645db75119233fe2b21f139bf8e4f

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2
  2026-06-02  8:41           ` Fiona Ebner
@ 2026-06-02 12:08             ` Stefan Hajnoczi
  0 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2026-06-02 12:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng,
	Hanna Czenczek, Kevin Wolf, Thomas Lamprecht, Fiona Ebner

[-- Attachment #1: Type: text/plain, Size: 4437 bytes --]

On Tue, Jun 02, 2026 at 10:41:11AM +0200, Fiona Ebner wrote:
> Am 01.06.26 um 7:20 PM schrieb Stefan Hajnoczi:
> > On Wed, Apr 29, 2026 at 10:00:34AM +0200, Fiona Ebner wrote:
> >> Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi:
> >>> On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote:
> >>>> Hi Stefan,
> >>>>
> >>>> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi:
> >>>>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote:
> >>>>>> Dear maintainers,
> >>>>>>
> >>>>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event
> >>>>>> loop of iothreads and this causes an IO pressure stall value of nearly
> >>>>>> 100 when idle.
> >>>>>>
> >>>>>> The issue was also reported on the kernel mailing list [0]. The
> >>>>>> suggestion from Jens Axboe was to just turn off the iowait accounting
> >>>>>> completely. But since (for block/file-posix.c), there is actual IO
> >>>>>> submitted via the same ring, I wasn't sure if that is the right approach.
> >>>>>>
> >>>>>> So the idea was to keep track of whether the event loop is otherwise
> >>>>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1].
> >>>>>>
> >>>>>> However, doing so would only help for block/file-posix.c, which submits
> >>>>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for
> >>>>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When
> >>>>>> submitting that poll SQE in the iothread, we would need to be able to
> >>>>>> know if IO for RBD is currently in-flight or not to be able to decide
> >>>>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good
> >>>>>> way to do this (in a general way)?
> >>>>>>
> >>>>>> Or should the flag really always be used (if supported by the kernel)?
> >>>>>> Is there a way to tell io_uring/kernel that we are an event loop and our
> >>>>>> waiting should only be accounted for when there is actual IO in-flight?
> >>>>>>
> >>>>>> Happy to hear your opinions and suggestions!
> >>>>>>
> >>>>>> [0]:
> >>>>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/
> >>>>>
> >>>>> Hi Fiona,
> >>>>> Jens replied yesterday confirmed your suspicion that the number of
> >>>>> inflight requests is not being tracked correctly.
> >>>>>
> >>>>> Is there still a problem after fixing the kernel's inflight counting? If
> >>>>> not, then no QEMU change is necessary and that seems like the cleanest
> >>>>> solution anyway. The kernel should know whether there is I/O in flight
> >>>>> and so it doesn't seem right that userspace needs to hint this.
> >>>>
> >>>>
> >>>> unfortunately, yes. Even with the kernel fix [2], the real problem with
> >>>> poll SQEs described above remains. I'm still seeing high IO pressure
> >>>> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll
> >>>> SQEs for the AioHandler node fd, and that does count as pending IO. A
> >>>> small reproducer modeling this [3].
> >>>
> >>> Does the kernel account POLL_ADD SQEs as blocking I/O activity?
> >>
> >> Apparently yes. See the C program below [3].
> >>
> >>> That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2)
> >>> syscalls do not count as blocking I/O activity. The kernel io_uring code
> >>> should account them correctly and not rely on a userspace hint.
> >>
> >> @Jens Axboe: should there be a separate internal counter for
> >> poll/timeout SQEs and have them not count towards IO wait by default?
> > 
> > Hi Fiona,
> > Any update on this issue? Was it resolved in io_uring or is a QEMU patch
> > still needed?
> 
> Hi Stefan,
> 
> I did not proceed with the above, since I did not get an ack from Jens
> regarding the suggested approach.
> 
> We needed to go ahead with a release downstream, so for the meantime, we
> applied a workaround by Thomas with setting the IORING_ENTER_NO_IOWAIT
> flag when there is no actual IO in-flight [0]. Should it be submitted to
> qemu-devel too?

Pinging Jens: io_uring accounts POLL_ADD SQEs as blocking I/O activity
whereas select(2)/poll(2)/epoll_wait(2) do not. Would it make sense to
follow the same accounting as the syscalls for this operation since that
is probably expected?

Thanks,
Stefan

> 
> [0]:
> https://git.proxmox.com/?p=pve-qemu.git;a=commitdiff;h=775e41b890a645db75119233fe2b21f139bf8e4f
> 
> Best Regards,
> Fiona
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-02 12:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 10:25 Excessive IO PSI for iothread when using io_uring since QEMU 10.2 Fiona Ebner
2026-04-27 19:13 ` Stefan Hajnoczi
2026-04-28 12:10   ` Fiona Ebner
2026-04-28 13:31     ` Fiona Ebner
2026-04-28 16:19     ` Stefan Hajnoczi
2026-04-29  8:00       ` Fiona Ebner
2026-04-29 12:20         ` Stefan Hajnoczi
2026-06-01 17:20         ` Stefan Hajnoczi
2026-06-02  8:41           ` Fiona Ebner
2026-06-02 12:08             ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.