* Excessive IO PSI for iothread when using io_uring since QEMU 10.2 @ 2026-04-24 10:25 Fiona Ebner 2026-04-27 19:13 ` Stefan Hajnoczi 0 siblings, 1 reply; 10+ messages in thread From: Fiona Ebner @ 2026-04-24 10:25 UTC (permalink / raw) To: open list:Network Block Dev... Cc: QEMU Developers, Stefan Hajnoczi, Fam Zheng, Hanna Czenczek, Kevin Wolf Dear maintainers, since QEMU 10.2, if io_uring is enabled, it will be used for the event loop of iothreads and this causes an IO pressure stall value of nearly 100 when idle. The issue was also reported on the kernel mailing list [0]. The suggestion from Jens Axboe was to just turn off the iowait accounting completely. But since (for block/file-posix.c), there is actual IO submitted via the same ring, I wasn't sure if that is the right approach. So the idea was to keep track of whether the event loop is otherwise idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. However, doing so would only help for block/file-posix.c, which submits IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When submitting that poll SQE in the iothread, we would need to be able to know if IO for RBD is currently in-flight or not to be able to decide whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good way to do this (in a general way)? Or should the flag really always be used (if supported by the kernel)? Is there a way to tell io_uring/kernel that we are an event loop and our waiting should only be accounted for when there is actual IO in-flight? Happy to hear your opinions and suggestions! [0]: https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/ [1]: https://lore.proxmox.com/pve-devel/525c4dad-6d04-41f0-8a21-9302b0c6baa4@proxmox.com/T/ Best Regards, Fiona ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2 2026-04-24 10:25 Excessive IO PSI for iothread when using io_uring since QEMU 10.2 Fiona Ebner @ 2026-04-27 19:13 ` Stefan Hajnoczi 2026-04-28 12:10 ` Fiona Ebner 0 siblings, 1 reply; 10+ messages in thread From: Stefan Hajnoczi @ 2026-04-27 19:13 UTC (permalink / raw) To: Fiona Ebner Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng, Hanna Czenczek, Kevin Wolf [-- Attachment #1: Type: text/plain, Size: 2124 bytes --] On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: > Dear maintainers, > > since QEMU 10.2, if io_uring is enabled, it will be used for the event > loop of iothreads and this causes an IO pressure stall value of nearly > 100 when idle. > > The issue was also reported on the kernel mailing list [0]. The > suggestion from Jens Axboe was to just turn off the iowait accounting > completely. But since (for block/file-posix.c), there is actual IO > submitted via the same ring, I wasn't sure if that is the right approach. > > So the idea was to keep track of whether the event loop is otherwise > idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. > > However, doing so would only help for block/file-posix.c, which submits > IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for > block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When > submitting that poll SQE in the iothread, we would need to be able to > know if IO for RBD is currently in-flight or not to be able to decide > whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good > way to do this (in a general way)? > > Or should the flag really always be used (if supported by the kernel)? > Is there a way to tell io_uring/kernel that we are an event loop and our > waiting should only be accounted for when there is actual IO in-flight? > > Happy to hear your opinions and suggestions! > > [0]: > https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/ Hi Fiona, Jens replied yesterday confirmed your suspicion that the number of inflight requests is not being tracked correctly. Is there still a problem after fixing the kernel's inflight counting? If not, then no QEMU change is necessary and that seems like the cleanest solution anyway. The kernel should know whether there is I/O in flight and so it doesn't seem right that userspace needs to hint this. Stefan > > [1]: > https://lore.proxmox.com/pve-devel/525c4dad-6d04-41f0-8a21-9302b0c6baa4@proxmox.com/T/ > > Best Regards, > Fiona > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2 2026-04-27 19:13 ` Stefan Hajnoczi @ 2026-04-28 12:10 ` Fiona Ebner 2026-04-28 13:31 ` Fiona Ebner 2026-04-28 16:19 ` Stefan Hajnoczi 0 siblings, 2 replies; 10+ messages in thread From: Fiona Ebner @ 2026-04-28 12:10 UTC (permalink / raw) To: Stefan Hajnoczi Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng, Hanna Czenczek, Kevin Wolf, Thomas Lamprecht Hi Stefan, Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi: > On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: >> Dear maintainers, >> >> since QEMU 10.2, if io_uring is enabled, it will be used for the event >> loop of iothreads and this causes an IO pressure stall value of nearly >> 100 when idle. >> >> The issue was also reported on the kernel mailing list [0]. The >> suggestion from Jens Axboe was to just turn off the iowait accounting >> completely. But since (for block/file-posix.c), there is actual IO >> submitted via the same ring, I wasn't sure if that is the right approach. >> >> So the idea was to keep track of whether the event loop is otherwise >> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. >> >> However, doing so would only help for block/file-posix.c, which submits >> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for >> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When >> submitting that poll SQE in the iothread, we would need to be able to >> know if IO for RBD is currently in-flight or not to be able to decide >> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good >> way to do this (in a general way)? >> >> Or should the flag really always be used (if supported by the kernel)? >> Is there a way to tell io_uring/kernel that we are an event loop and our >> waiting should only be accounted for when there is actual IO in-flight? >> >> Happy to hear your opinions and suggestions! >> >> [0]: >> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/ > > Hi Fiona, > Jens replied yesterday confirmed your suspicion that the number of > inflight requests is not being tracked correctly. > > Is there still a problem after fixing the kernel's inflight counting? If > not, then no QEMU change is necessary and that seems like the cleanest > solution anyway. The kernel should know whether there is I/O in flight > and so it doesn't seem right that userspace needs to hint this. unfortunately, yes. Even with the kernel fix [2], the real problem with poll SQEs described above remains. I'm still seeing high IO pressure stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll SQEs for the AioHandler node fd, and that does count as pending IO. A small reproducer modeling this [3]. So the question from above, how to deal with this for block drivers not going through file-posix.c remains. Best Regards, Fiona [2]: https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@proxmox.com/T/ [3]: #include <assert.h> #include <errno.h> #include <stdio.h> #include <unistd.h> #include <liburing.h> #include <sys/eventfd.h> int main(void) { int fd; int ret; struct io_uring ring; struct io_uring_sqe *sqe; fd = eventfd(0, 0); assert(fd >= 0); ret = io_uring_queue_init(128, &ring, 0); assert(ret == 0); sqe = io_uring_get_sqe(&ring); assert(sqe); io_uring_prep_poll_add(sqe, fd, 1); ret = io_uring_submit_and_wait(&ring, 1); printf("got ret %d\n", ret); io_uring_queue_exit(&ring); return 0; } ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2 2026-04-28 12:10 ` Fiona Ebner @ 2026-04-28 13:31 ` Fiona Ebner 2026-04-28 16:19 ` Stefan Hajnoczi 1 sibling, 0 replies; 10+ messages in thread From: Fiona Ebner @ 2026-04-28 13:31 UTC (permalink / raw) To: Stefan Hajnoczi Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng, Hanna Czenczek, Kevin Wolf, Thomas Lamprecht Am 28.04.26 um 2:09 PM schrieb Fiona Ebner: > Hi Stefan, > > Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi: >> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: >>> Dear maintainers, >>> >>> since QEMU 10.2, if io_uring is enabled, it will be used for the event >>> loop of iothreads and this causes an IO pressure stall value of nearly >>> 100 when idle. >>> >>> The issue was also reported on the kernel mailing list [0]. The >>> suggestion from Jens Axboe was to just turn off the iowait accounting >>> completely. But since (for block/file-posix.c), there is actual IO >>> submitted via the same ring, I wasn't sure if that is the right approach. >>> >>> So the idea was to keep track of whether the event loop is otherwise >>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. >>> >>> However, doing so would only help for block/file-posix.c, which submits >>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for >>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When >>> submitting that poll SQE in the iothread, we would need to be able to >>> know if IO for RBD is currently in-flight or not to be able to decide >>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good >>> way to do this (in a general way)? >>> >>> Or should the flag really always be used (if supported by the kernel)? >>> Is there a way to tell io_uring/kernel that we are an event loop and our >>> waiting should only be accounted for when there is actual IO in-flight? >>> >>> Happy to hear your opinions and suggestions! >>> >>> [0]: >>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/ >> >> Hi Fiona, >> Jens replied yesterday confirmed your suspicion that the number of >> inflight requests is not being tracked correctly. >> >> Is there still a problem after fixing the kernel's inflight counting? If >> not, then no QEMU change is necessary and that seems like the cleanest >> solution anyway. The kernel should know whether there is I/O in flight >> and so it doesn't seem right that userspace needs to hint this. > > > unfortunately, yes. Even with the kernel fix [2], the real problem with > poll SQEs described above remains. I'm still seeing high IO pressure > stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll > SQEs for the AioHandler node fd, and that does count as pending IO. A > small reproducer modeling this [3]. > > So the question from above, how to deal with this for block drivers not > going through file-posix.c remains. Or maybe there is no actual issue with such drivers. We always use the IORING_ENTER_NO_IOWAIT flag when we only have poll or timeout SQEs. If there is actual IO via io_uring, i.e. submitted via luring_co_submit(), we don't set the IORING_ENTER_NO_IOWAIT flag. IO submitted outside of io_uring will still be accounted for by the kernel just like it was before QEMU did the switch to the iothread event loop. Or am I missing something there? > > Best Regards, > Fiona > > [2]: > https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@proxmox.com/T/ > > [3]: > > #include <assert.h> > #include <errno.h> > #include <stdio.h> > #include <unistd.h> > #include <liburing.h> > #include <sys/eventfd.h> > > int main(void) { > int fd; > int ret; > struct io_uring ring; > struct io_uring_sqe *sqe; > > fd = eventfd(0, 0); > assert(fd >= 0); > > ret = io_uring_queue_init(128, &ring, 0); > assert(ret == 0); > > sqe = io_uring_get_sqe(&ring); > assert(sqe); > > io_uring_prep_poll_add(sqe, fd, 1); > > ret = io_uring_submit_and_wait(&ring, 1); > printf("got ret %d\n", ret); > > io_uring_queue_exit(&ring); > > return 0; > } > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2 2026-04-28 12:10 ` Fiona Ebner 2026-04-28 13:31 ` Fiona Ebner @ 2026-04-28 16:19 ` Stefan Hajnoczi 2026-04-29 8:00 ` Fiona Ebner 1 sibling, 1 reply; 10+ messages in thread From: Stefan Hajnoczi @ 2026-04-28 16:19 UTC (permalink / raw) To: Fiona Ebner Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng, Hanna Czenczek, Kevin Wolf, Thomas Lamprecht [-- Attachment #1: Type: text/plain, Size: 3785 bytes --] On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote: > Hi Stefan, > > Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi: > > On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: > >> Dear maintainers, > >> > >> since QEMU 10.2, if io_uring is enabled, it will be used for the event > >> loop of iothreads and this causes an IO pressure stall value of nearly > >> 100 when idle. > >> > >> The issue was also reported on the kernel mailing list [0]. The > >> suggestion from Jens Axboe was to just turn off the iowait accounting > >> completely. But since (for block/file-posix.c), there is actual IO > >> submitted via the same ring, I wasn't sure if that is the right approach. > >> > >> So the idea was to keep track of whether the event loop is otherwise > >> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. > >> > >> However, doing so would only help for block/file-posix.c, which submits > >> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for > >> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When > >> submitting that poll SQE in the iothread, we would need to be able to > >> know if IO for RBD is currently in-flight or not to be able to decide > >> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good > >> way to do this (in a general way)? > >> > >> Or should the flag really always be used (if supported by the kernel)? > >> Is there a way to tell io_uring/kernel that we are an event loop and our > >> waiting should only be accounted for when there is actual IO in-flight? > >> > >> Happy to hear your opinions and suggestions! > >> > >> [0]: > >> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/ > > > > Hi Fiona, > > Jens replied yesterday confirmed your suspicion that the number of > > inflight requests is not being tracked correctly. > > > > Is there still a problem after fixing the kernel's inflight counting? If > > not, then no QEMU change is necessary and that seems like the cleanest > > solution anyway. The kernel should know whether there is I/O in flight > > and so it doesn't seem right that userspace needs to hint this. > > > unfortunately, yes. Even with the kernel fix [2], the real problem with > poll SQEs described above remains. I'm still seeing high IO pressure > stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll > SQEs for the AioHandler node fd, and that does count as pending IO. A > small reproducer modeling this [3]. Does the kernel account POLL_ADD SQEs as blocking I/O activity? That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2) syscalls do not count as blocking I/O activity. The kernel io_uring code should account them correctly and not rely on a userspace hint. Stefan > > So the question from above, how to deal with this for block drivers not > going through file-posix.c remains. > > Best Regards, > Fiona > > [2]: > https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@proxmox.com/T/ > > [3]: > > #include <assert.h> > #include <errno.h> > #include <stdio.h> > #include <unistd.h> > #include <liburing.h> > #include <sys/eventfd.h> > > int main(void) { > int fd; > int ret; > struct io_uring ring; > struct io_uring_sqe *sqe; > > fd = eventfd(0, 0); > assert(fd >= 0); > > ret = io_uring_queue_init(128, &ring, 0); > assert(ret == 0); > > sqe = io_uring_get_sqe(&ring); > assert(sqe); > > io_uring_prep_poll_add(sqe, fd, 1); > > ret = io_uring_submit_and_wait(&ring, 1); > printf("got ret %d\n", ret); > > io_uring_queue_exit(&ring); > > return 0; > } > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2 2026-04-28 16:19 ` Stefan Hajnoczi @ 2026-04-29 8:00 ` Fiona Ebner 2026-04-29 12:20 ` Stefan Hajnoczi 2026-06-01 17:20 ` Stefan Hajnoczi 0 siblings, 2 replies; 10+ messages in thread From: Fiona Ebner @ 2026-04-29 8:00 UTC (permalink / raw) To: Stefan Hajnoczi Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng, Hanna Czenczek, Kevin Wolf, Thomas Lamprecht, Jens Axboe Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi: > On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote: >> Hi Stefan, >> >> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi: >>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: >>>> Dear maintainers, >>>> >>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event >>>> loop of iothreads and this causes an IO pressure stall value of nearly >>>> 100 when idle. >>>> >>>> The issue was also reported on the kernel mailing list [0]. The >>>> suggestion from Jens Axboe was to just turn off the iowait accounting >>>> completely. But since (for block/file-posix.c), there is actual IO >>>> submitted via the same ring, I wasn't sure if that is the right approach. >>>> >>>> So the idea was to keep track of whether the event loop is otherwise >>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. >>>> >>>> However, doing so would only help for block/file-posix.c, which submits >>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for >>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When >>>> submitting that poll SQE in the iothread, we would need to be able to >>>> know if IO for RBD is currently in-flight or not to be able to decide >>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good >>>> way to do this (in a general way)? >>>> >>>> Or should the flag really always be used (if supported by the kernel)? >>>> Is there a way to tell io_uring/kernel that we are an event loop and our >>>> waiting should only be accounted for when there is actual IO in-flight? >>>> >>>> Happy to hear your opinions and suggestions! >>>> >>>> [0]: >>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/ >>> >>> Hi Fiona, >>> Jens replied yesterday confirmed your suspicion that the number of >>> inflight requests is not being tracked correctly. >>> >>> Is there still a problem after fixing the kernel's inflight counting? If >>> not, then no QEMU change is necessary and that seems like the cleanest >>> solution anyway. The kernel should know whether there is I/O in flight >>> and so it doesn't seem right that userspace needs to hint this. >> >> >> unfortunately, yes. Even with the kernel fix [2], the real problem with >> poll SQEs described above remains. I'm still seeing high IO pressure >> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll >> SQEs for the AioHandler node fd, and that does count as pending IO. A >> small reproducer modeling this [3]. > > Does the kernel account POLL_ADD SQEs as blocking I/O activity? Apparently yes. See the C program below [3]. > That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2) > syscalls do not count as blocking I/O activity. The kernel io_uring code > should account them correctly and not rely on a userspace hint. @Jens Axboe: should there be a separate internal counter for poll/timeout SQEs and have them not count towards IO wait by default? > > Stefan > >> >> So the question from above, how to deal with this for block drivers not >> going through file-posix.c remains. >> >> Best Regards, >> Fiona >> >> [2]: >> https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@proxmox.com/T/ >> >> [3]: >> >> #include <assert.h> >> #include <errno.h> >> #include <stdio.h> >> #include <unistd.h> >> #include <liburing.h> >> #include <sys/eventfd.h> >> >> int main(void) { >> int fd; >> int ret; >> struct io_uring ring; >> struct io_uring_sqe *sqe; >> >> fd = eventfd(0, 0); >> assert(fd >= 0); >> >> ret = io_uring_queue_init(128, &ring, 0); >> assert(ret == 0); >> >> sqe = io_uring_get_sqe(&ring); >> assert(sqe); >> >> io_uring_prep_poll_add(sqe, fd, 1); >> >> ret = io_uring_submit_and_wait(&ring, 1); >> printf("got ret %d\n", ret); >> >> io_uring_queue_exit(&ring); >> >> return 0; >> } >> >> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2 2026-04-29 8:00 ` Fiona Ebner @ 2026-04-29 12:20 ` Stefan Hajnoczi 2026-06-01 17:20 ` Stefan Hajnoczi 1 sibling, 0 replies; 10+ messages in thread From: Stefan Hajnoczi @ 2026-04-29 12:20 UTC (permalink / raw) To: Fiona Ebner Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng, Hanna Czenczek, Kevin Wolf, Thomas Lamprecht, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 5343 bytes --] On Wed, Apr 29, 2026 at 10:00:34AM +0200, Fiona Ebner wrote: > Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi: > > On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote: > >> Hi Stefan, > >> > >> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi: > >>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: > >>>> Dear maintainers, > >>>> > >>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event > >>>> loop of iothreads and this causes an IO pressure stall value of nearly > >>>> 100 when idle. > >>>> > >>>> The issue was also reported on the kernel mailing list [0]. The > >>>> suggestion from Jens Axboe was to just turn off the iowait accounting > >>>> completely. But since (for block/file-posix.c), there is actual IO > >>>> submitted via the same ring, I wasn't sure if that is the right approach. > >>>> > >>>> So the idea was to keep track of whether the event loop is otherwise > >>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. > >>>> > >>>> However, doing so would only help for block/file-posix.c, which submits > >>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for > >>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When > >>>> submitting that poll SQE in the iothread, we would need to be able to > >>>> know if IO for RBD is currently in-flight or not to be able to decide > >>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good > >>>> way to do this (in a general way)? > >>>> > >>>> Or should the flag really always be used (if supported by the kernel)? > >>>> Is there a way to tell io_uring/kernel that we are an event loop and our > >>>> waiting should only be accounted for when there is actual IO in-flight? > >>>> > >>>> Happy to hear your opinions and suggestions! > >>>> > >>>> [0]: > >>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/ > >>> > >>> Hi Fiona, > >>> Jens replied yesterday confirmed your suspicion that the number of > >>> inflight requests is not being tracked correctly. > >>> > >>> Is there still a problem after fixing the kernel's inflight counting? If > >>> not, then no QEMU change is necessary and that seems like the cleanest > >>> solution anyway. The kernel should know whether there is I/O in flight > >>> and so it doesn't seem right that userspace needs to hint this. > >> > >> > >> unfortunately, yes. Even with the kernel fix [2], the real problem with > >> poll SQEs described above remains. I'm still seeing high IO pressure > >> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll > >> SQEs for the AioHandler node fd, and that does count as pending IO. A > >> small reproducer modeling this [3]. > > > > Does the kernel account POLL_ADD SQEs as blocking I/O activity? > > Apparently yes. See the C program below [3]. > > > That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2) > > syscalls do not count as blocking I/O activity. The kernel io_uring code > > should account them correctly and not rely on a userspace hint. > > @Jens Axboe: should there be a separate internal counter for > poll/timeout SQEs and have them not count towards IO wait by default? I wanted to add more nuance to what I wrote: As a baseline, io_uring should account IO activity in the same way as the traditional syscalls for those operations. However, it does seem like userspace hints can be useful in some cases. For example, if a server process is reading from a socket/eventfd/pipe waiting for an incoming request then it is not stalled by IO. However, if the same process makes a request to another process and is reading a socket/eventfd/pipe waiting for the response, then it may indeed be considered as waiting for IO. In other words, whether a read means the process is stalled waiting for IO or not depends on the application and the kernel doesn't know that. Userspace hints make sense in this case. I just think that in this case io_uring isn't following the IO pressure stall accounting of the equivalent traditional system calls and that seems like a gap that should be fixed in the kernel rather than userspace. Stefan > > > > Stefan > > > >> > >> So the question from above, how to deal with this for block drivers not > >> going through file-posix.c remains. > >> > >> Best Regards, > >> Fiona > >> > >> [2]: > >> https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@proxmox.com/T/ > >> > >> [3]: > >> > >> #include <assert.h> > >> #include <errno.h> > >> #include <stdio.h> > >> #include <unistd.h> > >> #include <liburing.h> > >> #include <sys/eventfd.h> > >> > >> int main(void) { > >> int fd; > >> int ret; > >> struct io_uring ring; > >> struct io_uring_sqe *sqe; > >> > >> fd = eventfd(0, 0); > >> assert(fd >= 0); > >> > >> ret = io_uring_queue_init(128, &ring, 0); > >> assert(ret == 0); > >> > >> sqe = io_uring_get_sqe(&ring); > >> assert(sqe); > >> > >> io_uring_prep_poll_add(sqe, fd, 1); > >> > >> ret = io_uring_submit_and_wait(&ring, 1); > >> printf("got ret %d\n", ret); > >> > >> io_uring_queue_exit(&ring); > >> > >> return 0; > >> } > >> > >> > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2 2026-04-29 8:00 ` Fiona Ebner 2026-04-29 12:20 ` Stefan Hajnoczi @ 2026-06-01 17:20 ` Stefan Hajnoczi 2026-06-02 8:41 ` Fiona Ebner 1 sibling, 1 reply; 10+ messages in thread From: Stefan Hajnoczi @ 2026-06-01 17:20 UTC (permalink / raw) To: Fiona Ebner Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng, Hanna Czenczek, Kevin Wolf, Thomas Lamprecht, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 3403 bytes --] On Wed, Apr 29, 2026 at 10:00:34AM +0200, Fiona Ebner wrote: > Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi: > > On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote: > >> Hi Stefan, > >> > >> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi: > >>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: > >>>> Dear maintainers, > >>>> > >>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event > >>>> loop of iothreads and this causes an IO pressure stall value of nearly > >>>> 100 when idle. > >>>> > >>>> The issue was also reported on the kernel mailing list [0]. The > >>>> suggestion from Jens Axboe was to just turn off the iowait accounting > >>>> completely. But since (for block/file-posix.c), there is actual IO > >>>> submitted via the same ring, I wasn't sure if that is the right approach. > >>>> > >>>> So the idea was to keep track of whether the event loop is otherwise > >>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. > >>>> > >>>> However, doing so would only help for block/file-posix.c, which submits > >>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for > >>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When > >>>> submitting that poll SQE in the iothread, we would need to be able to > >>>> know if IO for RBD is currently in-flight or not to be able to decide > >>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good > >>>> way to do this (in a general way)? > >>>> > >>>> Or should the flag really always be used (if supported by the kernel)? > >>>> Is there a way to tell io_uring/kernel that we are an event loop and our > >>>> waiting should only be accounted for when there is actual IO in-flight? > >>>> > >>>> Happy to hear your opinions and suggestions! > >>>> > >>>> [0]: > >>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/ > >>> > >>> Hi Fiona, > >>> Jens replied yesterday confirmed your suspicion that the number of > >>> inflight requests is not being tracked correctly. > >>> > >>> Is there still a problem after fixing the kernel's inflight counting? If > >>> not, then no QEMU change is necessary and that seems like the cleanest > >>> solution anyway. The kernel should know whether there is I/O in flight > >>> and so it doesn't seem right that userspace needs to hint this. > >> > >> > >> unfortunately, yes. Even with the kernel fix [2], the real problem with > >> poll SQEs described above remains. I'm still seeing high IO pressure > >> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll > >> SQEs for the AioHandler node fd, and that does count as pending IO. A > >> small reproducer modeling this [3]. > > > > Does the kernel account POLL_ADD SQEs as blocking I/O activity? > > Apparently yes. See the C program below [3]. > > > That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2) > > syscalls do not count as blocking I/O activity. The kernel io_uring code > > should account them correctly and not rely on a userspace hint. > > @Jens Axboe: should there be a separate internal counter for > poll/timeout SQEs and have them not count towards IO wait by default? Hi Fiona, Any update on this issue? Was it resolved in io_uring or is a QEMU patch still needed? Thanks, Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2 2026-06-01 17:20 ` Stefan Hajnoczi @ 2026-06-02 8:41 ` Fiona Ebner 2026-06-02 12:08 ` Stefan Hajnoczi 0 siblings, 1 reply; 10+ messages in thread From: Fiona Ebner @ 2026-06-02 8:41 UTC (permalink / raw) To: Stefan Hajnoczi Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng, Hanna Czenczek, Kevin Wolf, Thomas Lamprecht, Jens Axboe Am 01.06.26 um 7:20 PM schrieb Stefan Hajnoczi: > On Wed, Apr 29, 2026 at 10:00:34AM +0200, Fiona Ebner wrote: >> Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi: >>> On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote: >>>> Hi Stefan, >>>> >>>> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi: >>>>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: >>>>>> Dear maintainers, >>>>>> >>>>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event >>>>>> loop of iothreads and this causes an IO pressure stall value of nearly >>>>>> 100 when idle. >>>>>> >>>>>> The issue was also reported on the kernel mailing list [0]. The >>>>>> suggestion from Jens Axboe was to just turn off the iowait accounting >>>>>> completely. But since (for block/file-posix.c), there is actual IO >>>>>> submitted via the same ring, I wasn't sure if that is the right approach. >>>>>> >>>>>> So the idea was to keep track of whether the event loop is otherwise >>>>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. >>>>>> >>>>>> However, doing so would only help for block/file-posix.c, which submits >>>>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for >>>>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When >>>>>> submitting that poll SQE in the iothread, we would need to be able to >>>>>> know if IO for RBD is currently in-flight or not to be able to decide >>>>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good >>>>>> way to do this (in a general way)? >>>>>> >>>>>> Or should the flag really always be used (if supported by the kernel)? >>>>>> Is there a way to tell io_uring/kernel that we are an event loop and our >>>>>> waiting should only be accounted for when there is actual IO in-flight? >>>>>> >>>>>> Happy to hear your opinions and suggestions! >>>>>> >>>>>> [0]: >>>>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/ >>>>> >>>>> Hi Fiona, >>>>> Jens replied yesterday confirmed your suspicion that the number of >>>>> inflight requests is not being tracked correctly. >>>>> >>>>> Is there still a problem after fixing the kernel's inflight counting? If >>>>> not, then no QEMU change is necessary and that seems like the cleanest >>>>> solution anyway. The kernel should know whether there is I/O in flight >>>>> and so it doesn't seem right that userspace needs to hint this. >>>> >>>> >>>> unfortunately, yes. Even with the kernel fix [2], the real problem with >>>> poll SQEs described above remains. I'm still seeing high IO pressure >>>> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll >>>> SQEs for the AioHandler node fd, and that does count as pending IO. A >>>> small reproducer modeling this [3]. >>> >>> Does the kernel account POLL_ADD SQEs as blocking I/O activity? >> >> Apparently yes. See the C program below [3]. >> >>> That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2) >>> syscalls do not count as blocking I/O activity. The kernel io_uring code >>> should account them correctly and not rely on a userspace hint. >> >> @Jens Axboe: should there be a separate internal counter for >> poll/timeout SQEs and have them not count towards IO wait by default? > > Hi Fiona, > Any update on this issue? Was it resolved in io_uring or is a QEMU patch > still needed? Hi Stefan, I did not proceed with the above, since I did not get an ack from Jens regarding the suggested approach. We needed to go ahead with a release downstream, so for the meantime, we applied a workaround by Thomas with setting the IORING_ENTER_NO_IOWAIT flag when there is no actual IO in-flight [0]. Should it be submitted to qemu-devel too? [0]: https://git.proxmox.com/?p=pve-qemu.git;a=commitdiff;h=775e41b890a645db75119233fe2b21f139bf8e4f Best Regards, Fiona ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2 2026-06-02 8:41 ` Fiona Ebner @ 2026-06-02 12:08 ` Stefan Hajnoczi 0 siblings, 0 replies; 10+ messages in thread From: Stefan Hajnoczi @ 2026-06-02 12:08 UTC (permalink / raw) To: Jens Axboe Cc: open list:Network Block Dev..., QEMU Developers, Fam Zheng, Hanna Czenczek, Kevin Wolf, Thomas Lamprecht, Fiona Ebner [-- Attachment #1: Type: text/plain, Size: 4437 bytes --] On Tue, Jun 02, 2026 at 10:41:11AM +0200, Fiona Ebner wrote: > Am 01.06.26 um 7:20 PM schrieb Stefan Hajnoczi: > > On Wed, Apr 29, 2026 at 10:00:34AM +0200, Fiona Ebner wrote: > >> Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi: > >>> On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote: > >>>> Hi Stefan, > >>>> > >>>> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi: > >>>>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: > >>>>>> Dear maintainers, > >>>>>> > >>>>>> since QEMU 10.2, if io_uring is enabled, it will be used for the event > >>>>>> loop of iothreads and this causes an IO pressure stall value of nearly > >>>>>> 100 when idle. > >>>>>> > >>>>>> The issue was also reported on the kernel mailing list [0]. The > >>>>>> suggestion from Jens Axboe was to just turn off the iowait accounting > >>>>>> completely. But since (for block/file-posix.c), there is actual IO > >>>>>> submitted via the same ring, I wasn't sure if that is the right approach. > >>>>>> > >>>>>> So the idea was to keep track of whether the event loop is otherwise > >>>>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. > >>>>>> > >>>>>> However, doing so would only help for block/file-posix.c, which submits > >>>>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, for > >>>>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. When > >>>>>> submitting that poll SQE in the iothread, we would need to be able to > >>>>>> know if IO for RBD is currently in-flight or not to be able to decide > >>>>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a good > >>>>>> way to do this (in a general way)? > >>>>>> > >>>>>> Or should the flag really always be used (if supported by the kernel)? > >>>>>> Is there a way to tell io_uring/kernel that we are an event loop and our > >>>>>> waiting should only be accounted for when there is actual IO in-flight? > >>>>>> > >>>>>> Happy to hear your opinions and suggestions! > >>>>>> > >>>>>> [0]: > >>>>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417b@proxmox.com/T/ > >>>>> > >>>>> Hi Fiona, > >>>>> Jens replied yesterday confirmed your suspicion that the number of > >>>>> inflight requests is not being tracked correctly. > >>>>> > >>>>> Is there still a problem after fixing the kernel's inflight counting? If > >>>>> not, then no QEMU change is necessary and that seems like the cleanest > >>>>> solution anyway. The kernel should know whether there is I/O in flight > >>>>> and so it doesn't seem right that userspace needs to hint this. > >>>> > >>>> > >>>> unfortunately, yes. Even with the kernel fix [2], the real problem with > >>>> poll SQEs described above remains. I'm still seeing high IO pressure > >>>> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll > >>>> SQEs for the AioHandler node fd, and that does count as pending IO. A > >>>> small reproducer modeling this [3]. > >>> > >>> Does the kernel account POLL_ADD SQEs as blocking I/O activity? > >> > >> Apparently yes. See the C program below [3]. > >> > >>> That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2) > >>> syscalls do not count as blocking I/O activity. The kernel io_uring code > >>> should account them correctly and not rely on a userspace hint. > >> > >> @Jens Axboe: should there be a separate internal counter for > >> poll/timeout SQEs and have them not count towards IO wait by default? > > > > Hi Fiona, > > Any update on this issue? Was it resolved in io_uring or is a QEMU patch > > still needed? > > Hi Stefan, > > I did not proceed with the above, since I did not get an ack from Jens > regarding the suggested approach. > > We needed to go ahead with a release downstream, so for the meantime, we > applied a workaround by Thomas with setting the IORING_ENTER_NO_IOWAIT > flag when there is no actual IO in-flight [0]. Should it be submitted to > qemu-devel too? Pinging Jens: io_uring accounts POLL_ADD SQEs as blocking I/O activity whereas select(2)/poll(2)/epoll_wait(2) do not. Would it make sense to follow the same accounting as the syscalls for this operation since that is probably expected? Thanks, Stefan > > [0]: > https://git.proxmox.com/?p=pve-qemu.git;a=commitdiff;h=775e41b890a645db75119233fe2b21f139bf8e4f > > Best Regards, > Fiona > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-06-02 12:09 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-24 10:25 Excessive IO PSI for iothread when using io_uring since QEMU 10.2 Fiona Ebner 2026-04-27 19:13 ` Stefan Hajnoczi 2026-04-28 12:10 ` Fiona Ebner 2026-04-28 13:31 ` Fiona Ebner 2026-04-28 16:19 ` Stefan Hajnoczi 2026-04-29 8:00 ` Fiona Ebner 2026-04-29 12:20 ` Stefan Hajnoczi 2026-06-01 17:20 ` Stefan Hajnoczi 2026-06-02 8:41 ` Fiona Ebner 2026-06-02 12:08 ` Stefan Hajnoczi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.