From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists1p.gnu.org (lists1p.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8713AFF8867 for ; Wed, 29 Apr 2026 12:21:02 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists1p.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1wI3uE-0007Dc-0P; Wed, 29 Apr 2026 08:20:42 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists1p.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wI3tw-00070x-PW for qemu-devel@nongnu.org; Wed, 29 Apr 2026 08:20:26 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wI3to-0001s4-2l for qemu-devel@nongnu.org; Wed, 29 Apr 2026 08:20:20 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777465212; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=bxCZTO1vjWEUis2mApUIK1leKROtNBLIIf31dvOWC7c=; b=PqBTHO7zxrXw68tZkchYGjF4wI5UMbEhll23RoA+S23kRH2lgxUiqBO+Q5v6EgVyKBZwU/ voN7GpIRLxpzk8O4skl1/X3j8D10sPitzP05l2+kCASyhr/RK07oKzYNxQPn5FdBwFO4uT 0qPc4hbXp+ES1VyeN+gE8B+tiv6mwXI= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-563-zgoiV67FNHudMBYHLV_csw-1; Wed, 29 Apr 2026 08:20:08 -0400 X-MC-Unique: zgoiV67FNHudMBYHLV_csw-1 X-Mimecast-MFC-AGG-ID: zgoiV67FNHudMBYHLV_csw_1777465207 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B6264195609D; Wed, 29 Apr 2026 12:20:06 +0000 (UTC) Received: from localhost (unknown [10.44.32.30]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id B48C01800480; Wed, 29 Apr 2026 12:20:04 +0000 (UTC) Date: Wed, 29 Apr 2026 08:20:02 -0400 From: Stefan Hajnoczi To: Fiona Ebner Cc: "open list:Network Block Dev..." , QEMU Developers , Fam Zheng , Hanna Czenczek , Kevin Wolf , Thomas Lamprecht , Jens Axboe Subject: Re: Excessive IO PSI for iothread when using io_uring since QEMU 10.2 Message-ID: <20260429122002.GA317605@fedora> References: <017dc767-90e3-4983-8417-e541b3fb04f6@proxmox.com> <20260427191343.GD218226@fedora> <317ab384-f8c2-49af-89c4-407bb2f5617c@proxmox.com> <20260428161927.GB278591@fedora> <9b11dec1-f27c-4c69-8dfc-34ef0bdc7e8f@proxmox.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="4kXzE+PWiEmyNhhJ" Content-Disposition: inline In-Reply-To: <9b11dec1-f27c-4c69-8dfc-34ef0bdc7e8f@proxmox.com> X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Received-SPF: pass client-ip=170.10.129.124; envelope-from=stefanha@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: 12 X-Spam_score: 1.2 X-Spam_bar: + X-Spam_report: (1.2 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_SBL_CSS=3.335, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: qemu development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org --4kXzE+PWiEmyNhhJ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Apr 29, 2026 at 10:00:34AM +0200, Fiona Ebner wrote: > Am 28.04.26 um 6:18 PM schrieb Stefan Hajnoczi: > > On Tue, Apr 28, 2026 at 02:10:02PM +0200, Fiona Ebner wrote: > >> Hi Stefan, > >> > >> Am 27.04.26 um 9:12 PM schrieb Stefan Hajnoczi: > >>> On Fri, Apr 24, 2026 at 12:25:41PM +0200, Fiona Ebner wrote: > >>>> Dear maintainers, > >>>> > >>>> since QEMU 10.2, if io_uring is enabled, it will be used for the eve= nt > >>>> loop of iothreads and this causes an IO pressure stall value of near= ly > >>>> 100 when idle. > >>>> > >>>> The issue was also reported on the kernel mailing list [0]. The > >>>> suggestion from Jens Axboe was to just turn off the iowait accounting > >>>> completely. But since (for block/file-posix.c), there is actual IO > >>>> submitted via the same ring, I wasn't sure if that is the right appr= oach. > >>>> > >>>> So the idea was to keep track of whether the event loop is otherwise > >>>> idle and only use the IORING_ENTER_NO_IOWAIT flag in that case [1]. > >>>> > >>>> However, doing so would only help for block/file-posix.c, which subm= its > >>>> IO via luring_co_submit() -> fdmon_io_uring_add_sqe(). For example, = for > >>>> block/rbd.c, only a poll SQE for the AioHandler node's fd is used. W= hen > >>>> submitting that poll SQE in the iothread, we would need to be able to > >>>> know if IO for RBD is currently in-flight or not to be able to decide > >>>> whether to use the IORING_ENTER_NO_IOWAIT flag or not. Is there a go= od > >>>> way to do this (in a general way)? > >>>> > >>>> Or should the flag really always be used (if supported by the kernel= )? > >>>> Is there a way to tell io_uring/kernel that we are an event loop and= our > >>>> waiting should only be accounted for when there is actual IO in-flig= ht? > >>>> > >>>> Happy to hear your opinions and suggestions! > >>>> > >>>> [0]: > >>>> https://lore.kernel.org/io-uring/14bc6266-5bc9-4454-9518-d1016bfe417= b@proxmox.com/T/ > >>> > >>> Hi Fiona, > >>> Jens replied yesterday confirmed your suspicion that the number of > >>> inflight requests is not being tracked correctly. > >>> > >>> Is there still a problem after fixing the kernel's inflight counting?= If > >>> not, then no QEMU change is necessary and that seems like the cleanest > >>> solution anyway. The kernel should know whether there is I/O in flight > >>> and so it doesn't seem right that userspace needs to hint this. > >> > >> > >> unfortunately, yes. Even with the kernel fix [2], the real problem with > >> poll SQEs described above remains. I'm still seeing high IO pressure > >> stall values when using QEMU. In add_poll_add_sqe(), QEMU submits poll > >> SQEs for the AioHandler node fd, and that does count as pending IO. A > >> small reproducer modeling this [3]. > >=20 > > Does the kernel account POLL_ADD SQEs as blocking I/O activity? >=20 > Apparently yes. See the C program below [3]. >=20 > > That behavior is inconsistent if select(2)/poll(2)/epoll_wait(2) > > syscalls do not count as blocking I/O activity. The kernel io_uring code > > should account them correctly and not rely on a userspace hint. >=20 > @Jens Axboe: should there be a separate internal counter for > poll/timeout SQEs and have them not count towards IO wait by default? I wanted to add more nuance to what I wrote: As a baseline, io_uring should account IO activity in the same way as the traditional syscalls for those operations. However, it does seem like userspace hints can be useful in some cases. For example, if a server process is reading from a socket/eventfd/pipe waiting for an incoming request then it is not stalled by IO. However, if the same process makes a request to another process and is reading a socket/eventfd/pipe waiting for the response, then it may indeed be considered as waiting for IO. In other words, whether a read means the process is stalled waiting for IO or not depends on the application and the kernel doesn't know that. Userspace hints make sense in this case. I just think that in this case io_uring isn't following the IO pressure stall accounting of the equivalent traditional system calls and that seems like a gap that should be fixed in the kernel rather than userspace. Stefan > >=20 > > Stefan > >=20 > >> > >> So the question from above, how to deal with this for block drivers not > >> going through file-posix.c remains. > >> > >> Best Regards, > >> Fiona > >> > >> [2]: > >> https://lore.kernel.org/io-uring/b4d2aa36-8301-4e58-be3e-1451267b8c43@= proxmox.com/T/ > >> > >> [3]: > >> > >> #include > >> #include > >> #include > >> #include > >> #include > >> #include > >> > >> int main(void) { > >> int fd; > >> int ret; > >> struct io_uring ring; > >> struct io_uring_sqe *sqe; > >> > >> fd =3D eventfd(0, 0); > >> assert(fd >=3D 0); > >> > >> ret =3D io_uring_queue_init(128, &ring, 0); > >> assert(ret =3D=3D 0); > >> > >> sqe =3D io_uring_get_sqe(&ring); > >> assert(sqe); > >> > >> io_uring_prep_poll_add(sqe, fd, 1); > >> > >> ret =3D io_uring_submit_and_wait(&ring, 1); > >> printf("got ret %d\n", ret); > >> > >> io_uring_queue_exit(&ring); > >> > >> return 0; > >> } > >> > >> >=20 >=20 --4kXzE+PWiEmyNhhJ Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQEzBAEBCgAdFiEEhpWov9P5fNqsNXdanKSrs4Grc8gFAmnx93IACgkQnKSrs4Gr c8gTygf+J/BGBJEjAS2FmT8zwEQoKH0HGC2SxNrAEbMKF8yAN3YkHhml74JGNbVJ 3WEeA6lVW2lYYPDTn3Txu/YUcxHq0LiukPs2DN01y1DotEGODOEP5uev2A5QGv+A Lc+2CQAOzeeLjntHCTPRBDSRaRoc8EAydUuZ1zFkfKOaad4X7lvlJVt7YfdURSGo XmYtvlIEBDJHs8d9p9k0b3MXqH2dKUFkZkJ6FwohDTjmIKQv36R49N2KVKpb5nt5 zQGdf81kZww77JXl2nJIFe/AgECJlbw7Xr6UkbJxerJkEW7knnayXw/uz34uav2I RZmctfubZ5yGH5fxHGfs3uJ/b4yBKw== =/gE/ -----END PGP SIGNATURE----- --4kXzE+PWiEmyNhhJ--