From: Luis Henriques <luis@igalia.com>
To: Bernd Schubert via B4 Relay <devnull+bernd.bsbernd.com@kernel.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>,
bernd@bsbernd.com, Joanne Koong <joannelkoong@gmail.com>,
linux-fsdevel@vger.kernel.org, Gang He <dchg2000@gmail.com>,
Bernd Schubert <bschubert@ddn.com>
Subject: Re: [PATCH v4 6/8] fuse: {io-uring} Queue background requests on a different core
Date: Fri, 24 Apr 2026 16:26:22 +0100 [thread overview]
Message-ID: <87v7dgmlb5.fsf@igalia.com> (raw)
In-Reply-To: <20260413-reduced-nr-ring-queues_3-v4-6-982b6414b723@bsbernd.com> (Bernd Schubert via's message of "Mon, 13 Apr 2026 11:41:29 +0200")
On Mon, Apr 13 2026, Bernd Schubert via B4 Relay wrote:
> From: Bernd Schubert <bschubert@ddn.com>
>
> Running background IO on a different core makes quite a difference.
>
> fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
> --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
> --runtime=30s --group_reporting --ioengine=io_uring\
> --direct=1
>
> unpatched
> READ: bw=272MiB/s (285MB/s) ...
> patched
> READ: bw=650MiB/s (682MB/s)
>
> Reason is easily visible, the fio process is migrating between CPUs
> when requests are submitted on the queue for the same core.
>
> With --iodepth=8
>
> unpatched
> READ: bw=466MiB/s (489MB/s)
> patched
> READ: bw=641MiB/s (672MB/s)
>
> Without io-uring (--iodepth=8)
> READ: bw=729MiB/s (764MB/s)
>
> Without fuse (--iodepth=8)
> READ: bw=2199MiB/s (2306MB/s)
>
> (Test were done with
> <libfuse>/example/passthrough_hp -o allow_other --nopassthrough \
> [-o io_uring] /tmp/source /tmp/dest
> )
>
> Additional notes:
>
> With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
> READ: bw=903MiB/s (946MB/s)
>
> With just a random qid (--iodepth=8)
> READ: bw=429MiB/s (450MB/s)
>
> With --iodepth=1
> unpatched
> READ: bw=195MiB/s (204MB/s)
> patched
> READ: bw=232MiB/s (243MB/s)
>
> With --iodepth=1 --numjobs=2
> unpatched
> READ: bw=366MiB/s (384MB/s)
> patched
> READ: bw=472MiB/s (495MB/s)
>
> With --iodepth=1 --numjobs=8
> unpatched
> READ: bw=1437MiB/s (1507MB/s)
> patched
> READ: bw=1529MiB/s (1603MB/s)
> fuse without io-uring
> READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
> no-fuse
> READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...
>
> In summary, for async requests the core doing application IO is busy
> sending requests and processing IOs should be done on a different core.
> Spreading the load on random cores is also not desirable, as the core
> might be frequency scaled down and/or in C1 sleep states. Not shown here,
> but differnces are much smaller when the system uses performance govenor
> instead of schedutil (ubuntu default). Obviously at the cost of higher
> system power consumption for performance govenor - not desirable either.
>
> Results without io-uring (which uses fixed libfuse threads per queue)
> heavily depend on the current number of active threads. Libfuse uses
> default of max 10 threads, but actual nr max threads is a parameter.
> Also, no-fuse-io-uring results heavily depend on, if there was already
> running another workload before, as libfuse starts these threads
> dynamically - i.e. the more threads are active, the worse the
> performance.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
> fs/fuse/dev_uring.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index e68089babaf89fb81741e4a5e605c6e36a137f9e..ed061e239b8ed70ff36deb51dd6957fe1704ec87 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -1306,13 +1306,21 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
> fuse_uring_send(ent, cmd, err, issue_flags);
> }
>
> -static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring)
> +static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
> + bool background)
> {
> unsigned int qid;
> int node;
> unsigned int nr_queues;
> unsigned int cpu = task_cpu(current);
>
> + /*
> + * Background requests result in better performance on a different
> + * CPU, unless CPUs are already busy.
> + */
> + if (background)
> + cpu++;
> +
The performance number look great, but I was wondering if you get similar
improvements for write operations.
Also, isn't 'cpu++' too arbitrary? I mean, isn't there some heuristics
that could be used? I understand the goal is just to push the request
somewhere else, but does it make sense to push it to the next cpu on the
same node? Or to the next cpu in a different core? I'm just thinking out
loud, and maybe this is non-sense ;-)
Finally, shouldn't this behaviour be behind some knob? Maybe it's
over-complicating for no good reason, but being able to: 1) enable/disable
it, 2) enable by pushing it to the next cpu (this behaviour), 3) enable by
pushing to the next cpu on the same/different node, etc.
Cheers,
--
Luís
> cpu = cpu % ring->max_nr_queues;
>
> /* numa local registered queue bitmap */
> @@ -1356,7 +1364,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
> int err;
>
> err = -EINVAL;
> - queue = fuse_uring_select_queue(ring);
> + queue = fuse_uring_select_queue(ring, false);
> if (!queue)
> goto err;
>
> @@ -1400,7 +1408,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
> struct fuse_ring_queue *queue;
> struct fuse_ring_ent *ent = NULL;
>
> - queue = fuse_uring_select_queue(ring);
> + queue = fuse_uring_select_queue(ring, true);
> if (!queue)
> return false;
>
>
> --
> 2.43.0
>
>
next prev parent reply other threads:[~2026-04-24 15:26 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-13 9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
2026-04-13 9:41 ` [PATCH v4 1/8] fuse: {io-uring} Add queue length counters Bernd Schubert via B4 Relay
2026-04-13 9:41 ` [PATCH v4 2/8] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert via B4 Relay
2026-04-13 9:41 ` [PATCH v4 3/8] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert via B4 Relay
2026-04-24 15:04 ` Luis Henriques
2026-04-24 15:33 ` Bernd Schubert
2026-04-13 9:41 ` [PATCH v4 4/8] fuse: Fetch a queued fuse request on command registration Bernd Schubert via B4 Relay
2026-04-13 9:41 ` [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert via B4 Relay
2026-04-24 15:15 ` Luis Henriques
2026-04-24 18:28 ` Joanne Koong
2026-04-24 22:00 ` Bernd Schubert
2026-04-13 9:41 ` [PATCH v4 6/8] fuse: {io-uring} Queue background requests on a different core Bernd Schubert via B4 Relay
2026-04-24 15:26 ` Luis Henriques [this message]
2026-04-13 9:41 ` [PATCH v4 7/8] fuse: Add retry attempts for numa local queues for load distribution Bernd Schubert via B4 Relay
2026-04-24 15:28 ` Luis Henriques
2026-04-13 9:41 ` [PATCH v4 8/8] fuse: {io-uring} Prefer the current core over mapping Bernd Schubert via B4 Relay
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87v7dgmlb5.fsf@igalia.com \
--to=luis@igalia.com \
--cc=bernd@bsbernd.com \
--cc=bschubert@ddn.com \
--cc=dchg2000@gmail.com \
--cc=devnull+bernd.bsbernd.com@kernel.org \
--cc=joannelkoong@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox