Re: [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Luis Henriques <luis@igalia.com>
To: Bernd Schubert <bschubert@ddn.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>,
	 Joanne Koong <joannelkoong@gmail.com>,
	 "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	 Gang He <dchg2000@gmail.com>
Subject: Re: [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core
Date: Wed, 15 Oct 2025 12:05:10 +0100	[thread overview]
Message-ID: <87frbk8ly1.fsf@igalia.com> (raw)
In-Reply-To: <5e3b8848-2049-4321-82e7-2dec658d6936@ddn.com> (Bernd Schubert's message of "Wed, 15 Oct 2025 10:27:44 +0000")

On Wed, Oct 15 2025, Bernd Schubert wrote:

> On 10/15/25 11:50, Luis Henriques wrote:
>> On Mon, Oct 13 2025, Bernd Schubert wrote:
>> 
>>> Running background IO on a different core makes quite a difference.
>>>
>>> fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
>>> --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
>>> --runtime=30s --group_reporting --ioengine=io_uring\
>>>  --direct=1
>>>
>>> unpatched
>>>    READ: bw=272MiB/s (285MB/s), 272MiB/s-272MiB/s ...
>>> patched
>>>    READ: bw=760MiB/s (797MB/s), 760MiB/s-760MiB/s ...
>>>
>>> With --iodepth=8
>>>
>>> unpatched
>>>    READ: bw=466MiB/s (489MB/s), 466MiB/s-466MiB/s ...
>>> patched
>>>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
>>> 2nd run:
>>>    READ: bw=1014MiB/s (1064MB/s), 1014MiB/s-1014MiB/s ...
>>>
>>> Without io-uring (--iodepth=8)
>>>    READ: bw=729MiB/s (764MB/s), 729MiB/s-729MiB/s ...
>>>
>>> Without fuse (--iodepth=8)
>>>    READ: bw=2199MiB/s (2306MB/s), 2199MiB/s-2199MiB/s ...
>>>
>>> (Test were done with
>>> <libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
>>> [-o io_uring] /tmp/source /tmp/dest
>>> )
>>>
>>> Additional notes:
>>>
>>> With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
>>>    READ: bw=903MiB/s (946MB/s), 903MiB/s-903MiB/s ...
>>>
>>> With just a random qid (--iodepth=8)
>>>    READ: bw=429MiB/s (450MB/s), 429MiB/s-429MiB/s ...
>>>
>>> With --iodepth=1
>>> unpatched
>>>    READ: bw=195MiB/s (204MB/s), 195MiB/s-195MiB/s ...
>>> patched
>>>    READ: bw=232MiB/s (243MB/s), 232MiB/s-232MiB/s ...
>>>
>>> With --iodepth=1 --numjobs=2
>>> unpatched
>>>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
>>> patched
>>>    READ: bw=1821MiB/s (1909MB/s), 1821MiB/s-1821MiB/s ...
>>>
>>> With --iodepth=1 --numjobs=8
>>> unpatched
>>>    READ: bw=1138MiB/s (1193MB/s), 1138MiB/s-1138MiB/s ...
>>> patched
>>>    READ: bw=1650MiB/s (1730MB/s), 1650MiB/s-1650MiB/s ...
>>> fuse without io-uring
>>>    READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
>>> no-fuse
>>>    READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...
>>>
>>> In summary, for async requests the core doing application IO is busy
>>> sending requests and processing IOs should be done on a different core.
>>> Spreading the load on random cores is also not desirable, as the core
>>> might be frequency scaled down and/or in C1 sleep states. Not shown here,
>>> but differnces are much smaller when the system uses performance govenor
>>> instead of schedutil (ubuntu default). Obviously at the cost of higher
>>> system power consumption for performance govenor - not desirable either.
>>>
>>> Results without io-uring (which uses fixed libfuse threads per queue)
>>> heavily depend on the current number of active threads. Libfuse uses
>>> default of max 10 threads, but actual nr max threads is a parameter.
>>> Also, no-fuse-io-uring results heavily depend on, if there was already
>>> running another workload before, as libfuse starts these threads
>>> dynamically - i.e. the more threads are active, the worse the
>>> performance.
>>>
>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>> ---
>>>  fs/fuse/dev_uring.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
>>>  1 file changed, 46 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>> index aca71ce5632efd1d80e3ac0ad4e81ac1536dbc47..f35dd98abfe6407849fec55847c6b3d186383803 100644
>>> --- a/fs/fuse/dev_uring.c
>>> +++ b/fs/fuse/dev_uring.c
>>> @@ -23,6 +23,7 @@ MODULE_PARM_DESC(enable_uring,
>>>  #define FURING_Q_LOCAL_THRESHOLD 2
>>>  #define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>>>  #define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>>> +#define FURING_NEXT_QUEUE_RETRIES 2
>> 
>> Some bikeshedding:
>> 
>> Maybe that's just me, but I associate the names above (FURING_*) to 'fur'
>> -- the action of making 'fur'.  I'm not sure that verb even exists, but my
>> brain makes me dislike those names :-)
>> 
>> But I'm not a native speaker, and I don't have any other objection to
>> those names rather than "I don't like fur!" so feel free to ignore me. :-)
>
> I had initially called it "FUSE_URING", but it seemed to be a bit long.
> I can change back to that or maybe you have a better short name in your
> mind (as usual the hardest part is to find good variable names)?

Yeah, the only alternative would be to either drop 'FUSE' or 'URING' from
the name: FUSE_Q_LOCAL_THRESHOLD or URING_Q_LOCAL_THRESHOLD.  But as I
said: this is plain bikeshedding.  If no one else complains about these
definitions, just leave them as-is.

>>>  
>>>  bool fuse_uring_enabled(void)
>>>  {
>>> @@ -1302,12 +1303,15 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>>>  /*
>>>   * Get the best queue for the current CPU
>>>   */
>>> -static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring,
>>> +						    bool background)
>>>  {
>>>  	unsigned int qid;
>>>  	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>>>  	int local_node;
>>>  	const struct cpumask *numa_mask, *global_mask;
>>> +	int retries = 0;
>>> +	int weight = -1;
>>>  
>>>  	qid = task_cpu(current);
>>>  	if (WARN_ONCE(qid >= ring->max_nr_queues,
>>> @@ -1315,16 +1319,50 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>>  		      ring->max_nr_queues))
>>>  		qid = 0;
>>>  
>>> -	local_queue = READ_ONCE(ring->queues[qid]);
>>>  	local_node = cpu_to_node(qid);
>>>  	if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
>>>  		local_node = 0;
>>>  
>>> -	/* Fast path: if local queue exists and is not overloaded, use it */
>>> -	if (local_queue &&
>>> -	    READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
>>> +	local_queue = READ_ONCE(ring->queues[qid]);
>>> +
>>> +retry:
>>> +	/*
>>> +	 * For background requests, try next CPU in same NUMA domain.
>>> +	 * I.e. cpu-0 creates async requests, cpu-1 io processes.
>>> +	 * Similar for foreground requests, when the local queue does not
>>> +	 * exist - still better to always wake the same cpu id.
>>> +	 */
>>> +	if (background || !local_queue) {
>>> +		numa_mask = ring->numa_registered_q_mask[local_node];
>>> +
>>> +		if (weight == -1)
>>> +			weight = cpumask_weight(numa_mask);
>>> +
>>> +		if (weight == 0)
>>> +			goto global;
>>> +
>>> +		if (weight > 1) {
>>> +			int idx = (qid + 1) % weight;
>>> +
>>> +			qid = cpumask_nth(idx, numa_mask);
>>> +		} else {
>>> +			qid = cpumask_first(numa_mask);
>>> +		}
>>> +
>>> +		local_queue = READ_ONCE(ring->queues[qid]);
>>> +		if (WARN_ON_ONCE(!local_queue))
>>> +			return NULL;
>>> +	}
>>> +
>>> +	if (READ_ONCE(local_queue->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>>>  		return local_queue;
>>>  
>>> +	if (retries < FURING_NEXT_QUEUE_RETRIES && weight > retries + 1) {
>>> +		retries++;
>>> +		local_queue = NULL;
>>> +		goto retry;
>>> +	}
>>> +
>> 
>> I wonder if this retry loop is really useful.  If I understand this
>> correctly, we're doing a busy loop, hoping for a better queue to become
>> available.  But if the system is really busy doing IO this retry loop will
>> most of the times fail and will fall back to the next option -- only once
>> in a while we will get a better one.
>> 
>> Do you have evidence that this could be helpful?  Or am I misunderstanding
>> the purpose of this retry loop?
>
> Yeah, I got better results with the retry, because using random queues
> really doesn't work well - wakes up cores on random queues and wakes and
> doesn't accumulate on the same queue - doesn't make use of the ring
> advantage. So random should be avoided, if possible. Also, the more
> queues the system has, the worse the results with the plain random
> fallback. I can provide some numbers later on today.

From my side, there's no need to provide numbers -- I'm happy with your
answer.  I just thought that looping could be more expensive than just
picking whatever queue was selected, because in the end that would likely
be the end result anyway.  But I haven't tested it and I understand that
this design can definitely have impact in big systems.

Cheers,
-- 
Luís

next prev parent reply	other threads:[~2025-10-15 11:05 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
2025-10-13 17:09 ` [PATCH v3 1/6] fuse: {io-uring} Add queue length counters Bernd Schubert
2025-10-15  9:19   ` Luis Henriques
2025-10-13 17:09 ` [PATCH v3 2/6] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert
2025-10-13 17:09 ` [PATCH v3 3/6] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
2025-10-15 23:49   ` Joanne Koong
2025-10-16 11:33     ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues Bernd Schubert
2025-10-18  0:12   ` Joanne Koong
2025-10-20 19:00     ` Bernd Schubert
2025-10-20 22:59       ` Joanne Koong
2025-10-20 23:28         ` Bernd Schubert
2025-10-24 17:05         ` Joanne Koong
2025-10-24 17:52           ` Bernd Schubert
2025-10-24 17:58             ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 5/6] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
2025-10-15  9:25   ` Luis Henriques
2025-10-15  9:31     ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
2025-10-15  9:50   ` Luis Henriques
2025-10-15 10:27     ` Bernd Schubert
2025-10-15 11:05       ` Luis Henriques [this message]
2025-10-14  8:43 ` [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Gang He
2025-10-14  9:14   ` Bernd Schubert
2025-10-16  6:15     ` Gang He
  -- strict thread matches above, loose matches on Subject: below --
2025-10-20  7:09 [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core kernel test robot
2025-10-20  7:15 ` Dan Carpenter
2025-10-22 20:24 kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87frbk8ly1.fsf@igalia.com \
    --to=luis@igalia.com \
    --cc=bschubert@ddn.com \
    --cc=dchg2000@gmail.com \
    --cc=joannelkoong@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.