All of lore.kernel.org
 help / color / mirror / Atom feed
From: Luis Henriques <luis@igalia.com>
To: Bernd Schubert <bschubert@ddn.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>,
	 Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	 Juri Lelli <juri.lelli@redhat.com>,
	 Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	 Steven Rostedt <rostedt@goodmis.org>,
	 Ben Segall <bsegall@google.com>,  Mel Gorman <mgorman@suse.de>,
	 Valentin Schneider <vschneid@redhat.com>,
	 Joanne Koong <joannelkoong@gmail.com>,
	 linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v2 6/7] fuse: {io-uring} Queue background requests on a different core
Date: Mon, 06 Oct 2025 10:53:58 +0100	[thread overview]
Message-ID: <87frbwe4p5.fsf@wotan.olymp> (raw)
In-Reply-To: <20251003-reduced-nr-ring-queues_3-v2-6-742ff1a8fc58@ddn.com> (Bernd Schubert's message of "Fri, 03 Oct 2025 12:06:47 +0200")

On Fri, Oct 03 2025, Bernd Schubert wrote:

> Running background IO on a different core makes quite a difference.
>
> fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
> --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
> --runtime=30s --group_reporting --ioengine=io_uring\
>  --direct=1
>
> unpatched
>    READ: bw=272MiB/s (285MB/s), 272MiB/s-272MiB/s ...
> patched
>    READ: bw=760MiB/s (797MB/s), 760MiB/s-760MiB/s ...
>
> With --iodepth=8
>
> unpatched
>    READ: bw=466MiB/s (489MB/s), 466MiB/s-466MiB/s ...
> patched
>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
> 2nd run:
>    READ: bw=1014MiB/s (1064MB/s), 1014MiB/s-1014MiB/s ...
>
> Without io-uring (--iodepth=8)
>    READ: bw=729MiB/s (764MB/s), 729MiB/s-729MiB/s ...
>
> Without fuse (--iodepth=8)
>    READ: bw=2199MiB/s (2306MB/s), 2199MiB/s-2199MiB/s ...
>
> (Test were done with
> <libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
> [-o io_uring] /tmp/source /tmp/dest
> )
>
> Additional notes:
>
> With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
>    READ: bw=903MiB/s (946MB/s), 903MiB/s-903MiB/s ...
>
> With just a random qid (--iodepth=8)
>    READ: bw=429MiB/s (450MB/s), 429MiB/s-429MiB/s ...
>
> With --iodepth=1
> unpatched
>    READ: bw=195MiB/s (204MB/s), 195MiB/s-195MiB/s ...
> patched
>    READ: bw=232MiB/s (243MB/s), 232MiB/s-232MiB/s ...
>
> With --iodepth=1 --numjobs=2
> unpatched
>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
> patched
>    READ: bw=1821MiB/s (1909MB/s), 1821MiB/s-1821MiB/s ...
>
> With --iodepth=1 --numjobs=8
> unpatched
>    READ: bw=1138MiB/s (1193MB/s), 1138MiB/s-1138MiB/s ...
> patched
>    READ: bw=1650MiB/s (1730MB/s), 1650MiB/s-1650MiB/s ...
> fuse without io-uring
>    READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
> no-fuse
>    READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...
>
> In summary, for async requests the core doing application IO is busy
> sending requests and processing IOs should be done on a different core.
> Spreading the load on random cores is also not desirable, as the core
> might be frequency scaled down and/or in C1 sleep states. Not shown here,
> but differnces are much smaller when the system uses performance govenor
> instead of schedutil (ubuntu default). Obviously at the cost of higher
> system power consumption for performance govenor - not desirable either.
>
> Results without io-uring (which uses fixed libfuse threads per queue)
> heavily depend on the current number of active threads. Libfuse uses
> default of max 10 threads, but actual nr max threads is a parameter.
> Also, no-fuse-io-uring results heavily depend on, if there was already
> running another workload before, as libfuse starts these threads
> dynamically - i.e. the more threads are active, the worse the
> performance.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c | 61 +++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 50 insertions(+), 11 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index f5946bb1bbea930522921d49c04e047c70d21ee2..296592fe3651926ab4982b8d80694b3dac8bbffa 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -22,6 +22,7 @@ MODULE_PARM_DESC(enable_uring,
>  #define FURING_Q_LOCAL_THRESHOLD 2
>  #define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>  #define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
> +#define FURING_NEXT_QUEUE_RETRIES 2
>  
>  bool fuse_uring_enabled(void)
>  {
> @@ -1262,7 +1263,8 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>   *  (Michael David Mitzenmacher, 1991)
>   */
>  static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
> -						     struct fuse_ring *ring)
> +						     struct fuse_ring *ring,
> +						     bool background)
>  {
>  	unsigned int qid1, qid2;
>  	struct fuse_ring_queue *queue1, *queue2;
> @@ -1277,9 +1279,14 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>  	}
>  
>  	/* Get two different queues using optimized bounded random */
> -	qid1 = cpumask_nth(get_random_u32_below(weight), mask);
> +
> +	do {
> +		qid1 = cpumask_nth(get_random_u32_below(weight), mask);
> +	} while (background && qid1 == task_cpu(current));
>  	queue1 = READ_ONCE(ring->queues[qid1]);
>  
> +	return queue1;

Hmmm?  I guess this was left from some local testing, right?

Cheers,
-- 
Luís


> +
>  	do {
>  		qid2 = cpumask_nth(get_random_u32_below(weight), mask);
>  	} while (qid2 == qid1);
> @@ -1298,12 +1305,14 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>  /*
>   * Get the best queue for the current CPU
>   */
> -static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring,
> +						    bool background)
>  {
>  	unsigned int qid;
>  	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>  	int local_node;
>  	const struct cpumask *numa_mask, *global_mask;
> +	int retries = 0;
>  
>  	qid = task_cpu(current);
>  	if (WARN_ONCE(qid >= ring->max_nr_queues,
> @@ -1311,16 +1320,44 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>  		      ring->max_nr_queues))
>  		qid = 0;
>  
> -	local_queue = READ_ONCE(ring->queues[qid]);
>  	local_node = cpu_to_node(qid);
>  
> -	/* Fast path: if local queue exists and is not overloaded, use it */
> -	if (local_queue && local_queue->nr_reqs <= FURING_Q_LOCAL_THRESHOLD)
> +	local_queue = READ_ONCE(ring->queues[qid]);
> +
> +retry:
> +	/*
> +	 * For background requests, try next CPU in same NUMA domain.
> +	 * I.e. cpu-0 creates async requests, cpu-1 io processes.
> +	 * Similar for foreground requests, when the local queue does not
> +	 * exist - still better to always wake the same cpu id.
> +	 */
> +	if (background || !local_queue) {
> +		numa_mask = ring->numa_registered_q_mask[local_node];
> +		int weight = cpumask_weight(numa_mask);
> +
> +		if (weight > 0) {
> +			int idx = (qid + 1) % weight;
> +
> +			qid = cpumask_nth(idx, numa_mask);
> +		} else {
> +			qid = cpumask_first(numa_mask);
> +		}
> +
> +		local_queue = READ_ONCE(ring->queues[qid]);
> +	}
> +
> +	if (local_queue && local_queue->nr_reqs <= FURING_Q_NUMA_THRESHOLD)
>  		return local_queue;
>  
> +	if (retries < FURING_NEXT_QUEUE_RETRIES) {
> +		retries++;
> +		local_queue = NULL;
> +		goto retry;
> +	}
> +
>  	/* Find best NUMA-local queue */
>  	numa_mask = ring->numa_registered_q_mask[local_node];
> -	best_numa = fuse_uring_best_queue(numa_mask, ring);
> +	best_numa = fuse_uring_best_queue(numa_mask, ring, background);
>  
>  	/* If NUMA queue is under threshold, use it */
>  	if (best_numa && best_numa->nr_reqs <= FURING_Q_NUMA_THRESHOLD)
> @@ -1328,7 +1365,7 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>  
>  	/* NUMA queues above threshold, try global queues */
>  	global_mask = ring->registered_q_mask;
> -	best_global = fuse_uring_best_queue(global_mask, ring);
> +	best_global = fuse_uring_best_queue(global_mask, ring, background);
>  
>  	/* Might happen during tear down */
>  	if (!best_global)
> @@ -1338,8 +1375,10 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>  	if (best_global->nr_reqs <= FURING_Q_GLOBAL_THRESHOLD)
>  		return best_global;
>  
> +	return best_global;
> +
>  	/* Fall back to best available queue */
> -	return best_numa ? best_numa : best_global;
> +	// return best_numa ? best_numa : best_global;
>  }
>  
>  static void fuse_uring_dispatch_ent(struct fuse_ring_ent *ent)
> @@ -1360,7 +1399,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>  	int err;
>  
>  	err = -EINVAL;
> -	queue = fuse_uring_get_queue(ring);
> +	queue = fuse_uring_get_queue(ring, false);
>  	if (!queue)
>  		goto err;
>  
> @@ -1405,7 +1444,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
>  	struct fuse_ring_queue *queue;
>  	struct fuse_ring_ent *ent = NULL;
>  
> -	queue = fuse_uring_get_queue(ring);
> +	queue = fuse_uring_get_queue(ring, true);
>  	if (!queue)
>  		return false;
>  
>
> -- 
> 2.43.0
>
>


  reply	other threads:[~2025-10-06  9:54 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-03 10:06 [PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 1/7] fuse: {io-uring} Add queue length counters Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 2/7] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 3/7] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
2025-10-06  9:51   ` Luis Henriques
2025-10-03 10:06 ` [PATCH v2 4/7] fuse: {io-uring} Distribute load among queues Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 5/7] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
2025-10-06 10:35   ` Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 6/7] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
2025-10-06  9:53   ` Luis Henriques [this message]
2025-10-06 10:31     ` Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 7/7] fuse: Wake requests on the same cpu Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87frbwe4p5.fsf@wotan.olymp \
    --to=luis@igalia.com \
    --cc=bschubert@ddn.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joannelkoong@gmail.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=miklos@szeredi.hu \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.