[PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution
@ 2025-10-13 17:09 Bernd Schubert
  2025-10-13 17:09 ` [PATCH v3 1/6] fuse: {io-uring} Add queue length counters Bernd Schubert
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Bernd Schubert @ 2025-10-13 17:09 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

This adds bitmaps that track which queues are registered and which queues
do not have queued requests.
These bitmaps are then used to map from request core to queue
and also allow load distribution. NUMA affinity is handled and
fuse client/server protocol does not need changes, all is handled
in fuse client internally.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
Changes in v3:
- removed FUSE_URING_QUEUE_THRESHOLD (Luis)
- Fixed accidentaly early return of queue1 in fuse_uring_best_queue()
- Fixed similar early return 'best_global'
- Added sanity checks for cpu_to_node()
- Removed retry loops in fuse_uring_best_queue() for code simplicity
- Reduced local numa retries in fuse_uring_get_queue
- Added 'FUSE_URING_REDUCED_Q' FUSE_INIT flag to inform userspace
  about the possibility to reduced queues
- Link to v2: https://lore.kernel.org/r/20251003-reduced-nr-ring-queues_3-v2-0-742ff1a8fc58@ddn.com
- Removed wake-on-same cpu patch from this series, 
  it will be send out independently
- Used READ_ONCE(queue->nr_reqs) as the value is updated (with a lock being
  hold) by other threads and possibly cpus.

---
Bernd Schubert (6):
      fuse: {io-uring} Add queue length counters
      fuse: {io-uring} Rename ring->nr_queues to max_nr_queues
      fuse: {io-uring} Use bitmaps to track registered queues
      fuse: {io-uring} Distribute load among queues
      fuse: {io-uring} Allow reduced number of ring queues
      fuse: {io-uring} Queue background requests on a different core

 fs/fuse/dev_uring.c       | 260 ++++++++++++++++++++++++++++++++++++----------
 fs/fuse/dev_uring_i.h     |  14 ++-
 fs/fuse/inode.c           |   2 +-
 include/uapi/linux/fuse.h |   3 +
 4 files changed, 224 insertions(+), 55 deletions(-)
---
base-commit: ec714e371f22f716a04e6ecb2a24988c92b26911
change-id: 20250722-reduced-nr-ring-queues_3-6acb79dad978

Best regards,
-- 
Bernd Schubert <bschubert@ddn.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 1/6] fuse: {io-uring} Add queue length counters
  2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
@ 2025-10-13 17:09 ` Bernd Schubert
  2025-10-15  9:19   ` Luis Henriques
  2025-10-13 17:09 ` [PATCH v3 2/6] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Bernd Schubert @ 2025-10-13 17:09 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

This is another preparation and will be used for decision
which queue to add a request to.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/dev_uring.c   | 17 +++++++++++++++--
 fs/fuse/dev_uring_i.h |  3 +++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index f6b12aebb8bbe7d255980593b75b5fb5af9c669e..872ae17ffaf49a30c46ef89c1668684a61a0cce4 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -86,13 +86,13 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
 	lockdep_assert_not_held(&queue->lock);
 	spin_lock(&queue->lock);
 	ent->fuse_req = NULL;
+	queue->nr_reqs--;
 	if (test_bit(FR_BACKGROUND, &req->flags)) {
 		queue->active_background--;
 		spin_lock(&fc->bg_lock);
 		fuse_uring_flush_bg(queue);
 		spin_unlock(&fc->bg_lock);
 	}
-
 	spin_unlock(&queue->lock);
 
 	if (error)
@@ -112,6 +112,7 @@ static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
 	list_for_each_entry(req, &queue->fuse_req_queue, list)
 		clear_bit(FR_PENDING, &req->flags);
 	list_splice_init(&queue->fuse_req_queue, &req_list);
+	queue->nr_reqs = 0;
 	spin_unlock(&queue->lock);
 
 	/* must not hold queue lock to avoid order issues with fi->lock */
@@ -1280,10 +1281,13 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 	req->ring_queue = queue;
 	ent = list_first_entry_or_null(&queue->ent_avail_queue,
 				       struct fuse_ring_ent, list);
+	queue->nr_reqs++;
+
 	if (ent)
 		fuse_uring_add_req_to_ring_ent(ent, req);
 	else
 		list_add_tail(&req->list, &queue->fuse_req_queue);
+
 	spin_unlock(&queue->lock);
 
 	if (ent)
@@ -1319,6 +1323,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	set_bit(FR_URING, &req->flags);
 	req->ring_queue = queue;
 	list_add_tail(&req->list, &queue->fuse_req_bg_queue);
+	queue->nr_reqs++;
 
 	ent = list_first_entry_or_null(&queue->ent_avail_queue,
 				       struct fuse_ring_ent, list);
@@ -1351,8 +1356,16 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 bool fuse_uring_remove_pending_req(struct fuse_req *req)
 {
 	struct fuse_ring_queue *queue = req->ring_queue;
+	bool removed = fuse_remove_pending_req(req, &queue->lock);
 
-	return fuse_remove_pending_req(req, &queue->lock);
+	if (removed) {
+		/* Update counters after successful removal */
+		spin_lock(&queue->lock);
+		queue->nr_reqs--;
+		spin_unlock(&queue->lock);
+	}
+
+	return removed;
 }
 
 static const struct fuse_iqueue_ops fuse_io_uring_ops = {
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 51a563922ce14158904a86c248c77767be4fe5ae..c63bed9f863d53d4ac2bed7bfbda61941cd99083 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -94,6 +94,9 @@ struct fuse_ring_queue {
 	/* background fuse requests */
 	struct list_head fuse_req_bg_queue;
 
+	/* number of requests queued or in userspace */
+	unsigned int nr_reqs;
+
 	struct fuse_pqueue fpq;
 
 	unsigned int active_background;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 2/6] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues
  2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
  2025-10-13 17:09 ` [PATCH v3 1/6] fuse: {io-uring} Add queue length counters Bernd Schubert
@ 2025-10-13 17:09 ` Bernd Schubert
  2025-10-13 17:09 ` [PATCH v3 3/6] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Bernd Schubert @ 2025-10-13 17:09 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

This is preparation for follow up commits that allow to run with a
reduced number of queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c   | 24 ++++++++++++------------
 fs/fuse/dev_uring_i.h |  2 +-
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 872ae17ffaf49a30c46ef89c1668684a61a0cce4..70eed06571d7254f3a30b7b27bcedea221ec2dd1 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -125,7 +125,7 @@ void fuse_uring_abort_end_requests(struct fuse_ring *ring)
 	struct fuse_ring_queue *queue;
 	struct fuse_conn *fc = ring->fc;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		queue = READ_ONCE(ring->queues[qid]);
 		if (!queue)
 			continue;
@@ -166,7 +166,7 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
 	if (!ring)
 		return false;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		queue = READ_ONCE(ring->queues[qid]);
 		if (!queue)
 			continue;
@@ -193,7 +193,7 @@ void fuse_uring_destruct(struct fuse_conn *fc)
 	if (!ring)
 		return;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = ring->queues[qid];
 		struct fuse_ring_ent *ent, *next;
 
@@ -253,7 +253,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 
 	init_waitqueue_head(&ring->stop_waitq);
 
-	ring->nr_queues = nr_queues;
+	ring->max_nr_queues = nr_queues;
 	ring->fc = fc;
 	ring->max_payload_sz = max_payload_size;
 	smp_store_release(&fc->ring, ring);
@@ -405,7 +405,7 @@ static void fuse_uring_log_ent_state(struct fuse_ring *ring)
 	int qid;
 	struct fuse_ring_ent *ent;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = ring->queues[qid];
 
 		if (!queue)
@@ -436,7 +436,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
 		container_of(work, struct fuse_ring, async_teardown_work.work);
 
 	/* XXX code dup */
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
 
 		if (!queue)
@@ -471,7 +471,7 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
 {
 	int qid;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
 
 		if (!queue)
@@ -890,7 +890,7 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
 	if (!ring)
 		return err;
 
-	if (qid >= ring->nr_queues)
+	if (qid >= ring->max_nr_queues)
 		return -EINVAL;
 
 	queue = ring->queues[qid];
@@ -953,7 +953,7 @@ static bool is_ring_ready(struct fuse_ring *ring, int current_qid)
 	struct fuse_ring_queue *queue;
 	bool ready = true;
 
-	for (qid = 0; qid < ring->nr_queues && ready; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues && ready; qid++) {
 		if (current_qid == qid)
 			continue;
 
@@ -1094,7 +1094,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
 			return err;
 	}
 
-	if (qid >= ring->nr_queues) {
+	if (qid >= ring->max_nr_queues) {
 		pr_info_ratelimited("fuse: Invalid ring qid %u\n", qid);
 		return -EINVAL;
 	}
@@ -1237,9 +1237,9 @@ static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
 
 	qid = task_cpu(current);
 
-	if (WARN_ONCE(qid >= ring->nr_queues,
+	if (WARN_ONCE(qid >= ring->max_nr_queues,
 		      "Core number (%u) exceeds nr queues (%zu)\n", qid,
-		      ring->nr_queues))
+		      ring->max_nr_queues))
 		qid = 0;
 
 	queue = ring->queues[qid];
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index c63bed9f863d53d4ac2bed7bfbda61941cd99083..708412294982566919122a1a0d7f741217c763ce 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -113,7 +113,7 @@ struct fuse_ring {
 	struct fuse_conn *fc;
 
 	/* number of ring queues */
-	size_t nr_queues;
+	size_t max_nr_queues;
 
 	/* maximum payload/arg size */
 	size_t max_payload_sz;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 3/6] fuse: {io-uring} Use bitmaps to track registered queues
  2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
  2025-10-13 17:09 ` [PATCH v3 1/6] fuse: {io-uring} Add queue length counters Bernd Schubert
  2025-10-13 17:09 ` [PATCH v3 2/6] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert
@ 2025-10-13 17:09 ` Bernd Schubert
  2025-10-15 23:49   ` Joanne Koong
  2025-10-13 17:10 ` [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues Bernd Schubert
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Bernd Schubert @ 2025-10-13 17:09 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

Add per-CPU and per-NUMA node bitmasks to track which
io-uring queues are registered.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c   | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h |  9 +++++++++
 2 files changed, 64 insertions(+)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 70eed06571d7254f3a30b7b27bcedea221ec2dd1..02c4b40e739c7aa43dc1c581d4ff1f721617cc79 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -185,6 +185,18 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
 	return false;
 }
 
+static void fuse_ring_destruct_q_masks(struct fuse_ring *ring)
+{
+	int node;
+
+	free_cpumask_var(ring->registered_q_mask);
+	if (ring->numa_registered_q_mask) {
+		for (node = 0; node < ring->nr_numa_nodes; node++)
+			free_cpumask_var(ring->numa_registered_q_mask[node]);
+		kfree(ring->numa_registered_q_mask);
+	}
+}
+
 void fuse_uring_destruct(struct fuse_conn *fc)
 {
 	struct fuse_ring *ring = fc->ring;
@@ -216,11 +228,32 @@ void fuse_uring_destruct(struct fuse_conn *fc)
 		ring->queues[qid] = NULL;
 	}
 
+	fuse_ring_destruct_q_masks(ring);
 	kfree(ring->queues);
 	kfree(ring);
 	fc->ring = NULL;
 }
 
+static int fuse_ring_create_q_masks(struct fuse_ring *ring)
+{
+	int node;
+
+	if (!zalloc_cpumask_var(&ring->registered_q_mask, GFP_KERNEL_ACCOUNT))
+		return -ENOMEM;
+
+	ring->numa_registered_q_mask = kcalloc(
+		ring->nr_numa_nodes, sizeof(cpumask_var_t), GFP_KERNEL_ACCOUNT);
+	if (!ring->numa_registered_q_mask)
+		return -ENOMEM;
+	for (node = 0; node < ring->nr_numa_nodes; node++) {
+		if (!zalloc_cpumask_var(&ring->numa_registered_q_mask[node],
+					GFP_KERNEL_ACCOUNT))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
 /*
  * Basic ring setup for this connection based on the provided configuration
  */
@@ -230,11 +263,14 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	size_t nr_queues = num_possible_cpus();
 	struct fuse_ring *res = NULL;
 	size_t max_payload_size;
+	int err;
 
 	ring = kzalloc(sizeof(*fc->ring), GFP_KERNEL_ACCOUNT);
 	if (!ring)
 		return NULL;
 
+	ring->nr_numa_nodes = num_online_nodes();
+
 	ring->queues = kcalloc(nr_queues, sizeof(struct fuse_ring_queue *),
 			       GFP_KERNEL_ACCOUNT);
 	if (!ring->queues)
@@ -243,6 +279,10 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	max_payload_size = max(FUSE_MIN_READ_BUFFER, fc->max_write);
 	max_payload_size = max(max_payload_size, fc->max_pages * PAGE_SIZE);
 
+	err = fuse_ring_create_q_masks(ring);
+	if (err)
+		goto out_err;
+
 	spin_lock(&fc->lock);
 	if (fc->ring) {
 		/* race, another thread created the ring in the meantime */
@@ -262,6 +302,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	return ring;
 
 out_err:
+	fuse_ring_destruct_q_masks(ring);
 	kfree(ring->queues);
 	kfree(ring);
 	return res;
@@ -424,6 +465,7 @@ static void fuse_uring_log_ent_state(struct fuse_ring *ring)
 			pr_info(" ent-commit-queue ring=%p qid=%d ent=%p state=%d\n",
 				ring, qid, ent, ent->state);
 		}
+
 		spin_unlock(&queue->lock);
 	}
 	ring->stop_debug_log = 1;
@@ -470,6 +512,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
 void fuse_uring_stop_queues(struct fuse_ring *ring)
 {
 	int qid;
+	int node;
 
 	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
@@ -480,6 +523,11 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
 		fuse_uring_teardown_entries(queue);
 	}
 
+	/* Reset all queue masks, we won't process any more IO */
+	cpumask_clear(ring->registered_q_mask);
+	for (node = 0; node < ring->nr_numa_nodes; node++)
+		cpumask_clear(ring->numa_registered_q_mask[node]);
+
 	if (atomic_read(&ring->queue_refs) > 0) {
 		ring->teardown_time = jiffies;
 		INIT_DELAYED_WORK(&ring->async_teardown_work,
@@ -983,6 +1031,10 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
 	struct fuse_ring *ring = queue->ring;
 	struct fuse_conn *fc = ring->fc;
 	struct fuse_iqueue *fiq = &fc->iq;
+	int node = cpu_to_node(queue->qid);
+
+	if (WARN_ON_ONCE(node >= ring->nr_numa_nodes))
+		node = 0;
 
 	fuse_uring_prepare_cancel(cmd, issue_flags, ent);
 
@@ -991,6 +1043,9 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
 	fuse_uring_ent_avail(ent, queue);
 	spin_unlock(&queue->lock);
 
+	cpumask_set_cpu(queue->qid, ring->registered_q_mask);
+	cpumask_set_cpu(queue->qid, ring->numa_registered_q_mask[node]);
+
 	if (!ring->ready) {
 		bool ready = is_ring_ready(ring, queue->qid);
 
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 708412294982566919122a1a0d7f741217c763ce..35e3b6808b60398848965afd3091b765444283ff 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -115,6 +115,9 @@ struct fuse_ring {
 	/* number of ring queues */
 	size_t max_nr_queues;
 
+	/* number of numa nodes */
+	int nr_numa_nodes;
+
 	/* maximum payload/arg size */
 	size_t max_payload_sz;
 
@@ -125,6 +128,12 @@ struct fuse_ring {
 	 */
 	unsigned int stop_debug_log : 1;
 
+	/* Tracks which queues are registered */
+	cpumask_var_t registered_q_mask;
+
+	/* Tracks which queues are registered per NUMA node */
+	cpumask_var_t *numa_registered_q_mask;
+
 	wait_queue_head_t stop_waitq;
 
 	/* async tear down */

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues
  2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
                   ` (2 preceding siblings ...)
  2025-10-13 17:09 ` [PATCH v3 3/6] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
@ 2025-10-13 17:10 ` Bernd Schubert
  2025-10-18  0:12   ` Joanne Koong
  2025-10-13 17:10 ` [PATCH v3 5/6] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: Bernd Schubert @ 2025-10-13 17:10 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

So far queue selection was only for the queue corresponding
to the current core.
With bitmaps about registered queues and counting of queued
requests per queue, distributing the load is possible now.

This is on purpose lockless and accurate, under the assumption that a lock
between queues might become the limiting factor. Approximate selection
based on queue->nr_reqs should be good enough. If queues get slightly
more requests than given by that counter it should not be too bad,
as number of kernel/userspace transitions gets reduced with higher
queue sizes.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 84 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 02c4b40e739c7aa43dc1c581d4ff1f721617cc79..92401adecf813b1c4570d925718be772c8f02975 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -19,6 +19,10 @@ MODULE_PARM_DESC(enable_uring,
 
 #define FUSE_URING_IOV_SEGS 2 /* header and payload */
 
+/* Number of queued fuse requests until a queue is considered full */
+#define FURING_Q_LOCAL_THRESHOLD 2
+#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
+#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
 
 bool fuse_uring_enabled(void)
 {
@@ -1285,22 +1289,94 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
 	fuse_uring_send(ent, cmd, err, issue_flags);
 }
 
-static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
+/*
+ * Pick best queue from mask. Follows the algorithm described in
+ * "The Power of Two Choices in Randomized Load Balancing"
+ *  (Michael David Mitzenmacher, 1991)
+ */
+static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
+						     struct fuse_ring *ring)
+{
+	unsigned int qid1, qid2;
+	struct fuse_ring_queue *queue1, *queue2;
+	int weight = cpumask_weight(mask);
+
+	if (weight == 0)
+		return NULL;
+
+	if (weight == 1) {
+		qid1 = cpumask_first(mask);
+		return READ_ONCE(ring->queues[qid1]);
+	}
+
+	/* Get two different queues using optimized bounded random */
+	qid1 = cpumask_nth(get_random_u32_below(weight), mask);
+	queue1 = READ_ONCE(ring->queues[qid1]);
+
+	qid2 = cpumask_nth(get_random_u32_below(weight), mask);
+
+	/* Avoid retries and take this queue for code simplicity */
+	if (qid1 == qid2)
+		return queue1;
+
+	queue2 = READ_ONCE(ring->queues[qid2]);
+
+	if (WARN_ON_ONCE(!queue1 || !queue2))
+		return NULL;
+
+	return (READ_ONCE(queue1->nr_reqs) < READ_ONCE(queue2->nr_reqs)) ?
+		queue1 : queue2;
+}
+
+/*
+ * Get the best queue for the current CPU
+ */
+static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
 {
 	unsigned int qid;
-	struct fuse_ring_queue *queue;
+	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
+	int local_node;
+	const struct cpumask *numa_mask, *global_mask;
 
 	qid = task_cpu(current);
-
 	if (WARN_ONCE(qid >= ring->max_nr_queues,
 		      "Core number (%u) exceeds nr queues (%zu)\n", qid,
 		      ring->max_nr_queues))
 		qid = 0;
 
-	queue = ring->queues[qid];
-	WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
+	local_queue = READ_ONCE(ring->queues[qid]);
+	local_node = cpu_to_node(qid);
+	if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
+		local_node = 0;
 
-	return queue;
+	/* Fast path: if local queue exists and is not overloaded, use it */
+	if (local_queue &&
+	    READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
+		return local_queue;
+
+	/* Find best NUMA-local queue */
+	numa_mask = ring->numa_registered_q_mask[local_node];
+	best_numa = fuse_uring_best_queue(numa_mask, ring);
+
+	/* If NUMA queue is under threshold, use it */
+	if (best_numa &&
+	    READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
+		return best_numa;
+
+	/* NUMA queues above threshold, try global queues */
+	global_mask = ring->registered_q_mask;
+	best_global = fuse_uring_best_queue(global_mask, ring);
+
+	/* Might happen during tear down */
+	if (!best_global)
+		return NULL;
+
+	/* If global queue is under double threshold, use it */
+	if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
+		return best_global;
+
+	/* There is no ideal queue, stay numa_local if possible */
+	return best_numa ? best_numa : best_global;
 }
 
 static void fuse_uring_dispatch_ent(struct fuse_ring_ent *ent)
@@ -1321,7 +1397,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 	int err;
 
 	err = -EINVAL;
-	queue = fuse_uring_task_to_queue(ring);
+	queue = fuse_uring_get_queue(ring);
 	if (!queue)
 		goto err;
 
@@ -1365,7 +1441,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	struct fuse_ring_queue *queue;
 	struct fuse_ring_ent *ent = NULL;
 
-	queue = fuse_uring_task_to_queue(ring);
+	queue = fuse_uring_get_queue(ring);
 	if (!queue)
 		return false;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 5/6] fuse: {io-uring} Allow reduced number of ring queues
  2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
                   ` (3 preceding siblings ...)
  2025-10-13 17:10 ` [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues Bernd Schubert
@ 2025-10-13 17:10 ` Bernd Schubert
  2025-10-15  9:25   ` Luis Henriques
  2025-10-13 17:10 ` [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
  2025-10-14  8:43 ` [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Gang He
  6 siblings, 1 reply; 26+ messages in thread
From: Bernd Schubert @ 2025-10-13 17:10 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

Queues selection (fuse_uring_get_queue) can handle reduced number
queues - using io-uring is possible now even with a single
queue and entry.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c       | 35 +++--------------------------------
 fs/fuse/inode.c           |  2 +-
 include/uapi/linux/fuse.h |  3 +++
 3 files changed, 7 insertions(+), 33 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 92401adecf813b1c4570d925718be772c8f02975..aca71ce5632efd1d80e3ac0ad4e81ac1536dbc47 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -999,31 +999,6 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
 	return 0;
 }
 
-static bool is_ring_ready(struct fuse_ring *ring, int current_qid)
-{
-	int qid;
-	struct fuse_ring_queue *queue;
-	bool ready = true;
-
-	for (qid = 0; qid < ring->max_nr_queues && ready; qid++) {
-		if (current_qid == qid)
-			continue;
-
-		queue = ring->queues[qid];
-		if (!queue) {
-			ready = false;
-			break;
-		}
-
-		spin_lock(&queue->lock);
-		if (list_empty(&queue->ent_avail_queue))
-			ready = false;
-		spin_unlock(&queue->lock);
-	}
-
-	return ready;
-}
-
 /*
  * fuse_uring_req_fetch command handling
  */
@@ -1051,13 +1026,9 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
 	cpumask_set_cpu(queue->qid, ring->numa_registered_q_mask[node]);
 
 	if (!ring->ready) {
-		bool ready = is_ring_ready(ring, queue->qid);
-
-		if (ready) {
-			WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
-			WRITE_ONCE(ring->ready, true);
-			wake_up_all(&fc->blocked_waitq);
-		}
+		WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
+		WRITE_ONCE(ring->ready, true);
+		wake_up_all(&fc->blocked_waitq);
 	}
 }
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d1babf56f25470fcc08fe400467b3450e8b7464a..3f97cc307b4d77e12334180731589c579b2eb7a2 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1503,7 +1503,7 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
 		FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP |
 		FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP |
 		FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_ALLOW_IDMAP |
-		FUSE_REQUEST_TIMEOUT;
+		FUSE_REQUEST_TIMEOUT | FUSE_URING_REDUCED_Q;
 #ifdef CONFIG_FUSE_DAX
 	if (fm->fc->dax)
 		flags |= FUSE_MAP_ALIGNMENT;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index c13e1f9a2f12bd39f535188cb5466688eba42263..3da20d9bba1cb6336734511d21da9f64cea0e720 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -448,6 +448,8 @@ struct fuse_file_lock {
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
  * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
  *			 init_out.request_timeout contains the timeout (in secs)
+ * FUSE_URING_REDUCED_Q: Client (kernel) supports less queues - Server is free
+ *			 to register between 1 and nr-core io-uring queues
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -495,6 +497,7 @@ struct fuse_file_lock {
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
+#define FUSE_URING_REDUCED_Q (1ULL << 43)
 
 /**
  * CUSE INIT request/reply flags

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core
  2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
                   ` (4 preceding siblings ...)
  2025-10-13 17:10 ` [PATCH v3 5/6] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
@ 2025-10-13 17:10 ` Bernd Schubert
  2025-10-15  9:50   ` Luis Henriques
  2025-10-20  7:15   ` Dan Carpenter
  2025-10-14  8:43 ` [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Gang He
  6 siblings, 2 replies; 26+ messages in thread
From: Bernd Schubert @ 2025-10-13 17:10 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

Running background IO on a different core makes quite a difference.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
 --direct=1

unpatched
   READ: bw=272MiB/s (285MB/s), 272MiB/s-272MiB/s ...
patched
   READ: bw=760MiB/s (797MB/s), 760MiB/s-760MiB/s ...

With --iodepth=8

unpatched
   READ: bw=466MiB/s (489MB/s), 466MiB/s-466MiB/s ...
patched
   READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
2nd run:
   READ: bw=1014MiB/s (1064MB/s), 1014MiB/s-1014MiB/s ...

Without io-uring (--iodepth=8)
   READ: bw=729MiB/s (764MB/s), 729MiB/s-729MiB/s ...

Without fuse (--iodepth=8)
   READ: bw=2199MiB/s (2306MB/s), 2199MiB/s-2199MiB/s ...

(Test were done with
<libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
[-o io_uring] /tmp/source /tmp/dest
)

Additional notes:

With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
   READ: bw=903MiB/s (946MB/s), 903MiB/s-903MiB/s ...

With just a random qid (--iodepth=8)
   READ: bw=429MiB/s (450MB/s), 429MiB/s-429MiB/s ...

With --iodepth=1
unpatched
   READ: bw=195MiB/s (204MB/s), 195MiB/s-195MiB/s ...
patched
   READ: bw=232MiB/s (243MB/s), 232MiB/s-232MiB/s ...

With --iodepth=1 --numjobs=2
unpatched
   READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
patched
   READ: bw=1821MiB/s (1909MB/s), 1821MiB/s-1821MiB/s ...

With --iodepth=1 --numjobs=8
unpatched
   READ: bw=1138MiB/s (1193MB/s), 1138MiB/s-1138MiB/s ...
patched
   READ: bw=1650MiB/s (1730MB/s), 1650MiB/s-1650MiB/s ...
fuse without io-uring
   READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
no-fuse
   READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...

In summary, for async requests the core doing application IO is busy
sending requests and processing IOs should be done on a different core.
Spreading the load on random cores is also not desirable, as the core
might be frequency scaled down and/or in C1 sleep states. Not shown here,
but differnces are much smaller when the system uses performance govenor
instead of schedutil (ubuntu default). Obviously at the cost of higher
system power consumption for performance govenor - not desirable either.

Results without io-uring (which uses fixed libfuse threads per queue)
heavily depend on the current number of active threads. Libfuse uses
default of max 10 threads, but actual nr max threads is a parameter.
Also, no-fuse-io-uring results heavily depend on, if there was already
running another workload before, as libfuse starts these threads
dynamically - i.e. the more threads are active, the worse the
performance.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 46 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index aca71ce5632efd1d80e3ac0ad4e81ac1536dbc47..f35dd98abfe6407849fec55847c6b3d186383803 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -23,6 +23,7 @@ MODULE_PARM_DESC(enable_uring,
 #define FURING_Q_LOCAL_THRESHOLD 2
 #define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
 #define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
+#define FURING_NEXT_QUEUE_RETRIES 2
 
 bool fuse_uring_enabled(void)
 {
@@ -1302,12 +1303,15 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
 /*
  * Get the best queue for the current CPU
  */
-static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
+static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring,
+						    bool background)
 {
 	unsigned int qid;
 	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
 	int local_node;
 	const struct cpumask *numa_mask, *global_mask;
+	int retries = 0;
+	int weight = -1;
 
 	qid = task_cpu(current);
 	if (WARN_ONCE(qid >= ring->max_nr_queues,
@@ -1315,16 +1319,50 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
 		      ring->max_nr_queues))
 		qid = 0;
 
-	local_queue = READ_ONCE(ring->queues[qid]);
 	local_node = cpu_to_node(qid);
 	if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
 		local_node = 0;
 
-	/* Fast path: if local queue exists and is not overloaded, use it */
-	if (local_queue &&
-	    READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
+	local_queue = READ_ONCE(ring->queues[qid]);
+
+retry:
+	/*
+	 * For background requests, try next CPU in same NUMA domain.
+	 * I.e. cpu-0 creates async requests, cpu-1 io processes.
+	 * Similar for foreground requests, when the local queue does not
+	 * exist - still better to always wake the same cpu id.
+	 */
+	if (background || !local_queue) {
+		numa_mask = ring->numa_registered_q_mask[local_node];
+
+		if (weight == -1)
+			weight = cpumask_weight(numa_mask);
+
+		if (weight == 0)
+			goto global;
+
+		if (weight > 1) {
+			int idx = (qid + 1) % weight;
+
+			qid = cpumask_nth(idx, numa_mask);
+		} else {
+			qid = cpumask_first(numa_mask);
+		}
+
+		local_queue = READ_ONCE(ring->queues[qid]);
+		if (WARN_ON_ONCE(!local_queue))
+			return NULL;
+	}
+
+	if (READ_ONCE(local_queue->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
 		return local_queue;
 
+	if (retries < FURING_NEXT_QUEUE_RETRIES && weight > retries + 1) {
+		retries++;
+		local_queue = NULL;
+		goto retry;
+	}
+
 	/* Find best NUMA-local queue */
 	numa_mask = ring->numa_registered_q_mask[local_node];
 	best_numa = fuse_uring_best_queue(numa_mask, ring);
@@ -1334,6 +1372,7 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
 	    READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
 		return best_numa;
 
+global:
 	/* NUMA queues above threshold, try global queues */
 	global_mask = ring->registered_q_mask;
 	best_global = fuse_uring_best_queue(global_mask, ring);
@@ -1368,7 +1407,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 	int err;
 
 	err = -EINVAL;
-	queue = fuse_uring_get_queue(ring);
+	queue = fuse_uring_get_queue(ring, false);
 	if (!queue)
 		goto err;
 
@@ -1412,7 +1451,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	struct fuse_ring_queue *queue;
 	struct fuse_ring_ent *ent = NULL;
 
-	queue = fuse_uring_get_queue(ring);
+	queue = fuse_uring_get_queue(ring, true);
 	if (!queue)
 		return false;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution
  2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
                   ` (5 preceding siblings ...)
  2025-10-13 17:10 ` [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
@ 2025-10-14  8:43 ` Gang He
  2025-10-14  9:14   ` Bernd Schubert
  6 siblings, 1 reply; 26+ messages in thread
From: Gang He @ 2025-10-14  8:43 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Joanne Koong, linux-fsdevel, Luis Henriques

Hi Bernd,

Thank for your optimization patches.
I applied these patches, for asynchronous IO(iodepth > 1), it looks
the patches can improve the performance as expected.
But, for synchronous IO(iodepth =1), it looks there is  a regression
problem here(performance drop).
Did you check the above regression issue?

Thanks
Gang

Bernd Schubert <bschubert@ddn.com> 于2025年10月14日周二 01:10写道：
>
> This adds bitmaps that track which queues are registered and which queues
> do not have queued requests.
> These bitmaps are then used to map from request core to queue
> and also allow load distribution. NUMA affinity is handled and
> fuse client/server protocol does not need changes, all is handled
> in fuse client internally.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
> Changes in v3:
> - removed FUSE_URING_QUEUE_THRESHOLD (Luis)
> - Fixed accidentaly early return of queue1 in fuse_uring_best_queue()
> - Fixed similar early return 'best_global'
> - Added sanity checks for cpu_to_node()
> - Removed retry loops in fuse_uring_best_queue() for code simplicity
> - Reduced local numa retries in fuse_uring_get_queue
> - Added 'FUSE_URING_REDUCED_Q' FUSE_INIT flag to inform userspace
>   about the possibility to reduced queues
> - Link to v2: https://lore.kernel.org/r/20251003-reduced-nr-ring-queues_3-v2-0-742ff1a8fc58@ddn.com
> - Removed wake-on-same cpu patch from this series,
>   it will be send out independently
> - Used READ_ONCE(queue->nr_reqs) as the value is updated (with a lock being
>   hold) by other threads and possibly cpus.
>
> ---
> Bernd Schubert (6):
>       fuse: {io-uring} Add queue length counters
>       fuse: {io-uring} Rename ring->nr_queues to max_nr_queues
>       fuse: {io-uring} Use bitmaps to track registered queues
>       fuse: {io-uring} Distribute load among queues
>       fuse: {io-uring} Allow reduced number of ring queues
>       fuse: {io-uring} Queue background requests on a different core
>
>  fs/fuse/dev_uring.c       | 260 ++++++++++++++++++++++++++++++++++++----------
>  fs/fuse/dev_uring_i.h     |  14 ++-
>  fs/fuse/inode.c           |   2 +-
>  include/uapi/linux/fuse.h |   3 +
>  4 files changed, 224 insertions(+), 55 deletions(-)
> ---
> base-commit: ec714e371f22f716a04e6ecb2a24988c92b26911
> change-id: 20250722-reduced-nr-ring-queues_3-6acb79dad978
>
> Best regards,
> --
> Bernd Schubert <bschubert@ddn.com>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution
  2025-10-14  8:43 ` [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Gang He
@ 2025-10-14  9:14   ` Bernd Schubert
  2025-10-16  6:15     ` Gang He
  0 siblings, 1 reply; 26+ messages in thread
From: Bernd Schubert @ 2025-10-14  9:14 UTC (permalink / raw)
  To: Gang He
  Cc: Miklos Szeredi, Joanne Koong, linux-fsdevel@vger.kernel.org,
	Luis Henriques

On 10/14/25 10:43, Gang He wrote:
> [You don't often get email from dchg2000@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hi Bernd,
> 
> Thank for your optimization patches.
> I applied these patches, for asynchronous IO(iodepth > 1), it looks
> the patches can improve the performance as expected.
> But, for synchronous IO(iodepth =1), it looks there is  a regression
> problem here(performance drop).
> Did you check the above regression issue?

Hi Gang,

please see patch 6/6, it has numbers with iodepth=1 - I don't see a
regression. Also, I don't think '--ioengine=io_uring --iodepth=4'
results in synchronous IO, but for synchronous IOs you need
'--ioengine=psync'.

For truly synchronous IOs you need
https://lore.kernel.org/all/20251013-wake-same-cpu-v1-1-45d8059adde7@ddn.com/


Could you please provide the exact fio command you are using, so that I
can give it a quick test?

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/6] fuse: {io-uring} Add queue length counters
  2025-10-13 17:09 ` [PATCH v3 1/6] fuse: {io-uring} Add queue length counters Bernd Schubert
@ 2025-10-15  9:19   ` Luis Henriques
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Henriques @ 2025-10-15  9:19 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Miklos Szeredi, Joanne Koong, linux-fsdevel, Gang He

On Mon, Oct 13 2025, Bernd Schubert wrote:

> This is another preparation and will be used for decision
> which queue to add a request to.

Just a very small nit: "This is another preparation..." -- this is the
first patch in the series, so it makes more sense to have this sentence in
the second patch instead.  I guess the patches got reordered at some point
and the wording didn't got updated accordingly.

Cheers,
-- 
Luís

>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  fs/fuse/dev_uring.c   | 17 +++++++++++++++--
>  fs/fuse/dev_uring_i.h |  3 +++
>  2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index f6b12aebb8bbe7d255980593b75b5fb5af9c669e..872ae17ffaf49a30c46ef89c1668684a61a0cce4 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -86,13 +86,13 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
>  	lockdep_assert_not_held(&queue->lock);
>  	spin_lock(&queue->lock);
>  	ent->fuse_req = NULL;
> +	queue->nr_reqs--;
>  	if (test_bit(FR_BACKGROUND, &req->flags)) {
>  		queue->active_background--;
>  		spin_lock(&fc->bg_lock);
>  		fuse_uring_flush_bg(queue);
>  		spin_unlock(&fc->bg_lock);
>  	}
> -
>  	spin_unlock(&queue->lock);
>  
>  	if (error)
> @@ -112,6 +112,7 @@ static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
>  	list_for_each_entry(req, &queue->fuse_req_queue, list)
>  		clear_bit(FR_PENDING, &req->flags);
>  	list_splice_init(&queue->fuse_req_queue, &req_list);
> +	queue->nr_reqs = 0;
>  	spin_unlock(&queue->lock);
>  
>  	/* must not hold queue lock to avoid order issues with fi->lock */
> @@ -1280,10 +1281,13 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>  	req->ring_queue = queue;
>  	ent = list_first_entry_or_null(&queue->ent_avail_queue,
>  				       struct fuse_ring_ent, list);
> +	queue->nr_reqs++;
> +
>  	if (ent)
>  		fuse_uring_add_req_to_ring_ent(ent, req);
>  	else
>  		list_add_tail(&req->list, &queue->fuse_req_queue);
> +
>  	spin_unlock(&queue->lock);
>  
>  	if (ent)
> @@ -1319,6 +1323,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
>  	set_bit(FR_URING, &req->flags);
>  	req->ring_queue = queue;
>  	list_add_tail(&req->list, &queue->fuse_req_bg_queue);
> +	queue->nr_reqs++;
>  
>  	ent = list_first_entry_or_null(&queue->ent_avail_queue,
>  				       struct fuse_ring_ent, list);
> @@ -1351,8 +1356,16 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
>  bool fuse_uring_remove_pending_req(struct fuse_req *req)
>  {
>  	struct fuse_ring_queue *queue = req->ring_queue;
> +	bool removed = fuse_remove_pending_req(req, &queue->lock);
>  
> -	return fuse_remove_pending_req(req, &queue->lock);
> +	if (removed) {
> +		/* Update counters after successful removal */
> +		spin_lock(&queue->lock);
> +		queue->nr_reqs--;
> +		spin_unlock(&queue->lock);
> +	}
> +
> +	return removed;
>  }
>  
>  static const struct fuse_iqueue_ops fuse_io_uring_ops = {
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 51a563922ce14158904a86c248c77767be4fe5ae..c63bed9f863d53d4ac2bed7bfbda61941cd99083 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -94,6 +94,9 @@ struct fuse_ring_queue {
>  	/* background fuse requests */
>  	struct list_head fuse_req_bg_queue;
>  
> +	/* number of requests queued or in userspace */
> +	unsigned int nr_reqs;
> +
>  	struct fuse_pqueue fpq;
>  
>  	unsigned int active_background;
>
> -- 
> 2.43.0
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 5/6] fuse: {io-uring} Allow reduced number of ring queues
  2025-10-13 17:10 ` [PATCH v3 5/6] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
@ 2025-10-15  9:25   ` Luis Henriques
  2025-10-15  9:31     ` Bernd Schubert
  0 siblings, 1 reply; 26+ messages in thread
From: Luis Henriques @ 2025-10-15  9:25 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Miklos Szeredi, Joanne Koong, linux-fsdevel, Gang He

On Mon, Oct 13 2025, Bernd Schubert wrote:

> Queues selection (fuse_uring_get_queue) can handle reduced number
> queues - using io-uring is possible now even with a single
> queue and entry.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c       | 35 +++--------------------------------
>  fs/fuse/inode.c           |  2 +-
>  include/uapi/linux/fuse.h |  3 +++
>  3 files changed, 7 insertions(+), 33 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 92401adecf813b1c4570d925718be772c8f02975..aca71ce5632efd1d80e3ac0ad4e81ac1536dbc47 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -999,31 +999,6 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
>  	return 0;
>  }
>  
> -static bool is_ring_ready(struct fuse_ring *ring, int current_qid)
> -{
> -	int qid;
> -	struct fuse_ring_queue *queue;
> -	bool ready = true;
> -
> -	for (qid = 0; qid < ring->max_nr_queues && ready; qid++) {
> -		if (current_qid == qid)
> -			continue;
> -
> -		queue = ring->queues[qid];
> -		if (!queue) {
> -			ready = false;
> -			break;
> -		}
> -
> -		spin_lock(&queue->lock);
> -		if (list_empty(&queue->ent_avail_queue))
> -			ready = false;
> -		spin_unlock(&queue->lock);
> -	}
> -
> -	return ready;
> -}
> -
>  /*
>   * fuse_uring_req_fetch command handling
>   */
> @@ -1051,13 +1026,9 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
>  	cpumask_set_cpu(queue->qid, ring->numa_registered_q_mask[node]);
>  
>  	if (!ring->ready) {
> -		bool ready = is_ring_ready(ring, queue->qid);
> -
> -		if (ready) {
> -			WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
> -			WRITE_ONCE(ring->ready, true);
> -			wake_up_all(&fc->blocked_waitq);
> -		}
> +		WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
> +		WRITE_ONCE(ring->ready, true);
> +		wake_up_all(&fc->blocked_waitq);
>  	}
>  }
>  
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index d1babf56f25470fcc08fe400467b3450e8b7464a..3f97cc307b4d77e12334180731589c579b2eb7a2 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1503,7 +1503,7 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
>  		FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP |
>  		FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP |
>  		FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_ALLOW_IDMAP |
> -		FUSE_REQUEST_TIMEOUT;
> +		FUSE_REQUEST_TIMEOUT | FUSE_URING_REDUCED_Q;
>  #ifdef CONFIG_FUSE_DAX
>  	if (fm->fc->dax)
>  		flags |= FUSE_MAP_ALIGNMENT;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index c13e1f9a2f12bd39f535188cb5466688eba42263..3da20d9bba1cb6336734511d21da9f64cea0e720 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -448,6 +448,8 @@ struct fuse_file_lock {
>   * FUSE_OVER_IO_URING: Indicate that client supports io-uring
>   * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
>   *			 init_out.request_timeout contains the timeout (in secs)
> + * FUSE_URING_REDUCED_Q: Client (kernel) supports less queues - Server is free
> + *			 to register between 1 and nr-core io-uring queues
>   */
>  #define FUSE_ASYNC_READ		(1 << 0)
>  #define FUSE_POSIX_LOCKS	(1 << 1)
> @@ -495,6 +497,7 @@ struct fuse_file_lock {
>  #define FUSE_ALLOW_IDMAP	(1ULL << 40)
>  #define FUSE_OVER_IO_URING	(1ULL << 41)
>  #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
> +#define FUSE_URING_REDUCED_Q (1ULL << 43)

This flag doesn't seem to be used anywhere.  Should it be removed, or will
there be any usage for it?

Cheers,
-- 
Luís


>  /**
>   * CUSE INIT request/reply flags
>
> -- 
> 2.43.0
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 5/6] fuse: {io-uring} Allow reduced number of ring queues
  2025-10-15  9:25   ` Luis Henriques
@ 2025-10-15  9:31     ` Bernd Schubert
  0 siblings, 0 replies; 26+ messages in thread
From: Bernd Schubert @ 2025-10-15  9:31 UTC (permalink / raw)
  To: Luis Henriques; +Cc: Miklos Szeredi, Joanne Koong, linux-fsdevel, Gang He



On 10/15/25 11:25, Luis Henriques wrote:
> On Mon, Oct 13 2025, Bernd Schubert wrote:
> 
>> Queues selection (fuse_uring_get_queue) can handle reduced number
>> queues - using io-uring is possible now even with a single
>> queue and entry.
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev_uring.c       | 35 +++--------------------------------
>>  fs/fuse/inode.c           |  2 +-
>>  include/uapi/linux/fuse.h |  3 +++
>>  3 files changed, 7 insertions(+), 33 deletions(-)
>>
>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>> index 92401adecf813b1c4570d925718be772c8f02975..aca71ce5632efd1d80e3ac0ad4e81ac1536dbc47 100644
>> --- a/fs/fuse/dev_uring.c
>> +++ b/fs/fuse/dev_uring.c
>> @@ -999,31 +999,6 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
>>  	return 0;
>>  }
>>  
>> -static bool is_ring_ready(struct fuse_ring *ring, int current_qid)
>> -{
>> -	int qid;
>> -	struct fuse_ring_queue *queue;
>> -	bool ready = true;
>> -
>> -	for (qid = 0; qid < ring->max_nr_queues && ready; qid++) {
>> -		if (current_qid == qid)
>> -			continue;
>> -
>> -		queue = ring->queues[qid];
>> -		if (!queue) {
>> -			ready = false;
>> -			break;
>> -		}
>> -
>> -		spin_lock(&queue->lock);
>> -		if (list_empty(&queue->ent_avail_queue))
>> -			ready = false;
>> -		spin_unlock(&queue->lock);
>> -	}
>> -
>> -	return ready;
>> -}
>> -
>>  /*
>>   * fuse_uring_req_fetch command handling
>>   */
>> @@ -1051,13 +1026,9 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
>>  	cpumask_set_cpu(queue->qid, ring->numa_registered_q_mask[node]);
>>  
>>  	if (!ring->ready) {
>> -		bool ready = is_ring_ready(ring, queue->qid);
>> -
>> -		if (ready) {
>> -			WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
>> -			WRITE_ONCE(ring->ready, true);
>> -			wake_up_all(&fc->blocked_waitq);
>> -		}
>> +		WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
>> +		WRITE_ONCE(ring->ready, true);
>> +		wake_up_all(&fc->blocked_waitq);
>>  	}
>>  }
>>  
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index d1babf56f25470fcc08fe400467b3450e8b7464a..3f97cc307b4d77e12334180731589c579b2eb7a2 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -1503,7 +1503,7 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
>>  		FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP |
>>  		FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP |
>>  		FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_ALLOW_IDMAP |
>> -		FUSE_REQUEST_TIMEOUT;
>> +		FUSE_REQUEST_TIMEOUT | FUSE_URING_REDUCED_Q;
>>  #ifdef CONFIG_FUSE_DAX
>>  	if (fm->fc->dax)
>>  		flags |= FUSE_MAP_ALIGNMENT;
>> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
>> index c13e1f9a2f12bd39f535188cb5466688eba42263..3da20d9bba1cb6336734511d21da9f64cea0e720 100644
>> --- a/include/uapi/linux/fuse.h
>> +++ b/include/uapi/linux/fuse.h
>> @@ -448,6 +448,8 @@ struct fuse_file_lock {
>>   * FUSE_OVER_IO_URING: Indicate that client supports io-uring
>>   * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
>>   *			 init_out.request_timeout contains the timeout (in secs)
>> + * FUSE_URING_REDUCED_Q: Client (kernel) supports less queues - Server is free
>> + *			 to register between 1 and nr-core io-uring queues
>>   */
>>  #define FUSE_ASYNC_READ		(1 << 0)
>>  #define FUSE_POSIX_LOCKS	(1 << 1)
>> @@ -495,6 +497,7 @@ struct fuse_file_lock {
>>  #define FUSE_ALLOW_IDMAP	(1ULL << 40)
>>  #define FUSE_OVER_IO_URING	(1ULL << 41)
>>  #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
>> +#define FUSE_URING_REDUCED_Q (1ULL << 43)
> 
> This flag doesn't seem to be used anywhere.  Should it be removed, or will
> there be any usage for it?

Hi Luis,

thanks for your reviews! I had forgotten to update the commit message,
the introduction patch [0/6] has it

- Added 'FUSE_URING_REDUCED_Q' FUSE_INIT flag to inform userspace
  about the possibility to reduced queues


except in the corresponding libfuse branch
(https://github.com/bsbernd/libfuse/tree/uring-reduce-nr-queues),
there won't be any user of it. That branch also needs to be updated.
Basically, fuse-server cannot unconditionally reduce the number of
queues - on a kernel that does not have the queue-reduction feature,
it would cause a blocking mount point, because fuse-client/kernel
would wait for every core to get a queue registration.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core
  2025-10-13 17:10 ` [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
@ 2025-10-15  9:50   ` Luis Henriques
  2025-10-15 10:27     ` Bernd Schubert
  2025-10-20  7:15   ` Dan Carpenter
  1 sibling, 1 reply; 26+ messages in thread
From: Luis Henriques @ 2025-10-15  9:50 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Miklos Szeredi, Joanne Koong, linux-fsdevel, Gang He

On Mon, Oct 13 2025, Bernd Schubert wrote:

> Running background IO on a different core makes quite a difference.
>
> fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
> --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
> --runtime=30s --group_reporting --ioengine=io_uring\
>  --direct=1
>
> unpatched
>    READ: bw=272MiB/s (285MB/s), 272MiB/s-272MiB/s ...
> patched
>    READ: bw=760MiB/s (797MB/s), 760MiB/s-760MiB/s ...
>
> With --iodepth=8
>
> unpatched
>    READ: bw=466MiB/s (489MB/s), 466MiB/s-466MiB/s ...
> patched
>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
> 2nd run:
>    READ: bw=1014MiB/s (1064MB/s), 1014MiB/s-1014MiB/s ...
>
> Without io-uring (--iodepth=8)
>    READ: bw=729MiB/s (764MB/s), 729MiB/s-729MiB/s ...
>
> Without fuse (--iodepth=8)
>    READ: bw=2199MiB/s (2306MB/s), 2199MiB/s-2199MiB/s ...
>
> (Test were done with
> <libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
> [-o io_uring] /tmp/source /tmp/dest
> )
>
> Additional notes:
>
> With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
>    READ: bw=903MiB/s (946MB/s), 903MiB/s-903MiB/s ...
>
> With just a random qid (--iodepth=8)
>    READ: bw=429MiB/s (450MB/s), 429MiB/s-429MiB/s ...
>
> With --iodepth=1
> unpatched
>    READ: bw=195MiB/s (204MB/s), 195MiB/s-195MiB/s ...
> patched
>    READ: bw=232MiB/s (243MB/s), 232MiB/s-232MiB/s ...
>
> With --iodepth=1 --numjobs=2
> unpatched
>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
> patched
>    READ: bw=1821MiB/s (1909MB/s), 1821MiB/s-1821MiB/s ...
>
> With --iodepth=1 --numjobs=8
> unpatched
>    READ: bw=1138MiB/s (1193MB/s), 1138MiB/s-1138MiB/s ...
> patched
>    READ: bw=1650MiB/s (1730MB/s), 1650MiB/s-1650MiB/s ...
> fuse without io-uring
>    READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
> no-fuse
>    READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...
>
> In summary, for async requests the core doing application IO is busy
> sending requests and processing IOs should be done on a different core.
> Spreading the load on random cores is also not desirable, as the core
> might be frequency scaled down and/or in C1 sleep states. Not shown here,
> but differnces are much smaller when the system uses performance govenor
> instead of schedutil (ubuntu default). Obviously at the cost of higher
> system power consumption for performance govenor - not desirable either.
>
> Results without io-uring (which uses fixed libfuse threads per queue)
> heavily depend on the current number of active threads. Libfuse uses
> default of max 10 threads, but actual nr max threads is a parameter.
> Also, no-fuse-io-uring results heavily depend on, if there was already
> running another workload before, as libfuse starts these threads
> dynamically - i.e. the more threads are active, the worse the
> performance.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 46 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index aca71ce5632efd1d80e3ac0ad4e81ac1536dbc47..f35dd98abfe6407849fec55847c6b3d186383803 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -23,6 +23,7 @@ MODULE_PARM_DESC(enable_uring,
>  #define FURING_Q_LOCAL_THRESHOLD 2
>  #define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>  #define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
> +#define FURING_NEXT_QUEUE_RETRIES 2

Some bikeshedding:

Maybe that's just me, but I associate the names above (FURING_*) to 'fur'
-- the action of making 'fur'.  I'm not sure that verb even exists, but my
brain makes me dislike those names :-)

But I'm not a native speaker, and I don't have any other objection to
those names rather than "I don't like fur!" so feel free to ignore me. :-)

>  
>  bool fuse_uring_enabled(void)
>  {
> @@ -1302,12 +1303,15 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>  /*
>   * Get the best queue for the current CPU
>   */
> -static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring,
> +						    bool background)
>  {
>  	unsigned int qid;
>  	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>  	int local_node;
>  	const struct cpumask *numa_mask, *global_mask;
> +	int retries = 0;
> +	int weight = -1;
>  
>  	qid = task_cpu(current);
>  	if (WARN_ONCE(qid >= ring->max_nr_queues,
> @@ -1315,16 +1319,50 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>  		      ring->max_nr_queues))
>  		qid = 0;
>  
> -	local_queue = READ_ONCE(ring->queues[qid]);
>  	local_node = cpu_to_node(qid);
>  	if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
>  		local_node = 0;
>  
> -	/* Fast path: if local queue exists and is not overloaded, use it */
> -	if (local_queue &&
> -	    READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
> +	local_queue = READ_ONCE(ring->queues[qid]);
> +
> +retry:
> +	/*
> +	 * For background requests, try next CPU in same NUMA domain.
> +	 * I.e. cpu-0 creates async requests, cpu-1 io processes.
> +	 * Similar for foreground requests, when the local queue does not
> +	 * exist - still better to always wake the same cpu id.
> +	 */
> +	if (background || !local_queue) {
> +		numa_mask = ring->numa_registered_q_mask[local_node];
> +
> +		if (weight == -1)
> +			weight = cpumask_weight(numa_mask);
> +
> +		if (weight == 0)
> +			goto global;
> +
> +		if (weight > 1) {
> +			int idx = (qid + 1) % weight;
> +
> +			qid = cpumask_nth(idx, numa_mask);
> +		} else {
> +			qid = cpumask_first(numa_mask);
> +		}
> +
> +		local_queue = READ_ONCE(ring->queues[qid]);
> +		if (WARN_ON_ONCE(!local_queue))
> +			return NULL;
> +	}
> +
> +	if (READ_ONCE(local_queue->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>  		return local_queue;
>  
> +	if (retries < FURING_NEXT_QUEUE_RETRIES && weight > retries + 1) {
> +		retries++;
> +		local_queue = NULL;
> +		goto retry;
> +	}
> +

I wonder if this retry loop is really useful.  If I understand this
correctly, we're doing a busy loop, hoping for a better queue to become
available.  But if the system is really busy doing IO this retry loop will
most of the times fail and will fall back to the next option -- only once
in a while we will get a better one.

Do you have evidence that this could be helpful?  Or am I misunderstanding
the purpose of this retry loop?

Cheers,
-- 
Luís

> 
>  	/* Find best NUMA-local queue */
>  	numa_mask = ring->numa_registered_q_mask[local_node];
>  	best_numa = fuse_uring_best_queue(numa_mask, ring);
> @@ -1334,6 +1372,7 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>  	    READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>  		return best_numa;
>  
> +global:
>  	/* NUMA queues above threshold, try global queues */
>  	global_mask = ring->registered_q_mask;
>  	best_global = fuse_uring_best_queue(global_mask, ring);
> @@ -1368,7 +1407,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>  	int err;
>  
>  	err = -EINVAL;
> -	queue = fuse_uring_get_queue(ring);
> +	queue = fuse_uring_get_queue(ring, false);
>  	if (!queue)
>  		goto err;
>  
> @@ -1412,7 +1451,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
>  	struct fuse_ring_queue *queue;
>  	struct fuse_ring_ent *ent = NULL;
>  
> -	queue = fuse_uring_get_queue(ring);
> +	queue = fuse_uring_get_queue(ring, true);
>  	if (!queue)
>  		return false;
>  
>
> -- 
> 2.43.0
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core
  2025-10-15  9:50   ` Luis Henriques
@ 2025-10-15 10:27     ` Bernd Schubert
  2025-10-15 11:05       ` Luis Henriques
  0 siblings, 1 reply; 26+ messages in thread
From: Bernd Schubert @ 2025-10-15 10:27 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Miklos Szeredi, Joanne Koong, linux-fsdevel@vger.kernel.org,
	Gang He

On 10/15/25 11:50, Luis Henriques wrote:
> On Mon, Oct 13 2025, Bernd Schubert wrote:
> 
>> Running background IO on a different core makes quite a difference.
>>
>> fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
>> --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
>> --runtime=30s --group_reporting --ioengine=io_uring\
>>  --direct=1
>>
>> unpatched
>>    READ: bw=272MiB/s (285MB/s), 272MiB/s-272MiB/s ...
>> patched
>>    READ: bw=760MiB/s (797MB/s), 760MiB/s-760MiB/s ...
>>
>> With --iodepth=8
>>
>> unpatched
>>    READ: bw=466MiB/s (489MB/s), 466MiB/s-466MiB/s ...
>> patched
>>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
>> 2nd run:
>>    READ: bw=1014MiB/s (1064MB/s), 1014MiB/s-1014MiB/s ...
>>
>> Without io-uring (--iodepth=8)
>>    READ: bw=729MiB/s (764MB/s), 729MiB/s-729MiB/s ...
>>
>> Without fuse (--iodepth=8)
>>    READ: bw=2199MiB/s (2306MB/s), 2199MiB/s-2199MiB/s ...
>>
>> (Test were done with
>> <libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
>> [-o io_uring] /tmp/source /tmp/dest
>> )
>>
>> Additional notes:
>>
>> With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
>>    READ: bw=903MiB/s (946MB/s), 903MiB/s-903MiB/s ...
>>
>> With just a random qid (--iodepth=8)
>>    READ: bw=429MiB/s (450MB/s), 429MiB/s-429MiB/s ...
>>
>> With --iodepth=1
>> unpatched
>>    READ: bw=195MiB/s (204MB/s), 195MiB/s-195MiB/s ...
>> patched
>>    READ: bw=232MiB/s (243MB/s), 232MiB/s-232MiB/s ...
>>
>> With --iodepth=1 --numjobs=2
>> unpatched
>>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
>> patched
>>    READ: bw=1821MiB/s (1909MB/s), 1821MiB/s-1821MiB/s ...
>>
>> With --iodepth=1 --numjobs=8
>> unpatched
>>    READ: bw=1138MiB/s (1193MB/s), 1138MiB/s-1138MiB/s ...
>> patched
>>    READ: bw=1650MiB/s (1730MB/s), 1650MiB/s-1650MiB/s ...
>> fuse without io-uring
>>    READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
>> no-fuse
>>    READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...
>>
>> In summary, for async requests the core doing application IO is busy
>> sending requests and processing IOs should be done on a different core.
>> Spreading the load on random cores is also not desirable, as the core
>> might be frequency scaled down and/or in C1 sleep states. Not shown here,
>> but differnces are much smaller when the system uses performance govenor
>> instead of schedutil (ubuntu default). Obviously at the cost of higher
>> system power consumption for performance govenor - not desirable either.
>>
>> Results without io-uring (which uses fixed libfuse threads per queue)
>> heavily depend on the current number of active threads. Libfuse uses
>> default of max 10 threads, but actual nr max threads is a parameter.
>> Also, no-fuse-io-uring results heavily depend on, if there was already
>> running another workload before, as libfuse starts these threads
>> dynamically - i.e. the more threads are active, the worse the
>> performance.
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev_uring.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 46 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>> index aca71ce5632efd1d80e3ac0ad4e81ac1536dbc47..f35dd98abfe6407849fec55847c6b3d186383803 100644
>> --- a/fs/fuse/dev_uring.c
>> +++ b/fs/fuse/dev_uring.c
>> @@ -23,6 +23,7 @@ MODULE_PARM_DESC(enable_uring,
>>  #define FURING_Q_LOCAL_THRESHOLD 2
>>  #define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>>  #define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>> +#define FURING_NEXT_QUEUE_RETRIES 2
> 
> Some bikeshedding:
> 
> Maybe that's just me, but I associate the names above (FURING_*) to 'fur'
> -- the action of making 'fur'.  I'm not sure that verb even exists, but my
> brain makes me dislike those names :-)
> 
> But I'm not a native speaker, and I don't have any other objection to
> those names rather than "I don't like fur!" so feel free to ignore me. :-)

I had initially called it "FUSE_URING", but it seemed to be a bit long.
I can change back to that or maybe you have a better short name in your
mind (as usual the hardest part is to find good variable names)?

> 
>>  
>>  bool fuse_uring_enabled(void)
>>  {
>> @@ -1302,12 +1303,15 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>>  /*
>>   * Get the best queue for the current CPU
>>   */
>> -static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring,
>> +						    bool background)
>>  {
>>  	unsigned int qid;
>>  	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>>  	int local_node;
>>  	const struct cpumask *numa_mask, *global_mask;
>> +	int retries = 0;
>> +	int weight = -1;
>>  
>>  	qid = task_cpu(current);
>>  	if (WARN_ONCE(qid >= ring->max_nr_queues,
>> @@ -1315,16 +1319,50 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>  		      ring->max_nr_queues))
>>  		qid = 0;
>>  
>> -	local_queue = READ_ONCE(ring->queues[qid]);
>>  	local_node = cpu_to_node(qid);
>>  	if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
>>  		local_node = 0;
>>  
>> -	/* Fast path: if local queue exists and is not overloaded, use it */
>> -	if (local_queue &&
>> -	    READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
>> +	local_queue = READ_ONCE(ring->queues[qid]);
>> +
>> +retry:
>> +	/*
>> +	 * For background requests, try next CPU in same NUMA domain.
>> +	 * I.e. cpu-0 creates async requests, cpu-1 io processes.
>> +	 * Similar for foreground requests, when the local queue does not
>> +	 * exist - still better to always wake the same cpu id.
>> +	 */
>> +	if (background || !local_queue) {
>> +		numa_mask = ring->numa_registered_q_mask[local_node];
>> +
>> +		if (weight == -1)
>> +			weight = cpumask_weight(numa_mask);
>> +
>> +		if (weight == 0)
>> +			goto global;
>> +
>> +		if (weight > 1) {
>> +			int idx = (qid + 1) % weight;
>> +
>> +			qid = cpumask_nth(idx, numa_mask);
>> +		} else {
>> +			qid = cpumask_first(numa_mask);
>> +		}
>> +
>> +		local_queue = READ_ONCE(ring->queues[qid]);
>> +		if (WARN_ON_ONCE(!local_queue))
>> +			return NULL;
>> +	}
>> +
>> +	if (READ_ONCE(local_queue->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>>  		return local_queue;
>>  
>> +	if (retries < FURING_NEXT_QUEUE_RETRIES && weight > retries + 1) {
>> +		retries++;
>> +		local_queue = NULL;
>> +		goto retry;
>> +	}
>> +
> 
> I wonder if this retry loop is really useful.  If I understand this
> correctly, we're doing a busy loop, hoping for a better queue to become
> available.  But if the system is really busy doing IO this retry loop will
> most of the times fail and will fall back to the next option -- only once
> in a while we will get a better one.
> 
> Do you have evidence that this could be helpful?  Or am I misunderstanding
> the purpose of this retry loop?

Yeah, I got better results with the retry, because using random queues
really doesn't work well - wakes up cores on random queues and wakes and
doesn't accumulate on the same queue - doesn't make use of the ring
advantage. So random should be avoided, if possible. Also, the more
queues the system has, the worse the results with the plain random
fallback. I can provide some numbers later on today.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core
  2025-10-15 10:27     ` Bernd Schubert
@ 2025-10-15 11:05       ` Luis Henriques
  0 siblings, 0 replies; 26+ messages in thread
From: Luis Henriques @ 2025-10-15 11:05 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Joanne Koong, linux-fsdevel@vger.kernel.org,
	Gang He

On Wed, Oct 15 2025, Bernd Schubert wrote:

> On 10/15/25 11:50, Luis Henriques wrote:
>> On Mon, Oct 13 2025, Bernd Schubert wrote:
>> 
>>> Running background IO on a different core makes quite a difference.
>>>
>>> fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
>>> --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
>>> --runtime=30s --group_reporting --ioengine=io_uring\
>>>  --direct=1
>>>
>>> unpatched
>>>    READ: bw=272MiB/s (285MB/s), 272MiB/s-272MiB/s ...
>>> patched
>>>    READ: bw=760MiB/s (797MB/s), 760MiB/s-760MiB/s ...
>>>
>>> With --iodepth=8
>>>
>>> unpatched
>>>    READ: bw=466MiB/s (489MB/s), 466MiB/s-466MiB/s ...
>>> patched
>>>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
>>> 2nd run:
>>>    READ: bw=1014MiB/s (1064MB/s), 1014MiB/s-1014MiB/s ...
>>>
>>> Without io-uring (--iodepth=8)
>>>    READ: bw=729MiB/s (764MB/s), 729MiB/s-729MiB/s ...
>>>
>>> Without fuse (--iodepth=8)
>>>    READ: bw=2199MiB/s (2306MB/s), 2199MiB/s-2199MiB/s ...
>>>
>>> (Test were done with
>>> <libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
>>> [-o io_uring] /tmp/source /tmp/dest
>>> )
>>>
>>> Additional notes:
>>>
>>> With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
>>>    READ: bw=903MiB/s (946MB/s), 903MiB/s-903MiB/s ...
>>>
>>> With just a random qid (--iodepth=8)
>>>    READ: bw=429MiB/s (450MB/s), 429MiB/s-429MiB/s ...
>>>
>>> With --iodepth=1
>>> unpatched
>>>    READ: bw=195MiB/s (204MB/s), 195MiB/s-195MiB/s ...
>>> patched
>>>    READ: bw=232MiB/s (243MB/s), 232MiB/s-232MiB/s ...
>>>
>>> With --iodepth=1 --numjobs=2
>>> unpatched
>>>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
>>> patched
>>>    READ: bw=1821MiB/s (1909MB/s), 1821MiB/s-1821MiB/s ...
>>>
>>> With --iodepth=1 --numjobs=8
>>> unpatched
>>>    READ: bw=1138MiB/s (1193MB/s), 1138MiB/s-1138MiB/s ...
>>> patched
>>>    READ: bw=1650MiB/s (1730MB/s), 1650MiB/s-1650MiB/s ...
>>> fuse without io-uring
>>>    READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
>>> no-fuse
>>>    READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...
>>>
>>> In summary, for async requests the core doing application IO is busy
>>> sending requests and processing IOs should be done on a different core.
>>> Spreading the load on random cores is also not desirable, as the core
>>> might be frequency scaled down and/or in C1 sleep states. Not shown here,
>>> but differnces are much smaller when the system uses performance govenor
>>> instead of schedutil (ubuntu default). Obviously at the cost of higher
>>> system power consumption for performance govenor - not desirable either.
>>>
>>> Results without io-uring (which uses fixed libfuse threads per queue)
>>> heavily depend on the current number of active threads. Libfuse uses
>>> default of max 10 threads, but actual nr max threads is a parameter.
>>> Also, no-fuse-io-uring results heavily depend on, if there was already
>>> running another workload before, as libfuse starts these threads
>>> dynamically - i.e. the more threads are active, the worse the
>>> performance.
>>>
>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>> ---
>>>  fs/fuse/dev_uring.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
>>>  1 file changed, 46 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>> index aca71ce5632efd1d80e3ac0ad4e81ac1536dbc47..f35dd98abfe6407849fec55847c6b3d186383803 100644
>>> --- a/fs/fuse/dev_uring.c
>>> +++ b/fs/fuse/dev_uring.c
>>> @@ -23,6 +23,7 @@ MODULE_PARM_DESC(enable_uring,
>>>  #define FURING_Q_LOCAL_THRESHOLD 2
>>>  #define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>>>  #define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>>> +#define FURING_NEXT_QUEUE_RETRIES 2
>> 
>> Some bikeshedding:
>> 
>> Maybe that's just me, but I associate the names above (FURING_*) to 'fur'
>> -- the action of making 'fur'.  I'm not sure that verb even exists, but my
>> brain makes me dislike those names :-)
>> 
>> But I'm not a native speaker, and I don't have any other objection to
>> those names rather than "I don't like fur!" so feel free to ignore me. :-)
>
> I had initially called it "FUSE_URING", but it seemed to be a bit long.
> I can change back to that or maybe you have a better short name in your
> mind (as usual the hardest part is to find good variable names)?

Yeah, the only alternative would be to either drop 'FUSE' or 'URING' from
the name: FUSE_Q_LOCAL_THRESHOLD or URING_Q_LOCAL_THRESHOLD.  But as I
said: this is plain bikeshedding.  If no one else complains about these
definitions, just leave them as-is.

>>>  
>>>  bool fuse_uring_enabled(void)
>>>  {
>>> @@ -1302,12 +1303,15 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>>>  /*
>>>   * Get the best queue for the current CPU
>>>   */
>>> -static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring,
>>> +						    bool background)
>>>  {
>>>  	unsigned int qid;
>>>  	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>>>  	int local_node;
>>>  	const struct cpumask *numa_mask, *global_mask;
>>> +	int retries = 0;
>>> +	int weight = -1;
>>>  
>>>  	qid = task_cpu(current);
>>>  	if (WARN_ONCE(qid >= ring->max_nr_queues,
>>> @@ -1315,16 +1319,50 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>>  		      ring->max_nr_queues))
>>>  		qid = 0;
>>>  
>>> -	local_queue = READ_ONCE(ring->queues[qid]);
>>>  	local_node = cpu_to_node(qid);
>>>  	if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
>>>  		local_node = 0;
>>>  
>>> -	/* Fast path: if local queue exists and is not overloaded, use it */
>>> -	if (local_queue &&
>>> -	    READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
>>> +	local_queue = READ_ONCE(ring->queues[qid]);
>>> +
>>> +retry:
>>> +	/*
>>> +	 * For background requests, try next CPU in same NUMA domain.
>>> +	 * I.e. cpu-0 creates async requests, cpu-1 io processes.
>>> +	 * Similar for foreground requests, when the local queue does not
>>> +	 * exist - still better to always wake the same cpu id.
>>> +	 */
>>> +	if (background || !local_queue) {
>>> +		numa_mask = ring->numa_registered_q_mask[local_node];
>>> +
>>> +		if (weight == -1)
>>> +			weight = cpumask_weight(numa_mask);
>>> +
>>> +		if (weight == 0)
>>> +			goto global;
>>> +
>>> +		if (weight > 1) {
>>> +			int idx = (qid + 1) % weight;
>>> +
>>> +			qid = cpumask_nth(idx, numa_mask);
>>> +		} else {
>>> +			qid = cpumask_first(numa_mask);
>>> +		}
>>> +
>>> +		local_queue = READ_ONCE(ring->queues[qid]);
>>> +		if (WARN_ON_ONCE(!local_queue))
>>> +			return NULL;
>>> +	}
>>> +
>>> +	if (READ_ONCE(local_queue->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>>>  		return local_queue;
>>>  
>>> +	if (retries < FURING_NEXT_QUEUE_RETRIES && weight > retries + 1) {
>>> +		retries++;
>>> +		local_queue = NULL;
>>> +		goto retry;
>>> +	}
>>> +
>> 
>> I wonder if this retry loop is really useful.  If I understand this
>> correctly, we're doing a busy loop, hoping for a better queue to become
>> available.  But if the system is really busy doing IO this retry loop will
>> most of the times fail and will fall back to the next option -- only once
>> in a while we will get a better one.
>> 
>> Do you have evidence that this could be helpful?  Or am I misunderstanding
>> the purpose of this retry loop?
>
> Yeah, I got better results with the retry, because using random queues
> really doesn't work well - wakes up cores on random queues and wakes and
> doesn't accumulate on the same queue - doesn't make use of the ring
> advantage. So random should be avoided, if possible. Also, the more
> queues the system has, the worse the results with the plain random
> fallback. I can provide some numbers later on today.

From my side, there's no need to provide numbers -- I'm happy with your
answer.  I just thought that looping could be more expensive than just
picking whatever queue was selected, because in the end that would likely
be the end result anyway.  But I haven't tested it and I understand that
this design can definitely have impact in big systems.

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/6] fuse: {io-uring} Use bitmaps to track registered queues
  2025-10-13 17:09 ` [PATCH v3 3/6] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
@ 2025-10-15 23:49   ` Joanne Koong
  2025-10-16 11:33     ` Bernd Schubert
  0 siblings, 1 reply; 26+ messages in thread
From: Joanne Koong @ 2025-10-15 23:49 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Miklos Szeredi, linux-fsdevel, Luis Henriques, Gang He

> @@ -983,6 +1031,10 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
>         struct fuse_ring *ring = queue->ring;
>         struct fuse_conn *fc = ring->fc;
>         struct fuse_iqueue *fiq = &fc->iq;
> +       int node = cpu_to_node(queue->qid);

Am I reading the correct version of the libfuse implementation in [1]?
As I understand it, the libfuse implementation sets queue->qid
sequentially no matter the user-specified cpu-set [2], but the qid is
used as the cpu id, and we get the numa node from that. My
understanding is that in practice, sequential CPU numbering often
follows NUMA topology (eg NUMA node0 cpus: 0-15, 32-47; NUMA node1
cpus: 16-31), so it seems like this has a high chance of being all on
the same numa node? Or am I missing something here?

Thanks,
Joanne

[1] https://github.com/bsbernd/libfuse/commit/afcc654c8760f0c95f9e042fc92f05b1b9c3df13
[2] https://github.com/bsbernd/libfuse/blob/afcc654c8760f0c95f9e042fc92f05b1b9c3df13/lib/fuse_uring.c#L830


> +
> +       if (WARN_ON_ONCE(node >= ring->nr_numa_nodes))
> +               node = 0;
>
>         fuse_uring_prepare_cancel(cmd, issue_flags, ent);
>
> @@ -991,6 +1043,9 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
>         fuse_uring_ent_avail(ent, queue);
>         spin_unlock(&queue->lock);
>
> +       cpumask_set_cpu(queue->qid, ring->registered_q_mask);
> +       cpumask_set_cpu(queue->qid, ring->numa_registered_q_mask[node]);
> +
>         if (!ring->ready) {
>                 bool ready = is_ring_ready(ring, queue->qid);
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution
  2025-10-14  9:14   ` Bernd Schubert
@ 2025-10-16  6:15     ` Gang He
  0 siblings, 0 replies; 26+ messages in thread
From: Gang He @ 2025-10-16  6:15 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Joanne Koong, linux-fsdevel@vger.kernel.org,
	Luis Henriques

Hi Bernd,

The test case is as below,
First, I created a null file system, e.g.
cd libfuse/build/example/
./null -o io_uring -o io_uring_q_depth=32 /mnt/singfile

Then, run fio commands with the different iodepth value.
the fio commands are as below,
fio -direct=1 --filename=/mnt/singfile --rw=read  -iodepth=4
--ioengine=libaio --bs=4k --size=4G --runtime=60 --numjobs=1
-name=test_fuse2
(the patches can help this case)
fio -direct=1 --filename=/mnt/singfile --rw=read  -iodepth=1
--ioengine=libaio --bs=4k --size=4G --runtime=60 --numjobs=1
-name=test_fuse1
(but the patches will have a regression problem for this case,
maybe we cannot handle both cases with the same code)

Thanks
Gang

Bernd Schubert <bschubert@ddn.com> 于2025年10月14日周二 17:14写道：
>
> On 10/14/25 10:43, Gang He wrote:
> > [You don't often get email from dchg2000@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > Hi Bernd,
> >
> > Thank for your optimization patches.
> > I applied these patches, for asynchronous IO(iodepth > 1), it looks
> > the patches can improve the performance as expected.
> > But, for synchronous IO(iodepth =1), it looks there is  a regression
> > problem here(performance drop).
> > Did you check the above regression issue?
>
> Hi Gang,
>
> please see patch 6/6, it has numbers with iodepth=1 - I don't see a
> regression. Also, I don't think '--ioengine=io_uring --iodepth=4'
> results in synchronous IO, but for synchronous IOs you need
> '--ioengine=psync'.
>
> For truly synchronous IOs you need
> https://lore.kernel.org/all/20251013-wake-same-cpu-v1-1-45d8059adde7@ddn.com/
>
>
> Could you please provide the exact fio command you are using, so that I
> can give it a quick test?
>
> Thanks,
> Bernd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/6] fuse: {io-uring} Use bitmaps to track registered queues
  2025-10-15 23:49   ` Joanne Koong
@ 2025-10-16 11:33     ` Bernd Schubert
  0 siblings, 0 replies; 26+ messages in thread
From: Bernd Schubert @ 2025-10-16 11:33 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel@vger.kernel.org, Luis Henriques,
	Gang He

On 10/16/25 01:49, Joanne Koong wrote:
>> @@ -983,6 +1031,10 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
>>         struct fuse_ring *ring = queue->ring;
>>         struct fuse_conn *fc = ring->fc;
>>         struct fuse_iqueue *fiq = &fc->iq;
>> +       int node = cpu_to_node(queue->qid);
> 
> Am I reading the correct version of the libfuse implementation in [1]?
> As I understand it, the libfuse implementation sets queue->qid
> sequentially no matter the user-specified cpu-set [2], but the qid is
> used as the cpu id, and we get the numa node from that. My
> understanding is that in practice, sequential CPU numbering often
> follows NUMA topology (eg NUMA node0 cpus: 0-15, 32-47; NUMA node1
> cpus: 16-31), so it seems like this has a high chance of being all on
> the same numa node? Or am I missing something here?

Yeah, sorry about this. The tests hadn't been busted, but I had just
forgotten to push the branch update I had locally. Well, actually there
was another issue in fuse_create_cpu_set(), i.e. when giving something
like '-o io_uring_nr_qs=8' it was not correctly distributing across numa
nodes, which is why I had to select cpus manually for some tests. That
2nd issue had hold me off to publish the changes. Should be all fixed
now and it pushed.

https://github.com/bsbernd/libfuse/tree/uring-reduce-nr-queues


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues
  2025-10-13 17:10 ` [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues Bernd Schubert
@ 2025-10-18  0:12   ` Joanne Koong
  2025-10-20 19:00     ` Bernd Schubert
  0 siblings, 1 reply; 26+ messages in thread
From: Joanne Koong @ 2025-10-18  0:12 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Miklos Szeredi, linux-fsdevel, Luis Henriques, Gang He

On Mon, Oct 13, 2025 at 10:10 AM Bernd Schubert <bschubert@ddn.com> wrote:
>
> So far queue selection was only for the queue corresponding
> to the current core.
> With bitmaps about registered queues and counting of queued
> requests per queue, distributing the load is possible now.
>
> This is on purpose lockless and accurate, under the assumption that a lock
> between queues might become the limiting factor. Approximate selection
> based on queue->nr_reqs should be good enough. If queues get slightly
> more requests than given by that counter it should not be too bad,
> as number of kernel/userspace transitions gets reduced with higher
> queue sizes.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 84 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 02c4b40e739c7aa43dc1c581d4ff1f721617cc79..92401adecf813b1c4570d925718be772c8f02975 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -19,6 +19,10 @@ MODULE_PARM_DESC(enable_uring,
>
>  #define FUSE_URING_IOV_SEGS 2 /* header and payload */
>
> +/* Number of queued fuse requests until a queue is considered full */
> +#define FURING_Q_LOCAL_THRESHOLD 2
> +#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
> +#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>
>  bool fuse_uring_enabled(void)
>  {
> @@ -1285,22 +1289,94 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>         fuse_uring_send(ent, cmd, err, issue_flags);
>  }
>
> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
> +/*
> + * Pick best queue from mask. Follows the algorithm described in
> + * "The Power of Two Choices in Randomized Load Balancing"
> + *  (Michael David Mitzenmacher, 1991)
> + */
> +static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
> +                                                    struct fuse_ring *ring)
> +{
> +       unsigned int qid1, qid2;
> +       struct fuse_ring_queue *queue1, *queue2;
> +       int weight = cpumask_weight(mask);
> +
> +       if (weight == 0)
> +               return NULL;
> +
> +       if (weight == 1) {
> +               qid1 = cpumask_first(mask);
> +               return READ_ONCE(ring->queues[qid1]);
> +       }
> +
> +       /* Get two different queues using optimized bounded random */
> +       qid1 = cpumask_nth(get_random_u32_below(weight), mask);
> +       queue1 = READ_ONCE(ring->queues[qid1]);
> +
> +       qid2 = cpumask_nth(get_random_u32_below(weight), mask);
> +
> +       /* Avoid retries and take this queue for code simplicity */
> +       if (qid1 == qid2)
> +               return queue1;
> +
> +       queue2 = READ_ONCE(ring->queues[qid2]);
> +
> +       if (WARN_ON_ONCE(!queue1 || !queue2))
> +               return NULL;
> +
> +       return (READ_ONCE(queue1->nr_reqs) < READ_ONCE(queue2->nr_reqs)) ?
> +               queue1 : queue2;
> +}
> +
> +/*
> + * Get the best queue for the current CPU
> + */
> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>  {
>         unsigned int qid;
> -       struct fuse_ring_queue *queue;
> +       struct fuse_ring_queue *local_queue, *best_numa, *best_global;
> +       int local_node;
> +       const struct cpumask *numa_mask, *global_mask;
>
>         qid = task_cpu(current);
> -
>         if (WARN_ONCE(qid >= ring->max_nr_queues,
>                       "Core number (%u) exceeds nr queues (%zu)\n", qid,
>                       ring->max_nr_queues))
>                 qid = 0;
>
> -       queue = ring->queues[qid];
> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
> +       local_queue = READ_ONCE(ring->queues[qid]);
> +       local_node = cpu_to_node(qid);
> +       if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
> +               local_node = 0;
>
> -       return queue;
> +       /* Fast path: if local queue exists and is not overloaded, use it */
> +       if (local_queue &&
> +           READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
> +               return local_queue;
> +
> +       /* Find best NUMA-local queue */
> +       numa_mask = ring->numa_registered_q_mask[local_node];
> +       best_numa = fuse_uring_best_queue(numa_mask, ring);
> +
> +       /* If NUMA queue is under threshold, use it */
> +       if (best_numa &&
> +           READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
> +               return best_numa;
> +
> +       /* NUMA queues above threshold, try global queues */
> +       global_mask = ring->registered_q_mask;
> +       best_global = fuse_uring_best_queue(global_mask, ring);
> +
> +       /* Might happen during tear down */
> +       if (!best_global)
> +               return NULL;
> +
> +       /* If global queue is under double threshold, use it */
> +       if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
> +               return best_global;
> +
> +       /* There is no ideal queue, stay numa_local if possible */
> +       return best_numa ? best_numa : best_global;
>  }

Hi Bernd,

I started looking a bit at the block layer blk-mq.c code because, as I
understand it, they have to address this same problem of allocating
requests to queues while taking into account NUMA locality.

I haven't looked at the code deeply yet but I think what it does is
maintain a static mapping (that considers numa topology) of cpus to
queues which then makes queue selection very simple with minimal
overhead. For distributing load, I think it relies on the CPU
scheduler to distribute application tasks fairly across CPUs rather
than doing load balancing itself (which would also then have to break
numa locality if the request gets moved to a different queue).
Regarding load balancing, my read of this patch is that it uses the
number of current requests on queues as the metric of load but I'm not
sure that's accurate - for example, some requests may be more
intensive (eg fetching a read over a network) where even if there's
only a few requests on that queue, that queue could still be more
loaded with higher latency than other queues.

I'm curious to hear your thoughts on whether you think a simple
mapping solution like what the block layer does would suffice or not
for fuse uring queue selection.

Thanks,
Joanne

>
>  static void fuse_uring_dispatch_ent(struct fuse_ring_ent *ent)
> @@ -1321,7 +1397,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>         int err;
>
>         err = -EINVAL;
> -       queue = fuse_uring_task_to_queue(ring);
> +       queue = fuse_uring_get_queue(ring);
>         if (!queue)
>                 goto err;
>
> @@ -1365,7 +1441,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
>         struct fuse_ring_queue *queue;
>         struct fuse_ring_ent *ent = NULL;
>
> -       queue = fuse_uring_task_to_queue(ring);
> +       queue = fuse_uring_get_queue(ring);
>         if (!queue)
>                 return false;
>
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core
  2025-10-13 17:10 ` [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
  2025-10-15  9:50   ` Luis Henriques
@ 2025-10-20  7:15   ` Dan Carpenter
  1 sibling, 0 replies; 26+ messages in thread
From: Dan Carpenter @ 2025-10-20  7:15 UTC (permalink / raw)
  To: oe-kbuild, Bernd Schubert, Miklos Szeredi
  Cc: lkp, oe-kbuild-all, Joanne Koong, linux-fsdevel, Luis Henriques,
	Gang He, Bernd Schubert

Hi Bernd,

kernel test robot noticed the following build warnings:

url:    https://github.com/intel-lab-lkp/linux/commits/Bernd-Schubert/fuse-io-uring-Add-queue-length-counters/20251014-024703
base:   ec714e371f22f716a04e6ecb2a24988c92b26911
patch link:    https://lore.kernel.org/r/20251013-reduced-nr-ring-queues_3-v3-6-6d87c8aa31ae%40ddn.com
patch subject: [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core
config: loongarch-randconfig-r072-20251019 (https://download.01.org/0day-ci/archive/20251020/202510201259.MevZAfl5-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 15.1.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202510201259.MevZAfl5-lkp@intel.com/

smatch warnings:
fs/fuse/dev_uring.c:1389 fuse_uring_get_queue() error: uninitialized symbol 'best_numa'.

vim +/best_numa +1389 fs/fuse/dev_uring.c

2482ae85881b957 Bernd Schubert 2025-10-13  1306  static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring,
2482ae85881b957 Bernd Schubert 2025-10-13  1307  						    bool background)
c2c9af9a0b13261 Bernd Schubert 2025-01-20  1308  {
c2c9af9a0b13261 Bernd Schubert 2025-01-20  1309  	unsigned int qid;
aca09b212467554 Bernd Schubert 2025-10-13  1310  	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
aca09b212467554 Bernd Schubert 2025-10-13  1311  	int local_node;
aca09b212467554 Bernd Schubert 2025-10-13  1312  	const struct cpumask *numa_mask, *global_mask;
2482ae85881b957 Bernd Schubert 2025-10-13  1313  	int retries = 0;
2482ae85881b957 Bernd Schubert 2025-10-13  1314  	int weight = -1;
c2c9af9a0b13261 Bernd Schubert 2025-01-20  1315  
c2c9af9a0b13261 Bernd Schubert 2025-01-20  1316  	qid = task_cpu(current);
868e7728394dbc8 Bernd Schubert 2025-10-13  1317  	if (WARN_ONCE(qid >= ring->max_nr_queues,
c2c9af9a0b13261 Bernd Schubert 2025-01-20  1318  		      "Core number (%u) exceeds nr queues (%zu)\n", qid,
868e7728394dbc8 Bernd Schubert 2025-10-13  1319  		      ring->max_nr_queues))
c2c9af9a0b13261 Bernd Schubert 2025-01-20  1320  		qid = 0;
c2c9af9a0b13261 Bernd Schubert 2025-01-20  1321  
aca09b212467554 Bernd Schubert 2025-10-13  1322  	local_node = cpu_to_node(qid);
aca09b212467554 Bernd Schubert 2025-10-13  1323  	if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
aca09b212467554 Bernd Schubert 2025-10-13  1324  		local_node = 0;
c2c9af9a0b13261 Bernd Schubert 2025-01-20  1325  
2482ae85881b957 Bernd Schubert 2025-10-13  1326  	local_queue = READ_ONCE(ring->queues[qid]);
2482ae85881b957 Bernd Schubert 2025-10-13  1327  
2482ae85881b957 Bernd Schubert 2025-10-13  1328  retry:
2482ae85881b957 Bernd Schubert 2025-10-13  1329  	/*
2482ae85881b957 Bernd Schubert 2025-10-13  1330  	 * For background requests, try next CPU in same NUMA domain.
2482ae85881b957 Bernd Schubert 2025-10-13  1331  	 * I.e. cpu-0 creates async requests, cpu-1 io processes.
2482ae85881b957 Bernd Schubert 2025-10-13  1332  	 * Similar for foreground requests, when the local queue does not
2482ae85881b957 Bernd Schubert 2025-10-13  1333  	 * exist - still better to always wake the same cpu id.
2482ae85881b957 Bernd Schubert 2025-10-13  1334  	 */
2482ae85881b957 Bernd Schubert 2025-10-13  1335  	if (background || !local_queue) {
2482ae85881b957 Bernd Schubert 2025-10-13  1336  		numa_mask = ring->numa_registered_q_mask[local_node];
2482ae85881b957 Bernd Schubert 2025-10-13  1337  
2482ae85881b957 Bernd Schubert 2025-10-13  1338  		if (weight == -1)
2482ae85881b957 Bernd Schubert 2025-10-13  1339  			weight = cpumask_weight(numa_mask);
2482ae85881b957 Bernd Schubert 2025-10-13  1340  
2482ae85881b957 Bernd Schubert 2025-10-13  1341  		if (weight == 0)
2482ae85881b957 Bernd Schubert 2025-10-13  1342  			goto global;

best_numa not set on this path.

2482ae85881b957 Bernd Schubert 2025-10-13  1343  
2482ae85881b957 Bernd Schubert 2025-10-13  1344  		if (weight > 1) {
2482ae85881b957 Bernd Schubert 2025-10-13  1345  			int idx = (qid + 1) % weight;
2482ae85881b957 Bernd Schubert 2025-10-13  1346  
2482ae85881b957 Bernd Schubert 2025-10-13  1347  			qid = cpumask_nth(idx, numa_mask);
2482ae85881b957 Bernd Schubert 2025-10-13  1348  		} else {
2482ae85881b957 Bernd Schubert 2025-10-13  1349  			qid = cpumask_first(numa_mask);
2482ae85881b957 Bernd Schubert 2025-10-13  1350  		}
2482ae85881b957 Bernd Schubert 2025-10-13  1351  
2482ae85881b957 Bernd Schubert 2025-10-13  1352  		local_queue = READ_ONCE(ring->queues[qid]);
2482ae85881b957 Bernd Schubert 2025-10-13  1353  		if (WARN_ON_ONCE(!local_queue))
2482ae85881b957 Bernd Schubert 2025-10-13  1354  			return NULL;
2482ae85881b957 Bernd Schubert 2025-10-13  1355  	}
2482ae85881b957 Bernd Schubert 2025-10-13  1356  
2482ae85881b957 Bernd Schubert 2025-10-13  1357  	if (READ_ONCE(local_queue->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
aca09b212467554 Bernd Schubert 2025-10-13  1358  		return local_queue;
aca09b212467554 Bernd Schubert 2025-10-13  1359  
2482ae85881b957 Bernd Schubert 2025-10-13  1360  	if (retries < FURING_NEXT_QUEUE_RETRIES && weight > retries + 1) {
2482ae85881b957 Bernd Schubert 2025-10-13  1361  		retries++;
2482ae85881b957 Bernd Schubert 2025-10-13  1362  		local_queue = NULL;
2482ae85881b957 Bernd Schubert 2025-10-13  1363  		goto retry;
2482ae85881b957 Bernd Schubert 2025-10-13  1364  	}
2482ae85881b957 Bernd Schubert 2025-10-13  1365  
aca09b212467554 Bernd Schubert 2025-10-13  1366  	/* Find best NUMA-local queue */
aca09b212467554 Bernd Schubert 2025-10-13  1367  	numa_mask = ring->numa_registered_q_mask[local_node];
aca09b212467554 Bernd Schubert 2025-10-13  1368  	best_numa = fuse_uring_best_queue(numa_mask, ring);
aca09b212467554 Bernd Schubert 2025-10-13  1369  
aca09b212467554 Bernd Schubert 2025-10-13  1370  	/* If NUMA queue is under threshold, use it */
aca09b212467554 Bernd Schubert 2025-10-13  1371  	if (best_numa &&
aca09b212467554 Bernd Schubert 2025-10-13  1372  	    READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
aca09b212467554 Bernd Schubert 2025-10-13  1373  		return best_numa;
aca09b212467554 Bernd Schubert 2025-10-13  1374  
2482ae85881b957 Bernd Schubert 2025-10-13  1375  global:
aca09b212467554 Bernd Schubert 2025-10-13  1376  	/* NUMA queues above threshold, try global queues */
aca09b212467554 Bernd Schubert 2025-10-13  1377  	global_mask = ring->registered_q_mask;
aca09b212467554 Bernd Schubert 2025-10-13  1378  	best_global = fuse_uring_best_queue(global_mask, ring);
aca09b212467554 Bernd Schubert 2025-10-13  1379  
aca09b212467554 Bernd Schubert 2025-10-13  1380  	/* Might happen during tear down */
aca09b212467554 Bernd Schubert 2025-10-13  1381  	if (!best_global)
aca09b212467554 Bernd Schubert 2025-10-13  1382  		return NULL;
aca09b212467554 Bernd Schubert 2025-10-13  1383  
aca09b212467554 Bernd Schubert 2025-10-13  1384  	/* If global queue is under double threshold, use it */
aca09b212467554 Bernd Schubert 2025-10-13  1385  	if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
aca09b212467554 Bernd Schubert 2025-10-13  1386  		return best_global;
aca09b212467554 Bernd Schubert 2025-10-13  1387  
aca09b212467554 Bernd Schubert 2025-10-13  1388  	/* There is no ideal queue, stay numa_local if possible */
aca09b212467554 Bernd Schubert 2025-10-13 @1389  	return best_numa ? best_numa : best_global;
                                                               ^^^^^^^^^
Uninitialized

c2c9af9a0b13261 Bernd Schubert 2025-01-20  1390  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues
  2025-10-18  0:12   ` Joanne Koong
@ 2025-10-20 19:00     ` Bernd Schubert
  2025-10-20 22:59       ` Joanne Koong
  0 siblings, 1 reply; 26+ messages in thread
From: Bernd Schubert @ 2025-10-20 19:00 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel@vger.kernel.org, Luis Henriques,
	Gang He

On 10/18/25 02:12, Joanne Koong wrote:
> On Mon, Oct 13, 2025 at 10:10 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> So far queue selection was only for the queue corresponding
>> to the current core.
>> With bitmaps about registered queues and counting of queued
>> requests per queue, distributing the load is possible now.
>>
>> This is on purpose lockless and accurate, under the assumption that a lock
>> between queues might become the limiting factor. Approximate selection
>> based on queue->nr_reqs should be good enough. If queues get slightly
>> more requests than given by that counter it should not be too bad,
>> as number of kernel/userspace transitions gets reduced with higher
>> queue sizes.
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev_uring.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 84 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>> index 02c4b40e739c7aa43dc1c581d4ff1f721617cc79..92401adecf813b1c4570d925718be772c8f02975 100644
>> --- a/fs/fuse/dev_uring.c
>> +++ b/fs/fuse/dev_uring.c
>> @@ -19,6 +19,10 @@ MODULE_PARM_DESC(enable_uring,
>>
>>  #define FUSE_URING_IOV_SEGS 2 /* header and payload */
>>
>> +/* Number of queued fuse requests until a queue is considered full */
>> +#define FURING_Q_LOCAL_THRESHOLD 2
>> +#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>> +#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>>
>>  bool fuse_uring_enabled(void)
>>  {
>> @@ -1285,22 +1289,94 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>>         fuse_uring_send(ent, cmd, err, issue_flags);
>>  }
>>
>> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
>> +/*
>> + * Pick best queue from mask. Follows the algorithm described in
>> + * "The Power of Two Choices in Randomized Load Balancing"
>> + *  (Michael David Mitzenmacher, 1991)
>> + */
>> +static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>> +                                                    struct fuse_ring *ring)
>> +{
>> +       unsigned int qid1, qid2;
>> +       struct fuse_ring_queue *queue1, *queue2;
>> +       int weight = cpumask_weight(mask);
>> +
>> +       if (weight == 0)
>> +               return NULL;
>> +
>> +       if (weight == 1) {
>> +               qid1 = cpumask_first(mask);
>> +               return READ_ONCE(ring->queues[qid1]);
>> +       }
>> +
>> +       /* Get two different queues using optimized bounded random */
>> +       qid1 = cpumask_nth(get_random_u32_below(weight), mask);
>> +       queue1 = READ_ONCE(ring->queues[qid1]);
>> +
>> +       qid2 = cpumask_nth(get_random_u32_below(weight), mask);
>> +
>> +       /* Avoid retries and take this queue for code simplicity */
>> +       if (qid1 == qid2)
>> +               return queue1;
>> +
>> +       queue2 = READ_ONCE(ring->queues[qid2]);
>> +
>> +       if (WARN_ON_ONCE(!queue1 || !queue2))
>> +               return NULL;
>> +
>> +       return (READ_ONCE(queue1->nr_reqs) < READ_ONCE(queue2->nr_reqs)) ?
>> +               queue1 : queue2;
>> +}
>> +
>> +/*
>> + * Get the best queue for the current CPU
>> + */
>> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>  {
>>         unsigned int qid;
>> -       struct fuse_ring_queue *queue;
>> +       struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>> +       int local_node;
>> +       const struct cpumask *numa_mask, *global_mask;
>>
>>         qid = task_cpu(current);
>> -
>>         if (WARN_ONCE(qid >= ring->max_nr_queues,
>>                       "Core number (%u) exceeds nr queues (%zu)\n", qid,
>>                       ring->max_nr_queues))
>>                 qid = 0;
>>
>> -       queue = ring->queues[qid];
>> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
>> +       local_queue = READ_ONCE(ring->queues[qid]);
>> +       local_node = cpu_to_node(qid);
>> +       if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
>> +               local_node = 0;
>>
>> -       return queue;
>> +       /* Fast path: if local queue exists and is not overloaded, use it */
>> +       if (local_queue &&
>> +           READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
>> +               return local_queue;
>> +
>> +       /* Find best NUMA-local queue */
>> +       numa_mask = ring->numa_registered_q_mask[local_node];
>> +       best_numa = fuse_uring_best_queue(numa_mask, ring);
>> +
>> +       /* If NUMA queue is under threshold, use it */
>> +       if (best_numa &&
>> +           READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>> +               return best_numa;
>> +
>> +       /* NUMA queues above threshold, try global queues */
>> +       global_mask = ring->registered_q_mask;
>> +       best_global = fuse_uring_best_queue(global_mask, ring);
>> +
>> +       /* Might happen during tear down */
>> +       if (!best_global)
>> +               return NULL;
>> +
>> +       /* If global queue is under double threshold, use it */
>> +       if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
>> +               return best_global;
>> +
>> +       /* There is no ideal queue, stay numa_local if possible */
>> +       return best_numa ? best_numa : best_global;
>>  }
> 
> Hi Bernd,
> 
> I started looking a bit at the block layer blk-mq.c code because, as I
> understand it, they have to address this same problem of allocating
> requests to queues while taking into account NUMA locality.
> 
> I haven't looked at the code deeply yet but I think what it does is
> maintain a static mapping (that considers numa topology) of cpus to
> queues which then makes queue selection very simple with minimal
> overhead. For distributing load, I think it relies on the CPU
> scheduler to distribute application tasks fairly across CPUs rather
> than doing load balancing itself (which would also then have to break
> numa locality if the request gets moved to a different queue).
> Regarding load balancing, my read of this patch is that it uses the
> number of current requests on queues as the metric of load but I'm not
> sure that's accurate - for example, some requests may be more
> intensive (eg fetching a read over a network) where even if there's
> only a few requests on that queue, that queue could still be more
> loaded with higher latency than other queues.
> 
> I'm curious to hear your thoughts on whether you think a simple
> mapping solution like what the block layer does would suffice or not
> for fuse uring queue selection.


Hi Joanne,

thanks for looking at the patch. I think we have primarily a static
mapping? For completeness, please also look at the patch 6/6, which 
updates queue selection. Basically with patch 6/6 we have static
mapping to the local queue, with neighbor queues as retries. I 
had already answered Luis question - I can show that retries
to the neighbor QIDs improves performance, at least for fio's
'--ioengine=io_uring --numjobs={1..8} --iodepth={8..128} --direct=1'.

So that leaves the fallback to random QIDs - I don't have strong
opinion about that, but I don't think the CPU scheduler can handle it.
Let's say you are doing write-back to a single file and let's say
fuse is tuned to allow lot's of dirty pages. How should the scheduler
be able to distribute single threaded dirty page flush? Especially
also see in patch 6/6 that we really want to have a different CPU
to handle async data - the cpu scheduler will not even try to move the
the application or migration thread to a different cpu, because
there is no conflict. And for cpu cache, C-states and frequency,  
we actually also want to the scheduler to limit migration to
absolutely minimum.

Another choice instead of random fallback would be to distribute
requests to neighbor queues within FURING_NEXT_QUEUE_RETRIES.
Maybe that would even give better peformance, as random queues so
far didn't have a positive effect in my testing.

The kind of ideal queue selection for async requests seems to be
to fill a queue and then to move to the next qid, within a numa
domain. I just hadn't found a way to do that lockless yet.

Regarding usage of number of requests - I guess there always
will be workloads where the algorithm isn't perfect - see the
scheduler wake discussion. Maybe we can find a way in the future
to map queued requests in fuse-daemon and then use that + number
of unhandled io-uring CQEs to know if a queue is busy. Any chance
we can do it step by step? 

I don't have a big problem to remove
the random queue selection fallback, but also would be good
to have Miklos' opinion - Miklos had some concerns in September
last year that queues/ring-threads might end up being unused,
although they could take requests...


Thanks,
Bernd




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues
  2025-10-20 19:00     ` Bernd Schubert
@ 2025-10-20 22:59       ` Joanne Koong
  2025-10-20 23:28         ` Bernd Schubert
  2025-10-24 17:05         ` Joanne Koong
  0 siblings, 2 replies; 26+ messages in thread
From: Joanne Koong @ 2025-10-20 22:59 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, linux-fsdevel@vger.kernel.org, Luis Henriques,
	Gang He

On Mon, Oct 20, 2025 at 12:00 PM Bernd Schubert <bschubert@ddn.com> wrote:
>
> On 10/18/25 02:12, Joanne Koong wrote:
> > On Mon, Oct 13, 2025 at 10:10 AM Bernd Schubert <bschubert@ddn.com> wrote:
> >>
> >> So far queue selection was only for the queue corresponding
> >> to the current core.
> >> With bitmaps about registered queues and counting of queued
> >> requests per queue, distributing the load is possible now.
> >>
> >> This is on purpose lockless and accurate, under the assumption that a lock
> >> between queues might become the limiting factor. Approximate selection
> >> based on queue->nr_reqs should be good enough. If queues get slightly
> >> more requests than given by that counter it should not be too bad,
> >> as number of kernel/userspace transitions gets reduced with higher
> >> queue sizes.
> >>
> >> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> >> ---
> >>  fs/fuse/dev_uring.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 84 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> >> index 02c4b40e739c7aa43dc1c581d4ff1f721617cc79..92401adecf813b1c4570d925718be772c8f02975 100644
> >> --- a/fs/fuse/dev_uring.c
> >> +++ b/fs/fuse/dev_uring.c
> >> @@ -19,6 +19,10 @@ MODULE_PARM_DESC(enable_uring,
> >>
> >>  #define FUSE_URING_IOV_SEGS 2 /* header and payload */
> >>
> >> +/* Number of queued fuse requests until a queue is considered full */
> >> +#define FURING_Q_LOCAL_THRESHOLD 2
> >> +#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
> >> +#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
> >>
> >>  bool fuse_uring_enabled(void)
> >>  {
> >> @@ -1285,22 +1289,94 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
> >>         fuse_uring_send(ent, cmd, err, issue_flags);
> >>  }
> >>
> >> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
> >> +/*
> >> + * Pick best queue from mask. Follows the algorithm described in
> >> + * "The Power of Two Choices in Randomized Load Balancing"
> >> + *  (Michael David Mitzenmacher, 1991)
> >> + */
> >> +static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
> >> +                                                    struct fuse_ring *ring)
> >> +{
> >> +       unsigned int qid1, qid2;
> >> +       struct fuse_ring_queue *queue1, *queue2;
> >> +       int weight = cpumask_weight(mask);
> >> +
> >> +       if (weight == 0)
> >> +               return NULL;
> >> +
> >> +       if (weight == 1) {
> >> +               qid1 = cpumask_first(mask);
> >> +               return READ_ONCE(ring->queues[qid1]);
> >> +       }
> >> +
> >> +       /* Get two different queues using optimized bounded random */
> >> +       qid1 = cpumask_nth(get_random_u32_below(weight), mask);
> >> +       queue1 = READ_ONCE(ring->queues[qid1]);
> >> +
> >> +       qid2 = cpumask_nth(get_random_u32_below(weight), mask);
> >> +
> >> +       /* Avoid retries and take this queue for code simplicity */
> >> +       if (qid1 == qid2)
> >> +               return queue1;
> >> +
> >> +       queue2 = READ_ONCE(ring->queues[qid2]);
> >> +
> >> +       if (WARN_ON_ONCE(!queue1 || !queue2))
> >> +               return NULL;
> >> +
> >> +       return (READ_ONCE(queue1->nr_reqs) < READ_ONCE(queue2->nr_reqs)) ?
> >> +               queue1 : queue2;
> >> +}
> >> +
> >> +/*
> >> + * Get the best queue for the current CPU
> >> + */
> >> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
> >>  {
> >>         unsigned int qid;
> >> -       struct fuse_ring_queue *queue;
> >> +       struct fuse_ring_queue *local_queue, *best_numa, *best_global;
> >> +       int local_node;
> >> +       const struct cpumask *numa_mask, *global_mask;
> >>
> >>         qid = task_cpu(current);
> >> -
> >>         if (WARN_ONCE(qid >= ring->max_nr_queues,
> >>                       "Core number (%u) exceeds nr queues (%zu)\n", qid,
> >>                       ring->max_nr_queues))
> >>                 qid = 0;
> >>
> >> -       queue = ring->queues[qid];
> >> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
> >> +       local_queue = READ_ONCE(ring->queues[qid]);
> >> +       local_node = cpu_to_node(qid);
> >> +       if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
> >> +               local_node = 0;
> >>
> >> -       return queue;
> >> +       /* Fast path: if local queue exists and is not overloaded, use it */
> >> +       if (local_queue &&
> >> +           READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
> >> +               return local_queue;
> >> +
> >> +       /* Find best NUMA-local queue */
> >> +       numa_mask = ring->numa_registered_q_mask[local_node];
> >> +       best_numa = fuse_uring_best_queue(numa_mask, ring);
> >> +
> >> +       /* If NUMA queue is under threshold, use it */
> >> +       if (best_numa &&
> >> +           READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
> >> +               return best_numa;
> >> +
> >> +       /* NUMA queues above threshold, try global queues */
> >> +       global_mask = ring->registered_q_mask;
> >> +       best_global = fuse_uring_best_queue(global_mask, ring);
> >> +
> >> +       /* Might happen during tear down */
> >> +       if (!best_global)
> >> +               return NULL;
> >> +
> >> +       /* If global queue is under double threshold, use it */
> >> +       if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
> >> +               return best_global;
> >> +
> >> +       /* There is no ideal queue, stay numa_local if possible */
> >> +       return best_numa ? best_numa : best_global;
> >>  }
> >
> > Hi Bernd,
> >
> > I started looking a bit at the block layer blk-mq.c code because, as I
> > understand it, they have to address this same problem of allocating
> > requests to queues while taking into account NUMA locality.
> >
> > I haven't looked at the code deeply yet but I think what it does is
> > maintain a static mapping (that considers numa topology) of cpus to
> > queues which then makes queue selection very simple with minimal
> > overhead. For distributing load, I think it relies on the CPU
> > scheduler to distribute application tasks fairly across CPUs rather
> > than doing load balancing itself (which would also then have to break
> > numa locality if the request gets moved to a different queue).
> > Regarding load balancing, my read of this patch is that it uses the
> > number of current requests on queues as the metric of load but I'm not
> > sure that's accurate - for example, some requests may be more
> > intensive (eg fetching a read over a network) where even if there's
> > only a few requests on that queue, that queue could still be more
> > loaded with higher latency than other queues.
> >
> > I'm curious to hear your thoughts on whether you think a simple
> > mapping solution like what the block layer does would suffice or not
> > for fuse uring queue selection.
>
Hi Bernd,

Thanks for your reply and for sharing your thoughts on this.

>
> Hi Joanne,
>
> thanks for looking at the patch. I think we have primarily a static
> mapping? For completeness, please also look at the patch 6/6, which
> updates queue selection. Basically with patch 6/6 we have static
> mapping to the local queue, with neighbor queues as retries. I
> had already answered Luis question - I can show that retries
> to the neighbor QIDs improves performance, at least for fio's
> '--ioengine=io_uring --numjobs={1..8} --iodepth={8..128} --direct=1'.
>
> So that leaves the fallback to random QIDs - I don't have strong
> opinion about that, but I don't think the CPU scheduler can handle it.
> Let's say you are doing write-back to a single file and let's say
> fuse is tuned to allow lot's of dirty pages. How should the scheduler
> be able to distribute single threaded dirty page flush? Especially

For writeback, I believe the writeback workqueue is unbound (I'm
seeing bdi_wq allocated with WQ_UNBOUND in default_bid_init()) to any
cpu. As I understand it, the worker thread can be migrated by the
scheduler which will distribute writing back dirty data across
multiple cpus as it sees fit.

> also see in patch 6/6 that we really want to have a different CPU
> to handle async data - the cpu scheduler will not even try to move the
> the application or migration thread to a different cpu, because
> there is no conflict. And for cpu cache, C-states and frequency,
> we actually also want to the scheduler to limit migration to
> absolutely minimum.
>
> Another choice instead of random fallback would be to distribute
> requests to neighbor queues within FURING_NEXT_QUEUE_RETRIES.
> Maybe that would even give better peformance, as random queues so
> far didn't have a positive effect in my testing.
>
> The kind of ideal queue selection for async requests seems to be
> to fill a queue and then to move to the next qid, within a numa
> domain. I just hadn't found a way to do that lockless yet.
>
> Regarding usage of number of requests - I guess there always
> will be workloads where the algorithm isn't perfect - see the
> scheduler wake discussion. Maybe we can find a way in the future
> to map queued requests in fuse-daemon and then use that + number
> of unhandled io-uring CQEs to know if a queue is busy. Any chance
> we can do it step by step?
>
> I don't have a big problem to remove
> the random queue selection fallback, but also would be good
> to have Miklos' opinion - Miklos had some concerns in September

I agree, it'd be good to have Miklos's opinion on this and go with that.

My opinion looking at this is that the fuse uring problem of
distributing requests to queues is very similar to what the block
layer has to do with assigning bio submissions to hardware queues. The
block layer's solution to me seems more elegantly simple and flows
organically with the cpu scheduler's internal load balancing. I think
we should try to keep things as simple as possible, as I don't see how
the optimizations with the custom load balancing we're doing here can
be accurate enough to warrant the extra complexity and overhead.

But I defer to whatever approach you and Miklos think is best and
would rather go with.

Thanks,
Joanne

> last year that queues/ring-threads might end up being unused,
> although they could take requests...
>
>
> Thanks,
> Bernd
>
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues
  2025-10-20 22:59       ` Joanne Koong
@ 2025-10-20 23:28         ` Bernd Schubert
  2025-10-24 17:05         ` Joanne Koong
  1 sibling, 0 replies; 26+ messages in thread
From: Bernd Schubert @ 2025-10-20 23:28 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel@vger.kernel.org, Luis Henriques,
	Gang He

On 10/21/25 00:59, Joanne Koong wrote:
> On Mon, Oct 20, 2025 at 12:00 PM Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> On 10/18/25 02:12, Joanne Koong wrote:
>>> On Mon, Oct 13, 2025 at 10:10 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>>>
>>>> So far queue selection was only for the queue corresponding
>>>> to the current core.
>>>> With bitmaps about registered queues and counting of queued
>>>> requests per queue, distributing the load is possible now.
>>>>
>>>> This is on purpose lockless and accurate, under the assumption that a lock
>>>> between queues might become the limiting factor. Approximate selection
>>>> based on queue->nr_reqs should be good enough. If queues get slightly
>>>> more requests than given by that counter it should not be too bad,
>>>> as number of kernel/userspace transitions gets reduced with higher
>>>> queue sizes.
>>>>
>>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>>> ---
>>>>  fs/fuse/dev_uring.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>>>>  1 file changed, 84 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>>> index 02c4b40e739c7aa43dc1c581d4ff1f721617cc79..92401adecf813b1c4570d925718be772c8f02975 100644
>>>> --- a/fs/fuse/dev_uring.c
>>>> +++ b/fs/fuse/dev_uring.c
>>>> @@ -19,6 +19,10 @@ MODULE_PARM_DESC(enable_uring,
>>>>
>>>>  #define FUSE_URING_IOV_SEGS 2 /* header and payload */
>>>>
>>>> +/* Number of queued fuse requests until a queue is considered full */
>>>> +#define FURING_Q_LOCAL_THRESHOLD 2
>>>> +#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>>>> +#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>>>>
>>>>  bool fuse_uring_enabled(void)
>>>>  {
>>>> @@ -1285,22 +1289,94 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>>>>         fuse_uring_send(ent, cmd, err, issue_flags);
>>>>  }
>>>>
>>>> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
>>>> +/*
>>>> + * Pick best queue from mask. Follows the algorithm described in
>>>> + * "The Power of Two Choices in Randomized Load Balancing"
>>>> + *  (Michael David Mitzenmacher, 1991)
>>>> + */
>>>> +static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>>>> +                                                    struct fuse_ring *ring)
>>>> +{
>>>> +       unsigned int qid1, qid2;
>>>> +       struct fuse_ring_queue *queue1, *queue2;
>>>> +       int weight = cpumask_weight(mask);
>>>> +
>>>> +       if (weight == 0)
>>>> +               return NULL;
>>>> +
>>>> +       if (weight == 1) {
>>>> +               qid1 = cpumask_first(mask);
>>>> +               return READ_ONCE(ring->queues[qid1]);
>>>> +       }
>>>> +
>>>> +       /* Get two different queues using optimized bounded random */
>>>> +       qid1 = cpumask_nth(get_random_u32_below(weight), mask);
>>>> +       queue1 = READ_ONCE(ring->queues[qid1]);
>>>> +
>>>> +       qid2 = cpumask_nth(get_random_u32_below(weight), mask);
>>>> +
>>>> +       /* Avoid retries and take this queue for code simplicity */
>>>> +       if (qid1 == qid2)
>>>> +               return queue1;
>>>> +
>>>> +       queue2 = READ_ONCE(ring->queues[qid2]);
>>>> +
>>>> +       if (WARN_ON_ONCE(!queue1 || !queue2))
>>>> +               return NULL;
>>>> +
>>>> +       return (READ_ONCE(queue1->nr_reqs) < READ_ONCE(queue2->nr_reqs)) ?
>>>> +               queue1 : queue2;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Get the best queue for the current CPU
>>>> + */
>>>> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>>>  {
>>>>         unsigned int qid;
>>>> -       struct fuse_ring_queue *queue;
>>>> +       struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>>>> +       int local_node;
>>>> +       const struct cpumask *numa_mask, *global_mask;
>>>>
>>>>         qid = task_cpu(current);
>>>> -
>>>>         if (WARN_ONCE(qid >= ring->max_nr_queues,
>>>>                       "Core number (%u) exceeds nr queues (%zu)\n", qid,
>>>>                       ring->max_nr_queues))
>>>>                 qid = 0;
>>>>
>>>> -       queue = ring->queues[qid];
>>>> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
>>>> +       local_queue = READ_ONCE(ring->queues[qid]);
>>>> +       local_node = cpu_to_node(qid);
>>>> +       if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
>>>> +               local_node = 0;
>>>>
>>>> -       return queue;
>>>> +       /* Fast path: if local queue exists and is not overloaded, use it */
>>>> +       if (local_queue &&
>>>> +           READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
>>>> +               return local_queue;
>>>> +
>>>> +       /* Find best NUMA-local queue */
>>>> +       numa_mask = ring->numa_registered_q_mask[local_node];
>>>> +       best_numa = fuse_uring_best_queue(numa_mask, ring);
>>>> +
>>>> +       /* If NUMA queue is under threshold, use it */
>>>> +       if (best_numa &&
>>>> +           READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>>>> +               return best_numa;
>>>> +
>>>> +       /* NUMA queues above threshold, try global queues */
>>>> +       global_mask = ring->registered_q_mask;
>>>> +       best_global = fuse_uring_best_queue(global_mask, ring);
>>>> +
>>>> +       /* Might happen during tear down */
>>>> +       if (!best_global)
>>>> +               return NULL;
>>>> +
>>>> +       /* If global queue is under double threshold, use it */
>>>> +       if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
>>>> +               return best_global;
>>>> +
>>>> +       /* There is no ideal queue, stay numa_local if possible */
>>>> +       return best_numa ? best_numa : best_global;
>>>>  }
>>>
>>> Hi Bernd,
>>>
>>> I started looking a bit at the block layer blk-mq.c code because, as I
>>> understand it, they have to address this same problem of allocating
>>> requests to queues while taking into account NUMA locality.
>>>
>>> I haven't looked at the code deeply yet but I think what it does is
>>> maintain a static mapping (that considers numa topology) of cpus to
>>> queues which then makes queue selection very simple with minimal
>>> overhead. For distributing load, I think it relies on the CPU
>>> scheduler to distribute application tasks fairly across CPUs rather
>>> than doing load balancing itself (which would also then have to break
>>> numa locality if the request gets moved to a different queue).
>>> Regarding load balancing, my read of this patch is that it uses the
>>> number of current requests on queues as the metric of load but I'm not
>>> sure that's accurate - for example, some requests may be more
>>> intensive (eg fetching a read over a network) where even if there's
>>> only a few requests on that queue, that queue could still be more
>>> loaded with higher latency than other queues.
>>>
>>> I'm curious to hear your thoughts on whether you think a simple
>>> mapping solution like what the block layer does would suffice or not
>>> for fuse uring queue selection.
>>
> Hi Bernd,
> 
> Thanks for your reply and for sharing your thoughts on this.
> 
>>
>> Hi Joanne,
>>
>> thanks for looking at the patch. I think we have primarily a static
>> mapping? For completeness, please also look at the patch 6/6, which
>> updates queue selection. Basically with patch 6/6 we have static
>> mapping to the local queue, with neighbor queues as retries. I
>> had already answered Luis question - I can show that retries
>> to the neighbor QIDs improves performance, at least for fio's
>> '--ioengine=io_uring --numjobs={1..8} --iodepth={8..128} --direct=1'.
>>
>> So that leaves the fallback to random QIDs - I don't have strong
>> opinion about that, but I don't think the CPU scheduler can handle it.
>> Let's say you are doing write-back to a single file and let's say
>> fuse is tuned to allow lot's of dirty pages. How should the scheduler
>> be able to distribute single threaded dirty page flush? Especially
> 
> For writeback, I believe the writeback workqueue is unbound (I'm
> seeing bdi_wq allocated with WQ_UNBOUND in default_bid_init()) to any
> cpu. As I understand it, the worker thread can be migrated by the
> scheduler which will distribute writing back dirty data across
> multiple cpus as it sees fit.

Short reply before I got to bed, let's say write-back wants to write out
1MB, but max-write/pages is set to one page - it would result in 256
requests. Now why would the scheduler migrate during request creation?
And why would we even want that?



Thanks,
Bernd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues
  2025-10-20 22:59       ` Joanne Koong
  2025-10-20 23:28         ` Bernd Schubert
@ 2025-10-24 17:05         ` Joanne Koong
  2025-10-24 17:52           ` Bernd Schubert
  1 sibling, 1 reply; 26+ messages in thread
From: Joanne Koong @ 2025-10-24 17:05 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, linux-fsdevel@vger.kernel.org, Luis Henriques,
	Gang He

On Mon, Oct 20, 2025 at 3:59 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Oct 20, 2025 at 12:00 PM Bernd Schubert <bschubert@ddn.com> wrote:
> >
> > On 10/18/25 02:12, Joanne Koong wrote:
> > > On Mon, Oct 13, 2025 at 10:10 AM Bernd Schubert <bschubert@ddn.com> wrote:
> > >>
> > >> So far queue selection was only for the queue corresponding
> > >> to the current core.
> > >> With bitmaps about registered queues and counting of queued
> > >> requests per queue, distributing the load is possible now.
> > >>
> > >> This is on purpose lockless and accurate, under the assumption that a lock
> > >> between queues might become the limiting factor. Approximate selection
> > >> based on queue->nr_reqs should be good enough. If queues get slightly
> > >> more requests than given by that counter it should not be too bad,
> > >> as number of kernel/userspace transitions gets reduced with higher
> > >> queue sizes.
> > >>
> > >> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> > >> ---
> > >>  fs/fuse/dev_uring.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-----
> > >>  1 file changed, 84 insertions(+), 8 deletions(-)
> > >>
> > >> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> > >> index 02c4b40e739c7aa43dc1c581d4ff1f721617cc79..92401adecf813b1c4570d925718be772c8f02975 100644
> > >> --- a/fs/fuse/dev_uring.c
> > >> +++ b/fs/fuse/dev_uring.c
> > >> @@ -19,6 +19,10 @@ MODULE_PARM_DESC(enable_uring,
> > >>
> > >>  #define FUSE_URING_IOV_SEGS 2 /* header and payload */
> > >>
> > >> +/* Number of queued fuse requests until a queue is considered full */
> > >> +#define FURING_Q_LOCAL_THRESHOLD 2
> > >> +#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
> > >> +#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
> > >>
> > >>  bool fuse_uring_enabled(void)
> > >>  {
> > >> @@ -1285,22 +1289,94 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
> > >>         fuse_uring_send(ent, cmd, err, issue_flags);
> > >>  }
> > >>
> > >> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
> > >> +/*
> > >> + * Pick best queue from mask. Follows the algorithm described in
> > >> + * "The Power of Two Choices in Randomized Load Balancing"
> > >> + *  (Michael David Mitzenmacher, 1991)
> > >> + */
> > >> +static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
> > >> +                                                    struct fuse_ring *ring)
> > >> +{
> > >> +       unsigned int qid1, qid2;
> > >> +       struct fuse_ring_queue *queue1, *queue2;
> > >> +       int weight = cpumask_weight(mask);
> > >> +
> > >> +       if (weight == 0)
> > >> +               return NULL;
> > >> +
> > >> +       if (weight == 1) {
> > >> +               qid1 = cpumask_first(mask);
> > >> +               return READ_ONCE(ring->queues[qid1]);
> > >> +       }
> > >> +
> > >> +       /* Get two different queues using optimized bounded random */
> > >> +       qid1 = cpumask_nth(get_random_u32_below(weight), mask);
> > >> +       queue1 = READ_ONCE(ring->queues[qid1]);
> > >> +
> > >> +       qid2 = cpumask_nth(get_random_u32_below(weight), mask);
> > >> +
> > >> +       /* Avoid retries and take this queue for code simplicity */
> > >> +       if (qid1 == qid2)
> > >> +               return queue1;
> > >> +
> > >> +       queue2 = READ_ONCE(ring->queues[qid2]);
> > >> +
> > >> +       if (WARN_ON_ONCE(!queue1 || !queue2))
> > >> +               return NULL;
> > >> +
> > >> +       return (READ_ONCE(queue1->nr_reqs) < READ_ONCE(queue2->nr_reqs)) ?
> > >> +               queue1 : queue2;
> > >> +}
> > >> +
> > >> +/*
> > >> + * Get the best queue for the current CPU
> > >> + */
> > >> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
> > >>  {
> > >>         unsigned int qid;
> > >> -       struct fuse_ring_queue *queue;
> > >> +       struct fuse_ring_queue *local_queue, *best_numa, *best_global;
> > >> +       int local_node;
> > >> +       const struct cpumask *numa_mask, *global_mask;
> > >>
> > >>         qid = task_cpu(current);
> > >> -
> > >>         if (WARN_ONCE(qid >= ring->max_nr_queues,
> > >>                       "Core number (%u) exceeds nr queues (%zu)\n", qid,
> > >>                       ring->max_nr_queues))
> > >>                 qid = 0;
> > >>
> > >> -       queue = ring->queues[qid];
> > >> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
> > >> +       local_queue = READ_ONCE(ring->queues[qid]);
> > >> +       local_node = cpu_to_node(qid);
> > >> +       if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
> > >> +               local_node = 0;
> > >>
> > >> -       return queue;
> > >> +       /* Fast path: if local queue exists and is not overloaded, use it */
> > >> +       if (local_queue &&
> > >> +           READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
> > >> +               return local_queue;
> > >> +
> > >> +       /* Find best NUMA-local queue */
> > >> +       numa_mask = ring->numa_registered_q_mask[local_node];
> > >> +       best_numa = fuse_uring_best_queue(numa_mask, ring);
> > >> +
> > >> +       /* If NUMA queue is under threshold, use it */
> > >> +       if (best_numa &&
> > >> +           READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
> > >> +               return best_numa;
> > >> +
> > >> +       /* NUMA queues above threshold, try global queues */
> > >> +       global_mask = ring->registered_q_mask;
> > >> +       best_global = fuse_uring_best_queue(global_mask, ring);
> > >> +
> > >> +       /* Might happen during tear down */
> > >> +       if (!best_global)
> > >> +               return NULL;
> > >> +
> > >> +       /* If global queue is under double threshold, use it */
> > >> +       if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
> > >> +               return best_global;
> > >> +
> > >> +       /* There is no ideal queue, stay numa_local if possible */
> > >> +       return best_numa ? best_numa : best_global;
> > >>  }
> > >
> > > Hi Bernd,
> > >
> > > I started looking a bit at the block layer blk-mq.c code because, as I
> > > understand it, they have to address this same problem of allocating
> > > requests to queues while taking into account NUMA locality.
> > >
> > > I haven't looked at the code deeply yet but I think what it does is
> > > maintain a static mapping (that considers numa topology) of cpus to
> > > queues which then makes queue selection very simple with minimal
> > > overhead. For distributing load, I think it relies on the CPU
> > > scheduler to distribute application tasks fairly across CPUs rather
> > > than doing load balancing itself (which would also then have to break
> > > numa locality if the request gets moved to a different queue).
> > > Regarding load balancing, my read of this patch is that it uses the
> > > number of current requests on queues as the metric of load but I'm not
> > > sure that's accurate - for example, some requests may be more
> > > intensive (eg fetching a read over a network) where even if there's
> > > only a few requests on that queue, that queue could still be more
> > > loaded with higher latency than other queues.
> > >
> > > I'm curious to hear your thoughts on whether you think a simple
> > > mapping solution like what the block layer does would suffice or not
> > > for fuse uring queue selection.
> >
> Hi Bernd,
>
> Thanks for your reply and for sharing your thoughts on this.
>
> >
> > Hi Joanne,
> >
> > thanks for looking at the patch. I think we have primarily a static
> > mapping? For completeness, please also look at the patch 6/6, which
> > updates queue selection. Basically with patch 6/6 we have static
> > mapping to the local queue, with neighbor queues as retries. I
> > had already answered Luis question - I can show that retries
> > to the neighbor QIDs improves performance, at least for fio's
> > '--ioengine=io_uring --numjobs={1..8} --iodepth={8..128} --direct=1'.
> >
> > So that leaves the fallback to random QIDs - I don't have strong
> > opinion about that, but I don't think the CPU scheduler can handle it.
> > Let's say you are doing write-back to a single file and let's say
> > fuse is tuned to allow lot's of dirty pages. How should the scheduler
> > be able to distribute single threaded dirty page flush? Especially
>
> For writeback, I believe the writeback workqueue is unbound (I'm
> seeing bdi_wq allocated with WQ_UNBOUND in default_bid_init()) to any
> cpu. As I understand it, the worker thread can be migrated by the
> scheduler which will distribute writing back dirty data across
> multiple cpus as it sees fit.
>
> > also see in patch 6/6 that we really want to have a different CPU
> > to handle async data - the cpu scheduler will not even try to move the
> > the application or migration thread to a different cpu, because
> > there is no conflict. And for cpu cache, C-states and frequency,
> > we actually also want to the scheduler to limit migration to
> > absolutely minimum.
> >
> > Another choice instead of random fallback would be to distribute
> > requests to neighbor queues within FURING_NEXT_QUEUE_RETRIES.
> > Maybe that would even give better peformance, as random queues so
> > far didn't have a positive effect in my testing.
> >
> > The kind of ideal queue selection for async requests seems to be
> > to fill a queue and then to move to the next qid, within a numa
> > domain. I just hadn't found a way to do that lockless yet.
> >
> > Regarding usage of number of requests - I guess there always
> > will be workloads where the algorithm isn't perfect - see the
> > scheduler wake discussion. Maybe we can find a way in the future
> > to map queued requests in fuse-daemon and then use that + number
> > of unhandled io-uring CQEs to know if a queue is busy. Any chance
> > we can do it step by step?
> >
> > I don't have a big problem to remove
> > the random queue selection fallback, but also would be good
> > to have Miklos' opinion - Miklos had some concerns in September
>
> I agree, it'd be good to have Miklos's opinion on this and go with that.
>
> My opinion looking at this is that the fuse uring problem of
> distributing requests to queues is very similar to what the block
> layer has to do with assigning bio submissions to hardware queues. The
> block layer's solution to me seems more elegantly simple and flows
> organically with the cpu scheduler's internal load balancing. I think
> we should try to keep things as simple as possible, as I don't see how
> the optimizations with the custom load balancing we're doing here can
> be accurate enough to warrant the extra complexity and overhead.
>
> But I defer to whatever approach you and Miklos think is best and
> would rather go with.

Miklos, could you share your thoughts on how we should approach this?

Thanks,
Joanne
>
> Thanks,
> Joanne
>
> > last year that queues/ring-threads might end up being unused,
> > although they could take requests...
> >
> >
> > Thanks,
> > Bernd
> >
> >
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues
  2025-10-24 17:05         ` Joanne Koong
@ 2025-10-24 17:52           ` Bernd Schubert
  2025-10-24 17:58             ` Bernd Schubert
  0 siblings, 1 reply; 26+ messages in thread
From: Bernd Schubert @ 2025-10-24 17:52 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel@vger.kernel.org, Luis Henriques,
	Gang He

On 10/24/25 19:05, Joanne Koong wrote:
> On Mon, Oct 20, 2025 at 3:59 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>
>> On Mon, Oct 20, 2025 at 12:00 PM Bernd Schubert <bschubert@ddn.com> wrote:
>>>
>>> On 10/18/25 02:12, Joanne Koong wrote:
>>>> On Mon, Oct 13, 2025 at 10:10 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>>>>
>>>>> So far queue selection was only for the queue corresponding
>>>>> to the current core.
>>>>> With bitmaps about registered queues and counting of queued
>>>>> requests per queue, distributing the load is possible now.
>>>>>
>>>>> This is on purpose lockless and accurate, under the assumption that a lock
>>>>> between queues might become the limiting factor. Approximate selection
>>>>> based on queue->nr_reqs should be good enough. If queues get slightly
>>>>> more requests than given by that counter it should not be too bad,
>>>>> as number of kernel/userspace transitions gets reduced with higher
>>>>> queue sizes.
>>>>>
>>>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>>>> ---
>>>>>  fs/fuse/dev_uring.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>>>>>  1 file changed, 84 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>>>> index 02c4b40e739c7aa43dc1c581d4ff1f721617cc79..92401adecf813b1c4570d925718be772c8f02975 100644
>>>>> --- a/fs/fuse/dev_uring.c
>>>>> +++ b/fs/fuse/dev_uring.c
>>>>> @@ -19,6 +19,10 @@ MODULE_PARM_DESC(enable_uring,
>>>>>
>>>>>  #define FUSE_URING_IOV_SEGS 2 /* header and payload */
>>>>>
>>>>> +/* Number of queued fuse requests until a queue is considered full */
>>>>> +#define FURING_Q_LOCAL_THRESHOLD 2
>>>>> +#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>>>>> +#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>>>>>
>>>>>  bool fuse_uring_enabled(void)
>>>>>  {
>>>>> @@ -1285,22 +1289,94 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>>>>>         fuse_uring_send(ent, cmd, err, issue_flags);
>>>>>  }
>>>>>
>>>>> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
>>>>> +/*
>>>>> + * Pick best queue from mask. Follows the algorithm described in
>>>>> + * "The Power of Two Choices in Randomized Load Balancing"
>>>>> + *  (Michael David Mitzenmacher, 1991)
>>>>> + */
>>>>> +static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>>>>> +                                                    struct fuse_ring *ring)
>>>>> +{
>>>>> +       unsigned int qid1, qid2;
>>>>> +       struct fuse_ring_queue *queue1, *queue2;
>>>>> +       int weight = cpumask_weight(mask);
>>>>> +
>>>>> +       if (weight == 0)
>>>>> +               return NULL;
>>>>> +
>>>>> +       if (weight == 1) {
>>>>> +               qid1 = cpumask_first(mask);
>>>>> +               return READ_ONCE(ring->queues[qid1]);
>>>>> +       }
>>>>> +
>>>>> +       /* Get two different queues using optimized bounded random */
>>>>> +       qid1 = cpumask_nth(get_random_u32_below(weight), mask);
>>>>> +       queue1 = READ_ONCE(ring->queues[qid1]);
>>>>> +
>>>>> +       qid2 = cpumask_nth(get_random_u32_below(weight), mask);
>>>>> +
>>>>> +       /* Avoid retries and take this queue for code simplicity */
>>>>> +       if (qid1 == qid2)
>>>>> +               return queue1;
>>>>> +
>>>>> +       queue2 = READ_ONCE(ring->queues[qid2]);
>>>>> +
>>>>> +       if (WARN_ON_ONCE(!queue1 || !queue2))
>>>>> +               return NULL;
>>>>> +
>>>>> +       return (READ_ONCE(queue1->nr_reqs) < READ_ONCE(queue2->nr_reqs)) ?
>>>>> +               queue1 : queue2;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Get the best queue for the current CPU
>>>>> + */
>>>>> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>>>>  {
>>>>>         unsigned int qid;
>>>>> -       struct fuse_ring_queue *queue;
>>>>> +       struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>>>>> +       int local_node;
>>>>> +       const struct cpumask *numa_mask, *global_mask;
>>>>>
>>>>>         qid = task_cpu(current);
>>>>> -
>>>>>         if (WARN_ONCE(qid >= ring->max_nr_queues,
>>>>>                       "Core number (%u) exceeds nr queues (%zu)\n", qid,
>>>>>                       ring->max_nr_queues))
>>>>>                 qid = 0;
>>>>>
>>>>> -       queue = ring->queues[qid];
>>>>> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
>>>>> +       local_queue = READ_ONCE(ring->queues[qid]);
>>>>> +       local_node = cpu_to_node(qid);
>>>>> +       if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
>>>>> +               local_node = 0;
>>>>>
>>>>> -       return queue;
>>>>> +       /* Fast path: if local queue exists and is not overloaded, use it */
>>>>> +       if (local_queue &&
>>>>> +           READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
>>>>> +               return local_queue;
>>>>> +
>>>>> +       /* Find best NUMA-local queue */
>>>>> +       numa_mask = ring->numa_registered_q_mask[local_node];
>>>>> +       best_numa = fuse_uring_best_queue(numa_mask, ring);
>>>>> +
>>>>> +       /* If NUMA queue is under threshold, use it */
>>>>> +       if (best_numa &&
>>>>> +           READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>>>>> +               return best_numa;
>>>>> +
>>>>> +       /* NUMA queues above threshold, try global queues */
>>>>> +       global_mask = ring->registered_q_mask;
>>>>> +       best_global = fuse_uring_best_queue(global_mask, ring);
>>>>> +
>>>>> +       /* Might happen during tear down */
>>>>> +       if (!best_global)
>>>>> +               return NULL;
>>>>> +
>>>>> +       /* If global queue is under double threshold, use it */
>>>>> +       if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
>>>>> +               return best_global;
>>>>> +
>>>>> +       /* There is no ideal queue, stay numa_local if possible */
>>>>> +       return best_numa ? best_numa : best_global;
>>>>>  }
>>>>
>>>> Hi Bernd,
>>>>
>>>> I started looking a bit at the block layer blk-mq.c code because, as I
>>>> understand it, they have to address this same problem of allocating
>>>> requests to queues while taking into account NUMA locality.
>>>>
>>>> I haven't looked at the code deeply yet but I think what it does is
>>>> maintain a static mapping (that considers numa topology) of cpus to
>>>> queues which then makes queue selection very simple with minimal
>>>> overhead. For distributing load, I think it relies on the CPU
>>>> scheduler to distribute application tasks fairly across CPUs rather
>>>> than doing load balancing itself (which would also then have to break
>>>> numa locality if the request gets moved to a different queue).
>>>> Regarding load balancing, my read of this patch is that it uses the
>>>> number of current requests on queues as the metric of load but I'm not
>>>> sure that's accurate - for example, some requests may be more
>>>> intensive (eg fetching a read over a network) where even if there's
>>>> only a few requests on that queue, that queue could still be more
>>>> loaded with higher latency than other queues.
>>>>
>>>> I'm curious to hear your thoughts on whether you think a simple
>>>> mapping solution like what the block layer does would suffice or not
>>>> for fuse uring queue selection.
>>>
>> Hi Bernd,
>>
>> Thanks for your reply and for sharing your thoughts on this.
>>
>>>
>>> Hi Joanne,
>>>
>>> thanks for looking at the patch. I think we have primarily a static
>>> mapping? For completeness, please also look at the patch 6/6, which
>>> updates queue selection. Basically with patch 6/6 we have static
>>> mapping to the local queue, with neighbor queues as retries. I
>>> had already answered Luis question - I can show that retries
>>> to the neighbor QIDs improves performance, at least for fio's
>>> '--ioengine=io_uring --numjobs={1..8} --iodepth={8..128} --direct=1'.
>>>
>>> So that leaves the fallback to random QIDs - I don't have strong
>>> opinion about that, but I don't think the CPU scheduler can handle it.
>>> Let's say you are doing write-back to a single file and let's say
>>> fuse is tuned to allow lot's of dirty pages. How should the scheduler
>>> be able to distribute single threaded dirty page flush? Especially
>>
>> For writeback, I believe the writeback workqueue is unbound (I'm
>> seeing bdi_wq allocated with WQ_UNBOUND in default_bid_init()) to any
>> cpu. As I understand it, the worker thread can be migrated by the
>> scheduler which will distribute writing back dirty data across
>> multiple cpus as it sees fit.
>>
>>> also see in patch 6/6 that we really want to have a different CPU
>>> to handle async data - the cpu scheduler will not even try to move the
>>> the application or migration thread to a different cpu, because
>>> there is no conflict. And for cpu cache, C-states and frequency,
>>> we actually also want to the scheduler to limit migration to
>>> absolutely minimum.
>>>
>>> Another choice instead of random fallback would be to distribute
>>> requests to neighbor queues within FURING_NEXT_QUEUE_RETRIES.
>>> Maybe that would even give better peformance, as random queues so
>>> far didn't have a positive effect in my testing.
>>>
>>> The kind of ideal queue selection for async requests seems to be
>>> to fill a queue and then to move to the next qid, within a numa
>>> domain. I just hadn't found a way to do that lockless yet.
>>>
>>> Regarding usage of number of requests - I guess there always
>>> will be workloads where the algorithm isn't perfect - see the
>>> scheduler wake discussion. Maybe we can find a way in the future
>>> to map queued requests in fuse-daemon and then use that + number
>>> of unhandled io-uring CQEs to know if a queue is busy. Any chance
>>> we can do it step by step?
>>>
>>> I don't have a big problem to remove
>>> the random queue selection fallback, but also would be good
>>> to have Miklos' opinion - Miklos had some concerns in September
>>
>> I agree, it'd be good to have Miklos's opinion on this and go with that.
>>
>> My opinion looking at this is that the fuse uring problem of
>> distributing requests to queues is very similar to what the block
>> layer has to do with assigning bio submissions to hardware queues. The
>> block layer's solution to me seems more elegantly simple and flows
>> organically with the cpu scheduler's internal load balancing. I think
>> we should try to keep things as simple as possible, as I don't see how
>> the optimizations with the custom load balancing we're doing here can
>> be accurate enough to warrant the extra complexity and overhead.
>>
>> But I defer to whatever approach you and Miklos think is best and
>> would rather go with.
> 
> Miklos, could you share your thoughts on how we should approach this?

I have already updated the series, v4 without extended balancing
will come tomorrow - I don't manage to run benchmarks now. Left in
a new commit is still cpu+1 retry balancing, for the reason that
it gives better performance - I really don't want to remove that.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues
  2025-10-24 17:52           ` Bernd Schubert
@ 2025-10-24 17:58             ` Bernd Schubert
  0 siblings, 0 replies; 26+ messages in thread
From: Bernd Schubert @ 2025-10-24 17:58 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Miklos Szeredi, linux-fsdevel@vger.kernel.org, Luis Henriques,
	Gang He

On 10/24/25 19:52, Bernd Schubert wrote:
> On 10/24/25 19:05, Joanne Koong wrote:
>> On Mon, Oct 20, 2025 at 3:59 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>>>
>>> On Mon, Oct 20, 2025 at 12:00 PM Bernd Schubert <bschubert@ddn.com> wrote:
>>>>
>>>> On 10/18/25 02:12, Joanne Koong wrote:
>>>>> On Mon, Oct 13, 2025 at 10:10 AM Bernd Schubert <bschubert@ddn.com> wrote:
>>>>>>
>>>>>> So far queue selection was only for the queue corresponding
>>>>>> to the current core.
>>>>>> With bitmaps about registered queues and counting of queued
>>>>>> requests per queue, distributing the load is possible now.
>>>>>>
>>>>>> This is on purpose lockless and accurate, under the assumption that a lock
>>>>>> between queues might become the limiting factor. Approximate selection
>>>>>> based on queue->nr_reqs should be good enough. If queues get slightly
>>>>>> more requests than given by that counter it should not be too bad,
>>>>>> as number of kernel/userspace transitions gets reduced with higher
>>>>>> queue sizes.
>>>>>>
>>>>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>>>>> ---
>>>>>>  fs/fuse/dev_uring.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>>>>>>  1 file changed, 84 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>>>>> index 02c4b40e739c7aa43dc1c581d4ff1f721617cc79..92401adecf813b1c4570d925718be772c8f02975 100644
>>>>>> --- a/fs/fuse/dev_uring.c
>>>>>> +++ b/fs/fuse/dev_uring.c
>>>>>> @@ -19,6 +19,10 @@ MODULE_PARM_DESC(enable_uring,
>>>>>>
>>>>>>  #define FUSE_URING_IOV_SEGS 2 /* header and payload */
>>>>>>
>>>>>> +/* Number of queued fuse requests until a queue is considered full */
>>>>>> +#define FURING_Q_LOCAL_THRESHOLD 2
>>>>>> +#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>>>>>> +#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>>>>>>
>>>>>>  bool fuse_uring_enabled(void)
>>>>>>  {
>>>>>> @@ -1285,22 +1289,94 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>>>>>>         fuse_uring_send(ent, cmd, err, issue_flags);
>>>>>>  }
>>>>>>
>>>>>> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
>>>>>> +/*
>>>>>> + * Pick best queue from mask. Follows the algorithm described in
>>>>>> + * "The Power of Two Choices in Randomized Load Balancing"
>>>>>> + *  (Michael David Mitzenmacher, 1991)
>>>>>> + */
>>>>>> +static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>>>>>> +                                                    struct fuse_ring *ring)
>>>>>> +{
>>>>>> +       unsigned int qid1, qid2;
>>>>>> +       struct fuse_ring_queue *queue1, *queue2;
>>>>>> +       int weight = cpumask_weight(mask);
>>>>>> +
>>>>>> +       if (weight == 0)
>>>>>> +               return NULL;
>>>>>> +
>>>>>> +       if (weight == 1) {
>>>>>> +               qid1 = cpumask_first(mask);
>>>>>> +               return READ_ONCE(ring->queues[qid1]);
>>>>>> +       }
>>>>>> +
>>>>>> +       /* Get two different queues using optimized bounded random */
>>>>>> +       qid1 = cpumask_nth(get_random_u32_below(weight), mask);
>>>>>> +       queue1 = READ_ONCE(ring->queues[qid1]);
>>>>>> +
>>>>>> +       qid2 = cpumask_nth(get_random_u32_below(weight), mask);
>>>>>> +
>>>>>> +       /* Avoid retries and take this queue for code simplicity */
>>>>>> +       if (qid1 == qid2)
>>>>>> +               return queue1;
>>>>>> +
>>>>>> +       queue2 = READ_ONCE(ring->queues[qid2]);
>>>>>> +
>>>>>> +       if (WARN_ON_ONCE(!queue1 || !queue2))
>>>>>> +               return NULL;
>>>>>> +
>>>>>> +       return (READ_ONCE(queue1->nr_reqs) < READ_ONCE(queue2->nr_reqs)) ?
>>>>>> +               queue1 : queue2;
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * Get the best queue for the current CPU
>>>>>> + */
>>>>>> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>>>>>>  {
>>>>>>         unsigned int qid;
>>>>>> -       struct fuse_ring_queue *queue;
>>>>>> +       struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>>>>>> +       int local_node;
>>>>>> +       const struct cpumask *numa_mask, *global_mask;
>>>>>>
>>>>>>         qid = task_cpu(current);
>>>>>> -
>>>>>>         if (WARN_ONCE(qid >= ring->max_nr_queues,
>>>>>>                       "Core number (%u) exceeds nr queues (%zu)\n", qid,
>>>>>>                       ring->max_nr_queues))
>>>>>>                 qid = 0;
>>>>>>
>>>>>> -       queue = ring->queues[qid];
>>>>>> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
>>>>>> +       local_queue = READ_ONCE(ring->queues[qid]);
>>>>>> +       local_node = cpu_to_node(qid);
>>>>>> +       if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes))
>>>>>> +               local_node = 0;
>>>>>>
>>>>>> -       return queue;
>>>>>> +       /* Fast path: if local queue exists and is not overloaded, use it */
>>>>>> +       if (local_queue &&
>>>>>> +           READ_ONCE(local_queue->nr_reqs) <= FURING_Q_LOCAL_THRESHOLD)
>>>>>> +               return local_queue;
>>>>>> +
>>>>>> +       /* Find best NUMA-local queue */
>>>>>> +       numa_mask = ring->numa_registered_q_mask[local_node];
>>>>>> +       best_numa = fuse_uring_best_queue(numa_mask, ring);
>>>>>> +
>>>>>> +       /* If NUMA queue is under threshold, use it */
>>>>>> +       if (best_numa &&
>>>>>> +           READ_ONCE(best_numa->nr_reqs) <= FURING_Q_NUMA_THRESHOLD)
>>>>>> +               return best_numa;
>>>>>> +
>>>>>> +       /* NUMA queues above threshold, try global queues */
>>>>>> +       global_mask = ring->registered_q_mask;
>>>>>> +       best_global = fuse_uring_best_queue(global_mask, ring);
>>>>>> +
>>>>>> +       /* Might happen during tear down */
>>>>>> +       if (!best_global)
>>>>>> +               return NULL;
>>>>>> +
>>>>>> +       /* If global queue is under double threshold, use it */
>>>>>> +       if (READ_ONCE(best_global->nr_reqs) <= FURING_Q_GLOBAL_THRESHOLD)
>>>>>> +               return best_global;
>>>>>> +
>>>>>> +       /* There is no ideal queue, stay numa_local if possible */
>>>>>> +       return best_numa ? best_numa : best_global;
>>>>>>  }
>>>>>
>>>>> Hi Bernd,
>>>>>
>>>>> I started looking a bit at the block layer blk-mq.c code because, as I
>>>>> understand it, they have to address this same problem of allocating
>>>>> requests to queues while taking into account NUMA locality.
>>>>>
>>>>> I haven't looked at the code deeply yet but I think what it does is
>>>>> maintain a static mapping (that considers numa topology) of cpus to
>>>>> queues which then makes queue selection very simple with minimal
>>>>> overhead. For distributing load, I think it relies on the CPU
>>>>> scheduler to distribute application tasks fairly across CPUs rather
>>>>> than doing load balancing itself (which would also then have to break
>>>>> numa locality if the request gets moved to a different queue).
>>>>> Regarding load balancing, my read of this patch is that it uses the
>>>>> number of current requests on queues as the metric of load but I'm not
>>>>> sure that's accurate - for example, some requests may be more
>>>>> intensive (eg fetching a read over a network) where even if there's
>>>>> only a few requests on that queue, that queue could still be more
>>>>> loaded with higher latency than other queues.
>>>>>
>>>>> I'm curious to hear your thoughts on whether you think a simple
>>>>> mapping solution like what the block layer does would suffice or not
>>>>> for fuse uring queue selection.
>>>>
>>> Hi Bernd,
>>>
>>> Thanks for your reply and for sharing your thoughts on this.
>>>
>>>>
>>>> Hi Joanne,
>>>>
>>>> thanks for looking at the patch. I think we have primarily a static
>>>> mapping? For completeness, please also look at the patch 6/6, which
>>>> updates queue selection. Basically with patch 6/6 we have static
>>>> mapping to the local queue, with neighbor queues as retries. I
>>>> had already answered Luis question - I can show that retries
>>>> to the neighbor QIDs improves performance, at least for fio's
>>>> '--ioengine=io_uring --numjobs={1..8} --iodepth={8..128} --direct=1'.
>>>>
>>>> So that leaves the fallback to random QIDs - I don't have strong
>>>> opinion about that, but I don't think the CPU scheduler can handle it.
>>>> Let's say you are doing write-back to a single file and let's say
>>>> fuse is tuned to allow lot's of dirty pages. How should the scheduler
>>>> be able to distribute single threaded dirty page flush? Especially
>>>
>>> For writeback, I believe the writeback workqueue is unbound (I'm
>>> seeing bdi_wq allocated with WQ_UNBOUND in default_bid_init()) to any
>>> cpu. As I understand it, the worker thread can be migrated by the
>>> scheduler which will distribute writing back dirty data across
>>> multiple cpus as it sees fit.
>>>
>>>> also see in patch 6/6 that we really want to have a different CPU
>>>> to handle async data - the cpu scheduler will not even try to move the
>>>> the application or migration thread to a different cpu, because
>>>> there is no conflict. And for cpu cache, C-states and frequency,
>>>> we actually also want to the scheduler to limit migration to
>>>> absolutely minimum.
>>>>
>>>> Another choice instead of random fallback would be to distribute
>>>> requests to neighbor queues within FURING_NEXT_QUEUE_RETRIES.
>>>> Maybe that would even give better peformance, as random queues so
>>>> far didn't have a positive effect in my testing.
>>>>
>>>> The kind of ideal queue selection for async requests seems to be
>>>> to fill a queue and then to move to the next qid, within a numa
>>>> domain. I just hadn't found a way to do that lockless yet.
>>>>
>>>> Regarding usage of number of requests - I guess there always
>>>> will be workloads where the algorithm isn't perfect - see the
>>>> scheduler wake discussion. Maybe we can find a way in the future
>>>> to map queued requests in fuse-daemon and then use that + number
>>>> of unhandled io-uring CQEs to know if a queue is busy. Any chance
>>>> we can do it step by step?
>>>>
>>>> I don't have a big problem to remove
>>>> the random queue selection fallback, but also would be good
>>>> to have Miklos' opinion - Miklos had some concerns in September
>>>
>>> I agree, it'd be good to have Miklos's opinion on this and go with that.
>>>
>>> My opinion looking at this is that the fuse uring problem of
>>> distributing requests to queues is very similar to what the block
>>> layer has to do with assigning bio submissions to hardware queues. The
>>> block layer's solution to me seems more elegantly simple and flows
>>> organically with the cpu scheduler's internal load balancing. I think
>>> we should try to keep things as simple as possible, as I don't see how
>>> the optimizations with the custom load balancing we're doing here can
>>> be accurate enough to warrant the extra complexity and overhead.
>>>
>>> But I defer to whatever approach you and Miklos think is best and
>>> would rather go with.
>>
>> Miklos, could you share your thoughts on how we should approach this?
> 
> I have already updated the series, v4 without extended balancing
> will come tomorrow - I don't manage to run benchmarks now. Left in
> a new commit is still cpu+1 retry balancing, for the reason that
> it gives better performance - I really don't want to remove that.
> 


Forgot to write, I'm still going to add back the balancing, as I want to
remove

	spin_lock(&fc->bg_lock);
	fc->num_background++;

in dev_uring.c code and replace it by a per queue counter. For that
num_background needs to be split equally between queues.
The bg_lock is the reason of lock order issues, which is the reason
why we a) have to block the mount until io-uring is initialized and
also cannot disable/re-enable io-uring later on (the latter is
something that the qemu team has asked for).
And obviously, a global counter and lock don't help performance.


Thanks,
Bernd
would

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-10-24 17:58 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-13 17:09 [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
2025-10-13 17:09 ` [PATCH v3 1/6] fuse: {io-uring} Add queue length counters Bernd Schubert
2025-10-15  9:19   ` Luis Henriques
2025-10-13 17:09 ` [PATCH v3 2/6] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert
2025-10-13 17:09 ` [PATCH v3 3/6] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
2025-10-15 23:49   ` Joanne Koong
2025-10-16 11:33     ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 4/6] fuse: {io-uring} Distribute load among queues Bernd Schubert
2025-10-18  0:12   ` Joanne Koong
2025-10-20 19:00     ` Bernd Schubert
2025-10-20 22:59       ` Joanne Koong
2025-10-20 23:28         ` Bernd Schubert
2025-10-24 17:05         ` Joanne Koong
2025-10-24 17:52           ` Bernd Schubert
2025-10-24 17:58             ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 5/6] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
2025-10-15  9:25   ` Luis Henriques
2025-10-15  9:31     ` Bernd Schubert
2025-10-13 17:10 ` [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
2025-10-15  9:50   ` Luis Henriques
2025-10-15 10:27     ` Bernd Schubert
2025-10-15 11:05       ` Luis Henriques
2025-10-20  7:15   ` Dan Carpenter
2025-10-14  8:43 ` [PATCH v3 0/6] fuse: {io-uring} Allow to reduce the number of queues and request distribution Gang He
2025-10-14  9:14   ` Bernd Schubert
2025-10-16  6:15     ` Gang He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).