[PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution
@ 2026-04-13  9:41 Bernd Schubert via B4 Relay
  2026-04-13  9:41 ` [PATCH v4 1/8] fuse: {io-uring} Add queue length counters Bernd Schubert via B4 Relay
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Bernd Schubert via B4 Relay @ 2026-04-13  9:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert, Bernd Schubert

This adds bitmaps that track which queues are registered and which queues
do not have queued requests.
These bitmaps are then used to map from request core to queue
and also allow load distribution. NUMA affinity is handled and
fuse client/server protocol does not need changes, all is handled
in fuse client internally.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
Changes in v4:
- Fix leak of ring->numa_q_map
- Fix q-imbalance due to an issue in static cpu->qid mapping
- Performance tunings, like preferring the current cpu
- Removal of distribution among queues, Joanne had concerns about it.
  As it is an optimization, it can be added later again. At least
  most of it was removed, some light distribution is still left in.
- Link to v3: https://lore.kernel.org/r/20251013-reduced-nr-ring-queues_3-v3-0-6d87c8aa31ae@ddn.com

Changes in v3:
- removed FUSE_URING_QUEUE_THRESHOLD (Luis)
- Fixed accidentaly early return of queue1 in fuse_uring_best_queue()
- Fixed similar early return 'best_global'
- Added sanity checks for cpu_to_node()
- Removed retry loops in fuse_uring_best_queue() for code simplicity
- Reduced local numa retries in fuse_uring_get_queue
- Added 'FUSE_URING_REDUCED_Q' FUSE_INIT flag to inform userspace
  about the possibility to reduced queues
- Link to v2: https://lore.kernel.org/r/20251003-reduced-nr-ring-queues_3-v2-0-742ff1a8fc58@ddn.com
- Removed wake-on-same cpu patch from this series, 
  it will be send out independently
- Used READ_ONCE(queue->nr_reqs) as the value is updated (with a lock being
  hold) by other threads and possibly cpus.

---
Bernd Schubert (8):
      fuse: {io-uring} Add queue length counters
      fuse: {io-uring} Rename ring->nr_queues to max_nr_queues
      fuse: {io-uring} Use bitmaps to track registered queues
      fuse: Fetch a queued fuse request on command registration
      fuse: {io-uring} Allow reduced number of ring queues
      fuse: {io-uring} Queue background requests on a different core
      fuse: Add retry attempts for numa local queues for load distribution
      fuse: {io-uring} Prefer the current core over mapping

 fs/fuse/dev_uring.c       | 303 ++++++++++++++++++++++++++++++++++------------
 fs/fuse/dev_uring_i.h     |  25 +++-
 fs/fuse/inode.c           |   2 +-
 include/uapi/linux/fuse.h |   3 +
 4 files changed, 252 insertions(+), 81 deletions(-)
---
base-commit: 028ef9c96e96197026887c0f092424679298aae8
change-id: 20250722-reduced-nr-ring-queues_3-6acb79dad978

Best regards,
-- 
Bernd Schubert <bernd@bsbernd.com>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/8] fuse: {io-uring} Add queue length counters
  2026-04-13  9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
@ 2026-04-13  9:41 ` Bernd Schubert via B4 Relay
  2026-04-13  9:41 ` [PATCH v4 2/8] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert via B4 Relay
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Bernd Schubert via B4 Relay @ 2026-04-13  9:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

From: Bernd Schubert <bschubert@ddn.com>

This is another preparation and will be used for decision
which queue to add a request to.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/dev_uring.c   | 17 +++++++++++++++--
 fs/fuse/dev_uring_i.h |  3 +++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 3a38b61aac26f7387cdcb7b2d81624e1ac9ececa..f2f1e10da3429748da87884ac6d1f53613d41bc4 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -87,13 +87,13 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
 	spin_lock(&queue->lock);
 	ent->fuse_req = NULL;
 	list_del_init(&req->list);
+	queue->nr_reqs--;
 	if (test_bit(FR_BACKGROUND, &req->flags)) {
 		queue->active_background--;
 		spin_lock(&fc->bg_lock);
 		fuse_uring_flush_bg(queue);
 		spin_unlock(&fc->bg_lock);
 	}
-
 	spin_unlock(&queue->lock);
 
 	if (error)
@@ -113,6 +113,7 @@ static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
 	list_for_each_entry(req, &queue->fuse_req_queue, list)
 		clear_bit(FR_PENDING, &req->flags);
 	list_splice_init(&queue->fuse_req_queue, &req_list);
+	queue->nr_reqs = 0;
 	spin_unlock(&queue->lock);
 
 	/* must not hold queue lock to avoid order issues with fi->lock */
@@ -1287,10 +1288,13 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 	req->ring_queue = queue;
 	ent = list_first_entry_or_null(&queue->ent_avail_queue,
 				       struct fuse_ring_ent, list);
+	queue->nr_reqs++;
+
 	if (ent)
 		fuse_uring_add_req_to_ring_ent(ent, req);
 	else
 		list_add_tail(&req->list, &queue->fuse_req_queue);
+
 	spin_unlock(&queue->lock);
 
 	if (ent)
@@ -1326,6 +1330,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	set_bit(FR_URING, &req->flags);
 	req->ring_queue = queue;
 	list_add_tail(&req->list, &queue->fuse_req_bg_queue);
+	queue->nr_reqs++;
 
 	ent = list_first_entry_or_null(&queue->ent_avail_queue,
 				       struct fuse_ring_ent, list);
@@ -1358,8 +1363,16 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 bool fuse_uring_remove_pending_req(struct fuse_req *req)
 {
 	struct fuse_ring_queue *queue = req->ring_queue;
+	bool removed = fuse_remove_pending_req(req, &queue->lock);
 
-	return fuse_remove_pending_req(req, &queue->lock);
+	if (removed) {
+		/* Update counters after successful removal */
+		spin_lock(&queue->lock);
+		queue->nr_reqs--;
+		spin_unlock(&queue->lock);
+	}
+
+	return removed;
 }
 
 static const struct fuse_iqueue_ops fuse_io_uring_ops = {
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 51a563922ce14158904a86c248c77767be4fe5ae..c63bed9f863d53d4ac2bed7bfbda61941cd99083 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -94,6 +94,9 @@ struct fuse_ring_queue {
 	/* background fuse requests */
 	struct list_head fuse_req_bg_queue;
 
+	/* number of requests queued or in userspace */
+	unsigned int nr_reqs;
+
 	struct fuse_pqueue fpq;
 
 	unsigned int active_background;

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/8] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues
  2026-04-13  9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
  2026-04-13  9:41 ` [PATCH v4 1/8] fuse: {io-uring} Add queue length counters Bernd Schubert via B4 Relay
@ 2026-04-13  9:41 ` Bernd Schubert via B4 Relay
  2026-04-13  9:41 ` [PATCH v4 3/8] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert via B4 Relay
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Bernd Schubert via B4 Relay @ 2026-04-13  9:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

From: Bernd Schubert <bschubert@ddn.com>

This is preparation for follow up commits that allow to run with a
reduced number of queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c   | 24 ++++++++++++------------
 fs/fuse/dev_uring_i.h |  2 +-
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index f2f1e10da3429748da87884ac6d1f53613d41bc4..08c49e5b65ee950a9749a034b1e9d0a8883e1d31 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -126,7 +126,7 @@ void fuse_uring_abort_end_requests(struct fuse_ring *ring)
 	struct fuse_ring_queue *queue;
 	struct fuse_conn *fc = ring->fc;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		queue = READ_ONCE(ring->queues[qid]);
 		if (!queue)
 			continue;
@@ -167,7 +167,7 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
 	if (!ring)
 		return false;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		queue = READ_ONCE(ring->queues[qid]);
 		if (!queue)
 			continue;
@@ -194,7 +194,7 @@ void fuse_uring_destruct(struct fuse_conn *fc)
 	if (!ring)
 		return;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = ring->queues[qid];
 		struct fuse_ring_ent *ent, *next;
 
@@ -254,7 +254,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 
 	init_waitqueue_head(&ring->stop_waitq);
 
-	ring->nr_queues = nr_queues;
+	ring->max_nr_queues = nr_queues;
 	ring->fc = fc;
 	ring->max_payload_sz = max_payload_size;
 	smp_store_release(&fc->ring, ring);
@@ -406,7 +406,7 @@ static void fuse_uring_log_ent_state(struct fuse_ring *ring)
 	int qid;
 	struct fuse_ring_ent *ent;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = ring->queues[qid];
 
 		if (!queue)
@@ -437,7 +437,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
 		container_of(work, struct fuse_ring, async_teardown_work.work);
 
 	/* XXX code dup */
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
 
 		if (!queue)
@@ -472,7 +472,7 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
 {
 	int qid;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
 
 		if (!queue)
@@ -895,7 +895,7 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
 	if (!ring)
 		return err;
 
-	if (qid >= ring->nr_queues)
+	if (qid >= ring->max_nr_queues)
 		return -EINVAL;
 
 	queue = ring->queues[qid];
@@ -958,7 +958,7 @@ static bool is_ring_ready(struct fuse_ring *ring, int current_qid)
 	struct fuse_ring_queue *queue;
 	bool ready = true;
 
-	for (qid = 0; qid < ring->nr_queues && ready; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues && ready; qid++) {
 		if (current_qid == qid)
 			continue;
 
@@ -1100,7 +1100,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
 			return err;
 	}
 
-	if (qid >= ring->nr_queues) {
+	if (qid >= ring->max_nr_queues) {
 		pr_info_ratelimited("fuse: Invalid ring qid %u\n", qid);
 		return -EINVAL;
 	}
@@ -1244,9 +1244,9 @@ static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
 
 	qid = task_cpu(current);
 
-	if (WARN_ONCE(qid >= ring->nr_queues,
+	if (WARN_ONCE(qid >= ring->max_nr_queues,
 		      "Core number (%u) exceeds nr queues (%zu)\n", qid,
-		      ring->nr_queues))
+		      ring->max_nr_queues))
 		qid = 0;
 
 	queue = ring->queues[qid];
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index c63bed9f863d53d4ac2bed7bfbda61941cd99083..708412294982566919122a1a0d7f741217c763ce 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -113,7 +113,7 @@ struct fuse_ring {
 	struct fuse_conn *fc;
 
 	/* number of ring queues */
-	size_t nr_queues;
+	size_t max_nr_queues;
 
 	/* maximum payload/arg size */
 	size_t max_payload_sz;

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/8] fuse: {io-uring} Use bitmaps to track registered queues
  2026-04-13  9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
  2026-04-13  9:41 ` [PATCH v4 1/8] fuse: {io-uring} Add queue length counters Bernd Schubert via B4 Relay
  2026-04-13  9:41 ` [PATCH v4 2/8] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert via B4 Relay
@ 2026-04-13  9:41 ` Bernd Schubert via B4 Relay
  2026-04-24 15:04   ` Luis Henriques
  2026-04-13  9:41 ` [PATCH v4 4/8] fuse: Fetch a queued fuse request on command registration Bernd Schubert via B4 Relay
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Bernd Schubert via B4 Relay @ 2026-04-13  9:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

From: Bernd Schubert <bschubert@ddn.com>

Add per-CPU and per-NUMA node bitmasks to track which
io-uring queues are registered.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c   | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h | 20 ++++++++++++++
 2 files changed, 93 insertions(+)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 08c49e5b65ee950a9749a034b1e9d0a8883e1d31..1d1305d38efa07641128a865aec1745d2e040b93 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -186,6 +186,25 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
 	return false;
 }
 
+static void fuse_ring_destruct_q_map(struct fuse_queue_map *q_map)
+{
+	free_cpumask_var(q_map->registered_q_mask);
+	kfree(q_map->cpu_to_qid);
+}
+
+static void fuse_uring_destruct_q_masks(struct fuse_ring *ring)
+{
+	int node;
+
+	fuse_ring_destruct_q_map(&ring->q_map);
+
+	if (ring->numa_q_map) {
+		for (node = 0; node < ring->nr_numa_nodes; node++)
+			fuse_ring_destruct_q_map(&ring->numa_q_map[node]);
+		kfree(ring->numa_q_map);
+	}
+}
+
 void fuse_uring_destruct(struct fuse_conn *fc)
 {
 	struct fuse_ring *ring = fc->ring;
@@ -217,11 +236,44 @@ void fuse_uring_destruct(struct fuse_conn *fc)
 		ring->queues[qid] = NULL;
 	}
 
+	fuse_uring_destruct_q_masks(ring);
 	kfree(ring->queues);
 	kfree(ring);
 	fc->ring = NULL;
 }
 
+static int fuse_uring_init_q_map(struct fuse_queue_map *q_map, size_t nr_cpu)
+{
+	if (!zalloc_cpumask_var(&q_map->registered_q_mask, GFP_KERNEL_ACCOUNT))
+		return -ENOMEM;
+
+	q_map->cpu_to_qid = kzalloc_objs(*q_map->cpu_to_qid, nr_cpu,
+					 GFP_KERNEL_ACCOUNT);
+
+	return 0;
+}
+
+static int fuse_uring_create_q_masks(struct fuse_ring *ring)
+{
+	int err, node;
+
+	err = fuse_uring_init_q_map(&ring->q_map, ring->max_nr_queues);
+	if (err)
+		return err;
+
+	ring->numa_q_map = kzalloc_objs(*ring->numa_q_map, ring->nr_numa_nodes,
+					GFP_KERNEL_ACCOUNT);
+	if (!ring->numa_q_map)
+		return -ENOMEM;
+	for (node = 0; node < ring->nr_numa_nodes; node++) {
+		err = fuse_uring_init_q_map(&ring->numa_q_map[node],
+					   ring->max_nr_queues);
+		if (err)
+			return err;
+	}
+	return 0;
+}
+
 /*
  * Basic ring setup for this connection based on the provided configuration
  */
@@ -231,6 +283,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	size_t nr_queues = num_possible_cpus();
 	struct fuse_ring *res = NULL;
 	size_t max_payload_size;
+	int err;
 
 	ring = kzalloc_obj(*fc->ring, GFP_KERNEL_ACCOUNT);
 	if (!ring)
@@ -241,9 +294,15 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	if (!ring->queues)
 		goto out_err;
 
+	ring->nr_numa_nodes = num_online_nodes();
+
 	max_payload_size = max(FUSE_MIN_READ_BUFFER, fc->max_write);
 	max_payload_size = max(max_payload_size, fc->max_pages * PAGE_SIZE);
 
+	err = fuse_uring_create_q_masks(ring);
+	if (err)
+		goto out_err;
+
 	spin_lock(&fc->lock);
 	if (fc->ring) {
 		/* race, another thread created the ring in the meantime */
@@ -263,6 +322,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	return ring;
 
 out_err:
+	fuse_uring_destruct_q_masks(ring);
 	kfree(ring->queues);
 	kfree(ring);
 	return res;
@@ -425,6 +485,7 @@ static void fuse_uring_log_ent_state(struct fuse_ring *ring)
 			pr_info(" ent-commit-queue ring=%p qid=%d ent=%p state=%d\n",
 				ring, qid, ent, ent->state);
 		}
+
 		spin_unlock(&queue->lock);
 	}
 	ring->stop_debug_log = 1;
@@ -471,6 +532,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
 void fuse_uring_stop_queues(struct fuse_ring *ring)
 {
 	int qid;
+	int node;
 
 	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
@@ -481,6 +543,13 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
 		fuse_uring_teardown_entries(queue);
 	}
 
+	/* Reset all queue masks, we won't process any more IO */
+	cpumask_clear(ring->q_map.registered_q_mask);
+	for (node = 0; node < ring->nr_numa_nodes; node++) {
+		if (ring->numa_q_map)
+			cpumask_clear(ring->numa_q_map[node].registered_q_mask);
+	}
+
 	if (atomic_read(&ring->queue_refs) > 0) {
 		ring->teardown_time = jiffies;
 		INIT_DELAYED_WORK(&ring->async_teardown_work,
@@ -988,6 +1057,10 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
 	struct fuse_ring *ring = queue->ring;
 	struct fuse_conn *fc = ring->fc;
 	struct fuse_iqueue *fiq = &fc->iq;
+	int node = cpu_to_node(queue->qid);
+
+	if (WARN_ON_ONCE(node >= ring->nr_numa_nodes))
+		node = 0;
 
 	fuse_uring_prepare_cancel(cmd, issue_flags, ent);
 
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 708412294982566919122a1a0d7f741217c763ce..83506f431b97249c8f0c82f89f0fce41021288dd 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -104,6 +104,17 @@ struct fuse_ring_queue {
 	bool stopped;
 };
 
+struct fuse_queue_map {
+	/* Tracks which queues are registered */
+	cpumask_var_t registered_q_mask;
+
+	/* number of registered queues */
+	size_t nr_queues;
+
+	/* cpu to qid mapping */
+	int *cpu_to_qid;
+};
+
 /**
  * Describes if uring is for communication and holds alls the data needed
  * for uring communication
@@ -115,6 +126,9 @@ struct fuse_ring {
 	/* number of ring queues */
 	size_t max_nr_queues;
 
+	/* number of numa nodes */
+	int nr_numa_nodes;
+
 	/* maximum payload/arg size */
 	size_t max_payload_sz;
 
@@ -125,6 +139,12 @@ struct fuse_ring {
 	 */
 	unsigned int stop_debug_log : 1;
 
+	/* per numa node queue tracking */
+	struct fuse_queue_map *numa_q_map;
+
+	/* all queue tracking */
+	struct fuse_queue_map q_map;
+
 	wait_queue_head_t stop_waitq;
 
 	/* async tear down */

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/8] fuse: {io-uring} Use bitmaps to track registered queues
  2026-04-13  9:41 ` [PATCH v4 3/8] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert via B4 Relay
@ 2026-04-24 15:04   ` Luis Henriques
  2026-04-24 15:33     ` Bernd Schubert
  0 siblings, 1 reply; 16+ messages in thread
From: Luis Henriques @ 2026-04-24 15:04 UTC (permalink / raw)
  To: Bernd Schubert via B4 Relay
  Cc: Miklos Szeredi, bernd, Joanne Koong, linux-fsdevel, Gang He,
	Bernd Schubert

On Mon, Apr 13 2026, Bernd Schubert via B4 Relay wrote:

> From: Bernd Schubert <bschubert@ddn.com>
>
> Add per-CPU and per-NUMA node bitmasks to track which
> io-uring queues are registered.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c   | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/fuse/dev_uring_i.h | 20 ++++++++++++++
>  2 files changed, 93 insertions(+)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 08c49e5b65ee950a9749a034b1e9d0a8883e1d31..1d1305d38efa07641128a865aec1745d2e040b93 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -186,6 +186,25 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
>  	return false;
>  }
>  
> +static void fuse_ring_destruct_q_map(struct fuse_queue_map *q_map)
> +{
> +	free_cpumask_var(q_map->registered_q_mask);
> +	kfree(q_map->cpu_to_qid);
> +}
> +
> +static void fuse_uring_destruct_q_masks(struct fuse_ring *ring)
> +{
> +	int node;
> +
> +	fuse_ring_destruct_q_map(&ring->q_map);
> +
> +	if (ring->numa_q_map) {
> +		for (node = 0; node < ring->nr_numa_nodes; node++)
> +			fuse_ring_destruct_q_map(&ring->numa_q_map[node]);
> +		kfree(ring->numa_q_map);
> +	}
> +}
> +
>  void fuse_uring_destruct(struct fuse_conn *fc)
>  {
>  	struct fuse_ring *ring = fc->ring;
> @@ -217,11 +236,44 @@ void fuse_uring_destruct(struct fuse_conn *fc)
>  		ring->queues[qid] = NULL;
>  	}
>  
> +	fuse_uring_destruct_q_masks(ring);
>  	kfree(ring->queues);
>  	kfree(ring);
>  	fc->ring = NULL;
>  }
>  
> +static int fuse_uring_init_q_map(struct fuse_queue_map *q_map, size_t nr_cpu)
> +{
> +	if (!zalloc_cpumask_var(&q_map->registered_q_mask, GFP_KERNEL_ACCOUNT))
> +		return -ENOMEM;
> +
> +	q_map->cpu_to_qid = kzalloc_objs(*q_map->cpu_to_qid, nr_cpu,
> +					 GFP_KERNEL_ACCOUNT);
> +

Missing NULL check here (and corresponding call to free_cpumask_var()).

> 
> +	return 0;
> +}
> +
> +static int fuse_uring_create_q_masks(struct fuse_ring *ring)
> +{
> +	int err, node;
> +
> +	err = fuse_uring_init_q_map(&ring->q_map, ring->max_nr_queues);
> +	if (err)
> +		return err;
> +
> +	ring->numa_q_map = kzalloc_objs(*ring->numa_q_map, ring->nr_numa_nodes,
> +					GFP_KERNEL_ACCOUNT);
> +	if (!ring->numa_q_map)
> +		return -ENOMEM;

Missing call to fuse_ring_destruct_q_map().

> +	for (node = 0; node < ring->nr_numa_nodes; node++) {
> +		err = fuse_uring_init_q_map(&ring->numa_q_map[node],
> +					   ring->max_nr_queues);
> +		if (err)
> +			return err;

Cleanup also missing here.

Cheers,
-- 
Luís

> +	}
> +	return 0;
> +}
> +
>  /*
>   * Basic ring setup for this connection based on the provided configuration
>   */
> @@ -231,6 +283,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>  	size_t nr_queues = num_possible_cpus();
>  	struct fuse_ring *res = NULL;
>  	size_t max_payload_size;
> +	int err;
>  
>  	ring = kzalloc_obj(*fc->ring, GFP_KERNEL_ACCOUNT);
>  	if (!ring)
> @@ -241,9 +294,15 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>  	if (!ring->queues)
>  		goto out_err;
>  
> +	ring->nr_numa_nodes = num_online_nodes();
> +
>  	max_payload_size = max(FUSE_MIN_READ_BUFFER, fc->max_write);
>  	max_payload_size = max(max_payload_size, fc->max_pages * PAGE_SIZE);
>  
> +	err = fuse_uring_create_q_masks(ring);
> +	if (err)
> +		goto out_err;
> +
>  	spin_lock(&fc->lock);
>  	if (fc->ring) {
>  		/* race, another thread created the ring in the meantime */
> @@ -263,6 +322,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>  	return ring;
>  
>  out_err:
> +	fuse_uring_destruct_q_masks(ring);
>  	kfree(ring->queues);
>  	kfree(ring);
>  	return res;
> @@ -425,6 +485,7 @@ static void fuse_uring_log_ent_state(struct fuse_ring *ring)
>  			pr_info(" ent-commit-queue ring=%p qid=%d ent=%p state=%d\n",
>  				ring, qid, ent, ent->state);
>  		}
> +
>  		spin_unlock(&queue->lock);
>  	}
>  	ring->stop_debug_log = 1;
> @@ -471,6 +532,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
>  void fuse_uring_stop_queues(struct fuse_ring *ring)
>  {
>  	int qid;
> +	int node;
>  
>  	for (qid = 0; qid < ring->max_nr_queues; qid++) {
>  		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
> @@ -481,6 +543,13 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
>  		fuse_uring_teardown_entries(queue);
>  	}
>  
> +	/* Reset all queue masks, we won't process any more IO */
> +	cpumask_clear(ring->q_map.registered_q_mask);
> +	for (node = 0; node < ring->nr_numa_nodes; node++) {
> +		if (ring->numa_q_map)
> +			cpumask_clear(ring->numa_q_map[node].registered_q_mask);
> +	}
> +
>  	if (atomic_read(&ring->queue_refs) > 0) {
>  		ring->teardown_time = jiffies;
>  		INIT_DELAYED_WORK(&ring->async_teardown_work,
> @@ -988,6 +1057,10 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
>  	struct fuse_ring *ring = queue->ring;
>  	struct fuse_conn *fc = ring->fc;
>  	struct fuse_iqueue *fiq = &fc->iq;
> +	int node = cpu_to_node(queue->qid);
> +
> +	if (WARN_ON_ONCE(node >= ring->nr_numa_nodes))
> +		node = 0;
>  
>  	fuse_uring_prepare_cancel(cmd, issue_flags, ent);
>  
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 708412294982566919122a1a0d7f741217c763ce..83506f431b97249c8f0c82f89f0fce41021288dd 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -104,6 +104,17 @@ struct fuse_ring_queue {
>  	bool stopped;
>  };
>  
> +struct fuse_queue_map {
> +	/* Tracks which queues are registered */
> +	cpumask_var_t registered_q_mask;
> +
> +	/* number of registered queues */
> +	size_t nr_queues;
> +
> +	/* cpu to qid mapping */
> +	int *cpu_to_qid;
> +};
> +
>  /**
>   * Describes if uring is for communication and holds alls the data needed
>   * for uring communication
> @@ -115,6 +126,9 @@ struct fuse_ring {
>  	/* number of ring queues */
>  	size_t max_nr_queues;
>  
> +	/* number of numa nodes */
> +	int nr_numa_nodes;
> +
>  	/* maximum payload/arg size */
>  	size_t max_payload_sz;
>  
> @@ -125,6 +139,12 @@ struct fuse_ring {
>  	 */
>  	unsigned int stop_debug_log : 1;
>  
> +	/* per numa node queue tracking */
> +	struct fuse_queue_map *numa_q_map;
> +
> +	/* all queue tracking */
> +	struct fuse_queue_map q_map;
> +
>  	wait_queue_head_t stop_waitq;
>  
>  	/* async tear down */
>
> -- 
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/8] fuse: {io-uring} Use bitmaps to track registered queues
  2026-04-24 15:04   ` Luis Henriques
@ 2026-04-24 15:33     ` Bernd Schubert
  0 siblings, 0 replies; 16+ messages in thread
From: Bernd Schubert @ 2026-04-24 15:33 UTC (permalink / raw)
  To: Luis Henriques, Bernd Schubert via B4 Relay
  Cc: Miklos Szeredi, Joanne Koong, linux-fsdevel, Gang He,
	Bernd Schubert



On 4/24/26 17:04, Luis Henriques wrote:
> On Mon, Apr 13 2026, Bernd Schubert via B4 Relay wrote:
> 
>> From: Bernd Schubert <bschubert@ddn.com>
>>
>> Add per-CPU and per-NUMA node bitmasks to track which
>> io-uring queues are registered.
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev_uring.c   | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/fuse/dev_uring_i.h | 20 ++++++++++++++
>>  2 files changed, 93 insertions(+)
>>
>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>> index 08c49e5b65ee950a9749a034b1e9d0a8883e1d31..1d1305d38efa07641128a865aec1745d2e040b93 100644
>> --- a/fs/fuse/dev_uring.c
>> +++ b/fs/fuse/dev_uring.c
>> @@ -186,6 +186,25 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
>>  	return false;
>>  }
>>  
>> +static void fuse_ring_destruct_q_map(struct fuse_queue_map *q_map)
>> +{
>> +	free_cpumask_var(q_map->registered_q_mask);
>> +	kfree(q_map->cpu_to_qid);
>> +}
>> +
>> +static void fuse_uring_destruct_q_masks(struct fuse_ring *ring)
>> +{
>> +	int node;
>> +
>> +	fuse_ring_destruct_q_map(&ring->q_map);
>> +
>> +	if (ring->numa_q_map) {
>> +		for (node = 0; node < ring->nr_numa_nodes; node++)
>> +			fuse_ring_destruct_q_map(&ring->numa_q_map[node]);
>> +		kfree(ring->numa_q_map);
>> +	}
>> +}
>> +
>>  void fuse_uring_destruct(struct fuse_conn *fc)
>>  {
>>  	struct fuse_ring *ring = fc->ring;
>> @@ -217,11 +236,44 @@ void fuse_uring_destruct(struct fuse_conn *fc)
>>  		ring->queues[qid] = NULL;
>>  	}
>>  
>> +	fuse_uring_destruct_q_masks(ring);
>>  	kfree(ring->queues);
>>  	kfree(ring);
>>  	fc->ring = NULL;
>>  }
>>  
>> +static int fuse_uring_init_q_map(struct fuse_queue_map *q_map, size_t nr_cpu)
>> +{
>> +	if (!zalloc_cpumask_var(&q_map->registered_q_mask, GFP_KERNEL_ACCOUNT))
>> +		return -ENOMEM;
>> +
>> +	q_map->cpu_to_qid = kzalloc_objs(*q_map->cpu_to_qid, nr_cpu,
>> +					 GFP_KERNEL_ACCOUNT);
>> +
> 
> Missing NULL check here (and corresponding call to free_cpumask_var()).
> 
>>
>> +	return 0;
>> +}
>> +
>> +static int fuse_uring_create_q_masks(struct fuse_ring *ring)
>> +{
>> +	int err, node;
>> +
>> +	err = fuse_uring_init_q_map(&ring->q_map, ring->max_nr_queues);
>> +	if (err)
>> +		return err;
>> +
>> +	ring->numa_q_map = kzalloc_objs(*ring->numa_q_map, ring->nr_numa_nodes,
>> +					GFP_KERNEL_ACCOUNT);
>> +	if (!ring->numa_q_map)
>> +		return -ENOMEM;
> 
> Missing call to fuse_ring_destruct_q_map().
> 
>> +	for (node = 0; node < ring->nr_numa_nodes; node++) {
>> +		err = fuse_uring_init_q_map(&ring->numa_q_map[node],
>> +					   ring->max_nr_queues);
>> +		if (err)
>> +			return err;
> 
> Cleanup also missing here.

I don't think so, for either of both, see error handling in
fuse_uring_create_q_masks().


Thanks,
Bernd




^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 4/8] fuse: Fetch a queued fuse request on command registration
  2026-04-13  9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
                   ` (2 preceding siblings ...)
  2026-04-13  9:41 ` [PATCH v4 3/8] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert via B4 Relay
@ 2026-04-13  9:41 ` Bernd Schubert via B4 Relay
  2026-04-13  9:41 ` [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert via B4 Relay
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Bernd Schubert via B4 Relay @ 2026-04-13  9:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert, Bernd Schubert

From: Bernd Schubert <bschubert@ddn.com>

This is a preparation for the upcoming reduced fuse-io-uring queue
feature, i.e. less queues than system cores.

With the reduced queue feature fuse-io-uring is marked as ready
after receiving the 1st ring entry, as any available queue
can now handle ring requests.  At this time other queues just
might be in the process of registration and then a race might
happen

fuse_uring_queue_fuse_req -> no queue entry registered yet
    list_add_tail -> fuse request gets queued

So far fetching requests from the list only happened from
FUSE_IO_URING_CMD_COMMIT_AND_FETCH, but without new requests
on the same queue, it would actually never send requests
from that queue - the request would be stuck.

Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
---
 fs/fuse/dev_uring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 1d1305d38efa07641128a865aec1745d2e040b93..9dcbc39531f0e019e5abf58a29cdf6c75fafdca1 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1196,6 +1196,8 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
 
 	fuse_uring_do_register(ent, cmd, issue_flags);
 
+	fuse_uring_next_fuse_req(ent, queue, issue_flags);
+
 	return 0;
 }
 

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues
  2026-04-13  9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
                   ` (3 preceding siblings ...)
  2026-04-13  9:41 ` [PATCH v4 4/8] fuse: Fetch a queued fuse request on command registration Bernd Schubert via B4 Relay
@ 2026-04-13  9:41 ` Bernd Schubert via B4 Relay
  2026-04-24 15:15   ` Luis Henriques
  2026-04-24 18:28   ` Joanne Koong
  2026-04-13  9:41 ` [PATCH v4 6/8] fuse: {io-uring} Queue background requests on a different core Bernd Schubert via B4 Relay
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 16+ messages in thread
From: Bernd Schubert via B4 Relay @ 2026-04-13  9:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

From: Bernd Schubert <bschubert@ddn.com>

Queues selection (fuse_uring_get_queue) can handle reduced number
queues - using io-uring is possible now even with a single
queue and entry.

The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that
reduced queues are possible, i.e. if the flag is set, fuse server
is free to reduce number queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c       | 160 ++++++++++++++++++++++++----------------------
 fs/fuse/inode.c           |   2 +-
 include/uapi/linux/fuse.h |   3 +
 3 files changed, 88 insertions(+), 77 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 9dcbc39531f0e019e5abf58a29cdf6c75fafdca1..e68089babaf89fb81741e4a5e605c6e36a137f9e 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -249,15 +249,17 @@ static int fuse_uring_init_q_map(struct fuse_queue_map *q_map, size_t nr_cpu)
 
 	q_map->cpu_to_qid = kzalloc_objs(*q_map->cpu_to_qid, nr_cpu,
 					 GFP_KERNEL_ACCOUNT);
+	if (!q_map->cpu_to_qid)
+		return -ENOMEM;
 
 	return 0;
 }
 
-static int fuse_uring_create_q_masks(struct fuse_ring *ring)
+static int fuse_uring_create_q_masks(struct fuse_ring *ring, size_t nr_queues)
 {
 	int err, node;
 
-	err = fuse_uring_init_q_map(&ring->q_map, ring->max_nr_queues);
+	err = fuse_uring_init_q_map(&ring->q_map, nr_queues);
 	if (err)
 		return err;
 
@@ -267,7 +269,7 @@ static int fuse_uring_create_q_masks(struct fuse_ring *ring)
 		return -ENOMEM;
 	for (node = 0; node < ring->nr_numa_nodes; node++) {
 		err = fuse_uring_init_q_map(&ring->numa_q_map[node],
-					   ring->max_nr_queues);
+					    nr_queues);
 		if (err)
 			return err;
 	}
@@ -299,7 +301,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	max_payload_size = max(FUSE_MIN_READ_BUFFER, fc->max_write);
 	max_payload_size = max(max_payload_size, fc->max_pages * PAGE_SIZE);
 
-	err = fuse_uring_create_q_masks(ring);
+	err = fuse_uring_create_q_masks(ring, nr_queues);
 	if (err)
 		goto out_err;
 
@@ -328,12 +330,36 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	return res;
 }
 
+static void fuse_uring_cpu_qid_mapping(struct fuse_ring *ring, int qid,
+				       struct fuse_queue_map *q_map,
+				       int node)
+{
+	int cpu, qid_idx, mapping_count = 0;
+	size_t nr_queues;
+
+	cpumask_set_cpu(qid, q_map->registered_q_mask);
+	nr_queues = cpumask_weight(q_map->registered_q_mask);
+	for (cpu = 0; cpu < ring->max_nr_queues; cpu++) {
+		if (node != -1 && cpu_to_node(cpu) != node)
+			continue;
+
+		qid_idx = mapping_count % nr_queues;
+		q_map->cpu_to_qid[cpu] = cpumask_nth(qid_idx,
+						     q_map->registered_q_mask);
+		mapping_count++;
+		pr_debug("%s node=%d qid=%d qid_idx=%d nr_queues=%zu %d->%d\n",
+			 __func__, node, qid, qid_idx, nr_queues, cpu,
+			 q_map->cpu_to_qid[cpu]);
+	}
+}
+
 static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
 						       int qid)
 {
 	struct fuse_conn *fc = ring->fc;
 	struct fuse_ring_queue *queue;
 	struct list_head *pq;
+	int node;
 
 	queue = kzalloc_obj(*queue, GFP_KERNEL_ACCOUNT);
 	if (!queue)
@@ -371,6 +397,22 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
 	 * write_once and lock as the caller mostly doesn't take the lock at all
 	 */
 	WRITE_ONCE(ring->queues[qid], queue);
+
+	/* Static mapping from cpu to per numa queues */
+	node = cpu_to_node(qid);
+	fuse_uring_cpu_qid_mapping(ring, qid, &ring->numa_q_map[node], node);
+
+	/*
+	 * smp_store_release, as the variable is read without fc->lock and
+	 * we need to avoid compiler re-ordering of updating the nr_queues
+	 * and setting ring->numa_queues[node].cpu_to_qid above
+	 */
+	smp_store_release(&ring->numa_q_map[node].nr_queues,
+			  ring->numa_q_map[node].nr_queues + 1);
+
+	/* global mapping */
+	fuse_uring_cpu_qid_mapping(ring, qid, &ring->q_map, -1);
+
 	spin_unlock(&fc->lock);
 
 	return queue;
@@ -1021,65 +1063,6 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
 	return 0;
 }
 
-static bool is_ring_ready(struct fuse_ring *ring, int current_qid)
-{
-	int qid;
-	struct fuse_ring_queue *queue;
-	bool ready = true;
-
-	for (qid = 0; qid < ring->max_nr_queues && ready; qid++) {
-		if (current_qid == qid)
-			continue;
-
-		queue = ring->queues[qid];
-		if (!queue) {
-			ready = false;
-			break;
-		}
-
-		spin_lock(&queue->lock);
-		if (list_empty(&queue->ent_avail_queue))
-			ready = false;
-		spin_unlock(&queue->lock);
-	}
-
-	return ready;
-}
-
-/*
- * fuse_uring_req_fetch command handling
- */
-static void fuse_uring_do_register(struct fuse_ring_ent *ent,
-				   struct io_uring_cmd *cmd,
-				   unsigned int issue_flags)
-{
-	struct fuse_ring_queue *queue = ent->queue;
-	struct fuse_ring *ring = queue->ring;
-	struct fuse_conn *fc = ring->fc;
-	struct fuse_iqueue *fiq = &fc->iq;
-	int node = cpu_to_node(queue->qid);
-
-	if (WARN_ON_ONCE(node >= ring->nr_numa_nodes))
-		node = 0;
-
-	fuse_uring_prepare_cancel(cmd, issue_flags, ent);
-
-	spin_lock(&queue->lock);
-	ent->cmd = cmd;
-	fuse_uring_ent_avail(ent, queue);
-	spin_unlock(&queue->lock);
-
-	if (!ring->ready) {
-		bool ready = is_ring_ready(ring, queue->qid);
-
-		if (ready) {
-			WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
-			WRITE_ONCE(ring->ready, true);
-			wake_up_all(&fc->blocked_waitq);
-		}
-	}
-}
-
 /*
  * sqe->addr is a ptr to an iovec array, iov[0] has the headers, iov[1]
  * the payload
@@ -1163,6 +1146,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
 	struct fuse_ring *ring = smp_load_acquire(&fc->ring);
 	struct fuse_ring_queue *queue;
 	struct fuse_ring_ent *ent;
+	struct fuse_iqueue *fiq = &fc->iq;
 	int err;
 	unsigned int qid = READ_ONCE(cmd_req->qid);
 
@@ -1194,8 +1178,18 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
 	if (IS_ERR(ent))
 		return PTR_ERR(ent);
 
-	fuse_uring_do_register(ent, cmd, issue_flags);
+	fuse_uring_prepare_cancel(cmd, issue_flags, ent);
+	if (!ring->ready) {
+		WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
+		WRITE_ONCE(ring->ready, true);
+		wake_up_all(&fc->blocked_waitq);
+	}
 
+	spin_lock(&queue->lock);
+	ent->cmd = cmd;
+	spin_unlock(&queue->lock);
+
+	/* Marks the ring entry as ready */
 	fuse_uring_next_fuse_req(ent, queue, issue_flags);
 
 	return 0;
@@ -1312,22 +1306,36 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
 	fuse_uring_send(ent, cmd, err, issue_flags);
 }
 
-static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
+static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring)
 {
 	unsigned int qid;
-	struct fuse_ring_queue *queue;
+	int node;
+	unsigned int nr_queues;
+	unsigned int cpu = task_cpu(current);
 
-	qid = task_cpu(current);
+	cpu = cpu % ring->max_nr_queues;
 
-	if (WARN_ONCE(qid >= ring->max_nr_queues,
-		      "Core number (%u) exceeds nr queues (%zu)\n", qid,
-		      ring->max_nr_queues))
-		qid = 0;
+	/* numa local registered queue bitmap */
+	node = cpu_to_node(cpu);
+	if (WARN_ONCE(node >= ring->nr_numa_nodes,
+		      "Node number (%d) exceeds nr nodes (%d)\n",
+		      node, ring->nr_numa_nodes)) {
+		node = 0;
+	}
 
-	queue = ring->queues[qid];
-	WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
+	nr_queues = READ_ONCE(ring->numa_q_map[node].nr_queues);
+	if (nr_queues) {
+		qid = ring->numa_q_map[node].cpu_to_qid[cpu];
+		if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
+			return NULL;
+		return READ_ONCE(ring->queues[qid]);
+	}
 
-	return queue;
+	/* global registered queue bitmap */
+	qid = ring->q_map.cpu_to_qid[cpu];
+	if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
+		return NULL;
+	return READ_ONCE(ring->queues[qid]);
 }
 
 static void fuse_uring_dispatch_ent(struct fuse_ring_ent *ent)
@@ -1348,7 +1356,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 	int err;
 
 	err = -EINVAL;
-	queue = fuse_uring_task_to_queue(ring);
+	queue = fuse_uring_select_queue(ring);
 	if (!queue)
 		goto err;
 
@@ -1392,7 +1400,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	struct fuse_ring_queue *queue;
 	struct fuse_ring_ent *ent = NULL;
 
-	queue = fuse_uring_task_to_queue(ring);
+	queue = fuse_uring_select_queue(ring);
 	if (!queue)
 		return false;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index c795abe47a4f4a488b9623c389e4afce43c6647d..5cb903186c29a77727551fe72c4cabf705a22258 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1506,7 +1506,7 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
 		FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP |
 		FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP |
 		FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_ALLOW_IDMAP |
-		FUSE_REQUEST_TIMEOUT;
+		FUSE_REQUEST_TIMEOUT | FUSE_URING_REDUCED_Q;
 #ifdef CONFIG_FUSE_DAX
 	if (fm->fc->dax)
 		flags |= FUSE_MAP_ALIGNMENT;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index c13e1f9a2f12bd39f535188cb5466688eba42263..3da20d9bba1cb6336734511d21da9f64cea0e720 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -448,6 +448,8 @@ struct fuse_file_lock {
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
  * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
  *			 init_out.request_timeout contains the timeout (in secs)
+ * FUSE_URING_REDUCED_Q: Client (kernel) supports less queues - Server is free
+ *			 to register between 1 and nr-core io-uring queues
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -495,6 +497,7 @@ struct fuse_file_lock {
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
+#define FUSE_URING_REDUCED_Q (1ULL << 43)
 
 /**
  * CUSE INIT request/reply flags

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues
  2026-04-13  9:41 ` [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert via B4 Relay
@ 2026-04-24 15:15   ` Luis Henriques
  2026-04-24 18:28   ` Joanne Koong
  1 sibling, 0 replies; 16+ messages in thread
From: Luis Henriques @ 2026-04-24 15:15 UTC (permalink / raw)
  To: Bernd Schubert via B4 Relay
  Cc: Miklos Szeredi, bernd, Joanne Koong, linux-fsdevel, Gang He,
	Bernd Schubert

On Mon, Apr 13 2026, Bernd Schubert via B4 Relay wrote:

> From: Bernd Schubert <bschubert@ddn.com>
>
> Queues selection (fuse_uring_get_queue) can handle reduced number
> queues - using io-uring is possible now even with a single
> queue and entry.
>
> The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that

nit: "[...] introduced to tell the fuse server [...]"

> reduced queues are possible, i.e. if the flag is set, fuse server
> is free to reduce number queues.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c       | 160 ++++++++++++++++++++++++----------------------
>  fs/fuse/inode.c           |   2 +-
>  include/uapi/linux/fuse.h |   3 +
>  3 files changed, 88 insertions(+), 77 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 9dcbc39531f0e019e5abf58a29cdf6c75fafdca1..e68089babaf89fb81741e4a5e605c6e36a137f9e 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -249,15 +249,17 @@ static int fuse_uring_init_q_map(struct fuse_queue_map *q_map, size_t nr_cpu)
>  
>  	q_map->cpu_to_qid = kzalloc_objs(*q_map->cpu_to_qid, nr_cpu,
>  					 GFP_KERNEL_ACCOUNT);
> +	if (!q_map->cpu_to_qid)
> +		return -ENOMEM;

Ah! I guess this belongs to patch 0001.  And, as I commented there, it
still misses the free_cpumask_var().

>  
>  	return 0;
>  }
>  
> -static int fuse_uring_create_q_masks(struct fuse_ring *ring)
> +static int fuse_uring_create_q_masks(struct fuse_ring *ring, size_t nr_queues)
>  {
>  	int err, node;
>  
> -	err = fuse_uring_init_q_map(&ring->q_map, ring->max_nr_queues);
> +	err = fuse_uring_init_q_map(&ring->q_map, nr_queues);
>  	if (err)
>  		return err;
>  
> @@ -267,7 +269,7 @@ static int fuse_uring_create_q_masks(struct fuse_ring *ring)
>  		return -ENOMEM;
>  	for (node = 0; node < ring->nr_numa_nodes; node++) {
>  		err = fuse_uring_init_q_map(&ring->numa_q_map[node],
> -					   ring->max_nr_queues);
> +					    nr_queues);
>  		if (err)
>  			return err;
>  	}
> @@ -299,7 +301,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>  	max_payload_size = max(FUSE_MIN_READ_BUFFER, fc->max_write);
>  	max_payload_size = max(max_payload_size, fc->max_pages * PAGE_SIZE);
>  
> -	err = fuse_uring_create_q_masks(ring);
> +	err = fuse_uring_create_q_masks(ring, nr_queues);
>  	if (err)
>  		goto out_err;
>  
> @@ -328,12 +330,36 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>  	return res;
>  }
>  
> +static void fuse_uring_cpu_qid_mapping(struct fuse_ring *ring, int qid,
> +				       struct fuse_queue_map *q_map,
> +				       int node)
> +{
> +	int cpu, qid_idx, mapping_count = 0;
> +	size_t nr_queues;
> +
> +	cpumask_set_cpu(qid, q_map->registered_q_mask);
> +	nr_queues = cpumask_weight(q_map->registered_q_mask);
> +	for (cpu = 0; cpu < ring->max_nr_queues; cpu++) {
> +		if (node != -1 && cpu_to_node(cpu) != node)
> +			continue;
> +
> +		qid_idx = mapping_count % nr_queues;
> +		q_map->cpu_to_qid[cpu] = cpumask_nth(qid_idx,
> +						     q_map->registered_q_mask);
> +		mapping_count++;
> +		pr_debug("%s node=%d qid=%d qid_idx=%d nr_queues=%zu %d->%d\n",
> +			 __func__, node, qid, qid_idx, nr_queues, cpu,
> +			 q_map->cpu_to_qid[cpu]);
> +	}
> +}
> +
>  static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
>  						       int qid)
>  {
>  	struct fuse_conn *fc = ring->fc;
>  	struct fuse_ring_queue *queue;
>  	struct list_head *pq;
> +	int node;
>  
>  	queue = kzalloc_obj(*queue, GFP_KERNEL_ACCOUNT);
>  	if (!queue)
> @@ -371,6 +397,22 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
>  	 * write_once and lock as the caller mostly doesn't take the lock at all
>  	 */
>  	WRITE_ONCE(ring->queues[qid], queue);
> +
> +	/* Static mapping from cpu to per numa queues */
> +	node = cpu_to_node(qid);
> +	fuse_uring_cpu_qid_mapping(ring, qid, &ring->numa_q_map[node], node);
> +
> +	/*
> +	 * smp_store_release, as the variable is read without fc->lock and
> +	 * we need to avoid compiler re-ordering of updating the nr_queues
> +	 * and setting ring->numa_queues[node].cpu_to_qid above
> +	 */
> +	smp_store_release(&ring->numa_q_map[node].nr_queues,
> +			  ring->numa_q_map[node].nr_queues + 1);
> +
> +	/* global mapping */
> +	fuse_uring_cpu_qid_mapping(ring, qid, &ring->q_map, -1);
> +
>  	spin_unlock(&fc->lock);
>  
>  	return queue;
> @@ -1021,65 +1063,6 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
>  	return 0;
>  }
>  
> -static bool is_ring_ready(struct fuse_ring *ring, int current_qid)
> -{
> -	int qid;
> -	struct fuse_ring_queue *queue;
> -	bool ready = true;
> -
> -	for (qid = 0; qid < ring->max_nr_queues && ready; qid++) {
> -		if (current_qid == qid)
> -			continue;
> -
> -		queue = ring->queues[qid];
> -		if (!queue) {
> -			ready = false;
> -			break;
> -		}
> -
> -		spin_lock(&queue->lock);
> -		if (list_empty(&queue->ent_avail_queue))
> -			ready = false;
> -		spin_unlock(&queue->lock);
> -	}
> -
> -	return ready;
> -}
> -
> -/*
> - * fuse_uring_req_fetch command handling
> - */
> -static void fuse_uring_do_register(struct fuse_ring_ent *ent,
> -				   struct io_uring_cmd *cmd,
> -				   unsigned int issue_flags)
> -{
> -	struct fuse_ring_queue *queue = ent->queue;
> -	struct fuse_ring *ring = queue->ring;
> -	struct fuse_conn *fc = ring->fc;
> -	struct fuse_iqueue *fiq = &fc->iq;
> -	int node = cpu_to_node(queue->qid);
> -
> -	if (WARN_ON_ONCE(node >= ring->nr_numa_nodes))
> -		node = 0;
> -
> -	fuse_uring_prepare_cancel(cmd, issue_flags, ent);
> -
> -	spin_lock(&queue->lock);
> -	ent->cmd = cmd;
> -	fuse_uring_ent_avail(ent, queue);
> -	spin_unlock(&queue->lock);
> -
> -	if (!ring->ready) {
> -		bool ready = is_ring_ready(ring, queue->qid);
> -
> -		if (ready) {
> -			WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
> -			WRITE_ONCE(ring->ready, true);
> -			wake_up_all(&fc->blocked_waitq);
> -		}
> -	}
> -}
> -
>  /*
>   * sqe->addr is a ptr to an iovec array, iov[0] has the headers, iov[1]
>   * the payload
> @@ -1163,6 +1146,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
>  	struct fuse_ring *ring = smp_load_acquire(&fc->ring);
>  	struct fuse_ring_queue *queue;
>  	struct fuse_ring_ent *ent;
> +	struct fuse_iqueue *fiq = &fc->iq;
>  	int err;
>  	unsigned int qid = READ_ONCE(cmd_req->qid);
>  
> @@ -1194,8 +1178,18 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
>  	if (IS_ERR(ent))
>  		return PTR_ERR(ent);
>  
> -	fuse_uring_do_register(ent, cmd, issue_flags);
> +	fuse_uring_prepare_cancel(cmd, issue_flags, ent);
> +	if (!ring->ready) {
> +		WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
> +		WRITE_ONCE(ring->ready, true);
> +		wake_up_all(&fc->blocked_waitq);
> +	}
>  
> +	spin_lock(&queue->lock);
> +	ent->cmd = cmd;
> +	spin_unlock(&queue->lock);
> +
> +	/* Marks the ring entry as ready */
>  	fuse_uring_next_fuse_req(ent, queue, issue_flags);
>  
>  	return 0;
> @@ -1312,22 +1306,36 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
>  	fuse_uring_send(ent, cmd, err, issue_flags);
>  }
>  
> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
> +static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring)
>  {
>  	unsigned int qid;
> -	struct fuse_ring_queue *queue;
> +	int node;
> +	unsigned int nr_queues;
> +	unsigned int cpu = task_cpu(current);
>  
> -	qid = task_cpu(current);
> +	cpu = cpu % ring->max_nr_queues;

nit: why set cpu twice?

Cheers,
-- 
Luís

>  
> -	if (WARN_ONCE(qid >= ring->max_nr_queues,
> -		      "Core number (%u) exceeds nr queues (%zu)\n", qid,
> -		      ring->max_nr_queues))
> -		qid = 0;
> +	/* numa local registered queue bitmap */
> +	node = cpu_to_node(cpu);
> +	if (WARN_ONCE(node >= ring->nr_numa_nodes,
> +		      "Node number (%d) exceeds nr nodes (%d)\n",
> +		      node, ring->nr_numa_nodes)) {
> +		node = 0;
> +	}
>  
> -	queue = ring->queues[qid];
> -	WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
> +	nr_queues = READ_ONCE(ring->numa_q_map[node].nr_queues);
> +	if (nr_queues) {
> +		qid = ring->numa_q_map[node].cpu_to_qid[cpu];
> +		if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
> +			return NULL;
> +		return READ_ONCE(ring->queues[qid]);
> +	}
>  
> -	return queue;
> +	/* global registered queue bitmap */
> +	qid = ring->q_map.cpu_to_qid[cpu];
> +	if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
> +		return NULL;
> +	return READ_ONCE(ring->queues[qid]);
>  }
>  
>  static void fuse_uring_dispatch_ent(struct fuse_ring_ent *ent)
> @@ -1348,7 +1356,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>  	int err;
>  
>  	err = -EINVAL;
> -	queue = fuse_uring_task_to_queue(ring);
> +	queue = fuse_uring_select_queue(ring);
>  	if (!queue)
>  		goto err;
>  
> @@ -1392,7 +1400,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
>  	struct fuse_ring_queue *queue;
>  	struct fuse_ring_ent *ent = NULL;
>  
> -	queue = fuse_uring_task_to_queue(ring);
> +	queue = fuse_uring_select_queue(ring);
>  	if (!queue)
>  		return false;
>  
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index c795abe47a4f4a488b9623c389e4afce43c6647d..5cb903186c29a77727551fe72c4cabf705a22258 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1506,7 +1506,7 @@ static struct fuse_init_args *fuse_new_init(struct fuse_mount *fm)
>  		FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP |
>  		FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP |
>  		FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_ALLOW_IDMAP |
> -		FUSE_REQUEST_TIMEOUT;
> +		FUSE_REQUEST_TIMEOUT | FUSE_URING_REDUCED_Q;
>  #ifdef CONFIG_FUSE_DAX
>  	if (fm->fc->dax)
>  		flags |= FUSE_MAP_ALIGNMENT;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index c13e1f9a2f12bd39f535188cb5466688eba42263..3da20d9bba1cb6336734511d21da9f64cea0e720 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -448,6 +448,8 @@ struct fuse_file_lock {
>   * FUSE_OVER_IO_URING: Indicate that client supports io-uring
>   * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
>   *			 init_out.request_timeout contains the timeout (in secs)
> + * FUSE_URING_REDUCED_Q: Client (kernel) supports less queues - Server is free
> + *			 to register between 1 and nr-core io-uring queues
>   */
>  #define FUSE_ASYNC_READ		(1 << 0)
>  #define FUSE_POSIX_LOCKS	(1 << 1)
> @@ -495,6 +497,7 @@ struct fuse_file_lock {
>  #define FUSE_ALLOW_IDMAP	(1ULL << 40)
>  #define FUSE_OVER_IO_URING	(1ULL << 41)
>  #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
> +#define FUSE_URING_REDUCED_Q (1ULL << 43)
>  
>  /**
>   * CUSE INIT request/reply flags
>
> -- 
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues
  2026-04-13  9:41 ` [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert via B4 Relay
  2026-04-24 15:15   ` Luis Henriques
@ 2026-04-24 18:28   ` Joanne Koong
  2026-04-24 22:00     ` Bernd Schubert
  1 sibling, 1 reply; 16+ messages in thread
From: Joanne Koong @ 2026-04-24 18:28 UTC (permalink / raw)
  To: bernd
  Cc: Miklos Szeredi, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

On Mon, Apr 13, 2026 at 2:41 AM Bernd Schubert via B4 Relay
<devnull+bernd.bsbernd.com@kernel.org> wrote:
>
> From: Bernd Schubert <bschubert@ddn.com>
>
> Queues selection (fuse_uring_get_queue) can handle reduced number
> queues - using io-uring is possible now even with a single
> queue and entry.
>
> The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that
> reduced queues are possible, i.e. if the flag is set, fuse server
> is free to reduce number queues.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c       | 160 ++++++++++++++++++++++++----------------------
>  fs/fuse/inode.c           |   2 +-
>  include/uapi/linux/fuse.h |   3 +
>  3 files changed, 88 insertions(+), 77 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 9dcbc39531f0e019e5abf58a29cdf6c75fafdca1..e68089babaf89fb81741e4a5e605c6e36a137f9e 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -249,15 +249,17 @@ static int fuse_uring_init_q_map(struct fuse_queue_map *q_map, size_t nr_cpu)
>
>         q_map->cpu_to_qid = kzalloc_objs(*q_map->cpu_to_qid, nr_cpu,
>                                          GFP_KERNEL_ACCOUNT);
> +       if (!q_map->cpu_to_qid)
> +               return -ENOMEM;
>
>         return 0;
>  }
>
> -static int fuse_uring_create_q_masks(struct fuse_ring *ring)
> +static int fuse_uring_create_q_masks(struct fuse_ring *ring, size_t nr_queues)
>  {
>         int err, node;
>
> -       err = fuse_uring_init_q_map(&ring->q_map, ring->max_nr_queues);
> +       err = fuse_uring_init_q_map(&ring->q_map, nr_queues);
>         if (err)
>                 return err;
>
> @@ -267,7 +269,7 @@ static int fuse_uring_create_q_masks(struct fuse_ring *ring)
>                 return -ENOMEM;
>         for (node = 0; node < ring->nr_numa_nodes; node++) {
>                 err = fuse_uring_init_q_map(&ring->numa_q_map[node],
> -                                          ring->max_nr_queues);
> +                                           nr_queues);
>                 if (err)
>                         return err;
>         }
> @@ -299,7 +301,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>         max_payload_size = max(FUSE_MIN_READ_BUFFER, fc->max_write);
>         max_payload_size = max(max_payload_size, fc->max_pages * PAGE_SIZE);
>
> -       err = fuse_uring_create_q_masks(ring);
> +       err = fuse_uring_create_q_masks(ring, nr_queues);
>         if (err)
>                 goto out_err;
>
> @@ -328,12 +330,36 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>         return res;
>  }
>
> +static void fuse_uring_cpu_qid_mapping(struct fuse_ring *ring, int qid,
> +                                      struct fuse_queue_map *q_map,
> +                                      int node)
> +{
> +       int cpu, qid_idx, mapping_count = 0;
> +       size_t nr_queues;
> +
> +       cpumask_set_cpu(qid, q_map->registered_q_mask);
> +       nr_queues = cpumask_weight(q_map->registered_q_mask);
> +       for (cpu = 0; cpu < ring->max_nr_queues; cpu++) {
> +               if (node != -1 && cpu_to_node(cpu) != node)
> +                       continue;
> +
> +               qid_idx = mapping_count % nr_queues;
> +               q_map->cpu_to_qid[cpu] = cpumask_nth(qid_idx,
> +                                                    q_map->registered_q_mask);
> +               mapping_count++;
> +               pr_debug("%s node=%d qid=%d qid_idx=%d nr_queues=%zu %d->%d\n",
> +                        __func__, node, qid, qid_idx, nr_queues, cpu,
> +                        q_map->cpu_to_qid[cpu]);
> +       }
> +}
> +
>
> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
> +static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring)
>  {
>         unsigned int qid;
> -       struct fuse_ring_queue *queue;
> +       int node;
> +       unsigned int nr_queues;
> +       unsigned int cpu = task_cpu(current);
>
> -       qid = task_cpu(current);
> +       cpu = cpu % ring->max_nr_queues;
>
> -       if (WARN_ONCE(qid >= ring->max_nr_queues,
> -                     "Core number (%u) exceeds nr queues (%zu)\n", qid,
> -                     ring->max_nr_queues))
> -               qid = 0;
> +       /* numa local registered queue bitmap */
> +       node = cpu_to_node(cpu);
> +       if (WARN_ONCE(node >= ring->nr_numa_nodes,
> +                     "Node number (%d) exceeds nr nodes (%d)\n",
> +                     node, ring->nr_numa_nodes)) {
> +               node = 0;
> +       }
>
> -       queue = ring->queues[qid];
> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
> +       nr_queues = READ_ONCE(ring->numa_q_map[node].nr_queues);
> +       if (nr_queues) {
> +               qid = ring->numa_q_map[node].cpu_to_qid[cpu];
> +               if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
> +                       return NULL;
> +               return READ_ONCE(ring->queues[qid]);
> +       }

Hi Bernd,

Thanks for making the changes on this - I really like how much simpler
the logic is now.

I'm looking through how the block multiqueue code works
(block/blk-mq.c and block/blk-mq-cpumap.c) because I think they
basically have to do the same thing with figuring out which cpu to
dispatch a request to.

It looks like what they do is use group_cpus_evenly(), which as I
understand it, will partition CPUs taking into account numa nodes (as
well as clustering and SMT siblings). I think if we use this for fuse
io-uring, it will make things a lot simpler and we could get rid of
the per-numa state tracking (eg numa_q_map, registered_q_mask,
nr_numa_nodes)  and simplify queue selection where now that can just
be a cpu to qid lookup instead of a two-level
numa-then-global-fallback lookup.

Do you think something like this makes sense?

Additionally, as I understand it, in this series, the ring->q_map
mapping has to get rebuilt every time a new queue gets created. What
do you think about just having the server declare the total queue
count upfront and then the mapping can just get established at ring
creation time? group_cpus_evenly() would only need to be called once,
the cpu_to_qid map would only have to be built once, and we could
avoid the rebuild-on-each-queue-creation complexity entirely. Do you
think something like this makes sense?

Thanks,
Joanne

>
> -       return queue;
> +       /* global registered queue bitmap */
> +       qid = ring->q_map.cpu_to_qid[cpu];
> +       if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
> +               return NULL;
> +       return READ_ONCE(ring->queues[qid]);
>  }
>
>  static void fuse_uring_dispatch_ent(struct fuse_ring_ent *ent)
> @@ -1348,7 +1356,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>         int err;
>
>         err = -EINVAL;
> -       queue = fuse_uring_task_to_queue(ring);
> +       queue = fuse_uring_select_queue(ring);
>         if (!queue)
>                 goto err;
>
> @@ -1392,7 +1400,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
>         struct fuse_ring_queue *queue;
>         struct fuse_ring_ent *ent = NULL;
>
> -       queue = fuse_uring_task_to_queue(ring);
> +       queue = fuse_uring_select_queue(ring);
>         if (!queue)
>                 return false;
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues
  2026-04-24 18:28   ` Joanne Koong
@ 2026-04-24 22:00     ` Bernd Schubert
  0 siblings, 0 replies; 16+ messages in thread
From: Bernd Schubert @ 2026-04-24 22:00 UTC (permalink / raw)
  To: Joanne Koong, bernd@bsbernd.com
  Cc: Miklos Szeredi, linux-fsdevel@vger.kernel.org, Luis Henriques,
	Gang He

On 4/24/26 20:28, Joanne Koong wrote:
> On Mon, Apr 13, 2026 at 2:41 AM Bernd Schubert via B4 Relay
> <devnull+bernd.bsbernd.com@kernel.org> wrote:
>>
>> From: Bernd Schubert <bschubert@ddn.com>
>>
>> Queues selection (fuse_uring_get_queue) can handle reduced number
>> queues - using io-uring is possible now even with a single
>> queue and entry.
>>
>> The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that
>> reduced queues are possible, i.e. if the flag is set, fuse server
>> is free to reduce number queues.
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev_uring.c       | 160 ++++++++++++++++++++++++----------------------
>>  fs/fuse/inode.c           |   2 +-
>>  include/uapi/linux/fuse.h |   3 +
>>  3 files changed, 88 insertions(+), 77 deletions(-)
>>
>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>> index 9dcbc39531f0e019e5abf58a29cdf6c75fafdca1..e68089babaf89fb81741e4a5e605c6e36a137f9e 100644
>> --- a/fs/fuse/dev_uring.c
>> +++ b/fs/fuse/dev_uring.c
>> @@ -249,15 +249,17 @@ static int fuse_uring_init_q_map(struct fuse_queue_map *q_map, size_t nr_cpu)
>>
>>         q_map->cpu_to_qid = kzalloc_objs(*q_map->cpu_to_qid, nr_cpu,
>>                                          GFP_KERNEL_ACCOUNT);
>> +       if (!q_map->cpu_to_qid)
>> +               return -ENOMEM;
>>
>>         return 0;
>>  }
>>
>> -static int fuse_uring_create_q_masks(struct fuse_ring *ring)
>> +static int fuse_uring_create_q_masks(struct fuse_ring *ring, size_t nr_queues)
>>  {
>>         int err, node;
>>
>> -       err = fuse_uring_init_q_map(&ring->q_map, ring->max_nr_queues);
>> +       err = fuse_uring_init_q_map(&ring->q_map, nr_queues);
>>         if (err)
>>                 return err;
>>
>> @@ -267,7 +269,7 @@ static int fuse_uring_create_q_masks(struct fuse_ring *ring)
>>                 return -ENOMEM;
>>         for (node = 0; node < ring->nr_numa_nodes; node++) {
>>                 err = fuse_uring_init_q_map(&ring->numa_q_map[node],
>> -                                          ring->max_nr_queues);
>> +                                           nr_queues);
>>                 if (err)
>>                         return err;
>>         }
>> @@ -299,7 +301,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>>         max_payload_size = max(FUSE_MIN_READ_BUFFER, fc->max_write);
>>         max_payload_size = max(max_payload_size, fc->max_pages * PAGE_SIZE);
>>
>> -       err = fuse_uring_create_q_masks(ring);
>> +       err = fuse_uring_create_q_masks(ring, nr_queues);
>>         if (err)
>>                 goto out_err;
>>
>> @@ -328,12 +330,36 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>>         return res;
>>  }
>>
>> +static void fuse_uring_cpu_qid_mapping(struct fuse_ring *ring, int qid,
>> +                                      struct fuse_queue_map *q_map,
>> +                                      int node)
>> +{
>> +       int cpu, qid_idx, mapping_count = 0;
>> +       size_t nr_queues;
>> +
>> +       cpumask_set_cpu(qid, q_map->registered_q_mask);
>> +       nr_queues = cpumask_weight(q_map->registered_q_mask);
>> +       for (cpu = 0; cpu < ring->max_nr_queues; cpu++) {
>> +               if (node != -1 && cpu_to_node(cpu) != node)
>> +                       continue;
>> +
>> +               qid_idx = mapping_count % nr_queues;
>> +               q_map->cpu_to_qid[cpu] = cpumask_nth(qid_idx,
>> +                                                    q_map->registered_q_mask);
>> +               mapping_count++;
>> +               pr_debug("%s node=%d qid=%d qid_idx=%d nr_queues=%zu %d->%d\n",
>> +                        __func__, node, qid, qid_idx, nr_queues, cpu,
>> +                        q_map->cpu_to_qid[cpu]);
>> +       }
>> +}
>> +
>>
>> -static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
>> +static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring)
>>  {
>>         unsigned int qid;
>> -       struct fuse_ring_queue *queue;
>> +       int node;
>> +       unsigned int nr_queues;
>> +       unsigned int cpu = task_cpu(current);
>>
>> -       qid = task_cpu(current);
>> +       cpu = cpu % ring->max_nr_queues;
>>
>> -       if (WARN_ONCE(qid >= ring->max_nr_queues,
>> -                     "Core number (%u) exceeds nr queues (%zu)\n", qid,
>> -                     ring->max_nr_queues))
>> -               qid = 0;
>> +       /* numa local registered queue bitmap */
>> +       node = cpu_to_node(cpu);
>> +       if (WARN_ONCE(node >= ring->nr_numa_nodes,
>> +                     "Node number (%d) exceeds nr nodes (%d)\n",
>> +                     node, ring->nr_numa_nodes)) {
>> +               node = 0;
>> +       }
>>
>> -       queue = ring->queues[qid];
>> -       WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
>> +       nr_queues = READ_ONCE(ring->numa_q_map[node].nr_queues);
>> +       if (nr_queues) {
>> +               qid = ring->numa_q_map[node].cpu_to_qid[cpu];
>> +               if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
>> +                       return NULL;
>> +               return READ_ONCE(ring->queues[qid]);
>> +       }
> 
> Hi Bernd,
> 
> Thanks for making the changes on this - I really like how much simpler
> the logic is now.
> 
> I'm looking through how the block multiqueue code works
> (block/blk-mq.c and block/blk-mq-cpumap.c) because I think they
> basically have to do the same thing with figuring out which cpu to
> dispatch a request to.
> 
> It looks like what they do is use group_cpus_evenly(), which as I
> understand it, will partition CPUs taking into account numa nodes (as
> well as clustering and SMT siblings). I think if we use this for fuse
> io-uring, it will make things a lot simpler and we could get rid of
> the per-numa state tracking (eg numa_q_map, registered_q_mask,
> nr_numa_nodes)  and simplify queue selection where now that can just
> be a cpu to qid lookup instead of a two-level
> numa-then-global-fallback lookup.
> 
> Do you think something like this makes sense?

Maybe, I need to check that code. However, does this really need to be
done right now? This cannot be updated later? For me it looks a bit like
we are going to replace one code by another, without a clear advantage.
I can look into group_cpus_evenly(), but I cannot promise you when that
will happen.
My personal preference would be to work on real issue, like getting rid
of two locks (queue->lock and bg->lock) and distribute max_bg accross
queues. And that probably requires the distribution across queues, which
you didn't like in the previous series. Anway, already finding the time
for that is hard.

My personal opinion is that queue selection needs to return the qid, so
that the function can be overriden with eBPF. I didn't have time yet to
try that out.

> 
> Additionally, as I understand it, in this series, the ring->q_map
> mapping has to get rebuilt every time a new queue gets created. What
> do you think about just having the server declare the total queue
> count upfront and then the mapping can just get established at ring
> creation time? group_cpus_evenly() would only need to be called once,
> the cpu_to_qid map would only have to be built once, and we could
> avoid the rebuild-on-each-queue-creation complexity entirely. Do you
> think something like this makes sense?

That is why I said in another mail that a config SQE would make to some
extend sense. However, the part where I disagree is that we could make
it all entirely dynamic with the current approach.
Only the logic for that in libfuse is missing. I.e. it _could_ start
with a single queue or one queue per numa and one ring entry. Basically
no memory usage then.
And now libfuse could add logic - many small requests - set up ring
entries with smaller payload size (or smaller pBuf). Many large requests
- add more requests with larger payload size. And with the current
approach queues can be added dynamically.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 6/8] fuse: {io-uring} Queue background requests on a different core
  2026-04-13  9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
                   ` (4 preceding siblings ...)
  2026-04-13  9:41 ` [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert via B4 Relay
@ 2026-04-13  9:41 ` Bernd Schubert via B4 Relay
  2026-04-24 15:26   ` Luis Henriques
  2026-04-13  9:41 ` [PATCH v4 7/8] fuse: Add retry attempts for numa local queues for load distribution Bernd Schubert via B4 Relay
  2026-04-13  9:41 ` [PATCH v4 8/8] fuse: {io-uring} Prefer the current core over mapping Bernd Schubert via B4 Relay
  7 siblings, 1 reply; 16+ messages in thread
From: Bernd Schubert via B4 Relay @ 2026-04-13  9:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

From: Bernd Schubert <bschubert@ddn.com>

Running background IO on a different core makes quite a difference.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
 --direct=1

unpatched
   READ: bw=272MiB/s (285MB/s) ...
patched
   READ: bw=650MiB/s (682MB/s)

Reason is easily visible, the fio process is migrating between CPUs
when requests are submitted on the queue for the same core.

With --iodepth=8

unpatched
   READ: bw=466MiB/s (489MB/s)
patched
   READ: bw=641MiB/s (672MB/s)

Without io-uring (--iodepth=8)
   READ: bw=729MiB/s (764MB/s)

Without fuse (--iodepth=8)
   READ: bw=2199MiB/s (2306MB/s)

(Test were done with
<libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
[-o io_uring] /tmp/source /tmp/dest
)

Additional notes:

With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
   READ: bw=903MiB/s (946MB/s)

With just a random qid (--iodepth=8)
   READ: bw=429MiB/s (450MB/s)

With --iodepth=1
unpatched
   READ: bw=195MiB/s (204MB/s)
patched
   READ: bw=232MiB/s (243MB/s)

With --iodepth=1 --numjobs=2
unpatched
   READ: bw=366MiB/s (384MB/s)
patched
   READ: bw=472MiB/s (495MB/s)

With --iodepth=1 --numjobs=8
unpatched
   READ: bw=1437MiB/s (1507MB/s)
patched
   READ: bw=1529MiB/s (1603MB/s)
fuse without io-uring
   READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
no-fuse
   READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...

In summary, for async requests the core doing application IO is busy
sending requests and processing IOs should be done on a different core.
Spreading the load on random cores is also not desirable, as the core
might be frequency scaled down and/or in C1 sleep states. Not shown here,
but differnces are much smaller when the system uses performance govenor
instead of schedutil (ubuntu default). Obviously at the cost of higher
system power consumption for performance govenor - not desirable either.

Results without io-uring (which uses fixed libfuse threads per queue)
heavily depend on the current number of active threads. Libfuse uses
default of max 10 threads, but actual nr max threads is a parameter.
Also, no-fuse-io-uring results heavily depend on, if there was already
running another workload before, as libfuse starts these threads
dynamically - i.e. the more threads are active, the worse the
performance.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index e68089babaf89fb81741e4a5e605c6e36a137f9e..ed061e239b8ed70ff36deb51dd6957fe1704ec87 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1306,13 +1306,21 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
 	fuse_uring_send(ent, cmd, err, issue_flags);
 }
 
-static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring)
+static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
+						       bool background)
 {
 	unsigned int qid;
 	int node;
 	unsigned int nr_queues;
 	unsigned int cpu = task_cpu(current);
 
+	/*
+	 *  Background requests result in better performance on a different
+	 *  CPU, unless CPUs are already busy.
+	 */
+	if (background)
+		cpu++;
+
 	cpu = cpu % ring->max_nr_queues;
 
 	/* numa local registered queue bitmap */
@@ -1356,7 +1364,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 	int err;
 
 	err = -EINVAL;
-	queue = fuse_uring_select_queue(ring);
+	queue = fuse_uring_select_queue(ring, false);
 	if (!queue)
 		goto err;
 
@@ -1400,7 +1408,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	struct fuse_ring_queue *queue;
 	struct fuse_ring_ent *ent = NULL;
 
-	queue = fuse_uring_select_queue(ring);
+	queue = fuse_uring_select_queue(ring, true);
 	if (!queue)
 		return false;
 

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 6/8] fuse: {io-uring} Queue background requests on a different core
  2026-04-13  9:41 ` [PATCH v4 6/8] fuse: {io-uring} Queue background requests on a different core Bernd Schubert via B4 Relay
@ 2026-04-24 15:26   ` Luis Henriques
  0 siblings, 0 replies; 16+ messages in thread
From: Luis Henriques @ 2026-04-24 15:26 UTC (permalink / raw)
  To: Bernd Schubert via B4 Relay
  Cc: Miklos Szeredi, bernd, Joanne Koong, linux-fsdevel, Gang He,
	Bernd Schubert

On Mon, Apr 13 2026, Bernd Schubert via B4 Relay wrote:

> From: Bernd Schubert <bschubert@ddn.com>
>
> Running background IO on a different core makes quite a difference.
>
> fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
> --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
> --runtime=30s --group_reporting --ioengine=io_uring\
>  --direct=1
>
> unpatched
>    READ: bw=272MiB/s (285MB/s) ...
> patched
>    READ: bw=650MiB/s (682MB/s)
>
> Reason is easily visible, the fio process is migrating between CPUs
> when requests are submitted on the queue for the same core.
>
> With --iodepth=8
>
> unpatched
>    READ: bw=466MiB/s (489MB/s)
> patched
>    READ: bw=641MiB/s (672MB/s)
>
> Without io-uring (--iodepth=8)
>    READ: bw=729MiB/s (764MB/s)
>
> Without fuse (--iodepth=8)
>    READ: bw=2199MiB/s (2306MB/s)
>
> (Test were done with
> <libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
> [-o io_uring] /tmp/source /tmp/dest
> )
>
> Additional notes:
>
> With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
>    READ: bw=903MiB/s (946MB/s)
>
> With just a random qid (--iodepth=8)
>    READ: bw=429MiB/s (450MB/s)
>
> With --iodepth=1
> unpatched
>    READ: bw=195MiB/s (204MB/s)
> patched
>    READ: bw=232MiB/s (243MB/s)
>
> With --iodepth=1 --numjobs=2
> unpatched
>    READ: bw=366MiB/s (384MB/s)
> patched
>    READ: bw=472MiB/s (495MB/s)
>
> With --iodepth=1 --numjobs=8
> unpatched
>    READ: bw=1437MiB/s (1507MB/s)
> patched
>    READ: bw=1529MiB/s (1603MB/s)
> fuse without io-uring
>    READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
> no-fuse
>    READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...
>
> In summary, for async requests the core doing application IO is busy
> sending requests and processing IOs should be done on a different core.
> Spreading the load on random cores is also not desirable, as the core
> might be frequency scaled down and/or in C1 sleep states. Not shown here,
> but differnces are much smaller when the system uses performance govenor
> instead of schedutil (ubuntu default). Obviously at the cost of higher
> system power consumption for performance govenor - not desirable either.
>
> Results without io-uring (which uses fixed libfuse threads per queue)
> heavily depend on the current number of active threads. Libfuse uses
> default of max 10 threads, but actual nr max threads is a parameter.
> Also, no-fuse-io-uring results heavily depend on, if there was already
> running another workload before, as libfuse starts these threads
> dynamically - i.e. the more threads are active, the worse the
> performance.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index e68089babaf89fb81741e4a5e605c6e36a137f9e..ed061e239b8ed70ff36deb51dd6957fe1704ec87 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -1306,13 +1306,21 @@ static void fuse_uring_send_in_task(struct io_tw_req tw_req, io_tw_token_t tw)
>  	fuse_uring_send(ent, cmd, err, issue_flags);
>  }
>  
> -static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring)
> +static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
> +						       bool background)
>  {
>  	unsigned int qid;
>  	int node;
>  	unsigned int nr_queues;
>  	unsigned int cpu = task_cpu(current);
>  
> +	/*
> +	 *  Background requests result in better performance on a different
> +	 *  CPU, unless CPUs are already busy.
> +	 */
> +	if (background)
> +		cpu++;
> +

The performance number look great, but I was wondering if you get similar
improvements for write operations.

Also, isn't 'cpu++' too arbitrary?  I mean, isn't there some heuristics
that could be used?  I understand the goal is just to push the request
somewhere else, but does it make sense to push it to the next cpu on the
same node?  Or to the next cpu in a different core?  I'm just thinking out
loud, and maybe this is non-sense ;-)

Finally, shouldn't this behaviour be behind some knob?  Maybe it's
over-complicating for no good reason, but being able to: 1) enable/disable
it, 2) enable by pushing it to the next cpu (this behaviour), 3) enable by
pushing to the next cpu on the same/different node, etc.

Cheers,
-- 
Luís

>  	cpu = cpu % ring->max_nr_queues;
>  
>  	/* numa local registered queue bitmap */
> @@ -1356,7 +1364,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>  	int err;
>  
>  	err = -EINVAL;
> -	queue = fuse_uring_select_queue(ring);
> +	queue = fuse_uring_select_queue(ring, false);
>  	if (!queue)
>  		goto err;
>  
> @@ -1400,7 +1408,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
>  	struct fuse_ring_queue *queue;
>  	struct fuse_ring_ent *ent = NULL;
>  
> -	queue = fuse_uring_select_queue(ring);
> +	queue = fuse_uring_select_queue(ring, true);
>  	if (!queue)
>  		return false;
>  
>
> -- 
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 7/8] fuse: Add retry attempts for numa local queues for load distribution
  2026-04-13  9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
                   ` (5 preceding siblings ...)
  2026-04-13  9:41 ` [PATCH v4 6/8] fuse: {io-uring} Queue background requests on a different core Bernd Schubert via B4 Relay
@ 2026-04-13  9:41 ` Bernd Schubert via B4 Relay
  2026-04-24 15:28   ` Luis Henriques
  2026-04-13  9:41 ` [PATCH v4 8/8] fuse: {io-uring} Prefer the current core over mapping Bernd Schubert via B4 Relay
  7 siblings, 1 reply; 16+ messages in thread
From: Bernd Schubert via B4 Relay @ 2026-04-13  9:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert

From: Bernd Schubert <bschubert@ddn.com>

This is to further improve performance.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
--direct=1

unpatched
   READ: bw=650MiB/s (682MB/s)
patched:
   READ: bw=995MiB/s (1043MB/s)

with --iodepth=8

unpatched
   READ: bw=641MiB/s (672MB/s)
patched
   READ: bw=966MiB/s (1012MB/s)

Reason is that with --iodepth=x (x > 1) fio submits multiple async
requests and a single queue might become CPU limited. I.e. spreading
the load helps.
---
 fs/fuse/dev_uring.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index ed061e239b8ed70ff36deb51dd6957fe1704ec87..e06d45b161d5000e24431314b2222b66bdea58aa 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -19,6 +19,7 @@ MODULE_PARM_DESC(enable_uring,
 
 #define FUSE_URING_IOV_SEGS 2 /* header and payload */
 
+#define FUSE_URING_Q_THRESHOLD 2
 
 bool fuse_uring_enabled(void)
 {
@@ -1310,9 +1311,10 @@ static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
 						       bool background)
 {
 	unsigned int qid;
-	int node;
+	int node, retries = 0;
 	unsigned int nr_queues;
 	unsigned int cpu = task_cpu(current);
+	struct fuse_ring_queue *queue, *primary_queue = NULL;
 
 	/*
 	 *  Background requests result in better performance on a different
@@ -1321,6 +1323,7 @@ static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
 	if (background)
 		cpu++;
 
+retry:
 	cpu = cpu % ring->max_nr_queues;
 
 	/* numa local registered queue bitmap */
@@ -1336,12 +1339,35 @@ static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
 		qid = ring->numa_q_map[node].cpu_to_qid[cpu];
 		if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
 			return NULL;
-		return READ_ONCE(ring->queues[qid]);
+		queue = READ_ONCE(ring->queues[qid]);
+
+		/* Might happen on teardown */
+		if (unlikely(!queue))
+			return NULL;
+
+		if (queue->nr_reqs < FUSE_URING_Q_THRESHOLD)
+			return queue;
+
+		/* Retries help for load balancing */
+		if (retries < FUSE_URING_Q_THRESHOLD) {
+			if (!retries)
+				primary_queue = queue;
+
+			/* Increase cpu, assuming it will map to a differet qid*/
+			cpu++;
+			retries++;
+			goto retry;
+		}
 	}
 
+	/* Retries exceeded, take the primary target queue */
+	if (primary_queue)
+		return primary_queue;
+
 	/* global registered queue bitmap */
 	qid = ring->q_map.cpu_to_qid[cpu];
 	if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
+	/* Might happen on teardown */
 		return NULL;
 	return READ_ONCE(ring->queues[qid]);
 }

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 7/8] fuse: Add retry attempts for numa local queues for load distribution
  2026-04-13  9:41 ` [PATCH v4 7/8] fuse: Add retry attempts for numa local queues for load distribution Bernd Schubert via B4 Relay
@ 2026-04-24 15:28   ` Luis Henriques
  0 siblings, 0 replies; 16+ messages in thread
From: Luis Henriques @ 2026-04-24 15:28 UTC (permalink / raw)
  To: Bernd Schubert via B4 Relay
  Cc: Miklos Szeredi, bernd, Joanne Koong, linux-fsdevel, Gang He,
	Bernd Schubert

On Mon, Apr 13 2026, Bernd Schubert via B4 Relay wrote:

> From: Bernd Schubert <bschubert@ddn.com>
>
> This is to further improve performance.
>
> fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
> --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
> --runtime=30s --group_reporting --ioengine=io_uring\
> --direct=1
>
> unpatched
>    READ: bw=650MiB/s (682MB/s)
> patched:
>    READ: bw=995MiB/s (1043MB/s)
>
> with --iodepth=8
>
> unpatched
>    READ: bw=641MiB/s (672MB/s)
> patched
>    READ: bw=966MiB/s (1012MB/s)
>
> Reason is that with --iodepth=x (x > 1) fio submits multiple async
> requests and a single queue might become CPU limited. I.e. spreading
> the load helps.
> ---
>  fs/fuse/dev_uring.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index ed061e239b8ed70ff36deb51dd6957fe1704ec87..e06d45b161d5000e24431314b2222b66bdea58aa 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -19,6 +19,7 @@ MODULE_PARM_DESC(enable_uring,
>  
>  #define FUSE_URING_IOV_SEGS 2 /* header and payload */
>  
> +#define FUSE_URING_Q_THRESHOLD 2
>  
>  bool fuse_uring_enabled(void)
>  {
> @@ -1310,9 +1311,10 @@ static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
>  						       bool background)
>  {
>  	unsigned int qid;
> -	int node;
> +	int node, retries = 0;
>  	unsigned int nr_queues;
>  	unsigned int cpu = task_cpu(current);
> +	struct fuse_ring_queue *queue, *primary_queue = NULL;
>  
>  	/*
>  	 *  Background requests result in better performance on a different
> @@ -1321,6 +1323,7 @@ static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
>  	if (background)
>  		cpu++;
>  
> +retry:
>  	cpu = cpu % ring->max_nr_queues;
>  
>  	/* numa local registered queue bitmap */
> @@ -1336,12 +1339,35 @@ static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
>  		qid = ring->numa_q_map[node].cpu_to_qid[cpu];
>  		if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
>  			return NULL;
> -		return READ_ONCE(ring->queues[qid]);
> +		queue = READ_ONCE(ring->queues[qid]);
> +
> +		/* Might happen on teardown */
> +		if (unlikely(!queue))
> +			return NULL;
> +
> +		if (queue->nr_reqs < FUSE_URING_Q_THRESHOLD)
> +			return queue;
> +
> +		/* Retries help for load balancing */
> +		if (retries < FUSE_URING_Q_THRESHOLD) {
> +			if (!retries)
> +				primary_queue = queue;
> +
> +			/* Increase cpu, assuming it will map to a differet qid*/

nit: "different"

> +			cpu++;
> +			retries++;
> +			goto retry;
> +		}
>  	}
>  
> +	/* Retries exceeded, take the primary target queue */
> +	if (primary_queue)
> +		return primary_queue;
> +
>  	/* global registered queue bitmap */
>  	qid = ring->q_map.cpu_to_qid[cpu];
>  	if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
> +	/* Might happen on teardown */

This comment should probably be in the line above the 'if' statement.

Cheers,
-- 
Luís

>  		return NULL;
>  	return READ_ONCE(ring->queues[qid]);
>  }
>
> -- 
> 2.43.0
>
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 8/8] fuse: {io-uring} Prefer the current core over mapping
  2026-04-13  9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
                   ` (6 preceding siblings ...)
  2026-04-13  9:41 ` [PATCH v4 7/8] fuse: Add retry attempts for numa local queues for load distribution Bernd Schubert via B4 Relay
@ 2026-04-13  9:41 ` Bernd Schubert via B4 Relay
  7 siblings, 0 replies; 16+ messages in thread
From: Bernd Schubert via B4 Relay @ 2026-04-13  9:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, linux-fsdevel, Luis Henriques, Gang He,
	Bernd Schubert, Bernd Schubert

From: Bernd Schubert <bschubert@ddn.com>

Mapping might point to a totally different core due to
random assignment. For performance using the current
core might be beneficial

Example (with core binding)

unpatched WRITE: bw=841MiB/s
patched   WRITE: bw=1363MiB/s

With
fio --name=test --ioengine=psync --direct=1 \
    --rw=write --bs=1M --iodepth=1 --numjobs=1 \
    --filename_format=/redfs/testfile.\$jobnum --size=100G \
    --thread --create_on_open=1 --runtime=30s --cpus_allowed=1

In order to get the good number `--cpus_allowed=1` is needed.
This could be improved by a future change that avoids
cpu migration in fuse_request_end() on wake_up() call.

Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
---
 fs/fuse/dev_uring.c | 41 ++++++++++++++++++++++++++++-------------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index e06d45b161d5000e24431314b2222b66bdea58aa..5c00fd047c8bd359ec34fb6a41abba44f6794517 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -19,8 +19,12 @@ MODULE_PARM_DESC(enable_uring,
 
 #define FUSE_URING_IOV_SEGS 2 /* header and payload */
 
+/* Threshold that determines if a better queue should be searched for */
 #define FUSE_URING_Q_THRESHOLD 2
 
+/* Number of (re)tries to find a better queue */
+#define FUSE_URING_Q_TRIES 3
+
 bool fuse_uring_enabled(void)
 {
 	return enable_uring;
@@ -1311,7 +1315,7 @@ static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
 						       bool background)
 {
 	unsigned int qid;
-	int node, retries = 0;
+	int node, tries = 0;
 	unsigned int nr_queues;
 	unsigned int cpu = task_cpu(current);
 	struct fuse_ring_queue *queue, *primary_queue = NULL;
@@ -1336,26 +1340,36 @@ static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
 
 	nr_queues = READ_ONCE(ring->numa_q_map[node].nr_queues);
 	if (nr_queues) {
+		/* prefer the queue that corresponds to the current cpu */
+		queue = READ_ONCE(ring->queues[cpu]);
+		if (queue) {
+			if (queue->nr_reqs <= FUSE_URING_Q_THRESHOLD)
+				return queue;
+			primary_queue = queue;
+		}
+
 		qid = ring->numa_q_map[node].cpu_to_qid[cpu];
 		if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
 			return NULL;
-		queue = READ_ONCE(ring->queues[qid]);
+		if (qid != cpu) {
+			queue = READ_ONCE(ring->queues[qid]);
 
-		/* Might happen on teardown */
-		if (unlikely(!queue))
-			return NULL;
+			/* Might happen on teardown */
+			if (unlikely(!queue))
+				return NULL;
 
-		if (queue->nr_reqs < FUSE_URING_Q_THRESHOLD)
-			return queue;
+			if (queue->nr_reqs <= FUSE_URING_Q_THRESHOLD)
+				return queue;
+		}
 
 		/* Retries help for load balancing */
-		if (retries < FUSE_URING_Q_THRESHOLD) {
-			if (!retries)
+		if (tries < FUSE_URING_Q_TRIES && tries + 1 < nr_queues) {
+			if (!primary_queue)
 				primary_queue = queue;
 
-			/* Increase cpu, assuming it will map to a differet qid*/
+			/* Increase cpu, assuming it will map to a different qid*/
 			cpu++;
-			retries++;
+			tries++;
 			goto retry;
 		}
 	}
@@ -1366,9 +1380,10 @@ static struct fuse_ring_queue *fuse_uring_select_queue(struct fuse_ring *ring,
 
 	/* global registered queue bitmap */
 	qid = ring->q_map.cpu_to_qid[cpu];
-	if (WARN_ON_ONCE(qid >= ring->max_nr_queues))
-	/* Might happen on teardown */
+	if (WARN_ON_ONCE(qid >= ring->max_nr_queues)) {
+		/* Might happen on teardown */
 		return NULL;
+	}
 	return READ_ONCE(ring->queues[qid]);
 }
 

-- 
2.43.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-04-24 23:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-13  9:41 [PATCH v4 0/8] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert via B4 Relay
2026-04-13  9:41 ` [PATCH v4 1/8] fuse: {io-uring} Add queue length counters Bernd Schubert via B4 Relay
2026-04-13  9:41 ` [PATCH v4 2/8] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert via B4 Relay
2026-04-13  9:41 ` [PATCH v4 3/8] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert via B4 Relay
2026-04-24 15:04   ` Luis Henriques
2026-04-24 15:33     ` Bernd Schubert
2026-04-13  9:41 ` [PATCH v4 4/8] fuse: Fetch a queued fuse request on command registration Bernd Schubert via B4 Relay
2026-04-13  9:41 ` [PATCH v4 5/8] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert via B4 Relay
2026-04-24 15:15   ` Luis Henriques
2026-04-24 18:28   ` Joanne Koong
2026-04-24 22:00     ` Bernd Schubert
2026-04-13  9:41 ` [PATCH v4 6/8] fuse: {io-uring} Queue background requests on a different core Bernd Schubert via B4 Relay
2026-04-24 15:26   ` Luis Henriques
2026-04-13  9:41 ` [PATCH v4 7/8] fuse: Add retry attempts for numa local queues for load distribution Bernd Schubert via B4 Relay
2026-04-24 15:28   ` Luis Henriques
2026-04-13  9:41 ` [PATCH v4 8/8] fuse: {io-uring} Prefer the current core over mapping Bernd Schubert via B4 Relay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox