[PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution
@ 2025-10-03 10:06 Bernd Schubert
  2025-10-03 10:06 ` [PATCH v2 1/7] fuse: {io-uring} Add queue length counters Bernd Schubert
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Bernd Schubert @ 2025-10-03 10:06 UTC (permalink / raw)
  To: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joanne Koong, linux-fsdevel, Bernd Schubert

This adds bitmaps that track which queues are registered and which queues
do not have queued requests.
These bitmaps are then used to map from request core to queue
and also allow load distribution. NUMA affinity is handled and
fuse client/server protocol does not need changes, all is handled
in fuse client internally.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
Changes in v2:
- Overall code/logic changes, avail_q_masks were removed,
  decision which queue to use for a reqest was re-worked
  to achieve better balancing and performance
- Addressed Joannes comments. Thanks a lot for 
  kcalloc(..., sizeof(cpumask_var_t))!
- Added back optimizations that were part of fuse-io-uring to RFCv2,
  i.e. wake_up_on_current_cpu() for sync requests and
  queuing on a different cpu queue for async requests
- Added some benchmarks on the optimization commits.
- Link to v1: https://lore.kernel.org/r/20250722-reduced-nr-ring-queues_3-v1-0-aa8e37ae97e6@ddn.com

---
Bernd Schubert (7):
      fuse: {io-uring} Add queue length counters
      fuse: {io-uring} Rename ring->nr_queues to max_nr_queues
      fuse: {io-uring} Use bitmaps to track registered queues
      fuse: {io-uring} Distribute load among queues
      fuse: {io-uring} Allow reduced number of ring queues
      fuse: {io-uring} Queue background requests on a different core
      fuse: Wake requests on the same cpu

 fs/fuse/dev.c         |   8 +-
 fs/fuse/dev_uring.c   | 253 +++++++++++++++++++++++++++++++++++++++-----------
 fs/fuse/dev_uring_i.h |  14 ++-
 include/linux/wait.h  |   6 +-
 kernel/sched/wait.c   |  12 +++
 5 files changed, 234 insertions(+), 59 deletions(-)
---
base-commit: 8b789f2b7602a818e7c7488c74414fae21392b63
change-id: 20250722-reduced-nr-ring-queues_3-6acb79dad978

Best regards,
-- 
Bernd Schubert <bschubert@ddn.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/7] fuse: {io-uring} Add queue length counters
  2025-10-03 10:06 [PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
@ 2025-10-03 10:06 ` Bernd Schubert
  2025-10-03 10:06 ` [PATCH v2 2/7] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Bernd Schubert @ 2025-10-03 10:06 UTC (permalink / raw)
  To: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joanne Koong, linux-fsdevel, Bernd Schubert

This is another preparation and will be used for decision
which queue to add a request to.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/dev_uring.c   | 17 +++++++++++++++--
 fs/fuse/dev_uring_i.h |  3 +++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 249b210becb1cc2b40ae7b2fdf3a57dc57eaac42..2f2f7ff5e95a63a4df76f484d30cce1077b29123 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -85,13 +85,13 @@ static void fuse_uring_req_end(struct fuse_ring_ent *ent, struct fuse_req *req,
 	lockdep_assert_not_held(&queue->lock);
 	spin_lock(&queue->lock);
 	ent->fuse_req = NULL;
+	queue->nr_reqs--;
 	if (test_bit(FR_BACKGROUND, &req->flags)) {
 		queue->active_background--;
 		spin_lock(&fc->bg_lock);
 		fuse_uring_flush_bg(queue);
 		spin_unlock(&fc->bg_lock);
 	}
-
 	spin_unlock(&queue->lock);
 
 	if (error)
@@ -111,6 +111,7 @@ static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
 	list_for_each_entry(req, &queue->fuse_req_queue, list)
 		clear_bit(FR_PENDING, &req->flags);
 	list_splice_init(&queue->fuse_req_queue, &req_list);
+	queue->nr_reqs = 0;
 	spin_unlock(&queue->lock);
 
 	/* must not hold queue lock to avoid order issues with fi->lock */
@@ -1280,10 +1281,13 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 	req->ring_queue = queue;
 	ent = list_first_entry_or_null(&queue->ent_avail_queue,
 				       struct fuse_ring_ent, list);
+	queue->nr_reqs++;
+
 	if (ent)
 		fuse_uring_add_req_to_ring_ent(ent, req);
 	else
 		list_add_tail(&req->list, &queue->fuse_req_queue);
+
 	spin_unlock(&queue->lock);
 
 	if (ent)
@@ -1319,6 +1323,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	set_bit(FR_URING, &req->flags);
 	req->ring_queue = queue;
 	list_add_tail(&req->list, &queue->fuse_req_bg_queue);
+	queue->nr_reqs++;
 
 	ent = list_first_entry_or_null(&queue->ent_avail_queue,
 				       struct fuse_ring_ent, list);
@@ -1351,8 +1356,16 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 bool fuse_uring_remove_pending_req(struct fuse_req *req)
 {
 	struct fuse_ring_queue *queue = req->ring_queue;
+	bool removed = fuse_remove_pending_req(req, &queue->lock);
 
-	return fuse_remove_pending_req(req, &queue->lock);
+	if (removed) {
+		/* Update counters after successful removal */
+		spin_lock(&queue->lock);
+		queue->nr_reqs--;
+		spin_unlock(&queue->lock);
+	}
+
+	return removed;
 }
 
 static const struct fuse_iqueue_ops fuse_io_uring_ops = {
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 51a563922ce14158904a86c248c77767be4fe5ae..c63bed9f863d53d4ac2bed7bfbda61941cd99083 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -94,6 +94,9 @@ struct fuse_ring_queue {
 	/* background fuse requests */
 	struct list_head fuse_req_bg_queue;
 
+	/* number of requests queued or in userspace */
+	unsigned int nr_reqs;
+
 	struct fuse_pqueue fpq;
 
 	unsigned int active_background;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/7] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues
  2025-10-03 10:06 [PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
  2025-10-03 10:06 ` [PATCH v2 1/7] fuse: {io-uring} Add queue length counters Bernd Schubert
@ 2025-10-03 10:06 ` Bernd Schubert
  2025-10-03 10:06 ` [PATCH v2 3/7] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Bernd Schubert @ 2025-10-03 10:06 UTC (permalink / raw)
  To: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joanne Koong, linux-fsdevel, Bernd Schubert

This is preparation for follow up commits that allow to run with a
reduced number of queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c   | 24 ++++++++++++------------
 fs/fuse/dev_uring_i.h |  2 +-
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 2f2f7ff5e95a63a4df76f484d30cce1077b29123..0f5ab27dacb66c9f5f10eac2713d9bd3eb4c26da 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -124,7 +124,7 @@ void fuse_uring_abort_end_requests(struct fuse_ring *ring)
 	struct fuse_ring_queue *queue;
 	struct fuse_conn *fc = ring->fc;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		queue = READ_ONCE(ring->queues[qid]);
 		if (!queue)
 			continue;
@@ -165,7 +165,7 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
 	if (!ring)
 		return false;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		queue = READ_ONCE(ring->queues[qid]);
 		if (!queue)
 			continue;
@@ -192,7 +192,7 @@ void fuse_uring_destruct(struct fuse_conn *fc)
 	if (!ring)
 		return;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = ring->queues[qid];
 		struct fuse_ring_ent *ent, *next;
 
@@ -252,7 +252,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 
 	init_waitqueue_head(&ring->stop_waitq);
 
-	ring->nr_queues = nr_queues;
+	ring->max_nr_queues = nr_queues;
 	ring->fc = fc;
 	ring->max_payload_sz = max_payload_size;
 	smp_store_release(&fc->ring, ring);
@@ -404,7 +404,7 @@ static void fuse_uring_log_ent_state(struct fuse_ring *ring)
 	int qid;
 	struct fuse_ring_ent *ent;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = ring->queues[qid];
 
 		if (!queue)
@@ -435,7 +435,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
 		container_of(work, struct fuse_ring, async_teardown_work.work);
 
 	/* XXX code dup */
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
 
 		if (!queue)
@@ -470,7 +470,7 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
 {
 	int qid;
 
-	for (qid = 0; qid < ring->nr_queues; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
 
 		if (!queue)
@@ -889,7 +889,7 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
 	if (!ring)
 		return err;
 
-	if (qid >= ring->nr_queues)
+	if (qid >= ring->max_nr_queues)
 		return -EINVAL;
 
 	queue = ring->queues[qid];
@@ -952,7 +952,7 @@ static bool is_ring_ready(struct fuse_ring *ring, int current_qid)
 	struct fuse_ring_queue *queue;
 	bool ready = true;
 
-	for (qid = 0; qid < ring->nr_queues && ready; qid++) {
+	for (qid = 0; qid < ring->max_nr_queues && ready; qid++) {
 		if (current_qid == qid)
 			continue;
 
@@ -1093,7 +1093,7 @@ static int fuse_uring_register(struct io_uring_cmd *cmd,
 			return err;
 	}
 
-	if (qid >= ring->nr_queues) {
+	if (qid >= ring->max_nr_queues) {
 		pr_info_ratelimited("fuse: Invalid ring qid %u\n", qid);
 		return -EINVAL;
 	}
@@ -1236,9 +1236,9 @@ static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
 
 	qid = task_cpu(current);
 
-	if (WARN_ONCE(qid >= ring->nr_queues,
+	if (WARN_ONCE(qid >= ring->max_nr_queues,
 		      "Core number (%u) exceeds nr queues (%zu)\n", qid,
-		      ring->nr_queues))
+		      ring->max_nr_queues))
 		qid = 0;
 
 	queue = ring->queues[qid];
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index c63bed9f863d53d4ac2bed7bfbda61941cd99083..708412294982566919122a1a0d7f741217c763ce 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -113,7 +113,7 @@ struct fuse_ring {
 	struct fuse_conn *fc;
 
 	/* number of ring queues */
-	size_t nr_queues;
+	size_t max_nr_queues;
 
 	/* maximum payload/arg size */
 	size_t max_payload_sz;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/7] fuse: {io-uring} Use bitmaps to track registered queues
  2025-10-03 10:06 [PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
  2025-10-03 10:06 ` [PATCH v2 1/7] fuse: {io-uring} Add queue length counters Bernd Schubert
  2025-10-03 10:06 ` [PATCH v2 2/7] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert
@ 2025-10-03 10:06 ` Bernd Schubert
  2025-10-06  9:51   ` Luis Henriques
  2025-10-03 10:06 ` [PATCH v2 4/7] fuse: {io-uring} Distribute load among queues Bernd Schubert
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2025-10-03 10:06 UTC (permalink / raw)
  To: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joanne Koong, linux-fsdevel, Bernd Schubert

Add per-CPU and per-NUMA node bitmasks to track which
io-uring queues are registered.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c   | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h |  9 +++++++++
 2 files changed, 63 insertions(+)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 0f5ab27dacb66c9f5f10eac2713d9bd3eb4c26da..dacc07f5b5b1a48acefa278279f851c3ae2b1489 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -18,6 +18,8 @@ MODULE_PARM_DESC(enable_uring,
 
 #define FUSE_URING_IOV_SEGS 2 /* header and payload */
 
+/* Number of queued fuse requests until a queue is considered full */
+#define FUSE_URING_QUEUE_THRESHOLD 5
 
 bool fuse_uring_enabled(void)
 {
@@ -184,6 +186,18 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
 	return false;
 }
 
+static void fuse_ring_destruct_q_masks(struct fuse_ring *ring)
+{
+	int node;
+
+	free_cpumask_var(ring->registered_q_mask);
+	if (ring->numa_registered_q_mask) {
+		for (node = 0; node < ring->nr_numa_nodes; node++)
+			free_cpumask_var(ring->numa_registered_q_mask[node]);
+		kfree(ring->numa_registered_q_mask);
+	}
+}
+
 void fuse_uring_destruct(struct fuse_conn *fc)
 {
 	struct fuse_ring *ring = fc->ring;
@@ -215,11 +229,32 @@ void fuse_uring_destruct(struct fuse_conn *fc)
 		ring->queues[qid] = NULL;
 	}
 
+	fuse_ring_destruct_q_masks(ring);
 	kfree(ring->queues);
 	kfree(ring);
 	fc->ring = NULL;
 }
 
+static int fuse_ring_create_q_masks(struct fuse_ring *ring)
+{
+	int node;
+
+	if (!zalloc_cpumask_var(&ring->registered_q_mask, GFP_KERNEL_ACCOUNT))
+		return -ENOMEM;
+
+	ring->numa_registered_q_mask = kcalloc(
+		ring->nr_numa_nodes, sizeof(cpumask_var_t), GFP_KERNEL_ACCOUNT);
+	if (!ring->numa_registered_q_mask)
+		return -ENOMEM;
+	for (node = 0; node < ring->nr_numa_nodes; node++) {
+		if (!zalloc_cpumask_var(&ring->numa_registered_q_mask[node],
+					GFP_KERNEL_ACCOUNT))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
 /*
  * Basic ring setup for this connection based on the provided configuration
  */
@@ -229,11 +264,14 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	size_t nr_queues = num_possible_cpus();
 	struct fuse_ring *res = NULL;
 	size_t max_payload_size;
+	int err;
 
 	ring = kzalloc(sizeof(*fc->ring), GFP_KERNEL_ACCOUNT);
 	if (!ring)
 		return NULL;
 
+	ring->nr_numa_nodes = num_online_nodes();
+
 	ring->queues = kcalloc(nr_queues, sizeof(struct fuse_ring_queue *),
 			       GFP_KERNEL_ACCOUNT);
 	if (!ring->queues)
@@ -242,6 +280,10 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	max_payload_size = max(FUSE_MIN_READ_BUFFER, fc->max_write);
 	max_payload_size = max(max_payload_size, fc->max_pages * PAGE_SIZE);
 
+	err = fuse_ring_create_q_masks(ring);
+	if (err)
+		goto out_err;
+
 	spin_lock(&fc->lock);
 	if (fc->ring) {
 		/* race, another thread created the ring in the meantime */
@@ -261,6 +303,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
 	return ring;
 
 out_err:
+	fuse_ring_destruct_q_masks(ring);
 	kfree(ring->queues);
 	kfree(ring);
 	return res;
@@ -423,6 +466,7 @@ static void fuse_uring_log_ent_state(struct fuse_ring *ring)
 			pr_info(" ent-commit-queue ring=%p qid=%d ent=%p state=%d\n",
 				ring, qid, ent, ent->state);
 		}
+
 		spin_unlock(&queue->lock);
 	}
 	ring->stop_debug_log = 1;
@@ -469,6 +513,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
 void fuse_uring_stop_queues(struct fuse_ring *ring)
 {
 	int qid;
+	int node;
 
 	for (qid = 0; qid < ring->max_nr_queues; qid++) {
 		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
@@ -479,6 +524,11 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
 		fuse_uring_teardown_entries(queue);
 	}
 
+	/* Reset all queue masks, we won't process any more IO */
+	cpumask_clear(ring->registered_q_mask);
+	for (node = 0; node < ring->nr_numa_nodes; node++)
+		cpumask_clear(ring->numa_registered_q_mask[node]);
+
 	if (atomic_read(&ring->queue_refs) > 0) {
 		ring->teardown_time = jiffies;
 		INIT_DELAYED_WORK(&ring->async_teardown_work,
@@ -982,6 +1032,7 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
 	struct fuse_ring *ring = queue->ring;
 	struct fuse_conn *fc = ring->fc;
 	struct fuse_iqueue *fiq = &fc->iq;
+	int node = cpu_to_node(queue->qid);
 
 	fuse_uring_prepare_cancel(cmd, issue_flags, ent);
 
@@ -990,6 +1041,9 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
 	fuse_uring_ent_avail(ent, queue);
 	spin_unlock(&queue->lock);
 
+	cpumask_set_cpu(queue->qid, ring->registered_q_mask);
+	cpumask_set_cpu(queue->qid, ring->numa_registered_q_mask[node]);
+
 	if (!ring->ready) {
 		bool ready = is_ring_ready(ring, queue->qid);
 
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 708412294982566919122a1a0d7f741217c763ce..35e3b6808b60398848965afd3091b765444283ff 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -115,6 +115,9 @@ struct fuse_ring {
 	/* number of ring queues */
 	size_t max_nr_queues;
 
+	/* number of numa nodes */
+	int nr_numa_nodes;
+
 	/* maximum payload/arg size */
 	size_t max_payload_sz;
 
@@ -125,6 +128,12 @@ struct fuse_ring {
 	 */
 	unsigned int stop_debug_log : 1;
 
+	/* Tracks which queues are registered */
+	cpumask_var_t registered_q_mask;
+
+	/* Tracks which queues are registered per NUMA node */
+	cpumask_var_t *numa_registered_q_mask;
+
 	wait_queue_head_t stop_waitq;
 
 	/* async tear down */

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/7] fuse: {io-uring} Distribute load among queues
  2025-10-03 10:06 [PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
                   ` (2 preceding siblings ...)
  2025-10-03 10:06 ` [PATCH v2 3/7] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
@ 2025-10-03 10:06 ` Bernd Schubert
  2025-10-03 10:06 ` [PATCH v2 5/7] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Bernd Schubert @ 2025-10-03 10:06 UTC (permalink / raw)
  To: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joanne Koong, linux-fsdevel, Bernd Schubert

So far queue selection was only for the queue corresponding
to the current core.
A previous commit introduced bitmaps that track which queues
are available - queue selection can make use of these bitmaps
and try to find another queue if the current one is loaded.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 79 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index dacc07f5b5b1a48acefa278279f851c3ae2b1489..bb5d7a98536963ec2e4c10982d33633db2573f4d 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -19,7 +19,9 @@ MODULE_PARM_DESC(enable_uring,
 #define FUSE_URING_IOV_SEGS 2 /* header and payload */
 
 /* Number of queued fuse requests until a queue is considered full */
-#define FUSE_URING_QUEUE_THRESHOLD 5
+#define FURING_Q_LOCAL_THRESHOLD 2
+#define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
+#define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
 
 bool fuse_uring_enabled(void)
 {
@@ -1283,22 +1285,90 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
 	fuse_uring_send(ent, cmd, err, issue_flags);
 }
 
-static struct fuse_ring_queue *fuse_uring_task_to_queue(struct fuse_ring *ring)
+/*
+ * Pick best queue from mask. Follows the algorithm described in
+ * "The Power of Two Choices in Randomized Load Balancing"
+ *  (Michael David Mitzenmacher, 1991)
+ */
+static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
+						     struct fuse_ring *ring)
+{
+	unsigned int qid1, qid2;
+	struct fuse_ring_queue *queue1, *queue2;
+	int weight = cpumask_weight(mask);
+
+	if (weight == 0)
+		return NULL;
+
+	if (weight == 1) {
+		qid1 = cpumask_first(mask);
+		return READ_ONCE(ring->queues[qid1]);
+	}
+
+	/* Get two different queues using optimized bounded random */
+	qid1 = cpumask_nth(get_random_u32_below(weight), mask);
+	queue1 = READ_ONCE(ring->queues[qid1]);
+
+	do {
+		qid2 = cpumask_nth(get_random_u32_below(weight), mask);
+	} while (qid2 == qid1);
+
+	queue2 = READ_ONCE(ring->queues[qid2]);
+
+	/* Return lowest loaded queue */
+	if (!queue1)
+		return queue2;
+	if (!queue2)
+		return queue1;
+
+	return (queue1->nr_reqs <= queue2->nr_reqs) ? queue1 : queue2;
+}
+
+/*
+ * Get the best queue for the current CPU
+ */
+static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
 {
 	unsigned int qid;
-	struct fuse_ring_queue *queue;
+	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
+	int local_node;
+	const struct cpumask *numa_mask, *global_mask;
 
 	qid = task_cpu(current);
-
 	if (WARN_ONCE(qid >= ring->max_nr_queues,
 		      "Core number (%u) exceeds nr queues (%zu)\n", qid,
 		      ring->max_nr_queues))
 		qid = 0;
 
-	queue = ring->queues[qid];
-	WARN_ONCE(!queue, "Missing queue for qid %d\n", qid);
+	local_queue = READ_ONCE(ring->queues[qid]);
+	local_node = cpu_to_node(qid);
 
-	return queue;
+	/* Fast path: if local queue exists and is not overloaded, use it */
+	if (local_queue && local_queue->nr_reqs <= FURING_Q_LOCAL_THRESHOLD)
+		return local_queue;
+
+	/* Find best NUMA-local queue */
+	numa_mask = ring->numa_registered_q_mask[local_node];
+	best_numa = fuse_uring_best_queue(numa_mask, ring);
+
+	/* If NUMA queue is under threshold, use it */
+	if (best_numa && best_numa->nr_reqs <= FURING_Q_NUMA_THRESHOLD)
+		return best_numa;
+
+	/* NUMA queues above threshold, try global queues */
+	global_mask = ring->registered_q_mask;
+	best_global = fuse_uring_best_queue(global_mask, ring);
+
+	/* Might happen during tear down */
+	if (!best_global)
+		return NULL;
+
+	/* If global queue is under double threshold, use it */
+	if (best_global->nr_reqs <= FURING_Q_GLOBAL_THRESHOLD)
+		return best_global;
+
+	/* Fall back to best available queue */
+	return best_numa ? best_numa : best_global;
 }
 
 static void fuse_uring_dispatch_ent(struct fuse_ring_ent *ent)
@@ -1319,7 +1389,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 	int err;
 
 	err = -EINVAL;
-	queue = fuse_uring_task_to_queue(ring);
+	queue = fuse_uring_get_queue(ring);
 	if (!queue)
 		goto err;
 
@@ -1364,7 +1434,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	struct fuse_ring_queue *queue;
 	struct fuse_ring_ent *ent = NULL;
 
-	queue = fuse_uring_task_to_queue(ring);
+	queue = fuse_uring_get_queue(ring);
 	if (!queue)
 		return false;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/7] fuse: {io-uring} Allow reduced number of ring queues
  2025-10-03 10:06 [PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
                   ` (3 preceding siblings ...)
  2025-10-03 10:06 ` [PATCH v2 4/7] fuse: {io-uring} Distribute load among queues Bernd Schubert
@ 2025-10-03 10:06 ` Bernd Schubert
  2025-10-06 10:35   ` Bernd Schubert
  2025-10-03 10:06 ` [PATCH v2 6/7] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
  2025-10-03 10:06 ` [PATCH v2 7/7] fuse: Wake requests on the same cpu Bernd Schubert
  6 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2025-10-03 10:06 UTC (permalink / raw)
  To: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joanne Koong, linux-fsdevel, Bernd Schubert

Queues selection (fuse_uring_get_queue) can handle reduced number
queues - using io-uring is possible now even with a single
queue and entry.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c | 35 +++--------------------------------
 1 file changed, 3 insertions(+), 32 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index bb5d7a98536963ec2e4c10982d33633db2573f4d..f5946bb1bbea930522921d49c04e047c70d21ee2 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -998,31 +998,6 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
 	return 0;
 }
 
-static bool is_ring_ready(struct fuse_ring *ring, int current_qid)
-{
-	int qid;
-	struct fuse_ring_queue *queue;
-	bool ready = true;
-
-	for (qid = 0; qid < ring->max_nr_queues && ready; qid++) {
-		if (current_qid == qid)
-			continue;
-
-		queue = ring->queues[qid];
-		if (!queue) {
-			ready = false;
-			break;
-		}
-
-		spin_lock(&queue->lock);
-		if (list_empty(&queue->ent_avail_queue))
-			ready = false;
-		spin_unlock(&queue->lock);
-	}
-
-	return ready;
-}
-
 /*
  * fuse_uring_req_fetch command handling
  */
@@ -1047,13 +1022,9 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
 	cpumask_set_cpu(queue->qid, ring->numa_registered_q_mask[node]);
 
 	if (!ring->ready) {
-		bool ready = is_ring_ready(ring, queue->qid);
-
-		if (ready) {
-			WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
-			WRITE_ONCE(ring->ready, true);
-			wake_up_all(&fc->blocked_waitq);
-		}
+		WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
+		WRITE_ONCE(ring->ready, true);
+		wake_up_all(&fc->blocked_waitq);
 	}
 }
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 6/7] fuse: {io-uring} Queue background requests on a different core
  2025-10-03 10:06 [PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
                   ` (4 preceding siblings ...)
  2025-10-03 10:06 ` [PATCH v2 5/7] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
@ 2025-10-03 10:06 ` Bernd Schubert
  2025-10-06  9:53   ` Luis Henriques
  2025-10-03 10:06 ` [PATCH v2 7/7] fuse: Wake requests on the same cpu Bernd Schubert
  6 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2025-10-03 10:06 UTC (permalink / raw)
  To: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joanne Koong, linux-fsdevel, Bernd Schubert

Running background IO on a different core makes quite a difference.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
 --direct=1

unpatched
   READ: bw=272MiB/s (285MB/s), 272MiB/s-272MiB/s ...
patched
   READ: bw=760MiB/s (797MB/s), 760MiB/s-760MiB/s ...

With --iodepth=8

unpatched
   READ: bw=466MiB/s (489MB/s), 466MiB/s-466MiB/s ...
patched
   READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
2nd run:
   READ: bw=1014MiB/s (1064MB/s), 1014MiB/s-1014MiB/s ...

Without io-uring (--iodepth=8)
   READ: bw=729MiB/s (764MB/s), 729MiB/s-729MiB/s ...

Without fuse (--iodepth=8)
   READ: bw=2199MiB/s (2306MB/s), 2199MiB/s-2199MiB/s ...

(Test were done with
<libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
[-o io_uring] /tmp/source /tmp/dest
)

Additional notes:

With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
   READ: bw=903MiB/s (946MB/s), 903MiB/s-903MiB/s ...

With just a random qid (--iodepth=8)
   READ: bw=429MiB/s (450MB/s), 429MiB/s-429MiB/s ...

With --iodepth=1
unpatched
   READ: bw=195MiB/s (204MB/s), 195MiB/s-195MiB/s ...
patched
   READ: bw=232MiB/s (243MB/s), 232MiB/s-232MiB/s ...

With --iodepth=1 --numjobs=2
unpatched
   READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
patched
   READ: bw=1821MiB/s (1909MB/s), 1821MiB/s-1821MiB/s ...

With --iodepth=1 --numjobs=8
unpatched
   READ: bw=1138MiB/s (1193MB/s), 1138MiB/s-1138MiB/s ...
patched
   READ: bw=1650MiB/s (1730MB/s), 1650MiB/s-1650MiB/s ...
fuse without io-uring
   READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
no-fuse
   READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...

In summary, for async requests the core doing application IO is busy
sending requests and processing IOs should be done on a different core.
Spreading the load on random cores is also not desirable, as the core
might be frequency scaled down and/or in C1 sleep states. Not shown here,
but differnces are much smaller when the system uses performance govenor
instead of schedutil (ubuntu default). Obviously at the cost of higher
system power consumption for performance govenor - not desirable either.

Results without io-uring (which uses fixed libfuse threads per queue)
heavily depend on the current number of active threads. Libfuse uses
default of max 10 threads, but actual nr max threads is a parameter.
Also, no-fuse-io-uring results heavily depend on, if there was already
running another workload before, as libfuse starts these threads
dynamically - i.e. the more threads are active, the worse the
performance.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c | 61 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 50 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index f5946bb1bbea930522921d49c04e047c70d21ee2..296592fe3651926ab4982b8d80694b3dac8bbffa 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -22,6 +22,7 @@ MODULE_PARM_DESC(enable_uring,
 #define FURING_Q_LOCAL_THRESHOLD 2
 #define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
 #define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
+#define FURING_NEXT_QUEUE_RETRIES 2
 
 bool fuse_uring_enabled(void)
 {
@@ -1262,7 +1263,8 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
  *  (Michael David Mitzenmacher, 1991)
  */
 static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
-						     struct fuse_ring *ring)
+						     struct fuse_ring *ring,
+						     bool background)
 {
 	unsigned int qid1, qid2;
 	struct fuse_ring_queue *queue1, *queue2;
@@ -1277,9 +1279,14 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
 	}
 
 	/* Get two different queues using optimized bounded random */
-	qid1 = cpumask_nth(get_random_u32_below(weight), mask);
+
+	do {
+		qid1 = cpumask_nth(get_random_u32_below(weight), mask);
+	} while (background && qid1 == task_cpu(current));
 	queue1 = READ_ONCE(ring->queues[qid1]);
 
+	return queue1;
+
 	do {
 		qid2 = cpumask_nth(get_random_u32_below(weight), mask);
 	} while (qid2 == qid1);
@@ -1298,12 +1305,14 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
 /*
  * Get the best queue for the current CPU
  */
-static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
+static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring,
+						    bool background)
 {
 	unsigned int qid;
 	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
 	int local_node;
 	const struct cpumask *numa_mask, *global_mask;
+	int retries = 0;
 
 	qid = task_cpu(current);
 	if (WARN_ONCE(qid >= ring->max_nr_queues,
@@ -1311,16 +1320,44 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
 		      ring->max_nr_queues))
 		qid = 0;
 
-	local_queue = READ_ONCE(ring->queues[qid]);
 	local_node = cpu_to_node(qid);
 
-	/* Fast path: if local queue exists and is not overloaded, use it */
-	if (local_queue && local_queue->nr_reqs <= FURING_Q_LOCAL_THRESHOLD)
+	local_queue = READ_ONCE(ring->queues[qid]);
+
+retry:
+	/*
+	 * For background requests, try next CPU in same NUMA domain.
+	 * I.e. cpu-0 creates async requests, cpu-1 io processes.
+	 * Similar for foreground requests, when the local queue does not
+	 * exist - still better to always wake the same cpu id.
+	 */
+	if (background || !local_queue) {
+		numa_mask = ring->numa_registered_q_mask[local_node];
+		int weight = cpumask_weight(numa_mask);
+
+		if (weight > 0) {
+			int idx = (qid + 1) % weight;
+
+			qid = cpumask_nth(idx, numa_mask);
+		} else {
+			qid = cpumask_first(numa_mask);
+		}
+
+		local_queue = READ_ONCE(ring->queues[qid]);
+	}
+
+	if (local_queue && local_queue->nr_reqs <= FURING_Q_NUMA_THRESHOLD)
 		return local_queue;
 
+	if (retries < FURING_NEXT_QUEUE_RETRIES) {
+		retries++;
+		local_queue = NULL;
+		goto retry;
+	}
+
 	/* Find best NUMA-local queue */
 	numa_mask = ring->numa_registered_q_mask[local_node];
-	best_numa = fuse_uring_best_queue(numa_mask, ring);
+	best_numa = fuse_uring_best_queue(numa_mask, ring, background);
 
 	/* If NUMA queue is under threshold, use it */
 	if (best_numa && best_numa->nr_reqs <= FURING_Q_NUMA_THRESHOLD)
@@ -1328,7 +1365,7 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
 
 	/* NUMA queues above threshold, try global queues */
 	global_mask = ring->registered_q_mask;
-	best_global = fuse_uring_best_queue(global_mask, ring);
+	best_global = fuse_uring_best_queue(global_mask, ring, background);
 
 	/* Might happen during tear down */
 	if (!best_global)
@@ -1338,8 +1375,10 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
 	if (best_global->nr_reqs <= FURING_Q_GLOBAL_THRESHOLD)
 		return best_global;
 
+	return best_global;
+
 	/* Fall back to best available queue */
-	return best_numa ? best_numa : best_global;
+	// return best_numa ? best_numa : best_global;
 }
 
 static void fuse_uring_dispatch_ent(struct fuse_ring_ent *ent)
@@ -1360,7 +1399,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
 	int err;
 
 	err = -EINVAL;
-	queue = fuse_uring_get_queue(ring);
+	queue = fuse_uring_get_queue(ring, false);
 	if (!queue)
 		goto err;
 
@@ -1405,7 +1444,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
 	struct fuse_ring_queue *queue;
 	struct fuse_ring_ent *ent = NULL;
 
-	queue = fuse_uring_get_queue(ring);
+	queue = fuse_uring_get_queue(ring, true);
 	if (!queue)
 		return false;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 7/7] fuse: Wake requests on the same cpu
  2025-10-03 10:06 [PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
                   ` (5 preceding siblings ...)
  2025-10-03 10:06 ` [PATCH v2 6/7] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
@ 2025-10-03 10:06 ` Bernd Schubert
  6 siblings, 0 replies; 12+ messages in thread
From: Bernd Schubert @ 2025-10-03 10:06 UTC (permalink / raw)
  To: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joanne Koong, linux-fsdevel, Bernd Schubert

For io-uring it makes sense to wake the waiting application (synchronous
IO) on the same core.

With queue-per-pore

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread --bs=4k \
    --size=1G --numjobs=1 --iodepth=1 --time_based --runtime=30s
    \ --group_reporting --ioengine=psync --direct=1

no-io-uring
   READ: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s
no-io-uring wake on the same core (not part of this patch)
   READ: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s
unpatched
   READ: bw=260MiB/s (273MB/s), 260MiB/s-260MiB/s
patched
   READ: bw=345MiB/s (362MB/s), 345MiB/s-345MiB/s

Without io-uring and core bound fuse-server queues there is almost
not difference. In fact, fio results are very fluctuating, in
between 85MB/s and 205MB/s during the run.

With --numjobs=8

unpatched
   READ: bw=2378MiB/s (2493MB/s), 2378MiB/s-2378MiB/s
patched
   READ: bw=2402MiB/s (2518MB/s), 2402MiB/s-2402MiB/s
(differences within the confidence interval)

'-o io_uring_q_mask=0-3:8-11' (16 core / 32 SMT core system) and

unpatched
   READ: bw=1286MiB/s (1348MB/s), 1286MiB/s-1286MiB/s
patched
   READ: bw=1561MiB/s (1637MB/s), 1561MiB/s-1561MiB/s

I.e. no differences with many application threads and queue-per-core,
but perf gain with overloaded queues - a bit surprising.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c        |  8 ++++++--
 include/linux/wait.h |  6 +++---
 kernel/sched/wait.c  | 12 ++++++++++++
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5150aa25e64be91e17fc45b1dbefb92491c81346..cbff7091124cb1d74e04ad40d9f461b4815bcca4 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -479,8 +479,12 @@ void fuse_request_end(struct fuse_req *req)
 		flush_bg_queue(fc);
 		spin_unlock(&fc->bg_lock);
 	} else {
-		/* Wake up waiter sleeping in request_wait_answer() */
-		wake_up(&req->waitq);
+		if (test_bit(FR_URING, &req->flags)) {
+			wake_up_on_current_cpu(&req->waitq);
+		} else {
+			/* Wake up waiter sleeping in request_wait_answer() */
+			wake_up(&req->waitq);
+		}
 	}
 
 	if (test_bit(FR_ASYNC, &req->flags))
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 09855d8194180e1848db857b2af95112df91128c..595979a601f8a943438482a33e8af2d20979d409 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -219,6 +219,7 @@ void __wake_up_sync(struct wait_queue_head *wq_head, unsigned int mode);
 void __wake_up_pollfree(struct wait_queue_head *wq_head);
 
 #define wake_up(x)			__wake_up(x, TASK_NORMAL, 1, NULL)
+#define wake_up_on_current_cpu(x)	__wake_up_on_current_cpu(x, TASK_NORMAL, NULL)
 #define wake_up_nr(x, nr)		__wake_up(x, TASK_NORMAL, nr, NULL)
 #define wake_up_all(x)			__wake_up(x, TASK_NORMAL, 0, NULL)
 #define wake_up_locked(x)		__wake_up_locked((x), TASK_NORMAL, 1)
@@ -479,9 +480,8 @@ do {										\
 	__wait_event_cmd(wq_head, condition, cmd1, cmd2);			\
 } while (0)
 
-#define __wait_event_interruptible(wq_head, condition)				\
-	___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0,		\
-		      schedule())
+#define __wait_event_interruptible(wq_head, condition) \
+	___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0, schedule())
 
 /**
  * wait_event_interruptible - sleep until a condition gets true
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 20f27e2cf7aec691af040fcf2236a20374ec66bf..1c6943a620ae389590a9d06577b998c320310923 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -147,10 +147,22 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
 }
 EXPORT_SYMBOL(__wake_up);
 
+/**
+ * __wake_up - wake up threads blocked on a waitqueue, on the current cpu
+ * @wq_head: the waitqueue
+ * @mode: which threads
+ * @nr_exclusive: how many wake-one or wake-many threads to wake up
+ * @key: is directly passed to the wakeup function
+ *
+ * If this function wakes up a task, it executes a full memory barrier
+ * before accessing the task state.  Returns the number of exclusive
+ * tasks that were awaken.
+ */
 void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
 {
 	__wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
 }
+EXPORT_SYMBOL_GPL(__wake_up_on_current_cpu);
 
 /*
  * Same as __wake_up but called with the spinlock in wait_queue_head_t held.

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/7] fuse: {io-uring} Use bitmaps to track registered queues
  2025-10-03 10:06 ` [PATCH v2 3/7] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
@ 2025-10-06  9:51   ` Luis Henriques
  0 siblings, 0 replies; 12+ messages in thread
From: Luis Henriques @ 2025-10-06  9:51 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Joanne Koong, linux-fsdevel

On Fri, Oct 03 2025, Bernd Schubert wrote:

> Add per-CPU and per-NUMA node bitmasks to track which
> io-uring queues are registered.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c   | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/fuse/dev_uring_i.h |  9 +++++++++
>  2 files changed, 63 insertions(+)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 0f5ab27dacb66c9f5f10eac2713d9bd3eb4c26da..dacc07f5b5b1a48acefa278279f851c3ae2b1489 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -18,6 +18,8 @@ MODULE_PARM_DESC(enable_uring,
>  
>  #define FUSE_URING_IOV_SEGS 2 /* header and payload */
>  
> +/* Number of queued fuse requests until a queue is considered full */
> +#define FUSE_URING_QUEUE_THRESHOLD 5

Nit: I guess this hunk can be removed from this patch as this constant is
never used and is removed in the next patch.

Cheers,
-- 
Luís

>  bool fuse_uring_enabled(void)
>  {
> @@ -184,6 +186,18 @@ bool fuse_uring_request_expired(struct fuse_conn *fc)
>  	return false;
>  }
>  
> +static void fuse_ring_destruct_q_masks(struct fuse_ring *ring)
> +{
> +	int node;
> +
> +	free_cpumask_var(ring->registered_q_mask);
> +	if (ring->numa_registered_q_mask) {
> +		for (node = 0; node < ring->nr_numa_nodes; node++)
> +			free_cpumask_var(ring->numa_registered_q_mask[node]);
> +		kfree(ring->numa_registered_q_mask);
> +	}
> +}
> +
>  void fuse_uring_destruct(struct fuse_conn *fc)
>  {
>  	struct fuse_ring *ring = fc->ring;
> @@ -215,11 +229,32 @@ void fuse_uring_destruct(struct fuse_conn *fc)
>  		ring->queues[qid] = NULL;
>  	}
>  
> +	fuse_ring_destruct_q_masks(ring);
>  	kfree(ring->queues);
>  	kfree(ring);
>  	fc->ring = NULL;
>  }
>  
> +static int fuse_ring_create_q_masks(struct fuse_ring *ring)
> +{
> +	int node;
> +
> +	if (!zalloc_cpumask_var(&ring->registered_q_mask, GFP_KERNEL_ACCOUNT))
> +		return -ENOMEM;
> +
> +	ring->numa_registered_q_mask = kcalloc(
> +		ring->nr_numa_nodes, sizeof(cpumask_var_t), GFP_KERNEL_ACCOUNT);
> +	if (!ring->numa_registered_q_mask)
> +		return -ENOMEM;
> +	for (node = 0; node < ring->nr_numa_nodes; node++) {
> +		if (!zalloc_cpumask_var(&ring->numa_registered_q_mask[node],
> +					GFP_KERNEL_ACCOUNT))
> +			return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * Basic ring setup for this connection based on the provided configuration
>   */
> @@ -229,11 +264,14 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>  	size_t nr_queues = num_possible_cpus();
>  	struct fuse_ring *res = NULL;
>  	size_t max_payload_size;
> +	int err;
>  
>  	ring = kzalloc(sizeof(*fc->ring), GFP_KERNEL_ACCOUNT);
>  	if (!ring)
>  		return NULL;
>  
> +	ring->nr_numa_nodes = num_online_nodes();
> +
>  	ring->queues = kcalloc(nr_queues, sizeof(struct fuse_ring_queue *),
>  			       GFP_KERNEL_ACCOUNT);
>  	if (!ring->queues)
> @@ -242,6 +280,10 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>  	max_payload_size = max(FUSE_MIN_READ_BUFFER, fc->max_write);
>  	max_payload_size = max(max_payload_size, fc->max_pages * PAGE_SIZE);
>  
> +	err = fuse_ring_create_q_masks(ring);
> +	if (err)
> +		goto out_err;
> +
>  	spin_lock(&fc->lock);
>  	if (fc->ring) {
>  		/* race, another thread created the ring in the meantime */
> @@ -261,6 +303,7 @@ static struct fuse_ring *fuse_uring_create(struct fuse_conn *fc)
>  	return ring;
>  
>  out_err:
> +	fuse_ring_destruct_q_masks(ring);
>  	kfree(ring->queues);
>  	kfree(ring);
>  	return res;
> @@ -423,6 +466,7 @@ static void fuse_uring_log_ent_state(struct fuse_ring *ring)
>  			pr_info(" ent-commit-queue ring=%p qid=%d ent=%p state=%d\n",
>  				ring, qid, ent, ent->state);
>  		}
> +
>  		spin_unlock(&queue->lock);
>  	}
>  	ring->stop_debug_log = 1;
> @@ -469,6 +513,7 @@ static void fuse_uring_async_stop_queues(struct work_struct *work)
>  void fuse_uring_stop_queues(struct fuse_ring *ring)
>  {
>  	int qid;
> +	int node;
>  
>  	for (qid = 0; qid < ring->max_nr_queues; qid++) {
>  		struct fuse_ring_queue *queue = READ_ONCE(ring->queues[qid]);
> @@ -479,6 +524,11 @@ void fuse_uring_stop_queues(struct fuse_ring *ring)
>  		fuse_uring_teardown_entries(queue);
>  	}
>  
> +	/* Reset all queue masks, we won't process any more IO */
> +	cpumask_clear(ring->registered_q_mask);
> +	for (node = 0; node < ring->nr_numa_nodes; node++)
> +		cpumask_clear(ring->numa_registered_q_mask[node]);
> +
>  	if (atomic_read(&ring->queue_refs) > 0) {
>  		ring->teardown_time = jiffies;
>  		INIT_DELAYED_WORK(&ring->async_teardown_work,
> @@ -982,6 +1032,7 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
>  	struct fuse_ring *ring = queue->ring;
>  	struct fuse_conn *fc = ring->fc;
>  	struct fuse_iqueue *fiq = &fc->iq;
> +	int node = cpu_to_node(queue->qid);
>  
>  	fuse_uring_prepare_cancel(cmd, issue_flags, ent);
>  
> @@ -990,6 +1041,9 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
>  	fuse_uring_ent_avail(ent, queue);
>  	spin_unlock(&queue->lock);
>  
> +	cpumask_set_cpu(queue->qid, ring->registered_q_mask);
> +	cpumask_set_cpu(queue->qid, ring->numa_registered_q_mask[node]);
> +
>  	if (!ring->ready) {
>  		bool ready = is_ring_ready(ring, queue->qid);
>  
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 708412294982566919122a1a0d7f741217c763ce..35e3b6808b60398848965afd3091b765444283ff 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -115,6 +115,9 @@ struct fuse_ring {
>  	/* number of ring queues */
>  	size_t max_nr_queues;
>  
> +	/* number of numa nodes */
> +	int nr_numa_nodes;
> +
>  	/* maximum payload/arg size */
>  	size_t max_payload_sz;
>  
> @@ -125,6 +128,12 @@ struct fuse_ring {
>  	 */
>  	unsigned int stop_debug_log : 1;
>  
> +	/* Tracks which queues are registered */
> +	cpumask_var_t registered_q_mask;
> +
> +	/* Tracks which queues are registered per NUMA node */
> +	cpumask_var_t *numa_registered_q_mask;
> +
>  	wait_queue_head_t stop_waitq;
>  
>  	/* async tear down */
>
> -- 
> 2.43.0
>
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 6/7] fuse: {io-uring} Queue background requests on a different core
  2025-10-03 10:06 ` [PATCH v2 6/7] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
@ 2025-10-06  9:53   ` Luis Henriques
  2025-10-06 10:31     ` Bernd Schubert
  0 siblings, 1 reply; 12+ messages in thread
From: Luis Henriques @ 2025-10-06  9:53 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Joanne Koong, linux-fsdevel

On Fri, Oct 03 2025, Bernd Schubert wrote:

> Running background IO on a different core makes quite a difference.
>
> fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
> --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
> --runtime=30s --group_reporting --ioengine=io_uring\
>  --direct=1
>
> unpatched
>    READ: bw=272MiB/s (285MB/s), 272MiB/s-272MiB/s ...
> patched
>    READ: bw=760MiB/s (797MB/s), 760MiB/s-760MiB/s ...
>
> With --iodepth=8
>
> unpatched
>    READ: bw=466MiB/s (489MB/s), 466MiB/s-466MiB/s ...
> patched
>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
> 2nd run:
>    READ: bw=1014MiB/s (1064MB/s), 1014MiB/s-1014MiB/s ...
>
> Without io-uring (--iodepth=8)
>    READ: bw=729MiB/s (764MB/s), 729MiB/s-729MiB/s ...
>
> Without fuse (--iodepth=8)
>    READ: bw=2199MiB/s (2306MB/s), 2199MiB/s-2199MiB/s ...
>
> (Test were done with
> <libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
> [-o io_uring] /tmp/source /tmp/dest
> )
>
> Additional notes:
>
> With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
>    READ: bw=903MiB/s (946MB/s), 903MiB/s-903MiB/s ...
>
> With just a random qid (--iodepth=8)
>    READ: bw=429MiB/s (450MB/s), 429MiB/s-429MiB/s ...
>
> With --iodepth=1
> unpatched
>    READ: bw=195MiB/s (204MB/s), 195MiB/s-195MiB/s ...
> patched
>    READ: bw=232MiB/s (243MB/s), 232MiB/s-232MiB/s ...
>
> With --iodepth=1 --numjobs=2
> unpatched
>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
> patched
>    READ: bw=1821MiB/s (1909MB/s), 1821MiB/s-1821MiB/s ...
>
> With --iodepth=1 --numjobs=8
> unpatched
>    READ: bw=1138MiB/s (1193MB/s), 1138MiB/s-1138MiB/s ...
> patched
>    READ: bw=1650MiB/s (1730MB/s), 1650MiB/s-1650MiB/s ...
> fuse without io-uring
>    READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
> no-fuse
>    READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...
>
> In summary, for async requests the core doing application IO is busy
> sending requests and processing IOs should be done on a different core.
> Spreading the load on random cores is also not desirable, as the core
> might be frequency scaled down and/or in C1 sleep states. Not shown here,
> but differnces are much smaller when the system uses performance govenor
> instead of schedutil (ubuntu default). Obviously at the cost of higher
> system power consumption for performance govenor - not desirable either.
>
> Results without io-uring (which uses fixed libfuse threads per queue)
> heavily depend on the current number of active threads. Libfuse uses
> default of max 10 threads, but actual nr max threads is a parameter.
> Also, no-fuse-io-uring results heavily depend on, if there was already
> running another workload before, as libfuse starts these threads
> dynamically - i.e. the more threads are active, the worse the
> performance.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c | 61 +++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 50 insertions(+), 11 deletions(-)
>
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index f5946bb1bbea930522921d49c04e047c70d21ee2..296592fe3651926ab4982b8d80694b3dac8bbffa 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -22,6 +22,7 @@ MODULE_PARM_DESC(enable_uring,
>  #define FURING_Q_LOCAL_THRESHOLD 2
>  #define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>  #define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
> +#define FURING_NEXT_QUEUE_RETRIES 2
>  
>  bool fuse_uring_enabled(void)
>  {
> @@ -1262,7 +1263,8 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>   *  (Michael David Mitzenmacher, 1991)
>   */
>  static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
> -						     struct fuse_ring *ring)
> +						     struct fuse_ring *ring,
> +						     bool background)
>  {
>  	unsigned int qid1, qid2;
>  	struct fuse_ring_queue *queue1, *queue2;
> @@ -1277,9 +1279,14 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>  	}
>  
>  	/* Get two different queues using optimized bounded random */
> -	qid1 = cpumask_nth(get_random_u32_below(weight), mask);
> +
> +	do {
> +		qid1 = cpumask_nth(get_random_u32_below(weight), mask);
> +	} while (background && qid1 == task_cpu(current));
>  	queue1 = READ_ONCE(ring->queues[qid1]);
>  
> +	return queue1;

Hmmm?  I guess this was left from some local testing, right?

Cheers,
-- 
Luís


> +
>  	do {
>  		qid2 = cpumask_nth(get_random_u32_below(weight), mask);
>  	} while (qid2 == qid1);
> @@ -1298,12 +1305,14 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>  /*
>   * Get the best queue for the current CPU
>   */
> -static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring,
> +						    bool background)
>  {
>  	unsigned int qid;
>  	struct fuse_ring_queue *local_queue, *best_numa, *best_global;
>  	int local_node;
>  	const struct cpumask *numa_mask, *global_mask;
> +	int retries = 0;
>  
>  	qid = task_cpu(current);
>  	if (WARN_ONCE(qid >= ring->max_nr_queues,
> @@ -1311,16 +1320,44 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>  		      ring->max_nr_queues))
>  		qid = 0;
>  
> -	local_queue = READ_ONCE(ring->queues[qid]);
>  	local_node = cpu_to_node(qid);
>  
> -	/* Fast path: if local queue exists and is not overloaded, use it */
> -	if (local_queue && local_queue->nr_reqs <= FURING_Q_LOCAL_THRESHOLD)
> +	local_queue = READ_ONCE(ring->queues[qid]);
> +
> +retry:
> +	/*
> +	 * For background requests, try next CPU in same NUMA domain.
> +	 * I.e. cpu-0 creates async requests, cpu-1 io processes.
> +	 * Similar for foreground requests, when the local queue does not
> +	 * exist - still better to always wake the same cpu id.
> +	 */
> +	if (background || !local_queue) {
> +		numa_mask = ring->numa_registered_q_mask[local_node];
> +		int weight = cpumask_weight(numa_mask);
> +
> +		if (weight > 0) {
> +			int idx = (qid + 1) % weight;
> +
> +			qid = cpumask_nth(idx, numa_mask);
> +		} else {
> +			qid = cpumask_first(numa_mask);
> +		}
> +
> +		local_queue = READ_ONCE(ring->queues[qid]);
> +	}
> +
> +	if (local_queue && local_queue->nr_reqs <= FURING_Q_NUMA_THRESHOLD)
>  		return local_queue;
>  
> +	if (retries < FURING_NEXT_QUEUE_RETRIES) {
> +		retries++;
> +		local_queue = NULL;
> +		goto retry;
> +	}
> +
>  	/* Find best NUMA-local queue */
>  	numa_mask = ring->numa_registered_q_mask[local_node];
> -	best_numa = fuse_uring_best_queue(numa_mask, ring);
> +	best_numa = fuse_uring_best_queue(numa_mask, ring, background);
>  
>  	/* If NUMA queue is under threshold, use it */
>  	if (best_numa && best_numa->nr_reqs <= FURING_Q_NUMA_THRESHOLD)
> @@ -1328,7 +1365,7 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>  
>  	/* NUMA queues above threshold, try global queues */
>  	global_mask = ring->registered_q_mask;
> -	best_global = fuse_uring_best_queue(global_mask, ring);
> +	best_global = fuse_uring_best_queue(global_mask, ring, background);
>  
>  	/* Might happen during tear down */
>  	if (!best_global)
> @@ -1338,8 +1375,10 @@ static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *ring)
>  	if (best_global->nr_reqs <= FURING_Q_GLOBAL_THRESHOLD)
>  		return best_global;
>  
> +	return best_global;
> +
>  	/* Fall back to best available queue */
> -	return best_numa ? best_numa : best_global;
> +	// return best_numa ? best_numa : best_global;
>  }
>  
>  static void fuse_uring_dispatch_ent(struct fuse_ring_ent *ent)
> @@ -1360,7 +1399,7 @@ void fuse_uring_queue_fuse_req(struct fuse_iqueue *fiq, struct fuse_req *req)
>  	int err;
>  
>  	err = -EINVAL;
> -	queue = fuse_uring_get_queue(ring);
> +	queue = fuse_uring_get_queue(ring, false);
>  	if (!queue)
>  		goto err;
>  
> @@ -1405,7 +1444,7 @@ bool fuse_uring_queue_bq_req(struct fuse_req *req)
>  	struct fuse_ring_queue *queue;
>  	struct fuse_ring_ent *ent = NULL;
>  
> -	queue = fuse_uring_get_queue(ring);
> +	queue = fuse_uring_get_queue(ring, true);
>  	if (!queue)
>  		return false;
>  
>
> -- 
> 2.43.0
>
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 6/7] fuse: {io-uring} Queue background requests on a different core
  2025-10-06  9:53   ` Luis Henriques
@ 2025-10-06 10:31     ` Bernd Schubert
  0 siblings, 0 replies; 12+ messages in thread
From: Bernd Schubert @ 2025-10-06 10:31 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Joanne Koong,
	linux-fsdevel@vger.kernel.org

On 10/6/25 11:53, Luis Henriques wrote:
> On Fri, Oct 03 2025, Bernd Schubert wrote:
> 
>> Running background IO on a different core makes quite a difference.
>>
>> fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
>> --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
>> --runtime=30s --group_reporting --ioengine=io_uring\
>>  --direct=1
>>
>> unpatched
>>    READ: bw=272MiB/s (285MB/s), 272MiB/s-272MiB/s ...
>> patched
>>    READ: bw=760MiB/s (797MB/s), 760MiB/s-760MiB/s ...
>>
>> With --iodepth=8
>>
>> unpatched
>>    READ: bw=466MiB/s (489MB/s), 466MiB/s-466MiB/s ...
>> patched
>>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
>> 2nd run:
>>    READ: bw=1014MiB/s (1064MB/s), 1014MiB/s-1014MiB/s ...
>>
>> Without io-uring (--iodepth=8)
>>    READ: bw=729MiB/s (764MB/s), 729MiB/s-729MiB/s ...
>>
>> Without fuse (--iodepth=8)
>>    READ: bw=2199MiB/s (2306MB/s), 2199MiB/s-2199MiB/s ...
>>
>> (Test were done with
>> <libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
>> [-o io_uring] /tmp/source /tmp/dest
>> )
>>
>> Additional notes:
>>
>> With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
>>    READ: bw=903MiB/s (946MB/s), 903MiB/s-903MiB/s ...
>>
>> With just a random qid (--iodepth=8)
>>    READ: bw=429MiB/s (450MB/s), 429MiB/s-429MiB/s ...
>>
>> With --iodepth=1
>> unpatched
>>    READ: bw=195MiB/s (204MB/s), 195MiB/s-195MiB/s ...
>> patched
>>    READ: bw=232MiB/s (243MB/s), 232MiB/s-232MiB/s ...
>>
>> With --iodepth=1 --numjobs=2
>> unpatched
>>    READ: bw=966MiB/s (1013MB/s), 966MiB/s-966MiB/s ...
>> patched
>>    READ: bw=1821MiB/s (1909MB/s), 1821MiB/s-1821MiB/s ...
>>
>> With --iodepth=1 --numjobs=8
>> unpatched
>>    READ: bw=1138MiB/s (1193MB/s), 1138MiB/s-1138MiB/s ...
>> patched
>>    READ: bw=1650MiB/s (1730MB/s), 1650MiB/s-1650MiB/s ...
>> fuse without io-uring
>>    READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
>> no-fuse
>>    READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...
>>
>> In summary, for async requests the core doing application IO is busy
>> sending requests and processing IOs should be done on a different core.
>> Spreading the load on random cores is also not desirable, as the core
>> might be frequency scaled down and/or in C1 sleep states. Not shown here,
>> but differnces are much smaller when the system uses performance govenor
>> instead of schedutil (ubuntu default). Obviously at the cost of higher
>> system power consumption for performance govenor - not desirable either.
>>
>> Results without io-uring (which uses fixed libfuse threads per queue)
>> heavily depend on the current number of active threads. Libfuse uses
>> default of max 10 threads, but actual nr max threads is a parameter.
>> Also, no-fuse-io-uring results heavily depend on, if there was already
>> running another workload before, as libfuse starts these threads
>> dynamically - i.e. the more threads are active, the worse the
>> performance.
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev_uring.c | 61 +++++++++++++++++++++++++++++++++++++++++++----------
>>  1 file changed, 50 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>> index f5946bb1bbea930522921d49c04e047c70d21ee2..296592fe3651926ab4982b8d80694b3dac8bbffa 100644
>> --- a/fs/fuse/dev_uring.c
>> +++ b/fs/fuse/dev_uring.c
>> @@ -22,6 +22,7 @@ MODULE_PARM_DESC(enable_uring,
>>  #define FURING_Q_LOCAL_THRESHOLD 2
>>  #define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1)
>>  #define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2)
>> +#define FURING_NEXT_QUEUE_RETRIES 2
>>  
>>  bool fuse_uring_enabled(void)
>>  {
>> @@ -1262,7 +1263,8 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>>   *  (Michael David Mitzenmacher, 1991)
>>   */
>>  static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>> -						     struct fuse_ring *ring)
>> +						     struct fuse_ring *ring,
>> +						     bool background)
>>  {
>>  	unsigned int qid1, qid2;
>>  	struct fuse_ring_queue *queue1, *queue2;
>> @@ -1277,9 +1279,14 @@ static struct fuse_ring_queue *fuse_uring_best_queue(const struct cpumask *mask,
>>  	}
>>  
>>  	/* Get two different queues using optimized bounded random */
>> -	qid1 = cpumask_nth(get_random_u32_below(weight), mask);
>> +
>> +	do {
>> +		qid1 = cpumask_nth(get_random_u32_below(weight), mask);
>> +	} while (background && qid1 == task_cpu(current));
>>  	queue1 = READ_ONCE(ring->queues[qid1]);
>>  
>> +	return queue1;
> 
> Hmmm?  I guess this was left from some local testing, right?

Oh yeah, sorry, thanks for spotting that.

Thanks,
Bernd



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 5/7] fuse: {io-uring} Allow reduced number of ring queues
  2025-10-03 10:06 ` [PATCH v2 5/7] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
@ 2025-10-06 10:35   ` Bernd Schubert
  0 siblings, 0 replies; 12+ messages in thread
From: Bernd Schubert @ 2025-10-06 10:35 UTC (permalink / raw)
  To: Miklos Szeredi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider
  Cc: Joanne Koong, linux-fsdevel@vger.kernel.org

On 10/3/25 12:06, Bernd Schubert wrote:
> Queues selection (fuse_uring_get_queue) can handle reduced number
> queues - using io-uring is possible now even with a single
> queue and entry.
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c | 35 +++--------------------------------
>  1 file changed, 3 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index bb5d7a98536963ec2e4c10982d33633db2573f4d..f5946bb1bbea930522921d49c04e047c70d21ee2 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -998,31 +998,6 @@ static int fuse_uring_commit_fetch(struct io_uring_cmd *cmd, int issue_flags,
>  	return 0;
>  }
>  
> -static bool is_ring_ready(struct fuse_ring *ring, int current_qid)
> -{
> -	int qid;
> -	struct fuse_ring_queue *queue;
> -	bool ready = true;
> -
> -	for (qid = 0; qid < ring->max_nr_queues && ready; qid++) {
> -		if (current_qid == qid)
> -			continue;
> -
> -		queue = ring->queues[qid];
> -		if (!queue) {
> -			ready = false;
> -			break;
> -		}
> -
> -		spin_lock(&queue->lock);
> -		if (list_empty(&queue->ent_avail_queue))
> -			ready = false;
> -		spin_unlock(&queue->lock);
> -	}
> -
> -	return ready;
> -}
> -
>  /*
>   * fuse_uring_req_fetch command handling
>   */
> @@ -1047,13 +1022,9 @@ static void fuse_uring_do_register(struct fuse_ring_ent *ent,
>  	cpumask_set_cpu(queue->qid, ring->numa_registered_q_mask[node]);
>  
>  	if (!ring->ready) {
> -		bool ready = is_ring_ready(ring, queue->qid);
> -
> -		if (ready) {
> -			WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
> -			WRITE_ONCE(ring->ready, true);
> -			wake_up_all(&fc->blocked_waitq);
> -		}
> +		WRITE_ONCE(fiq->ops, &fuse_io_uring_ops);
> +		WRITE_ONCE(ring->ready, true);
> +		wake_up_all(&fc->blocked_waitq);
>  	}
>  }
>  
> 

We actually need to add a FUSE_INIT flag, fuse-server needs
to know when it can reduce the number of queues. If fuse-server
would do a lower number of queues unconditionally, the mount
point will hang after FUSE_INIT.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-10-06 10:35 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-03 10:06 [PATCH v2 0/7] fuse: {io-uring} Allow to reduce the number of queues and request distribution Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 1/7] fuse: {io-uring} Add queue length counters Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 2/7] fuse: {io-uring} Rename ring->nr_queues to max_nr_queues Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 3/7] fuse: {io-uring} Use bitmaps to track registered queues Bernd Schubert
2025-10-06  9:51   ` Luis Henriques
2025-10-03 10:06 ` [PATCH v2 4/7] fuse: {io-uring} Distribute load among queues Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 5/7] fuse: {io-uring} Allow reduced number of ring queues Bernd Schubert
2025-10-06 10:35   ` Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 6/7] fuse: {io-uring} Queue background requests on a different core Bernd Schubert
2025-10-06  9:53   ` Luis Henriques
2025-10-06 10:31     ` Bernd Schubert
2025-10-03 10:06 ` [PATCH v2 7/7] fuse: Wake requests on the same cpu Bernd Schubert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.