[PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
@ 2025-06-14  9:25 Yu Kuai
  2025-06-14  9:25 ` [PATCH RFC v2 1/5] elevator: introduce global lock for sq_shared elevator Yu Kuai
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Yu Kuai @ 2025-06-14  9:25 UTC (permalink / raw)
  To: ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Before this patch, each dispatch context will hold a global lock to
dispatch one request at a time, which introduce intense lock competition:

lock
ops.dispatch_request
unlock

Hence support dispatch a batch of requests while holding the lock to
reduce lock contention.

nullblk setup:
modprobe null_blk nr_devices=0 &&
    udevadm settle &&
    cd /sys/kernel/config/nullb &&
    mkdir nullb0 &&
    cd nullb0 &&
    echo 0 > completion_nsec &&
    echo 512 > blocksize &&
    echo 0 > home_node &&
    echo 0 > irqmode &&
    echo 128 > submit_queues &&
    echo 1024 > hw_queue_depth &&
    echo 1024 > size &&
    echo 0 > memory_backed &&
    echo 2 > queue_mode &&
    echo 1 > power ||
    exit $?

Test script:
fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \
  -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \
  -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30

Test result(elevator is deadline): iops
|                 | null_blk | scsi hdd |
| --------------- | -------- | -------- |
| before this set | 263k     | 24       |
| after this set  | 475k     | 272      |

Yu Kuai (5):
  elevator: introduce global lock for sq_shared elevator
  mq-deadline: switch to use elevator lock
  block, bfq: switch to use elevator lock
  blk-mq-sched: refactor __blk_mq_do_dispatch_sched()
  blk-mq-sched: support request batch dispatching for sq elevator

 block/bfq-cgroup.c   |   4 +-
 block/bfq-iosched.c  |  53 ++++------
 block/bfq-iosched.h  |   2 +-
 block/blk-mq-sched.c | 226 +++++++++++++++++++++++++++++--------------
 block/blk-mq.c       |   5 +-
 block/blk-mq.h       |  21 ++++
 block/elevator.c     |   1 +
 block/elevator.h     |  61 +++++++++++-
 block/mq-deadline.c  |  58 +++++------
 9 files changed, 282 insertions(+), 149 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 1/5] elevator: introduce global lock for sq_shared elevator
  2025-06-14  9:25 [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
@ 2025-06-14  9:25 ` Yu Kuai
  2025-06-16  2:38   ` Damien Le Moal
  2025-06-14  9:25 ` [PATCH RFC v2 2/5] mq-deadline: switch to use elevator lock Yu Kuai
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Yu Kuai @ 2025-06-14  9:25 UTC (permalink / raw)
  To: ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Currently, both mq-deadline and bfq have internal global lock, prepare
to convert them to use this high level lock and support batch request
dispatching.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-sched.c |  4 +--
 block/blk-mq.c       |  5 ++--
 block/elevator.c     |  1 +
 block/elevator.h     | 61 ++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55a0fd105147..c1390d3e6381 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -113,7 +113,7 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		if (budget_token < 0)
 			break;
 
-		rq = e->type->ops.dispatch_request(hctx);
+		rq = elevator_dispatch_request(hctx);
 		if (!rq) {
 			blk_mq_put_dispatch_budget(q, budget_token);
 			/*
@@ -342,7 +342,7 @@ bool blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio,
 	enum hctx_type type;
 
 	if (e && e->type->ops.bio_merge) {
-		ret = e->type->ops.bio_merge(q, bio, nr_segs);
+		ret = elevator_bio_merge(q, bio, nr_segs);
 		goto out_put;
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..2650b7b28d1e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2637,7 +2637,7 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
 		WARN_ON_ONCE(rq->tag != BLK_MQ_NO_TAG);
 
 		list_add(&rq->queuelist, &list);
-		q->elevator->type->ops.insert_requests(hctx, &list, flags);
+		elevator_insert_requests(hctx, &list, flags);
 	} else {
 		trace_block_rq_insert(rq);
 
@@ -2912,8 +2912,7 @@ static void blk_mq_dispatch_list(struct rq_list *rqs, bool from_sched)
 		spin_unlock(&this_hctx->lock);
 		blk_mq_run_hw_queue(this_hctx, from_sched);
 	} else if (this_hctx->queue->elevator) {
-		this_hctx->queue->elevator->type->ops.insert_requests(this_hctx,
-				&list, 0);
+		elevator_insert_requests(this_hctx, &list, 0);
 		blk_mq_run_hw_queue(this_hctx, from_sched);
 	} else {
 		blk_mq_insert_requests(this_hctx, this_ctx, &list, from_sched);
diff --git a/block/elevator.c b/block/elevator.c
index ab22542e6cf0..91df270d9d91 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -144,6 +144,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 	eq->type = e;
 	kobject_init(&eq->kobj, &elv_ktype);
 	mutex_init(&eq->sysfs_lock);
+	spin_lock_init(&eq->lock);
 	hash_init(eq->hash);
 
 	return eq;
diff --git a/block/elevator.h b/block/elevator.h
index a07ce773a38f..8399dfe5c3b6 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -110,12 +110,12 @@ struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
 /*
  * each queue has an elevator_queue associated with it
  */
-struct elevator_queue
-{
+struct elevator_queue {
 	struct elevator_type *type;
 	void *elevator_data;
 	struct kobject kobj;
 	struct mutex sysfs_lock;
+	spinlock_t lock;
 	unsigned long flags;
 	DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
 };
@@ -186,4 +186,61 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 void blk_mq_sched_reg_debugfs(struct request_queue *q);
 void blk_mq_sched_unreg_debugfs(struct request_queue *q);
 
+#define elevator_lock(e)	spin_lock_irq(&(e)->lock)
+#define elevator_unlock(e)	spin_unlock_irq(&(e)->lock)
+
+static inline struct request *elevator_dispatch_request(
+		struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct elevator_queue *e = q->elevator;
+	bool sq_shared = blk_queue_sq_sched(q);
+	struct request *rq;
+
+	if (sq_shared)
+		elevator_lock(e);
+
+	rq = e->type->ops.dispatch_request(hctx);
+
+	if (sq_shared)
+		elevator_unlock(e);
+
+	return rq;
+}
+
+static inline void elevator_insert_requests(struct blk_mq_hw_ctx *hctx,
+					    struct list_head *list,
+					    blk_insert_t flags)
+{
+	struct request_queue *q = hctx->queue;
+	struct elevator_queue *e = q->elevator;
+	bool sq_shared = blk_queue_sq_sched(q);
+
+	if (sq_shared)
+		elevator_lock(e);
+
+	e->type->ops.insert_requests(hctx, list, flags);
+
+	if (sq_shared)
+		elevator_unlock(e);
+}
+
+static inline bool elevator_bio_merge(struct request_queue *q, struct bio *bio,
+				      unsigned int nr_segs)
+{
+	struct elevator_queue *e = q->elevator;
+	bool sq_shared = blk_queue_sq_sched(q);
+	bool ret;
+
+	if (sq_shared)
+		elevator_lock(e);
+
+	ret = e->type->ops.bio_merge(q, bio, nr_segs);
+
+	if (sq_shared)
+		elevator_unlock(e);
+
+	return ret;
+}
+
 #endif /* _ELEVATOR_H */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 2/5] mq-deadline: switch to use elevator lock
  2025-06-14  9:25 [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
  2025-06-14  9:25 ` [PATCH RFC v2 1/5] elevator: introduce global lock for sq_shared elevator Yu Kuai
@ 2025-06-14  9:25 ` Yu Kuai
  2025-06-16  2:41   ` Damien Le Moal
  2025-06-14  9:25 ` [PATCH RFC v2 3/5] block, bfq: " Yu Kuai
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Yu Kuai @ 2025-06-14  9:25 UTC (permalink / raw)
  To: ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Convert 'dd->lock' to high level 'q->elevator->lock', prepare to support
batch requests dispatching.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/mq-deadline.c | 58 ++++++++++++++++++---------------------------
 1 file changed, 23 insertions(+), 35 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf1cac06d5..6b993a5bf69f 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -101,7 +101,7 @@ struct deadline_data {
 	u32 async_depth;
 	int prio_aging_expire;
 
-	spinlock_t lock;
+	spinlock_t *lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -213,7 +213,7 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 	const u8 ioprio_class = dd_rq_ioclass(next);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	dd->per_prio[prio].stats.merged++;
 
@@ -253,7 +253,7 @@ static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
 {
 	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	return stats->inserted - atomic_read(&stats->completed);
 }
@@ -323,7 +323,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	enum dd_prio prio;
 	u8 ioprio_class;
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	if (!list_empty(&per_prio->dispatch)) {
 		rq = list_first_entry(&per_prio->dispatch, struct request,
@@ -434,7 +434,7 @@ static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
 	enum dd_prio prio;
 	int prio_cnt;
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
 		   !!dd_queued(dd, DD_IDLE_PRIO);
@@ -466,10 +466,9 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;
 
-	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
-		goto unlock;
+		return rq;
 
 	/*
 	 * Next, dispatch requests in priority order. Ignore lower priority
@@ -481,9 +480,6 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 			break;
 	}
 
-unlock:
-	spin_unlock(&dd->lock);
-
 	return rq;
 }
 
@@ -552,9 +548,9 @@ static void dd_exit_sched(struct elevator_queue *e)
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
 
-		spin_lock(&dd->lock);
+		spin_lock(dd->lock);
 		queued = dd_queued(dd, prio);
-		spin_unlock(&dd->lock);
+		spin_unlock(dd->lock);
 
 		WARN_ONCE(queued != 0,
 			  "statistics for priority %d: i %u m %u d %u c %u\n",
@@ -601,7 +597,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->last_dir = DD_WRITE;
 	dd->fifo_batch = fifo_batch;
 	dd->prio_aging_expire = prio_aging_expire;
-	spin_lock_init(&dd->lock);
+	dd->lock = &eq->lock;
 
 	/* We dispatch from request queue wide instead of hw queue */
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
@@ -653,14 +649,10 @@ static int dd_request_merge(struct request_queue *q, struct request **rq,
 static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *free = NULL;
 	bool ret;
 
-	spin_lock(&dd->lock);
 	ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
-	spin_unlock(&dd->lock);
-
 	if (free)
 		blk_mq_free_request(free);
 
@@ -681,7 +673,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	struct dd_per_prio *per_prio;
 	enum dd_prio prio;
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	prio = ioprio_class_to_prio[ioprio_class];
 	per_prio = &dd->per_prio[prio];
@@ -721,11 +713,8 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 			       struct list_head *list,
 			       blk_insert_t flags)
 {
-	struct request_queue *q = hctx->queue;
-	struct deadline_data *dd = q->elevator->elevator_data;
 	LIST_HEAD(free);
 
-	spin_lock(&dd->lock);
 	while (!list_empty(list)) {
 		struct request *rq;
 
@@ -733,7 +722,6 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 		list_del_init(&rq->queuelist);
 		dd_insert_request(hctx, rq, flags, &free);
 	}
-	spin_unlock(&dd->lock);
 
 	blk_mq_free_requests(&free);
 }
@@ -849,13 +837,13 @@ static const struct elv_fs_entry deadline_attrs[] = {
 #define DEADLINE_DEBUGFS_DDIR_ATTRS(prio, data_dir, name)		\
 static void *deadline_##name##_fifo_start(struct seq_file *m,		\
 					  loff_t *pos)			\
-	__acquires(&dd->lock)						\
+	__acquires(dd->lock)						\
 {									\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 	struct dd_per_prio *per_prio = &dd->per_prio[prio];		\
 									\
-	spin_lock(&dd->lock);						\
+	spin_lock(dd->lock);						\
 	return seq_list_start(&per_prio->fifo_list[data_dir], *pos);	\
 }									\
 									\
@@ -870,12 +858,12 @@ static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
 }									\
 									\
 static void deadline_##name##_fifo_stop(struct seq_file *m, void *v)	\
-	__releases(&dd->lock)						\
+	__releases(dd->lock)						\
 {									\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 									\
-	spin_unlock(&dd->lock);						\
+	spin_unlock(dd->lock);						\
 }									\
 									\
 static const struct seq_operations deadline_##name##_fifo_seq_ops = {	\
@@ -941,11 +929,11 @@ static int dd_queued_show(void *data, struct seq_file *m)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	u32 rt, be, idle;
 
-	spin_lock(&dd->lock);
+	spin_lock(dd->lock);
 	rt = dd_queued(dd, DD_RT_PRIO);
 	be = dd_queued(dd, DD_BE_PRIO);
 	idle = dd_queued(dd, DD_IDLE_PRIO);
-	spin_unlock(&dd->lock);
+	spin_unlock(dd->lock);
 
 	seq_printf(m, "%u %u %u\n", rt, be, idle);
 
@@ -957,7 +945,7 @@ static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
 {
 	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	return stats->dispatched + stats->merged -
 		atomic_read(&stats->completed);
@@ -969,11 +957,11 @@ static int dd_owned_by_driver_show(void *data, struct seq_file *m)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	u32 rt, be, idle;
 
-	spin_lock(&dd->lock);
+	spin_lock(dd->lock);
 	rt = dd_owned_by_driver(dd, DD_RT_PRIO);
 	be = dd_owned_by_driver(dd, DD_BE_PRIO);
 	idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
-	spin_unlock(&dd->lock);
+	spin_unlock(dd->lock);
 
 	seq_printf(m, "%u %u %u\n", rt, be, idle);
 
@@ -983,13 +971,13 @@ static int dd_owned_by_driver_show(void *data, struct seq_file *m)
 #define DEADLINE_DISPATCH_ATTR(prio)					\
 static void *deadline_dispatch##prio##_start(struct seq_file *m,	\
 					     loff_t *pos)		\
-	__acquires(&dd->lock)						\
+	__acquires(dd->lock)						\
 {									\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 	struct dd_per_prio *per_prio = &dd->per_prio[prio];		\
 									\
-	spin_lock(&dd->lock);						\
+	spin_lock(dd->lock);						\
 	return seq_list_start(&per_prio->dispatch, *pos);		\
 }									\
 									\
@@ -1004,12 +992,12 @@ static void *deadline_dispatch##prio##_next(struct seq_file *m,		\
 }									\
 									\
 static void deadline_dispatch##prio##_stop(struct seq_file *m, void *v)	\
-	__releases(&dd->lock)						\
+	__releases(dd->lock)						\
 {									\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 									\
-	spin_unlock(&dd->lock);						\
+	spin_unlock(dd->lock);						\
 }									\
 									\
 static const struct seq_operations deadline_dispatch##prio##_seq_ops = { \
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 3/5] block, bfq: switch to use elevator lock
  2025-06-14  9:25 [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
  2025-06-14  9:25 ` [PATCH RFC v2 1/5] elevator: introduce global lock for sq_shared elevator Yu Kuai
  2025-06-14  9:25 ` [PATCH RFC v2 2/5] mq-deadline: switch to use elevator lock Yu Kuai
@ 2025-06-14  9:25 ` Yu Kuai
  2025-06-16  2:46   ` Damien Le Moal
  2025-06-14  9:25 ` [PATCH RFC v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched() Yu Kuai
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Yu Kuai @ 2025-06-14  9:25 UTC (permalink / raw)
  To: ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Convert 'bfqd->lock' to high level 'q->elevator->lock', prepare to support
batch requests dispatching.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bfq-cgroup.c  |  4 ++--
 block/bfq-iosched.c | 53 +++++++++++++++------------------------------
 block/bfq-iosched.h |  2 +-
 3 files changed, 21 insertions(+), 38 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 9fb9f3533150..1717bac7eccc 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -878,7 +878,7 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&bfqd->lock, flags);
+	spin_lock_irqsave(bfqd->lock, flags);
 
 	if (!entity) /* root group */
 		goto put_async_queues;
@@ -923,7 +923,7 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
 put_async_queues:
 	bfq_put_async_queues(bfqd, bfqg);
 
-	spin_unlock_irqrestore(&bfqd->lock, flags);
+	spin_unlock_irqrestore(bfqd->lock, flags);
 	/*
 	 * @blkg is going offline and will be ignored by
 	 * blkg_[rw]stat_recursive_sum().  Transfer stats to the parent so
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cb1e9873aab..fd6d81a185f7 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -473,7 +473,7 @@ static struct bfq_io_cq *bfq_bic_lookup(struct request_queue *q)
  */
 void bfq_schedule_dispatch(struct bfq_data *bfqd)
 {
-	lockdep_assert_held(&bfqd->lock);
+	lockdep_assert_held(bfqd->lock);
 
 	if (bfqd->queued != 0) {
 		bfq_log(bfqd, "schedule dispatch");
@@ -598,7 +598,7 @@ static bool bfqq_request_over_limit(struct bfq_data *bfqd,
 	int level;
 
 retry:
-	spin_lock_irq(&bfqd->lock);
+	spin_lock_irq(bfqd->lock);
 	bfqq = bic_to_bfqq(bic, op_is_sync(opf), act_idx);
 	if (!bfqq)
 		goto out;
@@ -610,7 +610,7 @@ static bool bfqq_request_over_limit(struct bfq_data *bfqd,
 	/* +1 for bfqq entity, root cgroup not included */
 	depth = bfqg_to_blkg(bfqq_group(bfqq))->blkcg->css.cgroup->level + 1;
 	if (depth > alloc_depth) {
-		spin_unlock_irq(&bfqd->lock);
+		spin_unlock_irq(bfqd->lock);
 		if (entities != inline_entities)
 			kfree(entities);
 		entities = kmalloc_array(depth, sizeof(*entities), GFP_NOIO);
@@ -668,7 +668,7 @@ static bool bfqq_request_over_limit(struct bfq_data *bfqd,
 		}
 	}
 out:
-	spin_unlock_irq(&bfqd->lock);
+	spin_unlock_irq(bfqd->lock);
 	if (entities != inline_entities)
 		kfree(entities);
 	return ret;
@@ -2458,18 +2458,9 @@ static bool bfq_bio_merge(struct request_queue *q, struct bio *bio,
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
 	struct request *free = NULL;
-	/*
-	 * bfq_bic_lookup grabs the queue_lock: invoke it now and
-	 * store its return value for later use, to avoid nesting
-	 * queue_lock inside the bfqd->lock. We assume that the bic
-	 * returned by bfq_bic_lookup does not go away before
-	 * bfqd->lock is taken.
-	 */
 	struct bfq_io_cq *bic = bfq_bic_lookup(q);
 	bool ret;
 
-	spin_lock_irq(&bfqd->lock);
-
 	if (bic) {
 		/*
 		 * Make sure cgroup info is uptodate for current process before
@@ -2485,8 +2476,6 @@ static bool bfq_bio_merge(struct request_queue *q, struct bio *bio,
 	bfqd->bio_bic = bic;
 
 	ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
-
-	spin_unlock_irq(&bfqd->lock);
 	if (free)
 		blk_mq_free_request(free);
 
@@ -2661,7 +2650,7 @@ static void bfq_end_wr(struct bfq_data *bfqd)
 	struct bfq_queue *bfqq;
 	int i;
 
-	spin_lock_irq(&bfqd->lock);
+	spin_lock_irq(bfqd->lock);
 
 	for (i = 0; i < bfqd->num_actuators; i++) {
 		list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)
@@ -2671,7 +2660,7 @@ static void bfq_end_wr(struct bfq_data *bfqd)
 		bfq_bfqq_end_wr(bfqq);
 	bfq_end_wr_async(bfqd);
 
-	spin_unlock_irq(&bfqd->lock);
+	spin_unlock_irq(bfqd->lock);
 }
 
 static sector_t bfq_io_struct_pos(void *io_struct, bool request)
@@ -5317,8 +5306,6 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct bfq_queue *in_serv_queue;
 	bool waiting_rq, idle_timer_disabled = false;
 
-	spin_lock_irq(&bfqd->lock);
-
 	in_serv_queue = bfqd->in_service_queue;
 	waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);
 
@@ -5328,7 +5315,6 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
 			waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);
 	}
 
-	spin_unlock_irq(&bfqd->lock);
 	bfq_update_dispatch_stats(hctx->queue, rq,
 			idle_timer_disabled ? in_serv_queue : NULL,
 				idle_timer_disabled);
@@ -5506,9 +5492,9 @@ static void bfq_exit_icq(struct io_cq *icq)
 	 * this is the last time these queues are accessed.
 	 */
 	if (bfqd) {
-		spin_lock_irqsave(&bfqd->lock, flags);
+		spin_lock_irqsave(bfqd->lock, flags);
 		_bfq_exit_icq(bic, bfqd->num_actuators);
-		spin_unlock_irqrestore(&bfqd->lock, flags);
+		spin_unlock_irqrestore(bfqd->lock, flags);
 	} else {
 		_bfq_exit_icq(bic, BFQ_MAX_ACTUATORS);
 	}
@@ -6264,10 +6250,8 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	if (!cgroup_subsys_on_dfl(io_cgrp_subsys) && rq->bio)
 		bfqg_stats_update_legacy_io(q, rq);
 #endif
-	spin_lock_irq(&bfqd->lock);
 	bfqq = bfq_init_rq(rq);
 	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
-		spin_unlock_irq(&bfqd->lock);
 		blk_mq_free_requests(&free);
 		return;
 	}
@@ -6300,7 +6284,6 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	 * merge).
 	 */
 	cmd_flags = rq->cmd_flags;
-	spin_unlock_irq(&bfqd->lock);
 
 	bfq_update_insert_stats(q, bfqq, idle_timer_disabled,
 				cmd_flags);
@@ -6681,7 +6664,7 @@ static void bfq_finish_requeue_request(struct request *rq)
 					     rq->io_start_time_ns,
 					     rq->cmd_flags);
 
-	spin_lock_irqsave(&bfqd->lock, flags);
+	spin_lock_irqsave(bfqd->lock, flags);
 	if (likely(rq->rq_flags & RQF_STARTED)) {
 		if (rq == bfqd->waited_rq)
 			bfq_update_inject_limit(bfqd, bfqq);
@@ -6691,7 +6674,7 @@ static void bfq_finish_requeue_request(struct request *rq)
 	bfqq_request_freed(bfqq);
 	bfq_put_queue(bfqq);
 	RQ_BIC(rq)->requests--;
-	spin_unlock_irqrestore(&bfqd->lock, flags);
+	spin_unlock_irqrestore(bfqd->lock, flags);
 
 	/*
 	 * Reset private fields. In case of a requeue, this allows
@@ -7022,7 +7005,7 @@ bfq_idle_slice_timer_body(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	enum bfqq_expiration reason;
 	unsigned long flags;
 
-	spin_lock_irqsave(&bfqd->lock, flags);
+	spin_lock_irqsave(bfqd->lock, flags);
 
 	/*
 	 * Considering that bfqq may be in race, we should firstly check
@@ -7032,7 +7015,7 @@ bfq_idle_slice_timer_body(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	 * been cleared in __bfq_bfqd_reset_in_service func.
 	 */
 	if (bfqq != bfqd->in_service_queue) {
-		spin_unlock_irqrestore(&bfqd->lock, flags);
+		spin_unlock_irqrestore(bfqd->lock, flags);
 		return;
 	}
 
@@ -7060,7 +7043,7 @@ bfq_idle_slice_timer_body(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 
 schedule_dispatch:
 	bfq_schedule_dispatch(bfqd);
-	spin_unlock_irqrestore(&bfqd->lock, flags);
+	spin_unlock_irqrestore(bfqd->lock, flags);
 }
 
 /*
@@ -7186,10 +7169,10 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
 	hrtimer_cancel(&bfqd->idle_slice_timer);
 
-	spin_lock_irq(&bfqd->lock);
+	spin_lock_irq(bfqd->lock);
 	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
 		bfq_deactivate_bfqq(bfqd, bfqq, false, false);
-	spin_unlock_irq(&bfqd->lock);
+	spin_unlock_irq(bfqd->lock);
 
 	for (actuator = 0; actuator < bfqd->num_actuators; actuator++)
 		WARN_ON_ONCE(bfqd->rq_in_driver[actuator]);
@@ -7203,10 +7186,10 @@ static void bfq_exit_queue(struct elevator_queue *e)
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 	blkcg_deactivate_policy(bfqd->queue->disk, &blkcg_policy_bfq);
 #else
-	spin_lock_irq(&bfqd->lock);
+	spin_lock_irq(bfqd->lock);
 	bfq_put_async_queues(bfqd, bfqd->root_group);
 	kfree(bfqd->root_group);
-	spin_unlock_irq(&bfqd->lock);
+	spin_unlock_irq(bfqd->lock);
 #endif
 
 	blk_stat_disable_accounting(bfqd->queue);
@@ -7371,7 +7354,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	/* see comments on the definition of next field inside bfq_data */
 	bfqd->actuator_load_threshold = 4;
 
-	spin_lock_init(&bfqd->lock);
+	bfqd->lock = &eq->lock;
 
 	/*
 	 * The invocation of the next bfq_create_group_hierarchy
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 687a3a7ba784..d70eb6529dab 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -795,7 +795,7 @@ struct bfq_data {
 	/* fallback dummy bfqq for extreme OOM conditions */
 	struct bfq_queue oom_bfqq;
 
-	spinlock_t lock;
+	spinlock_t *lock;
 
 	/*
 	 * bic associated with the task issuing current bio for
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched()
  2025-06-14  9:25 [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
                   ` (2 preceding siblings ...)
  2025-06-14  9:25 ` [PATCH RFC v2 3/5] block, bfq: " Yu Kuai
@ 2025-06-14  9:25 ` Yu Kuai
  2025-06-16  2:54   ` Damien Le Moal
  2025-06-14  9:25 ` [PATCH RFC v2 5/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
  2025-06-16  4:03 ` [PATCH RFC v2 0/5] " Damien Le Moal
  5 siblings, 1 reply; 15+ messages in thread
From: Yu Kuai @ 2025-06-14  9:25 UTC (permalink / raw)
  To: ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Introduce struct sched_dispatch_ctx, and split the helper into
elevator_dispatch_one_request() and elevator_finish_dispatch(). Make
code cleaner and prepare to support request batch dispatching.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-sched.c | 181 ++++++++++++++++++++++++++-----------------
 1 file changed, 109 insertions(+), 72 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c1390d3e6381..990d0f19594a 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -74,85 +74,88 @@ static bool blk_mq_dispatch_hctx_list(struct list_head *rq_list)
 
 #define BLK_MQ_BUDGET_DELAY	3		/* ms units */
 
-/*
- * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
- * its queue by itself in its completion handler, so we don't need to
- * restart queue if .get_budget() fails to get the budget.
- *
- * Returns -EAGAIN if hctx->dispatch was found non-empty and run_work has to
- * be run again.  This is necessary to avoid starving flushes.
- */
-static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
-{
-	struct request_queue *q = hctx->queue;
-	struct elevator_queue *e = q->elevator;
-	bool multi_hctxs = false, run_queue = false;
-	bool dispatched = false, busy = false;
-	unsigned int max_dispatch;
-	LIST_HEAD(rq_list);
-	int count = 0;
+struct sched_dispatch_ctx {
+	struct blk_mq_hw_ctx *hctx;
+	struct elevator_queue *e;
+	struct request_queue *q;
 
-	if (hctx->dispatch_busy)
-		max_dispatch = 1;
-	else
-		max_dispatch = hctx->queue->nr_requests;
+	struct list_head rq_list;
+	int count;
 
-	do {
-		struct request *rq;
-		int budget_token;
+	bool multi_hctxs;
+	bool run_queue;
+	bool busy;
+};
 
-		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
-			break;
+static bool elevator_can_dispatch(struct sched_dispatch_ctx *ctx)
+{
+	if (ctx->e->type->ops.has_work &&
+	    !ctx->e->type->ops.has_work(ctx->hctx))
+		return false;
 
-		if (!list_empty_careful(&hctx->dispatch)) {
-			busy = true;
-			break;
-		}
+	if (!list_empty_careful(&ctx->hctx->dispatch)) {
+		ctx->busy = true;
+		return false;
+	}
 
-		budget_token = blk_mq_get_dispatch_budget(q);
-		if (budget_token < 0)
-			break;
+	return true;
+}
 
-		rq = elevator_dispatch_request(hctx);
-		if (!rq) {
-			blk_mq_put_dispatch_budget(q, budget_token);
-			/*
-			 * We're releasing without dispatching. Holding the
-			 * budget could have blocked any "hctx"s with the
-			 * same queue and if we didn't dispatch then there's
-			 * no guarantee anyone will kick the queue.  Kick it
-			 * ourselves.
-			 */
-			run_queue = true;
-			break;
-		}
+static bool elevator_dispatch_one_request(struct sched_dispatch_ctx *ctx)
+{
+	struct request *rq;
+	int budget_token;
 
-		blk_mq_set_rq_budget_token(rq, budget_token);
+	if (!elevator_can_dispatch(ctx))
+		return false;
 
-		/*
-		 * Now this rq owns the budget which has to be released
-		 * if this rq won't be queued to driver via .queue_rq()
-		 * in blk_mq_dispatch_rq_list().
-		 */
-		list_add_tail(&rq->queuelist, &rq_list);
-		count++;
-		if (rq->mq_hctx != hctx)
-			multi_hctxs = true;
+	budget_token = blk_mq_get_dispatch_budget(ctx->q);
+	if (budget_token < 0)
+		return false;
 
+	rq = elevator_dispatch_request(ctx->hctx);
+	if (!rq) {
+		blk_mq_put_dispatch_budget(ctx->q, budget_token);
 		/*
-		 * If we cannot get tag for the request, stop dequeueing
-		 * requests from the IO scheduler. We are unlikely to be able
-		 * to submit them anyway and it creates false impression for
-		 * scheduling heuristics that the device can take more IO.
+		 * We're releasing without dispatching. Holding the
+		 * budget could have blocked any "hctx"s with the
+		 * same queue and if we didn't dispatch then there's
+		 * no guarantee anyone will kick the queue.  Kick it
+		 * ourselves.
 		 */
-		if (!blk_mq_get_driver_tag(rq))
-			break;
-	} while (count < max_dispatch);
+		ctx->run_queue = true;
+		return false;
+	}
 
-	if (!count) {
-		if (run_queue)
-			blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY);
-	} else if (multi_hctxs) {
+	blk_mq_set_rq_budget_token(rq, budget_token);
+
+	/*
+	 * Now this rq owns the budget which has to be released
+	 * if this rq won't be queued to driver via .queue_rq()
+	 * in blk_mq_dispatch_rq_list().
+	 */
+	list_add_tail(&rq->queuelist, &ctx->rq_list);
+	ctx->count++;
+	if (rq->mq_hctx != ctx->hctx)
+		ctx->multi_hctxs = true;
+
+	/*
+	 * If we cannot get tag for the request, stop dequeueing
+	 * requests from the IO scheduler. We are unlikely to be able
+	 * to submit them anyway and it creates false impression for
+	 * scheduling heuristics that the device can take more IO.
+	 */
+	return blk_mq_get_driver_tag(rq);
+}
+
+static int elevator_finish_dispatch(struct sched_dispatch_ctx *ctx)
+{
+	bool dispatched = false;
+
+	if (!ctx->count) {
+		if (ctx->run_queue)
+			blk_mq_delay_run_hw_queues(ctx->q, BLK_MQ_BUDGET_DELAY);
+	} else if (ctx->multi_hctxs) {
 		/*
 		 * Requests from different hctx may be dequeued from some
 		 * schedulers, such as bfq and deadline.
@@ -160,19 +163,53 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		 * Sort the requests in the list according to their hctx,
 		 * dispatch batching requests from same hctx at a time.
 		 */
-		list_sort(NULL, &rq_list, sched_rq_cmp);
+		list_sort(NULL, &ctx->rq_list, sched_rq_cmp);
 		do {
-			dispatched |= blk_mq_dispatch_hctx_list(&rq_list);
-		} while (!list_empty(&rq_list));
+			dispatched |= blk_mq_dispatch_hctx_list(&ctx->rq_list);
+		} while (!list_empty(&ctx->rq_list));
 	} else {
-		dispatched = blk_mq_dispatch_rq_list(hctx, &rq_list, false);
+		dispatched = blk_mq_dispatch_rq_list(ctx->hctx, &ctx->rq_list,
+						     false);
 	}
 
-	if (busy)
+	if (ctx->busy)
 		return -EAGAIN;
+
 	return !!dispatched;
 }
 
+/*
+ * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
+ * its queue by itself in its completion handler, so we don't need to
+ * restart queue if .get_budget() fails to get the budget.
+ *
+ * Returns -EAGAIN if hctx->dispatch was found non-empty and run_work has to
+ * be run again.  This is necessary to avoid starving flushes.
+ */
+static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int max_dispatch;
+	struct sched_dispatch_ctx ctx = {
+		.hctx	= hctx,
+		.q	= hctx->queue,
+		.e	= hctx->queue->elevator,
+	};
+
+	INIT_LIST_HEAD(&ctx.rq_list);
+
+	if (hctx->dispatch_busy)
+		max_dispatch = 1;
+	else
+		max_dispatch = hctx->queue->nr_requests;
+
+	do {
+		if (!elevator_dispatch_one_request(&ctx))
+			break;
+	} while (ctx.count < max_dispatch);
+
+	return elevator_finish_dispatch(&ctx);
+}
+
 static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	unsigned long end = jiffies + HZ;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC v2 5/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-06-14  9:25 [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
                   ` (3 preceding siblings ...)
  2025-06-14  9:25 ` [PATCH RFC v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched() Yu Kuai
@ 2025-06-14  9:25 ` Yu Kuai
  2025-06-16  3:07   ` Damien Le Moal
  2025-06-16  4:03 ` [PATCH RFC v2 0/5] " Damien Le Moal
  5 siblings, 1 reply; 15+ messages in thread
From: Yu Kuai @ 2025-06-14  9:25 UTC (permalink / raw)
  To: ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Before this patch, each context will hold a global lock to dispatch one
request at a time, which introduce intense lock competition:

lock
ops.dispatch_request
unlock

Hence support dispatch a batch of requests while holding the lock to
reduce lock contention.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-sched.c | 55 ++++++++++++++++++++++++++++++++++++++++----
 block/blk-mq.h       | 21 +++++++++++++++++
 2 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 990d0f19594a..d7cb88c8e8c7 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -101,6 +101,49 @@ static bool elevator_can_dispatch(struct sched_dispatch_ctx *ctx)
 	return true;
 }
 
+static void elevator_dispatch_requests(struct sched_dispatch_ctx *ctx)
+{
+	struct request *rq;
+	int budget_token[BUDGET_TOKEN_BATCH];
+	int count;
+	int i;
+
+	while (true) {
+		if (!elevator_can_dispatch(ctx))
+			return;
+
+		count = blk_mq_get_dispatch_budgets(ctx->q, budget_token);
+		if (count <= 0)
+			return;
+
+		elevator_lock(ctx->e);
+		for (i = 0; i < count; ++i) {
+			rq = ctx->e->type->ops.dispatch_request(ctx->hctx);
+			if (!rq) {
+				ctx->run_queue = true;
+				goto err_free_budgets;
+			}
+
+			blk_mq_set_rq_budget_token(rq, budget_token[i]);
+			list_add_tail(&rq->queuelist, &ctx->rq_list);
+			ctx->count++;
+			if (rq->mq_hctx != ctx->hctx)
+				ctx->multi_hctxs = true;
+
+			if (!blk_mq_get_driver_tag(rq)) {
+				i++;
+				goto err_free_budgets;
+			}
+		}
+		elevator_unlock(ctx->e);
+	}
+
+err_free_budgets:
+	elevator_unlock(ctx->e);
+	for (; i < count; ++i)
+		blk_mq_put_dispatch_budget(ctx->q, budget_token[i]);
+}
+
 static bool elevator_dispatch_one_request(struct sched_dispatch_ctx *ctx)
 {
 	struct request *rq;
@@ -202,10 +245,14 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 	else
 		max_dispatch = hctx->queue->nr_requests;
 
-	do {
-		if (!elevator_dispatch_one_request(&ctx))
-			break;
-	} while (ctx.count < max_dispatch);
+	if (!hctx->dispatch_busy && blk_queue_sq_sched(ctx.q))
+		elevator_dispatch_requests(&ctx);
+	else {
+		do {
+			if (!elevator_dispatch_one_request(&ctx))
+				break;
+		} while (ctx.count < max_dispatch);
+	}
 
 	return elevator_finish_dispatch(&ctx);
 }
diff --git a/block/blk-mq.h b/block/blk-mq.h
index affb2e14b56e..450c16a07841 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,6 +37,7 @@ enum {
 };
 
 #define BLK_MQ_CPU_WORK_BATCH	(8)
+#define BUDGET_TOKEN_BATCH	(8)
 
 typedef unsigned int __bitwise blk_insert_t;
 #define BLK_MQ_INSERT_AT_HEAD		((__force blk_insert_t)0x01)
@@ -262,6 +263,26 @@ static inline int blk_mq_get_dispatch_budget(struct request_queue *q)
 	return 0;
 }
 
+static inline int blk_mq_get_dispatch_budgets(struct request_queue *q,
+					      int *budget_token)
+{
+	int count = 0;
+
+	while (count < BUDGET_TOKEN_BATCH) {
+		int token = 0;
+
+		if (q->mq_ops->get_budget)
+			token = q->mq_ops->get_budget(q);
+
+		if (token < 0)
+			return count;
+
+		budget_token[count++] = token;
+	}
+
+	return count;
+}
+
 static inline void blk_mq_set_rq_budget_token(struct request *rq, int token)
 {
 	if (token < 0)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 1/5] elevator: introduce global lock for sq_shared elevator
  2025-06-14  9:25 ` [PATCH RFC v2 1/5] elevator: introduce global lock for sq_shared elevator Yu Kuai
@ 2025-06-16  2:38   ` Damien Le Moal
  0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2025-06-16  2:38 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 6/14/25 18:25, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently, both mq-deadline and bfq have internal global lock, prepare
> to convert them to use this high level lock and support batch request
> dispatching.

There is not enough explanations and what you have is misleading since this is
not just a "preparation" patch as you are adding a lock. So please explain that
in more details.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 2/5] mq-deadline: switch to use elevator lock
  2025-06-14  9:25 ` [PATCH RFC v2 2/5] mq-deadline: switch to use elevator lock Yu Kuai
@ 2025-06-16  2:41   ` Damien Le Moal
  0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2025-06-16  2:41 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 6/14/25 18:25, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Convert 'dd->lock' to high level 'q->elevator->lock', prepare to support
> batch requests dispatching.

Again here the explanation is rather light. You should mention that you can do
this conversion because now the elevator lock is already taken by the block
layer when calling the scheduler dispatch and merge methods.

As for the "prepare to support batch requests dispatching", it is unclear what
that is in this patch. So please explain that or remove this.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 3/5] block, bfq: switch to use elevator lock
  2025-06-14  9:25 ` [PATCH RFC v2 3/5] block, bfq: " Yu Kuai
@ 2025-06-16  2:46   ` Damien Le Moal
  0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2025-06-16  2:46 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 6/14/25 18:25, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Convert 'bfqd->lock' to high level 'q->elevator->lock', prepare to support
> batch requests dispatching.

Same comment as for the previous patch.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched()
  2025-06-14  9:25 ` [PATCH RFC v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched() Yu Kuai
@ 2025-06-16  2:54   ` Damien Le Moal
  0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2025-06-16  2:54 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 6/14/25 18:25, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Introduce struct sched_dispatch_ctx, and split the helper into
> elevator_dispatch_one_request() and elevator_finish_dispatch(). Make
> code cleaner and prepare to support request batch dispatching.

It is not clear how this patch prepares for supporting batch dispatching. Since
this is only a refactor without any semantic change, I would either drop this
comment or explain more clearly what you mean. This patch can also probably come
earlier in the series.

> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>


> +static int elevator_finish_dispatch(struct sched_dispatch_ctx *ctx)

Please add a comment here to document the return values.

> +/*
> + * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
> + * its queue by itself in its completion handler, so we don't need to
> + * restart queue if .get_budget() fails to get the budget.
> + *
> + * Returns -EAGAIN if hctx->dispatch was found non-empty and run_work has to
> + * be run again.  This is necessary to avoid starving flushes.

And what is returned for the non error case ? (e.e. document the meaning of 0
and 1 return values).

> + */
> +static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 5/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-06-14  9:25 ` [PATCH RFC v2 5/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
@ 2025-06-16  3:07   ` Damien Le Moal
  0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2025-06-16  3:07 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 6/14/25 18:25, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Before this patch, each context will hold a global lock to dispatch one
> request at a time, which introduce intense lock competition:

How so ? If there is only a single context issuing IOs, there will not be any
contention on the lock.

> lock
> ops.dispatch_request
> unlock
> 
> Hence support dispatch a batch of requests while holding the lock to
> reduce lock contention.

Lock contention would happen only if you have multiple processes issuing I/Os.
For a single context case, this simply reduces the overhead of dispatching
commands by avoiding the lock+unlock per request. So please explain that clearly.

> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq-sched.c | 55 ++++++++++++++++++++++++++++++++++++++++----
>  block/blk-mq.h       | 21 +++++++++++++++++
>  2 files changed, 72 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 990d0f19594a..d7cb88c8e8c7 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -101,6 +101,49 @@ static bool elevator_can_dispatch(struct sched_dispatch_ctx *ctx)
>  	return true;
>  }
>  
> +static void elevator_dispatch_requests(struct sched_dispatch_ctx *ctx)
> +{
> +	struct request *rq;
> +	int budget_token[BUDGET_TOKEN_BATCH];
> +	int count;
> +	int i;

These 2 can be declared on the same line.

> +
> +	while (true) {
> +		if (!elevator_can_dispatch(ctx))
> +			return;
> +
> +		count = blk_mq_get_dispatch_budgets(ctx->q, budget_token);
> +		if (count <= 0)
> +			return;
> +
> +		elevator_lock(ctx->e);
> +		for (i = 0; i < count; ++i) {
> +			rq = ctx->e->type->ops.dispatch_request(ctx->hctx);
> +			if (!rq) {
> +				ctx->run_queue = true;
> +				goto err_free_budgets;
> +			}
> +
> +			blk_mq_set_rq_budget_token(rq, budget_token[i]);
> +			list_add_tail(&rq->queuelist, &ctx->rq_list);
> +			ctx->count++;
> +			if (rq->mq_hctx != ctx->hctx)
> +				ctx->multi_hctxs = true;
> +
> +			if (!blk_mq_get_driver_tag(rq)) {
> +				i++;
> +				goto err_free_budgets;
> +			}
> +		}
> +		elevator_unlock(ctx->e);
> +	}
> +
> +err_free_budgets:
> +	elevator_unlock(ctx->e);
> +	for (; i < count; ++i)> +		blk_mq_put_dispatch_budget(ctx->q, budget_token[i]);
> +}


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-06-14  9:25 [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
                   ` (4 preceding siblings ...)
  2025-06-14  9:25 ` [PATCH RFC v2 5/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
@ 2025-06-16  4:03 ` Damien Le Moal
  2025-06-16  7:22   ` Yu Kuai
  5 siblings, 1 reply; 15+ messages in thread
From: Damien Le Moal @ 2025-06-16  4:03 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, yukuai3, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 6/14/25 18:25, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Before this patch, each dispatch context will hold a global lock to
> dispatch one request at a time, which introduce intense lock competition:
> 
> lock
> ops.dispatch_request
> unlock
> 
> Hence support dispatch a batch of requests while holding the lock to
> reduce lock contention.
> 
> nullblk setup:
> modprobe null_blk nr_devices=0 &&
>     udevadm settle &&
>     cd /sys/kernel/config/nullb &&
>     mkdir nullb0 &&
>     cd nullb0 &&
>     echo 0 > completion_nsec &&
>     echo 512 > blocksize &&
>     echo 0 > home_node &&
>     echo 0 > irqmode &&
>     echo 128 > submit_queues &&
>     echo 1024 > hw_queue_depth &&
>     echo 1024 > size &&
>     echo 0 > memory_backed &&
>     echo 2 > queue_mode &&
>     echo 1 > power ||
>     exit $?
> 
> Test script:
> fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \
>   -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \
>   -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30
> 
> Test result(elevator is deadline): iops
> |                 | null_blk | scsi hdd |
> | --------------- | -------- | -------- |
> | before this set | 263k     | 24       |
> | after this set  | 475k     | 272      |

For the HDD, these numbers are very low, and I do not understand how you can get
any improvement from reducing lock contention, since contention should not be an
issue with this kind of performance. What HW did you use for testing ? Was this
a VM ?

I tested this null_blk setup and your fio command on a bare-metal 16-cores Xeon
machine. For the scsi disk, I used a 26TB SATA HDD connected to an AHCI port).
With this setup, results are like this:

|                 | null_blk | hdd (write) | hdd (read) |
| --------------- | -------- | ----------- | ---------- |
| before this set | 613k     | 1088        | 211        |
| after this set  | 940k     | 1093        | 212        |

So not surprisingly, there is no improvement for the SATA HDD because of the low
max IOPS these devices can achieve: lock contention is not really an issue when
you are dealing with a slow device. And a SAS HDD will be the same. Gains may
likely be more significant with a fast SAS/FC RAID array but I do not have
access to that.

But the improvement for a fast device like null_blk is indeed excellent (+53%).

With LOCKDEP & KASAN disabled, the results are like this:

|                 | null_blk | hdd (write) | hdd (read) |
| --------------- | -------- | ----------- | ---------- |
| before this set | 625k     | 1092        | 213        |
| after this set  | 984k     | 1095        | 215        |

No real changes for the HDD, as expected, and the improvement for null_blk is
still good.

So maybe drop the RFC tag on these patches and repost after cleaning things up ?


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-06-16  4:03 ` [PATCH RFC v2 0/5] " Damien Le Moal
@ 2025-06-16  7:22   ` Yu Kuai
  2025-06-16  7:37     ` Damien Le Moal
  0 siblings, 1 reply; 15+ messages in thread
From: Yu Kuai @ 2025-06-16  7:22 UTC (permalink / raw)
  To: Damien Le Moal, Yu Kuai, ming.lei, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/06/16 12:03, Damien Le Moal 写道:
> On 6/14/25 18:25, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Before this patch, each dispatch context will hold a global lock to
>> dispatch one request at a time, which introduce intense lock competition:
>>
>> lock
>> ops.dispatch_request
>> unlock
>>
>> Hence support dispatch a batch of requests while holding the lock to
>> reduce lock contention.
>>
>> nullblk setup:
>> modprobe null_blk nr_devices=0 &&
>>      udevadm settle &&
>>      cd /sys/kernel/config/nullb &&
>>      mkdir nullb0 &&
>>      cd nullb0 &&
>>      echo 0 > completion_nsec &&
>>      echo 512 > blocksize &&
>>      echo 0 > home_node &&
>>      echo 0 > irqmode &&
>>      echo 128 > submit_queues &&
>>      echo 1024 > hw_queue_depth &&
>>      echo 1024 > size &&
>>      echo 0 > memory_backed &&
>>      echo 2 > queue_mode &&
>>      echo 1 > power ||
>>      exit $?
>>
>> Test script:
>> fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \
>>    -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \
>>    -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30
>>
>> Test result(elevator is deadline): iops
>> |                 | null_blk | scsi hdd |
>> | --------------- | -------- | -------- |
>> | before this set | 263k     | 24       |
>> | after this set  | 475k     | 272      |
> 
> For the HDD, these numbers are very low, and I do not understand how you can get
> any improvement from reducing lock contention, since contention should not be an
> issue with this kind of performance. What HW did you use for testing ? Was this
> a VM ?
> 

Thanks for reviewing this RFC set! I'm curious why there are improvement
as well, I didn't have the answer when I sent this set.

I'm testing on 256-core Kunpeng-920 server, with MG04ACA600E, 5TB HDD,
attched to hisi_sas_v3, and the disk have beed used for testing for more
than 5 years, perhaps this is why randwrite numbers are so low.

> I tested this null_blk setup and your fio command on a bare-metal 16-cores Xeon
> machine. For the scsi disk, I used a 26TB SATA HDD connected to an AHCI port).
> With this setup, results are like this:
> 
> |                 | null_blk | hdd (write) | hdd (read) |
> | --------------- | -------- | ----------- | ---------- |
> | before this set | 613k     | 1088        | 211        |
> | after this set  | 940k     | 1093        | 212        |
> 
> So not surprisingly, there is no improvement for the SATA HDD because of the low
> max IOPS these devices can achieve: lock contention is not really an issue when
> you are dealing with a slow device. And a SAS HDD will be the same. Gains may
> likely be more significant with a fast SAS/FC RAID array but I do not have
> access to that.
> 
> But the improvement for a fast device like null_blk is indeed excellent (+53%).
> 
> With LOCKDEP & KASAN disabled, the results are like this:
> 
> |                 | null_blk | hdd (write) | hdd (read) |
> | --------------- | -------- | ----------- | ---------- |
> | before this set | 625k     | 1092        | 213        |
> | after this set  | 984k     | 1095        | 215        |
> 
> No real changes for the HDD, as expected, and the improvement for null_blk is
> still good.

I agree that lock contention here will not affect HDD performance.
What I suspect the difference in my environment is that the order of rqs
might be changed from elevator dispatching them and the disk handling
them.

For example, the order can be easily revised if more than one context
dispatch one request at a time:

t1:

lock
rq1 = dd_dispatch_request
unlock
			t2:
			lock
			rq2 = dd_dispatch_request
			unlock

lock
rq3 = dd_dispatch_request
unlock

			lock
			rq4 = dd_dispatch_request
			unlock

//rq1,rq3 issue to disk
			// rq2, rq4 issue to disk

In this case, the elevator dispatch order is rq 1-2-3-4, however,
such order in disk is rq 1-3-2-4.

And with batch requests dispatch, will this less likely to happen?
> 
> So maybe drop the RFC tag on these patches and repost after cleaning things up ?

Sure, thanks again for reviewing this RFC set.
Kuai

> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-06-16  7:22   ` Yu Kuai
@ 2025-06-16  7:37     ` Damien Le Moal
  2025-06-16  9:14       ` Yu Kuai
  0 siblings, 1 reply; 15+ messages in thread
From: Damien Le Moal @ 2025-06-16  7:37 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

On 6/16/25 16:22, Yu Kuai wrote:
> I agree that lock contention here will not affect HDD performance.
> What I suspect the difference in my environment is that the order of rqs
> might be changed from elevator dispatching them and the disk handling
> them.
> 
> For example, the order can be easily revised if more than one context
> dispatch one request at a time:
> 
> t1:
> 
> lock
> rq1 = dd_dispatch_request
> unlock
> 			t2:
> 			lock
> 			rq2 = dd_dispatch_request
> 			unlock
> 
> lock
> rq3 = dd_dispatch_request
> unlock
> 
> 			lock
> 			rq4 = dd_dispatch_request
> 			unlock
> 
> //rq1,rq3 issue to disk
> 			// rq2, rq4 issue to disk
> 
> In this case, the elevator dispatch order is rq 1-2-3-4, however,
> such order in disk is rq 1-3-2-4.
> 
> And with batch requests dispatch, will this less likely to happen?

If you are running a write test with the HDD write cache enabled, such
reordering will most liley not matter at all. Running the same workload with
"none" and I get the same IOPS for writes.

Check your disk. If you do have the HDD write cache disabled, then sure, the
order will matter more depending on how your drive handles WCD writes (recent
drives have very similar performance with WCE and WCD).

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-06-16  7:37     ` Damien Le Moal
@ 2025-06-16  9:14       ` Yu Kuai
  0 siblings, 0 replies; 15+ messages in thread
From: Yu Kuai @ 2025-06-16  9:14 UTC (permalink / raw)
  To: Damien Le Moal, Yu Kuai, ming.lei, tj, josef, axboe
  Cc: linux-block, cgroups, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/06/16 15:37, Damien Le Moal 写道:
> On 6/16/25 16:22, Yu Kuai wrote:
>> I agree that lock contention here will not affect HDD performance.
>> What I suspect the difference in my environment is that the order of rqs
>> might be changed from elevator dispatching them and the disk handling
>> them.
>>
>> For example, the order can be easily revised if more than one context
>> dispatch one request at a time:
>>
>> t1:
>>
>> lock
>> rq1 = dd_dispatch_request
>> unlock
>> 			t2:
>> 			lock
>> 			rq2 = dd_dispatch_request
>> 			unlock
>>
>> lock
>> rq3 = dd_dispatch_request
>> unlock
>>
>> 			lock
>> 			rq4 = dd_dispatch_request
>> 			unlock
>>
>> //rq1,rq3 issue to disk
>> 			// rq2, rq4 issue to disk
>>
>> In this case, the elevator dispatch order is rq 1-2-3-4, however,
>> such order in disk is rq 1-3-2-4.
>>
>> And with batch requests dispatch, will this less likely to happen?
> 
> If you are running a write test with the HDD write cache enabled, such
> reordering will most liley not matter at all. Running the same workload with
> "none" and I get the same IOPS for writes.
> 
> Check your disk. If you do have the HDD write cache disabled, then sure, the
> order will matter more depending on how your drive handles WCD writes (recent
> drives have very similar performance with WCE and WCD).
> 
Thanks for the explanation, I'll test more workload on more disks, and
of corese, explain details more in the formal version as you suggested.

Kuai


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-06-16  9:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-14  9:25 [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
2025-06-14  9:25 ` [PATCH RFC v2 1/5] elevator: introduce global lock for sq_shared elevator Yu Kuai
2025-06-16  2:38   ` Damien Le Moal
2025-06-14  9:25 ` [PATCH RFC v2 2/5] mq-deadline: switch to use elevator lock Yu Kuai
2025-06-16  2:41   ` Damien Le Moal
2025-06-14  9:25 ` [PATCH RFC v2 3/5] block, bfq: " Yu Kuai
2025-06-16  2:46   ` Damien Le Moal
2025-06-14  9:25 ` [PATCH RFC v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched() Yu Kuai
2025-06-16  2:54   ` Damien Le Moal
2025-06-14  9:25 ` [PATCH RFC v2 5/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
2025-06-16  3:07   ` Damien Le Moal
2025-06-16  4:03 ` [PATCH RFC v2 0/5] " Damien Le Moal
2025-06-16  7:22   ` Yu Kuai
2025-06-16  7:37     ` Damien Le Moal
2025-06-16  9:14       ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).