cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
@ 2025-07-30  8:22 Yu Kuai
  2025-07-30  8:22 ` [PATCH v2 1/5] blk-mq-sched: introduce high level elevator lock Yu Kuai
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Yu Kuai @ 2025-07-30  8:22 UTC (permalink / raw)
  To: dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Changes from v1:
 - the ioc changes are send separately;
 - change the patch 1-3 order as suggested by Damien;

Currently, both mq-deadline and bfq have global spin lock that will be
grabbed inside elevator methods like dispatch_request, insert_requests,
and bio_merge. And the global lock is the main reason mq-deadline and
bfq can't scale very well.

For dispatch_request method, current behavior is dispatching one request at
a time. In the case of multiple dispatching contexts, This behavior, on the
one hand, introduce intense lock contention:

t1:                     t2:                     t3:
lock                    lock                    lock
// grab lock
ops.dispatch_request
unlock
                        // grab lock
                        ops.dispatch_request
                        unlock
                                                // grab lock
                                                ops.dispatch_request
                                                unlock

on the other hand, messing up the requests dispatching order:
t1:

lock
rq1 = ops.dispatch_request
unlock
                        t2:
                        lock
                        rq2 = ops.dispatch_request
                        unlock

lock
rq3 = ops.dispatch_request
unlock

                        lock
                        rq4 = ops.dispatch_request
                        unlock

//rq1,rq3 issue to disk
                        // rq2, rq4 issue to disk

In this case, the elevator dispatch order is rq 1-2-3-4, however,
such order in disk is rq 1-3-2-4, the order for rq2 and rq3 is inversed.

While dispatching request, blk_mq_get_disatpch_budget() and
blk_mq_get_driver_tag() must be called, and they are not ready to be
called inside elevator methods, hence introduce a new method like
dispatch_requests is not possible.

In conclusion, this set factor the global lock out of dispatch_request
method, and support request batch dispatch by calling the methods
multiple time while holding the lock.

nullblk setup:
modprobe null_blk nr_devices=0 &&
    udevadm settle &&
    cd /sys/kernel/config/nullb &&
    mkdir nullb0 &&
    cd nullb0 &&
    echo 0 > completion_nsec &&
    echo 512 > blocksize &&
    echo 0 > home_node &&
    echo 0 > irqmode &&
    echo 128 > submit_queues &&
    echo 1024 > hw_queue_depth &&
    echo 1024 > size &&
    echo 0 > memory_backed &&
    echo 2 > queue_mode &&
    echo 1 > power ||
    exit $?

Test script:
fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \
  -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \
  -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30

Test result: iops

|                 | deadline | bfq      |
| --------------- | -------- | -------- |
| before this set | 263k     | 124k     |
| after this set  | 475k     | 292k     |

Yu Kuai (5):
  blk-mq-sched: introduce high level elevator lock
  mq-deadline: switch to use elevator lock
  block, bfq: switch to use elevator lock
  blk-mq-sched: refactor __blk_mq_do_dispatch_sched()
  blk-mq-sched: support request batch dispatching for sq elevator

 block/bfq-cgroup.c   |   4 +-
 block/bfq-iosched.c  |  49 +++++----
 block/bfq-iosched.h  |   2 +-
 block/blk-mq-sched.c | 241 ++++++++++++++++++++++++++++++-------------
 block/blk-mq.h       |  21 ++++
 block/elevator.c     |   1 +
 block/elevator.h     |   4 +-
 block/mq-deadline.c  |  58 +++++------
 8 files changed, 248 insertions(+), 132 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/5] blk-mq-sched: introduce high level elevator lock
  2025-07-30  8:22 [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
@ 2025-07-30  8:22 ` Yu Kuai
  2025-07-30 17:19   ` Bart Van Assche
  2025-07-31  6:17   ` Hannes Reinecke
  2025-07-30  8:22 ` [PATCH v2 2/5] mq-deadline: switch to use " Yu Kuai
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: Yu Kuai @ 2025-07-30  8:22 UTC (permalink / raw)
  To: dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Currently, both mq-deadline and bfq have global spin lock that will be
grabbed inside elevator methods like dispatch_request, insert_requests,
and bio_merge. And the global lock is the main reason mq-deadline and
bfq can't scale very well.

While dispatching request, blk_mq_get_disatpch_budget() and
blk_mq_get_driver_tag() must be called, and they are not ready to be called
inside elevator methods, hence introduce a new method like
dispatch_requests is not possible.

Hence introduce a new high level elevator lock, currently it is protecting
dispatch_request only. Following patches will convert mq-deadline and bfq
to use this lock and finally support request batch dispatching by calling
the method multiple time while holding the lock.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-sched.c | 6 ++++++
 block/elevator.c     | 1 +
 block/elevator.h     | 4 ++--
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55a0fd105147..7911fae75ce4 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -98,6 +98,7 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		max_dispatch = hctx->queue->nr_requests;
 
 	do {
+		bool sq_sched = blk_queue_sq_sched(q);
 		struct request *rq;
 		int budget_token;
 
@@ -113,7 +114,12 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		if (budget_token < 0)
 			break;
 
+		if (sq_sched)
+			spin_lock(&e->lock);
 		rq = e->type->ops.dispatch_request(hctx);
+		if (sq_sched)
+			spin_unlock(&e->lock);
+
 		if (!rq) {
 			blk_mq_put_dispatch_budget(q, budget_token);
 			/*
diff --git a/block/elevator.c b/block/elevator.c
index 88f8f36bed98..45303af0ca73 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -144,6 +144,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 	eq->type = e;
 	kobject_init(&eq->kobj, &elv_ktype);
 	mutex_init(&eq->sysfs_lock);
+	spin_lock_init(&eq->lock);
 	hash_init(eq->hash);
 
 	return eq;
diff --git a/block/elevator.h b/block/elevator.h
index a07ce773a38f..cbbac4f7825c 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -110,12 +110,12 @@ struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
 /*
  * each queue has an elevator_queue associated with it
  */
-struct elevator_queue
-{
+struct elevator_queue {
 	struct elevator_type *type;
 	void *elevator_data;
 	struct kobject kobj;
 	struct mutex sysfs_lock;
+	spinlock_t lock;
 	unsigned long flags;
 	DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
 };
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/5] mq-deadline: switch to use elevator lock
  2025-07-30  8:22 [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
  2025-07-30  8:22 ` [PATCH v2 1/5] blk-mq-sched: introduce high level elevator lock Yu Kuai
@ 2025-07-30  8:22 ` Yu Kuai
  2025-07-30 17:21   ` Bart Van Assche
  2025-07-31  6:20   ` Hannes Reinecke
  2025-07-30  8:22 ` [PATCH v2 3/5] block, bfq: " Yu Kuai
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: Yu Kuai @ 2025-07-30  8:22 UTC (permalink / raw)
  To: dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Replace the internal spinlock 'dd->lock' with the new spinlock in
elevator_queue, there are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/mq-deadline.c | 58 +++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 31 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 9ab6c6256695..2054c023e855 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -101,7 +101,7 @@ struct deadline_data {
 	u32 async_depth;
 	int prio_aging_expire;
 
-	spinlock_t lock;
+	spinlock_t *lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -213,7 +213,7 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 	const u8 ioprio_class = dd_rq_ioclass(next);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	dd->per_prio[prio].stats.merged++;
 
@@ -253,7 +253,7 @@ static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
 {
 	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	return stats->inserted - atomic_read(&stats->completed);
 }
@@ -323,7 +323,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	enum dd_prio prio;
 	u8 ioprio_class;
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	if (!list_empty(&per_prio->dispatch)) {
 		rq = list_first_entry(&per_prio->dispatch, struct request,
@@ -434,7 +434,7 @@ static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
 	enum dd_prio prio;
 	int prio_cnt;
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
 		   !!dd_queued(dd, DD_IDLE_PRIO);
@@ -466,10 +466,9 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;
 
-	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
-		goto unlock;
+		return rq;
 
 	/*
 	 * Next, dispatch requests in priority order. Ignore lower priority
@@ -481,9 +480,6 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 			break;
 	}
 
-unlock:
-	spin_unlock(&dd->lock);
-
 	return rq;
 }
 
@@ -538,9 +534,9 @@ static void dd_exit_sched(struct elevator_queue *e)
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
 
-		spin_lock(&dd->lock);
+		spin_lock(dd->lock);
 		queued = dd_queued(dd, prio);
-		spin_unlock(&dd->lock);
+		spin_unlock(dd->lock);
 
 		WARN_ONCE(queued != 0,
 			  "statistics for priority %d: i %u m %u d %u c %u\n",
@@ -587,7 +583,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->last_dir = DD_WRITE;
 	dd->fifo_batch = fifo_batch;
 	dd->prio_aging_expire = prio_aging_expire;
-	spin_lock_init(&dd->lock);
+	dd->lock = &eq->lock;
 
 	/* We dispatch from request queue wide instead of hw queue */
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
@@ -643,9 +639,9 @@ static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
 	struct request *free = NULL;
 	bool ret;
 
-	spin_lock(&dd->lock);
+	spin_lock(dd->lock);
 	ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
-	spin_unlock(&dd->lock);
+	spin_unlock(dd->lock);
 
 	if (free)
 		blk_mq_free_request(free);
@@ -667,7 +663,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	struct dd_per_prio *per_prio;
 	enum dd_prio prio;
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	prio = ioprio_class_to_prio[ioprio_class];
 	per_prio = &dd->per_prio[prio];
@@ -711,7 +707,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	struct deadline_data *dd = q->elevator->elevator_data;
 	LIST_HEAD(free);
 
-	spin_lock(&dd->lock);
+	spin_lock(dd->lock);
 	while (!list_empty(list)) {
 		struct request *rq;
 
@@ -719,7 +715,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 		list_del_init(&rq->queuelist);
 		dd_insert_request(hctx, rq, flags, &free);
 	}
-	spin_unlock(&dd->lock);
+	spin_unlock(dd->lock);
 
 	blk_mq_free_requests(&free);
 }
@@ -835,13 +831,13 @@ static const struct elv_fs_entry deadline_attrs[] = {
 #define DEADLINE_DEBUGFS_DDIR_ATTRS(prio, data_dir, name)		\
 static void *deadline_##name##_fifo_start(struct seq_file *m,		\
 					  loff_t *pos)			\
-	__acquires(&dd->lock)						\
+	__acquires(dd->lock)						\
 {									\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 	struct dd_per_prio *per_prio = &dd->per_prio[prio];		\
 									\
-	spin_lock(&dd->lock);						\
+	spin_lock(dd->lock);						\
 	return seq_list_start(&per_prio->fifo_list[data_dir], *pos);	\
 }									\
 									\
@@ -856,12 +852,12 @@ static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
 }									\
 									\
 static void deadline_##name##_fifo_stop(struct seq_file *m, void *v)	\
-	__releases(&dd->lock)						\
+	__releases(dd->lock)						\
 {									\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 									\
-	spin_unlock(&dd->lock);						\
+	spin_unlock(dd->lock);						\
 }									\
 									\
 static const struct seq_operations deadline_##name##_fifo_seq_ops = {	\
@@ -927,11 +923,11 @@ static int dd_queued_show(void *data, struct seq_file *m)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	u32 rt, be, idle;
 
-	spin_lock(&dd->lock);
+	spin_lock(dd->lock);
 	rt = dd_queued(dd, DD_RT_PRIO);
 	be = dd_queued(dd, DD_BE_PRIO);
 	idle = dd_queued(dd, DD_IDLE_PRIO);
-	spin_unlock(&dd->lock);
+	spin_unlock(dd->lock);
 
 	seq_printf(m, "%u %u %u\n", rt, be, idle);
 
@@ -943,7 +939,7 @@ static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
 {
 	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
 
-	lockdep_assert_held(&dd->lock);
+	lockdep_assert_held(dd->lock);
 
 	return stats->dispatched + stats->merged -
 		atomic_read(&stats->completed);
@@ -955,11 +951,11 @@ static int dd_owned_by_driver_show(void *data, struct seq_file *m)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	u32 rt, be, idle;
 
-	spin_lock(&dd->lock);
+	spin_lock(dd->lock);
 	rt = dd_owned_by_driver(dd, DD_RT_PRIO);
 	be = dd_owned_by_driver(dd, DD_BE_PRIO);
 	idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
-	spin_unlock(&dd->lock);
+	spin_unlock(dd->lock);
 
 	seq_printf(m, "%u %u %u\n", rt, be, idle);
 
@@ -969,13 +965,13 @@ static int dd_owned_by_driver_show(void *data, struct seq_file *m)
 #define DEADLINE_DISPATCH_ATTR(prio)					\
 static void *deadline_dispatch##prio##_start(struct seq_file *m,	\
 					     loff_t *pos)		\
-	__acquires(&dd->lock)						\
+	__acquires(dd->lock)						\
 {									\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 	struct dd_per_prio *per_prio = &dd->per_prio[prio];		\
 									\
-	spin_lock(&dd->lock);						\
+	spin_lock(dd->lock);						\
 	return seq_list_start(&per_prio->dispatch, *pos);		\
 }									\
 									\
@@ -990,12 +986,12 @@ static void *deadline_dispatch##prio##_next(struct seq_file *m,		\
 }									\
 									\
 static void deadline_dispatch##prio##_stop(struct seq_file *m, void *v)	\
-	__releases(&dd->lock)						\
+	__releases(dd->lock)						\
 {									\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 									\
-	spin_unlock(&dd->lock);						\
+	spin_unlock(dd->lock);						\
 }									\
 									\
 static const struct seq_operations deadline_dispatch##prio##_seq_ops = { \
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/5] block, bfq: switch to use elevator lock
  2025-07-30  8:22 [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
  2025-07-30  8:22 ` [PATCH v2 1/5] blk-mq-sched: introduce high level elevator lock Yu Kuai
  2025-07-30  8:22 ` [PATCH v2 2/5] mq-deadline: switch to use " Yu Kuai
@ 2025-07-30  8:22 ` Yu Kuai
  2025-07-30 17:24   ` Bart Van Assche
  2025-07-31  6:22   ` Hannes Reinecke
  2025-07-30  8:22 ` [PATCH v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched() Yu Kuai
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: Yu Kuai @ 2025-07-30  8:22 UTC (permalink / raw)
  To: dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Replace the internal spinlock bfqd->lock with the new spinlock in
elevator_queue. There are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bfq-cgroup.c   |  4 ++--
 block/bfq-iosched.c  | 49 +++++++++++++++++++++-----------------------
 block/bfq-iosched.h  |  2 +-
 block/blk-mq-sched.c |  4 ++--
 4 files changed, 28 insertions(+), 31 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 9fb9f3533150..1717bac7eccc 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -878,7 +878,7 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&bfqd->lock, flags);
+	spin_lock_irqsave(bfqd->lock, flags);
 
 	if (!entity) /* root group */
 		goto put_async_queues;
@@ -923,7 +923,7 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
 put_async_queues:
 	bfq_put_async_queues(bfqd, bfqg);
 
-	spin_unlock_irqrestore(&bfqd->lock, flags);
+	spin_unlock_irqrestore(bfqd->lock, flags);
 	/*
 	 * @blkg is going offline and will be ignored by
 	 * blkg_[rw]stat_recursive_sum().  Transfer stats to the parent so
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a6a574a8eac9..1b4e01d15cfe 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -466,7 +466,7 @@ static struct bfq_io_cq *bfq_bic_lookup(struct request_queue *q)
  */
 void bfq_schedule_dispatch(struct bfq_data *bfqd)
 {
-	lockdep_assert_held(&bfqd->lock);
+	lockdep_assert_held(bfqd->lock);
 
 	if (bfqd->queued != 0) {
 		bfq_log(bfqd, "schedule dispatch");
@@ -591,7 +591,7 @@ static bool bfqq_request_over_limit(struct bfq_data *bfqd,
 	int level;
 
 retry:
-	spin_lock_irq(&bfqd->lock);
+	spin_lock_irq(bfqd->lock);
 	bfqq = bic_to_bfqq(bic, op_is_sync(opf), act_idx);
 	if (!bfqq)
 		goto out;
@@ -603,7 +603,7 @@ static bool bfqq_request_over_limit(struct bfq_data *bfqd,
 	/* +1 for bfqq entity, root cgroup not included */
 	depth = bfqg_to_blkg(bfqq_group(bfqq))->blkcg->css.cgroup->level + 1;
 	if (depth > alloc_depth) {
-		spin_unlock_irq(&bfqd->lock);
+		spin_unlock_irq(bfqd->lock);
 		if (entities != inline_entities)
 			kfree(entities);
 		entities = kmalloc_array(depth, sizeof(*entities), GFP_NOIO);
@@ -661,7 +661,7 @@ static bool bfqq_request_over_limit(struct bfq_data *bfqd,
 		}
 	}
 out:
-	spin_unlock_irq(&bfqd->lock);
+	spin_unlock_irq(bfqd->lock);
 	if (entities != inline_entities)
 		kfree(entities);
 	return ret;
@@ -2452,7 +2452,7 @@ static bool bfq_bio_merge(struct request_queue *q, struct bio *bio,
 	struct request *free = NULL;
 	bool ret;
 
-	spin_lock_irq(&bfqd->lock);
+	spin_lock_irq(bfqd->lock);
 
 	if (bic) {
 		/*
@@ -2470,7 +2470,7 @@ static bool bfq_bio_merge(struct request_queue *q, struct bio *bio,
 
 	ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
 
-	spin_unlock_irq(&bfqd->lock);
+	spin_unlock_irq(bfqd->lock);
 	if (free)
 		blk_mq_free_request(free);
 
@@ -2645,7 +2645,7 @@ static void bfq_end_wr(struct bfq_data *bfqd)
 	struct bfq_queue *bfqq;
 	int i;
 
-	spin_lock_irq(&bfqd->lock);
+	spin_lock_irq(bfqd->lock);
 
 	for (i = 0; i < bfqd->num_actuators; i++) {
 		list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)
@@ -2655,7 +2655,7 @@ static void bfq_end_wr(struct bfq_data *bfqd)
 		bfq_bfqq_end_wr(bfqq);
 	bfq_end_wr_async(bfqd);
 
-	spin_unlock_irq(&bfqd->lock);
+	spin_unlock_irq(bfqd->lock);
 }
 
 static sector_t bfq_io_struct_pos(void *io_struct, bool request)
@@ -5301,8 +5301,6 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct bfq_queue *in_serv_queue;
 	bool waiting_rq, idle_timer_disabled = false;
 
-	spin_lock_irq(&bfqd->lock);
-
 	in_serv_queue = bfqd->in_service_queue;
 	waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);
 
@@ -5312,7 +5310,6 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
 			waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);
 	}
 
-	spin_unlock_irq(&bfqd->lock);
 	bfq_update_dispatch_stats(hctx->queue, rq,
 			idle_timer_disabled ? in_serv_queue : NULL,
 				idle_timer_disabled);
@@ -5490,9 +5487,9 @@ static void bfq_exit_icq(struct io_cq *icq)
 	 * this is the last time these queues are accessed.
 	 */
 	if (bfqd) {
-		spin_lock_irqsave(&bfqd->lock, flags);
+		spin_lock_irqsave(bfqd->lock, flags);
 		_bfq_exit_icq(bic, bfqd->num_actuators);
-		spin_unlock_irqrestore(&bfqd->lock, flags);
+		spin_unlock_irqrestore(bfqd->lock, flags);
 	} else {
 		_bfq_exit_icq(bic, BFQ_MAX_ACTUATORS);
 	}
@@ -6248,10 +6245,10 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	if (!cgroup_subsys_on_dfl(io_cgrp_subsys) && rq->bio)
 		bfqg_stats_update_legacy_io(q, rq);
 #endif
-	spin_lock_irq(&bfqd->lock);
+	spin_lock_irq(bfqd->lock);
 	bfqq = bfq_init_rq(rq);
 	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
-		spin_unlock_irq(&bfqd->lock);
+		spin_unlock_irq(bfqd->lock);
 		blk_mq_free_requests(&free);
 		return;
 	}
@@ -6284,7 +6281,7 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	 * merge).
 	 */
 	cmd_flags = rq->cmd_flags;
-	spin_unlock_irq(&bfqd->lock);
+	spin_unlock_irq(bfqd->lock);
 
 	bfq_update_insert_stats(q, bfqq, idle_timer_disabled,
 				cmd_flags);
@@ -6665,7 +6662,7 @@ static void bfq_finish_requeue_request(struct request *rq)
 					     rq->io_start_time_ns,
 					     rq->cmd_flags);
 
-	spin_lock_irqsave(&bfqd->lock, flags);
+	spin_lock_irqsave(bfqd->lock, flags);
 	if (likely(rq->rq_flags & RQF_STARTED)) {
 		if (rq == bfqd->waited_rq)
 			bfq_update_inject_limit(bfqd, bfqq);
@@ -6675,7 +6672,7 @@ static void bfq_finish_requeue_request(struct request *rq)
 	bfqq_request_freed(bfqq);
 	bfq_put_queue(bfqq);
 	RQ_BIC(rq)->requests--;
-	spin_unlock_irqrestore(&bfqd->lock, flags);
+	spin_unlock_irqrestore(bfqd->lock, flags);
 
 	/*
 	 * Reset private fields. In case of a requeue, this allows
@@ -7006,7 +7003,7 @@ bfq_idle_slice_timer_body(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	enum bfqq_expiration reason;
 	unsigned long flags;
 
-	spin_lock_irqsave(&bfqd->lock, flags);
+	spin_lock_irqsave(bfqd->lock, flags);
 
 	/*
 	 * Considering that bfqq may be in race, we should firstly check
@@ -7016,7 +7013,7 @@ bfq_idle_slice_timer_body(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	 * been cleared in __bfq_bfqd_reset_in_service func.
 	 */
 	if (bfqq != bfqd->in_service_queue) {
-		spin_unlock_irqrestore(&bfqd->lock, flags);
+		spin_unlock_irqrestore(bfqd->lock, flags);
 		return;
 	}
 
@@ -7044,7 +7041,7 @@ bfq_idle_slice_timer_body(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 
 schedule_dispatch:
 	bfq_schedule_dispatch(bfqd);
-	spin_unlock_irqrestore(&bfqd->lock, flags);
+	spin_unlock_irqrestore(bfqd->lock, flags);
 }
 
 /*
@@ -7169,10 +7166,10 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
 	hrtimer_cancel(&bfqd->idle_slice_timer);
 
-	spin_lock_irq(&bfqd->lock);
+	spin_lock_irq(bfqd->lock);
 	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
 		bfq_deactivate_bfqq(bfqd, bfqq, false, false);
-	spin_unlock_irq(&bfqd->lock);
+	spin_unlock_irq(bfqd->lock);
 
 	for (actuator = 0; actuator < bfqd->num_actuators; actuator++)
 		WARN_ON_ONCE(bfqd->rq_in_driver[actuator]);
@@ -7186,10 +7183,10 @@ static void bfq_exit_queue(struct elevator_queue *e)
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 	blkcg_deactivate_policy(bfqd->queue->disk, &blkcg_policy_bfq);
 #else
-	spin_lock_irq(&bfqd->lock);
+	spin_lock_irq(bfqd->lock);
 	bfq_put_async_queues(bfqd, bfqd->root_group);
 	kfree(bfqd->root_group);
-	spin_unlock_irq(&bfqd->lock);
+	spin_unlock_irq(bfqd->lock);
 #endif
 
 	blk_stat_disable_accounting(bfqd->queue);
@@ -7354,7 +7351,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	/* see comments on the definition of next field inside bfq_data */
 	bfqd->actuator_load_threshold = 4;
 
-	spin_lock_init(&bfqd->lock);
+	bfqd->lock = &eq->lock;
 
 	/*
 	 * The invocation of the next bfq_create_group_hierarchy
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 31217f196f4f..3f612c3e6fae 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -795,7 +795,7 @@ struct bfq_data {
 	/* fallback dummy bfqq for extreme OOM conditions */
 	struct bfq_queue oom_bfqq;
 
-	spinlock_t lock;
+	spinlock_t *lock;
 
 	/*
 	 * bic associated with the task issuing current bio for
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 7911fae75ce4..82c4f4eef9ed 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -115,10 +115,10 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 			break;
 
 		if (sq_sched)
-			spin_lock(&e->lock);
+			spin_lock_irq(&e->lock);
 		rq = e->type->ops.dispatch_request(hctx);
 		if (sq_sched)
-			spin_unlock(&e->lock);
+			spin_unlock_irq(&e->lock);
 
 		if (!rq) {
 			blk_mq_put_dispatch_budget(q, budget_token);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched()
  2025-07-30  8:22 [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
                   ` (2 preceding siblings ...)
  2025-07-30  8:22 ` [PATCH v2 3/5] block, bfq: " Yu Kuai
@ 2025-07-30  8:22 ` Yu Kuai
  2025-07-30 18:32   ` Bart Van Assche
  2025-07-30  8:22 ` [PATCH v2 5/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
  2025-07-31  8:18 ` [PATCH v2 0/5] " Ming Lei
  5 siblings, 1 reply; 26+ messages in thread
From: Yu Kuai @ 2025-07-30  8:22 UTC (permalink / raw)
  To: dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Introduce struct sched_dispatch_ctx, and split the helper into
elevator_dispatch_one_request() and elevator_finish_dispatch(). Also
and comments about the non-error return value.

Make code cleaner, and make it easier to add a new branch to dispatch
a batch of requests at a time in the next patch.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-sched.c | 196 ++++++++++++++++++++++++++-----------------
 1 file changed, 119 insertions(+), 77 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 82c4f4eef9ed..f18aecf710ad 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -74,91 +74,100 @@ static bool blk_mq_dispatch_hctx_list(struct list_head *rq_list)
 
 #define BLK_MQ_BUDGET_DELAY	3		/* ms units */
 
-/*
- * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
- * its queue by itself in its completion handler, so we don't need to
- * restart queue if .get_budget() fails to get the budget.
- *
- * Returns -EAGAIN if hctx->dispatch was found non-empty and run_work has to
- * be run again.  This is necessary to avoid starving flushes.
- */
-static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
-{
-	struct request_queue *q = hctx->queue;
-	struct elevator_queue *e = q->elevator;
-	bool multi_hctxs = false, run_queue = false;
-	bool dispatched = false, busy = false;
-	unsigned int max_dispatch;
-	LIST_HEAD(rq_list);
-	int count = 0;
+struct sched_dispatch_ctx {
+	struct blk_mq_hw_ctx *hctx;
+	struct elevator_queue *e;
+	struct request_queue *q;
 
-	if (hctx->dispatch_busy)
-		max_dispatch = 1;
-	else
-		max_dispatch = hctx->queue->nr_requests;
+	struct list_head rq_list;
+	int count;
 
-	do {
-		bool sq_sched = blk_queue_sq_sched(q);
-		struct request *rq;
-		int budget_token;
+	bool multi_hctxs;
+	bool run_queue;
+	bool busy;
+};
 
-		if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
-			break;
+static bool elevator_can_dispatch(struct sched_dispatch_ctx *ctx)
+{
+	if (ctx->e->type->ops.has_work &&
+	    !ctx->e->type->ops.has_work(ctx->hctx))
+		return false;
 
-		if (!list_empty_careful(&hctx->dispatch)) {
-			busy = true;
-			break;
-		}
+	if (!list_empty_careful(&ctx->hctx->dispatch)) {
+		ctx->busy = true;
+		return false;
+	}
 
-		budget_token = blk_mq_get_dispatch_budget(q);
-		if (budget_token < 0)
-			break;
+	return true;
+}
 
-		if (sq_sched)
-			spin_lock_irq(&e->lock);
-		rq = e->type->ops.dispatch_request(hctx);
-		if (sq_sched)
-			spin_unlock_irq(&e->lock);
+static bool elevator_dispatch_one_request(struct sched_dispatch_ctx *ctx)
+{
+	bool sq_sched = blk_queue_sq_sched(ctx->q);
+	struct request *rq;
+	int budget_token;
 
-		if (!rq) {
-			blk_mq_put_dispatch_budget(q, budget_token);
-			/*
-			 * We're releasing without dispatching. Holding the
-			 * budget could have blocked any "hctx"s with the
-			 * same queue and if we didn't dispatch then there's
-			 * no guarantee anyone will kick the queue.  Kick it
-			 * ourselves.
-			 */
-			run_queue = true;
-			break;
-		}
+	if (!elevator_can_dispatch(ctx))
+		return false;
 
-		blk_mq_set_rq_budget_token(rq, budget_token);
+	budget_token = blk_mq_get_dispatch_budget(ctx->q);
+	if (budget_token < 0)
+		return false;
 
-		/*
-		 * Now this rq owns the budget which has to be released
-		 * if this rq won't be queued to driver via .queue_rq()
-		 * in blk_mq_dispatch_rq_list().
-		 */
-		list_add_tail(&rq->queuelist, &rq_list);
-		count++;
-		if (rq->mq_hctx != hctx)
-			multi_hctxs = true;
+	if (sq_sched)
+		spin_lock_irq(&ctx->e->lock);
+	rq = ctx->e->type->ops.dispatch_request(ctx->hctx);
+	if (sq_sched)
+		spin_unlock_irq(&ctx->e->lock);
 
+	if (!rq) {
+		blk_mq_put_dispatch_budget(ctx->q, budget_token);
 		/*
-		 * If we cannot get tag for the request, stop dequeueing
-		 * requests from the IO scheduler. We are unlikely to be able
-		 * to submit them anyway and it creates false impression for
-		 * scheduling heuristics that the device can take more IO.
+		 * We're releasing without dispatching. Holding the
+		 * budget could have blocked any "hctx"s with the
+		 * same queue and if we didn't dispatch then there's
+		 * no guarantee anyone will kick the queue.  Kick it
+		 * ourselves.
 		 */
-		if (!blk_mq_get_driver_tag(rq))
-			break;
-	} while (count < max_dispatch);
+		ctx->run_queue = true;
+		return false;
+	}
 
-	if (!count) {
-		if (run_queue)
-			blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY);
-	} else if (multi_hctxs) {
+	blk_mq_set_rq_budget_token(rq, budget_token);
+
+	/*
+	 * Now this rq owns the budget which has to be released
+	 * if this rq won't be queued to driver via .queue_rq()
+	 * in blk_mq_dispatch_rq_list().
+	 */
+	list_add_tail(&rq->queuelist, &ctx->rq_list);
+	ctx->count++;
+	if (rq->mq_hctx != ctx->hctx)
+		ctx->multi_hctxs = true;
+
+	/*
+	 * If we cannot get tag for the request, stop dequeueing
+	 * requests from the IO scheduler. We are unlikely to be able
+	 * to submit them anyway and it creates false impression for
+	 * scheduling heuristics that the device can take more IO.
+	 */
+	return blk_mq_get_driver_tag(rq);
+}
+
+/*
+ * Returns -EAGAIN if hctx->dispatch was found non-empty and run_work has to
+ * be run again. This is necessary to avoid starving flushes.
+ * Return 0 if no request is dispatched.
+ * Return 1 if at least one request is dispatched.
+ */
+static int elevator_finish_dispatch(struct sched_dispatch_ctx *ctx)
+{
+	bool dispatched = false;
+
+	if (!ctx->count) {
+		if (ctx->run_queue)
+			blk_mq_delay_run_hw_queues(ctx->q, BLK_MQ_BUDGET_DELAY);
+	} else if (ctx->multi_hctxs) {
 		/*
 		 * Requests from different hctx may be dequeued from some
 		 * schedulers, such as bfq and deadline.
@@ -166,19 +175,52 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 		 * Sort the requests in the list according to their hctx,
 		 * dispatch batching requests from same hctx at a time.
 		 */
-		list_sort(NULL, &rq_list, sched_rq_cmp);
+		list_sort(NULL, &ctx->rq_list, sched_rq_cmp);
 		do {
-			dispatched |= blk_mq_dispatch_hctx_list(&rq_list);
-		} while (!list_empty(&rq_list));
+			dispatched |= blk_mq_dispatch_hctx_list(&ctx->rq_list);
+		} while (!list_empty(&ctx->rq_list));
 	} else {
-		dispatched = blk_mq_dispatch_rq_list(hctx, &rq_list, false);
+		dispatched = blk_mq_dispatch_rq_list(ctx->hctx, &ctx->rq_list,
+						     false);
 	}
 
-	if (busy)
+	if (ctx->busy)
 		return -EAGAIN;
+
 	return !!dispatched;
 }
 
+/*
+ * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
+ * its queue by itself in its completion handler, so we don't need to
+ * restart queue if .get_budget() fails to get the budget.
+ *
+ * See elevator_finish_dispatch() for return values.
+ */
+static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int max_dispatch;
+	struct sched_dispatch_ctx ctx = {
+		.hctx	= hctx,
+		.q	= hctx->queue,
+		.e	= hctx->queue->elevator,
+	};
+
+	INIT_LIST_HEAD(&ctx.rq_list);
+
+	if (hctx->dispatch_busy)
+		max_dispatch = 1;
+	else
+		max_dispatch = hctx->queue->nr_requests;
+
+	do {
+		if (!elevator_dispatch_one_request(&ctx))
+			break;
+	} while (ctx.count < max_dispatch);
+
+	return elevator_finish_dispatch(&ctx);
+}
+
 static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	unsigned long end = jiffies + HZ;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 5/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-07-30  8:22 [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
                   ` (3 preceding siblings ...)
  2025-07-30  8:22 ` [PATCH v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched() Yu Kuai
@ 2025-07-30  8:22 ` Yu Kuai
  2025-07-31  8:18 ` [PATCH v2 0/5] " Ming Lei
  5 siblings, 0 replies; 26+ messages in thread
From: Yu Kuai @ 2025-07-30  8:22 UTC (permalink / raw)
  To: dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

For dispatch_request method, current behavior is dispatching one request at
a time. In the case of multiple dispatching contexts, This behavior, on the
one hand, introduce intense lock contention:

t1:                     t2:                     t3:
lock                    lock                    lock
// grab lock
ops.dispatch_request
unlock
                        // grab lock
                        ops.dispatch_request
                        unlock
                                                // grab lock
                                                ops.dispatch_request
                                                unlock

on the other hand, messing up the requests dispatching order:
t1:

lock
rq1 = ops.dispatch_request
unlock
                        t2:
                        lock
                        rq2 = ops.dispatch_request
                        unlock

lock
rq3 = ops.dispatch_request
unlock

                        lock
                        rq4 = ops.dispatch_request
                        unlock

//rq1,rq3 issue to disk
                        // rq2, rq4 issue to disk

In this case, the elevator dispatch order is rq 1-2-3-4, however,
such order in disk is rq 1-3-2-4, the order for rq2 and rq3 is inversed.

Fix those problems by introducing elevator_dispatch_requests(), this
helper will grab the lock and dispatch a batch of requests while holding
the lock.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-sched.c | 61 +++++++++++++++++++++++++++++++++++++++++---
 block/blk-mq.h       | 21 +++++++++++++++
 2 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index f18aecf710ad..9ee05d6e8350 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -101,6 +101,55 @@ static bool elevator_can_dispatch(struct sched_dispatch_ctx *ctx)
 	return true;
 }
 
+static void elevator_dispatch_requests(struct sched_dispatch_ctx *ctx)
+{
+	bool has_get_budget = ctx->q->mq_ops->get_budget != NULL;
+	int budget_token[BUDGET_TOKEN_BATCH];
+	int count = ctx->q->nr_requests;
+	int i;
+
+	while (true) {
+		if (!elevator_can_dispatch(ctx))
+			return;
+
+		if (has_get_budget) {
+			count = blk_mq_get_dispatch_budgets(ctx->q, budget_token);
+			if (count <= 0)
+				return;
+		}
+
+		spin_lock_irq(&ctx->e->lock);
+		for (i = 0; i < count; ++i) {
+			struct request *rq =
+				ctx->e->type->ops.dispatch_request(ctx->hctx);
+
+			if (!rq) {
+				ctx->run_queue = true;
+				goto err_free_budgets;
+			}
+
+			if (has_get_budget)
+				blk_mq_set_rq_budget_token(rq, budget_token[i]);
+			list_add_tail(&rq->queuelist, &ctx->rq_list);
+			ctx->count++;
+			if (rq->mq_hctx != ctx->hctx)
+				ctx->multi_hctxs = true;
+
+			if (!blk_mq_get_driver_tag(rq)) {
+				i++;
+				goto err_free_budgets;
+			}
+		}
+		spin_unlock_irq(&ctx->e->lock);
+	}
+
+err_free_budgets:
+	spin_unlock_irq(&ctx->e->lock);
+	if (has_get_budget)
+		for (; i < count; ++i)
+			blk_mq_put_dispatch_budget(ctx->q, budget_token[i]);
+}
+
 static bool elevator_dispatch_one_request(struct sched_dispatch_ctx *ctx)
 {
 	bool sq_sched = blk_queue_sq_sched(ctx->q);
@@ -213,10 +262,14 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 	else
 		max_dispatch = hctx->queue->nr_requests;
 
-	do {
-		if (!elevator_dispatch_one_request(&ctx))
-			break;
-	} while (ctx.count < max_dispatch);
+	if (!hctx->dispatch_busy && blk_queue_sq_sched(ctx.q))
+		elevator_dispatch_requests(&ctx);
+	else {
+		do {
+			if (!elevator_dispatch_one_request(&ctx))
+				break;
+		} while (ctx.count < max_dispatch);
+	}
 
 	return elevator_finish_dispatch(&ctx);
 }
diff --git a/block/blk-mq.h b/block/blk-mq.h
index affb2e14b56e..450c16a07841 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,6 +37,7 @@ enum {
 };
 
 #define BLK_MQ_CPU_WORK_BATCH	(8)
+#define BUDGET_TOKEN_BATCH	(8)
 
 typedef unsigned int __bitwise blk_insert_t;
 #define BLK_MQ_INSERT_AT_HEAD		((__force blk_insert_t)0x01)
@@ -262,6 +263,26 @@ static inline int blk_mq_get_dispatch_budget(struct request_queue *q)
 	return 0;
 }
 
+static inline int blk_mq_get_dispatch_budgets(struct request_queue *q,
+					      int *budget_token)
+{
+	int count = 0;
+
+	while (count < BUDGET_TOKEN_BATCH) {
+		int token = 0;
+
+		if (q->mq_ops->get_budget)
+			token = q->mq_ops->get_budget(q);
+
+		if (token < 0)
+			return count;
+
+		budget_token[count++] = token;
+	}
+
+	return count;
+}
+
 static inline void blk_mq_set_rq_budget_token(struct request *rq, int token)
 {
 	if (token < 0)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/5] blk-mq-sched: introduce high level elevator lock
  2025-07-30  8:22 ` [PATCH v2 1/5] blk-mq-sched: introduce high level elevator lock Yu Kuai
@ 2025-07-30 17:19   ` Bart Van Assche
  2025-07-30 17:59     ` Yu Kuai
  2025-07-31  6:17   ` Hannes Reinecke
  1 sibling, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2025-07-30 17:19 UTC (permalink / raw)
  To: Yu Kuai, dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 7/30/25 1:22 AM, Yu Kuai wrote:
> +		if (sq_sched)
> +			spin_lock(&e->lock);
>   		rq = e->type->ops.dispatch_request(hctx);
> +		if (sq_sched)
> +			spin_unlock(&e->lock);

The above will confuse static analyzers. Please change it into something
like the following:

if (blk_queue_sq_sched(q)) {
	spin_lock(&e->lock);
	rq = e->type->ops.dispatch_request(hctx);
	spin_unlock(&e->lock);
} else {
	rq = e->type->ops.dispatch_request(hctx);
}

Otherwise this patch looks good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/5] mq-deadline: switch to use elevator lock
  2025-07-30  8:22 ` [PATCH v2 2/5] mq-deadline: switch to use " Yu Kuai
@ 2025-07-30 17:21   ` Bart Van Assche
  2025-07-30 18:01     ` Yu Kuai
  2025-07-31  6:20   ` Hannes Reinecke
  1 sibling, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2025-07-30 17:21 UTC (permalink / raw)
  To: Yu Kuai, dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 7/30/25 1:22 AM, Yu Kuai wrote:
> @@ -466,10 +466,9 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>   	struct request *rq;
>   	enum dd_prio prio;
>   
> -	spin_lock(&dd->lock);
>   	rq = dd_dispatch_prio_aged_requests(dd, now);

The description says "no functional changes" but I think I see a 
functional change above. Please restrict this patch to changing
&dd->lock into dd->lock only.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/5] block, bfq: switch to use elevator lock
  2025-07-30  8:22 ` [PATCH v2 3/5] block, bfq: " Yu Kuai
@ 2025-07-30 17:24   ` Bart Van Assche
  2025-07-31  6:22   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Bart Van Assche @ 2025-07-30 17:24 UTC (permalink / raw)
  To: Yu Kuai, dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 7/30/25 1:22 AM, Yu Kuai wrote:
>   static sector_t bfq_io_struct_pos(void *io_struct, bool request)
> @@ -5301,8 +5301,6 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
>   	struct bfq_queue *in_serv_queue;
>   	bool waiting_rq, idle_timer_disabled = false;
>   
> -	spin_lock_irq(&bfqd->lock);
> -
>   	in_serv_queue = bfqd->in_service_queue;
>   	waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);
>   

Please restrict this patch to changing &bfqd->lock into bfqd->lock only.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/5] blk-mq-sched: introduce high level elevator lock
  2025-07-30 17:19   ` Bart Van Assche
@ 2025-07-30 17:59     ` Yu Kuai
  0 siblings, 0 replies; 26+ messages in thread
From: Yu Kuai @ 2025-07-30 17:59 UTC (permalink / raw)
  To: Bart Van Assche, Yu Kuai, dlemoal, hare, jack, tj, josef, axboe,
	yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

Hi

在 2025/7/31 1:19, Bart Van Assche 写道:
> On 7/30/25 1:22 AM, Yu Kuai wrote:
>> +        if (sq_sched)
>> +            spin_lock(&e->lock);
>>           rq = e->type->ops.dispatch_request(hctx);
>> +        if (sq_sched)
>> +            spin_unlock(&e->lock);
>
> The above will confuse static analyzers. Please change it into something
> like the following:
>
> if (blk_queue_sq_sched(q)) {
>     spin_lock(&e->lock);
>     rq = e->type->ops.dispatch_request(hctx);
>     spin_unlock(&e->lock);
> } else {
>     rq = e->type->ops.dispatch_request(hctx);
> }
>
> Otherwise this patch looks good to me.
Ok, thanks for the review, will change in the next version.
Kuai

>
> Thanks,
>
> Bart.
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/5] mq-deadline: switch to use elevator lock
  2025-07-30 17:21   ` Bart Van Assche
@ 2025-07-30 18:01     ` Yu Kuai
  2025-07-30 18:10       ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Yu Kuai @ 2025-07-30 18:01 UTC (permalink / raw)
  To: Bart Van Assche, Yu Kuai, dlemoal, hare, jack, tj, josef, axboe,
	yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

Hi,

在 2025/7/31 1:21, Bart Van Assche 写道:
> On 7/30/25 1:22 AM, Yu Kuai wrote:
>> @@ -466,10 +466,9 @@ static struct request 
>> *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>>       struct request *rq;
>>       enum dd_prio prio;
>>   -    spin_lock(&dd->lock);
>>       rq = dd_dispatch_prio_aged_requests(dd, now);
>
> The description says "no functional changes" but I think I see a 
> functional change above. Please restrict this patch to changing
> &dd->lock into dd->lock only.
Ok, you mean that the lock is moved to the caller is functional change, 
right?

Thanks,
Kuai
>
> Thanks,
>
> Bart.
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/5] mq-deadline: switch to use elevator lock
  2025-07-30 18:01     ` Yu Kuai
@ 2025-07-30 18:10       ` Bart Van Assche
  0 siblings, 0 replies; 26+ messages in thread
From: Bart Van Assche @ 2025-07-30 18:10 UTC (permalink / raw)
  To: yukuai, Yu Kuai, dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 7/30/25 11:01 AM, Yu Kuai wrote:
> Ok, you mean that the lock is moved to the caller is functional change, 
> right?

That's something one could argue about. But I think there is agreement
that each patch should only include one logical change.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched()
  2025-07-30  8:22 ` [PATCH v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched() Yu Kuai
@ 2025-07-30 18:32   ` Bart Van Assche
  2025-07-31  0:49     ` Yu Kuai
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2025-07-30 18:32 UTC (permalink / raw)
  To: Yu Kuai, dlemoal, hare, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 7/30/25 1:22 AM, Yu Kuai wrote:
> Introduce struct sched_dispatch_ctx, and split the helper into
> elevator_dispatch_one_request() and elevator_finish_dispatch(). Also
> and comments about the non-error return value.

and -> add

> +struct sched_dispatch_ctx {
> +	struct blk_mq_hw_ctx *hctx;
> +	struct elevator_queue *e;
> +	struct request_queue *q;

'e' is always equal to q->elevator so I'm not sure whether it's worth to
have the member 'e'?

> +static bool elevator_can_dispatch(struct sched_dispatch_ctx *ctx)
> +{
> +	if (ctx->e->type->ops.has_work &&
> +	    !ctx->e->type->ops.has_work(ctx->hctx))
> +		return false;
>   
> -		if (!list_empty_careful(&hctx->dispatch)) {
> -			busy = true;
> -			break;
> -		}
> +	if (!list_empty_careful(&ctx->hctx->dispatch)) {
> +		ctx->busy = true;
> +		return false;
> +	}
>   
> -		budget_token = blk_mq_get_dispatch_budget(q);
> -		if (budget_token < 0)
> -			break;
> +	return true;
> +}

Shouldn't all function names in this file start with the blk_mq_ prefix?

Additionally, please rename elevator_can_dispatch() into
elevator_should_dispatch(). I think the latter name better reflects the
purpose of this function.

> +	if (sq_sched)
> +		spin_lock_irq(&ctx->e->lock);
> +	rq = ctx->e->type->ops.dispatch_request(ctx->hctx);
> +	if (sq_sched)
> +		spin_unlock_irq(&ctx->e->lock);

Same comment here as on patch 1/5: code like the above makes it
harder than necessary for static analyzers to verify this code.

>   
> +	if (!rq) {
> +		blk_mq_put_dispatch_budget(ctx->q, budget_token);
>   		/*
> -		 * If we cannot get tag for the request, stop dequeueing
> -		 * requests from the IO scheduler. We are unlikely to be able
> -		 * to submit them anyway and it creates false impression for
> -		 * scheduling heuristics that the device can take more IO.
> +		 * We're releasing without dispatching. Holding the
> +		 * budget could have blocked any "hctx"s with the
> +		 * same queue and if we didn't dispatch then there's
> +		 * no guarantee anyone will kick the queue.  Kick it
> +		 * ourselves.
>   		 */

Please keep the original comment. To me the new comment seems less clear
than the existing comment.

> +static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> +{
> +	unsigned int max_dispatch;
> +	struct sched_dispatch_ctx ctx = {
> +		.hctx	= hctx,
> +		.q	= hctx->queue,
> +		.e	= hctx->queue->elevator,
> +	};
> +
> +	INIT_LIST_HEAD(&ctx.rq_list);

Please remove the INIT_LIST_HEAD() invocation and add the following in
the ctx declaration:

	.rq_list = LIST_HEAD_INIT(ctx.rq_list),

This is a common pattern in kernel code. The following grep command
yields about 200 results:

$ git grep -nH '= LIST_HEAD_INIT.*\.'

Otherwise this patch looks good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched()
  2025-07-30 18:32   ` Bart Van Assche
@ 2025-07-31  0:49     ` Yu Kuai
  0 siblings, 0 replies; 26+ messages in thread
From: Yu Kuai @ 2025-07-31  0:49 UTC (permalink / raw)
  To: Bart Van Assche, Yu Kuai, dlemoal, hare, jack, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/07/31 2:32, Bart Van Assche 写道:
> On 7/30/25 1:22 AM, Yu Kuai wrote:
>> Introduce struct sched_dispatch_ctx, and split the helper into
>> elevator_dispatch_one_request() and elevator_finish_dispatch(). Also
>> and comments about the non-error return value.
> 
> and -> add
> 
>> +struct sched_dispatch_ctx {
>> +    struct blk_mq_hw_ctx *hctx;
>> +    struct elevator_queue *e;
>> +    struct request_queue *q;
> 
> 'e' is always equal to q->elevator so I'm not sure whether it's worth to
> have the member 'e'?
> 
>> +static bool elevator_can_dispatch(struct sched_dispatch_ctx *ctx)
>> +{
>> +    if (ctx->e->type->ops.has_work &&
>> +        !ctx->e->type->ops.has_work(ctx->hctx))
>> +        return false;
>> -        if (!list_empty_careful(&hctx->dispatch)) {
>> -            busy = true;
>> -            break;
>> -        }
>> +    if (!list_empty_careful(&ctx->hctx->dispatch)) {
>> +        ctx->busy = true;
>> +        return false;
>> +    }
>> -        budget_token = blk_mq_get_dispatch_budget(q);
>> -        if (budget_token < 0)
>> -            break;
>> +    return true;
>> +}
> 
> Shouldn't all function names in this file start with the blk_mq_ prefix?
Ok

> 
> Additionally, please rename elevator_can_dispatch() into
> elevator_should_dispatch(). I think the latter name better reflects the
> purpose of this function.
Sounds good.

> 
>> +    if (sq_sched)
>> +        spin_lock_irq(&ctx->e->lock);
>> +    rq = ctx->e->type->ops.dispatch_request(ctx->hctx);
>> +    if (sq_sched)
>> +        spin_unlock_irq(&ctx->e->lock);
> 
> Same comment here as on patch 1/5: code like the above makes it
> harder than necessary for static analyzers to verify this code.
Ok

> 
>> +    if (!rq) {
>> +        blk_mq_put_dispatch_budget(ctx->q, budget_token);
>>           /*
>> -         * If we cannot get tag for the request, stop dequeueing
>> -         * requests from the IO scheduler. We are unlikely to be able
>> -         * to submit them anyway and it creates false impression for
>> -         * scheduling heuristics that the device can take more IO.
>> +         * We're releasing without dispatching. Holding the
>> +         * budget could have blocked any "hctx"s with the
>> +         * same queue and if we didn't dispatch then there's
>> +         * no guarantee anyone will kick the queue.  Kick it
>> +         * ourselves.
>>            */
> 
> Please keep the original comment. To me the new comment seems less clear
> than the existing comment.
Please note that I didn't change the comment here, above comment is for
setting the run_queue. The original comment for blk_mq_get_driver_tag()
is still there.

> 
>> +static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>> +{
>> +    unsigned int max_dispatch;
>> +    struct sched_dispatch_ctx ctx = {
>> +        .hctx    = hctx,
>> +        .q    = hctx->queue,
>> +        .e    = hctx->queue->elevator,
>> +    };
>> +
>> +    INIT_LIST_HEAD(&ctx.rq_list);
> 
> Please remove the INIT_LIST_HEAD() invocation and add the following in
> the ctx declaration:
> 
>      .rq_list = LIST_HEAD_INIT(ctx.rq_list),
> 
> This is a common pattern in kernel code. The following grep command
> yields about 200 results:
> 
> $ git grep -nH '= LIST_HEAD_INIT.*\.'
Ok
> 
> Otherwise this patch looks good to me.
> 
Thanks for the review!
Kuai

> Thanks,
> 
> Bart.
> .
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/5] blk-mq-sched: introduce high level elevator lock
  2025-07-30  8:22 ` [PATCH v2 1/5] blk-mq-sched: introduce high level elevator lock Yu Kuai
  2025-07-30 17:19   ` Bart Van Assche
@ 2025-07-31  6:17   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2025-07-31  6:17 UTC (permalink / raw)
  To: Yu Kuai, dlemoal, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 7/30/25 10:22, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently, both mq-deadline and bfq have global spin lock that will be
> grabbed inside elevator methods like dispatch_request, insert_requests,
> and bio_merge. And the global lock is the main reason mq-deadline and
> bfq can't scale very well.
> 
> While dispatching request, blk_mq_get_disatpch_budget() and
> blk_mq_get_driver_tag() must be called, and they are not ready to be called
> inside elevator methods, hence introduce a new method like
> dispatch_requests is not possible.
> 
> Hence introduce a new high level elevator lock, currently it is protecting
> dispatch_request only. Following patches will convert mq-deadline and bfq
> to use this lock and finally support request batch dispatching by calling
> the method multiple time while holding the lock.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/blk-mq-sched.c | 6 ++++++
>   block/elevator.c     | 1 +
>   block/elevator.h     | 4 ++--
>   3 files changed, 9 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/5] mq-deadline: switch to use elevator lock
  2025-07-30  8:22 ` [PATCH v2 2/5] mq-deadline: switch to use " Yu Kuai
  2025-07-30 17:21   ` Bart Van Assche
@ 2025-07-31  6:20   ` Hannes Reinecke
  2025-07-31  6:22     ` Damien Le Moal
  1 sibling, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2025-07-31  6:20 UTC (permalink / raw)
  To: Yu Kuai, dlemoal, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 7/30/25 10:22, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Replace the internal spinlock 'dd->lock' with the new spinlock in
> elevator_queue, there are no functional changes.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/mq-deadline.c | 58 +++++++++++++++++++++------------------------
>   1 file changed, 27 insertions(+), 31 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 9ab6c6256695..2054c023e855 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -101,7 +101,7 @@ struct deadline_data {
>   	u32 async_depth;
>   	int prio_aging_expire;
>   
> -	spinlock_t lock;
> +	spinlock_t *lock;
>   };
>   
>   /* Maps an I/O priority class to a deadline scheduler priority. */
> @@ -213,7 +213,7 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
>   	const u8 ioprio_class = dd_rq_ioclass(next);
>   	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>   
> -	lockdep_assert_held(&dd->lock);
> +	lockdep_assert_held(dd->lock);
>   
>   	dd->per_prio[prio].stats.merged++;
>   
> @@ -253,7 +253,7 @@ static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>   {
>   	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
>   
> -	lockdep_assert_held(&dd->lock);
> +	lockdep_assert_held(dd->lock);
>   
>   	return stats->inserted - atomic_read(&stats->completed);
>   }
> @@ -323,7 +323,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>   	enum dd_prio prio;
>   	u8 ioprio_class;
>   
> -	lockdep_assert_held(&dd->lock);
> +	lockdep_assert_held(dd->lock);
>   
>   	if (!list_empty(&per_prio->dispatch)) {
>   		rq = list_first_entry(&per_prio->dispatch, struct request,
> @@ -434,7 +434,7 @@ static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
>   	enum dd_prio prio;
>   	int prio_cnt;
>   
> -	lockdep_assert_held(&dd->lock);
> +	lockdep_assert_held(dd->lock);
>   
>   	prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
>   		   !!dd_queued(dd, DD_IDLE_PRIO);
> @@ -466,10 +466,9 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>   	struct request *rq;
>   	enum dd_prio prio;
>   
> -	spin_lock(&dd->lock);
>   	rq = dd_dispatch_prio_aged_requests(dd, now);
>   	if (rq)
> -		goto unlock;
> +		return rq;
>   
>   	/*
>   	 * Next, dispatch requests in priority order. Ignore lower priority
> @@ -481,9 +480,6 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>   			break;
>   	}
>   
> -unlock:
> -	spin_unlock(&dd->lock);
> -
>   	return rq;
>   }
>   
> @@ -538,9 +534,9 @@ static void dd_exit_sched(struct elevator_queue *e)
>   		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>   		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
>   
> -		spin_lock(&dd->lock);
> +		spin_lock(dd->lock);
>   		queued = dd_queued(dd, prio);
> -		spin_unlock(&dd->lock);
> +		spin_unlock(dd->lock);
>   
>   		WARN_ONCE(queued != 0,
>   			  "statistics for priority %d: i %u m %u d %u c %u\n",

Do you still need 'dd->lock'? Can't you just refer to the lock from the
elevator_queue structure directly?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/5] block, bfq: switch to use elevator lock
  2025-07-30  8:22 ` [PATCH v2 3/5] block, bfq: " Yu Kuai
  2025-07-30 17:24   ` Bart Van Assche
@ 2025-07-31  6:22   ` Hannes Reinecke
  1 sibling, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2025-07-31  6:22 UTC (permalink / raw)
  To: Yu Kuai, dlemoal, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 7/30/25 10:22, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Replace the internal spinlock bfqd->lock with the new spinlock in
> elevator_queue. There are no functional changes.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/bfq-cgroup.c   |  4 ++--
>   block/bfq-iosched.c  | 49 +++++++++++++++++++++-----------------------
>   block/bfq-iosched.h  |  2 +-
>   block/blk-mq-sched.c |  4 ++--
>   4 files changed, 28 insertions(+), 31 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/5] mq-deadline: switch to use elevator lock
  2025-07-31  6:20   ` Hannes Reinecke
@ 2025-07-31  6:22     ` Damien Le Moal
  2025-07-31  6:32       ` Yu Kuai
  0 siblings, 1 reply; 26+ messages in thread
From: Damien Le Moal @ 2025-07-31  6:22 UTC (permalink / raw)
  To: Hannes Reinecke, Yu Kuai, jack, tj, josef, axboe, yukuai3
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi

On 7/31/25 3:20 PM, Hannes Reinecke wrote:
> On 7/30/25 10:22, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Replace the internal spinlock 'dd->lock' with the new spinlock in
>> elevator_queue, there are no functional changes.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/mq-deadline.c | 58 +++++++++++++++++++++------------------------
>>   1 file changed, 27 insertions(+), 31 deletions(-)
>>
>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>> index 9ab6c6256695..2054c023e855 100644
>> --- a/block/mq-deadline.c
>> +++ b/block/mq-deadline.c
>> @@ -101,7 +101,7 @@ struct deadline_data {
>>       u32 async_depth;
>>       int prio_aging_expire;
>>   -    spinlock_t lock;
>> +    spinlock_t *lock;
>>   };
>>     /* Maps an I/O priority class to a deadline scheduler priority. */
>> @@ -213,7 +213,7 @@ static void dd_merged_requests(struct request_queue *q,
>> struct request *req,
>>       const u8 ioprio_class = dd_rq_ioclass(next);
>>       const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>>   -    lockdep_assert_held(&dd->lock);
>> +    lockdep_assert_held(dd->lock);
>>         dd->per_prio[prio].stats.merged++;
>>   @@ -253,7 +253,7 @@ static u32 dd_queued(struct deadline_data *dd, enum
>> dd_prio prio)
>>   {
>>       const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
>>   -    lockdep_assert_held(&dd->lock);
>> +    lockdep_assert_held(dd->lock);
>>         return stats->inserted - atomic_read(&stats->completed);
>>   }
>> @@ -323,7 +323,7 @@ static struct request *__dd_dispatch_request(struct
>> deadline_data *dd,
>>       enum dd_prio prio;
>>       u8 ioprio_class;
>>   -    lockdep_assert_held(&dd->lock);
>> +    lockdep_assert_held(dd->lock);
>>         if (!list_empty(&per_prio->dispatch)) {
>>           rq = list_first_entry(&per_prio->dispatch, struct request,
>> @@ -434,7 +434,7 @@ static struct request
>> *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
>>       enum dd_prio prio;
>>       int prio_cnt;
>>   -    lockdep_assert_held(&dd->lock);
>> +    lockdep_assert_held(dd->lock);
>>         prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
>>              !!dd_queued(dd, DD_IDLE_PRIO);
>> @@ -466,10 +466,9 @@ static struct request *dd_dispatch_request(struct
>> blk_mq_hw_ctx *hctx)
>>       struct request *rq;
>>       enum dd_prio prio;
>>   -    spin_lock(&dd->lock);
>>       rq = dd_dispatch_prio_aged_requests(dd, now);
>>       if (rq)
>> -        goto unlock;
>> +        return rq;
>>         /*
>>        * Next, dispatch requests in priority order. Ignore lower priority
>> @@ -481,9 +480,6 @@ static struct request *dd_dispatch_request(struct
>> blk_mq_hw_ctx *hctx)
>>               break;
>>       }
>>   -unlock:
>> -    spin_unlock(&dd->lock);
>> -
>>       return rq;
>>   }
>>   @@ -538,9 +534,9 @@ static void dd_exit_sched(struct elevator_queue *e)
>>           WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>>           WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
>>   -        spin_lock(&dd->lock);
>> +        spin_lock(dd->lock);
>>           queued = dd_queued(dd, prio);
>> -        spin_unlock(&dd->lock);
>> +        spin_unlock(dd->lock);
>>             WARN_ONCE(queued != 0,
>>                 "statistics for priority %d: i %u m %u d %u c %u\n",
> 
> Do you still need 'dd->lock'? Can't you just refer to the lock from the
> elevator_queue structure directly?

Indeed. Little inline helpers for locking/unlocking q->elevator->lock would be
nice.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/5] mq-deadline: switch to use elevator lock
  2025-07-31  6:22     ` Damien Le Moal
@ 2025-07-31  6:32       ` Yu Kuai
  2025-07-31  7:04         ` Damien Le Moal
  0 siblings, 1 reply; 26+ messages in thread
From: Yu Kuai @ 2025-07-31  6:32 UTC (permalink / raw)
  To: Damien Le Moal, Hannes Reinecke, Yu Kuai, jack, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/07/31 14:22, Damien Le Moal 写道:
> On 7/31/25 3:20 PM, Hannes Reinecke wrote:
>> On 7/30/25 10:22, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> Replace the internal spinlock 'dd->lock' with the new spinlock in
>>> elevator_queue, there are no functional changes.
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> ---
>>>    block/mq-deadline.c | 58 +++++++++++++++++++++------------------------
>>>    1 file changed, 27 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>>> index 9ab6c6256695..2054c023e855 100644
>>> --- a/block/mq-deadline.c
>>> +++ b/block/mq-deadline.c
>>> @@ -101,7 +101,7 @@ struct deadline_data {
>>>        u32 async_depth;
>>>        int prio_aging_expire;
>>>    -    spinlock_t lock;
>>> +    spinlock_t *lock;
>>>    };
>>>      /* Maps an I/O priority class to a deadline scheduler priority. */
>>> @@ -213,7 +213,7 @@ static void dd_merged_requests(struct request_queue *q,
>>> struct request *req,
>>>        const u8 ioprio_class = dd_rq_ioclass(next);
>>>        const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>>>    -    lockdep_assert_held(&dd->lock);
>>> +    lockdep_assert_held(dd->lock);
>>>          dd->per_prio[prio].stats.merged++;
>>>    @@ -253,7 +253,7 @@ static u32 dd_queued(struct deadline_data *dd, enum
>>> dd_prio prio)
>>>    {
>>>        const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
>>>    -    lockdep_assert_held(&dd->lock);
>>> +    lockdep_assert_held(dd->lock);
>>>          return stats->inserted - atomic_read(&stats->completed);
>>>    }
>>> @@ -323,7 +323,7 @@ static struct request *__dd_dispatch_request(struct
>>> deadline_data *dd,
>>>        enum dd_prio prio;
>>>        u8 ioprio_class;
>>>    -    lockdep_assert_held(&dd->lock);
>>> +    lockdep_assert_held(dd->lock);
>>>          if (!list_empty(&per_prio->dispatch)) {
>>>            rq = list_first_entry(&per_prio->dispatch, struct request,
>>> @@ -434,7 +434,7 @@ static struct request
>>> *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
>>>        enum dd_prio prio;
>>>        int prio_cnt;
>>>    -    lockdep_assert_held(&dd->lock);
>>> +    lockdep_assert_held(dd->lock);
>>>          prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
>>>               !!dd_queued(dd, DD_IDLE_PRIO);
>>> @@ -466,10 +466,9 @@ static struct request *dd_dispatch_request(struct
>>> blk_mq_hw_ctx *hctx)
>>>        struct request *rq;
>>>        enum dd_prio prio;
>>>    -    spin_lock(&dd->lock);
>>>        rq = dd_dispatch_prio_aged_requests(dd, now);
>>>        if (rq)
>>> -        goto unlock;
>>> +        return rq;
>>>          /*
>>>         * Next, dispatch requests in priority order. Ignore lower priority
>>> @@ -481,9 +480,6 @@ static struct request *dd_dispatch_request(struct
>>> blk_mq_hw_ctx *hctx)
>>>                break;
>>>        }
>>>    -unlock:
>>> -    spin_unlock(&dd->lock);
>>> -
>>>        return rq;
>>>    }
>>>    @@ -538,9 +534,9 @@ static void dd_exit_sched(struct elevator_queue *e)
>>>            WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>>>            WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
>>>    -        spin_lock(&dd->lock);
>>> +        spin_lock(dd->lock);
>>>            queued = dd_queued(dd, prio);
>>> -        spin_unlock(&dd->lock);
>>> +        spin_unlock(dd->lock);
>>>              WARN_ONCE(queued != 0,
>>>                  "statistics for priority %d: i %u m %u d %u c %u\n",
>>
>> Do you still need 'dd->lock'? Can't you just refer to the lock from the
>> elevator_queue structure directly?
> 
> Indeed. Little inline helpers for locking/unlocking q->elevator->lock would be
> nice.

How about the first patch to factor out inline helpers like dd_lock()
and dd_unlock(), still use dd->lock without any functional changes, and
then switch to use q->elevator->lock in the next patch? (same for bfq)

Thanks,
Kuai

> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/5] mq-deadline: switch to use elevator lock
  2025-07-31  6:32       ` Yu Kuai
@ 2025-07-31  7:04         ` Damien Le Moal
  2025-07-31  7:14           ` Yu Kuai
  0 siblings, 1 reply; 26+ messages in thread
From: Damien Le Moal @ 2025-07-31  7:04 UTC (permalink / raw)
  To: Yu Kuai, Hannes Reinecke, jack, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

On 7/31/25 3:32 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/31 14:22, Damien Le Moal 写道:
>> On 7/31/25 3:20 PM, Hannes Reinecke wrote:
>>> On 7/30/25 10:22, Yu Kuai wrote:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> Replace the internal spinlock 'dd->lock' with the new spinlock in
>>>> elevator_queue, there are no functional changes.
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>> ---
>>>>    block/mq-deadline.c | 58 +++++++++++++++++++++------------------------
>>>>    1 file changed, 27 insertions(+), 31 deletions(-)
>>>>
>>>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>>>> index 9ab6c6256695..2054c023e855 100644
>>>> --- a/block/mq-deadline.c
>>>> +++ b/block/mq-deadline.c
>>>> @@ -101,7 +101,7 @@ struct deadline_data {
>>>>        u32 async_depth;
>>>>        int prio_aging_expire;
>>>>    -    spinlock_t lock;
>>>> +    spinlock_t *lock;
>>>>    };
>>>>      /* Maps an I/O priority class to a deadline scheduler priority. */
>>>> @@ -213,7 +213,7 @@ static void dd_merged_requests(struct request_queue *q,
>>>> struct request *req,
>>>>        const u8 ioprio_class = dd_rq_ioclass(next);
>>>>        const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>>>>    -    lockdep_assert_held(&dd->lock);
>>>> +    lockdep_assert_held(dd->lock);
>>>>          dd->per_prio[prio].stats.merged++;
>>>>    @@ -253,7 +253,7 @@ static u32 dd_queued(struct deadline_data *dd, enum
>>>> dd_prio prio)
>>>>    {
>>>>        const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
>>>>    -    lockdep_assert_held(&dd->lock);
>>>> +    lockdep_assert_held(dd->lock);
>>>>          return stats->inserted - atomic_read(&stats->completed);
>>>>    }
>>>> @@ -323,7 +323,7 @@ static struct request *__dd_dispatch_request(struct
>>>> deadline_data *dd,
>>>>        enum dd_prio prio;
>>>>        u8 ioprio_class;
>>>>    -    lockdep_assert_held(&dd->lock);
>>>> +    lockdep_assert_held(dd->lock);
>>>>          if (!list_empty(&per_prio->dispatch)) {
>>>>            rq = list_first_entry(&per_prio->dispatch, struct request,
>>>> @@ -434,7 +434,7 @@ static struct request
>>>> *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
>>>>        enum dd_prio prio;
>>>>        int prio_cnt;
>>>>    -    lockdep_assert_held(&dd->lock);
>>>> +    lockdep_assert_held(dd->lock);
>>>>          prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd,
>>>> DD_BE_PRIO) +
>>>>               !!dd_queued(dd, DD_IDLE_PRIO);
>>>> @@ -466,10 +466,9 @@ static struct request *dd_dispatch_request(struct
>>>> blk_mq_hw_ctx *hctx)
>>>>        struct request *rq;
>>>>        enum dd_prio prio;
>>>>    -    spin_lock(&dd->lock);
>>>>        rq = dd_dispatch_prio_aged_requests(dd, now);
>>>>        if (rq)
>>>> -        goto unlock;
>>>> +        return rq;
>>>>          /*
>>>>         * Next, dispatch requests in priority order. Ignore lower priority
>>>> @@ -481,9 +480,6 @@ static struct request *dd_dispatch_request(struct
>>>> blk_mq_hw_ctx *hctx)
>>>>                break;
>>>>        }
>>>>    -unlock:
>>>> -    spin_unlock(&dd->lock);
>>>> -
>>>>        return rq;
>>>>    }
>>>>    @@ -538,9 +534,9 @@ static void dd_exit_sched(struct elevator_queue *e)
>>>>            WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>>>>            WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
>>>>    -        spin_lock(&dd->lock);
>>>> +        spin_lock(dd->lock);
>>>>            queued = dd_queued(dd, prio);
>>>> -        spin_unlock(&dd->lock);
>>>> +        spin_unlock(dd->lock);
>>>>              WARN_ONCE(queued != 0,
>>>>                  "statistics for priority %d: i %u m %u d %u c %u\n",
>>>
>>> Do you still need 'dd->lock'? Can't you just refer to the lock from the
>>> elevator_queue structure directly?
>>
>> Indeed. Little inline helpers for locking/unlocking q->elevator->lock would be
>> nice.
> 
> How about the first patch to factor out inline helpers like dd_lock()
> and dd_unlock(), still use dd->lock without any functional changes, and
> then switch to use q->elevator->lock in the next patch? (same for bfq)

Patch one can introduce elv->lock and the helpers, then patch 2 use the helpers
to replace dd->lock. Just don't say "no functional change" in the commit
message and rather explain that things keep working the same way as before, but
using a different lock. That will address Bart's comment too.
And same for bfq in patch 3.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/5] mq-deadline: switch to use elevator lock
  2025-07-31  7:04         ` Damien Le Moal
@ 2025-07-31  7:14           ` Yu Kuai
  0 siblings, 0 replies; 26+ messages in thread
From: Yu Kuai @ 2025-07-31  7:14 UTC (permalink / raw)
  To: Damien Le Moal, Yu Kuai, Hannes Reinecke, jack, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yi.zhang, yangerkun,
	johnny.chenyi, yukuai (C)

Hi,

在 2025/07/31 15:04, Damien Le Moal 写道:
> On 7/31/25 3:32 PM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/07/31 14:22, Damien Le Moal 写道:
>>> On 7/31/25 3:20 PM, Hannes Reinecke wrote:
>>>> On 7/30/25 10:22, Yu Kuai wrote:
>>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>>
>>>>> Replace the internal spinlock 'dd->lock' with the new spinlock in
>>>>> elevator_queue, there are no functional changes.
>>>>>
>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>>> ---
>>>>>     block/mq-deadline.c | 58 +++++++++++++++++++++------------------------
>>>>>     1 file changed, 27 insertions(+), 31 deletions(-)
>>>>>
>>>>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>>>>> index 9ab6c6256695..2054c023e855 100644
>>>>> --- a/block/mq-deadline.c
>>>>> +++ b/block/mq-deadline.c
>>>>> @@ -101,7 +101,7 @@ struct deadline_data {
>>>>>         u32 async_depth;
>>>>>         int prio_aging_expire;
>>>>>     -    spinlock_t lock;
>>>>> +    spinlock_t *lock;
>>>>>     };
>>>>>       /* Maps an I/O priority class to a deadline scheduler priority. */
>>>>> @@ -213,7 +213,7 @@ static void dd_merged_requests(struct request_queue *q,
>>>>> struct request *req,
>>>>>         const u8 ioprio_class = dd_rq_ioclass(next);
>>>>>         const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>>>>>     -    lockdep_assert_held(&dd->lock);
>>>>> +    lockdep_assert_held(dd->lock);
>>>>>           dd->per_prio[prio].stats.merged++;
>>>>>     @@ -253,7 +253,7 @@ static u32 dd_queued(struct deadline_data *dd, enum
>>>>> dd_prio prio)
>>>>>     {
>>>>>         const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
>>>>>     -    lockdep_assert_held(&dd->lock);
>>>>> +    lockdep_assert_held(dd->lock);
>>>>>           return stats->inserted - atomic_read(&stats->completed);
>>>>>     }
>>>>> @@ -323,7 +323,7 @@ static struct request *__dd_dispatch_request(struct
>>>>> deadline_data *dd,
>>>>>         enum dd_prio prio;
>>>>>         u8 ioprio_class;
>>>>>     -    lockdep_assert_held(&dd->lock);
>>>>> +    lockdep_assert_held(dd->lock);
>>>>>           if (!list_empty(&per_prio->dispatch)) {
>>>>>             rq = list_first_entry(&per_prio->dispatch, struct request,
>>>>> @@ -434,7 +434,7 @@ static struct request
>>>>> *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
>>>>>         enum dd_prio prio;
>>>>>         int prio_cnt;
>>>>>     -    lockdep_assert_held(&dd->lock);
>>>>> +    lockdep_assert_held(dd->lock);
>>>>>           prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd,
>>>>> DD_BE_PRIO) +
>>>>>                !!dd_queued(dd, DD_IDLE_PRIO);
>>>>> @@ -466,10 +466,9 @@ static struct request *dd_dispatch_request(struct
>>>>> blk_mq_hw_ctx *hctx)
>>>>>         struct request *rq;
>>>>>         enum dd_prio prio;
>>>>>     -    spin_lock(&dd->lock);
>>>>>         rq = dd_dispatch_prio_aged_requests(dd, now);
>>>>>         if (rq)
>>>>> -        goto unlock;
>>>>> +        return rq;
>>>>>           /*
>>>>>          * Next, dispatch requests in priority order. Ignore lower priority
>>>>> @@ -481,9 +480,6 @@ static struct request *dd_dispatch_request(struct
>>>>> blk_mq_hw_ctx *hctx)
>>>>>                 break;
>>>>>         }
>>>>>     -unlock:
>>>>> -    spin_unlock(&dd->lock);
>>>>> -
>>>>>         return rq;
>>>>>     }
>>>>>     @@ -538,9 +534,9 @@ static void dd_exit_sched(struct elevator_queue *e)
>>>>>             WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>>>>>             WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
>>>>>     -        spin_lock(&dd->lock);
>>>>> +        spin_lock(dd->lock);
>>>>>             queued = dd_queued(dd, prio);
>>>>> -        spin_unlock(&dd->lock);
>>>>> +        spin_unlock(dd->lock);
>>>>>               WARN_ONCE(queued != 0,
>>>>>                   "statistics for priority %d: i %u m %u d %u c %u\n",
>>>>
>>>> Do you still need 'dd->lock'? Can't you just refer to the lock from the
>>>> elevator_queue structure directly?
>>>
>>> Indeed. Little inline helpers for locking/unlocking q->elevator->lock would be
>>> nice.
>>
>> How about the first patch to factor out inline helpers like dd_lock()
>> and dd_unlock(), still use dd->lock without any functional changes, and
>> then switch to use q->elevator->lock in the next patch? (same for bfq)
> 
> Patch one can introduce elv->lock and the helpers, then patch 2 use the helpers
> to replace dd->lock. Just don't say "no functional change" in the commit
> message and rather explain that things keep working the same way as before, but
> using a different lock. That will address Bart's comment too.
> And same for bfq in patch 3.
> 
Ok, this is what I did in the first RFC version:

https://lore.kernel.org/all/20250530080355.1138759-3-yukuai1@huaweicloud.com/

I somehow convince myself using dd->lock is better. :(
Will change this in the next version.

Thanks,
Kuai

> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-07-30  8:22 [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
                   ` (4 preceding siblings ...)
  2025-07-30  8:22 ` [PATCH v2 5/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
@ 2025-07-31  8:18 ` Ming Lei
  2025-07-31  8:42   ` Yu Kuai
  5 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2025-07-31  8:18 UTC (permalink / raw)
  To: Yu Kuai
  Cc: dlemoal, hare, jack, tj, josef, axboe, yukuai3, cgroups,
	linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi

On Wed, Jul 30, 2025 at 04:22:02PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Changes from v1:
>  - the ioc changes are send separately;
>  - change the patch 1-3 order as suggested by Damien;
> 
> Currently, both mq-deadline and bfq have global spin lock that will be
> grabbed inside elevator methods like dispatch_request, insert_requests,
> and bio_merge. And the global lock is the main reason mq-deadline and
> bfq can't scale very well.
> 
> For dispatch_request method, current behavior is dispatching one request at
> a time. In the case of multiple dispatching contexts, This behavior, on the
> one hand, introduce intense lock contention:
> 
> t1:                     t2:                     t3:
> lock                    lock                    lock
> // grab lock
> ops.dispatch_request
> unlock
>                         // grab lock
>                         ops.dispatch_request
>                         unlock
>                                                 // grab lock
>                                                 ops.dispatch_request
>                                                 unlock
> 
> on the other hand, messing up the requests dispatching order:
> t1:
> 
> lock
> rq1 = ops.dispatch_request
> unlock
>                         t2:
>                         lock
>                         rq2 = ops.dispatch_request
>                         unlock
> 
> lock
> rq3 = ops.dispatch_request
> unlock
> 
>                         lock
>                         rq4 = ops.dispatch_request
>                         unlock
> 
> //rq1,rq3 issue to disk
>                         // rq2, rq4 issue to disk
> 
> In this case, the elevator dispatch order is rq 1-2-3-4, however,
> such order in disk is rq 1-3-2-4, the order for rq2 and rq3 is inversed.
> 
> While dispatching request, blk_mq_get_disatpch_budget() and
> blk_mq_get_driver_tag() must be called, and they are not ready to be
> called inside elevator methods, hence introduce a new method like
> dispatch_requests is not possible.
> 
> In conclusion, this set factor the global lock out of dispatch_request
> method, and support request batch dispatch by calling the methods
> multiple time while holding the lock.
> 
> nullblk setup:
> modprobe null_blk nr_devices=0 &&
>     udevadm settle &&
>     cd /sys/kernel/config/nullb &&
>     mkdir nullb0 &&
>     cd nullb0 &&
>     echo 0 > completion_nsec &&
>     echo 512 > blocksize &&
>     echo 0 > home_node &&
>     echo 0 > irqmode &&
>     echo 128 > submit_queues &&
>     echo 1024 > hw_queue_depth &&
>     echo 1024 > size &&
>     echo 0 > memory_backed &&
>     echo 2 > queue_mode &&
>     echo 1 > power ||
>     exit $?
> 
> Test script:
> fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \
>   -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \
>   -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30
> 
> Test result: iops
> 
> |                 | deadline | bfq      |
> | --------------- | -------- | -------- |
> | before this set | 263k     | 124k     |
> | after this set  | 475k     | 292k     |

batch dispatch may hurt io merge performance which is important for
elevator, so please provide test data on real HDD. & SSD., instead of
null_blk only, and it can be perfect if merge sensitive workload
is evaluated.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-07-31  8:18 ` [PATCH v2 0/5] " Ming Lei
@ 2025-07-31  8:42   ` Yu Kuai
  2025-07-31  9:25     ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Yu Kuai @ 2025-07-31  8:42 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: dlemoal, hare, jack, tj, josef, axboe, cgroups, linux-block,
	linux-kernel, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/07/31 16:18, Ming Lei 写道:
> batch dispatch may hurt io merge performance which is important for
> elevator, so please provide test data on real HDD. & SSD., instead of
> null_blk only, and it can be perfect if merge sensitive workload
> is evaluated.

Ok, I'll provide test data on HDD and SSD that I have for now.

For the elevator IO merge case, what I have in mind is that we issue
small sequential IO one by one with multiple contexts, so that bios
won't be merged in plug, and this will require IO issue > IO done, is
this case enough?

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-07-31  8:42   ` Yu Kuai
@ 2025-07-31  9:25     ` Ming Lei
  2025-07-31  9:33       ` Yu Kuai
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2025-07-31  9:25 UTC (permalink / raw)
  To: Yu Kuai
  Cc: dlemoal, hare, jack, tj, josef, axboe, cgroups, linux-block,
	linux-kernel, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

On Thu, Jul 31, 2025 at 04:42:10PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/31 16:18, Ming Lei 写道:
> > batch dispatch may hurt io merge performance which is important for
> > elevator, so please provide test data on real HDD. & SSD., instead of
> > null_blk only, and it can be perfect if merge sensitive workload
> > is evaluated.
> 
> Ok, I'll provide test data on HDD and SSD that I have for now.
> 
> For the elevator IO merge case, what I have in mind is that we issue
> small sequential IO one by one with multiple contexts, so that bios
> won't be merged in plug, and this will require IO issue > IO done, is
> this case enough?

Long time ago, I investigated one such issue which is triggered in qemu
workload, but not sure if I can find it now.

Also many scsi devices may easily run into queue busy, then scheduler merge
starts to work, and it may perform worse if you dispatch more in this
situation.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-07-31  9:25     ` Ming Lei
@ 2025-07-31  9:33       ` Yu Kuai
  2025-07-31 10:22         ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Yu Kuai @ 2025-07-31  9:33 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: dlemoal, hare, jack, tj, josef, axboe, cgroups, linux-block,
	linux-kernel, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/07/31 17:25, Ming Lei 写道:
> On Thu, Jul 31, 2025 at 04:42:10PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/07/31 16:18, Ming Lei 写道:
>>> batch dispatch may hurt io merge performance which is important for
>>> elevator, so please provide test data on real HDD. & SSD., instead of
>>> null_blk only, and it can be perfect if merge sensitive workload
>>> is evaluated.
>>
>> Ok, I'll provide test data on HDD and SSD that I have for now.
>>
>> For the elevator IO merge case, what I have in mind is that we issue
>> small sequential IO one by one with multiple contexts, so that bios
>> won't be merged in plug, and this will require IO issue > IO done, is
>> this case enough?
> 
> Long time ago, I investigated one such issue which is triggered in qemu
> workload, but not sure if I can find it now.
> 
> Also many scsi devices may easily run into queue busy, then scheduler merge
> starts to work, and it may perform worse if you dispatch more in this
> situation.

I think we won't dispatch more in this case, on the one hand we will get
budgets first, to make sure never dispatch more than queue_depth; on the
onther hand, in the case hctx->dispatch_busy is set, we still fallback
to the old case to dispatch one at a time;

So if the IO merge start because many rqs are accumulated inside
elevator and the driver if full, I will expect IO merge will behave the
same with this set. I'll have a test to check.

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator
  2025-07-31  9:33       ` Yu Kuai
@ 2025-07-31 10:22         ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2025-07-31 10:22 UTC (permalink / raw)
  To: Yu Kuai
  Cc: dlemoal, hare, jack, tj, josef, axboe, cgroups, linux-block,
	linux-kernel, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

On Thu, Jul 31, 2025 at 05:33:24PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/31 17:25, Ming Lei 写道:
> > On Thu, Jul 31, 2025 at 04:42:10PM +0800, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2025/07/31 16:18, Ming Lei 写道:
> > > > batch dispatch may hurt io merge performance which is important for
> > > > elevator, so please provide test data on real HDD. & SSD., instead of
> > > > null_blk only, and it can be perfect if merge sensitive workload
> > > > is evaluated.
> > > 
> > > Ok, I'll provide test data on HDD and SSD that I have for now.
> > > 
> > > For the elevator IO merge case, what I have in mind is that we issue
> > > small sequential IO one by one with multiple contexts, so that bios
> > > won't be merged in plug, and this will require IO issue > IO done, is
> > > this case enough?
> > 
> > Long time ago, I investigated one such issue which is triggered in qemu
> > workload, but not sure if I can find it now.
> > 
> > Also many scsi devices may easily run into queue busy, then scheduler merge
> > starts to work, and it may perform worse if you dispatch more in this
> > situation.
> 
> I think we won't dispatch more in this case, on the one hand we will get
> budgets first, to make sure never dispatch more than queue_depth; on the

OK.

> onther hand, in the case hctx->dispatch_busy is set, we still fallback
> to the old case to dispatch one at a time;

hctx->dispatch_busy is lockless, all request may get dispatched before
hctx->dispatch_busy is set.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-07-31 10:22 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-30  8:22 [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
2025-07-30  8:22 ` [PATCH v2 1/5] blk-mq-sched: introduce high level elevator lock Yu Kuai
2025-07-30 17:19   ` Bart Van Assche
2025-07-30 17:59     ` Yu Kuai
2025-07-31  6:17   ` Hannes Reinecke
2025-07-30  8:22 ` [PATCH v2 2/5] mq-deadline: switch to use " Yu Kuai
2025-07-30 17:21   ` Bart Van Assche
2025-07-30 18:01     ` Yu Kuai
2025-07-30 18:10       ` Bart Van Assche
2025-07-31  6:20   ` Hannes Reinecke
2025-07-31  6:22     ` Damien Le Moal
2025-07-31  6:32       ` Yu Kuai
2025-07-31  7:04         ` Damien Le Moal
2025-07-31  7:14           ` Yu Kuai
2025-07-30  8:22 ` [PATCH v2 3/5] block, bfq: " Yu Kuai
2025-07-30 17:24   ` Bart Van Assche
2025-07-31  6:22   ` Hannes Reinecke
2025-07-30  8:22 ` [PATCH v2 4/5] blk-mq-sched: refactor __blk_mq_do_dispatch_sched() Yu Kuai
2025-07-30 18:32   ` Bart Van Assche
2025-07-31  0:49     ` Yu Kuai
2025-07-30  8:22 ` [PATCH v2 5/5] blk-mq-sched: support request batch dispatching for sq elevator Yu Kuai
2025-07-31  8:18 ` [PATCH v2 0/5] " Ming Lei
2025-07-31  8:42   ` Yu Kuai
2025-07-31  9:25     ` Ming Lei
2025-07-31  9:33       ` Yu Kuai
2025-07-31 10:22         ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).