[PATCHSET RFC v2 0/4] mq-deadline scalability improvements

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET RFC v2 0/4] mq-deadline scalability improvements
@ 2024-01-19 16:02 Jens Axboe
  2024-01-19 16:02 ` [PATCH 1/4] block/mq-deadline: pass in queue directly to dd_insert_request() Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Jens Axboe @ 2024-01-19 16:02 UTC (permalink / raw)
  To: linux-block; +Cc: bvanassche

Hi,

It's no secret that mq-deadline doesn't scale very well - it was
originally done as a proof-of-concept conversion from deadline, when the
blk-mq multiqueue layer was written. In the single queue world, the
queue lock protected the IO scheduler as well, and mq-deadline simply
adopted an internal dd->lock to fill the place of that.

While mq-deadline works under blk-mq and doesn't suffer any scaling on
that side, as soon as request insertion or dispatch is done, we're
hitting the per-queue dd->lock quite intensely. On a basic test box
with 16 cores / 32 threads, running a number of IO intensive threads
on either null_blk (single hw queue) or nvme0n1 (many hw queues) shows
this quite easily:

The test case looks like this:

fio --bs=512 --group_reporting=1 --gtod_reduce=1 --invalidate=1 \
	--ioengine=io_uring --norandommap --runtime=60 --rw=randread \
	--thread --time_based=1 --buffered=0 --fixedbufs=1 --numjobs=32 \
	--iodepth=4 --iodepth_batch_submit=4 --iodepth_batch_complete=4 \
	--name=scaletest --filename=/dev/$DEV

and is being run on a desktop 7950X box.

which is 32 threads each doing 4 IOs, for a total queue depth of 128.

Results before the patches:

Device		IOPS	sys	contention	diff
====================================================
null_blk	879K	89%	93.6%
nvme0n1		901K	86%	94.5%

which looks pretty miserable, most of the time is spent contending on
the queue lock.

This RFC patchset attempts to address that by:

1) Serializing dispatch of requests. If we fail dispatching, rely on
   the next completion to dispatch the next one. This could potentially
   reduce the overall depth achieved on the device side, however even
   for the heavily contended test I'm running here, no observable
   change is seen. This is patch 2.

2) Serialize request insertion, using internal per-cpu lists to
   temporarily store requests until insertion can proceed. This is
   patch 3.

3) Skip expensive merges if the queue is already contended. Reasonings
   provided in that patch, patch 4.

With that in place, the same test case now does:

Device		IOPS	sys	contention	diff
====================================================
null_blk	2311K	10.3%	21.1%		+257%
nvme0n1		2610K	11.0%	24.6%		+289%

and while that doesn't completely eliminate the lock contention, it's
oodles better than what it was before. The throughput increase shows
that nicely, with more than a 200% improvement for both cases.

Since the above is very high IOPS testing to show the scalability
limitations, I also ran this on a more normal drive on a Dell R7525 test
box. It doesn't change the performance there (around 66K IOPS), but
it does reduce the system time required to do the IO from 12.6% to
10.7%, or about 20% less time spent in the kernel.

TODO: probably make DD_CPU_BUCKETS depend on number of online CPUs
in the system, rather than just default to 32.

 block/mq-deadline.c | 178 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 161 insertions(+), 17 deletions(-)

Since v1:
	- Redid numbers as I enabled fixedbufs for the workload,
	  and did not disable expensive merges but rather left it
	  at the kernel default settings.
	- Cleanup the insertion handling, also more efficient now.
	- Add merge contention patch
	- Expanded some of the commit messages
	- Add/expand code comments
	- Rebase to current master as block bits landed

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] block/mq-deadline: pass in queue directly to dd_insert_request()
  2024-01-19 16:02 [PATCHSET RFC v2 0/4] mq-deadline scalability improvements Jens Axboe
@ 2024-01-19 16:02 ` Jens Axboe
  2024-01-19 23:35   ` Bart Van Assche
  2024-01-19 16:02 ` [PATCH 2/4] block/mq-deadline: serialize request dispatching Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2024-01-19 16:02 UTC (permalink / raw)
  To: linux-block; +Cc: bvanassche, Jens Axboe

The hardware queue isn't relevant, deadline only operates on the queue
itself. Pass in the queue directly rather than the hardware queue, as
that more clearly explains what is being operated on.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/mq-deadline.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..9b7563e9d638 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -792,10 +792,9 @@ static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
 /*
  * add rq to rbtree and fifo
  */
-static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+static void dd_insert_request(struct request_queue *q, struct request *rq,
 			      blk_insert_t flags, struct list_head *free)
 {
-	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const enum dd_data_dir data_dir = rq_data_dir(rq);
 	u16 ioprio = req_get_ioprio(rq);
@@ -875,7 +874,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 
 		rq = list_first_entry(list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
-		dd_insert_request(hctx, rq, flags, &free);
+		dd_insert_request(q, rq, flags, &free);
 	}
 	spin_unlock(&dd->lock);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] block/mq-deadline: serialize request dispatching
  2024-01-19 16:02 [PATCHSET RFC v2 0/4] mq-deadline scalability improvements Jens Axboe
  2024-01-19 16:02 ` [PATCH 1/4] block/mq-deadline: pass in queue directly to dd_insert_request() Jens Axboe
@ 2024-01-19 16:02 ` Jens Axboe
  2024-01-19 23:24   ` Bart Van Assche
  2024-01-19 16:02 ` [PATCH 3/4] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe
  2024-01-19 16:02 ` [PATCH 4/4] block/mq-deadline: skip expensive merge lookups if contended Jens Axboe
  3 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2024-01-19 16:02 UTC (permalink / raw)
  To: linux-block; +Cc: bvanassche, Jens Axboe

If we're entering request dispatch but someone else is already
dispatching, then just skip this dispatch. We know IO is inflight and
this will trigger another dispatch event for any completion. This will
potentially cause slightly lower queue depth for contended cases, but
those are slowed down anyway and this should not cause an issue.

By itself, this patch doesn't help a whole lot, as the dispatch
lock contention reduction is just eating up by the same dd->lock now
seeing increased insertion contention. But it's required work to be
able to reduce the lock contention in general.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/mq-deadline.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 9b7563e9d638..b579ce282176 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -79,10 +79,20 @@ struct dd_per_prio {
 	struct io_stats_per_prio stats;
 };
 
+enum {
+	DD_DISPATCHING	= 0,
+};
+
 struct deadline_data {
 	/*
 	 * run time data
 	 */
+	struct {
+		spinlock_t lock;
+		spinlock_t zone_lock;
+	} ____cacheline_aligned_in_smp;
+
+	unsigned long run_state;
 
 	struct dd_per_prio per_prio[DD_PRIO_COUNT];
 
@@ -100,9 +110,6 @@ struct deadline_data {
 	int front_merges;
 	u32 async_depth;
 	int prio_aging_expire;
-
-	spinlock_t lock;
-	spinlock_t zone_lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -600,6 +607,18 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;
 
+	/*
+	 * If someone else is already dispatching, skip this one. This will
+	 * defer the next dispatch event to when something completes, and could
+	 * potentially lower the queue depth for contended cases.
+	 *
+	 * See the logic in blk_mq_do_dispatch_sched(), which loops and
+	 * retries if nothing is dispatched.
+	 */
+	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
+	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
+		return NULL;
+
 	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
@@ -616,6 +635,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	}
 
 unlock:
+	clear_bit(DD_DISPATCHING, &dd->run_state);
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -706,6 +726,9 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	eq->elevator_data = dd;
 
+	spin_lock_init(&dd->lock);
+	spin_lock_init(&dd->zone_lock);
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -722,8 +745,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->last_dir = DD_WRITE;
 	dd->fifo_batch = fifo_batch;
 	dd->prio_aging_expire = prio_aging_expire;
-	spin_lock_init(&dd->lock);
-	spin_lock_init(&dd->zone_lock);
 
 	/* We dispatch from request queue wide instead of hw queue */
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-19 16:02 [PATCHSET RFC v2 0/4] mq-deadline scalability improvements Jens Axboe
  2024-01-19 16:02 ` [PATCH 1/4] block/mq-deadline: pass in queue directly to dd_insert_request() Jens Axboe
  2024-01-19 16:02 ` [PATCH 2/4] block/mq-deadline: serialize request dispatching Jens Axboe
@ 2024-01-19 16:02 ` Jens Axboe
  2024-01-19 23:16   ` Bart Van Assche
  2024-01-19 16:02 ` [PATCH 4/4] block/mq-deadline: skip expensive merge lookups if contended Jens Axboe
  3 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2024-01-19 16:02 UTC (permalink / raw)
  To: linux-block; +Cc: bvanassche, Jens Axboe

If we attempt to insert a list of requests, but someone else is already
running an insertion, then fallback to queueing that list internally and
let the existing inserter finish the operation. The current inserter
will either see and flush this list, of if it ends before we're done
doing our bucket insert, then we'll flush it and insert ourselves.

This reduces contention on the dd->lock, which protects any request
insertion or dispatch, by having a backup point to insert into which
will either be flushed immediately or by an existing inserter. As the
alternative is to just keep spinning on the dd->lock, it's very easy
to get into a situation where multiple processes are trying to do IO
and all sit and spin on this lock.

With the previous dispatch optimization, this drastically reduces
contention for a sample cases of 32 threads doing IO to devices. The
test case looks as follows:

fio --bs=512 --group_reporting=1 --gtod_reduce=1 --invalidate=1 \
	--ioengine=io_uring --norandommap --runtime=60 --rw=randread \
	--thread --time_based=1 --buffered=0 --fixedbufs=1 --numjobs=32 \
	--iodepth=4 --iodepth_batch_submit=4 --iodepth_batch_complete=4 \
	--name=scaletest --filename=/dev/$DEV

Before:

Device		IOPS	sys	contention	diff
====================================================
null_blk	879K	89%	93.6%
nvme0n1		901K	86%	94.5%

and after this and the previous dispatch patch:

Device		IOPS	sys	contention	diff
====================================================
null_blk	2311K	10.3%	21.1%		+257%
nvme0n1		2610K	11.0%	24.6%		+289%

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/mq-deadline.c | 130 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 121 insertions(+), 9 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b579ce282176..cc3155d50e0d 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -81,8 +81,18 @@ struct dd_per_prio {
 
 enum {
 	DD_DISPATCHING	= 0,
+	DD_INSERTING	= 1,
+	DD_BUCKETS	= 2,
 };
 
+#define DD_CPU_BUCKETS		32
+#define DD_CPU_BUCKETS_MASK	(DD_CPU_BUCKETS - 1)
+
+struct dd_bucket_list {
+	struct list_head list;
+	spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
 struct deadline_data {
 	/*
 	 * run time data
@@ -94,6 +104,8 @@ struct deadline_data {
 
 	unsigned long run_state;
 
+	struct dd_bucket_list bucket_lists[DD_CPU_BUCKETS];
+
 	struct dd_per_prio per_prio[DD_PRIO_COUNT];
 
 	/* Data direction of latest dispatched request. */
@@ -714,7 +726,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
 	enum dd_prio prio;
-	int ret = -ENOMEM;
+	int i, ret = -ENOMEM;
 
 	eq = elevator_alloc(q, e);
 	if (!eq)
@@ -729,6 +741,11 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
 
+	for (i = 0; i < DD_CPU_BUCKETS; i++) {
+		INIT_LIST_HEAD(&dd->bucket_lists[i].list);
+		spin_lock_init(&dd->bucket_lists[i].lock);
+	}
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -878,6 +895,94 @@ static void dd_insert_request(struct request_queue *q, struct request *rq,
 	}
 }
 
+static void dd_dispatch_from_buckets(struct deadline_data *dd,
+				     struct list_head *list)
+{
+	int i;
+
+	if (!test_bit(DD_BUCKETS, &dd->run_state) ||
+	    !test_and_clear_bit(DD_BUCKETS, &dd->run_state))
+		return;
+
+	for (i = 0; i < DD_CPU_BUCKETS; i++) {
+		struct dd_bucket_list *bucket = &dd->bucket_lists[i];
+
+		if (list_empty_careful(&bucket->list))
+			continue;
+		spin_lock(&bucket->lock);
+		list_splice_init(&bucket->list, list);
+		spin_unlock(&bucket->lock);
+	}
+}
+
+/*
+ * If we can grab the dd->lock, then just return and do the insertion as per
+ * usual. If not, add to one of our internal buckets, and afterwards recheck
+ * if if we should retry.
+ */
+static bool dd_insert_to_bucket(struct deadline_data *dd,
+				struct list_head *list)
+	__acquires(&dd->lock)
+{
+	struct dd_bucket_list *bucket;
+
+	/*
+	 * If we can grab the lock, proceed as per usual. If not, and insert
+	 * isn't running, force grab the lock and proceed as per usual.
+	 */
+	if (spin_trylock(&dd->lock))
+		return false;
+	if (!test_bit(DD_INSERTING, &dd->run_state)) {
+		spin_lock(&dd->lock);
+		return false;
+	}
+
+	if (!test_bit(DD_BUCKETS, &dd->run_state))
+		set_bit(DD_BUCKETS, &dd->run_state);
+
+	bucket = &dd->bucket_lists[get_cpu() & DD_CPU_BUCKETS_MASK];
+	spin_lock(&bucket->lock);
+	list_splice_init(list, &bucket->list);
+	spin_unlock(&bucket->lock);
+	put_cpu();
+
+	/*
+	 * Insertion still running, we are done.
+	 */
+	if (test_bit(DD_INSERTING, &dd->run_state))
+		return true;
+
+	/*
+	 * We may be too late, play it safe and grab the lock. This will
+	 * flush the above bucket insert as well and insert it.
+	 */
+	spin_lock(&dd->lock);
+	return false;
+}
+
+static void __dd_insert_requests(struct request_queue *q,
+				 struct deadline_data *dd,
+				 struct list_head *list, blk_insert_t flags,
+				 struct list_head *free)
+{
+	set_bit(DD_INSERTING, &dd->run_state);
+	do {
+		while (!list_empty(list)) {
+			struct request *rq;
+
+			rq = list_first_entry(list, struct request, queuelist);
+			list_del_init(&rq->queuelist);
+			dd_insert_request(q, rq, flags, free);
+		}
+
+		dd_dispatch_from_buckets(dd, list);
+		if (list_empty(list))
+			break;
+	} while (1);
+
+	clear_bit(DD_INSERTING, &dd->run_state);
+}
+
 /*
  * Called from blk_mq_insert_request() or blk_mq_dispatch_plug_list().
  */
@@ -889,16 +994,23 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	struct deadline_data *dd = q->elevator->elevator_data;
 	LIST_HEAD(free);
 
-	spin_lock(&dd->lock);
-	while (!list_empty(list)) {
-		struct request *rq;
+	/*
+	 * If dispatch is busy and we ended up adding to our internal bucket,
+	 * then we're done for now.
+	 */
+	if (dd_insert_to_bucket(dd, list))
+		return;
 
-		rq = list_first_entry(list, struct request, queuelist);
-		list_del_init(&rq->queuelist);
-		dd_insert_request(q, rq, flags, &free);
-	}
-	spin_unlock(&dd->lock);
+	do {
+		__dd_insert_requests(q, dd, list, flags, &free);
 
+		/*
+		 * If buckets is set after inserting was cleared, be safe and do
+		 * another loop as we could be racing with bucket insertion.
+		 */
+	} while (test_bit(DD_BUCKETS, &dd->run_state));
+
+	spin_unlock(&dd->lock);
 	blk_mq_free_requests(&free);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] block/mq-deadline: skip expensive merge lookups if contended
  2024-01-19 16:02 [PATCHSET RFC v2 0/4] mq-deadline scalability improvements Jens Axboe
                   ` (2 preceding siblings ...)
  2024-01-19 16:02 ` [PATCH 3/4] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe
@ 2024-01-19 16:02 ` Jens Axboe
  3 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2024-01-19 16:02 UTC (permalink / raw)
  To: linux-block; +Cc: bvanassche, Jens Axboe

We do several stages of merging in the block layer - the most likely one
to work is also the cheap one, merging direct in the per-task plug when
IO is submitted. Getting merges outside of that is a lot less likely,
but IO schedulers may still maintain internal data structures to
facilitate merge lookups outside of the plug.

Make mq-deadline skip expensive merge lookups if the queue lock is
already contended. The likelihood of getting a merge here is not very
likely, hence it should not be a problem skipping the attempt in the
also unlikely event that the queue is already contended.

Perf diff shows the difference between a random read/write workload
with 4 threads doing IO, with expensive merges turned on and off:

    25.00%    +61.94%  [kernel.kallsyms]  [k] queued_spin_lock_slowpath

where we almost quadruple the lock contention by attempting these
expensive merges.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/mq-deadline.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index cc3155d50e0d..2de0832b1e5d 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -817,7 +817,19 @@ static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
 	struct request *free = NULL;
 	bool ret;

-	spin_lock(&dd->lock);
+	/*
+	 * bio merging is called for every bio queued, and it's very easy
+	 * to run into contention because of that. If we fail getting
+	 * the dd lock, just skip this merge attempt. For related IO, the
+	 * plug will be the successful merging point. If we get here, we
+	 * already failed doing the obvious merge. Chances of actually
+	 * getting a merge off this path is a lot slimmer, so skipping an
+	 * occassional lookup that will most likely not succeed anyway should
+	 * not be a problem.
+	 */
+	if (!spin_trylock(&dd->lock))
+		return false;
+
 	ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
 	spin_unlock(&dd->lock);

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-19 16:02 ` [PATCH 3/4] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe
@ 2024-01-19 23:16   ` Bart Van Assche
  2024-01-20  0:05     ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2024-01-19 23:16 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On 1/19/24 08:02, Jens Axboe wrote:
> If we attempt to insert a list of requests, but someone else is already
> running an insertion, then fallback to queueing that list internally and
> let the existing inserter finish the operation. The current inserter
> will either see and flush this list, of if it ends before we're done
> doing our bucket insert, then we'll flush it and insert ourselves.
> 
> This reduces contention on the dd->lock, which protects any request
> insertion or dispatch, by having a backup point to insert into which
> will either be flushed immediately or by an existing inserter. As the
> alternative is to just keep spinning on the dd->lock, it's very easy
> to get into a situation where multiple processes are trying to do IO
> and all sit and spin on this lock.

With this alternative patch I achieve 20% higher IOPS than with patch
3/4 of this series for 1..4 CPU cores (null_blk + fio in an x86 VM):

---
  block/mq-deadline.c | 61 ++++++++++++++++++++++++++++++++++++---------
  1 file changed, 49 insertions(+), 12 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 83bc21801226..fd9038a84dc5 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -89,11 +89,15 @@ struct deadline_data {
  	 */
  	struct {
  		spinlock_t lock;
+		spinlock_t insert_lock;
  		spinlock_t zone_lock;
  	} ____cacheline_aligned_in_smp;

  	unsigned long run_state;

+	struct list_head at_head;
+	struct list_head at_tail;
+
  	struct dd_per_prio per_prio[DD_PRIO_COUNT];

  	/* Data direction of latest dispatched request. */
@@ -120,6 +124,9 @@ static const enum dd_prio ioprio_class_to_prio[] = {
  	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
  };

+static void dd_insert_request(struct request_queue *q, struct request *rq,
+			      blk_insert_t flags, struct list_head *free);
+
  static inline struct rb_root *
  deadline_rb_root(struct dd_per_prio *per_prio, struct request *rq)
  {
@@ -592,6 +599,30 @@ static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
  	return NULL;
  }

+static void dd_do_insert(struct request_queue *q, struct list_head *free)
+{
+	struct deadline_data *dd = q->elevator->elevator_data;
+	struct request *rq;
+	LIST_HEAD(at_head);
+	LIST_HEAD(at_tail);
+
+	spin_lock(&dd->insert_lock);
+	list_splice_init(&dd->at_head, &at_head);
+	list_splice_init(&dd->at_tail, &at_tail);
+	spin_unlock(&dd->insert_lock);
+
+	while (!list_empty(&at_head)) {
+		rq = list_first_entry(&at_head, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		dd_insert_request(q, rq, BLK_MQ_INSERT_AT_HEAD, free);
+	}
+	while (!list_empty(&at_tail)) {
+		rq = list_first_entry(&at_tail, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		dd_insert_request(q, rq, 0, free);
+	}
+}
+
  /*
   * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
   *
@@ -606,6 +637,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
  	const unsigned long now = jiffies;
  	struct request *rq;
  	enum dd_prio prio;
+	LIST_HEAD(free);

  	/*
  	 * If someone else is already dispatching, skip this one. This will
@@ -620,6 +652,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
  		return NULL;

  	spin_lock(&dd->lock);
+	dd_do_insert(hctx->queue, &free);
  	rq = dd_dispatch_prio_aged_requests(dd, now);
  	if (rq)
  		goto unlock;
@@ -638,6 +671,8 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
  	clear_bit(DD_DISPATCHING, &dd->run_state);
  	spin_unlock(&dd->lock);

+	blk_mq_free_requests(&free);
+
  	return rq;
  }

@@ -727,8 +762,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
  	eq->elevator_data = dd;

  	spin_lock_init(&dd->lock);
+	spin_lock_init(&dd->insert_lock);
  	spin_lock_init(&dd->zone_lock);

+	INIT_LIST_HEAD(&dd->at_head);
+	INIT_LIST_HEAD(&dd->at_tail);
+
  	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
  		struct dd_per_prio *per_prio = &dd->per_prio[prio];

@@ -899,19 +938,13 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
  {
  	struct request_queue *q = hctx->queue;
  	struct deadline_data *dd = q->elevator->elevator_data;
-	LIST_HEAD(free);
-
-	spin_lock(&dd->lock);
-	while (!list_empty(list)) {
-		struct request *rq;

-		rq = list_first_entry(list, struct request, queuelist);
-		list_del_init(&rq->queuelist);
-		dd_insert_request(q, rq, flags, &free);
-	}
-	spin_unlock(&dd->lock);
-
-	blk_mq_free_requests(&free);
+	spin_lock(&dd->insert_lock);
+	if (flags & BLK_MQ_INSERT_AT_HEAD)
+		list_splice_init(list, &dd->at_head);
+	else
+		list_splice_init(list, &dd->at_tail);
+	spin_unlock(&dd->insert_lock);
  }

  /* Callback from inside blk_mq_rq_ctx_init(). */
@@ -990,6 +1023,10 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
  	enum dd_prio prio;

+	if (!list_empty_careful(&dd->at_head) ||
+	    !list_empty_careful(&dd->at_tail))
+		return true;
+
  	for (prio = 0; prio <= DD_PRIO_MAX; prio++)
  		if (dd_has_work_for_prio(&dd->per_prio[prio]))
  			return true;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] block/mq-deadline: serialize request dispatching
  2024-01-19 16:02 ` [PATCH 2/4] block/mq-deadline: serialize request dispatching Jens Axboe
@ 2024-01-19 23:24   ` Bart Van Assche
  2024-01-20  0:00     ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2024-01-19 23:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On 1/19/24 08:02, Jens Axboe wrote:
> +	/*
> +	 * If someone else is already dispatching, skip this one. This will
> +	 * defer the next dispatch event to when something completes, and could
> +	 * potentially lower the queue depth for contended cases.
> +	 *
> +	 * See the logic in blk_mq_do_dispatch_sched(), which loops and
> +	 * retries if nothing is dispatched.
> +	 */
> +	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
> +	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
> +		return NULL;
> +
>   	spin_lock(&dd->lock);
>   	rq = dd_dispatch_prio_aged_requests(dd, now);
>   	if (rq)
> @@ -616,6 +635,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>   	}
>   
>   unlock:
> +	clear_bit(DD_DISPATCHING, &dd->run_state);
>   	spin_unlock(&dd->lock);

 From Documentation/memory-barriers.txt: "These are also used for atomic RMW
bitop functions that do not imply a memory barrier (such as set_bit and
clear_bit)." Does this mean that CPUs with a weak memory model (e.g. ARM)
are allowed to execute the clear_bit() call earlier than where it occurs in
the code? I think that spin_trylock() has "acquire" semantics and also that
"spin_unlock()" has release semantics. While a CPU is allowed to execute
clear_bit() before the memory operations that come before it, I don't think
that is the case for spin_unlock(). See also
tools/memory-model/Documentation/locking.txt.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] block/mq-deadline: pass in queue directly to dd_insert_request()
  2024-01-19 16:02 ` [PATCH 1/4] block/mq-deadline: pass in queue directly to dd_insert_request() Jens Axboe
@ 2024-01-19 23:35   ` Bart Van Assche
  0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2024-01-19 23:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On 1/19/24 08:02, Jens Axboe wrote:
> The hardware queue isn't relevant, deadline only operates on the queue
> itself. Pass in the queue directly rather than the hardware queue, as
> that more clearly explains what is being operated on.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   block/mq-deadline.c | 5 ++---
>   1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index f958e79277b8..9b7563e9d638 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -792,10 +792,9 @@ static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
>   /*
>    * add rq to rbtree and fifo
>    */
> -static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> +static void dd_insert_request(struct request_queue *q, struct request *rq,
>   			      blk_insert_t flags, struct list_head *free)
>   {
> -	struct request_queue *q = hctx->queue;
>   	struct deadline_data *dd = q->elevator->elevator_data;
>   	const enum dd_data_dir data_dir = rq_data_dir(rq);
>   	u16 ioprio = req_get_ioprio(rq);
> @@ -875,7 +874,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
>   
>   		rq = list_first_entry(list, struct request, queuelist);
>   		list_del_init(&rq->queuelist);
> -		dd_insert_request(hctx, rq, flags, &free);
> +		dd_insert_request(q, rq, flags, &free);
>   	}
>   	spin_unlock(&dd->lock);

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] block/mq-deadline: serialize request dispatching
  2024-01-19 23:24   ` Bart Van Assche
@ 2024-01-20  0:00     ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2024-01-20  0:00 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 1/19/24 4:24 PM, Bart Van Assche wrote:
> On 1/19/24 08:02, Jens Axboe wrote:
>> +    /*
>> +     * If someone else is already dispatching, skip this one. This will
>> +     * defer the next dispatch event to when something completes, and could
>> +     * potentially lower the queue depth for contended cases.
>> +     *
>> +     * See the logic in blk_mq_do_dispatch_sched(), which loops and
>> +     * retries if nothing is dispatched.
>> +     */
>> +    if (test_bit(DD_DISPATCHING, &dd->run_state) ||
>> +        test_and_set_bit(DD_DISPATCHING, &dd->run_state))
>> +        return NULL;
>> +
>>       spin_lock(&dd->lock);
>>       rq = dd_dispatch_prio_aged_requests(dd, now);
>>       if (rq)
>> @@ -616,6 +635,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>>       }
>>     unlock:
>> +    clear_bit(DD_DISPATCHING, &dd->run_state);
>>       spin_unlock(&dd->lock);
> 
> From Documentation/memory-barriers.txt: "These are also used for atomic RMW
> bitop functions that do not imply a memory barrier (such as set_bit and
> clear_bit)." Does this mean that CPUs with a weak memory model (e.g. ARM)
> are allowed to execute the clear_bit() call earlier than where it occurs in
> the code? I think that spin_trylock() has "acquire" semantics and also that
> "spin_unlock()" has release semantics. While a CPU is allowed to execute
> clear_bit() before the memory operations that come before it, I don't think
> that is the case for spin_unlock(). See also
> tools/memory-model/Documentation/locking.txt.

Not sure why I didn't do it upfront, but they just need to be the _lock
variants of the bitops. I'll make that change.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-19 23:16   ` Bart Van Assche
@ 2024-01-20  0:05     ` Jens Axboe
  2024-01-20  0:13       ` Jens Axboe
                         ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jens Axboe @ 2024-01-20  0:05 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 1/19/24 4:16 PM, Bart Van Assche wrote:
> On 1/19/24 08:02, Jens Axboe wrote:
>> If we attempt to insert a list of requests, but someone else is already
>> running an insertion, then fallback to queueing that list internally and
>> let the existing inserter finish the operation. The current inserter
>> will either see and flush this list, of if it ends before we're done
>> doing our bucket insert, then we'll flush it and insert ourselves.
>>
>> This reduces contention on the dd->lock, which protects any request
>> insertion or dispatch, by having a backup point to insert into which
>> will either be flushed immediately or by an existing inserter. As the
>> alternative is to just keep spinning on the dd->lock, it's very easy
>> to get into a situation where multiple processes are trying to do IO
>> and all sit and spin on this lock.
> 
> With this alternative patch I achieve 20% higher IOPS than with patch
> 3/4 of this series for 1..4 CPU cores (null_blk + fio in an x86 VM):

Performance aside, I think this is a much better approach rather than
mine. Haven't tested yet, but I think this instead of my patch 3 and the
other patches and this should further drastically cut down on the
overhead. Can you send a "proper" patch and I'll just replace the one
that I have?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-20  0:05     ` Jens Axboe
@ 2024-01-20  0:13       ` Jens Axboe
  2024-01-20  0:31       ` Jens Axboe
  2024-01-22 23:55       ` Bart Van Assche
  2 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2024-01-20  0:13 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 1/19/24 5:05 PM, Jens Axboe wrote:
> On 1/19/24 4:16 PM, Bart Van Assche wrote:
>> On 1/19/24 08:02, Jens Axboe wrote:
>>> If we attempt to insert a list of requests, but someone else is already
>>> running an insertion, then fallback to queueing that list internally and
>>> let the existing inserter finish the operation. The current inserter
>>> will either see and flush this list, of if it ends before we're done
>>> doing our bucket insert, then we'll flush it and insert ourselves.
>>>
>>> This reduces contention on the dd->lock, which protects any request
>>> insertion or dispatch, by having a backup point to insert into which
>>> will either be flushed immediately or by an existing inserter. As the
>>> alternative is to just keep spinning on the dd->lock, it's very easy
>>> to get into a situation where multiple processes are trying to do IO
>>> and all sit and spin on this lock.
>>
>> With this alternative patch I achieve 20% higher IOPS than with patch
>> 3/4 of this series for 1..4 CPU cores (null_blk + fio in an x86 VM):
> 
> Performance aside, I think this is a much better approach rather than
> mine. Haven't tested yet, but I think this instead of my patch 3 and the
> other patches and this should further drastically cut down on the
> overhead. Can you send a "proper" patch and I'll just replace the one
> that I have?

I'd probably just fold in this incremental, as I think it cleans it up.

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 88991a791c56..977c512465ca 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -599,10 +599,21 @@ static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
 	return NULL;
 }
 
+static void __dd_do_insert(struct request_queue *q, blk_insert_t flags,
+			   struct list_head *list, struct list_head *free)
+{
+	while (!list_empty(list)) {
+		struct request *rq;
+
+		rq = list_first_entry(list, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		dd_insert_request(q, rq, flags, free);
+	}
+}
+
 static void dd_do_insert(struct request_queue *q, struct list_head *free)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	struct request *rq;
 	LIST_HEAD(at_head);
 	LIST_HEAD(at_tail);
 
@@ -611,16 +622,8 @@ static void dd_do_insert(struct request_queue *q, struct list_head *free)
 	list_splice_init(&dd->at_tail, &at_tail);
 	spin_unlock(&dd->insert_lock);
 
-	while (!list_empty(&at_head)) {
-		rq = list_first_entry(&at_head, struct request, queuelist);
-		list_del_init(&rq->queuelist);
-		dd_insert_request(q, rq, BLK_MQ_INSERT_AT_HEAD, free);
-	}
-	while (!list_empty(&at_tail)) {
-		rq = list_first_entry(&at_tail, struct request, queuelist);
-		list_del_init(&rq->queuelist);
-		dd_insert_request(q, rq, 0, free);
-	}
+	__dd_do_insert(q, BLK_MQ_INSERT_AT_HEAD, &at_head, free);
+	__dd_do_insert(q, 0, &at_tail, free);
 }
 
 /*

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-20  0:05     ` Jens Axboe
  2024-01-20  0:13       ` Jens Axboe
@ 2024-01-20  0:31       ` Jens Axboe
  2024-01-22 23:55       ` Bart Van Assche
  2 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2024-01-20  0:31 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 1/19/24 5:05 PM, Jens Axboe wrote:
> On 1/19/24 4:16 PM, Bart Van Assche wrote:
>> On 1/19/24 08:02, Jens Axboe wrote:
>>> If we attempt to insert a list of requests, but someone else is already
>>> running an insertion, then fallback to queueing that list internally and
>>> let the existing inserter finish the operation. The current inserter
>>> will either see and flush this list, of if it ends before we're done
>>> doing our bucket insert, then we'll flush it and insert ourselves.
>>>
>>> This reduces contention on the dd->lock, which protects any request
>>> insertion or dispatch, by having a backup point to insert into which
>>> will either be flushed immediately or by an existing inserter. As the
>>> alternative is to just keep spinning on the dd->lock, it's very easy
>>> to get into a situation where multiple processes are trying to do IO
>>> and all sit and spin on this lock.
>>
>> With this alternative patch I achieve 20% higher IOPS than with patch
>> 3/4 of this series for 1..4 CPU cores (null_blk + fio in an x86 VM):
> 
> Performance aside, I think this is a much better approach rather than
> mine. Haven't tested yet, but I think this instead of my patch 3 and the
> other patches and this should further drastically cut down on the
> overhead. Can you send a "proper" patch and I'll just replace the one
> that I have?

Ran with this real quick and the incremental I sent, here's what I see.
For reference, this is before the series:

Device		IOPS	sys	contention	diff
====================================================
null_blk	879K	89%	93.6%
nvme0n1		901K	86%	94.5%

and now with the series:

Device		IOPS	sys	contention	diff
====================================================
null_blk	2867K	11.1%	~6.0%		+326%
nvme0n1		3162K	 9.9%	~5.0%		+350%

which looks really good, it removes the last bit of contention that was
still there. And talk about a combined improvement...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-20  0:05     ` Jens Axboe
  2024-01-20  0:13       ` Jens Axboe
  2024-01-20  0:31       ` Jens Axboe
@ 2024-01-22 23:55       ` Bart Van Assche
  2 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2024-01-22 23:55 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On 1/19/24 16:05, Jens Axboe wrote:
> Can you send a "proper" patch and I'll just replace the one that I have?

Please take a look at
https://lore.kernel.org/linux-block/20240122235332.2299150-1-bvanassche@acm.org/T/#u

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-01-22 23:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-19 16:02 [PATCHSET RFC v2 0/4] mq-deadline scalability improvements Jens Axboe
2024-01-19 16:02 ` [PATCH 1/4] block/mq-deadline: pass in queue directly to dd_insert_request() Jens Axboe
2024-01-19 23:35   ` Bart Van Assche
2024-01-19 16:02 ` [PATCH 2/4] block/mq-deadline: serialize request dispatching Jens Axboe
2024-01-19 23:24   ` Bart Van Assche
2024-01-20  0:00     ` Jens Axboe
2024-01-19 16:02 ` [PATCH 3/4] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe
2024-01-19 23:16   ` Bart Van Assche
2024-01-20  0:05     ` Jens Axboe
2024-01-20  0:13       ` Jens Axboe
2024-01-20  0:31       ` Jens Axboe
2024-01-22 23:55       ` Bart Van Assche
2024-01-19 16:02 ` [PATCH 4/4] block/mq-deadline: skip expensive merge lookups if contended Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox