public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET RFC 0/2] mq-deadline scalability improvements
@ 2024-01-18 18:04 Jens Axboe
  2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 18:04 UTC (permalink / raw)
  To: linux-block; +Cc: bvanassche

Hi,

It's no secret that mq-deadline doesn't scale very well - it was
originally done as a proof-of-concept conversion from deadline, when the
blk-mq multiqueue layer was written. In the single queue world, the
queue lock protected the IO scheduler as well, and mq-deadline simply
adopted an internal dd->lock to fill the place of that.

While mq-deadline works under blk-mq and doesn't suffer any scaling on
that side, as soon as request insertion or dispatch is done, we're
hitting the per-queue dd->lock quite intensely. On a basic test box
with 16 cores / 32 threads, running a number of IO intensive threads
on either null_blk (single hw queue) or nvme0n1 (many hw queues) shows
this quite easily:

Device		QD	Jobs	IOPS	Lock contention
=======================================================
null_blk	4	32	1090K	92%
nvme0n1		4	32	1070K	94%

which looks pretty miserable, most of the time is spent contending on
the queue lock.

This RFC patchset attempts to address that by:

1) Serializing dispatch of requests. If we fail dispatching, rely on
   the next completion to dispatch the next one. This could potentially
   reduce the overall depth achieved on the device side, however even
   for the heavily contended test I'm running here, no observable
   change is seen. This is patch 1.

2) Serialize request insertion, using internal per-cpu lists to
   temporarily store requests until insertion can proceed. This is
   patch 2.

With that in place, the same test case now does:

Device		QD	Jobs	IOPS	Contention	Diff
=============================================================
null_blk	4	32	2250K	28%		+106%
nvme0n1		4	32	2560K	23%		+112%

and while that doesn't completely eliminate the lock contention, it's
oodles better than what it was before. The throughput increase shows
that nicely, with more than 100% improvement for both cases.

 block/mq-deadline.c | 146 ++++++++++++++++++++++++++++++++++++++++----
  1 file changed, 133 insertions(+), 13 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] block/mq-deadline: serialize request dispatching
  2024-01-18 18:04 [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe
@ 2024-01-18 18:04 ` Jens Axboe
  2024-01-18 18:24   ` Bart Van Assche
  2024-01-19  2:40   ` Ming Lei
  2024-01-18 18:04 ` [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe
  2024-01-18 19:29 ` [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe
  2 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 18:04 UTC (permalink / raw)
  To: linux-block; +Cc: bvanassche, Jens Axboe

If we're entering request dispatch but someone else is already
dispatching, then just skip this dispatch. We know IO is inflight and
this will trigger another dispatch event for any completion. This will
potentially cause slightly lower queue depth for contended cases, but
those are slowed down anyway and this should not cause an issue.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/mq-deadline.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..9e0ab3ea728a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -79,10 +79,20 @@ struct dd_per_prio {
 	struct io_stats_per_prio stats;
 };
 
+enum {
+	DD_DISPATCHING	= 0,
+};
+
 struct deadline_data {
 	/*
 	 * run time data
 	 */
+	struct {
+		spinlock_t lock;
+		spinlock_t zone_lock;
+	} ____cacheline_aligned_in_smp;
+
+	unsigned long run_state;
 
 	struct dd_per_prio per_prio[DD_PRIO_COUNT];
 
@@ -100,9 +110,6 @@ struct deadline_data {
 	int front_merges;
 	u32 async_depth;
 	int prio_aging_expire;
-
-	spinlock_t lock;
-	spinlock_t zone_lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;
 
+	/*
+	 * If someone else is already dispatching, skip this one. This will
+	 * defer the next dispatch event to when something completes, and could
+	 * potentially lower the queue depth for contended cases.
+	 */
+	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
+	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
+		return NULL;
+
 	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
@@ -616,6 +632,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	}
 
 unlock:
+	clear_bit(DD_DISPATCHING, &dd->run_state);
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -706,6 +723,9 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	eq->elevator_data = dd;
 
+	spin_lock_init(&dd->lock);
+	spin_lock_init(&dd->zone_lock);
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -722,8 +742,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->last_dir = DD_WRITE;
 	dd->fifo_batch = fifo_batch;
 	dd->prio_aging_expire = prio_aging_expire;
-	spin_lock_init(&dd->lock);
-	spin_lock_init(&dd->zone_lock);
 
 	/* We dispatch from request queue wide instead of hw queue */
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-18 18:04 [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe
  2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe
@ 2024-01-18 18:04 ` Jens Axboe
  2024-01-18 18:25   ` Keith Busch
  2024-01-18 18:31   ` Bart Van Assche
  2024-01-18 19:29 ` [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe
  2 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 18:04 UTC (permalink / raw)
  To: linux-block; +Cc: bvanassche, Jens Axboe

If we attempt to insert a list of requests but someone else is already
running an insertion, then fallback to queueing it internally and let
the existing inserter finish the operation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/mq-deadline.c | 118 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 110 insertions(+), 8 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 9e0ab3ea728a..eeeaaff189e1 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -81,8 +81,17 @@ struct dd_per_prio {
 
 enum {
 	DD_DISPATCHING	= 0,
+	DD_INSERTING	= 1,
 };
 
+#define DD_CPU_BUCKETS		32
+#define DD_CPU_BUCKETS_MASK	(DD_CPU_BUCKETS - 1)
+
+struct dd_bucket_list {
+	struct list_head list;
+	spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
 struct deadline_data {
 	/*
 	 * run time data
@@ -94,6 +103,9 @@ struct deadline_data {
 
 	unsigned long run_state;
 
+	atomic_t insert_seq;
+	struct dd_bucket_list bucket_lists[DD_CPU_BUCKETS];
+
 	struct dd_per_prio per_prio[DD_PRIO_COUNT];
 
 	/* Data direction of latest dispatched request. */
@@ -711,7 +723,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
 	enum dd_prio prio;
-	int ret = -ENOMEM;
+	int i, ret = -ENOMEM;
 
 	eq = elevator_alloc(q, e);
 	if (!eq)
@@ -725,6 +737,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
+	atomic_set(&dd->insert_seq, 0);
+
+	for (i = 0; i < DD_CPU_BUCKETS; i++) {
+		INIT_LIST_HEAD(&dd->bucket_lists[i].list);
+		spin_lock_init(&dd->bucket_lists[i].lock);
+	}
 
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
@@ -876,6 +894,67 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	}
 }
 
+static void dd_dispatch_from_buckets(struct deadline_data *dd,
+				     struct list_head *list)
+{
+	int i;
+
+	for (i = 0; i < DD_CPU_BUCKETS; i++) {
+		struct dd_bucket_list *bucket = &dd->bucket_lists[i];
+
+		if (list_empty_careful(&bucket->list))
+			continue;
+		spin_lock(&bucket->lock);
+		list_splice_init(&bucket->list, list);
+		spin_unlock(&bucket->lock);
+	}
+}
+
+/*
+ * If we can grab the dd->lock, then just return and do the insertion as per
+ * usual. If not, add to one of our internal buckets, and afterwards recheck
+ * if if we should retry.
+ */
+static bool dd_insert_to_bucket(struct deadline_data *dd,
+				struct list_head *list, int *seq)
+	__acquires(&dd->lock)
+{
+	struct dd_bucket_list *bucket;
+	int next_seq;
+
+	*seq = atomic_read(&dd->insert_seq);
+
+	if (spin_trylock(&dd->lock))
+		return false;
+	if (!test_bit(DD_INSERTING, &dd->run_state)) {
+		spin_lock(&dd->lock);
+		return false;
+	}
+
+	*seq = atomic_inc_return(&dd->insert_seq);
+
+	bucket = &dd->bucket_lists[get_cpu() & DD_CPU_BUCKETS_MASK];
+	spin_lock(&bucket->lock);
+	list_splice_init(list, &bucket->list);
+	spin_unlock(&bucket->lock);
+	put_cpu();
+
+	/*
+	 * If seq still matches, we should be safe to just exit with the
+	 * pending requests queued in a bucket.
+	 */
+	next_seq = atomic_inc_return(&dd->insert_seq);
+	if (next_seq == *seq + 1)
+		return true;
+
+	/*
+	 * Seq changed, be safe and grab the lock and insert. Don't update
+	 * sequence, so that we flusht the buckets too.
+	 */
+	spin_lock(&dd->lock);
+	return false;
+}
+
 /*
  * Called from blk_mq_insert_request() or blk_mq_dispatch_plug_list().
  */
@@ -886,16 +965,39 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
 	LIST_HEAD(free);
+	int seq;
 
-	spin_lock(&dd->lock);
-	while (!list_empty(list)) {
-		struct request *rq;
+	/*
+	 * If dispatch is busy and we ended up adding to our internal bucket,
+	 * then we're done for now.
+	 */
+	if (dd_insert_to_bucket(dd, list, &seq))
+		return;
+
+	set_bit(DD_INSERTING, &dd->run_state);
+	do {
+		int next_seq;
+
+		while (!list_empty(list)) {
+			struct request *rq;
+
+			rq = list_first_entry(list, struct request, queuelist);
+			list_del_init(&rq->queuelist);
+			dd_insert_request(hctx, rq, flags, &free);
+		}
+
+		/*
+		 * If sequence changed, flush internal buckets
+		 */
+		next_seq = atomic_inc_return(&dd->insert_seq);
+		if (next_seq == seq + 1)
+			break;
+		seq = next_seq;
+		dd_dispatch_from_buckets(dd, list);
+	} while (1);
 
-		rq = list_first_entry(list, struct request, queuelist);
-		list_del_init(&rq->queuelist);
-		dd_insert_request(hctx, rq, flags, &free);
-	}
 	spin_unlock(&dd->lock);
+	clear_bit(DD_INSERTING, &dd->run_state);
 
 	blk_mq_free_requests(&free);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching
  2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe
@ 2024-01-18 18:24   ` Bart Van Assche
  2024-01-18 18:45     ` Jens Axboe
  2024-01-19  2:40   ` Ming Lei
  1 sibling, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2024-01-18 18:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On 1/18/24 10:04, Jens Axboe wrote:
> If we're entering request dispatch but someone else is already
> dispatching, then just skip this dispatch. We know IO is inflight and
> this will trigger another dispatch event for any completion. This will
> potentially cause slightly lower queue depth for contended cases, but
> those are slowed down anyway and this should not cause an issue.

Shouldn't a performance optimization patch include numbers that show by
how much that patch improves the performance of different workloads?

>   struct deadline_data {
>   	/*
>   	 * run time data
>   	 */
> +	struct {
> +		spinlock_t lock;
> +		spinlock_t zone_lock;
> +	} ____cacheline_aligned_in_smp;

Is ____cacheline_aligned_in_smp useful here? struct deadline_data
is not embedded in any other data structure but is allocated with 
kzalloc_node().

> +	/*
> +	 * If someone else is already dispatching, skip this one. This will
> +	 * defer the next dispatch event to when something completes, and could
> +	 * potentially lower the queue depth for contended cases.
> +	 */
> +	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
> +	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
> +		return NULL;
> +

The above code behaves similar to spin_trylock(). Has it been considered 
to use spin_trylock() instead?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-18 18:04 ` [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe
@ 2024-01-18 18:25   ` Keith Busch
  2024-01-18 18:28     ` Jens Axboe
  2024-01-18 18:31   ` Bart Van Assche
  1 sibling, 1 reply; 20+ messages in thread
From: Keith Busch @ 2024-01-18 18:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, bvanassche

On Thu, Jan 18, 2024 at 11:04:57AM -0700, Jens Axboe wrote:
> +#define DD_CPU_BUCKETS		32
> +#define DD_CPU_BUCKETS_MASK	(DD_CPU_BUCKETS - 1)

A bit of wasted space on machines with < 32 CPUs, no? Sure, it's not
much and the fixed size makes the implementation simpler, but these add
up.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-18 18:25   ` Keith Busch
@ 2024-01-18 18:28     ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 18:28 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, bvanassche

On 1/18/24 11:25 AM, Keith Busch wrote:
> On Thu, Jan 18, 2024 at 11:04:57AM -0700, Jens Axboe wrote:
>> +#define DD_CPU_BUCKETS		32
>> +#define DD_CPU_BUCKETS_MASK	(DD_CPU_BUCKETS - 1)
> 
> A bit of wasted space on machines with < 32 CPUs, no? Sure, it's not
> much and the fixed size makes the implementation simpler, but these add
> up.

True, could make it allocated instead, 32 was just pulled out of thin
air. Might make sense to make the magic 8 or something instead, that's
probably Good Enough in general. Or we can make it dependent on the
number of online CPUs when initialized. Honestly just didn't want to
overcomplicate this initial RFC.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-18 18:04 ` [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe
  2024-01-18 18:25   ` Keith Busch
@ 2024-01-18 18:31   ` Bart Van Assche
  2024-01-18 18:33     ` Jens Axboe
  1 sibling, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2024-01-18 18:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On 1/18/24 10:04, Jens Axboe wrote:
> If we attempt to insert a list of requests but someone else is already
> running an insertion, then fallback to queueing it internally and let
> the existing inserter finish the operation.

Because this patch adds significant complexity: what are the use cases
that benefit from this kind of optimization? Are these perhaps workloads
on systems with many CPU cores and fast storage? If the storage is fast,
why to use mq-deadline instead of "none" as I/O-scheduler?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-18 18:31   ` Bart Van Assche
@ 2024-01-18 18:33     ` Jens Axboe
  2024-01-18 18:53       ` Bart Van Assche
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 18:33 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 1/18/24 11:31 AM, Bart Van Assche wrote:
> On 1/18/24 10:04, Jens Axboe wrote:
>> If we attempt to insert a list of requests but someone else is already
>> running an insertion, then fallback to queueing it internally and let
>> the existing inserter finish the operation.
> 
> Because this patch adds significant complexity: what are the use cases
> that benefit from this kind of optimization? Are these perhaps workloads
> on systems with many CPU cores and fast storage? If the storage is fast,
> why to use mq-deadline instead of "none" as I/O-scheduler?

You and others complain that mq-deadline is slow and doesn't scale,
these two patches help improve that situation. Not sure why this is even
a question?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching
  2024-01-18 18:24   ` Bart Van Assche
@ 2024-01-18 18:45     ` Jens Axboe
  2024-01-18 18:51       ` Bart Van Assche
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 18:45 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 1/18/24 11:24 AM, Bart Van Assche wrote:
> On 1/18/24 10:04, Jens Axboe wrote:
>> If we're entering request dispatch but someone else is already
>> dispatching, then just skip this dispatch. We know IO is inflight and
>> this will trigger another dispatch event for any completion. This will
>> potentially cause slightly lower queue depth for contended cases, but
>> those are slowed down anyway and this should not cause an issue.
> 
> Shouldn't a performance optimization patch include numbers that show by
> how much that patch improves the performance of different workloads?

The commit messages may want some editing, but all of that is in the
cover letter that I'm sure you saw. This is just a POC/RFC posting.

>>   struct deadline_data {
>>       /*
>>        * run time data
>>        */
>> +    struct {
>> +        spinlock_t lock;
>> +        spinlock_t zone_lock;
>> +    } ____cacheline_aligned_in_smp;
> 
> Is ____cacheline_aligned_in_smp useful here? struct deadline_data
> is not embedded in any other data structure but is allocated with kzalloc_node().

It's not for alignment of deadline_data, it's for alignment within
deadline_data so that grabbing/releasing these locks don't share a
cacheline with other bits. We could ensure that deadline_data itself
is aligned as well, that might be a good idea.

>> +    /*
>> +     * If someone else is already dispatching, skip this one. This will
>> +     * defer the next dispatch event to when something completes, and could
>> +     * potentially lower the queue depth for contended cases.
>> +     */
>> +    if (test_bit(DD_DISPATCHING, &dd->run_state) ||
>> +        test_and_set_bit(DD_DISPATCHING, &dd->run_state))
>> +        return NULL;
>> +
> 
> The above code behaves similar to spin_trylock(). Has it been
> considered to use spin_trylock() instead?

Do you read the replies to the emails, from the other thread? First of
all, you'd need another lock for this. And secondly, this avoids a RMW
if it's already set. So no, I think this is cleaner than a separate
lock, and you'd still have cacheline bouncing on that for ALL calls if
you did it that way.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching
  2024-01-18 18:45     ` Jens Axboe
@ 2024-01-18 18:51       ` Bart Van Assche
  2024-01-18 18:55         ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2024-01-18 18:51 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On 1/18/24 10:45, Jens Axboe wrote:
> Do you read the replies to the emails, from the other thread?

Yes.
  > And secondly, this avoids a RMW if it's already set.
 From the spinlock implementation (something I looked up before I wrote
my previous reply):

static __always_inline int queued_spin_trylock(struct qspinlock *lock)
{
	int val = atomic_read(&lock->val);

	if (unlikely(val))
		return 0;

	return likely(atomic_try_cmpxchg_acquire(&lock->val, &val,
                                                  _Q_LOCKED_VAL));
}

I think this implementation does not perform a RMW if the spinlock is
already locked.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-18 18:33     ` Jens Axboe
@ 2024-01-18 18:53       ` Bart Van Assche
  2024-01-18 18:56         ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2024-01-18 18:53 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On 1/18/24 10:33, Jens Axboe wrote:
> On 1/18/24 11:31 AM, Bart Van Assche wrote:
>> On 1/18/24 10:04, Jens Axboe wrote:
>>> If we attempt to insert a list of requests but someone else is already
>>> running an insertion, then fallback to queueing it internally and let
>>> the existing inserter finish the operation.
>>
>> Because this patch adds significant complexity: what are the use cases
>> that benefit from this kind of optimization? Are these perhaps workloads
>> on systems with many CPU cores and fast storage? If the storage is fast,
>> why to use mq-deadline instead of "none" as I/O-scheduler?
> 
> You and others complain that mq-deadline is slow and doesn't scale,
> these two patches help improve that situation. Not sure why this is even
> a question?

How much does this patch improve performance?

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching
  2024-01-18 18:51       ` Bart Van Assche
@ 2024-01-18 18:55         ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 18:55 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 1/18/24 11:51 AM, Bart Van Assche wrote:
> On 1/18/24 10:45, Jens Axboe wrote:
>> Do you read the replies to the emails, from the other thread?
> 
> Yes.

Well it's been a bit frustrating to say the least, as you seem to either
not read them or just ignore what's in them.

>  > And secondly, this avoids a RMW if it's already set.
> From the spinlock implementation (something I looked up before I wrote
> my previous reply):
> 
> static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> {
>     int val = atomic_read(&lock->val);
> 
>     if (unlikely(val))
>         return 0;
> 
>     return likely(atomic_try_cmpxchg_acquire(&lock->val, &val,
>                                                  _Q_LOCKED_VAL));
> }
> 
> I think this implementation does not perform a RMW if the spinlock is
> already locked.

This looks like a generic trylock, is the same true for the arch
variants? Because your attempt either failed because of RMW or because
you are hammering it repeatedly, or both.

In any case, even if any trylock would avoid the initial atomic, it
doesn't give you anything that a bitop would not just do, and we use the
latter trick elsewhere in blk-mq.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-18 18:53       ` Bart Van Assche
@ 2024-01-18 18:56         ` Jens Axboe
  2024-01-18 20:46           ` Bart Van Assche
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 18:56 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 1/18/24 11:53 AM, Bart Van Assche wrote:
> On 1/18/24 10:33, Jens Axboe wrote:
>> On 1/18/24 11:31 AM, Bart Van Assche wrote:
>>> On 1/18/24 10:04, Jens Axboe wrote:
>>>> If we attempt to insert a list of requests but someone else is already
>>>> running an insertion, then fallback to queueing it internally and let
>>>> the existing inserter finish the operation.
>>>
>>> Because this patch adds significant complexity: what are the use cases
>>> that benefit from this kind of optimization? Are these perhaps workloads
>>> on systems with many CPU cores and fast storage? If the storage is fast,
>>> why to use mq-deadline instead of "none" as I/O-scheduler?
>>
>> You and others complain that mq-deadline is slow and doesn't scale,
>> these two patches help improve that situation. Not sure why this is even
>> a question?
> 
> How much does this patch improve performance?

Do you need me to link the cover letter that you were CC'ed on?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHSET RFC 0/2] mq-deadline scalability improvements
  2024-01-18 18:04 [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe
  2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe
  2024-01-18 18:04 ` [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe
@ 2024-01-18 19:29 ` Jens Axboe
  2024-01-18 20:22   ` Jens Axboe
  2 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 19:29 UTC (permalink / raw)
  To: linux-block; +Cc: bvanassche

On 1/18/24 11:04 AM, Jens Axboe wrote:
> With that in place, the same test case now does:
> 
> Device	QD	Jobs	IOPS	Contention	Diff
> =============================================================
> null_blk	4	32	2250K	28%		+106%
> nvme0n1	4	32	2560K	23%		+112%

nvme0n1		4	32	2560K	23%		+139%

Apparently I can't math, this is a +139% improvement for the nvme
case... Just wanted to make it clear that the IOPS number was correct,
it's just the diff math that was wrong.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHSET RFC 0/2] mq-deadline scalability improvements
  2024-01-18 19:29 ` [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe
@ 2024-01-18 20:22   ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 20:22 UTC (permalink / raw)
  To: linux-block; +Cc: bvanassche

On 1/18/24 12:29 PM, Jens Axboe wrote:
> On 1/18/24 11:04 AM, Jens Axboe wrote:
>> With that in place, the same test case now does:
>>
>> Device	QD	Jobs	IOPS	Contention	Diff
>> =============================================================
>> null_blk	4	32	2250K	28%		+106%
>> nvme0n1	4	32	2560K	23%		+112%
> 
> nvme0n1		4	32	2560K	23%		+139%
> 
> Apparently I can't math, this is a +139% improvement for the nvme
> case... Just wanted to make it clear that the IOPS number was correct,
> it's just the diff math that was wrong.

And further followup, since I ran some quick testing on another box that
has a raid1 more normal drive (SATA, 32 tags). Both pre and post the
patches, the performance is roughly the same. The bigger difference is
that the pre result is using 8% systime to do ~73K, and with the patches
we're using 1% systime to do the same work.

This should help answer the question "does this matter at all?". The
answer is definitely yes. It's not just about scalability, as is usually
the case with improving things like this, it's about efficiency as well.
8x the sys time is ridiculous.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-18 18:56         ` Jens Axboe
@ 2024-01-18 20:46           ` Bart Van Assche
  2024-01-18 20:52             ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2024-01-18 20:46 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On 1/18/24 10:56, Jens Axboe wrote:
> Do you need me to link the cover letter that you were CC'ed on?

There is no reason to use such an aggressive tone in your emails.

In the cover letter I see performance numbers for the patch series in
its entirety but not for the individual patches. I'd like to know by
how much this patch by itself improves performance because whether or
not I will review this patch will depend on that data.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-18 20:46           ` Bart Van Assche
@ 2024-01-18 20:52             ` Jens Axboe
  2024-01-19 23:11               ` Bart Van Assche
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 20:52 UTC (permalink / raw)
  To: Bart Van Assche, linux-block

On 1/18/24 1:46 PM, Bart Van Assche wrote:
> On 1/18/24 10:56, Jens Axboe wrote:
>> Do you need me to link the cover letter that you were CC'ed on?
> 
> There is no reason to use such an aggressive tone in your emails.

I'm getting frustrated with you because I need to say the same things
multiple times, and it doesn't seem like it's getting received at the
other end. And you had clearly seen the results, which means that rather
than the passive aggressive question, you could have said

"It'd probably be useful to include some performance numbers in
 the commmit messages themselves as well."

which would be received way differently than asking a question that you
already have the answer to.

> In the cover letter I see performance numbers for the patch series in
> its entirety but not for the individual patches. I'd like to know by
> how much this patch by itself improves performance because whether or
> not I will review this patch will depend on that data.

You can't really split them up, as you need both to see the real
benefit. The only reason they are split is because it makes sense to do
so in terms of the code, as they are two different paths and points of
contention. This obviously makes patch 2 look better than it is, and
that would be the case even if I swapped the order of them as well. As
mentioned in a previous reply, I'll be editing the commit messages and
probably just include the various performance results in patch 2 and
reference patch 1 from that too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching
  2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe
  2024-01-18 18:24   ` Bart Van Assche
@ 2024-01-19  2:40   ` Ming Lei
  2024-01-19 15:49     ` Jens Axboe
  1 sibling, 1 reply; 20+ messages in thread
From: Ming Lei @ 2024-01-19  2:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, bvanassche, ming.lei

On Thu, Jan 18, 2024 at 11:04:56AM -0700, Jens Axboe wrote:
> If we're entering request dispatch but someone else is already
> dispatching, then just skip this dispatch. We know IO is inflight and
> this will trigger another dispatch event for any completion. This will
> potentially cause slightly lower queue depth for contended cases, but
> those are slowed down anyway and this should not cause an issue.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/mq-deadline.c | 28 +++++++++++++++++++++++-----
>  1 file changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index f958e79277b8..9e0ab3ea728a 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -79,10 +79,20 @@ struct dd_per_prio {
>  	struct io_stats_per_prio stats;
>  };
>  
> +enum {
> +	DD_DISPATCHING	= 0,
> +};
> +
>  struct deadline_data {
>  	/*
>  	 * run time data
>  	 */
> +	struct {
> +		spinlock_t lock;
> +		spinlock_t zone_lock;
> +	} ____cacheline_aligned_in_smp;
> +
> +	unsigned long run_state;
>  
>  	struct dd_per_prio per_prio[DD_PRIO_COUNT];
>  
> @@ -100,9 +110,6 @@ struct deadline_data {
>  	int front_merges;
>  	u32 async_depth;
>  	int prio_aging_expire;
> -
> -	spinlock_t lock;
> -	spinlock_t zone_lock;
>  };
>  
>  /* Maps an I/O priority class to a deadline scheduler priority. */
> @@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  	struct request *rq;
>  	enum dd_prio prio;
>  
> +	/*
> +	 * If someone else is already dispatching, skip this one. This will
> +	 * defer the next dispatch event to when something completes, and could
> +	 * potentially lower the queue depth for contended cases.
> +	 */
> +	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
> +	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
> +		return NULL;
> +

This patch looks fine.

BTW, the current dispatch is actually piggyback in the in-progress dispatch,
see blk_mq_do_dispatch_sched(). And the correctness should depend on
the looping dispatch & retry for nothing to dispatch in
blk_mq_do_dispatch_sched(), maybe we need to document it here.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching
  2024-01-19  2:40   ` Ming Lei
@ 2024-01-19 15:49     ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2024-01-19 15:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, bvanassche

On 1/18/24 7:40 PM, Ming Lei wrote:
> On Thu, Jan 18, 2024 at 11:04:56AM -0700, Jens Axboe wrote:
>> If we're entering request dispatch but someone else is already
>> dispatching, then just skip this dispatch. We know IO is inflight and
>> this will trigger another dispatch event for any completion. This will
>> potentially cause slightly lower queue depth for contended cases, but
>> those are slowed down anyway and this should not cause an issue.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  block/mq-deadline.c | 28 +++++++++++++++++++++++-----
>>  1 file changed, 23 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>> index f958e79277b8..9e0ab3ea728a 100644
>> --- a/block/mq-deadline.c
>> +++ b/block/mq-deadline.c
>> @@ -79,10 +79,20 @@ struct dd_per_prio {
>>  	struct io_stats_per_prio stats;
>>  };
>>  
>> +enum {
>> +	DD_DISPATCHING	= 0,
>> +};
>> +
>>  struct deadline_data {
>>  	/*
>>  	 * run time data
>>  	 */
>> +	struct {
>> +		spinlock_t lock;
>> +		spinlock_t zone_lock;
>> +	} ____cacheline_aligned_in_smp;
>> +
>> +	unsigned long run_state;
>>  
>>  	struct dd_per_prio per_prio[DD_PRIO_COUNT];
>>  
>> @@ -100,9 +110,6 @@ struct deadline_data {
>>  	int front_merges;
>>  	u32 async_depth;
>>  	int prio_aging_expire;
>> -
>> -	spinlock_t lock;
>> -	spinlock_t zone_lock;
>>  };
>>  
>>  /* Maps an I/O priority class to a deadline scheduler priority. */
>> @@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>>  	struct request *rq;
>>  	enum dd_prio prio;
>>  
>> +	/*
>> +	 * If someone else is already dispatching, skip this one. This will
>> +	 * defer the next dispatch event to when something completes, and could
>> +	 * potentially lower the queue depth for contended cases.
>> +	 */
>> +	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
>> +	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
>> +		return NULL;
>> +
> 
> This patch looks fine.
> 
> BTW, the current dispatch is actually piggyback in the in-progress dispatch,
> see blk_mq_do_dispatch_sched(). And the correctness should depend on
> the looping dispatch & retry for nothing to dispatch in
> blk_mq_do_dispatch_sched(), maybe we need to document it here.

Thanks for taking a look, I'll add a comment.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention
  2024-01-18 20:52             ` Jens Axboe
@ 2024-01-19 23:11               ` Bart Van Assche
  0 siblings, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2024-01-19 23:11 UTC (permalink / raw)
  To: Jens Axboe, linux-block

On 1/18/24 12:52, Jens Axboe wrote:
> And you had clearly seen the results, which means that rather
> than the passive aggressive question, you could have said [ ... ]

I'm never passive aggressive. If that is how my message was received I 
want to apologize.

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-01-19 23:11 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-18 18:04 [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe
2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe
2024-01-18 18:24   ` Bart Van Assche
2024-01-18 18:45     ` Jens Axboe
2024-01-18 18:51       ` Bart Van Assche
2024-01-18 18:55         ` Jens Axboe
2024-01-19  2:40   ` Ming Lei
2024-01-19 15:49     ` Jens Axboe
2024-01-18 18:04 ` [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe
2024-01-18 18:25   ` Keith Busch
2024-01-18 18:28     ` Jens Axboe
2024-01-18 18:31   ` Bart Van Assche
2024-01-18 18:33     ` Jens Axboe
2024-01-18 18:53       ` Bart Van Assche
2024-01-18 18:56         ` Jens Axboe
2024-01-18 20:46           ` Bart Van Assche
2024-01-18 20:52             ` Jens Axboe
2024-01-19 23:11               ` Bart Van Assche
2024-01-18 19:29 ` [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe
2024-01-18 20:22   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox