[PATCH] block: Improve shared tag set performance

Linux block layer
 help / color / mirror / Atom feed

* [PATCH] block: Improve shared tag set performance
@ 2023-10-18 18:00 Bart Van Assche
  2023-10-20  4:41 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-10-18 18:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Bart Van Assche, Christoph Hellwig,
	Martin K . Petersen, Ming Lei, Keith Busch, Damien Le Moal,
	Yu Kuai, Ed Tsai

Remove the code for fair tag sharing because it significantly hurts
performance for UFS devices. Removing this code is safe because the
legacy block layer worked fine without any equivalent fairness
algorithm.

This algorithm hurts performance for UFS devices because UFS devices
have multiple logical units. One of these logical units (WLUN) is used
to submit control commands, e.g. START STOP UNIT. If any request is
submitted to the WLUN, the queue depth is reduced from 31 to 15 or
lower for data LUNs.

See also https://lore.kernel.org/linux-scsi/20221229030645.11558-1-ed.tsai@mediatek.com/

Note: it has been attempted to rework this algorithm. See also "[PATCH
RFC 0/7] blk-mq: improve tag fair sharing"
(https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/).
Given the complexity of that patch series, I do not expect that patch
series to be merged.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Yu Kuai <yukuai1@huaweicloud.com>
Cc: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq-tag.c |  4 ----
 block/blk-mq.c     |  3 ---
 block/blk-mq.h     | 39 ---------------------------------------
 3 files changed, 46 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index cc57e2dd9a0b..25334bfcabf8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -105,10 +105,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 			    struct sbitmap_queue *bt)
 {
-	if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) &&
-			!hctx_may_queue(data->hctx, bt))
-		return BLK_MQ_NO_TAG;
-
 	if (data->shallow_depth)
 		return sbitmap_queue_get_shallow(bt, data->shallow_depth);
 	else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e2d11183f62e..502dafa76716 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1760,9 +1760,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq)
 	if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
 		bt = &rq->mq_hctx->tags->breserved_tags;
 		tag_offset = 0;
-	} else {
-		if (!hctx_may_queue(rq->mq_hctx, bt))
-			return false;
 	}
 
 	tag = __sbitmap_queue_get(bt);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index f75a9ecfebde..14a22f6d3fdf 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -407,45 +407,6 @@ static inline void blk_mq_free_requests(struct list_head *list)
 	}
 }
 
-/*
- * For shared tag users, we track the number of currently active users
- * and attempt to provide a fair share of the tag depth for each of them.
- */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
-				  struct sbitmap_queue *bt)
-{
-	unsigned int depth, users;
-
-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
-		return true;
-
-	/*
-	 * Don't try dividing an ant
-	 */
-	if (bt->sb.depth == 1)
-		return true;
-
-	if (blk_mq_is_shared_tags(hctx->flags)) {
-		struct request_queue *q = hctx->queue;
-
-		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
-			return true;
-	} else {
-		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-			return true;
-	}
-
-	users = READ_ONCE(hctx->tags->active_queues);
-	if (!users)
-		return true;
-
-	/*
-	 * Allow at least some tags
-	 */
-	depth = max((bt->sb.depth + users - 1) / users, 4U);
-	return __blk_mq_active_requests(hctx) < depth;
-}
-
 /* run the code block in @dispatch_ops with rcu/srcu read lock held */
 #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops)	\
 do {								\

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-18 18:00 [PATCH] block: Improve shared tag set performance Bart Van Assche
@ 2023-10-20  4:41 ` Christoph Hellwig
  2023-10-20 16:17   ` Bart Van Assche
  2023-10-20 19:11 ` Bart Van Assche
  2023-10-21  7:32 ` Yu Kuai
  2 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-10-20  4:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Martin K . Petersen,
	Ming Lei, Keith Busch, Damien Le Moal, Yu Kuai, Ed Tsai

On Wed, Oct 18, 2023 at 11:00:56AM -0700, Bart Van Assche wrote:
> Note: it has been attempted to rework this algorithm. See also "[PATCH
> RFC 0/7] blk-mq: improve tag fair sharing"
> (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/).
> Given the complexity of that patch series, I do not expect that patch
> series to be merged.

Work is hard, so let's skip it?  That's not really the most convincing
argument.  Hey, I'm the biggest advocate for code improvement by code
removal, but you better have a really good argument why it doesn't hurt
anyone.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-20  4:41 ` Christoph Hellwig
@ 2023-10-20 16:17   ` Bart Van Assche
  2023-10-20 16:25     ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-10-20 16:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Martin K . Petersen, Ming Lei,
	Keith Busch, Damien Le Moal, Yu Kuai, Ed Tsai

On 10/19/23 21:41, Christoph Hellwig wrote:
> On Wed, Oct 18, 2023 at 11:00:56AM -0700, Bart Van Assche wrote:
>> Note: it has been attempted to rework this algorithm. See also "[PATCH
>> RFC 0/7] blk-mq: improve tag fair sharing"
>> (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/).
>> Given the complexity of that patch series, I do not expect that patch
>> series to be merged.
> 
> Work is hard, so let's skip it?  That's not really the most convincing
> argument.  Hey, I'm the biggest advocate for code improvement by code
> removal, but you better have a really good argument why it doesn't hurt
> anyone.

Hi Christoph,

No, it's not because it's hard to improve the tag fairness algorithm
that I'm proposing to skip this work. It's because I'm convinced that
an improved fairness algorithm will have a negative performance impact
that is larger than that of the current algorithm.

Do you agree that the legacy block layer never had any such fairness
algorithm and also that nobody ever complained about fairness issues
for the legacy block layer? I think that's a strong argument in favor of
removing the fairness algorithm.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-20 16:17   ` Bart Van Assche
@ 2023-10-20 16:25     ` Keith Busch
  2023-10-20 16:45       ` Bart Van Assche
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2023-10-20 16:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Martin K . Petersen,
	Ming Lei, Damien Le Moal, Yu Kuai, Ed Tsai

On Fri, Oct 20, 2023 at 09:17:11AM -0700, Bart Van Assche wrote:
> On 10/19/23 21:41, Christoph Hellwig wrote:
> > On Wed, Oct 18, 2023 at 11:00:56AM -0700, Bart Van Assche wrote:
> > > Note: it has been attempted to rework this algorithm. See also "[PATCH
> > > RFC 0/7] blk-mq: improve tag fair sharing"
> > > (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/).
> > > Given the complexity of that patch series, I do not expect that patch
> > > series to be merged.
> > 
> > Work is hard, so let's skip it?  That's not really the most convincing
> > argument.  Hey, I'm the biggest advocate for code improvement by code
> > removal, but you better have a really good argument why it doesn't hurt
> > anyone.
> 
> Hi Christoph,
> 
> No, it's not because it's hard to improve the tag fairness algorithm
> that I'm proposing to skip this work. It's because I'm convinced that
> an improved fairness algorithm will have a negative performance impact
> that is larger than that of the current algorithm.
> 
> Do you agree that the legacy block layer never had any such fairness
> algorithm and also that nobody ever complained about fairness issues
> for the legacy block layer? I think that's a strong argument in favor of
> removing the fairness algorithm.

The legacy block request layer didn't have a tag resource shared among
multiple request queues. Each queue had their own mempool for allocating
requests. The mempool, I think, would always guarantee everyone could
get at least one request.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-20 16:25     ` Keith Busch
@ 2023-10-20 16:45       ` Bart Van Assche
  2023-10-20 17:09         ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-10-20 16:45 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Martin K . Petersen,
	Ming Lei, Damien Le Moal, Yu Kuai, Ed Tsai

On 10/20/23 09:25, Keith Busch wrote:
> The legacy block request layer didn't have a tag resource shared among
> multiple request queues. Each queue had their own mempool for allocating
> requests. The mempool, I think, would always guarantee everyone could
> get at least one request.

I think that the above is irrelevant in this context. As an example, 
SCSI devices have always shared a pool of tags across multiple logical
units. This behavior has not been changed by the conversion of the
SCSI core from the legacy block layer to blk-mq.

For other (hardware) block devices it didn't matter either that there
was no upper limit to the number of requests the legacy block layer
could allocate. All hardware block devices I know support fixed size
queues for queuing requests to the block device.

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-20 16:45       ` Bart Van Assche
@ 2023-10-20 17:09         ` Keith Busch
  2023-10-20 17:54           ` Bart Van Assche
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2023-10-20 17:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Martin K . Petersen,
	Ming Lei, Damien Le Moal, Yu Kuai, Ed Tsai

On Fri, Oct 20, 2023 at 09:45:53AM -0700, Bart Van Assche wrote:
> 
> On 10/20/23 09:25, Keith Busch wrote:
> > The legacy block request layer didn't have a tag resource shared among
> > multiple request queues. Each queue had their own mempool for allocating
> > requests. The mempool, I think, would always guarantee everyone could
> > get at least one request.
> 
> I think that the above is irrelevant in this context. As an example, SCSI
> devices have always shared a pool of tags across multiple logical
> units. This behavior has not been changed by the conversion of the
> SCSI core from the legacy block layer to blk-mq.
> 
> For other (hardware) block devices it didn't matter either that there
> was no upper limit to the number of requests the legacy block layer
> could allocate. All hardware block devices I know support fixed size
> queues for queuing requests to the block device.

I am not sure I understand your point. Those lower layers always were
able to get at least one request per request_queue. They can do whatever
they want with it after that. This change removes that guarantee for
blk-mq in some cases, right? I just don't think you can readily conclude
that is "safe" by appealing to the legacy behavior, that's all.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-20 17:09         ` Keith Busch
@ 2023-10-20 17:54           ` Bart Van Assche
  2023-10-21  1:31             ` Ming Lei
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-10-20 17:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Martin K . Petersen,
	Ming Lei, Damien Le Moal, Yu Kuai, Ed Tsai


On 10/20/23 10:09, Keith Busch wrote:
> On Fri, Oct 20, 2023 at 09:45:53AM -0700, Bart Van Assche wrote:
>>
>> On 10/20/23 09:25, Keith Busch wrote:
>>> The legacy block request layer didn't have a tag resource shared among
>>> multiple request queues. Each queue had their own mempool for allocating
>>> requests. The mempool, I think, would always guarantee everyone could
>>> get at least one request.
>>
>> I think that the above is irrelevant in this context. As an example, SCSI
>> devices have always shared a pool of tags across multiple logical
>> units. This behavior has not been changed by the conversion of the
>> SCSI core from the legacy block layer to blk-mq.
>>
>> For other (hardware) block devices it didn't matter either that there
>> was no upper limit to the number of requests the legacy block layer
>> could allocate. All hardware block devices I know support fixed size
>> queues for queuing requests to the block device.
> 
> I am not sure I understand your point. Those lower layers always were
> able to get at least one request per request_queue. They can do whatever
> they want with it after that. This change removes that guarantee for
> blk-mq in some cases, right? I just don't think you can readily conclude
> that is "safe" by appealing to the legacy behavior, that's all.

Hi Keith,

How requests were allocated in the legacy block layer is irrelevant in
this context. The patch I posted affects the tag allocation strategy.
Tag allocation happened in the legacy block layer by calling
blk_queue_start_tag(). From Linux kernel v4.20:

/**
  * blk_queue_start_tag - find a free tag and assign it
  * @q:  the request queue for the device
  * @rq:  the block request that needs tagging
  * [ ... ]
  **/

That function supports sharing tags between request queues but did not
attempt to be fair at all. This is how the SCSI core in Linux kernel 
v4.20 sets up tag sharing between request queues (from 
drivers/scsi/scsi_scan.c):

	if (!shost_use_blk_mq(sdev->host)) {
		blk_queue_init_tags(sdev->request_queue,
				    sdev->host->cmd_per_lun, shost->bqt,
				    shost->hostt->tag_alloc_policy);
	}

blk-mq has always had a fairness algorithm in case a tag set is shared
across request queues. If a tag set is shared across request queues, the
number of tags per request queue is restricted to the total number of
tags divided by the number of users (ignoring rounding). From
block/blk-mq.c in the latest kernel:

	depth = max((bt->sb.depth + users - 1) / users, 4U);

What my patch does is to remove this fairness guarantee. There was no
equivalent of this fairness guarantee in the legacy block layer.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-20 17:54           ` Bart Van Assche
@ 2023-10-21  1:31             ` Ming Lei
  2023-10-21 16:13               ` Bart Van Assche
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2023-10-21  1:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-block,
	Martin K . Petersen, Damien Le Moal, Yu Kuai, Ed Tsai

On Fri, Oct 20, 2023 at 10:54:59AM -0700, Bart Van Assche wrote:
> 
> On 10/20/23 10:09, Keith Busch wrote:
> > On Fri, Oct 20, 2023 at 09:45:53AM -0700, Bart Van Assche wrote:
> > > 
> > > On 10/20/23 09:25, Keith Busch wrote:
> > > > The legacy block request layer didn't have a tag resource shared among
> > > > multiple request queues. Each queue had their own mempool for allocating
> > > > requests. The mempool, I think, would always guarantee everyone could
> > > > get at least one request.
> > > 
> > > I think that the above is irrelevant in this context. As an example, SCSI
> > > devices have always shared a pool of tags across multiple logical
> > > units. This behavior has not been changed by the conversion of the
> > > SCSI core from the legacy block layer to blk-mq.
> > > 
> > > For other (hardware) block devices it didn't matter either that there
> > > was no upper limit to the number of requests the legacy block layer
> > > could allocate. All hardware block devices I know support fixed size
> > > queues for queuing requests to the block device.
> > 
> > I am not sure I understand your point. Those lower layers always were
> > able to get at least one request per request_queue. They can do whatever
> > they want with it after that. This change removes that guarantee for
> > blk-mq in some cases, right? I just don't think you can readily conclude
> > that is "safe" by appealing to the legacy behavior, that's all.
> 
> Hi Keith,
> 
> How requests were allocated in the legacy block layer is irrelevant in
> this context. The patch I posted affects the tag allocation strategy.
> Tag allocation happened in the legacy block layer by calling
> blk_queue_start_tag(). From Linux kernel v4.20:
> 
> /**
>  * blk_queue_start_tag - find a free tag and assign it
>  * @q:  the request queue for the device
>  * @rq:  the block request that needs tagging
>  * [ ... ]
>  **/
> 
> That function supports sharing tags between request queues but did not
> attempt to be fair at all. This is how the SCSI core in Linux kernel v4.20
> sets up tag sharing between request queues (from drivers/scsi/scsi_scan.c):
> 
> 	if (!shost_use_blk_mq(sdev->host)) {
> 		blk_queue_init_tags(sdev->request_queue,
> 				    sdev->host->cmd_per_lun, shost->bqt,
> 				    shost->hostt->tag_alloc_policy);
> 	}
> 
> blk-mq has always had a fairness algorithm in case a tag set is shared
> across request queues. If a tag set is shared across request queues, the
> number of tags per request queue is restricted to the total number of
> tags divided by the number of users (ignoring rounding). From
> block/blk-mq.c in the latest kernel:
> 
> 	depth = max((bt->sb.depth + users - 1) / users, 4U);
> 
> What my patch does is to remove this fairness guarantee. There was no
> equivalent of this fairness guarantee in the legacy block layer.

If two LUNs are attached to same host, one is slow, and another is fast,
and the slow LUN can slow down the fast LUN easily without this fairness
algorithm.

Your motivation is that "One of these logical units (WLUN) is used
to submit control commands, e.g. START STOP UNIT. If any request is
submitted to the WLUN, the queue depth is reduced from 31 to 15 or
lower for data LUNs." I guess one simple fix is to not account queues
of this non-IO LUN as active queues?



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-21  1:31             ` Ming Lei
@ 2023-10-21 16:13               ` Bart Van Assche
  2023-10-23  3:44                 ` Ming Lei
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-10-21 16:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-block,
	Martin K . Petersen, Damien Le Moal, Yu Kuai, Ed Tsai

On 10/20/23 18:31, Ming Lei wrote:
> If two LUNs are attached to same host, one is slow, and another is fast,
> and the slow LUN can slow down the fast LUN easily without this fairness
> algorithm.
> 
> Your motivation is that "One of these logical units (WLUN) is used
> to submit control commands, e.g. START STOP UNIT. If any request is
> submitted to the WLUN, the queue depth is reduced from 31 to 15 or
> lower for data LUNs." I guess one simple fix is to not account queues
> of this non-IO LUN as active queues?

Hi Ming,

For fast storage devices (e.g. UFS) any time spent in an algorithm for
fair sharing will reduce IOPS. If there are big differences in the
request processing latency between different request queues then fair
sharing is beneficial. Whether or not the fair sharing algorithm is
improved, how about making it easy to disable fair sharing, e.g. with
something like the untested patch below? I think that will work better
than ignoring fair sharing per LUN. UFS devices support multiple logical
units and with the current fair sharing approach it takes long until
tags are taken away from an inactive LUN (request queue timeout).

Thanks,

Bart.


diff --git a/block/blk-mq.h b/block/blk-mq.h
index f75a9ecfebde..b06b161d06de 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -416,7 +416,8 @@ static inline bool hctx_may_queue(struct 
blk_mq_hw_ctx *hctx,
  {
  	unsigned int depth, users;

-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
+	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
+	    hctx->queue->disable_fair_sharing)
  		return true;

  	/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index eef450f25982..63b04cf65887 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -523,6 +523,7 @@ struct request_queue {
  	struct mutex		debugfs_mutex;

  	bool			mq_sysfs_init_done;
+	bool			disable_fair_sharing;
  };

  /* Keep blk_queue_flag_name[] in sync with the definitions below */


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-21 16:13               ` Bart Van Assche
@ 2023-10-23  3:44                 ` Ming Lei
  0 siblings, 0 replies; 15+ messages in thread
From: Ming Lei @ 2023-10-23  3:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-block,
	Martin K . Petersen, Damien Le Moal, Yu Kuai, Ed Tsai, ming.lei

On Sat, Oct 21, 2023 at 09:13:38AM -0700, Bart Van Assche wrote:
> On 10/20/23 18:31, Ming Lei wrote:
> > If two LUNs are attached to same host, one is slow, and another is fast,
> > and the slow LUN can slow down the fast LUN easily without this fairness
> > algorithm.
> > 
> > Your motivation is that "One of these logical units (WLUN) is used
> > to submit control commands, e.g. START STOP UNIT. If any request is
> > submitted to the WLUN, the queue depth is reduced from 31 to 15 or
> > lower for data LUNs." I guess one simple fix is to not account queues
> > of this non-IO LUN as active queues?
> 
> Hi Ming,
> 
> For fast storage devices (e.g. UFS) any time spent in an algorithm for
> fair sharing will reduce IOPS. If there are big differences in the
> request processing latency between different request queues then fair
> sharing is beneficial. Whether or not the fair sharing algorithm is
> improved, how about making it easy to disable fair sharing, e.g. with
> something like the untested patch below? I think that will work better
> than ignoring fair sharing per LUN. UFS devices support multiple logical
> units and with the current fair sharing approach it takes long until
> tags are taken away from an inactive LUN (request queue timeout).
> 
> Thanks,
> 
> Bart.
> 
> 
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index f75a9ecfebde..b06b161d06de 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -416,7 +416,8 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx
> *hctx,
>  {
>  	unsigned int depth, users;
> 
> -	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
> +	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
> +	    hctx->queue->disable_fair_sharing)

Maybe you can propagate this flag into hctx->flags, then
hctx->queue->disable_fair_sharing can be avoided in fast path.

>  		return true;
> 
>  	/*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index eef450f25982..63b04cf65887 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -523,6 +523,7 @@ struct request_queue {
>  	struct mutex		debugfs_mutex;
> 
>  	bool			mq_sysfs_init_done;
> +	bool			disable_fair_sharing;

You also need to bypass blk_mq_tag_busy() & blk_mq_tag_idle()
in case of disable_fair_sharing which should only be set for
non-IO queues, such as UFS WLUN, and maybe nvme connection queues.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-18 18:00 [PATCH] block: Improve shared tag set performance Bart Van Assche
  2023-10-20  4:41 ` Christoph Hellwig
@ 2023-10-20 19:11 ` Bart Van Assche
  2023-10-21  7:32 ` Yu Kuai
  2 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-10-20 19:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen, Ming Lei,
	Keith Busch, Damien Le Moal, Yu Kuai, Ed Tsai

On 10/18/23 11:00, Bart Van Assche wrote:
> -/*
> - * For shared tag users, we track the number of currently active users
> - * and attempt to provide a fair share of the tag depth for each of them.
> - */
> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> -				  struct sbitmap_queue *bt)
> -{
> -	unsigned int depth, users;
> -
> -	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
> -		return true;
> -
> -	/*
> -	 * Don't try dividing an ant
> -	 */
> -	if (bt->sb.depth == 1)
> -		return true;
> -
> -	if (blk_mq_is_shared_tags(hctx->flags)) {
> -		struct request_queue *q = hctx->queue;
> -
> -		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> -			return true;
> -	} else {
> -		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -			return true;
> -	}
> -
> -	users = READ_ONCE(hctx->tags->active_queues);
> -	if (!users)
> -		return true;
> -
> -	/*
> -	 * Allow at least some tags
> -	 */
> -	depth = max((bt->sb.depth + users - 1) / users, 4U);
> -	return __blk_mq_active_requests(hctx) < depth;
> -}

(replying to my own email)

Hi Jens,

Do you perhaps remember why the above code was introduced by commit
0d2602ca30e4 ("blk-mq: improve support for shared tags maps") in 2014
(nine years ago)?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-18 18:00 [PATCH] block: Improve shared tag set performance Bart Van Assche
  2023-10-20  4:41 ` Christoph Hellwig
  2023-10-20 19:11 ` Bart Van Assche
@ 2023-10-21  7:32 ` Yu Kuai
  2023-10-21 16:21   ` Bart Van Assche
  2 siblings, 1 reply; 15+ messages in thread
From: Yu Kuai @ 2023-10-21  7:32 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen, Ming Lei,
	Keith Busch, Damien Le Moal, Yu Kuai, Ed Tsai, yukuai (C),
	yukuai (C)

Hi,

在 2023/10/19 2:00, Bart Van Assche 写道:
> Remove the code for fair tag sharing because it significantly hurts
> performance for UFS devices. Removing this code is safe because the
> legacy block layer worked fine without any equivalent fairness
> algorithm.
> 
> This algorithm hurts performance for UFS devices because UFS devices
> have multiple logical units. One of these logical units (WLUN) is used
> to submit control commands, e.g. START STOP UNIT. If any request is
> submitted to the WLUN, the queue depth is reduced from 31 to 15 or
> lower for data LUNs.
> 
> See also https://lore.kernel.org/linux-scsi/20221229030645.11558-1-ed.tsai@mediatek.com/
> 
> Note: it has been attempted to rework this algorithm. See also "[PATCH
> RFC 0/7] blk-mq: improve tag fair sharing"
> (https://lore.kernel.org/linux-block/20230618160738.54385-1-yukuai1@huaweicloud.com/).
> Given the complexity of that patch series, I do not expect that patch
> series to be merged.

Sorry for such huge delay, I was struggled on implementing a smoothly
algorithm to borrow tags and return borrowed tags, and later I put this
on ice and focus on other stuff.

I had an idea to implement a state machine, however, the amount of code
was aggressive and I gave up. And later, I implemented a simple version,
and I tested it in your case, 32 tags and 2 shared node, result looks
good(see below), however, I'm not confident this can work well general.

Anyway, I'll send a new RFC verion for this, and please let me know if
you still think this approch is unacceptable.

Thanks,
Kuai

Test script:

[global]
ioengine=libaio
iodepth=2
bs=4k
direct=1
rw=randrw
group_reporting

[sda]
numjobs=32
filename=/dev/sda

[sdb]
numjobs=1
filename=/dev/sdb

Test result, by monitor new debugfs entry shared_tag_info:
time	active		available
	sda 	sdb	sda	sdb
0	0	0	32	32
1	16	2	16	16	-> start fair sharing
2	19	2	20	16
3	24	2	24	16
4	26	2	28	16 	-> borrow 32/8=4 tags each round
5	28	2	28	16	-> save at lease 4 tags for sdb
...
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Cc: Yu Kuai <yukuai1@huaweicloud.com>
> Cc: Ed Tsai <ed.tsai@mediatek.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/blk-mq-tag.c |  4 ----
>   block/blk-mq.c     |  3 ---
>   block/blk-mq.h     | 39 ---------------------------------------
>   3 files changed, 46 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index cc57e2dd9a0b..25334bfcabf8 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -105,10 +105,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>   			    struct sbitmap_queue *bt)
>   {
> -	if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) &&
> -			!hctx_may_queue(data->hctx, bt))
> -		return BLK_MQ_NO_TAG;
> -
>   	if (data->shallow_depth)
>   		return sbitmap_queue_get_shallow(bt, data->shallow_depth);
>   	else
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e2d11183f62e..502dafa76716 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1760,9 +1760,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq)
>   	if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
>   		bt = &rq->mq_hctx->tags->breserved_tags;
>   		tag_offset = 0;
> -	} else {
> -		if (!hctx_may_queue(rq->mq_hctx, bt))
> -			return false;
>   	}
>   
>   	tag = __sbitmap_queue_get(bt);
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index f75a9ecfebde..14a22f6d3fdf 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -407,45 +407,6 @@ static inline void blk_mq_free_requests(struct list_head *list)
>   	}
>   }
>   
> -/*
> - * For shared tag users, we track the number of currently active users
> - * and attempt to provide a fair share of the tag depth for each of them.
> - */
> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> -				  struct sbitmap_queue *bt)
> -{
> -	unsigned int depth, users;
> -
> -	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
> -		return true;
> -
> -	/*
> -	 * Don't try dividing an ant
> -	 */
> -	if (bt->sb.depth == 1)
> -		return true;
> -
> -	if (blk_mq_is_shared_tags(hctx->flags)) {
> -		struct request_queue *q = hctx->queue;
> -
> -		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> -			return true;
> -	} else {
> -		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -			return true;
> -	}
> -
> -	users = READ_ONCE(hctx->tags->active_queues);
> -	if (!users)
> -		return true;
> -
> -	/*
> -	 * Allow at least some tags
> -	 */
> -	depth = max((bt->sb.depth + users - 1) / users, 4U);
> -	return __blk_mq_active_requests(hctx) < depth;
> -}
> -
>   /* run the code block in @dispatch_ops with rcu/srcu read lock held */
>   #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops)	\
>   do {								\
> 
> .
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-21  7:32 ` Yu Kuai
@ 2023-10-21 16:21   ` Bart Van Assche
  2023-10-23  1:11     ` Yu Kuai
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2023-10-21 16:21 UTC (permalink / raw)
  To: Yu Kuai, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen, Ming Lei,
	Keith Busch, Damien Le Moal, Ed Tsai, yukuai (C)


On 10/21/23 00:32, Yu Kuai wrote:
> Sorry for such huge delay, I was struggled on implementing a smoothly
> algorithm to borrow tags and return borrowed tags, and later I put this
> on ice and focus on other stuff.
> 
> I had an idea to implement a state machine, however, the amount of code
> was aggressive and I gave up. And later, I implemented a simple version,
> and I tested it in your case, 32 tags and 2 shared node, result looks
> good(see below), however, I'm not confident this can work well general.
> 
> Anyway, I'll send a new RFC verion for this, and please let me know if
> you still think this approch is unacceptable.
> 
> Thanks,
> Kuai
> 
> Test script:
> 
> [global]
> ioengine=libaio
> iodepth=2
> bs=4k
> direct=1
> rw=randrw
> group_reporting
> 
> [sda]
> numjobs=32
> filename=/dev/sda
> 
> [sdb]
> numjobs=1
> filename=/dev/sdb
> 
> Test result, by monitor new debugfs entry shared_tag_info:
> time    active        available
>      sda     sdb    sda    sdb
> 0    0    0    32    32
> 1    16    2    16    16    -> start fair sharing
> 2    19    2    20    16
> 3    24    2    24    16
> 4    26    2    28    16     -> borrow 32/8=4 tags each round
> 5    28    2    28    16    -> save at lease 4 tags for sdb

Hi Yu,

Thank you for having shared these results. What is the unit of the
numbers in the time column?

In the above I see that more tags are assigned to sda than to sdb
although I/O is being submitted to both LUNs. I think the current
algoritm defines fairness as dividing tags in a fair way across active
LUNs. Do the above results show that tags are divided per active job
instead of per active LUN? If so, I'm not sure that everyone will agree
that this is a fair way to distribute tags ...

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Improve shared tag set performance
  2023-10-21 16:21   ` Bart Van Assche
@ 2023-10-23  1:11     ` Yu Kuai
  0 siblings, 0 replies; 15+ messages in thread
From: Yu Kuai @ 2023-10-23  1:11 UTC (permalink / raw)
  To: Bart Van Assche, Yu Kuai, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen, Ming Lei,
	Keith Busch, Damien Le Moal, Ed Tsai, yukuai (C)

Hi,

在 2023/10/22 0:21, Bart Van Assche 写道:
> 
> On 10/21/23 00:32, Yu Kuai wrote:
>> Sorry for such huge delay, I was struggled on implementing a smoothly
>> algorithm to borrow tags and return borrowed tags, and later I put this
>> on ice and focus on other stuff.
>>
>> I had an idea to implement a state machine, however, the amount of code
>> was aggressive and I gave up. And later, I implemented a simple version,
>> and I tested it in your case, 32 tags and 2 shared node, result looks
>> good(see below), however, I'm not confident this can work well general.
>>
>> Anyway, I'll send a new RFC verion for this, and please let me know if
>> you still think this approch is unacceptable.
>>
>> Thanks,
>> Kuai
>>
>> Test script:
>>
>> [global]
>> ioengine=libaio
>> iodepth=2
>> bs=4k
>> direct=1
>> rw=randrw
>> group_reporting
>>
>> [sda]
>> numjobs=32
>> filename=/dev/sda
>>
>> [sdb]
>> numjobs=1
>> filename=/dev/sdb
>>
>> Test result, by monitor new debugfs entry shared_tag_info:
>> time    active        available
>>      sda     sdb    sda    sdb
>> 0    0    0    32    32
>> 1    16    2    16    16    -> start fair sharing
>> 2    19    2    20    16
>> 3    24    2    24    16
>> 4    26    2    28    16     -> borrow 32/8=4 tags each round
>> 5    28    2    28    16    -> save at lease 4 tags for sdb
> 
> Hi Yu,
> 
> Thank you for having shared these results. What is the unit of the
> numbers in the time column?


I added a timer in blk_mq_tags, and timer function is used to implement
borrow tags and return borrowed tags, the timer will start when one node
is busy, and will expire in HZ, so the time means each second.
> 
> In the above I see that more tags are assigned to sda than to sdb
> although I/O is being submitted to both LUNs. I think the current
> algoritm defines fairness as dividing tags in a fair way across active
> LUNs. Do the above results show that tags are divided per active job
> instead of per active LUN? If so, I'm not sure that everyone will agree
> that this is a fair way to distribute tags ...

Yes, active tag is divided into per active LUN, specifically each
request_queue or hctx that is sharing tags.

Thanks,
Kuai

> 
> Thanks,
> 
> Bart.
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] block: Improve shared tag set performance
@ 2023-01-02 17:39 Bart Van Assche
  0 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2023-01-02 17:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Bart Van Assche, Christoph Hellwig,
	Martin K . Petersen, Ming Lei, Keith Busch, Damien Le Moal,
	Ed Tsai

Remove the code for fair tag sharing because it significantly hurts
performance for UFS devices. Removing this code is safe because the
legacy block layer worked fine without any equivalent fairness
algorithm.

This algorithm hurts performance for UFS devices because UFS devices
have multiple logical units. One of these logical units (WLUN) is used
to submit control commands, e.g. START STOP UNIT. If any request is
submitted to the WLUN, the queue depth is reduced from 31 to 15 or
lower for data LUNs.

See also https://lore.kernel.org/linux-scsi/20221229030645.11558-1-ed.tsai@mediatek.com/

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq-debugfs.c |  2 --
 block/blk-mq-tag.c     |  8 --------
 block/blk-mq.c         |  3 ---
 block/blk-mq.h         | 40 ----------------------------------------
 4 files changed, 53 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index bd942341b638..00af8d2c8816 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -426,8 +426,6 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
 {
 	seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
 	seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
-	seq_printf(m, "active_queues=%d\n",
-		   atomic_read(&tags->active_queues));
 
 	seq_puts(m, "\nbitmap_tags:\n");
 	sbitmap_queue_show(&tags->bitmap_tags, m);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9eb968e14d31..e9243cae41c7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -53,8 +53,6 @@ void __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 		set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state);
 	}
 
-	users = atomic_inc_return(&hctx->tags->active_queues);
-
 	blk_mq_update_wake_batch(hctx->tags, users);
 }
 
@@ -88,8 +86,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 			return;
 	}
 
-	users = atomic_dec_return(&tags->active_queues);
-
 	blk_mq_update_wake_batch(tags, users);
 
 	blk_mq_tag_wakeup_all(tags, false);
@@ -98,10 +94,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 			    struct sbitmap_queue *bt)
 {
-	if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) &&
-			!hctx_may_queue(data->hctx, bt))
-		return BLK_MQ_NO_TAG;
-
 	if (data->shallow_depth)
 		return sbitmap_queue_get_shallow(bt, data->shallow_depth);
 	else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4a7805390cf8..c570f134850a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1698,9 +1698,6 @@ static bool __blk_mq_alloc_driver_tag(struct request *rq)
 	if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
 		bt = &rq->mq_hctx->tags->breserved_tags;
 		tag_offset = 0;
-	} else {
-		if (!hctx_may_queue(rq->mq_hctx, bt))
-			return false;
 	}
 
 	tag = __sbitmap_queue_get(bt);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 0b2870839cdd..908f830931f0 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -334,46 +334,6 @@ static inline void blk_mq_free_requests(struct list_head *list)
 	}
 }
 
-/*
- * For shared tag users, we track the number of currently active users
- * and attempt to provide a fair share of the tag depth for each of them.
- */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
-				  struct sbitmap_queue *bt)
-{
-	unsigned int depth, users;
-
-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
-		return true;
-
-	/*
-	 * Don't try dividing an ant
-	 */
-	if (bt->sb.depth == 1)
-		return true;
-
-	if (blk_mq_is_shared_tags(hctx->flags)) {
-		struct request_queue *q = hctx->queue;
-
-		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
-			return true;
-	} else {
-		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-			return true;
-	}
-
-	users = atomic_read(&hctx->tags->active_queues);
-
-	if (!users)
-		return true;
-
-	/*
-	 * Allow at least some tags
-	 */
-	depth = max((bt->sb.depth + users - 1) / users, 4U);
-	return __blk_mq_active_requests(hctx) < depth;
-}
-
 /* run the code block in @dispatch_ops with rcu/srcu read lock held */
 #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops)	\
 do {								\

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-10-23  3:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-18 18:00 [PATCH] block: Improve shared tag set performance Bart Van Assche
2023-10-20  4:41 ` Christoph Hellwig
2023-10-20 16:17   ` Bart Van Assche
2023-10-20 16:25     ` Keith Busch
2023-10-20 16:45       ` Bart Van Assche
2023-10-20 17:09         ` Keith Busch
2023-10-20 17:54           ` Bart Van Assche
2023-10-21  1:31             ` Ming Lei
2023-10-21 16:13               ` Bart Van Assche
2023-10-23  3:44                 ` Ming Lei
2023-10-20 19:11 ` Bart Van Assche
2023-10-21  7:32 ` Yu Kuai
2023-10-21 16:21   ` Bart Van Assche
2023-10-23  1:11     ` Yu Kuai
  -- strict thread matches above, loose matches on Subject: below --
2023-01-02 17:39 Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox