Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Omar Sandoval <osandov@osandov.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@fb.com>,
	linux-block@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Mike Snitzer <snitzer@redhat.com>,
	dm-devel@redhat.com, Bart Van Assche <bart.vanassche@sandisk.com>,
	Laurence Oberman <loberman@redhat.com>,
	Paolo Valente <paolo.valente@linaro.org>,
	Oleksandr Natalenko <oleksandr@natalenko.name>,
	Tom Nguyen <tom81094@gmail.com>,
	linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org,
	Omar Sandoval <osandov@fb.com>
Subject: Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue
Date: Tue, 10 Oct 2017 11:23:45 -0700	[thread overview]
Message-ID: <20171010182345.GD30738@vader.DHCP.thefacebook.com> (raw)
In-Reply-To: <20171009112424.30524-5-ming.lei@redhat.com>

On Mon, Oct 09, 2017 at 07:24:23PM +0800, Ming Lei wrote:
> SCSI devices use host-wide tagset, and the shared driver tag space is
> often quite big. Meantime there is also queue depth for each lun(
> .cmd_per_lun), which is often small, for example, on both lpfc and
> qla2xxx, .cmd_per_lun is just 3.
> 
> So lots of requests may stay in sw queue, and we always flush all
> belonging to same hw queue and dispatch them all to driver, unfortunately
> it is easy to cause queue busy because of the small .cmd_per_lun.
> Once these requests are flushed out, they have to stay in hctx->dispatch,
> and no bio merge can participate into these requests, and sequential IO
> performance is hurt a lot.
> 
> This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
> sw queue so that we can dispatch them in scheduler's way, then we can
> avoid to dequeue too many requests from sw queue when ->dispatch isn't
> flushed completely.
> 
> This patch improves dispatching from sw queue when there is per-request-queue
> queue depth by taking request one by one from sw queue, just like the way
> of IO scheduler.

This still didn't address Jens' concern about using q->queue_depth as
the heuristic for whether to do the full sw queue flush or one-by-one
dispatch. The EWMA approach is a bit too complex for now, can you please
try the heuristic of whether the driver ever returned BLK_STS_RESOURCE?

> Reviewed-by: Omar Sandoval <osandov@fb.com>
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c   | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  block/blk-mq.c         | 39 +++++++++++++++++++++++++++++++++++++++
>  block/blk-mq.h         |  2 ++
>  include/linux/blk-mq.h |  2 ++
>  4 files changed, 91 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index be29ba849408..14b354f617e5 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -104,6 +104,39 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  	} while (blk_mq_dispatch_rq_list(q, &rq_list));
>  }
>  
> +static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> +					  struct blk_mq_ctx *ctx)
> +{
> +	unsigned idx = ctx->index_hw;
> +
> +	if (++idx == hctx->nr_ctx)
> +		idx = 0;
> +
> +	return hctx->ctxs[idx];
> +}
> +
> +static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct request_queue *q = hctx->queue;
> +	LIST_HEAD(rq_list);
> +	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> +
> +	do {
> +		struct request *rq;
> +
> +		rq = blk_mq_dequeue_from_ctx(hctx, ctx);
> +		if (!rq)
> +			break;
> +		list_add(&rq->queuelist, &rq_list);
> +
> +		/* round robin for fair dispatch */
> +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> +
> +	} while (blk_mq_dispatch_rq_list(q, &rq_list));
> +
> +	WRITE_ONCE(hctx->dispatch_from, ctx);
> +}
> +
>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct request_queue *q = hctx->queue;
> @@ -143,10 +176,23 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 */
>  	if (!list_empty(&rq_list)) {
>  		blk_mq_sched_mark_restart_hctx(hctx);
> -		if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch)
> -			blk_mq_do_dispatch_sched(hctx);
> +		if (blk_mq_dispatch_rq_list(q, &rq_list)) {
> +			if (has_sched_dispatch)
> +				blk_mq_do_dispatch_sched(hctx);
> +			else
> +				blk_mq_do_dispatch_ctx(hctx);
> +		}
>  	} else if (has_sched_dispatch) {
>  		blk_mq_do_dispatch_sched(hctx);
> +	} else if (q->queue_depth) {
> +		/*
> +		 * If there is per-request_queue depth, we dequeue
> +		 * request one by one from sw queue for avoiding to mess
> +		 * up I/O merge when dispatch runs out of resource, which
> +		 * can be triggered easily when there is per-request_queue
> +		 * queue depth or .cmd_per_lun, such as SCSI device.
> +		 */
> +		blk_mq_do_dispatch_ctx(hctx);
>  	} else {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>  		blk_mq_dispatch_rq_list(q, &rq_list);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 076cbab9c3e0..394cb75d66fa 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -911,6 +911,45 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
>  
> +struct dispatch_rq_data {
> +	struct blk_mq_hw_ctx *hctx;
> +	struct request *rq;
> +};
> +
> +static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
> +		void *data)
> +{
> +	struct dispatch_rq_data *dispatch_data = data;
> +	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
> +	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
> +
> +	spin_lock(&ctx->lock);
> +	if (unlikely(!list_empty(&ctx->rq_list))) {
> +		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
> +		list_del_init(&dispatch_data->rq->queuelist);
> +		if (list_empty(&ctx->rq_list))
> +			sbitmap_clear_bit(sb, bitnr);
> +	}
> +	spin_unlock(&ctx->lock);
> +
> +	return !dispatch_data->rq;
> +}
> +
> +struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
> +					struct blk_mq_ctx *start)
> +{
> +	unsigned off = start ? start->index_hw : 0;
> +	struct dispatch_rq_data data = {
> +		.hctx = hctx,
> +		.rq   = NULL,
> +	};
> +
> +	__sbitmap_for_each_set(&hctx->ctx_map, off,
> +			       dispatch_rq_from_ctx, &data);
> +
> +	return data.rq;
> +}
> +
>  static inline unsigned int queued_to_index(unsigned int queued)
>  {
>  	if (!queued)
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index ef15b3414da5..231cfb0d973b 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -35,6 +35,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
>  bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
>  bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
>  				bool wait);
> +struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
> +					struct blk_mq_ctx *start);
>  
>  /*
>   * Internal helpers for allocating/freeing the request map
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 50c6485cb04f..7b7a366a97f3 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -30,6 +30,8 @@ struct blk_mq_hw_ctx {
>  
>  	struct sbitmap		ctx_map;
>  
> +	struct blk_mq_ctx	*dispatch_from;
> +
>  	struct blk_mq_ctx	**ctxs;
>  	unsigned int		nr_ctx;
>  
> -- 
> 2.9.5
>

next prev parent reply	other threads:[~2017-10-10 18:23 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-09 11:24 [PATCH V6 0/5] blk-mq-sched: improve sequential I/O performance Ming Lei
2017-10-09 11:24 ` [PATCH V6 1/5] blk-mq-sched: fix scheduler bad performance Ming Lei
2017-10-10 18:10   ` Omar Sandoval
2017-10-09 11:24 ` [PATCH V6 2/5] blk-mq-sched: move actual dispatching into one helper Ming Lei
2017-10-09 11:24 ` [PATCH V6 3/5] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
2017-10-10 18:15   ` Omar Sandoval
2017-10-09 11:24 ` [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue Ming Lei
2017-10-10 18:23   ` Omar Sandoval [this message]
2017-10-12 10:01     ` Ming Lei
2017-10-12 14:52       ` Jens Axboe
2017-10-12 15:22         ` Ming Lei
2017-10-12 15:24           ` Jens Axboe
2017-10-12 15:33       ` Bart Van Assche
2017-10-12 15:33         ` Bart Van Assche
2017-10-12 15:37         ` Jens Axboe
2017-10-12 15:49           ` Ming Lei
2017-10-12 15:49             ` Ming Lei
2017-10-09 11:24 ` [PATCH V6 5/5] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
2017-10-10 18:26   ` Omar Sandoval

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171010182345.GD30738@vader.DHCP.thefacebook.com \
    --to=osandov@osandov.com \
    --cc=axboe@fb.com \
    --cc=bart.vanassche@sandisk.com \
    --cc=dm-devel@redhat.com \
    --cc=hch@infradead.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=loberman@redhat.com \
    --cc=ming.lei@redhat.com \
    --cc=oleksandr@natalenko.name \
    --cc=osandov@fb.com \
    --cc=paolo.valente@linaro.org \
    --cc=snitzer@redhat.com \
    --cc=tom81094@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.