Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block <linux-block@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Ulf Hansson <ulf.hansson@linaro.org>,
	Mark Brown <broonie@kernel.org>,
	Linus Walleij <linus.walleij@linaro.org>,
	'Paolo Valente' via bfq-iosched <bfq-iosched@googlegroups.com>,
	Oleksandr Natalenko <oleksandr@natalenko.name>,
	Alban Browaeys <alban.browaeys@gmail.com>,
	ivan@ludios.org, 169364@studenti.unimore.it,
	holger@applied-asynchrony.com, efault@gmx.de,
	Serena Ziviani <ziviani.serena@gmail.com>
Subject: Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook
Date: Sat, 24 Feb 2018 00:17:57 +0800	[thread overview]
Message-ID: <20180223161751.GA17466@ming.t460p> (raw)
In-Reply-To: <8296094C-C3DB-434F-A83D-64CBDE927D09@linaro.org>

Hi Paolo,

On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 23 feb 2018, alle ore 16:07, Ming Lei <ming.lei@redhat.com> ha scritto:
> > 
> > Hi Paolo,
> > 
> > On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote:
> >> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
> >> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
> >> be re-inserted into the active I/O scheduler for that device. As a
> > 
> > No, this behaviour isn't related with commit a6a252e64914, and
> > it has been there since blk_mq_requeue_request() is introduced.
> > 
> 
> Hi Ming,
> actually, we wrote the above statement after simply following the call
> chain that led to the failure.  In this respect, the change in commit
> a6a252e64914:
> 
>  static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
> +                                      bool has_sched,
>                                        struct request *rq)
>  {
> -       if (rq->tag == -1) {
> +       /* dispatch flush rq directly */
> +       if (rq->rq_flags & RQF_FLUSH_SEQ) {
> +               spin_lock(&hctx->lock);
> +               list_add(&rq->queuelist, &hctx->dispatch);
> +               spin_unlock(&hctx->lock);
> +               return true;
> +       }
> +
> +       if (has_sched) {
>                 rq->rq_flags |= RQF_SORTED;
> -               return false;
> +               WARN_ON(rq->tag != -1);
>         }
>  
> -       /*
> -        * If we already have a real request tag, send directly to
> -        * the dispatch list.
> -        */
> -       spin_lock(&hctx->lock);
> -       list_add(&rq->queuelist, &hctx->dispatch);
> -       spin_unlock(&hctx->lock);
> -       return true;
> +       return false;
>  }
> 
> makes blk_mq_sched_bypass_insert return false for all non-flush
> requests.  From that, the anomaly described in our commit follows, for
> bfq any stateful scheduler that waits for the completion of requests
> that passed through it.  I'm elaborating again a little bit on this in
> my replies to your next points below.

Before a6a252e64914, follows blk_mq_sched_bypass_insert()

       if (rq->tag == -1) {
               rq->rq_flags |= RQF_SORTED;
               return false;
	   }

       /*
        * If we already have a real request tag, send directly to
        * the dispatch list.
        */
       spin_lock(&hctx->lock);
       list_add(&rq->queuelist, &hctx->dispatch);
       spin_unlock(&hctx->lock);
       return true;

This function still returns false for all non-flush request, so nothing
changes wrt. this kind of handling.

> 
> I don't mean that this change is an error, it simply sends a stateful
> scheduler in an inconsistent state, unless the scheduler properly
> handles the requeue that precedes the re-insertion into the
> scheduler.
> 
> If this clarifies the situation, but there is still some misleading
> statement in the commit, just let me better understand, and I'll be
> glad to rectify it, if possible somehow.
> 
> 
> > And you can see blk_mq_requeue_request() is called by lots of drivers,
> > especially it is often used in error handler, see SCSI's example.
> > 
> >> consequence, I/O schedulers may get the same request inserted again,
> >> even several times, without a finish_request invoked on that request
> >> before each re-insertion.
> >> 
> 
> ...
> 
> >> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = {
> >> 	.ops.mq = {
> >> 		.limit_depth		= bfq_limit_depth,
> >> 		.prepare_request	= bfq_prepare_request,
> >> -		.finish_request		= bfq_finish_request,
> >> +		.requeue_request        = bfq_finish_requeue_request,
> >> +		.finish_request		= bfq_finish_requeue_request,
> >> 		.exit_icq		= bfq_exit_icq,
> >> 		.insert_requests	= bfq_insert_requests,
> >> 		.dispatch_request	= bfq_dispatch_request,
> > 
> > This way may not be correct since blk_mq_sched_requeue_request() can be
> > called for one request which won't enter io scheduler.
> > 
> 
> Exactly, there are two cases: requeues that lead to subsequent
> re-insertions, and requeues that don't.  The function
> bfq_finish_requeue_request handles both, and both need to be handled,
> to inform bfq that it has not to wait for the completion of rq any
> longer.
> 
> One special case is when bfq_finish_requeue_request gets invoked even
> if rq has nothing to do with any scheduler.  In that case,
> bfq_finish_requeue_request exists immediately.
> 
> 
> > __blk_mq_requeue_request() is called for two cases:
> > 
> > 	- one is that the requeued request is added to hctx->dispatch, such
> > 	as blk_mq_dispatch_rq_list()
> 
> yes
> 
> > 	- another case is that the request is requeued to io scheduler, such as
> > 	blk_mq_requeue_request().
> > 
> 
> yes
> 
> > For the 1st case, blk_mq_sched_requeue_request() shouldn't be called
> > since it is nothing to do with scheduler,
> 
> No, if that rq has been inserted and then extracted from the scheduler
> through a dispatch_request, then it has.  The scheduler is stateful,
> and keeps state for rq, because it must do so, until a completion or a
> requeue arrive.  In particular, bfq may decide that no other

This 'requeue' to hctx->dispatch is invisible to io scheduler, and from
io scheduler view, seems no difference between STS_OK and STS_RESOURCE,
since this request will be submitted to lld automatically by blk-mq
core.

Also in legacy path, no such notification to io scheduler too.

> bfq_queues must be served before rq has been completed, because this
> is necessary to preserve its target service guarantees.  If bfq is not
> informed, either through its completion or its requeue hook, then it
> will wait forever.

The .finish_request will be called after this request is completed for this
case, either after this io is completed by device, or timeout.

Or .requeue_request will be called if the driver need to requeue it to
io scheduler via blk_mq_requeue_request() explicitly.

So io scheduler will be made sure to get notified.

> 
> > seems we only need to do that
> > for 2nd case.
> > 
> 
> > So looks we need the following patch:
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 23de7fd8099a..a216f3c3c3ce 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -712,7 +714,6 @@ static void __blk_mq_requeue_request(struct request *rq)
> > 
> > 	trace_block_rq_requeue(q, rq);
> > 	wbt_requeue(q->rq_wb, &rq->issue_stat);
> > -	blk_mq_sched_requeue_request(rq);
> > 
> 
> By doing so, if rq has 'passed' through bfq, then you block again bfq
> forever.

Could you explain it a bit why bfq is blocked by this way?

> 
> Of course, I may be wrong, as I don't have a complete mental model of
> all what blk-mq does around bfq.  But I thin that you can quickly
> check whether a hang occurs again if you add this change.  In
> particular, it should happen in the USB former failure case.

USB/BFQ runs well on my parted test with this change, and this test can
trigger the IO hang issue easily on kyber or without your patch of 'block,
bfq: add requeue-request hook'.

Thanks,
Ming

next prev parent reply	other threads:[~2018-02-23 16:17 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-07 21:19 [PATCH BUGFIX V3] block, bfq: add requeue-request hook Paolo Valente
2018-02-07 22:18 ` Jens Axboe
2018-02-08  7:16   ` Paolo Valente
2018-02-09 13:21     ` Oleksandr Natalenko
2018-02-09 17:17       ` Jens Axboe
2018-02-09 17:29       ` Mike Galbraith
2018-02-10  8:29         ` Oleksandr Natalenko
2018-02-12  7:24           ` Paolo Valente
2018-02-23 15:07 ` Ming Lei
2018-02-23 15:41   ` Paolo Valente
2018-02-23 16:17     ` Ming Lei [this message]
2018-02-24  7:54       ` Paolo Valente
2018-02-24 12:05         ` Ming Lei
2018-02-24 16:15           ` Paolo Valente

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180223161751.GA17466@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=169364@studenti.unimore.it \
    --cc=alban.browaeys@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bfq-iosched@googlegroups.com \
    --cc=broonie@kernel.org \
    --cc=efault@gmx.de \
    --cc=holger@applied-asynchrony.com \
    --cc=ivan@ludios.org \
    --cc=linus.walleij@linaro.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleksandr@natalenko.name \
    --cc=paolo.valente@linaro.org \
    --cc=ulf.hansson@linaro.org \
    --cc=ziviani.serena@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox