Re: [PATCH] blk-mq: Start to fix memory ordering...

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Parri <parri.andrea@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@fb.com>,
	linux-kernel@vger.kernel.org, stern@rowland.harvard.edu,
	will.deacon@arm.com, tom.leiming@gmail.com, boqun.feng@gmail.com,
	paulmck@linux.vnet.ibm.com, hch@lst.de, bart.vanassche@wdc.com
Subject: Re: [PATCH] blk-mq: Start to fix memory ordering...
Date: Wed, 6 Sep 2017 09:13:04 +0200	[thread overview]
Message-ID: <20170906071304.GA3519@andrea> (raw)
In-Reply-To: <20170904090932.ngfefw2clot3sbqi@hirez.programming.kicks-ass.net>

On Mon, Sep 04, 2017 at 11:09:32AM +0200, Peter Zijlstra wrote:
> 
> Attempt to untangle the ordering in blk-mq. The patch introducing the
> single smp_mb__before_atomic() is obviously broken in that it doesn't
> clearly specify a pairing barrier and an obtained guarantee.
> 
> The comment is further misleading in that it hints that the
> deadline store and the COMPLETE store also need to be ordered, but
> AFAICT there is no such dependency. However what does appear to be
> important is the clear happening _after_ the store, and that worked by
> pure accident.
> 
> This clarifies blk_mq_start_request() -- we should not get there with
> STARTING set -- this simplifies the code and makes the barrier usage
> sane (the old code could be read to allow not having _any_ atomic after
> the barrier, in which case the barrier hasn't got anything to order). We
> then also introduce the missing pairing barrier for it.
> 
> And it documents the STARTING vs COMPLETE ordering. Although I've not
> been entirely successful in reverse engineering the blk-mq state
> machine so there might still be more funnies around timeout vs
> requeue.
> 
> If I got anything wrong, feel free to educate me by adding comments to
> clarify things ;-)
> 
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Ming Lei <tom.leiming@gmail.com>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Andrea Parri <parri.andrea@gmail.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Fixes: 538b75341835 ("blk-mq: request deadline must be visible before marking rq as started")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  block/blk-mq.c |   48 +++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 37 insertions(+), 11 deletions(-)
> 
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -558,22 +558,29 @@ void blk_mq_start_request(struct request
>  
>  	blk_add_timer(rq);
>  
> -	/*
> -	 * Ensure that ->deadline is visible before set the started
> -	 * flag and clear the completed flag.
> -	 */
> -	smp_mb__before_atomic();
> +	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
>  
>  	/*
>  	 * Mark us as started and clear complete. Complete might have been
>  	 * set if requeue raced with timeout, which then marked it as
>  	 * complete. So be sure to clear complete again when we start
>  	 * the request, otherwise we'll ignore the completion event.
> +	 *
> +	 * Ensure that ->deadline is visible before set STARTED, such that
> +	 * blk_mq_check_expired() is guaranteed to observe our ->deadline
> +	 * when it observes STARTED.
>  	 */
> -	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> -		set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> -	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
> +	smp_mb__before_atomic();

I am wondering whether we should be using smp_wmb() instead: this would
provide the above guarantee and save a full barrier on powerpc/arm64.


> +	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> +	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
> +		/*
> +		 * Coherence order guarantees these consequtive stores to a
> +		 * singe variable propagate in the specified order. Thus the
> +		 * clear_bit() is ordered _after_ the set bit. See
> +		 * blk_mq_check_expired().
> +		 */
>  		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);

It could be useful to stress that set_bit(), clear_bit()  must "act" on
the same subword of the unsigned long (whatever "subword" means at this
level...) to rely on the coherence order (c.f., alpha's implementation).


> +	}
>  
>  	if (q->dma_drain_size && blk_rq_bytes(rq)) {
>  		/*
> @@ -744,11 +751,20 @@ static void blk_mq_check_expired(struct
>  		struct request *rq, void *priv, bool reserved)
>  {
>  	struct blk_mq_timeout_data *data = priv;
> +	unsigned long deadline;
>  
>  	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
>  		return;
>  
>  	/*
> +	 * Ensures that if we see STARTED we must also see our
> +	 * up-to-date deadline, see blk_mq_start_request().
> +	 */
> +	smp_rmb();
> +
> +	deadline = READ_ONCE(rq->deaedline);
> +
> +	/*
>  	 * The rq being checked may have been freed and reallocated
>  	 * out already here, we avoid this race by checking rq->deadline
>  	 * and REQ_ATOM_COMPLETE flag together:
> @@ -761,10 +777,20 @@ static void blk_mq_check_expired(struct
>  	 *   and clearing the flag in blk_mq_start_request(), so
>  	 *   this rq won't be timed out too.
>  	 */
> -	if (time_after_eq(jiffies, rq->deadline)) {
> -		if (!blk_mark_rq_complete(rq))
> +	if (time_after_eq(jiffies, deadline)) {
> +		if (!blk_mark_rq_complete(rq)) {
> +			/*
> +			 * Relies on the implied MB from test_and_clear() to
> +			 * order the COMPLETE load against the STARTED load.
> +			 * Orders against the coherence order in
> +			 * blk_mq_start_request().

I understand "from test_and_set_bit()" (in blk_mark_rq_complete()) and
that the interested cycle is:

   /* in blk_mq_start_request() */
   [STORE STARTED bit = 1 into atomic_flags]
      -->co [STORE COMPLETE bit = 0 into atomic_flags]
         /* in blk_mq_check_expired() */
         -->rf [LOAD COMPLETE bit = 0 from atomic_flags]
            -->po-loc [LOAD STARTED bit = 0 from atomic_flags]
               /* in blk_mq_start_request() again */
               -->fr [STORE STARTED bit = 1 into atomic_flags]

   (N.B. Assume all accesses happen to/from the same subword.)

This cycle being forbidden by the "coherence check", I'd say we do not
need to rely on the MB mentioned by the comment; what am I missing?

  Andrea


> +			 *
> +			 * This ensures that if we see !COMPLETE we must see
> +			 * STARTED and ignore this timeout.
> +			 */
>  			blk_mq_rq_timed_out(rq, reserved);
> -	} else if (!data->next_set || time_after(data->next, rq->deadline)) {
> +		}
> +	} else if (!data->next_set || time_after(data->next, deadline)) {
>  		data->next = rq->deadline;
>  		data->next_set = 1;
>  	}

next prev parent reply	other threads:[~2017-09-06  7:13 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-04  9:09 [PATCH] blk-mq: Start to fix memory ordering Peter Zijlstra
2017-09-05 14:51 ` Bart Van Assche
2017-09-06  7:22   ` Peter Zijlstra
2017-09-06  7:13 ` Andrea Parri [this message]
2017-09-06  8:48   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170906071304.GA3519@andrea \
    --to=parri.andrea@gmail.com \
    --cc=axboe@fb.com \
    --cc=bart.vanassche@wdc.com \
    --cc=boqun.feng@gmail.com \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=stern@rowland.harvard.edu \
    --cc=tom.leiming@gmail.com \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.