From: Martin Steigerwald <martin@lichtvoll.de>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org, Tejun Heo <tj@kernel.org>,
Bart Van Assche <bart.vanassche@wdc.com>,
Israel Rukshin <israelr@mellanox.com>
Subject: Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER
Date: Mon, 23 Apr 2018 10:41:31 +0200 [thread overview]
Message-ID: <1635378.MuJF6yyO9v@merkaba> (raw)
In-Reply-To: <20180418164629.GB22533@ming.t460p>
Hello Ming.
Ming Lei - 18.04.18, 18:46:
> On Mon, Apr 16, 2018 at 03:12:30PM +0200, Martin Steigerwald wrote:
> > Ming Lei - 16.04.18, 02:45:
> > > On Sun, Apr 15, 2018 at 06:31:44PM +0200, Martin Steigerwald
wrote:
> > > > Hi Ming.
> > > >
> > > > Ming Lei - 15.04.18, 17:43:
> > > > > Hi Jens,
> > > > >
> > > > > This two patches fixes the recently discussed race between
> > > > > completion
> > > > > and BLK_EH_RESET_TIMER.
> > > > >
> > > > > Israel & Martin, this one is a simpler fix on this issue and
> > > > > can
> > > > > cover the potencial hang of MQ_RQ_COMPLETE_IN_TIMEOUT request,
> > > > > could
> > > > > you test V4 and see if your issue can be fixed?
> > > >
> > > > In replacement of all the three other patches I applied?
> > > >
> > > > - '[PATCH] blk-mq_Directly schedule q->timeout_work when
> > > > aborting a
> > > > request.mbox'
> > > >
> > > > - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched()
> > > > pair
> > > > into rcu_read_{lock,unlock}().mbox'
> > > >
> > > > - '[PATCH v4] blk-mq_Fix race conditions in request timeout
> > > > handling.mbox'
> > >
> > > You only need to replace the above one '[PATCH v4] blk-mq_Fix race
> > > conditions in request timeout' with V4 in this thread.
> >
> > Ming, a 4.16.2 with the patches:
> >
> > '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a
> > request.mbox'
> > '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into
> > rcu_read_{lock,unlock}().mbox'
> > '[PATCH V4 1_2] blk-mq_set RQF_MQ_TIMEOUT_EXPIRED when the rq'\''s
> > timeout isn'\''t handled.mbox'
> > '[PATCH V4 2_2] blk-mq_fix race between complete and
> > BLK_EH_RESET_TIMER.mbox'
> >
> > hung on boot 3 out of 4 times.
> >
> > See
> >
> > [Possible REGRESSION, 4.16-rc4] Error updating SMART data during
> > runtime and boot failures with blk_mq_terminate_expired in
> > backtrace https://bugzilla.kernel.org/show_bug.cgi?id=199077#c13
> >
> > I tried to add your mail address to Cc of the bug report, but
> > Bugzilla did not know it.
> >
> > Fortunately it booted on the fourth attempt, cause I forgot my GRUB
> > password.
> >
> > Reverting back to previous 4.16.1 kernel with patches from Bart.
> >
> > > > These patches worked reliably so far both for the hang on boot
> > > > and
> > > > error reading SMART data.
> > >
> > > And you may see the reason in the following thread:
> > >
> > > https://marc.info/?l=linux-block&m=152366441625786&w=2
> >
> > So requests could never be completed?
>
> Yes.
>
> I guess Jianchao's patch("[PATCH] blk-mq: start request gstate with
> gen 1") may work for your issue because you are using
> blk_abort_request().
>
> If it doesn't, please try the following V5 together the other two, and
> V5 fixes one big issue, in which the new rq state shouldn't be
> introduced, otherwise timeout is broken easily.
>
> I have tested V5 by running blktests(block/011), in which both these
> code paths are covered: EH_HANDLED, EH_RESET_TIMER, and normal
> completion during timeout, and the patch V5 works as expected.
I tested 4.16.3 with just Jianchao's patch "[PATCH] blk-mq: start
request gstate with gen 1" (+ the unrelated btrfs trimming fix I carry
for a long time already) and it did at least 15 boots successfully
(without hanging). So far also no "error loading smart data mail", but
it takes a few days with suspend/hibernation + resume cycles in order to
know for sure.
So if I read your mail correctly, there is no need to test your V5
patch.
Thanks,
Martin
> --
>
> From e81da316a953db999d155d08143fd5722b44e79e Mon Sep 17 00:00:00 2001
> From: Ming Lei <ming.lei@redhat.com>
> Date: Thu, 12 Apr 2018 04:23:09 +0800
> Subject: [PATCH] blk-mq: fix race between complete and
> BLK_EH_RESET_TIMER
>
> The normal request completion can be done before or during handling
> BLK_EH_RESET_TIMER, and this race may cause the request to never
> be completed since driver's .timeout() may always return
> BLK_EH_RESET_TIMER.
>
> This issue can't be fixed completely by driver, since the normal
> completion can be done between returning .timeout() and handling
> BLK_EH_RESET_TIMER.
>
> This patch fixes the race by introducing rq state of
> MQ_RQ_COMPLETE_IN_RESET, and reading/writing rq's state by holding
> queue lock, which can be per-request actually, but just not necessary
> to introduce one lock for so unusual event.
>
> Also handle the timeout requests in two steps:
>
> 1) in 1st step, call .timeout(), and reset timer for
> BLK_EH_RESET_TIMER
>
> 2) in 2nd step, sync with normal completion path by holding queue lock
> for avoiding race between BLK_EH_RESET_TIMER and normal completion.
>
> Another change is that one request is always handled as time-out
> exclusively in this patch with help of queue lock.
>
> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Israel Rukshin <israelr@mellanox.com>,
> Cc: Max Gurtovoy <maxg@mellanox.com>
> Cc: stable@vger.kernel.org
> Cc: Martin Steigerwald <Martin@Lichtvoll.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq.c | 139
> +++++++++++++++++++++++++++++++++++++++----------
> include/linux/blkdev.h | 13 +++++
> 2 files changed, 125 insertions(+), 27 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0dc9e341c2a7..b70b00910b41 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq)
> * However, that would complicate paths which want to synchronize *
> against us. Let stay in sync with the issue path so that *
> hctx_lock() covers both issue and completion paths.
> + *
> + * Cover complete vs BLK_EH_RESET_TIMER race in slow path with
> + * holding queue lock.
> */
> hctx_lock(hctx, &srcu_idx);
> if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> __blk_mq_complete_request(rq);
> + else {
> + unsigned long flags;
> + bool need_complete = false;
> +
> + spin_lock_irqsave(q->queue_lock, flags);
> + if (!blk_mq_rq_aborted_gstate(rq))
> + need_complete = true;
> + else
> + rq->rq_flags |= RQF_MQ_COMPLETE_IN_EH;
> + spin_unlock_irqrestore(q->queue_lock, flags);
> +
> + if (need_complete)
> + __blk_mq_complete_request(rq);
> + }
> hctx_unlock(hctx, srcu_idx);
> }
> EXPORT_SYMBOL(blk_mq_complete_request);
> @@ -810,28 +827,50 @@ struct blk_mq_timeout_data {
> unsigned int nr_expired;
> };
>
> -static void blk_mq_rq_timed_out(struct request *req, bool reserved)
> +static void blk_mq_rq_pre_timed_out(struct request *req, bool
> reserved) {
> const struct blk_mq_ops *ops = req->q->mq_ops;
> enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
>
> - req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
> -
> if (ops->timeout)
> ret = ops->timeout(req, reserved);
>
> + if (ret == BLK_EH_RESET_TIMER)
> + blk_add_timer(req);
> + req->eh_data.ret = ret;
> +}
> +
> +static void blk_mq_rq_timed_out(struct request *req)
> +{
> + enum blk_eh_timer_return ret = req->eh_data.ret;
> +
> switch (ret) {
> case BLK_EH_HANDLED:
> + spin_lock_irq(req->q->queue_lock);
> + complete_rq:
> + if (req->rq_flags & RQF_MQ_COMPLETE_IN_EH)
> + req->rq_flags &= ~RQF_MQ_COMPLETE_IN_EH;
> + spin_unlock_irq(req->q->queue_lock);
> __blk_mq_complete_request(req);
> break;
> case BLK_EH_RESET_TIMER:
> /*
> - * As nothing prevents from completion happening while
> - * ->aborted_gstate is set, this may lead to ignored
> - * completions and further spurious timeouts.
> + * The normal completion may happen during handling the
> + * timeout, or even after returning from .timeout(), so
> + * once the request has been completed, we can't reset
> + * timer any more since this request may be handled as
> + * BLK_EH_RESET_TIMER in next timeout handling too, and
> + * it has to be completed in this situation.
> + *
> + * Holding the queue lock to cover read/write rq's
> + * aborted_gstate and normal state, so the race can be
> + * avoided completely.
> */
> + spin_lock_irq(req->q->queue_lock);
> + if (req->rq_flags & RQF_MQ_COMPLETE_IN_EH)
> + goto complete_rq;
> blk_mq_rq_update_aborted_gstate(req, 0);
> - blk_add_timer(req);
> + spin_unlock_irq(req->q->queue_lock);
> break;
> case BLK_EH_NOT_HANDLED:
> break;
> @@ -847,10 +886,15 @@ static void blk_mq_check_expired(struct
> blk_mq_hw_ctx *hctx, struct blk_mq_timeout_data *data = priv;
> unsigned long gstate, deadline;
> int start;
> + bool expired;
>
> might_sleep();
>
> - if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
> + spin_lock_irq(rq->q->queue_lock);
> + expired = rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED;
> + spin_unlock_irq(rq->q->queue_lock);
> +
> + if (expired)
> return;
>
> /* read coherent snapshots of @rq->state_gen and @rq->deadline */
> @@ -875,19 +919,53 @@ static void blk_mq_check_expired(struct
> blk_mq_hw_ctx *hctx, }
> }
>
> -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
> +static void blk_mq_prepare_expired(struct blk_mq_hw_ctx *hctx,
> struct request *rq, void *priv, bool reserved)
> {
> + struct list_head *list = priv;
> /*
> * We marked @rq->aborted_gstate and waited for RCU. If there were
> * completions that we lost to, they would have finished and
> * updated @rq->gstate by now; otherwise, the completion path is
> * now guaranteed to see @rq->aborted_gstate and yield. If
> * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> + *
> + * One request is always handled as time-out exclusively with
> + * queue lock.
> */
> + spin_lock_irq(rq->q->queue_lock);
> if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> - READ_ONCE(rq->gstate) == rq->aborted_gstate)
> - blk_mq_rq_timed_out(rq, reserved);
> + READ_ONCE(rq->gstate) == rq->aborted_gstate) {
> + rq->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
> + spin_unlock_irq(rq->q->queue_lock);
> +
> + blk_mq_rq_pre_timed_out(rq, reserved);
> + list_add_tail(&rq->eh_data.node, list);
> + } else
> + spin_unlock_irq(rq->q->queue_lock);
> +}
> +
> +static void blk_mq_timeout_synchronize_rcu(struct request_queue *q,
> + bool reset_expired)
> +{
> + struct blk_mq_hw_ctx *hctx;
> + int i;
> + bool has_rcu = false;
> +
> + queue_for_each_hw_ctx(q, hctx, i) {
> + if (!hctx->nr_expired)
> + continue;
> +
> + if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> + has_rcu = true;
> + else
> + synchronize_srcu(hctx->srcu);
> +
> + if (reset_expired)
> + hctx->nr_expired = 0;
> + }
> + if (has_rcu)
> + synchronize_rcu();
> }
>
> static void blk_mq_timeout_work(struct work_struct *work)
> @@ -899,8 +977,6 @@ static void blk_mq_timeout_work(struct work_struct
> *work) .next_set = 0,
> .nr_expired = 0,
> };
> - struct blk_mq_hw_ctx *hctx;
> - int i;
>
> /* A deadlock might occur if a request is stuck requiring a
> * timeout at the same time a queue freeze is waiting
> @@ -922,7 +998,8 @@ static void blk_mq_timeout_work(struct work_struct
> *work) blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
>
> if (data.nr_expired) {
> - bool has_rcu = false;
> + struct request *rq;
> + LIST_HEAD(rq_list);
>
> /*
> * Wait till everyone sees ->aborted_gstate. The
> @@ -930,28 +1007,36 @@ static void blk_mq_timeout_work(struct
> work_struct *work) * becomes a problem, we can add per-hw_ctx
> rcu_head and
> * wait in parallel.
> */
> - queue_for_each_hw_ctx(q, hctx, i) {
> - if (!hctx->nr_expired)
> - continue;
> + blk_mq_timeout_synchronize_rcu(q, false);
>
> - if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> - has_rcu = true;
> - else
> - synchronize_srcu(hctx->srcu);
> + /* call .timeout() for timed-out requests */
> + blk_mq_queue_tag_busy_iter(q, blk_mq_prepare_expired, &rq_list);
>
> - hctx->nr_expired = 0;
> - }
> - if (has_rcu)
> - synchronize_rcu();
> + /*
> + * If .timeout returns BLK_EH_HANDLED, wait till current
> + * completion is done, for avoiding to update state on
> + * completed request.
> + *
> + * If .timeout returns BLK_EH_RESET_TIMER, wait till
> + * blk_add_timer() is commited before completing this rq.
> + */
> + blk_mq_timeout_synchronize_rcu(q, true);
>
> - /* terminate the ones we won */
> - blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
> + /* terminate timed-out requests */
> + while (!list_empty(&rq_list)) {
> + rq = list_entry(rq_list.next, struct request, eh_data.node);
> + list_del_init(&rq->eh_data.node);
> + blk_mq_rq_timed_out(rq);
> + }
> }
>
> if (data.next_set) {
> data.next = blk_rq_timeout(round_jiffies_up(data.next));
> mod_timer(&q->timeout, data.next);
> } else {
> + struct blk_mq_hw_ctx *hctx;
> + int i;
> +
> /*
> * Request timeouts are handled as a forward rolling timer. If
> * we end up here it means that no requests are pending and
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 9af3e0f430bc..704d54ce7702 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -129,11 +129,18 @@ typedef __u32 __bitwise req_flags_t;
> #define RQF_MQ_TIMEOUT_EXPIRED ((__force req_flags_t)(1 << 20))
> /* already slept for hybrid poll */
> #define RQF_MQ_POLL_SLEPT ((__force req_flags_t)(1 << 21))
> +/* already slept for hybrid poll */
> +#define RQF_MQ_COMPLETE_IN_EH ((__force req_flags_t)(1 << 22))
>
> /* flags that prevent us from merging requests: */
> #define RQF_NOMERGE_FLAGS \
> (RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ |
> RQF_SPECIAL_PAYLOAD)
>
> +struct req_eh_data {
> + int ret;
> + struct list_head node;
> +};
> +
> /*
> * Try to put the fields that are referenced together in the same
> cacheline. *
> @@ -252,8 +259,14 @@ struct request {
> struct list_head timeout_list;
>
> union {
> + /* used after completion */
> struct __call_single_data csd;
> +
> + /* used in io scheduler, before dispatch */
> u64 fifo_time;
> +
> + /* used after dispatch and before completion */
> + struct req_eh_data eh_data;
> };
>
> /*
--
Martin
prev parent reply other threads:[~2018-04-23 8:41 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-15 15:43 [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER Ming Lei
2018-04-15 15:43 ` [PATCH V4 1/2] blk-mq: set RQF_MQ_TIMEOUT_EXPIRED when the rq's timeout isn't handled Ming Lei
2018-04-15 15:43 ` [PATCH V4 2/2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER Ming Lei
2018-04-15 16:31 ` [PATCH V4 0/2] blk-mq: fix race between completion " Martin Steigerwald
2018-04-16 0:45 ` Ming Lei
2018-04-16 13:12 ` Martin Steigerwald
2018-04-16 16:04 ` jianchao.wang
2018-04-17 0:15 ` Bart Van Assche
2018-04-17 3:49 ` jianchao.wang
2018-04-18 16:46 ` Ming Lei
2018-04-23 8:41 ` Martin Steigerwald [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1635378.MuJF6yyO9v@merkaba \
--to=martin@lichtvoll.de \
--cc=axboe@kernel.dk \
--cc=bart.vanassche@wdc.com \
--cc=israelr@mellanox.com \
--cc=linux-block@vger.kernel.org \
--cc=ming.lei@redhat.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox