* [PATCH V2 1/1] blk-mq: avoid double ->queue_rq() because of early timeout
@ 2022-10-26 1:55 Ming Lei
[not found] ` <Y1ilaQV3hz6kudH3@kbusch-mbp.dhcp.thefacebook.com>
0 siblings, 1 reply; 2+ messages in thread
From: Ming Lei @ 2022-10-26 1:55 UTC (permalink / raw)
To: Jens Axboe
Cc: David Jeffery, Bart Van Assche, virtualization, linux-block,
Stefan Hajnoczi, Keith Busch, Ming Lei
From: David Jeffery <djeffery@redhat.com>
David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.
So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.
But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V2:
- follow Jens's suggestion to run sync rcu only if there is timeout
- rename 'now' as 'start_timeout'
block/blk-mq.c | 53 ++++++++++++++++++++++++++++++++++++++------------
1 file changed, 41 insertions(+), 12 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 33292c01875d..431e71af0e06 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1523,7 +1523,14 @@ static void blk_mq_rq_timed_out(struct request *req)
blk_add_timer(req);
}
-static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
+struct blk_expired_data {
+ bool check_only;
+ bool has_timedout_rq;
+ unsigned long next;
+ unsigned long timeout_start;
+};
+
+static bool blk_mq_req_expired(struct request *rq, struct blk_expired_data *expired)
{
unsigned long deadline;
@@ -1533,13 +1540,13 @@ static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
return false;
deadline = READ_ONCE(rq->deadline);
- if (time_after_eq(jiffies, deadline))
+ if (time_after_eq(expired->timeout_start, deadline))
return true;
- if (*next == 0)
- *next = deadline;
- else if (time_after(*next, deadline))
- *next = deadline;
+ if (expired->next == 0)
+ expired->next = deadline;
+ else if (time_after(expired->next, deadline))
+ expired->next = deadline;
return false;
}
@@ -1555,7 +1562,7 @@ void blk_mq_put_rq_ref(struct request *rq)
static bool blk_mq_check_expired(struct request *rq, void *priv)
{
- unsigned long *next = priv;
+ struct blk_expired_data *expired = priv;
/*
* blk_mq_queue_tag_busy_iter() has locked the request, so it cannot
@@ -1564,8 +1571,13 @@ static bool blk_mq_check_expired(struct request *rq, void *priv)
* it was completed and reallocated as a new request after returning
* from blk_mq_check_expired().
*/
- if (blk_mq_req_expired(rq, next))
+ if (blk_mq_req_expired(rq, expired)) {
+ if (expired->check_only) {
+ expired->has_timedout_rq = true;
+ return false;
+ }
blk_mq_rq_timed_out(rq);
+ }
return true;
}
@@ -1573,7 +1585,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
{
struct request_queue *q =
container_of(work, struct request_queue, timeout_work);
- unsigned long next = 0;
+ struct blk_expired_data expired = {
+ .check_only = true,
+ .timeout_start = jiffies,
+ };
struct blk_mq_hw_ctx *hctx;
unsigned long i;
@@ -1593,10 +1608,24 @@ static void blk_mq_timeout_work(struct work_struct *work)
if (!percpu_ref_tryget(&q->q_usage_counter))
return;
- blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
+ /* check if there is any timed-out request */
+ blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &expired);
+ if (expired.has_timedout_rq) {
+ /*
+ * Before walking tags, we must ensure any submit started
+ * before the current time has finished. Since the submit
+ * uses srcu or rcu, wait for a synchronization point to
+ * ensure all running submits have finished
+ */
+ blk_mq_wait_quiesce_done(q);
+
+ expired.check_only = false;
+ expired.next = 0;
+ blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &expired);
+ }
- if (next != 0) {
- mod_timer(&q->timeout, next);
+ if (expired.next != 0) {
+ mod_timer(&q->timeout, expired.next);
} else {
/*
* Request timeouts are handled as a forward rolling timer. If
--
2.31.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH V2 1/1] blk-mq: avoid double ->queue_rq() because of early timeout
[not found] ` <Y1ilaQV3hz6kudH3@kbusch-mbp.dhcp.thefacebook.com>
@ 2022-10-26 3:29 ` Ming Lei
0 siblings, 0 replies; 2+ messages in thread
From: Ming Lei @ 2022-10-26 3:29 UTC (permalink / raw)
To: Keith Busch
Cc: Jens Axboe, David Jeffery, Bart Van Assche, virtualization,
linux-block, Stefan Hajnoczi
On Tue, Oct 25, 2022 at 09:11:37PM -0600, Keith Busch wrote:
> On Wed, Oct 26, 2022 at 09:55:21AM +0800, Ming Lei wrote:
> > @@ -1564,8 +1571,13 @@ static bool blk_mq_check_expired(struct request *rq, void *priv)
> > * it was completed and reallocated as a new request after returning
> > * from blk_mq_check_expired().
> > */
> > - if (blk_mq_req_expired(rq, next))
> > + if (blk_mq_req_expired(rq, expired)) {
> > + if (expired->check_only) {
> > + expired->has_timedout_rq = true;
> > + return false;
> > + }
> > blk_mq_rq_timed_out(rq);
> > + }
> > return true;
> > }
> >
> > @@ -1573,7 +1585,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
> > {
> > struct request_queue *q =
> > container_of(work, struct request_queue, timeout_work);
> > - unsigned long next = 0;
> > + struct blk_expired_data expired = {
> > + .check_only = true,
> > + .timeout_start = jiffies,
> > + };
> > struct blk_mq_hw_ctx *hctx;
> > unsigned long i;
> >
> > @@ -1593,10 +1608,24 @@ static void blk_mq_timeout_work(struct work_struct *work)
> > if (!percpu_ref_tryget(&q->q_usage_counter))
> > return;
> >
> > - blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
> > + /* check if there is any timed-out request */
> > + blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &expired);
> > + if (expired.has_timedout_rq) {
> > + /*
> > + * Before walking tags, we must ensure any submit started
> > + * before the current time has finished. Since the submit
> > + * uses srcu or rcu, wait for a synchronization point to
> > + * ensure all running submits have finished
> > + */
> > + blk_mq_wait_quiesce_done(q);
> > +
> > + expired.check_only = false;
> > + expired.next = 0;
> > + blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &expired);
>
> I think it would be easier to follow with separate callbacks instead of
> special casing for 'check_only'. One callback for checking timeouts, and
> a different one for handling them?
Both two are basically same, with two callbacks, just .check_only is saved,
nothing else, meantime with one extra similar callback added.
If you or anyone think it is one big deal, I can switch to two callback version.
Thanks,
Ming
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2022-10-26 3:30 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-26 1:55 [PATCH V2 1/1] blk-mq: avoid double ->queue_rq() because of early timeout Ming Lei
[not found] ` <Y1ilaQV3hz6kudH3@kbusch-mbp.dhcp.thefacebook.com>
2022-10-26 3:29 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).