Re: blk-mq flush fix

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Shaohua Li <shli@kernel.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>,
	Alexander Gordeev <agordeev@redhat.com>,
	Tejun Heo <tj@kernel.org>,
	Nicholas Bellinger <nab@linux-iscsi.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: blk-mq flush fix
Date: Tue, 29 Oct 2013 03:47:41 +0800	[thread overview]
Message-ID: <20131028194741.GA24664@kernel.org> (raw)
In-Reply-To: <526EBD51.1010804@kernel.dk>

On Mon, Oct 28, 2013 at 01:38:57PM -0600, Jens Axboe wrote:
> On 10/28/2013 10:57 AM, Shaohua Li wrote:
> > 
> > 
> > 
> > 2013/10/28 Jens Axboe <axboe@kernel.dk <mailto:axboe@kernel.dk>>
> > 
> >     On 10/28/2013 02:48 AM, Christoph Hellwig wrote:
> >     > On Sun, Oct 27, 2013 at 10:29:25PM +0000, Jens Axboe wrote:
> >     >> On Sat, Oct 26 2013, Christoph Hellwig wrote:
> >     >>> I think this variant of the patch from Alexander should fix the
> >     issue
> >     >>> in a minimally invasive way.  Longer term I'd prefer to use
> >     q->flush_rq
> >     >>> like in the non-mq case by copying over the context and tag
> >     information.
> >     >>
> >     >> This one is pretty simple, we could definitely use it as a band
> >     aid. I
> >     >> too would greatly prefer using the static ->flush_rq instead.
> >     Just have
> >     >> it marked to bypass most of the free logic.
> >     >
> >     > We already bypass the free logical by setting and end_io callback for
> >     > a while, similar to what the old code does.  Maybe it's not all that
> >     > hard to prealloc the request, let me give a sping.  Using the static
> >     > allocated one will be hard due to the driver-specific extra data,
> >     > though.
> > 
> >     It's not that I think the existing patch is THAT bad, it fits in alright
> >     with the reserved tagging and works regardless of whether a driver uses
> >     reserved tags or not. And it does have the upside of not requiring
> >     special checks or logic for this special non-tagged request that using
> >     the preallocated would might need.
> > 
> >     >> I'll add this one.
> >     >
> >     > Gimme another day or so to figure this out.
> > 
> >     OK, holding off.
> > 
> > 
> > Another option: we could throttle flush-request allocation in
> > blk_mq_alloc_request(), for example, flush_req_nr >= max_tags - 1, make
> > the allocation wait.
> 
> That could work too. If we back off, then we could restart it once a
> request completes. That does, however, requiring checking that and
> potentially kicking all the queues on completion when that happens.

Sounds not a big problem because the case flush_req uses all tags is very rare.
The good side is we can avoid reserving a tag, which is precious.

I cooked a patch to demonstrate the idea, only compiled yet.

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 3e4cc9c..192c2aa 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -284,7 +284,6 @@ static void mq_flush_work(struct work_struct *work)
 
 	q = container_of(work, struct request_queue, mq_flush_work);
 
-	/* We don't need set REQ_FLUSH_SEQ, it's for consistency */
 	rq = blk_mq_alloc_request(q, WRITE_FLUSH|REQ_FLUSH_SEQ,
 		__GFP_WAIT|GFP_ATOMIC);
 	rq->cmd_type = REQ_TYPE_FS;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ac804c6..fbbe0cc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -180,8 +180,21 @@ static void blk_mq_rq_ctx_init(struct blk_mq_ctx *ctx, struct request *rq,
 }
 
 static struct request *__blk_mq_alloc_request(struct blk_mq_hw_ctx *hctx,
-					      gfp_t gfp, bool reserved)
+					      gfp_t gfp, bool reserved,
+					      int rw)
 {
+
+	/*
+	 * flush need allocate a request, leave at least one request for
+	 * non-flush IO to avoid deadlock
+	 */
+	if ((rw & REQ_FLUSH) && !(rw & REQ_FLUSH_SEQ)) {
+		atomic_inc(&hctx->pending_flush);
+		/* fallback to a wait allocation */
+		if (atomic_read(&hctx->pending_flush) >= hctx->queue_depth -
+		    hctx->reserved_tags - 1)
+			return NULL;
+	}
 	return blk_mq_alloc_rq(hctx, gfp, reserved);
 }
 
@@ -195,7 +208,7 @@ static struct request *blk_mq_alloc_request_pinned(struct request_queue *q,
 		struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
 		struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q, ctx->cpu);
 
-		rq = __blk_mq_alloc_request(hctx, gfp & ~__GFP_WAIT, reserved);
+		rq = __blk_mq_alloc_request(hctx, gfp & ~__GFP_WAIT, reserved, rw);
 		if (rq) {
 			blk_mq_rq_ctx_init(ctx, rq, rw);
 			break;
@@ -253,6 +266,10 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
 	const int tag = rq->tag;
 	struct request_queue *q = rq->q;
 
+	if ((rq->cmd_flags & REQ_FLUSH) && !(rq->cmd_flags & REQ_FLUSH_SEQ)) {
+		atomic_dec(&hctx->pending_flush);
+	}
+
 	blk_mq_rq_init(hctx, rq);
 	blk_mq_put_tag(hctx->tags, tag);
 
@@ -918,7 +935,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
 
 	trace_block_getrq(q, bio, rw);
-	rq = __blk_mq_alloc_request(hctx, GFP_ATOMIC, false);
+	rq = __blk_mq_alloc_request(hctx, GFP_ATOMIC, false, rw);
 	if (likely(rq))
 		blk_mq_rq_ctx_init(ctx, rq, rw);
 	else {
@@ -1202,6 +1219,7 @@ static int blk_mq_init_hw_queues(struct request_queue *q,
 		hctx->queue_num = i;
 		hctx->flags = reg->flags;
 		hctx->queue_depth = reg->queue_depth;
+		hctx->reserved_tags = reg->reserved_tags;
 		hctx->cmd_size = reg->cmd_size;
 
 		blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 3368b97..0f81528 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -36,12 +36,15 @@ struct blk_mq_hw_ctx {
 	struct list_head	page_list;
 	struct blk_mq_tags	*tags;
 
+	atomic_t		pending_flush;
+
 	unsigned long		queued;
 	unsigned long		run;
 #define BLK_MQ_MAX_DISPATCH_ORDER	10
 	unsigned long		dispatched[BLK_MQ_MAX_DISPATCH_ORDER];
 
 	unsigned int		queue_depth;
+	unsigned int		reserved_tags;
 	unsigned int		numa_node;
 	unsigned int		cmd_size;	/* per-request extra data */

     prev parent reply	other threads:[~2013-10-28 20:37 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-26 11:46 blk-mq flush fix Christoph Hellwig
2013-10-26 15:31 ` Christoph Hellwig
2013-10-27 22:29   ` Jens Axboe
2013-10-28  8:48     ` Christoph Hellwig
2013-10-28 16:29       ` Jens Axboe
2013-10-28 16:46         ` Christoph Hellwig
2013-10-28 16:59           ` Jens Axboe
2013-10-28 19:30             ` Christoph Hellwig
2013-10-28 19:39               ` Jens Axboe
     [not found]         ` <CANejiEWyznEOtRAXrsgEqGoo2EWJDGBt7XH4AZFksWSmR4UY+Q@mail.gmail.com>
2013-10-28 19:38           ` Jens Axboe
2013-10-28 19:47             ` Shaohua Li [this message]

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:3e4cc9c dfblob:192c2aa dfblob:ac804c6 dfblob:fbbe0cc
dfblob:3368b97 dfblob:0f81528 )
 OR (
bs:"Re: blk-mq flush fix" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131028194741.GA24664@kernel.org \
    --to=shli@kernel.org \
    --cc=agordeev@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nab@linux-iscsi.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).