From: Shaohua Li <shli@kernel.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>,
Alexander Gordeev <agordeev@redhat.com>,
Tejun Heo <tj@kernel.org>,
Nicholas Bellinger <nab@linux-iscsi.org>,
linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: blk-mq flush fix
Date: Tue, 29 Oct 2013 03:47:41 +0800 [thread overview]
Message-ID: <20131028194741.GA24664@kernel.org> (raw)
In-Reply-To: <526EBD51.1010804@kernel.dk>
On Mon, Oct 28, 2013 at 01:38:57PM -0600, Jens Axboe wrote:
> On 10/28/2013 10:57 AM, Shaohua Li wrote:
> >
> >
> >
> > 2013/10/28 Jens Axboe <axboe@kernel.dk <mailto:axboe@kernel.dk>>
> >
> > On 10/28/2013 02:48 AM, Christoph Hellwig wrote:
> > > On Sun, Oct 27, 2013 at 10:29:25PM +0000, Jens Axboe wrote:
> > >> On Sat, Oct 26 2013, Christoph Hellwig wrote:
> > >>> I think this variant of the patch from Alexander should fix the
> > issue
> > >>> in a minimally invasive way. Longer term I'd prefer to use
> > q->flush_rq
> > >>> like in the non-mq case by copying over the context and tag
> > information.
> > >>
> > >> This one is pretty simple, we could definitely use it as a band
> > aid. I
> > >> too would greatly prefer using the static ->flush_rq instead.
> > Just have
> > >> it marked to bypass most of the free logic.
> > >
> > > We already bypass the free logical by setting and end_io callback for
> > > a while, similar to what the old code does. Maybe it's not all that
> > > hard to prealloc the request, let me give a sping. Using the static
> > > allocated one will be hard due to the driver-specific extra data,
> > > though.
> >
> > It's not that I think the existing patch is THAT bad, it fits in alright
> > with the reserved tagging and works regardless of whether a driver uses
> > reserved tags or not. And it does have the upside of not requiring
> > special checks or logic for this special non-tagged request that using
> > the preallocated would might need.
> >
> > >> I'll add this one.
> > >
> > > Gimme another day or so to figure this out.
> >
> > OK, holding off.
> >
> >
> > Another option: we could throttle flush-request allocation in
> > blk_mq_alloc_request(), for example, flush_req_nr >= max_tags - 1, make
> > the allocation wait.
>
> That could work too. If we back off, then we could restart it once a
> request completes. That does, however, requiring checking that and
> potentially kicking all the queues on completion when that happens.
Sounds not a big problem because the case flush_req uses all tags is very rare.
The good side is we can avoid reserving a tag, which is precious.
I cooked a patch to demonstrate the idea, only compiled yet.
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 3e4cc9c..192c2aa 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -284,7 +284,6 @@ static void mq_flush_work(struct work_struct *work)
q = container_of(work, struct request_queue, mq_flush_work);
- /* We don't need set REQ_FLUSH_SEQ, it's for consistency */
rq = blk_mq_alloc_request(q, WRITE_FLUSH|REQ_FLUSH_SEQ,
__GFP_WAIT|GFP_ATOMIC);
rq->cmd_type = REQ_TYPE_FS;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ac804c6..fbbe0cc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -180,8 +180,21 @@ static void blk_mq_rq_ctx_init(struct blk_mq_ctx *ctx, struct request *rq,
}
static struct request *__blk_mq_alloc_request(struct blk_mq_hw_ctx *hctx,
- gfp_t gfp, bool reserved)
+ gfp_t gfp, bool reserved,
+ int rw)
{
+
+ /*
+ * flush need allocate a request, leave at least one request for
+ * non-flush IO to avoid deadlock
+ */
+ if ((rw & REQ_FLUSH) && !(rw & REQ_FLUSH_SEQ)) {
+ atomic_inc(&hctx->pending_flush);
+ /* fallback to a wait allocation */
+ if (atomic_read(&hctx->pending_flush) >= hctx->queue_depth -
+ hctx->reserved_tags - 1)
+ return NULL;
+ }
return blk_mq_alloc_rq(hctx, gfp, reserved);
}
@@ -195,7 +208,7 @@ static struct request *blk_mq_alloc_request_pinned(struct request_queue *q,
struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q, ctx->cpu);
- rq = __blk_mq_alloc_request(hctx, gfp & ~__GFP_WAIT, reserved);
+ rq = __blk_mq_alloc_request(hctx, gfp & ~__GFP_WAIT, reserved, rw);
if (rq) {
blk_mq_rq_ctx_init(ctx, rq, rw);
break;
@@ -253,6 +266,10 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
const int tag = rq->tag;
struct request_queue *q = rq->q;
+ if ((rq->cmd_flags & REQ_FLUSH) && !(rq->cmd_flags & REQ_FLUSH_SEQ)) {
+ atomic_dec(&hctx->pending_flush);
+ }
+
blk_mq_rq_init(hctx, rq);
blk_mq_put_tag(hctx->tags, tag);
@@ -918,7 +935,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
hctx = q->mq_ops->map_queue(q, ctx->cpu);
trace_block_getrq(q, bio, rw);
- rq = __blk_mq_alloc_request(hctx, GFP_ATOMIC, false);
+ rq = __blk_mq_alloc_request(hctx, GFP_ATOMIC, false, rw);
if (likely(rq))
blk_mq_rq_ctx_init(ctx, rq, rw);
else {
@@ -1202,6 +1219,7 @@ static int blk_mq_init_hw_queues(struct request_queue *q,
hctx->queue_num = i;
hctx->flags = reg->flags;
hctx->queue_depth = reg->queue_depth;
+ hctx->reserved_tags = reg->reserved_tags;
hctx->cmd_size = reg->cmd_size;
blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 3368b97..0f81528 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -36,12 +36,15 @@ struct blk_mq_hw_ctx {
struct list_head page_list;
struct blk_mq_tags *tags;
+ atomic_t pending_flush;
+
unsigned long queued;
unsigned long run;
#define BLK_MQ_MAX_DISPATCH_ORDER 10
unsigned long dispatched[BLK_MQ_MAX_DISPATCH_ORDER];
unsigned int queue_depth;
+ unsigned int reserved_tags;
unsigned int numa_node;
unsigned int cmd_size; /* per-request extra data */
prev parent reply other threads:[~2013-10-28 20:37 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-26 11:46 blk-mq flush fix Christoph Hellwig
2013-10-26 15:31 ` Christoph Hellwig
2013-10-27 22:29 ` Jens Axboe
2013-10-28 8:48 ` Christoph Hellwig
2013-10-28 16:29 ` Jens Axboe
2013-10-28 16:46 ` Christoph Hellwig
2013-10-28 16:59 ` Jens Axboe
2013-10-28 19:30 ` Christoph Hellwig
2013-10-28 19:39 ` Jens Axboe
[not found] ` <CANejiEWyznEOtRAXrsgEqGoo2EWJDGBt7XH4AZFksWSmR4UY+Q@mail.gmail.com>
2013-10-28 19:38 ` Jens Axboe
2013-10-28 19:47 ` Shaohua Li [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131028194741.GA24664@kernel.org \
--to=shli@kernel.org \
--cc=agordeev@redhat.com \
--cc=axboe@kernel.dk \
--cc=hch@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=nab@linux-iscsi.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.