From: Ming Lei <ming.lei@redhat.com>
To: JeffleXu <jefflexu@linux.alibaba.com>
Cc: axboe@kernel.dk, linux-block@vger.kernel.org,
joseph.qi@linux.alibaba.com, xiaoguang.wang@linux.alibaba.com,
Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH] block: set NOWAIT for sync polling
Date: Wed, 14 Oct 2020 17:49:06 +0800 [thread overview]
Message-ID: <20201014094906.GD775684@T590> (raw)
In-Reply-To: <17357ee1-7662-f20b-0a49-2fb3fdf01ebc@linux.alibaba.com>
On Wed, Oct 14, 2020 at 04:31:49PM +0800, JeffleXu wrote:
> What about just disabling HIPRI in preadv2(2)/pwritev2(2)? Christoph Hellwig
> disabled HIPRI for libaio in
Then people will complain that poll can't be used for sync dio, and it is
an regression.
>
> commit 154989e45fd8de9bfb52bbd6e5ea763e437e54c5 ("aio: clear IOCB_HIPRI").
> What do you think, @Christoph?
>
> (cc Christoph Hellwig)
>
>
> > static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
> > {
> > - bio->bi_opf |= REQ_HIPRI;
> > - if (!is_sync_kiocb(kiocb))
> > - bio->bi_opf |= REQ_NOWAIT;
> > + bio->bi_opf |= REQ_HIPRI | REQ_NOWAIT;
> > }
>
> The original patch indeed could not fix the problem. Though it could fix the
> potential deadlock,
>
> the VFS code read(2)/write(2) is not ready by handling the returned -EAGAIN
> gracefully. Currently
>
> read(2)/write(2) will just return -EAGAIN to user space.
>
>
>
> On 10/13/20 8:09 PM, Ming Lei wrote:
> > On Tue, Oct 13, 2020 at 04:40:51PM +0800, Jeffle Xu wrote:
> > > Sync polling also needs REQ_NOWAIT flag. One sync read/write may be
> > > split into several bios (and thus several requests), and can used up the
> > > queue depth sometimes. Thus the following bio in the same sync
> > > read/write will wait for usable request if REQ_NOWAIT flag not set, in
> > > which case the following sync polling will cause a deadlock.
> > >
> > > One case (maybe the only case) for above situation is preadv2/pwritev2
> > > + direct + highpri. Two conditions need to be satisfied to trigger the
> > > deadlock.
> > >
> > > 1. HIPRI IO in sync routine. Normal read(2)/pread(2)/readv(2)/preadv(2)
> > > and corresponding write family syscalls don't support high-priority IO and
> > > thus won't trigger polling routine. Only preadv2(2)/pwritev2(2) supports
> > > high-priority IO by RWF_HIPRI flag of @flags parameter.
> > >
> > > 2. Polling support in sync routine. Currently both the blkdev and
> > > iomap-based fs (ext4/xfs, etc) support polling in direct IO routine. The
> > > general routine is described as follows.
> > >
> > > submit_bio
> > > wait for blk_mq_get_tag(), waiting for requests completion, which
> > > should be done by the following polling, thus causing a deadlock.
> > Another blocking point is rq_qos_throttle(),
>
> What is the issue here in rq_qos_throttle()? More details?
Like allocating request, rq_qos_throttle() may wait until rq completion
is done, see wbt_wait() and wbt_done().
>
>
> > so I guess falling back to
> > REQ_NOWAIT may not fix the issue completely.
>
>
>
> > Given iopoll isn't supposed to in case of big IO, another solution
> > may be to disable iopoll when bio splitting is needed, something
> > like the following change:
> >
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index bcf5e4580603..8e762215660b 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -279,6 +279,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
> > return NULL;
> > split:
> > *segs = nsegs;
> > +
> > + /*
> > + * bio splitting may cause more trouble for iopoll which isn't supposed
> > + * to be used in case of big IO
> > + */
> > + bio->bi_opf &= ~REQ_HIPRI;
> > return bio_split(bio, sectors, GFP_NOIO, bs);
> > }
>
> Actually split is not only from blk_mq_submit_bio->__blk_queue_split. In
> __blkdev_direct_IO,
>
> one input iov_iter could be split to several bios.
>
> ```
>
> __blkdev_direct_IO:
>
> for (;;) {
> ret = bio_iov_iter_get_pages(bio, iter);
> submit_bio(bio);
> }
>
> for (;;) {
> blk_poll()
>
> ...
>
> }
>
> ```
>
> Since one single bio can contain at most BIO_MAX_PAGES, i.e. 256 bio_vec in
> @bio->bi_io_vec,
As Jens mentioned, it is weird to poll on multiple bios, so we can
disable io poll simply in __blkdev_direct_IO(). And it is reasonable from
user's viewpoint to not poll on big bio cause io poll is often used
in latency sensitive cases.
>
> if the @iovcnt parameter of preadv2(2)/pwritev2(2) is larger than 256, then
> one call of
>
> preadv2(2)/pwritev2(2) can be split into several bios. These bios are
> submitted at once, and then
>
> start sync polling in the process context.
>
>
> If the number of bios split from one call of preadv2(2)/pwritev2(2) is
> larger than the queue depth,
>
> bios from single preadv2(2)/pwritev2(2) call can exhaust the queue depth and
> thus cause deadlock.
>
> Fortunately the maximum of @iovcnt parameter of preadv2(2)/pwritev2(2) is
> UIO_MAXIOV, i.e. 1024,
>
> and the minimum of queue depth is BLKDEV_MIN_RQ i.e. 4. That means one
> preadv2(2)/pwritev2(2)
>
> call can submit at most 4 bios, which will fill up the queue depth exactly
> and there's no deadlock in this
>
> case. I'm not sure if the setting of UIO_MAXIOV/BIO_MAX_PAGES/BLKDEV_MIN_RQ
> is coincident or
>
> deliberately tuned. At least it will not cause deadlock currently , though
> the constraint may be a little fragile.
>
>
> By the way, this patch could fix the potential hang I mentioned in
>
> https://patchwork.kernel.org/project/linux-block/patch/20200911032958.125068-1-jefflexu@linux.alibaba.com/
Right, I remembered that race.
Thanks,
Ming
next prev parent reply other threads:[~2020-10-14 9:49 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-13 8:40 [PATCH] block: set NOWAIT for sync polling Jeffle Xu
2020-10-13 12:09 ` Ming Lei
2020-10-13 19:28 ` Jens Axboe
2020-10-14 8:31 ` JeffleXu
2020-10-14 9:49 ` Ming Lei [this message]
2020-10-15 7:43 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201014094906.GD775684@T590 \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=jefflexu@linux.alibaba.com \
--cc=joseph.qi@linux.alibaba.com \
--cc=linux-block@vger.kernel.org \
--cc=xiaoguang.wang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).