linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: JeffleXu <jefflexu@linux.alibaba.com>
Cc: axboe@kernel.dk, linux-block@vger.kernel.org,
	joseph.qi@linux.alibaba.com, xiaoguang.wang@linux.alibaba.com,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH] block: set NOWAIT for sync polling
Date: Wed, 14 Oct 2020 17:49:06 +0800	[thread overview]
Message-ID: <20201014094906.GD775684@T590> (raw)
In-Reply-To: <17357ee1-7662-f20b-0a49-2fb3fdf01ebc@linux.alibaba.com>

On Wed, Oct 14, 2020 at 04:31:49PM +0800, JeffleXu wrote:
> What about just disabling HIPRI in preadv2(2)/pwritev2(2)? Christoph Hellwig
> disabled HIPRI for libaio in

Then people will complain that poll can't be used for sync dio, and it is
an regression.

> 
> commit 154989e45fd8de9bfb52bbd6e5ea763e437e54c5 ("aio: clear IOCB_HIPRI").
> What do you think, @Christoph?
> 
> (cc Christoph Hellwig)
> 
> 
> >   static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
> >   {
> > -	bio->bi_opf |= REQ_HIPRI;
> > -	if (!is_sync_kiocb(kiocb))
> > -		bio->bi_opf |= REQ_NOWAIT;
> > +	bio->bi_opf |= REQ_HIPRI | REQ_NOWAIT;
> >   }
> 
> The original patch indeed could not fix the problem. Though it could fix the
> potential deadlock,
> 
> the VFS code read(2)/write(2) is not ready by handling the returned -EAGAIN
> gracefully. Currently
> 
> read(2)/write(2) will just return -EAGAIN to user space.
> 
> 
> 
> On 10/13/20 8:09 PM, Ming Lei wrote:
> > On Tue, Oct 13, 2020 at 04:40:51PM +0800, Jeffle Xu wrote:
> > > Sync polling also needs REQ_NOWAIT flag. One sync read/write may be
> > > split into several bios (and thus several requests), and can used up the
> > > queue depth sometimes. Thus the following bio in the same sync
> > > read/write will wait for usable request if REQ_NOWAIT flag not set, in
> > > which case the following sync polling will cause a deadlock.
> > > 
> > > One case (maybe the only case) for above situation is preadv2/pwritev2
> > > + direct + highpri. Two conditions need to be satisfied to trigger the
> > > deadlock.
> > > 
> > > 1. HIPRI IO in sync routine. Normal read(2)/pread(2)/readv(2)/preadv(2)
> > > and corresponding write family syscalls don't support high-priority IO and
> > > thus won't trigger polling routine. Only preadv2(2)/pwritev2(2) supports
> > > high-priority IO by RWF_HIPRI flag of @flags parameter.
> > > 
> > > 2. Polling support in sync routine. Currently both the blkdev and
> > > iomap-based fs (ext4/xfs, etc) support polling in direct IO routine. The
> > > general routine is described as follows.
> > > 
> > > submit_bio
> > >    wait for blk_mq_get_tag(), waiting for requests completion, which
> > >    should be done by the following polling, thus causing a deadlock.
> > Another blocking point is rq_qos_throttle(),
> 
> What is the issue here in rq_qos_throttle()? More details?

Like allocating request, rq_qos_throttle() may wait until rq completion
is done, see wbt_wait() and wbt_done().

> 
> 
> > so I guess falling back to
> > REQ_NOWAIT may not fix the issue completely.
> 
> 
> 
> > Given iopoll isn't supposed to in case of big IO, another solution
> > may be to disable iopoll when bio splitting is needed, something
> > like the following change:
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index bcf5e4580603..8e762215660b 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -279,6 +279,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
> >   	return NULL;
> >   split:
> >   	*segs = nsegs;
> > +
> > +	/*
> > +	 * bio splitting may cause more trouble for iopoll which isn't supposed
> > +	 * to be used in case of big IO
> > +	 */
> > +	bio->bi_opf &= ~REQ_HIPRI;
> >   	return bio_split(bio, sectors, GFP_NOIO, bs);
> >   }
> 
> Actually split is not only from blk_mq_submit_bio->__blk_queue_split. In
> __blkdev_direct_IO,
> 
> one input iov_iter could be split to several bios.
> 
> ```
> 
> __blkdev_direct_IO:
> 
> for (;;) {
>         ret = bio_iov_iter_get_pages(bio, iter);
>         submit_bio(bio);
> }
> 
> for (;;) {
>         blk_poll()
> 
>         ...
> 
> }
> 
> ```
> 
> Since  one single bio can contain at most BIO_MAX_PAGES, i.e. 256 bio_vec in
> @bio->bi_io_vec,

As Jens mentioned, it is weird to poll on multiple bios, so we can
disable io poll simply in __blkdev_direct_IO(). And it is reasonable from
user's viewpoint to not poll on big bio cause io poll is often used
in latency sensitive cases.

> 
> if the @iovcnt parameter of preadv2(2)/pwritev2(2) is larger than 256, then
> one call of
> 
> preadv2(2)/pwritev2(2) can be split into several bios. These bios are
> submitted at once, and then
> 
> start sync polling in the process context.
> 
> 
> If the number of bios split from one call of preadv2(2)/pwritev2(2) is
> larger than the queue depth,
> 
> bios from single preadv2(2)/pwritev2(2) call can exhaust the queue depth and
> thus cause deadlock.
> 
> Fortunately the maximum of @iovcnt parameter of preadv2(2)/pwritev2(2) is
> UIO_MAXIOV, i.e. 1024,
> 
> and the minimum of queue depth is BLKDEV_MIN_RQ i.e. 4. That means one
> preadv2(2)/pwritev2(2)
> 
> call can submit at most 4 bios, which will fill up the queue depth exactly
> and there's no deadlock in this
> 
> case. I'm not sure if the setting of UIO_MAXIOV/BIO_MAX_PAGES/BLKDEV_MIN_RQ
> is coincident or
> 
> deliberately tuned. At least it will not cause deadlock currently , though
> the constraint may be a little fragile.
> 
> 
> By the way, this patch could fix the potential hang I mentioned in
> 
> https://patchwork.kernel.org/project/linux-block/patch/20200911032958.125068-1-jefflexu@linux.alibaba.com/

Right, I remembered that race.


Thanks,
Ming


  reply	other threads:[~2020-10-14  9:49 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-13  8:40 [PATCH] block: set NOWAIT for sync polling Jeffle Xu
2020-10-13 12:09 ` Ming Lei
2020-10-13 19:28   ` Jens Axboe
2020-10-14  8:31   ` JeffleXu
2020-10-14  9:49     ` Ming Lei [this message]
2020-10-15  7:43     ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201014094906.GD775684@T590 \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=jefflexu@linux.alibaba.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-block@vger.kernel.org \
    --cc=xiaoguang.wang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).