From: Peter Zijlstra <peterz@infradead.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Hou Tao <houtao1@huawei.com>, Jens Axboe <axboe@kernel.dk>,
Alexander Viro <viro@zeniv.linux.org.uk>,
linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org,
yukuai3@huawei.com, paulmck@kernel.org, will@kernel.org
Subject: Re: [PATCH] block: ensure the memory order between bi_private and bi_status
Date: Thu, 15 Jul 2021 10:13:48 +0200 [thread overview]
Message-ID: <20210715081348.GG2725@worktop.programming.kicks-ass.net> (raw)
In-Reply-To: <20210715070148.GA8088@lst.de>
On Thu, Jul 15, 2021 at 09:01:48AM +0200, Christoph Hellwig wrote:
> On Thu, Jul 01, 2021 at 07:35:37PM +0800, Hou Tao wrote:
> > When running stress test on null_blk under linux-4.19.y, the following
> > warning is reported:
> >
> > percpu_ref_switch_to_atomic_rcu: percpu ref (css_release) <= 0 (-3) after switching to atomic
> >
> > The cause is that css_put() is invoked twice on the same bio as shown below:
> >
> > CPU 1: CPU 2:
> >
> > // IO completion kworker // IO submit thread
> > __blkdev_direct_IO_simple
> > submit_bio
> >
> > bio_endio
> > bio_uninit(bio)
> > css_put(bi_css)
> > bi_css = NULL
> > set_current_state(TASK_UNINTERRUPTIBLE)
> > bio->bi_end_io
> > blkdev_bio_end_io_simple
> > bio->bi_private = NULL
> > // bi_private is NULL
> > READ_ONCE(bio->bi_private)
> > wake_up_process
> > smp_mb__after_spinlock
> >
> > bio_unint(bio)
> > // read bi_css as no-NULL
> > // so call css_put() again
> > css_put(bi_css)
> >
> > Because there is no memory barriers between the reading and the writing of
> > bi_private and bi_css, so reading bi_private as NULL can not guarantee
> > bi_css will also be NULL on weak-memory model host (e.g, ARM64).
> >
> > For the latest kernel source, css_put() has been removed from bio_unint(),
> > but the memory-order problem still exists, because the order between
> > bio->bi_private and {bi_status|bi_blkg} is also assumed in
> > __blkdev_direct_IO_simple(). It is reproducible that
> > __blkdev_direct_IO_simple() may read bi_status as 0 event if
> > bi_status is set as an errno in req_bio_endio().
> >
> > In __blkdev_direct_IO(), the memory order between dio->waiter and
> > dio->bio.bi_status is not guaranteed neither. Until now it is unable to
> > reproduce it, maybe because dio->waiter and dio->bio.bi_status are
> > in the same cache-line. But it is better to add guarantee for memory
> > order.
Cachelines don't guarantee anything, you can get partial forwards.
> > Fixing it by using smp_load_acquire() & smp_store_release() to guarantee
> > the order between {bio->bi_private|dio->waiter} and {bi_status|bi_blkg}.
> >
> > Fixes: 189ce2b9dcc3 ("block: fast-path for small and simple direct I/O requests")
>
> This obviously does not look broken, but smp_load_acquire /
> smp_store_release is way beyond my paygrade. Adding some CCs.
This block stuff is a bit beyond me, lets see if we can make sense of
it.
> > Signed-off-by: Hou Tao <houtao1@huawei.com>
> > ---
> > fs/block_dev.c | 19 +++++++++++++++----
> > 1 file changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index eb34f5c357cf..a602c6315b0b 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -224,7 +224,11 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
> > {
> > struct task_struct *waiter = bio->bi_private;
> >
> > - WRITE_ONCE(bio->bi_private, NULL);
> > + /*
> > + * Paired with smp_load_acquire in __blkdev_direct_IO_simple()
> > + * to ensure the order between bi_private and bi_xxx
> > + */
This comment doesn't help me; where are the other stores? Presumably
somewhere before this is called, but how does one go about finding them?
The Changelog seems to suggest you only care about bi_css, not bi_xxx in
general. In specific you can only care about stores that happen before
this; is all of bi_xxx written before here? If not, you have to be more
specific.
Also, this being a clear, this very much isn't the typical publish
pattern.
On top of all that, smp_wmb() would be sufficient here and would be
cheaper on some platforms (ARM comes to mind).
> > + smp_store_release(&bio->bi_private, NULL);
> > blk_wake_io_task(waiter);
> > }
> >
> > @@ -283,7 +287,8 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
> > qc = submit_bio(&bio);
> > for (;;) {
> > set_current_state(TASK_UNINTERRUPTIBLE);
> > - if (!READ_ONCE(bio.bi_private))
> > + /* Refer to comments in blkdev_bio_end_io_simple() */
> > + if (!smp_load_acquire(&bio.bi_private))
> > break;
> > if (!(iocb->ki_flags & IOCB_HIPRI) ||
> > !blk_poll(bdev_get_queue(bdev), qc, true))
That comment there doesn't help me find any relevant later loads and is
thus again inadequate.
Here the purpose seems to be to ensure the bi_css load happens after the
bi_private load, and this again is cheaper done using smp_rmb().
Also, the implication seems to be -- but is not spelled out anywhere --
that if bi_private is !NULL, it is stable.
> > @@ -353,7 +358,12 @@ static void blkdev_bio_end_io(struct bio *bio)
> > } else {
> > struct task_struct *waiter = dio->waiter;
> >
> > - WRITE_ONCE(dio->waiter, NULL);
> > + /*
> > + * Paired with smp_load_acquire() in
> > + * __blkdev_direct_IO() to ensure the order between
> > + * dio->waiter and bio->bi_xxx
> > + */
> > + smp_store_release(&dio->waiter, NULL);
> > blk_wake_io_task(waiter);
> > }
> > }
> > @@ -478,7 +488,8 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
> >
> > for (;;) {
> > set_current_state(TASK_UNINTERRUPTIBLE);
> > - if (!READ_ONCE(dio->waiter))
> > + /* Refer to comments in blkdev_bio_end_io */
> > + if (!smp_load_acquire(&dio->waiter))
> > break;
> >
> > if (!(iocb->ki_flags & IOCB_HIPRI) ||
Idem for these...
next prev parent reply other threads:[~2021-07-15 8:14 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-01 11:35 [PATCH] block: ensure the memory order between bi_private and bi_status Hou Tao
2021-07-07 6:29 ` Hou Tao
2021-07-13 1:14 ` Hou Tao
2021-07-15 7:01 ` Christoph Hellwig
2021-07-15 8:13 ` Peter Zijlstra [this message]
2021-07-16 9:02 ` Hou Tao
2021-07-16 10:19 ` Peter Zijlstra
2021-07-19 18:09 ` Paul E. McKenney
2021-07-19 18:16 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210715081348.GG2725@worktop.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=houtao1@huawei.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=paulmck@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=will@kernel.org \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.