From: Benjamin Marzinski <bmarzins@redhat.com>
To: Xiao Ni <xni@redhat.com>
Cc: Yu Kuai
<yukuai@alb-78bjiv52429oh8qptp.cn-shenzhen.alb.aliyuncs.com>,
Song Liu <song@kernel.org>, Li Nan <linan122@huawei.com>,
linux-raid@vger.kernel.org, dm-devel@lists.linux.dev,
Nigel Croxon <ncroxon@redhat.com>
Subject: Re: [PATCH v2] md/raid5: Fix UAF on IO across the reshape position
Date: Wed, 8 Apr 2026 15:57:53 -0400 [thread overview]
Message-ID: <adazQdvXjE1RlMid@redhat.com> (raw)
In-Reply-To: <CALTww28H_Qs_Oh0WkDFvVynvbyUhgUK9u0cuU7bpoF+jOg1mYA@mail.gmail.com>
On Wed, Apr 08, 2026 at 07:22:38PM +0800, Xiao Ni wrote:
> On Wed, Apr 8, 2026 at 12:35 PM Benjamin Marzinski <bmarzins@redhat.com> wrote:
> >
> > If make_stripe_request() returns STRIPE_WAIT_RESHAPE,
> > raid5_make_request() will free the cloned bio. But raid5_make_request()
> > can call make_stripe_request() multiple times, writing to the various
> > stripes. If that bio got added to the toread or towrite lists of a
> > stripe disk in an earlier call to make_stripe_request(), then it's not
> > safe to just free the bio if a later part of it is found to cross the
> > reshape position. Doing so can lead to a UAF error, when bio_endio()
> > is called on the bio for the earlier stripes.
> >
> > Instead, raid5_make_request() needs to wait until all parts of the bio
> > have called bio_endio(). To do this, bios that cross the reshape
> > position while the reshape can't make progress are flagged as needing to
> > wait for all parts to complete. When raid5_make_request() has a bio that
> > failed make_stripe_request() with STRIPE_WAIT_RESHAPE, it sets
> > bi->bi_private to a completion struct and waits for completion after
> > ending the bio. When the bio_endio() is called for the last time on a
> > clone bio with bi->bi_private set, it wakes up the waiter. This
> > guarantees that raid5_make_request() doesn't return until the cloned bio
> > needing a retry for io across the reshape boundary is safely cleaned up.
> >
> > There is a simple reproducer available at [1]. Compile the kernel with
> > KASAN for more useful reporting when the error is triggered (this is not
> > necessary to see the bug).
> >
> > [1] https://gist.github.com/bmarzins/e48598824305cf2171289e47d7241fa5
> >
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > ---
> >
> > Changes from v1:
> > - Removed mddev->pending_retry_bios, mddev->retry_bios_wait, and
> > md_io_clone->must_retry. Instead, use a completion struct
> > pointed to by bi->bi_private, as suggested by Xiao Ni and Yu Kuai.
> >
> > drivers/md/md.c | 31 ++++++++-----------------------
> > drivers/md/md.h | 1 -
> > drivers/md/raid5.c | 7 ++++++-
> > 3 files changed, 14 insertions(+), 25 deletions(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 3ce6f9e9d38e..4318d875a5f6 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -9215,9 +9215,11 @@ static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
> >
> > static void md_end_clone_io(struct bio *bio)
> > {
> > - struct md_io_clone *md_io_clone = bio->bi_private;
> > + struct md_io_clone *md_io_clone = container_of(bio, struct md_io_clone,
> > + bio_clone);
> > struct bio *orig_bio = md_io_clone->orig_bio;
> > struct mddev *mddev = md_io_clone->mddev;
> > + struct completion *reshape_completion = bio->bi_private;
> >
> > if (bio_data_dir(orig_bio) == WRITE && md_bitmap_enabled(mddev, false))
> > md_bitmap_end(mddev, md_io_clone);
> > @@ -9229,7 +9231,10 @@ static void md_end_clone_io(struct bio *bio)
> > bio_end_io_acct(orig_bio, md_io_clone->start_time);
> >
> > bio_put(bio);
> > - bio_endio(orig_bio);
> > + if (unlikely(reshape_completion))
> > + complete(reshape_completion);
> > + else
> > + bio_endio(orig_bio);
> > percpu_ref_put(&mddev->active_io);
> > }
> >
> > @@ -9254,7 +9259,7 @@ static void md_clone_bio(struct mddev *mddev, struct bio **bio)
> > }
> >
> > clone->bi_end_io = md_end_clone_io;
> > - clone->bi_private = md_io_clone;
> > + clone->bi_private = NULL;
> > *bio = clone;
> > }
> >
> > @@ -9265,26 +9270,6 @@ void md_account_bio(struct mddev *mddev, struct bio **bio)
> > }
> > EXPORT_SYMBOL_GPL(md_account_bio);
> >
> > -void md_free_cloned_bio(struct bio *bio)
> > -{
> > - struct md_io_clone *md_io_clone = bio->bi_private;
> > - struct bio *orig_bio = md_io_clone->orig_bio;
> > - struct mddev *mddev = md_io_clone->mddev;
> > -
> > - if (bio_data_dir(orig_bio) == WRITE && md_bitmap_enabled(mddev, false))
> > - md_bitmap_end(mddev, md_io_clone);
> > -
> > - if (bio->bi_status && !orig_bio->bi_status)
> > - orig_bio->bi_status = bio->bi_status;
> > -
> > - if (md_io_clone->start_time)
> > - bio_end_io_acct(orig_bio, md_io_clone->start_time);
> > -
> > - bio_put(bio);
> > - percpu_ref_put(&mddev->active_io);
> > -}
> > -EXPORT_SYMBOL_GPL(md_free_cloned_bio);
> > -
> > /* md_allow_write(mddev)
> > * Calling this ensures that the array is marked 'active' so that writes
> > * may proceed without blocking. It is important to call this before
> > diff --git a/drivers/md/md.h b/drivers/md/md.h
> > index ac84289664cd..5d57fee22901 100644
> > --- a/drivers/md/md.h
> > +++ b/drivers/md/md.h
> > @@ -917,7 +917,6 @@ extern void md_finish_reshape(struct mddev *mddev);
> > void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
> > struct bio *bio, sector_t start, sector_t size);
> > void md_account_bio(struct mddev *mddev, struct bio **bio);
> > -void md_free_cloned_bio(struct bio *bio);
> >
> > extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio);
> > void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index a8e8d431071b..dc0c680ca199 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -6217,7 +6217,12 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
> >
> > mempool_free(ctx, conf->ctx_pool);
> > if (res == STRIPE_WAIT_RESHAPE) {
> > - md_free_cloned_bio(bi);
> > + DECLARE_COMPLETION_ONSTACK(done);
> > + WRITE_ONCE(bi->bi_private, &done);
> > +
> > + bio_endio(bi);
>
> Hi Ben
>
> You gave an explanation why it doesn't need WRITE_ONCE. As you said,
> bio_endio uses atomic_dec_and_test, so it guarantees a full memory
> barrier. Why do you use WRITE_ONCE here?
You're correct. I don't believe it's necessary. The compiler has to
update bi->bi_private before calling bio_endio(bi), which can free bi,
and either the bio was never chained, and bi->bi_private will be read by
the same process that set it, or it was chained, and the
atomic_dec_and_test() in bio_remaining_done() will guarantee a memory
barrier.
I just patterned my updated fix off your idea, and left the WRITE_ONCE
there because it doesn't really hurt anything, since this is already the
slow (and unlikely) path. I can pull it out if you'd like.
I actually have another question about this code. My patch doesn't mess
with the code at the end of make_stripe_request() to return
STRIPE_WAIT_RESHAPE, but I'm not sure that it's right. That code
includes:
bi->bi_status = BLK_STS_RESOURCE;
This will update the orig_bio's bi_status in md_end_clone_io():
if (bio->bi_status && !orig_bio->bi_status)
orig_bio->bi_status = bio->bi_status;
For dm-raid, that orig_bio is itself a clone, and will eventually
get ended with DM_MAPIO_REQUEUE, which will requeue the actual
original bio (assuming that this happens when the device is
in a noflush suspend).
But for md, md_handle_request() can just loop and retry it. If the
mddev->pers->make_request() call succeeds on retry, the orig_bio will
still have the BLK_STS_RESOURCE status that got set when the earlier
call to make_stripe_request() returned STRIPE_WAIT_RESHAPE.
Perhaps make_stripe_request() shouldn't set bi->bi_status if
it's going to return STRIPE_WAIT_RESHAPE. The only thing I can see that
it does is set orig_bio->bi_status, which I don't think we want it to
do. Am I missing something here?
-Ben
>
> Regards
> Xiao
>
>
> > +
> > + wait_for_completion(&done);
> > return false;
> > }
> >
> > --
> > 2.53.0
> >
next prev parent reply other threads:[~2026-04-08 19:58 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 4:35 [PATCH v2] md/raid5: Fix UAF on IO across the reshape position Benjamin Marzinski
2026-04-08 11:22 ` Xiao Ni
2026-04-08 19:57 ` Benjamin Marzinski [this message]
2026-04-09 2:31 ` Xiao Ni
2026-04-13 2:08 ` Xiao Ni
2026-04-13 2:07 ` Xiao Ni
2026-04-19 3:51 ` Yu Kuai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adazQdvXjE1RlMid@redhat.com \
--to=bmarzins@redhat.com \
--cc=dm-devel@lists.linux.dev \
--cc=linan122@huawei.com \
--cc=linux-raid@vger.kernel.org \
--cc=ncroxon@redhat.com \
--cc=song@kernel.org \
--cc=xni@redhat.com \
--cc=yukuai@alb-78bjiv52429oh8qptp.cn-shenzhen.alb.aliyuncs.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.