public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Benjamin Marzinski <bmarzins@redhat.com>,
	Xiao Ni <xni@redhat.com>, Yu Kuai <yukuai@fnnas.com>,
	Sasha Levin <sashal@kernel.org>,
	song@kernel.org, linux-raid@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0-6.18] md/raid5: Fix UAF on IO across the reshape position
Date: Tue,  5 May 2026 05:51:35 -0400	[thread overview]
Message-ID: <20260505095149.512052-19-sashal@kernel.org> (raw)
In-Reply-To: <20260505095149.512052-1-sashal@kernel.org>

From: Benjamin Marzinski <bmarzins@redhat.com>

[ Upstream commit 418b3e64e4459feb3f75979de9ec89e085745343 ]

If make_stripe_request() returns STRIPE_WAIT_RESHAPE,
raid5_make_request() will free the cloned bio. But raid5_make_request()
can call make_stripe_request() multiple times, writing to the various
stripes. If that bio got added to the toread or towrite lists of a
stripe disk in an earlier call to make_stripe_request(), then it's not
safe to just free the bio if a later part of it is found to cross the
reshape position. Doing so can lead to a UAF error, when bio_endio()
is called on the bio for the earlier stripes.

Instead, raid5_make_request() needs to wait until all parts of the bio
have called bio_endio(). To do this, bios that cross the reshape
position while the reshape can't make progress are flagged as needing to
wait for all parts to complete. When raid5_make_request() has a bio that
failed make_stripe_request() with STRIPE_WAIT_RESHAPE, it sets
bi->bi_private to a completion struct and waits for completion after
ending the bio.  When the bio_endio() is called for the last time on a
clone bio with bi->bi_private set, it wakes up the waiter. This
guarantees that raid5_make_request() doesn't return until the cloned bio
needing a retry for io across the reshape boundary is safely cleaned up.

There is a simple reproducer available at [1]. Compile the kernel with
KASAN for more useful reporting when the error is triggered (this is not
necessary to see the bug).

[1] https://gist.github.com/bmarzins/e48598824305cf2171289e47d7241fa5

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/r/20260408043548.1695157-1-bmarzins@redhat.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Phase Walkthrough
1. Commit message forensics
Record: subsystem `md/raid5`; action verb `Fix`; intent is to prevent a
use-after-free when RAID5 IO crosses a reshape position and
`make_stripe_request()` returns `STRIPE_WAIT_RESHAPE`. Tags found:
`Signed-off-by: Benjamin Marzinski`, `Reviewed-by: Xiao Ni`, lore
`Link`, `Signed-off-by: Yu Kuai`; no `Fixes:`, `Reported-by:`, `Tested-
by`, or `Cc: stable`. Body describes a real UAF, a KASAN-aided
reproducer, and the root cause: a cloned bio can already be linked into
earlier stripe `toread`/`towrite` lists when a later stripe path frees
it.

2. Diff analysis
Record: 3 files, `14 insertions/25 deletions`; functions changed are
`md_end_clone_io()`, `md_clone_bio()`, removed `md_free_cloned_bio()`,
and `raid5_make_request()`. Before: `STRIPE_WAIT_RESHAPE` directly
called `md_free_cloned_bio(bi)`. After: it sets `bi_private` to a stack
completion, calls `bio_endio(bi)`, and waits until the cloned bio’s
final endio completes. Bug category: memory safety/UAF, caused by
freeing a clone still referenced by stripe bio chains. Fix is small and
contained.

3. Git history investigation
Record: upstream commit is `418b3e64e4459`. Blame shows the problematic
`STRIPE_WAIT_RESHAPE`/`md_free_cloned_bio()` path came from
`41425f96d7aa` (`dm-raid456, md/raid456: fix a deadlock...`), first
contained in `v6.9-rc1`/`v6.9`. That introducing commit was itself
stable-marked for `v6.7+` and is present in checked stable branches
`6.12.y`, `6.18.y`, `6.19.y`, and `7.0.y`; the specific buggy
helper/path was not found in `6.6.y` or `6.1.y`. No `Fixes:` tag exists,
so blame was used instead.

4. Mailing list and external research
Record: `b4 dig -c 418b3e64e4459` found `[PATCH v2]` at
`https://patch.msgid.link/20260408043548.1695157-1-bmarzins@redhat.com`.
`b4 dig -a` found v1 RFC and v2; committed patch matches v2. `b4 dig -w`
shows `linux-raid`, `dm-devel`, Song Liu, Yu Kuai, Li Nan, Xiao Ni, and
Red Hat participants were included. Thread review: Xiao Ni gave
`Reviewed-by`; Yu Kuai replied “Applied.” Reviewer asked about
`WRITE_ONCE`; author explained it was unnecessary but harmless on the
slow path, and Xiao accepted keeping it. No NAKs found. WebFetch for
lore was blocked by Anubis, but b4 retrieved the mbox. The gist
reproducer uses dmsetup/LVM RAID5 reshape loops.

5. Code semantic analysis
Record: `md_submit_bio()` reaches `md_handle_request()`, which calls the
RAID personality `.make_request = raid5_make_request` for RAID4/5/6.
`raid5_make_request()` calls `make_stripe_request()` repeatedly over
stripe bits. `add_all_stripe_bios()` calls `__add_stripe_bio()`, which
links the cloned bio into `toread`/`towrite` and calls
`bio_inc_remaining()`. `bio_endio()` only invokes `bi_end_io` after
`__bi_remaining` reaches zero, so the completion wait correctly waits
for all earlier stripe references to drain.

6. Stable tree analysis
Record: buggy code exists in `stable/linux-7.0.y`, `6.19.y`, `6.18.y`,
and `6.12.y`; not in checked `6.6.y`/`6.1.y`. Patch applies cleanly to
current `7.0.y` with `git apply --check`. Direct upstream patch does not
apply cleanly to `6.19.y` and `6.12.y` because nearby context differs
(`ctx` allocation/field access and older bitmap helpers), but the same
core code is present, so a small backport adjustment is needed.

7. Subsystem context
Record: subsystem is software RAID / MD, `drivers/md`; MAINTAINERS lists
Song Liu and Yu Kuai as maintainers and `linux-raid@vger.kernel.org` as
the list. Criticality: important storage subsystem, with data
availability and kernel memory-safety impact during RAID reshape.
Activity level is high; recent logs show multiple md/raid fixes and
refactors.

8. Impact and risk assessment
Record: affected users are md/dm RAID4/5/6 users doing IO while reshape
is blocked/interrupted. Trigger is specialized but real; reproducer
exists. Admin privileges are generally needed to create/reshape the
array, but ordinary IO can hit the path once the system is in that
state. Failure mode is UAF after freeing a cloned bio still referenced
by stripe lists; severity HIGH/CRITICAL because it can crash or corrupt
memory. Benefit is high for affected stable users; risk is low to
medium: small fix, no API change, but it touches bio completion
semantics in a storage path.

9. Final synthesis
Record: Evidence for backporting: explicit UAF, reproducer, KASAN
usefulness, small contained patch, reviewed by linux-raid reviewer,
applied by md maintainer, buggy code present in active stable trees from
`6.12.y` upward. Evidence against: not applicable to older
`6.6.y`/`6.1.y`; older applicable trees need a manual backport rather
than direct cherry-pick. Stable rules: obviously correct by code
inspection and review; fixes a real serious bug; small and contained; no
new features/APIs; applies cleanly to `7.0.y`, needs minor backport for
older affected branches. Exception categories: none needed.

## Verification
- Phase 1: `git show --format=fuller --stat 418b3e64e4459` verified
  subject, body, tags, reviewer, lore link, and line counts.
- Phase 2: `git show --no-ext-diff 418b3e64e4459` verified exact changes
  to `drivers/md/md.c`, `drivers/md/md.h`, and `drivers/md/raid5.c`.
- Phase 3: `git blame` verified `41425f96d7aa` introduced
  `STRIPE_WAIT_RESHAPE` freeing via `md_free_cloned_bio()`.
- Phase 3: `git describe --contains 41425f96d7aa` and tag checks
  verified first mainline containment around `v6.9`.
- Phase 4: `b4 dig -c/-a/-w/-m 418b3e64e4459` verified v1/v2 series,
  recipients, review, and applied response.
- Phase 4: WebFetch verified the raw gist reproducer content; lore
  WebFetch was blocked by Anubis.
- Phase 5: `rg`/file reads verified call chain, stripe list insertion,
  `bio_inc_remaining()`, and `bio_endio()` final-callback semantics.
- Phase 6: `git grep` verified buggy code in `6.12.y`, `6.18.y`,
  `6.19.y`, and `7.0.y`, absent from checked `6.6.y` and `6.1.y`.
- Phase 6: `git apply --check` verified clean apply to `7.0.y`; worktree
  checks showed direct patch conflicts on `6.19.y` and `6.12.y`.
- Unverified: I did not run the reproducer locally or test a built
  kernel.

This is stable material for affected trees, especially `6.12.y+`, with
backport adjustment where context differs.

**YES**

 drivers/md/md.c    | 31 ++++++++-----------------------
 drivers/md/md.h    |  1 -
 drivers/md/raid5.c |  7 ++++++-
 3 files changed, 14 insertions(+), 25 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3ce6f9e9d38e6..4318d875a5f63 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9215,9 +9215,11 @@ static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
 
 static void md_end_clone_io(struct bio *bio)
 {
-	struct md_io_clone *md_io_clone = bio->bi_private;
+	struct md_io_clone *md_io_clone = container_of(bio, struct md_io_clone,
+						       bio_clone);
 	struct bio *orig_bio = md_io_clone->orig_bio;
 	struct mddev *mddev = md_io_clone->mddev;
+	struct completion *reshape_completion = bio->bi_private;
 
 	if (bio_data_dir(orig_bio) == WRITE && md_bitmap_enabled(mddev, false))
 		md_bitmap_end(mddev, md_io_clone);
@@ -9229,7 +9231,10 @@ static void md_end_clone_io(struct bio *bio)
 		bio_end_io_acct(orig_bio, md_io_clone->start_time);
 
 	bio_put(bio);
-	bio_endio(orig_bio);
+	if (unlikely(reshape_completion))
+		complete(reshape_completion);
+	else
+		bio_endio(orig_bio);
 	percpu_ref_put(&mddev->active_io);
 }
 
@@ -9254,7 +9259,7 @@ static void md_clone_bio(struct mddev *mddev, struct bio **bio)
 	}
 
 	clone->bi_end_io = md_end_clone_io;
-	clone->bi_private = md_io_clone;
+	clone->bi_private = NULL;
 	*bio = clone;
 }
 
@@ -9265,26 +9270,6 @@ void md_account_bio(struct mddev *mddev, struct bio **bio)
 }
 EXPORT_SYMBOL_GPL(md_account_bio);
 
-void md_free_cloned_bio(struct bio *bio)
-{
-	struct md_io_clone *md_io_clone = bio->bi_private;
-	struct bio *orig_bio = md_io_clone->orig_bio;
-	struct mddev *mddev = md_io_clone->mddev;
-
-	if (bio_data_dir(orig_bio) == WRITE && md_bitmap_enabled(mddev, false))
-		md_bitmap_end(mddev, md_io_clone);
-
-	if (bio->bi_status && !orig_bio->bi_status)
-		orig_bio->bi_status = bio->bi_status;
-
-	if (md_io_clone->start_time)
-		bio_end_io_acct(orig_bio, md_io_clone->start_time);
-
-	bio_put(bio);
-	percpu_ref_put(&mddev->active_io);
-}
-EXPORT_SYMBOL_GPL(md_free_cloned_bio);
-
 /* md_allow_write(mddev)
  * Calling this ensures that the array is marked 'active' so that writes
  * may proceed without blocking.  It is important to call this before
diff --git a/drivers/md/md.h b/drivers/md/md.h
index ac84289664cd7..5d57fee22901f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -917,7 +917,6 @@ extern void md_finish_reshape(struct mddev *mddev);
 void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
 			struct bio *bio, sector_t start, sector_t size);
 void md_account_bio(struct mddev *mddev, struct bio **bio);
-void md_free_cloned_bio(struct bio *bio);
 
 extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio);
 void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a8e8d431071ba..dc0c680ca199b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6217,7 +6217,12 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 
 	mempool_free(ctx, conf->ctx_pool);
 	if (res == STRIPE_WAIT_RESHAPE) {
-		md_free_cloned_bio(bi);
+		DECLARE_COMPLETION_ONSTACK(done);
+		WRITE_ONCE(bi->bi_private, &done);
+
+		bio_endio(bi);
+
+		wait_for_completion(&done);
 		return false;
 	}
 
-- 
2.53.0


  parent reply	other threads:[~2026-05-05  9:52 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05  9:51 [PATCH AUTOSEL 7.0-5.10] ALSA: hda: Avoid WARN_ON() for HDMI chmap slot checks Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.1] nvmet-tcp: check INIT_FAILED before nvmet_req_uninit in digest error path Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] drm/amd/pm: Update emit clock logic Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] smb: client: change allocation requirements in smb2_compound_op Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: handle -EAGAIN from btrfs_duplicate_item and refresh stale leaf pointer Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] nvme: add missing MODULE_ALIAS for fabrics transports Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] dpll: export __dpll_pin_change_ntf() for use under dpll_lock Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] nvme-core: fix parameter name in comment Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] nvme: add quirk NVME_QUIRK_IGNORE_DEV_SUBNQN for 144d:a808 (Samsung PM981/983/970 EVO Plus ) Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] ASoC: spacemit: move hw constraints from hw_params to startup Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] ALSA: usb-audio: apply quirk for Playstation PDP Riffmaster Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] nvmet-tcp: Don't clear tls_key when freeing sq Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] rculist: add list_splice_rcu() for private lists Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] ALSA: hda/realtek: enable mute LED support on ThinkBook 16p Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] mailbox: cix: Add IRQF_NO_SUSPEND to mailbox interrupt Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.12] ASoC: codecs: wcd937x: fix AUX PA sequencing and mixer controls Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: replace ASSERT with proper error handling in stripe lookup fallback Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-5.10] btrfs: handle unexpected free-space-tree key types Sasha Levin
2026-05-05  9:51 ` Sasha Levin [this message]
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.6] btrfs: apply first key check for readahead when possible Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.6] ASoC: aw88395: Fix kernel panic caused by invalid GPIO error pointer Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.12] nvme-tcp: teardown circular locking fixes Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: fix wrong min_objectid in btrfs_previous_item() call Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: check return value of btrfs_partially_delete_raid_extent() Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: fix raid stripe search missing entries at leaf boundaries Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] btrfs: copy devid in btrfs_partially_delete_raid_extent() Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0-6.18] nvme-multipath: put module reference when delayed removal work is canceled Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] btrfs: abort transaction in do_remap_reloc_trans() on failure Sasha Levin
2026-05-05  9:51 ` [PATCH AUTOSEL 7.0] drm/amdkfd: check if vm ready in svm map and unmap to gpu Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260505095149.512052-19-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=bmarzins@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=song@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=xni@redhat.com \
    --cc=yukuai@fnnas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox