From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from va-2-35.ptr.blmpb.com (va-2-35.ptr.blmpb.com [209.127.231.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 713A823D2A1 for ; Tue, 7 Apr 2026 06:25:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.127.231.35 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775543114; cv=none; b=bUvbxBezI+QxvY7VzBtjbbz891w2w2hX4UxwrCxIuYTi1h6q/jrKLoPxrbOdh2kWqwl0wUYdYUIt1Y29A9WgBdKFfuiMSFQSptJHPCJD2CNXhseqPjw7WBarqwothLt/lN3HFo/dgsgl/JRen2l5OrboMe63np9MzxS7EjtZbfI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775543114; c=relaxed/simple; bh=orVkirP12pE/ABJ9SSy9C1B8zRj7D9II8M7h57as99c=; h=Mime-Version:In-Reply-To:References:Content-Type:Message-Id:From: Cc:Subject:Date:To; b=iWaQPokvUFufX8CQFOdkj3dUQke/710+5nF8m1Ofyq/XR3cLuGOjAAaOjsrDpvhpdBnOybZPPfL1eX381pe/SktwfsSjPlja6L2qG9wNjHFRMBOVuS+TvZtQcmFO5V2LGTMsep6JVScsf8hoSs1JZwk/Ms/Rn6osO30wI9UF0wA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=fnnas.com; spf=pass smtp.mailfrom=fnnas.com; dkim=pass (2048-bit key) header.d=fnnas-com.20200927.dkim.feishu.cn header.i=@fnnas-com.20200927.dkim.feishu.cn header.b=NS58vB8r; arc=none smtp.client-ip=209.127.231.35 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=fnnas.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fnnas.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fnnas-com.20200927.dkim.feishu.cn header.i=@fnnas-com.20200927.dkim.feishu.cn header.b="NS58vB8r" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; s=s1; d=fnnas-com.20200927.dkim.feishu.cn; t=1775543109; h=from:subject:mime-version:from:date:message-id:subject:to:cc: reply-to:content-type:mime-version:in-reply-to:message-id; bh=a2nTOKsycmq1bc9UHxRQEShd0F7ENVRKaKZmTGaTqss=; b=NS58vB8r7jlaKXOOm+hOJfGJVuV6B+TYWGahhLn80sEuyseWcqB/bTIT8jy+TuQtBcezS1 2iZzfpl6cmgwNZVPBYLNyjTdYQR//4UfbmrXs2lwipkOZ1f7rJs2SC3sWZF1hR93iWS/37 8F/gHvWCoqrxetPOYQNOmQobOffx4twXp5KAg5nl2DuJXcA2bW3mbMhCIG1snMnnV8ZTEP s5D1NY4T/EqV28QDDMiNL4N2wcMVk6g1YjSm/XWqw0wfYvQSj2wWFwLIt0r7mBJwxdrutg ynPvUzxV2k+M4U6f1WFRz/zi8biXXeo9Ved+IxVgUUTyBUxx407m87XAoHpLGg== Precedence: bulk X-Mailing-List: linux-raid@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Content-Language: en-US In-Reply-To: User-Agent: Mozilla Thunderbird References: <20260323225836.1037060-1-bmarzins@redhat.com> X-Lms-Return-Path: Content-Type: text/plain; charset=UTF-8 Message-Id: Received: from [192.168.1.104] ([39.182.0.129]) by smtp.feishu.cn with ESMTPS; Tue, 07 Apr 2026 14:24:56 +0800 From: "Yu Kuai" Content-Transfer-Encoding: quoted-printable Reply-To: yukuai@fnnas.com Cc: "Yu Kuai" , "Song Liu" , "Li Nan" , , , "Nigel Croxon" , Subject: Re: [RFC PATCH] md/raid5: Fix UAF on IO across the reshape position Date: Tue, 7 Apr 2026 14:24:53 +0800 X-Original-From: Yu Kuai To: "Xiao Ni" , "Benjamin Marzinski" Hi, =E5=9C=A8 2026/4/1 8:22, Xiao Ni =E5=86=99=E9=81=93: > On Wed, Apr 1, 2026 at 12:36=E2=80=AFAM Benjamin Marzinski wrote: >> On Tue, Mar 31, 2026 at 10:24:48AM +0800, Xiao Ni wrote: >>> On Tue, Mar 31, 2026 at 3:16=E2=80=AFAM Benjamin Marzinski wrote: >>>> On Mon, Mar 30, 2026 at 11:44:54PM +0800, Xiao Ni wrote: >>>>> On Tue, Mar 24, 2026 at 6:58=E2=80=AFAM Benjamin Marzinski wrote: >>>>>> If make_stripe_request() returns STRIPE_WAIT_RESHAPE, >>>>>> raid5_make_request() will free the cloned bio. But raid5_make_reques= t() >>>>>> can call make_stripe_request() multiple times, writing to the variou= s >>>>>> stripes. If that bio got added to the toread or towrite lists of a >>>>>> stripe disk in an earlier call to make_stripe_request(), then it's n= ot >>>>>> safe to just free the bio if a later part of it is found to cross th= e >>>>>> reshape position. Doing so can lead to a UAF error, when bio_endio() >>>>>> is called on the bio for the earlier stripes. >>>>>> >>>>>> Instead, raid5_make_request() needs to wait until all parts of the b= io >>>>>> have called bio_endio(). To do this, bios that cross the reshape >>>>>> position while the reshape can't make progress are flagged as needin= g a >>>>>> retry, and mddev tracks the number of bios needing a retry which hav= e >>>>>> not yet completed. When raid5_make_request() has a bio that failed >>>>>> make_stripe_request() with STRIPE_WAIT_RESHAPE, it waits for this >>>>>> counter to reach zero. When the bio_endio() is called for the last t= ime >>>>>> on a bio needing a retry, it decrements mddev's count of outstanding >>>>>> bios needing a retry. This guarantees that raid5_make_request() does= n't >>>>>> return until the cloned bio needing a retry for io across the reshap= e >>>>>> boundary is safely cleaned up. >>>>>> >>>>>> There is a simple reproducer available at [1]. Compile the kernel wi= th >>>>>> KASAN for more useful reporting when the error is triggered (this is= not >>>>>> necessary to see the bug). >>>>>> >>>>>> [1] https://gist.github.com/bmarzins/e48598824305cf2171289e47d7241fa= 5 >>>>>> >>>>>> Signed-off-by: Benjamin Marzinski >>>>>> --- >>>>>> >>>>>> I've tested this for regressions with the lvm2-testsuite raid tests.= I >>>>>> have not run any md-specific tests on it. >>>>>> >>>>>> >>>>>> drivers/md/md.c | 30 +++++++++--------------------- >>>>>> drivers/md/md.h | 5 ++++- >>>>>> drivers/md/raid5.c | 8 +++++++- >>>>>> 3 files changed, 20 insertions(+), 23 deletions(-) >>>>>> >>>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>>>>> index 3ce6f9e9d38e..5ec116b9da32 100644 >>>>>> --- a/drivers/md/md.c >>>>>> +++ b/drivers/md/md.c >>>>>> @@ -776,9 +776,11 @@ int mddev_init(struct mddev *mddev) >>>>>> atomic_set(&mddev->active, 1); >>>>>> atomic_set(&mddev->openers, 0); >>>>>> atomic_set(&mddev->sync_seq, 0); >>>>>> + atomic_set(&mddev->pending_retry_bios, 0); >>>>>> spin_lock_init(&mddev->lock); >>>>>> init_waitqueue_head(&mddev->sb_wait); >>>>>> init_waitqueue_head(&mddev->recovery_wait); >>>>>> + init_waitqueue_head(&mddev->retry_bios_wait); >>>>>> mddev->reshape_position =3D MaxSector; >>>>>> mddev->reshape_backwards =3D 0; >>>>>> mddev->last_sync_action =3D ACTION_IDLE; >>>>>> @@ -9218,6 +9220,7 @@ static void md_end_clone_io(struct bio *bio) >>>>>> struct md_io_clone *md_io_clone =3D bio->bi_private; >>>>>> struct bio *orig_bio =3D md_io_clone->orig_bio; >>>>>> struct mddev *mddev =3D md_io_clone->mddev; >>>>>> + unsigned int must_retry =3D md_io_clone->must_retry; >>>>>> >>>>>> if (bio_data_dir(orig_bio) =3D=3D WRITE && md_bitmap_enable= d(mddev, false)) >>>>>> md_bitmap_end(mddev, md_io_clone); >>>>>> @@ -9229,7 +9232,11 @@ static void md_end_clone_io(struct bio *bio) >>>>>> bio_end_io_acct(orig_bio, md_io_clone->start_time); >>>>>> >>>>>> bio_put(bio); >>>>>> - bio_endio(orig_bio); >>>>>> + if (unlikely(must_retry)) { >>>>>> + if (atomic_dec_and_test(&mddev->pending_retry_bios)) >>>>>> + wake_up(&mddev->retry_bios_wait); >>>>>> + } else >>>>>> + bio_endio(orig_bio); >>>>>> percpu_ref_put(&mddev->active_io); >>>>>> } >>>>>> >>>>>> @@ -9243,6 +9250,7 @@ static void md_clone_bio(struct mddev *mddev, = struct bio **bio) >>>>>> md_io_clone =3D container_of(clone, struct md_io_clone, bio= _clone); >>>>>> md_io_clone->orig_bio =3D *bio; >>>>>> md_io_clone->mddev =3D mddev; >>>>>> + md_io_clone->must_retry =3D 0; >>>>>> if (blk_queue_io_stat(bdev->bd_disk->queue)) >>>>>> md_io_clone->start_time =3D bio_start_io_acct(*bio)= ; >>>>>> >>>>>> @@ -9265,26 +9273,6 @@ void md_account_bio(struct mddev *mddev, stru= ct bio **bio) >>>>>> } >>>>>> EXPORT_SYMBOL_GPL(md_account_bio); >>>>>> >>>>>> -void md_free_cloned_bio(struct bio *bio) >>>>>> -{ >>>>>> - struct md_io_clone *md_io_clone =3D bio->bi_private; >>>>>> - struct bio *orig_bio =3D md_io_clone->orig_bio; >>>>>> - struct mddev *mddev =3D md_io_clone->mddev; >>>>>> - >>>>>> - if (bio_data_dir(orig_bio) =3D=3D WRITE && md_bitmap_enabled= (mddev, false)) >>>>>> - md_bitmap_end(mddev, md_io_clone); >>>>>> - >>>>>> - if (bio->bi_status && !orig_bio->bi_status) >>>>>> - orig_bio->bi_status =3D bio->bi_status; >>>>>> - >>>>>> - if (md_io_clone->start_time) >>>>>> - bio_end_io_acct(orig_bio, md_io_clone->start_time); >>>>>> - >>>>>> - bio_put(bio); >>>>>> - percpu_ref_put(&mddev->active_io); >>>>>> -} >>>>>> -EXPORT_SYMBOL_GPL(md_free_cloned_bio); >>>>>> - >>>>>> /* md_allow_write(mddev) >>>>>> * Calling this ensures that the array is marked 'active' so that = writes >>>>>> * may proceed without blocking. It is important to call this bef= ore >>>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h >>>>>> index ac84289664cd..49a231f11676 100644 >>>>>> --- a/drivers/md/md.h >>>>>> +++ b/drivers/md/md.h >>>>>> @@ -626,6 +626,9 @@ struct mddev { >>>>>> >>>>>> /* The sequence number for sync thread */ >>>>>> atomic_t sync_seq; >>>>>> + >>>>>> + wait_queue_head_t retry_bios_wait; >>>>>> + atomic_t pending_retry_bios; >>>>>> }; >>>>>> >>>>>> enum recovery_flags { >>>>>> @@ -877,6 +880,7 @@ struct md_io_clone { >>>>>> sector_t offset; >>>>>> unsigned long sectors; >>>>>> enum stat_group rw; >>>>>> + unsigned int must_retry; >>>>>> struct bio bio_clone; >>>>>> }; >>>>>> >>>>>> @@ -917,7 +921,6 @@ extern void md_finish_reshape(struct mddev *mdde= v); >>>>>> void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rd= ev, >>>>>> struct bio *bio, sector_t start, sector_t s= ize); >>>>>> void md_account_bio(struct mddev *mddev, struct bio **bio); >>>>>> -void md_free_cloned_bio(struct bio *bio); >>>>>> >>>>>> extern bool __must_check md_flush_request(struct mddev *mddev, str= uct bio *bio); >>>>>> void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev, >>>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >>>>>> index a8e8d431071b..fb78a757f2fd 100644 >>>>>> --- a/drivers/md/raid5.c >>>>>> +++ b/drivers/md/raid5.c >>>>>> @@ -6217,7 +6217,13 @@ static bool raid5_make_request(struct mddev *= mddev, struct bio * bi) >>>>>> >>>>>> mempool_free(ctx, conf->ctx_pool); >>>>>> if (res =3D=3D STRIPE_WAIT_RESHAPE) { >>>>>> - md_free_cloned_bio(bi); >>>>>> + struct md_io_clone *md_io_clone =3D bi->bi_private; >>>>>> + >>>>>> + md_io_clone->must_retry =3D 1; >>>>>> + atomic_inc(&mddev->pending_retry_bios); >>>>>> + bio_endio(bi); >>>>>> + wait_event(mddev->retry_bios_wait, >>>>>> + atomic_read(&mddev->pending_retry_bios)= =3D=3D0); >>>>> Hi Ben >>>>> >>>>> There is a problem here. The new counter pending_retry_bios above >>>>> doesn't represent the bios which have been added to stripes. So uaf >>>>> still can happen. >>>> I don't think it can. That counter won't be zero until there are no bi= os >>>> that need to get retried. So we way end up waiting too long, but we >>>> shouldn't ever end up not waiting long enough. >>> Ah, yes, you're right. I thought wrongly. The counter tracks multiple >>> threads that encounter BLK_STS_RESOURCE. For each thread, >>> md_end_clone_bio is called only when all bios return. But why do >>> different threads need to sync at raid5_make_request? >>> >>> >>>>> Do we need to add a new counter? >>>> I did it that way because I wanted to avoid growing md_io_clone >>>> ("must_retry" fit nicely in a hole in the structure, so it didn't take >>>> up any extra space). But if you are fine with adding the extra >>> Nice design. >>> >>>> reshape_completion pointer to md_io_clone, then your way avoids the >>>> chance that we might wait when we don't need to. >>> If you mention the threads sync here, yes. >> I can certainly add comments explaining why I left off the READ_ONCE(), >> WRITE_ONCE() (actually, there's no real reason not to keep the >> WRITE_ONCE, since that's not in the fast path) if you think it makes >> sense to leave them off. >> >> I'm fine with either my or your code for to deal with the waiting, >> whichever you think makes more sense. Since this is a very hard bug to >> hit, and the fix involves adding extra data to every bio and extra code >> in the end_io function that all the raid bios call, I wanted to make the >> impact as small as possible. But md_io_clone is still 8 bytes away from >> filling up 3 cachelines (at 64 bytes a cacheline), so there's space to >> add the reshape_completion pointer without increaseing the number of >> cachelines, which means that your solution for waiting is probably the >> way to go, unless you think something else might end up needing that >> space. I think it's not good to add a global counter in mddev, and I agree it's not good to increase md_io_clone for this corner case. How about we remove the use of bi_private filed, and replace it with container_of() to get md_clone_io from md_end_clone_io. And then we can use this field to record the completion without adding new field. > Thanks. After thinking, as you said, this is a corner case. So both > ways are fine with me also. Let's wait for Kuai's decision. > > Regards > Xiao >> -Ben >> >>>>> Because >>>>> md_end_clone_io is only called when all bios return. How about >>>>> something like: >>>>> >>>>> md_end_clone_io >>>>> + if (unlikely(READ_ONCE(md_io_clone->waiting_reshape))) { >>>>> + complete(md_io_clone->reshape_completion); >>>>> + } else { >>>>> + bio_endio(orig_bio); >>>>> + } >>>>> >>>>> raid5_make_request: >>>>> >>>>> if (res =3D=3D STRIPE_WAIT_RESHAPE) { >>>>> - md_free_cloned_bio(bi); >>>>> + struct md_io_clone *md_io_clone =3D bi->bi_private; >>>>> + struct completion done; >>>>> + >>>>> + init_completion(&done); >>>>> + md_io_clone->reshape_completion =3D &done; >>>>> + WRITE_ONCE(md_io_clone->waiting_reshape, 1); >>>>> + >>>>> + bio_endio(bi); >>>>> + >>>>> + wait_for_completion(&done); >>>> This looks fine. >>>> >>>> I'm pretty sure that the READ_ONCE() and WRITE_ONCE() are unnecessary. >>>> First, theres no chance that these operations will ever happen at the >>>> same time and we're just writing a 1, so there's no chance of tearing. >>> For the first point, I have a question. They can't happen >>> simultaneously. But the cache of different CPUs may contain different >>> values. >>> >>>> The compiler can't reorder setting md_io_clone->waiting_reshape to afe= r >>>> bio_endio(), since bio_endio can free md_io_clone. Also the compiler >>>> can't reorder reading it to before calling md_end_clone_io() from >>>> bio_endio(), obviously. >>> Thanks very much for pointing this out. >>> >>>> If the bio was never chained, then there can't be a race, since only >>>> raid5_make_request() will be calling bio_endio(). waiting_reshape will >>>> be read by the same process that sets it. >>>> >>>> If the bio was chained, then like you said, md_end_clone_io() will onl= y >>>> get called by the final caller of bio_endio(). This is determined by >>>> bio_remaining_done(), which guarantees a full memory barrier for >>>> atomic_dec_and_test(&bio->__bi_remaining). >>> Nice analysis! Thanks. >>> >>>> Since we know the compiler will keep these instructions on the proper >>>> size of a full memory barrier, I'm pretty sure that this is concurrenc= y >>>> race free, even without the READ_ONCE() and WRITE_ONCE(). Like with >>>> trying to avoid growing md_io_clone, I was trying to keep >>>> md_end_clone_io() as fast as possible (at least on architectures where >>>> READ_ONCE can have more of a performance impact) Again, if I'm >>>> overthinking this (or if my concurrency argument is flawed) feel free = to >>>> ignore me. >>> Thanks for your patience in explaining the concurrency problem. I >>> agree with you. And it's better to add comments that describe the >>> reason and it will help people understand it better. >>> >>> By the way, the md raid maintainer's email has changed. I fixed the >>> email address in to list. >>> >>> Regards >>> Xiao >>> >>>> -Ben >>>> >>>>> Best Regards >>>>> Xiao >>>>> >>>>> >>>>>> return false; >>>>>> } >>>>>> -- >>>>>> 2.53.0 >>>>>> > --=20 Thansk, Kuai