From: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
To: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: song@kernel.org, yukuai@fnnas.com, shli@fb.com, neilb@suse.com,
linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock
Date: Mon, 27 Apr 2026 19:44:31 +0200 [thread overview]
Message-ID: <m2se8g47sw.fsf@gmail.com> (raw)
In-Reply-To: <c859ab17-cfc6-4e54-9a79-8b0d2b145adc@molgen.mpg.de>
Hi Paul,
Thank you for the feedback.
On Mon, Apr 27, 2026 at 16:49 +0200, Paul Menzel wrote:
> Dear Abd-Alrhman,
>
>
> Thank you for your patch.
>
> Am 27.04.26 um 12:34 schrieb Abd-Alrhman Masalkhi:
>> Splitting a bio while executing in the raid1 thread can lead to
>> recursion, as task->bio_list is NULL in this context.
>>
>> In addition, resubmitting an md_cloned_bio after splitting may lead to
>> a deadlock if the array is suspended before the md driver calls
>> percpu_ref_tryget_live(&mddev->active_io) on it's path to
>> pers->make_request().
>>
>> Avoid splitting the bio in this context and require that it is either
>> read in full or not at all.
>>
>> This prevents recursion and avoids potential deadlocks during array
>> suspension.
>
> Do you have a reproducer?
I found this issue while reviewing the code and trying to understand the
read path.
The problem can be triggered when the first rdev cannot complete the
md_cloned_bio successfully, and RAID1 selects another rdev that cannot
fulfil the entire request. In that case, raid1_read_request() will split
the bio (the md_cloned_bio) via bio_submit_split_bioset(), which in turn
calls submit_bio_noacct_nocheck(). Since current->bio_list is NULL in
this context, this leads to raid1_read_request() being called again,
resulting in recursion.
I have also created a test to confirm that this can occur by modifying
the code with the following patch:
drivers/md/raid1.c | 52 ++++++++++++++++++++++++++++++++++++++++------
1 file changed, 46 insertions(+), 6 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index cc9914bd15c1..145e3ad0b1b8 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -362,11 +362,34 @@ static int find_bio_disk(struct r1bio *r1_bio, struct bio *bio)
static void raid1_end_read_request(struct bio *bio)
{
+ static int tmp = 75;
int uptodate = !bio->bi_status;
struct r1bio *r1_bio = bio->bi_private;
struct r1conf *conf = r1_bio->mddev->private;
struct md_rdev *rdev = conf->mirrors[r1_bio->read_disk].rdev;
+ if (tmp > 2) {
+ tmp --;
+ pr_info("--tmp = %d\n", tmp);
+ } else if (tmp == 2) {
+ if (r1_bio->sectors > 2 && uptodate) {
+ pr_info("I will start omitting errors\n");
+ bio->bi_status = BLK_STS_IOERR;
+ uptodate = false;
+ tmp = 0;
+ }
+ } else {
+ if (r1_bio->sectors > 2) {
+ if (tmp) {
+ bio->bi_status = BLK_STS_IOERR;
+ uptodate = false;
+ tmp = false;
+ } else {
+ tmp = true;
+ }
+ }
+ }
+
/*
* this branch is our 'one mirror IO has finished' event handler:
*/
@@ -607,7 +630,7 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
/* choose the first disk even if it has some bad blocks. */
read_len = raid1_check_read_range(rdev, this_sector, &len);
- if (read_len > 0) {
+ if (read_len > 0 && (!*max_sectors || read_len == r1_bio->sectors)) {
update_read_sectors(conf, disk, this_sector, read_len);
*max_sectors = read_len;
return disk;
@@ -704,8 +727,12 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
}
if (bb_disk != -1) {
- *max_sectors = bb_read_len;
- update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
+ if (!*max_sectors || bb_read_len == r1_bio->sectors) {
+ *max_sectors = bb_read_len;
+ update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
+ } else {
+ bb_disk = -1;
+ }
}
return bb_disk;
@@ -884,9 +911,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
* now spend a bit more time trying to find one with the most good
* sectors.
*/
- disk = choose_bb_rdev(conf, r1_bio, max_sectors);
- if (disk >= 0)
- return disk;
+ if (!*max_sectors) {
+ disk = choose_bb_rdev(conf, r1_bio, max_sectors);
+ if (disk >= 0)
+ return disk;
+ }
return choose_slow_rdev(conf, r1_bio, max_sectors);
}
@@ -1320,6 +1349,11 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
int rdisk;
bool r1bio_existed = !!r1_bio;
+ if (mddev->thread && mddev->thread->tsk == current) {
+ pr_info("taask->bio_list = %p, %d\n", current->bio_list,
+ r1bio_existed);
+ }
+
/*
* If r1_bio is set, we are blocking the raid1d thread
* so there is a tiny risk of deadlock. So ask for
@@ -1347,6 +1381,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
* make_request() can abort the operation when read-ahead is being
* used and no empty request is available.
*/
+ max_sectors = r1bio_existed;
rdisk = read_balance(conf, r1_bio, &max_sectors);
if (rdisk < 0) {
/* couldn't find anywhere to read from */
@@ -1376,7 +1411,12 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
mddev->bitmap_ops->wait_behind_writes(mddev);
}
+ if (r1bio_existed)
+ max_sectors = max_sectors - 1;
+
if (max_sectors < bio_sectors(bio)) {
+ /* we are not allowed */
+ /* BUG_ON(r1bio_existed); */
bio = bio_submit_split_bioset(bio, max_sectors,
&conf->bio_split);
if (!bio) {
--
2.43.0
using trace-cmd, it shows the recursion clearly, its output:
raid1_read_request() {
bio_submit_split_bioset() {
bio_split() {
bio_alloc_clone() {
bio_alloc_bioset() {
mempool_alloc_noprof() {
__cond_resched();
mempool_alloc_slab() {
kmem_cache_alloc_noprof();
}
}
bio_associate_blkg() {
__rcu_read_lock();
kthread_blkcg();
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
}
__rcu_read_unlock();
}
}
bio_clone_blkg_association() {
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
__rcu_read_lock();
__rcu_read_unlock();
}
}
}
}
bio_chain();
should_fail_bio();
submit_bio_noacct_nocheck() {
blk_cgroup_bio_start();
__submit_bio() {
__rcu_read_lock();
__rcu_read_unlock();
md_submit_bio() {
bio_split_to_limits() {
bio_split_rw() {
bio_split_io_at();
bio_submit_split();
}
}
md_handle_request() {
__rcu_read_lock();
__rcu_read_unlock();
raid1_make_request() {
raid1_read_request() {
_printk() {
vprintk() {
vprintk_default() {
vprintk_emit() {
panic_on_other_cpu();
nbcon_get_default_prio() {
panic_on_this_cpu();
nbcon_get_cpu_emergency_nesting() {
printk_percpu_data_ready();
}
}
is_printk_legacy_deferred() {
is_printk_cpu_sync_owner();
}
vprintk_store() {
local_clock();
printk_parse_prefix();
is_printk_force_console();
prb_reserve() {
data_alloc() {
data_push_tail();
}
space_used.isra.0();
}
printk_sprint() {
printk_parse_prefix();
}
prb_final_commit() {
desc_update_last_finalized() {
_prb_read_valid() {
desc_read_finalized_seq();
}
_prb_read_valid() {
desc_read_finalized_seq();
panic_on_this_cpu();
}
}
}
}
console_trylock() {
panic_on_other_cpu();
__printk_safe_enter();
down_trylock() {
_raw_spin_lock_irqsave();
_raw_spin_unlock_irqrestore();
}
__printk_safe_exit();
}
console_unlock() {
nbcon_get_default_prio() {
panic_on_this_cpu();
nbcon_get_cpu_emergency_nesting() {
printk_percpu_data_ready();
}
}
is_printk_legacy_deferred() {
is_printk_cpu_sync_owner();
}
console_flush_one_record() {
nbcon_get_default_prio() {
panic_on_this_cpu();
nbcon_get_cpu_emergency_nesting() {
printk_percpu_data_ready();
}
}
is_printk_legacy_deferred() {
is_printk_cpu_sync_owner();
}
__srcu_read_lock();
printk_get_next_message() {
prb_read_valid() {
_prb_read_valid() {
desc_read_finalized_seq();
get_data();
desc_read_finalized_seq();
}
}
}
panic_on_other_cpu();
__srcu_read_unlock();
}
console_flush_one_record() {
nbcon_get_default_prio() {
panic_on_this_cpu();
nbcon_get_cpu_emergency_nesting() {
printk_percpu_data_ready();
}
}
is_printk_legacy_deferred() {
is_printk_cpu_sync_owner();
}
__srcu_read_lock();
printk_get_next_message() {
prb_read_valid() {
_prb_read_valid() {
desc_read_finalized_seq();
panic_on_this_cpu();
}
}
}
__srcu_read_unlock();
}
__printk_safe_enter();
up() {
_raw_spin_lock_irqsave();
_raw_spin_unlock_irqrestore();
}
__printk_safe_exit();
prb_read_valid() {
_prb_read_valid() {
desc_read_finalized_seq();
panic_on_this_cpu();
}
}
}
__wake_up_klogd();
}
}
}
}
mempool_alloc_noprof() {
__cond_resched();
mempool_kmalloc() {
__kmalloc_noprof();
}
}
md_account_bio() {
__rcu_read_lock();
__rcu_read_unlock();
bio_alloc_clone() {
bio_alloc_bioset() {
mempool_alloc_noprof() {
__cond_resched();
mempool_alloc_slab() {
kmem_cache_alloc_noprof();
}
}
bio_associate_blkg() {
__rcu_read_lock();
kthread_blkcg();
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
}
__rcu_read_unlock();
}
}
bio_clone_blkg_association() {
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
__rcu_read_lock();
__rcu_read_unlock();
}
}
}
bio_start_io_acct() {
update_io_ticks();
}
}
bio_alloc_clone() {
bio_alloc_bioset() {
mempool_alloc_noprof() {
__cond_resched();
mempool_alloc_slab() {
kmem_cache_alloc_noprof();
}
}
bio_associate_blkg() {
__rcu_read_lock();
kthread_blkcg();
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
}
__rcu_read_unlock();
}
}
bio_clone_blkg_association() {
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
__rcu_read_lock();
__rcu_read_unlock();
}
}
}
submit_bio_noacct() {
__cond_resched();
submit_bio_noacct_nocheck() {
blk_cgroup_bio_start();
}
}
}
}
__rcu_read_lock();
__rcu_read_unlock();
}
}
__rcu_read_lock();
__rcu_read_unlock();
}
__submit_bio() {
blk_mq_submit_bio() {
__rcu_read_lock();
__rcu_read_unlock();
bio_split_rw() {
bio_split_io_at();
bio_submit_split();
}
blk_attempt_plug_merge();
blk_mq_sched_bio_merge();
__blk_mq_alloc_requests() {
blk_mq_get_tag() {
__blk_mq_get_tag();
}
blk_mq_rq_ctx_init.isra.0();
}
ktime_get();
update_io_ticks();
blk_add_rq_to_plug();
}
}
}
}
bio_alloc_clone() {
bio_alloc_bioset() {
mempool_alloc_noprof() {
__cond_resched();
mempool_alloc_slab() {
kmem_cache_alloc_noprof() {
__slab_alloc.isra.0() {
___slab_alloc();
}
}
}
}
bio_associate_blkg() {
__rcu_read_lock();
kthread_blkcg();
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
}
__rcu_read_unlock();
}
}
bio_clone_blkg_association() {
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
__rcu_read_lock();
__rcu_read_unlock();
}
}
}
submit_bio_noacct() {
__cond_resched();
submit_bio_noacct_nocheck() {
blk_cgroup_bio_start();
__submit_bio() {
blk_mq_submit_bio() {
__rcu_read_lock();
__rcu_read_unlock();
bio_split_rw() {
bio_split_io_at();
bio_submit_split();
}
blk_attempt_plug_merge();
blk_mq_sched_bio_merge();
__blk_mq_alloc_requests() {
blk_mq_get_tag() {
__blk_mq_get_tag();
}
blk_mq_rq_ctx_init.isra.0();
}
update_io_ticks() {
bdev_count_inflight();
}
blk_add_rq_to_plug();
}
}
}
}
}
>
>> Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().")
>> Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
>> ---
>> I sent an email about this issue two days ago, but at the time I was not
>> sure whether it was a real problem or a misunderstanding on my part.
>>
>> After further analysis, it appears that this issue can occur.
>>
>> Apologies for the earlier confusion, and thank you for your time.
>>
>> Abd-Alrhman
>
> I suggest to always share the URL (lore.kernel.org), when referencing
> another thread. If relevant, maybe even reference your message with a
> Link: tag in the commit message.
Yes, i will make sure to do that next time. Here is the link:
https://lore.kernel.org/linux-raid/20260425142938.5555-1-abd.masalkhi@gmail.com/T
>
>> ---
>> drivers/md/raid1.c | 33 ++++++++++++++++++++++++---------
>> 1 file changed, 24 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index cc9914bd15c1..14f6d6625811 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -607,7 +607,7 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>
>> /* choose the first disk even if it has some bad blocks. */
>> read_len = raid1_check_read_range(rdev, this_sector, &len);
>> - if (read_len > 0) {
>> + if (read_len > 0 && (!*max_sectors || read_len == r1_bio->sectors)) {
>> update_read_sectors(conf, disk, this_sector, read_len);
>> *max_sectors = read_len;
>> return disk;
>> @@ -704,8 +704,13 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>> }
>>
>> if (bb_disk != -1) {
>> - *max_sectors = bb_read_len;
>> - update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
>> + if (!*max_sectors || bb_read_len == r1_bio->sectors) {
>> + *max_sectors = bb_read_len;
>> + update_read_sectors(conf, bb_disk, this_sector,
>> + bb_read_len);
>> + } else {
>> + bb_disk = -1;
>> + }
>> }
>>
>> return bb_disk;
>> @@ -852,8 +857,9 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>> * disks and disks with bad blocks for now. Only pay attention to key disk
>> * choice.
>> *
>> - * 3) If we've made it this far, now look for disks with bad blocks and choose
>> - * the one with most number of sectors.
>> + * 3) If we've made it this far and *max_sectors is 0 (i.e., we are tolerant
>> + * of bad blocks), look for disks with bad blocks and choose the one with
>> + * the most sectors.
>> *
>> * 4) If we are all the way at the end, we have no choice but to use a disk even
>> * if it is write mostly.
>> @@ -882,11 +888,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
>> /*
>> * If we are here it means we didn't find a perfectly good disk so
>> * now spend a bit more time trying to find one with the most good
>> - * sectors.
>> + * sectors. but only if we are tolerant of bad blocks.
>
> s/but/But/
>
I will fix this in v2.
>> */
>> - disk = choose_bb_rdev(conf, r1_bio, max_sectors);
>> - if (disk >= 0)
>> - return disk;
>> + if (!*max_sectors) {
>> + disk = choose_bb_rdev(conf, r1_bio, max_sectors);
>> + if (disk >= 0)
>> + return disk;
>> + }
>>
>> return choose_slow_rdev(conf, r1_bio, max_sectors);
>> }
>> @@ -1346,7 +1354,14 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>> /*
>> * make_request() can abort the operation when read-ahead is being
>> * used and no empty request is available.
>> + *
>> + * If we allow splitting the bio while executing in the raid1 thread,
>> + * we may end up recursing (current->bio_list is NULL), and we might
>> + * also deadlock if we try to suspend the array, since we are
>> + * resubmitting an md_cloned_bio. Therefore, we must be read either
>
> … we must read …
>
I will fix this in v2.
>> + * all the sectors or none.
>> */
>> + max_sectors = r1bio_existed;
>
> Excuse my ignorance, but I do not get why a bool is assigned to an int
> representing the maximum sector value.
>
I modified read_balance() to interpret *max_sectors as a flag. If it is
0, the read path is allowed to be tolerant of bad blocks; otherwise, it
is not. In both cases, *max_sectors will eventually be updated to the
maximum number of readable sectors if a suitable disk is found.
I used r1bio_existed to initialize this value, so assigning
max_sectors = r1bio_existed effectively encodes this behavior
without introducing an additional parameter.
>> rdisk = read_balance(conf, r1_bio, &max_sectors);
>> if (rdisk < 0) {
>> /* couldn't find anywhere to read from */
>
>
> Kind regards,
>
> Paul
--
Best Regards,
Abd-Alrhman
next prev parent reply other threads:[~2026-04-27 17:44 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-27 10:34 [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock Abd-Alrhman Masalkhi
2026-04-27 14:49 ` Paul Menzel
2026-04-27 17:44 ` Abd-Alrhman Masalkhi [this message]
2026-04-28 8:16 ` Abd-Alrhman Masalkhi
2026-04-28 8:54 ` Yu Kuai
2026-04-28 9:46 ` Abd-Alrhman Masalkhi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m2se8g47sw.fsf@gmail.com \
--to=abd.masalkhi@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.com \
--cc=pmenzel@molgen.mpg.de \
--cc=shli@fb.com \
--cc=song@kernel.org \
--cc=yukuai@fnnas.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox