* [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock
@ 2026-04-27 10:34 Abd-Alrhman Masalkhi
2026-04-27 14:49 ` Paul Menzel
2026-04-28 8:54 ` Yu Kuai
0 siblings, 2 replies; 6+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-04-27 10:34 UTC (permalink / raw)
To: song, yukuai, shli, neilb, linux-raid, linux-kernel; +Cc: Abd-Alrhman Masalkhi
Splitting a bio while executing in the raid1 thread can lead to
recursion, as task->bio_list is NULL in this context.
In addition, resubmitting an md_cloned_bio after splitting may lead to
a deadlock if the array is suspended before the md driver calls
percpu_ref_tryget_live(&mddev->active_io) on it's path to
pers->make_request().
Avoid splitting the bio in this context and require that it is either
read in full or not at all.
This prevents recursion and avoids potential deadlocks during array
suspension.
Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
---
I sent an email about this issue two days ago, but at the time I was not
sure whether it was a real problem or a misunderstanding on my part.
After further analysis, it appears that this issue can occur.
Apologies for the earlier confusion, and thank you for your time.
Abd-Alrhman
---
drivers/md/raid1.c | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index cc9914bd15c1..14f6d6625811 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -607,7 +607,7 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
/* choose the first disk even if it has some bad blocks. */
read_len = raid1_check_read_range(rdev, this_sector, &len);
- if (read_len > 0) {
+ if (read_len > 0 && (!*max_sectors || read_len == r1_bio->sectors)) {
update_read_sectors(conf, disk, this_sector, read_len);
*max_sectors = read_len;
return disk;
@@ -704,8 +704,13 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
}
if (bb_disk != -1) {
- *max_sectors = bb_read_len;
- update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
+ if (!*max_sectors || bb_read_len == r1_bio->sectors) {
+ *max_sectors = bb_read_len;
+ update_read_sectors(conf, bb_disk, this_sector,
+ bb_read_len);
+ } else {
+ bb_disk = -1;
+ }
}
return bb_disk;
@@ -852,8 +857,9 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
* disks and disks with bad blocks for now. Only pay attention to key disk
* choice.
*
- * 3) If we've made it this far, now look for disks with bad blocks and choose
- * the one with most number of sectors.
+ * 3) If we've made it this far and *max_sectors is 0 (i.e., we are tolerant
+ * of bad blocks), look for disks with bad blocks and choose the one with
+ * the most sectors.
*
* 4) If we are all the way at the end, we have no choice but to use a disk even
* if it is write mostly.
@@ -882,11 +888,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
/*
* If we are here it means we didn't find a perfectly good disk so
* now spend a bit more time trying to find one with the most good
- * sectors.
+ * sectors. but only if we are tolerant of bad blocks.
*/
- disk = choose_bb_rdev(conf, r1_bio, max_sectors);
- if (disk >= 0)
- return disk;
+ if (!*max_sectors) {
+ disk = choose_bb_rdev(conf, r1_bio, max_sectors);
+ if (disk >= 0)
+ return disk;
+ }
return choose_slow_rdev(conf, r1_bio, max_sectors);
}
@@ -1346,7 +1354,14 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
/*
* make_request() can abort the operation when read-ahead is being
* used and no empty request is available.
+ *
+ * If we allow splitting the bio while executing in the raid1 thread,
+ * we may end up recursing (current->bio_list is NULL), and we might
+ * also deadlock if we try to suspend the array, since we are
+ * resubmitting an md_cloned_bio. Therefore, we must be read either
+ * all the sectors or none.
*/
+ max_sectors = r1bio_existed;
rdisk = read_balance(conf, r1_bio, &max_sectors);
if (rdisk < 0) {
/* couldn't find anywhere to read from */
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock
2026-04-27 10:34 [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock Abd-Alrhman Masalkhi
@ 2026-04-27 14:49 ` Paul Menzel
2026-04-27 17:44 ` Abd-Alrhman Masalkhi
2026-04-28 8:16 ` Abd-Alrhman Masalkhi
2026-04-28 8:54 ` Yu Kuai
1 sibling, 2 replies; 6+ messages in thread
From: Paul Menzel @ 2026-04-27 14:49 UTC (permalink / raw)
To: Abd-Alrhman Masalkhi; +Cc: song, yukuai, shli, neilb, linux-raid, linux-kernel
Dear Abd-Alrhman,
Thank you for your patch.
Am 27.04.26 um 12:34 schrieb Abd-Alrhman Masalkhi:
> Splitting a bio while executing in the raid1 thread can lead to
> recursion, as task->bio_list is NULL in this context.
>
> In addition, resubmitting an md_cloned_bio after splitting may lead to
> a deadlock if the array is suspended before the md driver calls
> percpu_ref_tryget_live(&mddev->active_io) on it's path to
> pers->make_request().
>
> Avoid splitting the bio in this context and require that it is either
> read in full or not at all.
>
> This prevents recursion and avoids potential deadlocks during array
> suspension.
Do you have a reproducer?
> Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().")
> Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
> ---
> I sent an email about this issue two days ago, but at the time I was not
> sure whether it was a real problem or a misunderstanding on my part.
>
> After further analysis, it appears that this issue can occur.
>
> Apologies for the earlier confusion, and thank you for your time.
>
> Abd-Alrhman
I suggest to always share the URL (lore.kernel.org), when referencing
another thread. If relevant, maybe even reference your message with a
Link: tag in the commit message.
> ---
> drivers/md/raid1.c | 33 ++++++++++++++++++++++++---------
> 1 file changed, 24 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index cc9914bd15c1..14f6d6625811 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -607,7 +607,7 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>
> /* choose the first disk even if it has some bad blocks. */
> read_len = raid1_check_read_range(rdev, this_sector, &len);
> - if (read_len > 0) {
> + if (read_len > 0 && (!*max_sectors || read_len == r1_bio->sectors)) {
> update_read_sectors(conf, disk, this_sector, read_len);
> *max_sectors = read_len;
> return disk;
> @@ -704,8 +704,13 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> }
>
> if (bb_disk != -1) {
> - *max_sectors = bb_read_len;
> - update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
> + if (!*max_sectors || bb_read_len == r1_bio->sectors) {
> + *max_sectors = bb_read_len;
> + update_read_sectors(conf, bb_disk, this_sector,
> + bb_read_len);
> + } else {
> + bb_disk = -1;
> + }
> }
>
> return bb_disk;
> @@ -852,8 +857,9 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
> * disks and disks with bad blocks for now. Only pay attention to key disk
> * choice.
> *
> - * 3) If we've made it this far, now look for disks with bad blocks and choose
> - * the one with most number of sectors.
> + * 3) If we've made it this far and *max_sectors is 0 (i.e., we are tolerant
> + * of bad blocks), look for disks with bad blocks and choose the one with
> + * the most sectors.
> *
> * 4) If we are all the way at the end, we have no choice but to use a disk even
> * if it is write mostly.
> @@ -882,11 +888,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
> /*
> * If we are here it means we didn't find a perfectly good disk so
> * now spend a bit more time trying to find one with the most good
> - * sectors.
> + * sectors. but only if we are tolerant of bad blocks.
s/but/But/
> */
> - disk = choose_bb_rdev(conf, r1_bio, max_sectors);
> - if (disk >= 0)
> - return disk;
> + if (!*max_sectors) {
> + disk = choose_bb_rdev(conf, r1_bio, max_sectors);
> + if (disk >= 0)
> + return disk;
> + }
>
> return choose_slow_rdev(conf, r1_bio, max_sectors);
> }
> @@ -1346,7 +1354,14 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
> /*
> * make_request() can abort the operation when read-ahead is being
> * used and no empty request is available.
> + *
> + * If we allow splitting the bio while executing in the raid1 thread,
> + * we may end up recursing (current->bio_list is NULL), and we might
> + * also deadlock if we try to suspend the array, since we are
> + * resubmitting an md_cloned_bio. Therefore, we must be read either
… we must read …
> + * all the sectors or none.
> */
> + max_sectors = r1bio_existed;
Excuse my ignorance, but I do not get why a bool is assigned to an int
representing the maximum sector value.
> rdisk = read_balance(conf, r1_bio, &max_sectors);
> if (rdisk < 0) {
> /* couldn't find anywhere to read from */
Kind regards,
Paul
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock
2026-04-27 14:49 ` Paul Menzel
@ 2026-04-27 17:44 ` Abd-Alrhman Masalkhi
2026-04-28 8:16 ` Abd-Alrhman Masalkhi
1 sibling, 0 replies; 6+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-04-27 17:44 UTC (permalink / raw)
To: Paul Menzel; +Cc: song, yukuai, shli, neilb, linux-raid, linux-kernel
Hi Paul,
Thank you for the feedback.
On Mon, Apr 27, 2026 at 16:49 +0200, Paul Menzel wrote:
> Dear Abd-Alrhman,
>
>
> Thank you for your patch.
>
> Am 27.04.26 um 12:34 schrieb Abd-Alrhman Masalkhi:
>> Splitting a bio while executing in the raid1 thread can lead to
>> recursion, as task->bio_list is NULL in this context.
>>
>> In addition, resubmitting an md_cloned_bio after splitting may lead to
>> a deadlock if the array is suspended before the md driver calls
>> percpu_ref_tryget_live(&mddev->active_io) on it's path to
>> pers->make_request().
>>
>> Avoid splitting the bio in this context and require that it is either
>> read in full or not at all.
>>
>> This prevents recursion and avoids potential deadlocks during array
>> suspension.
>
> Do you have a reproducer?
I found this issue while reviewing the code and trying to understand the
read path.
The problem can be triggered when the first rdev cannot complete the
md_cloned_bio successfully, and RAID1 selects another rdev that cannot
fulfil the entire request. In that case, raid1_read_request() will split
the bio (the md_cloned_bio) via bio_submit_split_bioset(), which in turn
calls submit_bio_noacct_nocheck(). Since current->bio_list is NULL in
this context, this leads to raid1_read_request() being called again,
resulting in recursion.
I have also created a test to confirm that this can occur by modifying
the code with the following patch:
drivers/md/raid1.c | 52 ++++++++++++++++++++++++++++++++++++++++------
1 file changed, 46 insertions(+), 6 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index cc9914bd15c1..145e3ad0b1b8 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -362,11 +362,34 @@ static int find_bio_disk(struct r1bio *r1_bio, struct bio *bio)
static void raid1_end_read_request(struct bio *bio)
{
+ static int tmp = 75;
int uptodate = !bio->bi_status;
struct r1bio *r1_bio = bio->bi_private;
struct r1conf *conf = r1_bio->mddev->private;
struct md_rdev *rdev = conf->mirrors[r1_bio->read_disk].rdev;
+ if (tmp > 2) {
+ tmp --;
+ pr_info("--tmp = %d\n", tmp);
+ } else if (tmp == 2) {
+ if (r1_bio->sectors > 2 && uptodate) {
+ pr_info("I will start omitting errors\n");
+ bio->bi_status = BLK_STS_IOERR;
+ uptodate = false;
+ tmp = 0;
+ }
+ } else {
+ if (r1_bio->sectors > 2) {
+ if (tmp) {
+ bio->bi_status = BLK_STS_IOERR;
+ uptodate = false;
+ tmp = false;
+ } else {
+ tmp = true;
+ }
+ }
+ }
+
/*
* this branch is our 'one mirror IO has finished' event handler:
*/
@@ -607,7 +630,7 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
/* choose the first disk even if it has some bad blocks. */
read_len = raid1_check_read_range(rdev, this_sector, &len);
- if (read_len > 0) {
+ if (read_len > 0 && (!*max_sectors || read_len == r1_bio->sectors)) {
update_read_sectors(conf, disk, this_sector, read_len);
*max_sectors = read_len;
return disk;
@@ -704,8 +727,12 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
}
if (bb_disk != -1) {
- *max_sectors = bb_read_len;
- update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
+ if (!*max_sectors || bb_read_len == r1_bio->sectors) {
+ *max_sectors = bb_read_len;
+ update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
+ } else {
+ bb_disk = -1;
+ }
}
return bb_disk;
@@ -884,9 +911,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
* now spend a bit more time trying to find one with the most good
* sectors.
*/
- disk = choose_bb_rdev(conf, r1_bio, max_sectors);
- if (disk >= 0)
- return disk;
+ if (!*max_sectors) {
+ disk = choose_bb_rdev(conf, r1_bio, max_sectors);
+ if (disk >= 0)
+ return disk;
+ }
return choose_slow_rdev(conf, r1_bio, max_sectors);
}
@@ -1320,6 +1349,11 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
int rdisk;
bool r1bio_existed = !!r1_bio;
+ if (mddev->thread && mddev->thread->tsk == current) {
+ pr_info("taask->bio_list = %p, %d\n", current->bio_list,
+ r1bio_existed);
+ }
+
/*
* If r1_bio is set, we are blocking the raid1d thread
* so there is a tiny risk of deadlock. So ask for
@@ -1347,6 +1381,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
* make_request() can abort the operation when read-ahead is being
* used and no empty request is available.
*/
+ max_sectors = r1bio_existed;
rdisk = read_balance(conf, r1_bio, &max_sectors);
if (rdisk < 0) {
/* couldn't find anywhere to read from */
@@ -1376,7 +1411,12 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
mddev->bitmap_ops->wait_behind_writes(mddev);
}
+ if (r1bio_existed)
+ max_sectors = max_sectors - 1;
+
if (max_sectors < bio_sectors(bio)) {
+ /* we are not allowed */
+ /* BUG_ON(r1bio_existed); */
bio = bio_submit_split_bioset(bio, max_sectors,
&conf->bio_split);
if (!bio) {
--
2.43.0
using trace-cmd, it shows the recursion clearly, its output:
raid1_read_request() {
bio_submit_split_bioset() {
bio_split() {
bio_alloc_clone() {
bio_alloc_bioset() {
mempool_alloc_noprof() {
__cond_resched();
mempool_alloc_slab() {
kmem_cache_alloc_noprof();
}
}
bio_associate_blkg() {
__rcu_read_lock();
kthread_blkcg();
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
}
__rcu_read_unlock();
}
}
bio_clone_blkg_association() {
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
__rcu_read_lock();
__rcu_read_unlock();
}
}
}
}
bio_chain();
should_fail_bio();
submit_bio_noacct_nocheck() {
blk_cgroup_bio_start();
__submit_bio() {
__rcu_read_lock();
__rcu_read_unlock();
md_submit_bio() {
bio_split_to_limits() {
bio_split_rw() {
bio_split_io_at();
bio_submit_split();
}
}
md_handle_request() {
__rcu_read_lock();
__rcu_read_unlock();
raid1_make_request() {
raid1_read_request() {
_printk() {
vprintk() {
vprintk_default() {
vprintk_emit() {
panic_on_other_cpu();
nbcon_get_default_prio() {
panic_on_this_cpu();
nbcon_get_cpu_emergency_nesting() {
printk_percpu_data_ready();
}
}
is_printk_legacy_deferred() {
is_printk_cpu_sync_owner();
}
vprintk_store() {
local_clock();
printk_parse_prefix();
is_printk_force_console();
prb_reserve() {
data_alloc() {
data_push_tail();
}
space_used.isra.0();
}
printk_sprint() {
printk_parse_prefix();
}
prb_final_commit() {
desc_update_last_finalized() {
_prb_read_valid() {
desc_read_finalized_seq();
}
_prb_read_valid() {
desc_read_finalized_seq();
panic_on_this_cpu();
}
}
}
}
console_trylock() {
panic_on_other_cpu();
__printk_safe_enter();
down_trylock() {
_raw_spin_lock_irqsave();
_raw_spin_unlock_irqrestore();
}
__printk_safe_exit();
}
console_unlock() {
nbcon_get_default_prio() {
panic_on_this_cpu();
nbcon_get_cpu_emergency_nesting() {
printk_percpu_data_ready();
}
}
is_printk_legacy_deferred() {
is_printk_cpu_sync_owner();
}
console_flush_one_record() {
nbcon_get_default_prio() {
panic_on_this_cpu();
nbcon_get_cpu_emergency_nesting() {
printk_percpu_data_ready();
}
}
is_printk_legacy_deferred() {
is_printk_cpu_sync_owner();
}
__srcu_read_lock();
printk_get_next_message() {
prb_read_valid() {
_prb_read_valid() {
desc_read_finalized_seq();
get_data();
desc_read_finalized_seq();
}
}
}
panic_on_other_cpu();
__srcu_read_unlock();
}
console_flush_one_record() {
nbcon_get_default_prio() {
panic_on_this_cpu();
nbcon_get_cpu_emergency_nesting() {
printk_percpu_data_ready();
}
}
is_printk_legacy_deferred() {
is_printk_cpu_sync_owner();
}
__srcu_read_lock();
printk_get_next_message() {
prb_read_valid() {
_prb_read_valid() {
desc_read_finalized_seq();
panic_on_this_cpu();
}
}
}
__srcu_read_unlock();
}
__printk_safe_enter();
up() {
_raw_spin_lock_irqsave();
_raw_spin_unlock_irqrestore();
}
__printk_safe_exit();
prb_read_valid() {
_prb_read_valid() {
desc_read_finalized_seq();
panic_on_this_cpu();
}
}
}
__wake_up_klogd();
}
}
}
}
mempool_alloc_noprof() {
__cond_resched();
mempool_kmalloc() {
__kmalloc_noprof();
}
}
md_account_bio() {
__rcu_read_lock();
__rcu_read_unlock();
bio_alloc_clone() {
bio_alloc_bioset() {
mempool_alloc_noprof() {
__cond_resched();
mempool_alloc_slab() {
kmem_cache_alloc_noprof();
}
}
bio_associate_blkg() {
__rcu_read_lock();
kthread_blkcg();
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
}
__rcu_read_unlock();
}
}
bio_clone_blkg_association() {
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
__rcu_read_lock();
__rcu_read_unlock();
}
}
}
bio_start_io_acct() {
update_io_ticks();
}
}
bio_alloc_clone() {
bio_alloc_bioset() {
mempool_alloc_noprof() {
__cond_resched();
mempool_alloc_slab() {
kmem_cache_alloc_noprof();
}
}
bio_associate_blkg() {
__rcu_read_lock();
kthread_blkcg();
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
}
__rcu_read_unlock();
}
}
bio_clone_blkg_association() {
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
__rcu_read_lock();
__rcu_read_unlock();
}
}
}
submit_bio_noacct() {
__cond_resched();
submit_bio_noacct_nocheck() {
blk_cgroup_bio_start();
}
}
}
}
__rcu_read_lock();
__rcu_read_unlock();
}
}
__rcu_read_lock();
__rcu_read_unlock();
}
__submit_bio() {
blk_mq_submit_bio() {
__rcu_read_lock();
__rcu_read_unlock();
bio_split_rw() {
bio_split_io_at();
bio_submit_split();
}
blk_attempt_plug_merge();
blk_mq_sched_bio_merge();
__blk_mq_alloc_requests() {
blk_mq_get_tag() {
__blk_mq_get_tag();
}
blk_mq_rq_ctx_init.isra.0();
}
ktime_get();
update_io_ticks();
blk_add_rq_to_plug();
}
}
}
}
bio_alloc_clone() {
bio_alloc_bioset() {
mempool_alloc_noprof() {
__cond_resched();
mempool_alloc_slab() {
kmem_cache_alloc_noprof() {
__slab_alloc.isra.0() {
___slab_alloc();
}
}
}
}
bio_associate_blkg() {
__rcu_read_lock();
kthread_blkcg();
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
}
__rcu_read_unlock();
}
}
bio_clone_blkg_association() {
bio_associate_blkg_from_css() {
__rcu_read_lock();
__rcu_read_unlock();
__rcu_read_lock();
__rcu_read_unlock();
}
}
}
submit_bio_noacct() {
__cond_resched();
submit_bio_noacct_nocheck() {
blk_cgroup_bio_start();
__submit_bio() {
blk_mq_submit_bio() {
__rcu_read_lock();
__rcu_read_unlock();
bio_split_rw() {
bio_split_io_at();
bio_submit_split();
}
blk_attempt_plug_merge();
blk_mq_sched_bio_merge();
__blk_mq_alloc_requests() {
blk_mq_get_tag() {
__blk_mq_get_tag();
}
blk_mq_rq_ctx_init.isra.0();
}
update_io_ticks() {
bdev_count_inflight();
}
blk_add_rq_to_plug();
}
}
}
}
}
>
>> Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().")
>> Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
>> ---
>> I sent an email about this issue two days ago, but at the time I was not
>> sure whether it was a real problem or a misunderstanding on my part.
>>
>> After further analysis, it appears that this issue can occur.
>>
>> Apologies for the earlier confusion, and thank you for your time.
>>
>> Abd-Alrhman
>
> I suggest to always share the URL (lore.kernel.org), when referencing
> another thread. If relevant, maybe even reference your message with a
> Link: tag in the commit message.
Yes, i will make sure to do that next time. Here is the link:
https://lore.kernel.org/linux-raid/20260425142938.5555-1-abd.masalkhi@gmail.com/T
>
>> ---
>> drivers/md/raid1.c | 33 ++++++++++++++++++++++++---------
>> 1 file changed, 24 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index cc9914bd15c1..14f6d6625811 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -607,7 +607,7 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>
>> /* choose the first disk even if it has some bad blocks. */
>> read_len = raid1_check_read_range(rdev, this_sector, &len);
>> - if (read_len > 0) {
>> + if (read_len > 0 && (!*max_sectors || read_len == r1_bio->sectors)) {
>> update_read_sectors(conf, disk, this_sector, read_len);
>> *max_sectors = read_len;
>> return disk;
>> @@ -704,8 +704,13 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>> }
>>
>> if (bb_disk != -1) {
>> - *max_sectors = bb_read_len;
>> - update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
>> + if (!*max_sectors || bb_read_len == r1_bio->sectors) {
>> + *max_sectors = bb_read_len;
>> + update_read_sectors(conf, bb_disk, this_sector,
>> + bb_read_len);
>> + } else {
>> + bb_disk = -1;
>> + }
>> }
>>
>> return bb_disk;
>> @@ -852,8 +857,9 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>> * disks and disks with bad blocks for now. Only pay attention to key disk
>> * choice.
>> *
>> - * 3) If we've made it this far, now look for disks with bad blocks and choose
>> - * the one with most number of sectors.
>> + * 3) If we've made it this far and *max_sectors is 0 (i.e., we are tolerant
>> + * of bad blocks), look for disks with bad blocks and choose the one with
>> + * the most sectors.
>> *
>> * 4) If we are all the way at the end, we have no choice but to use a disk even
>> * if it is write mostly.
>> @@ -882,11 +888,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
>> /*
>> * If we are here it means we didn't find a perfectly good disk so
>> * now spend a bit more time trying to find one with the most good
>> - * sectors.
>> + * sectors. but only if we are tolerant of bad blocks.
>
> s/but/But/
>
I will fix this in v2.
>> */
>> - disk = choose_bb_rdev(conf, r1_bio, max_sectors);
>> - if (disk >= 0)
>> - return disk;
>> + if (!*max_sectors) {
>> + disk = choose_bb_rdev(conf, r1_bio, max_sectors);
>> + if (disk >= 0)
>> + return disk;
>> + }
>>
>> return choose_slow_rdev(conf, r1_bio, max_sectors);
>> }
>> @@ -1346,7 +1354,14 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>> /*
>> * make_request() can abort the operation when read-ahead is being
>> * used and no empty request is available.
>> + *
>> + * If we allow splitting the bio while executing in the raid1 thread,
>> + * we may end up recursing (current->bio_list is NULL), and we might
>> + * also deadlock if we try to suspend the array, since we are
>> + * resubmitting an md_cloned_bio. Therefore, we must be read either
>
> … we must read …
>
I will fix this in v2.
>> + * all the sectors or none.
>> */
>> + max_sectors = r1bio_existed;
>
> Excuse my ignorance, but I do not get why a bool is assigned to an int
> representing the maximum sector value.
>
I modified read_balance() to interpret *max_sectors as a flag. If it is
0, the read path is allowed to be tolerant of bad blocks; otherwise, it
is not. In both cases, *max_sectors will eventually be updated to the
maximum number of readable sectors if a suitable disk is found.
I used r1bio_existed to initialize this value, so assigning
max_sectors = r1bio_existed effectively encodes this behavior
without introducing an additional parameter.
>> rdisk = read_balance(conf, r1_bio, &max_sectors);
>> if (rdisk < 0) {
>> /* couldn't find anywhere to read from */
>
>
> Kind regards,
>
> Paul
--
Best Regards,
Abd-Alrhman
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock
2026-04-27 14:49 ` Paul Menzel
2026-04-27 17:44 ` Abd-Alrhman Masalkhi
@ 2026-04-28 8:16 ` Abd-Alrhman Masalkhi
1 sibling, 0 replies; 6+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-04-28 8:16 UTC (permalink / raw)
To: Paul Menzel; +Cc: song, yukuai, shli, neilb, linux-raid, linux-kernel
Hi Paul,
On Mon, Apr 27, 2026 at 16:49 +0200, Paul Menzel wrote:
> Dear Abd-Alrhman,
>
>
> Thank you for your patch.
>
RAID 10 seems to have a similar issue, i will fix it too.
> Am 27.04.26 um 12:34 schrieb Abd-Alrhman Masalkhi:
>> Splitting a bio while executing in the raid1 thread can lead to
>> recursion, as task->bio_list is NULL in this context.
>>
>> In addition, resubmitting an md_cloned_bio after splitting may lead to
>> a deadlock if the array is suspended before the md driver calls
>> percpu_ref_tryget_live(&mddev->active_io) on it's path to
>> pers->make_request().
>>
>> Avoid splitting the bio in this context and require that it is either
>> read in full or not at all.
>>
>> This prevents recursion and avoids potential deadlocks during array
>> suspension.
>
> Do you have a reproducer?
>
>> Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().")
>> Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
>> ---
>> I sent an email about this issue two days ago, but at the time I was not
>> sure whether it was a real problem or a misunderstanding on my part.
>>
>> After further analysis, it appears that this issue can occur.
>>
>> Apologies for the earlier confusion, and thank you for your time.
>>
>> Abd-Alrhman
>
> I suggest to always share the URL (lore.kernel.org), when referencing
> another thread. If relevant, maybe even reference your message with a
> Link: tag in the commit message.
>
>> ---
>> drivers/md/raid1.c | 33 ++++++++++++++++++++++++---------
>> 1 file changed, 24 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index cc9914bd15c1..14f6d6625811 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -607,7 +607,7 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>
>> /* choose the first disk even if it has some bad blocks. */
>> read_len = raid1_check_read_range(rdev, this_sector, &len);
>> - if (read_len > 0) {
>> + if (read_len > 0 && (!*max_sectors || read_len == r1_bio->sectors)) {
>> update_read_sectors(conf, disk, this_sector, read_len);
>> *max_sectors = read_len;
>> return disk;
>> @@ -704,8 +704,13 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>> }
>>
>> if (bb_disk != -1) {
>> - *max_sectors = bb_read_len;
>> - update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
>> + if (!*max_sectors || bb_read_len == r1_bio->sectors) {
>> + *max_sectors = bb_read_len;
>> + update_read_sectors(conf, bb_disk, this_sector,
>> + bb_read_len);
>> + } else {
>> + bb_disk = -1;
>> + }
>> }
>>
>> return bb_disk;
>> @@ -852,8 +857,9 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>> * disks and disks with bad blocks for now. Only pay attention to key disk
>> * choice.
>> *
>> - * 3) If we've made it this far, now look for disks with bad blocks and choose
>> - * the one with most number of sectors.
>> + * 3) If we've made it this far and *max_sectors is 0 (i.e., we are tolerant
>> + * of bad blocks), look for disks with bad blocks and choose the one with
>> + * the most sectors.
>> *
>> * 4) If we are all the way at the end, we have no choice but to use a disk even
>> * if it is write mostly.
>> @@ -882,11 +888,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
>> /*
>> * If we are here it means we didn't find a perfectly good disk so
>> * now spend a bit more time trying to find one with the most good
>> - * sectors.
>> + * sectors. but only if we are tolerant of bad blocks.
>
> s/but/But/
>
>> */
>> - disk = choose_bb_rdev(conf, r1_bio, max_sectors);
>> - if (disk >= 0)
>> - return disk;
>> + if (!*max_sectors) {
>> + disk = choose_bb_rdev(conf, r1_bio, max_sectors);
>> + if (disk >= 0)
>> + return disk;
>> + }
>>
>> return choose_slow_rdev(conf, r1_bio, max_sectors);
>> }
>> @@ -1346,7 +1354,14 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>> /*
>> * make_request() can abort the operation when read-ahead is being
>> * used and no empty request is available.
>> + *
>> + * If we allow splitting the bio while executing in the raid1 thread,
>> + * we may end up recursing (current->bio_list is NULL), and we might
>> + * also deadlock if we try to suspend the array, since we are
>> + * resubmitting an md_cloned_bio. Therefore, we must be read either
>
> … we must read …
>
>> + * all the sectors or none.
>> */
>> + max_sectors = r1bio_existed;
>
> Excuse my ignorance, but I do not get why a bool is assigned to an int
> representing the maximum sector value.
>
>> rdisk = read_balance(conf, r1_bio, &max_sectors);
>> if (rdisk < 0) {
>> /* couldn't find anywhere to read from */
>
>
> Kind regards,
>
> Paul
--
Best Regards,
Abd-Alrhman
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock
2026-04-27 10:34 [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock Abd-Alrhman Masalkhi
2026-04-27 14:49 ` Paul Menzel
@ 2026-04-28 8:54 ` Yu Kuai
2026-04-28 9:46 ` Abd-Alrhman Masalkhi
1 sibling, 1 reply; 6+ messages in thread
From: Yu Kuai @ 2026-04-28 8:54 UTC (permalink / raw)
To: Abd-Alrhman Masalkhi, song, shli, neilb, linux-raid, linux-kernel,
yukuai
Hi,
在 2026/4/27 18:34, Abd-Alrhman Masalkhi 写道:
> Splitting a bio while executing in the raid1 thread can lead to
> recursion, as task->bio_list is NULL in this context.
>
> In addition, resubmitting an md_cloned_bio after splitting may lead to
> a deadlock if the array is suspended before the md driver calls
> percpu_ref_tryget_live(&mddev->active_io) on it's path to
> pers->make_request().
I don't understand, I agree this is problematic in the suspend case, but
what's wrong with task->bio_list being NULL? This can only cause the reverse
order because the split bio will submit first. However this is not a big deal
as this is the slow error patch.
If suspend is the only problem here, the simple fix is to add checking
in md_handle_request().
>
> Avoid splitting the bio in this context and require that it is either
> read in full or not at all.
>
> This prevents recursion and avoids potential deadlocks during array
> suspension.
>
> Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().")
> Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
> ---
> I sent an email about this issue two days ago, but at the time I was not
> sure whether it was a real problem or a misunderstanding on my part.
>
> After further analysis, it appears that this issue can occur.
>
> Apologies for the earlier confusion, and thank you for your time.
>
> Abd-Alrhman
> ---
> drivers/md/raid1.c | 33 ++++++++++++++++++++++++---------
> 1 file changed, 24 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index cc9914bd15c1..14f6d6625811 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -607,7 +607,7 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>
> /* choose the first disk even if it has some bad blocks. */
> read_len = raid1_check_read_range(rdev, this_sector, &len);
> - if (read_len > 0) {
> + if (read_len > 0 && (!*max_sectors || read_len == r1_bio->sectors)) {
> update_read_sectors(conf, disk, this_sector, read_len);
> *max_sectors = read_len;
> return disk;
> @@ -704,8 +704,13 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> }
>
> if (bb_disk != -1) {
> - *max_sectors = bb_read_len;
> - update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
> + if (!*max_sectors || bb_read_len == r1_bio->sectors) {
> + *max_sectors = bb_read_len;
> + update_read_sectors(conf, bb_disk, this_sector,
> + bb_read_len);
> + } else {
> + bb_disk = -1;
> + }
> }
>
> return bb_disk;
> @@ -852,8 +857,9 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
> * disks and disks with bad blocks for now. Only pay attention to key disk
> * choice.
> *
> - * 3) If we've made it this far, now look for disks with bad blocks and choose
> - * the one with most number of sectors.
> + * 3) If we've made it this far and *max_sectors is 0 (i.e., we are tolerant
> + * of bad blocks), look for disks with bad blocks and choose the one with
> + * the most sectors.
> *
> * 4) If we are all the way at the end, we have no choice but to use a disk even
> * if it is write mostly.
> @@ -882,11 +888,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
> /*
> * If we are here it means we didn't find a perfectly good disk so
> * now spend a bit more time trying to find one with the most good
> - * sectors.
> + * sectors. but only if we are tolerant of bad blocks.
> */
> - disk = choose_bb_rdev(conf, r1_bio, max_sectors);
> - if (disk >= 0)
> - return disk;
> + if (!*max_sectors) {
> + disk = choose_bb_rdev(conf, r1_bio, max_sectors);
> + if (disk >= 0)
> + return disk;
> + }
>
> return choose_slow_rdev(conf, r1_bio, max_sectors);
> }
> @@ -1346,7 +1354,14 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
> /*
> * make_request() can abort the operation when read-ahead is being
> * used and no empty request is available.
> + *
> + * If we allow splitting the bio while executing in the raid1 thread,
> + * we may end up recursing (current->bio_list is NULL), and we might
> + * also deadlock if we try to suspend the array, since we are
> + * resubmitting an md_cloned_bio. Therefore, we must be read either
> + * all the sectors or none.
> */
> + max_sectors = r1bio_existed;
> rdisk = read_balance(conf, r1_bio, &max_sectors);
> if (rdisk < 0) {
> /* couldn't find anywhere to read from */
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock
2026-04-28 8:54 ` Yu Kuai
@ 2026-04-28 9:46 ` Abd-Alrhman Masalkhi
0 siblings, 0 replies; 6+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-04-28 9:46 UTC (permalink / raw)
To: Yu Kuai, song, shli, neilb, linux-raid, linux-kernel, yukuai
Hi Kuai,
Thank you for the feedback.
On Tue, Apr 28, 2026 at 16:54 +0800, Yu Kuai wrote:
> Hi,
>
> 在 2026/4/27 18:34, Abd-Alrhman Masalkhi 写道:
>> Splitting a bio while executing in the raid1 thread can lead to
>> recursion, as task->bio_list is NULL in this context.
>>
>> In addition, resubmitting an md_cloned_bio after splitting may lead to
>> a deadlock if the array is suspended before the md driver calls
>> percpu_ref_tryget_live(&mddev->active_io) on it's path to
>> pers->make_request().
>
> I don't understand, I agree this is problematic in the suspend case, but
> what's wrong with task->bio_list being NULL? This can only cause the reverse
> order because the split bio will submit first. However this is not a big deal
> as this is the slow error patch.
>
> If suspend is the only problem here, the simple fix is to add checking
> in md_handle_request().
>
I meant that when current->bio_list is NULL, raid1_read_request()
can recurse into itself, as shown in the following trace-cmd output:
raid1_read_request() { <---
bio_submit_split_bioset() {
bio_split() {..}
bio_chain();
submit_bio_noacct_nocheck() {
__submit_bio() {
md_submit_bio() {
md_handle_request() {
raid1_make_request() {
raid1_read_request() { <---
md_account_bio() {
If this behavior is not an issue, I will follow your suggestion and
only add the check in md_handle_request().
>>
>> Avoid splitting the bio in this context and require that it is either
>> read in full or not at all.
>>
>> This prevents recursion and avoids potential deadlocks during array
>> suspension.
>>
>> Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().")
>> Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
>> ---
>> I sent an email about this issue two days ago, but at the time I was not
>> sure whether it was a real problem or a misunderstanding on my part.
>>
>> After further analysis, it appears that this issue can occur.
>>
>> Apologies for the earlier confusion, and thank you for your time.
>>
>> Abd-Alrhman
>> ---
>> drivers/md/raid1.c | 33 ++++++++++++++++++++++++---------
>> 1 file changed, 24 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index cc9914bd15c1..14f6d6625811 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -607,7 +607,7 @@ static int choose_first_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>
>> /* choose the first disk even if it has some bad blocks. */
>> read_len = raid1_check_read_range(rdev, this_sector, &len);
>> - if (read_len > 0) {
>> + if (read_len > 0 && (!*max_sectors || read_len == r1_bio->sectors)) {
>> update_read_sectors(conf, disk, this_sector, read_len);
>> *max_sectors = read_len;
>> return disk;
>> @@ -704,8 +704,13 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>> }
>>
>> if (bb_disk != -1) {
>> - *max_sectors = bb_read_len;
>> - update_read_sectors(conf, bb_disk, this_sector, bb_read_len);
>> + if (!*max_sectors || bb_read_len == r1_bio->sectors) {
>> + *max_sectors = bb_read_len;
>> + update_read_sectors(conf, bb_disk, this_sector,
>> + bb_read_len);
>> + } else {
>> + bb_disk = -1;
>> + }
>> }
>>
>> return bb_disk;
>> @@ -852,8 +857,9 @@ static int choose_best_rdev(struct r1conf *conf, struct r1bio *r1_bio)
>> * disks and disks with bad blocks for now. Only pay attention to key disk
>> * choice.
>> *
>> - * 3) If we've made it this far, now look for disks with bad blocks and choose
>> - * the one with most number of sectors.
>> + * 3) If we've made it this far and *max_sectors is 0 (i.e., we are tolerant
>> + * of bad blocks), look for disks with bad blocks and choose the one with
>> + * the most sectors.
>> *
>> * 4) If we are all the way at the end, we have no choice but to use a disk even
>> * if it is write mostly.
>> @@ -882,11 +888,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio,
>> /*
>> * If we are here it means we didn't find a perfectly good disk so
>> * now spend a bit more time trying to find one with the most good
>> - * sectors.
>> + * sectors. but only if we are tolerant of bad blocks.
>> */
>> - disk = choose_bb_rdev(conf, r1_bio, max_sectors);
>> - if (disk >= 0)
>> - return disk;
>> + if (!*max_sectors) {
>> + disk = choose_bb_rdev(conf, r1_bio, max_sectors);
>> + if (disk >= 0)
>> + return disk;
>> + }
>>
>> return choose_slow_rdev(conf, r1_bio, max_sectors);
>> }
>> @@ -1346,7 +1354,14 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>> /*
>> * make_request() can abort the operation when read-ahead is being
>> * used and no empty request is available.
>> + *
>> + * If we allow splitting the bio while executing in the raid1 thread,
>> + * we may end up recursing (current->bio_list is NULL), and we might
>> + * also deadlock if we try to suspend the array, since we are
>> + * resubmitting an md_cloned_bio. Therefore, we must be read either
>> + * all the sectors or none.
>> */
>> + max_sectors = r1bio_existed;
>> rdisk = read_balance(conf, r1_bio, &max_sectors);
>> if (rdisk < 0) {
>> /* couldn't find anywhere to read from */
>
> --
> Thansk,
> Kuai
--
Best Regards,
Abd-Alrhman
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-28 9:46 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 10:34 [PATCH] md/raid1: fix bio splitting in raid1 thread to avoid recursion and deadlock Abd-Alrhman Masalkhi
2026-04-27 14:49 ` Paul Menzel
2026-04-27 17:44 ` Abd-Alrhman Masalkhi
2026-04-28 8:16 ` Abd-Alrhman Masalkhi
2026-04-28 8:54 ` Yu Kuai
2026-04-28 9:46 ` Abd-Alrhman Masalkhi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox