* [PATCH v2] md/raid1: Add check for missing source disk in process_checks()
@ 2025-04-08 14:38 Meir Elisha
2025-04-10 7:15 ` Yu Kuai
0 siblings, 1 reply; 4+ messages in thread
From: Meir Elisha @ 2025-04-08 14:38 UTC (permalink / raw)
To: Song Liu, Yu Kuai; +Cc: linux-raid, Meir Elisha
During recovery/check operations, the process_checks function loops
through available disks to find a 'primary' source with successfully
read data.
If no suitable source disk is found after checking all possibilities,
the 'primary' index will reach conf->raid_disks * 2. Add an explicit
check for this condition after the loop. If no source disk was found,
print an error message and return early to prevent further processing
without a valid primary source.
Signed-off-by: Meir Elisha <meir.elisha@volumez.com>
---
Changes from v1:
- Don't fix read errors on recovery/check
This was observed when forcefully disconnecting all iSCSI devices backing
a RAID1 array(using --failfast flag) during a check operation, causing
all reads within process_checks to fail. The resulting kernel oops shows:
BUG: kernel NULL pointer dereference, address: 0000000000000040
RIP: 0010:process_checks+0x25e/0x5e0 [raid1]
Code: ... <4c> 8b 53 40 ... // mov r10,[rbx+0x40]
Call Trace:
process_checks
sync_request_write
raid1d
md_thread
kthread
ret_from_fork
drivers/md/raid1.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 0efc03cea24e..de9bccbe7337 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2200,14 +2200,9 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
if (!rdev_set_badblocks(rdev, sect, s, 0))
abort = 1;
}
- if (abort) {
- conf->recovery_disabled =
- mddev->recovery_disabled;
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_done_sync(mddev, r1_bio->sectors, 0);
- put_buf(r1_bio);
+ if (abort)
return 0;
- }
+
/* Try next page */
sectors -= s;
sect += s;
@@ -2346,10 +2341,21 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
int disks = conf->raid_disks * 2;
struct bio *wbio;
- if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
- /* ouch - failed to read all of that. */
- if (!fix_sync_read_error(r1_bio))
+ if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) {
+ /*
+ * ouch - failed to read all of that.
+ * No need to fix read error for check/repair
+ * because all member disks are read.
+ */
+ if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) ||
+ !fix_sync_read_error(r1_bio)) {
+ conf->recovery_disabled = mddev->recovery_disabled;
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ md_done_sync(mddev, r1_bio->sectors, 0);
+ put_buf(r1_bio);
return;
+ }
+ }
if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
process_checks(r1_bio);
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH v2] md/raid1: Add check for missing source disk in process_checks()
2025-04-08 14:38 [PATCH v2] md/raid1: Add check for missing source disk in process_checks() Meir Elisha
@ 2025-04-10 7:15 ` Yu Kuai
2025-04-10 8:59 ` Meir Elisha
0 siblings, 1 reply; 4+ messages in thread
From: Yu Kuai @ 2025-04-10 7:15 UTC (permalink / raw)
To: Meir Elisha, Song Liu; +Cc: linux-raid, yukuai (C)
Hi,
在 2025/04/08 22:38, Meir Elisha 写道:
> During recovery/check operations, the process_checks function loops
> through available disks to find a 'primary' source with successfully
> read data.
>
> If no suitable source disk is found after checking all possibilities,
> the 'primary' index will reach conf->raid_disks * 2. Add an explicit
> check for this condition after the loop. If no source disk was found,
> print an error message and return early to prevent further processing
> without a valid primary source.
>
> Signed-off-by: Meir Elisha <meir.elisha@volumez.com>
> ---
Just to be sure, do you tested this version?
Thanks,
Kuai
> Changes from v1:
> - Don't fix read errors on recovery/check
>
> This was observed when forcefully disconnecting all iSCSI devices backing
> a RAID1 array(using --failfast flag) during a check operation, causing
> all reads within process_checks to fail. The resulting kernel oops shows:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000040
> RIP: 0010:process_checks+0x25e/0x5e0 [raid1]
> Code: ... <4c> 8b 53 40 ... // mov r10,[rbx+0x40]
> Call Trace:
> process_checks
> sync_request_write
> raid1d
> md_thread
> kthread
> ret_from_fork
>
> drivers/md/raid1.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 0efc03cea24e..de9bccbe7337 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2200,14 +2200,9 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
> if (!rdev_set_badblocks(rdev, sect, s, 0))
> abort = 1;
> }
> - if (abort) {
> - conf->recovery_disabled =
> - mddev->recovery_disabled;
> - set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> - md_done_sync(mddev, r1_bio->sectors, 0);
> - put_buf(r1_bio);
> + if (abort)
> return 0;
> - }
> +
> /* Try next page */
> sectors -= s;
> sect += s;
> @@ -2346,10 +2341,21 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
> int disks = conf->raid_disks * 2;
> struct bio *wbio;
>
> - if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
> - /* ouch - failed to read all of that. */
> - if (!fix_sync_read_error(r1_bio))
> + if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) {
> + /*
> + * ouch - failed to read all of that.
> + * No need to fix read error for check/repair
> + * because all member disks are read.
> + */
> + if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) ||
> + !fix_sync_read_error(r1_bio)) {
> + conf->recovery_disabled = mddev->recovery_disabled;
> + set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> + md_done_sync(mddev, r1_bio->sectors, 0);
> + put_buf(r1_bio);
> return;
> + }
> + }
>
> if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
> process_checks(r1_bio);
>
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH v2] md/raid1: Add check for missing source disk in process_checks()
2025-04-10 7:15 ` Yu Kuai
@ 2025-04-10 8:59 ` Meir Elisha
2025-04-12 1:28 ` Yu Kuai
0 siblings, 1 reply; 4+ messages in thread
From: Meir Elisha @ 2025-04-10 8:59 UTC (permalink / raw)
To: Yu Kuai, Song Liu; +Cc: linux-raid, yukuai (C)
Hi,
On 10/04/2025 10:15, Yu Kuai wrote:
> Hi,
>
> 在 2025/04/08 22:38, Meir Elisha 写道:
>> During recovery/check operations, the process_checks function loops
>> through available disks to find a 'primary' source with successfully
>> read data.
>>
>> If no suitable source disk is found after checking all possibilities,
>> the 'primary' index will reach conf->raid_disks * 2. Add an explicit
>> check for this condition after the loop. If no source disk was found,
>> print an error message and return early to prevent further processing
>> without a valid primary source.
>>
>> Signed-off-by: Meir Elisha <meir.elisha@volumez.com>
>> ---
>
> Just to be sure, do you tested this version?
>
> Thanks,
> Kuai
I wasn't able to reproduce the crash when using this patch.
>> Changes from v1:
>> - Don't fix read errors on recovery/check
>>
>> This was observed when forcefully disconnecting all iSCSI devices backing
>> a RAID1 array(using --failfast flag) during a check operation, causing
>> all reads within process_checks to fail. The resulting kernel oops shows:
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000040
>> RIP: 0010:process_checks+0x25e/0x5e0 [raid1]
>> Code: ... <4c> 8b 53 40 ... // mov r10,[rbx+0x40]
>> Call Trace:
>> process_checks
>> sync_request_write
>> raid1d
>> md_thread
>> kthread
>> ret_from_fork
>>
>> drivers/md/raid1.c | 26 ++++++++++++++++----------
>> 1 file changed, 16 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 0efc03cea24e..de9bccbe7337 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -2200,14 +2200,9 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>> if (!rdev_set_badblocks(rdev, sect, s, 0))
>> abort = 1;
>> }
>> - if (abort) {
>> - conf->recovery_disabled =
>> - mddev->recovery_disabled;
>> - set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> - md_done_sync(mddev, r1_bio->sectors, 0);
>> - put_buf(r1_bio);
>> + if (abort)
>> return 0;
>> - }
>> +
>> /* Try next page */
>> sectors -= s;
>> sect += s;
>> @@ -2346,10 +2341,21 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
>> int disks = conf->raid_disks * 2;
>> struct bio *wbio;
>> - if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>> - /* ouch - failed to read all of that. */
>> - if (!fix_sync_read_error(r1_bio))
>> + if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) {
>> + /*
>> + * ouch - failed to read all of that.
>> + * No need to fix read error for check/repair
>> + * because all member disks are read.
>> + */
>> + if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) ||
>> + !fix_sync_read_error(r1_bio)) {
>> + conf->recovery_disabled = mddev->recovery_disabled;
>> + set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> + md_done_sync(mddev, r1_bio->sectors, 0);
>> + put_buf(r1_bio);
>> return;
>> + }
>> + }
>> if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
>> process_checks(r1_bio);
>>
>
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH v2] md/raid1: Add check for missing source disk in process_checks()
2025-04-10 8:59 ` Meir Elisha
@ 2025-04-12 1:28 ` Yu Kuai
0 siblings, 0 replies; 4+ messages in thread
From: Yu Kuai @ 2025-04-12 1:28 UTC (permalink / raw)
To: Meir Elisha, Yu Kuai, Song Liu; +Cc: linux-raid, yukuai (C)
Hi,
在 2025/04/10 16:59, Meir Elisha 写道:
> Hi,
>
> On 10/04/2025 10:15, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/04/08 22:38, Meir Elisha 写道:
>>> During recovery/check operations, the process_checks function loops
>>> through available disks to find a 'primary' source with successfully
>>> read data.
>>>
>>> If no suitable source disk is found after checking all possibilities,
>>> the 'primary' index will reach conf->raid_disks * 2. Add an explicit
>>> check for this condition after the loop. If no source disk was found,
>>> print an error message and return early to prevent further processing
>>> without a valid primary source.
>>>
>>> Signed-off-by: Meir Elisha <meir.elisha@volumez.com>
>>> ---
>>
>> Just to be sure, do you tested this version?
>>
>> Thanks,
>> Kuai
>
> I wasn't able to reproduce the crash when using this patch.
Good.
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
>
>>> Changes from v1:
>>> - Don't fix read errors on recovery/check
>>>
>>> This was observed when forcefully disconnecting all iSCSI devices backing
>>> a RAID1 array(using --failfast flag) during a check operation, causing
>>> all reads within process_checks to fail. The resulting kernel oops shows:
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000040
>>> RIP: 0010:process_checks+0x25e/0x5e0 [raid1]
>>> Code: ... <4c> 8b 53 40 ... // mov r10,[rbx+0x40]
>>> Call Trace:
>>> process_checks
>>> sync_request_write
>>> raid1d
>>> md_thread
>>> kthread
>>> ret_from_fork
>>>
>>> drivers/md/raid1.c | 26 ++++++++++++++++----------
>>> 1 file changed, 16 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index 0efc03cea24e..de9bccbe7337 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -2200,14 +2200,9 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>> if (!rdev_set_badblocks(rdev, sect, s, 0))
>>> abort = 1;
>>> }
>>> - if (abort) {
>>> - conf->recovery_disabled =
>>> - mddev->recovery_disabled;
>>> - set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>> - md_done_sync(mddev, r1_bio->sectors, 0);
>>> - put_buf(r1_bio);
>>> + if (abort)
>>> return 0;
>>> - }
>>> +
>>> /* Try next page */
>>> sectors -= s;
>>> sect += s;
>>> @@ -2346,10 +2341,21 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
>>> int disks = conf->raid_disks * 2;
>>> struct bio *wbio;
>>> - if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>>> - /* ouch - failed to read all of that. */
>>> - if (!fix_sync_read_error(r1_bio))
>>> + if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) {
>>> + /*
>>> + * ouch - failed to read all of that.
>>> + * No need to fix read error for check/repair
>>> + * because all member disks are read.
>>> + */
>>> + if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) ||
>>> + !fix_sync_read_error(r1_bio)) {
>>> + conf->recovery_disabled = mddev->recovery_disabled;
>>> + set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>> + md_done_sync(mddev, r1_bio->sectors, 0);
>>> + put_buf(r1_bio);
>>> return;
>>> + }
>>> + }
>>> if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
>>> process_checks(r1_bio);
>>>
>>
> .
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-04-12 1:28 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-08 14:38 [PATCH v2] md/raid1: Add check for missing source disk in process_checks() Meir Elisha
2025-04-10 7:15 ` Yu Kuai
2025-04-10 8:59 ` Meir Elisha
2025-04-12 1:28 ` Yu Kuai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox