* [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails
@ 2025-08-12 9:01 Kenta Akagi
2025-08-13 0:59 ` Yu Kuai
0 siblings, 1 reply; 5+ messages in thread
From: Kenta Akagi @ 2025-08-12 9:01 UTC (permalink / raw)
To: Song Liu, Yu Kuai, Mariusz Tkaczyk; +Cc: linux-raid, linux-kernel, Kenta Akagi
It is not intended for the array to fail when a metadata write with
MD_FAILFAST fails.
After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
when md_error is called on the last device in RAID1/10,
the MD_BROKEN flag is set on the array.
Because of this, a failfast metadata write failure will
make the array "broken" state.
If rdev is not Faulty even after calling md_error,
the rdev is the last device, and there is nothing except
MD_BROKEN that prevents writes to the array.
Therefore, by clearing MD_BROKEN, the array will not become
"broken" after a failfast metadata write failure.
Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
Signed-off-by: Kenta Akagi <k@mgml.me>
---
drivers/md/md.c | 1 +
drivers/md/md.h | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ac85ec73a409..3ec4abf02fa0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio)
md_error(mddev, rdev);
if (!test_bit(Faulty, &rdev->flags)
&& (bio->bi_opf & MD_FAILFAST)) {
+ clear_bit(MD_BROKEN, &mddev->flags);
set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
set_bit(LastDev, &rdev->flags);
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 51af29a03079..2f87bcc5d834 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -332,7 +332,7 @@ struct md_cluster_operations;
* resync lock, need to release the lock.
* @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
* calls to md_error() will never cause the array to
- * become failed.
+ * become failed while fail_last_dev is not set.
* @MD_HAS_PPL: The raid array has PPL feature set.
* @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
* @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
--
2.50.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails
2025-08-12 9:01 [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails Kenta Akagi
@ 2025-08-13 0:59 ` Yu Kuai
2025-08-14 15:54 ` Kenta Akagi
0 siblings, 1 reply; 5+ messages in thread
From: Yu Kuai @ 2025-08-13 0:59 UTC (permalink / raw)
To: Kenta Akagi, Song Liu, Mariusz Tkaczyk
Cc: linux-raid, linux-kernel, yukuai (C)
Hi,
在 2025/08/12 17:01, Kenta Akagi 写道:
> It is not intended for the array to fail when a metadata write with
> MD_FAILFAST fails.
> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
> when md_error is called on the last device in RAID1/10,
> the MD_BROKEN flag is set on the array.
> Because of this, a failfast metadata write failure will
> make the array "broken" state.
>
> If rdev is not Faulty even after calling md_error,
> the rdev is the last device, and there is nothing except
> MD_BROKEN that prevents writes to the array.
> Therefore, by clearing MD_BROKEN, the array will not become
> "broken" after a failfast metadata write failure.
I don't understand here, I think MD_BROKEN is expected, the last
rdev has IO error while updating metadata, the array is now broken
and you can only read it afterwards. Allow using this broken array
read-write might causing more severe problem like data loss.
Thanks,
Kuai
>
> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
> drivers/md/md.c | 1 +
> drivers/md/md.h | 2 +-
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index ac85ec73a409..3ec4abf02fa0 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio)
> md_error(mddev, rdev);
> if (!test_bit(Faulty, &rdev->flags)
> && (bio->bi_opf & MD_FAILFAST)) {
> + clear_bit(MD_BROKEN, &mddev->flags);
> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> set_bit(LastDev, &rdev->flags);
> }
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 51af29a03079..2f87bcc5d834 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -332,7 +332,7 @@ struct md_cluster_operations;
> * resync lock, need to release the lock.
> * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
> * calls to md_error() will never cause the array to
> - * become failed.
> + * become failed while fail_last_dev is not set.
> * @MD_HAS_PPL: The raid array has PPL feature set.
> * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
> * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails
2025-08-13 0:59 ` Yu Kuai
@ 2025-08-14 15:54 ` Kenta Akagi
2025-08-15 1:26 ` Yu Kuai
0 siblings, 1 reply; 5+ messages in thread
From: Kenta Akagi @ 2025-08-14 15:54 UTC (permalink / raw)
To: Yu Kuai, Song Liu, Mariusz Tkaczyk
Cc: linux-raid, linux-kernel, yukuai (C), Kenta Akagi
On 2025/08/13 9:59, Yu Kuai wrote:
> Hi,
>
> 在 2025/08/12 17:01, Kenta Akagi 写道:
>> It is not intended for the array to fail when a metadata write with
>> MD_FAILFAST fails.
>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>> when md_error is called on the last device in RAID1/10,
>> the MD_BROKEN flag is set on the array.
>> Because of this, a failfast metadata write failure will
>> make the array "broken" state.
>>
>> If rdev is not Faulty even after calling md_error,
>> the rdev is the last device, and there is nothing except
>> MD_BROKEN that prevents writes to the array.
>> Therefore, by clearing MD_BROKEN, the array will not become
>> "broken" after a failfast metadata write failure.
>
> I don't understand here, I think MD_BROKEN is expected, the last
> rdev has IO error while updating metadata, the array is now broken
> and you can only read it afterwards. Allow using this broken array
> read-write might causing more severe problem like data loss.
>
Thank you for reviewing.
I think that only when the bio has the MD_FAILFAST flag,
a metadata write failure to the last rdev should not make it
broken array at that point.
This is because a metadata write with MD_FAILFAST is retried after
failure as follows:
1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
2. In md_super_wait, which is called by the function that
executed md_super_write and waits for completion,
-EAGAIN is returned because MD_SB_NEED_REWRITE is set.
3. The caller of md_super_wait (such as md_update_sb)
receives a negative return value and then retries md_super_write.
4. The md_super_write function, which is called to perform
the same metadata write, issues a write bio
without MD_FAILFAST this time, because the rdev has LastDev flag.
When a bio from super_written without MD_FAILFAST fails,
the array is truly broken, and MD_BROKEN should be set.
A failfast bio, for example in the case of nvme-tcp ,
will fail immediately if the connection to the target is
lost for a few seconds and the device enters a reconnecting
state - even though it would recover if given a few seconds.
This behavior is exactly as intended by the design of failfast.
However, md treats super_write operations fails with failfast as fatal.
For example, if an initiator - that is, a machine loading the md module -
loses all connections for a few seconds, the array becomes
broken and subsequent write is no longer possible.
This is the issue I am currently facing, and which this patch aims to fix.
Should I add more context to the commit message? Please advise.
Thanks,
AKAGI
> Thanks,
> Kuai
>
>>
>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>> drivers/md/md.c | 1 +
>> drivers/md/md.h | 2 +-
>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index ac85ec73a409..3ec4abf02fa0 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio)
>> md_error(mddev, rdev);
>> if (!test_bit(Faulty, &rdev->flags)
>> && (bio->bi_opf & MD_FAILFAST)) {
>> + clear_bit(MD_BROKEN, &mddev->flags);
>> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>> set_bit(LastDev, &rdev->flags);
>> }
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 51af29a03079..2f87bcc5d834 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -332,7 +332,7 @@ struct md_cluster_operations;
>> * resync lock, need to release the lock.
>> * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
>> * calls to md_error() will never cause the array to
>> - * become failed.
>> + * become failed while fail_last_dev is not set.
>> * @MD_HAS_PPL: The raid array has PPL feature set.
>> * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
>> * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
>>
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails
2025-08-14 15:54 ` Kenta Akagi
@ 2025-08-15 1:26 ` Yu Kuai
2025-08-15 19:12 ` Kenta Akagi
0 siblings, 1 reply; 5+ messages in thread
From: Yu Kuai @ 2025-08-15 1:26 UTC (permalink / raw)
To: Kenta Akagi, Yu Kuai, Song Liu, Mariusz Tkaczyk
Cc: linux-raid, linux-kernel, yukuai (C)
Hi,
在 2025/08/14 23:54, Kenta Akagi 写道:
> On 2025/08/13 9:59, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/12 17:01, Kenta Akagi 写道:
>>> It is not intended for the array to fail when a metadata write with
>>> MD_FAILFAST fails.
>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>> when md_error is called on the last device in RAID1/10,
>>> the MD_BROKEN flag is set on the array.
>>> Because of this, a failfast metadata write failure will
>>> make the array "broken" state.
>>>
>>> If rdev is not Faulty even after calling md_error,
>>> the rdev is the last device, and there is nothing except
>>> MD_BROKEN that prevents writes to the array.
>>> Therefore, by clearing MD_BROKEN, the array will not become
>>> "broken" after a failfast metadata write failure.
>>
>> I don't understand here, I think MD_BROKEN is expected, the last
>> rdev has IO error while updating metadata, the array is now broken
>> and you can only read it afterwards. Allow using this broken array
>> read-write might causing more severe problem like data loss.
>>
> Thank you for reviewing.
>
> I think that only when the bio has the MD_FAILFAST flag,
> a metadata write failure to the last rdev should not make it
> broken array at that point.
>
> This is because a metadata write with MD_FAILFAST is retried after
> failure as follows:
>
> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>
> 2. In md_super_wait, which is called by the function that
> executed md_super_write and waits for completion,
> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>
> 3. The caller of md_super_wait (such as md_update_sb)
> receives a negative return value and then retries md_super_write.
>
> 4. The md_super_write function, which is called to perform
> the same metadata write, issues a write bio
> without MD_FAILFAST this time, because the rdev has LastDev flag.
>
> When a bio from super_written without MD_FAILFAST fails,
> the array is truly broken, and MD_BROKEN should be set.
>
> A failfast bio, for example in the case of nvme-tcp ,
> will fail immediately if the connection to the target is
> lost for a few seconds and the device enters a reconnecting
> state - even though it would recover if given a few seconds.
> This behavior is exactly as intended by the design of failfast.
>
> However, md treats super_write operations fails with failfast as fatal.
> For example, if an initiator - that is, a machine loading the md module -
> loses all connections for a few seconds, the array becomes
> broken and subsequent write is no longer possible.
> This is the issue I am currently facing, and which this patch aims to fix.
>
> Should I add more context to the commit message? Please advise.
Yes, please explain in detail in commit message.
>
> Thanks,
> AKAGI
>
>> Thanks,
>> Kuai
>>
>>>
>>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>>> Signed-off-by: Kenta Akagi <k@mgml.me>
>>> ---
>>> drivers/md/md.c | 1 +
>>> drivers/md/md.h | 2 +-
>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index ac85ec73a409..3ec4abf02fa0 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio)
>>> md_error(mddev, rdev);
>>> if (!test_bit(Faulty, &rdev->flags)
>>> && (bio->bi_opf & MD_FAILFAST)) {
>>> + clear_bit(MD_BROKEN, &mddev->flags);
And I feel a beeter way is to set MD_BROKEN only if the last rdev
failed, set it in middle and clear it is werid.
Thanks,
Kuai
>>> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>> set_bit(LastDev, &rdev->flags);
>>> }
>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>> index 51af29a03079..2f87bcc5d834 100644
>>> --- a/drivers/md/md.h
>>> +++ b/drivers/md/md.h
>>> @@ -332,7 +332,7 @@ struct md_cluster_operations;
>>> * resync lock, need to release the lock.
>>> * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
>>> * calls to md_error() will never cause the array to
>>> - * become failed.
>>> + * become failed while fail_last_dev is not set.
>>> * @MD_HAS_PPL: The raid array has PPL feature set.
>>> * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
>>> * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
>>>
>>
>>
> .
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails
2025-08-15 1:26 ` Yu Kuai
@ 2025-08-15 19:12 ` Kenta Akagi
0 siblings, 0 replies; 5+ messages in thread
From: Kenta Akagi @ 2025-08-15 19:12 UTC (permalink / raw)
To: Yu Kuai, Song Liu, Mariusz Tkaczyk; +Cc: linux-raid, linux-kernel, yukuai (C)
On 2025/08/15 10:26, Yu Kuai wrote:
> Hi,
>
> 在 2025/08/14 23:54, Kenta Akagi 写道:
>> On 2025/08/13 9:59, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/08/12 17:01, Kenta Akagi 写道:
>>>> It is not intended for the array to fail when a metadata write with
>>>> MD_FAILFAST fails.
>>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>>> when md_error is called on the last device in RAID1/10,
>>>> the MD_BROKEN flag is set on the array.
>>>> Because of this, a failfast metadata write failure will
>>>> make the array "broken" state.
>>>>
>>>> If rdev is not Faulty even after calling md_error,
>>>> the rdev is the last device, and there is nothing except
>>>> MD_BROKEN that prevents writes to the array.
>>>> Therefore, by clearing MD_BROKEN, the array will not become
>>>> "broken" after a failfast metadata write failure.
>>>
>>> I don't understand here, I think MD_BROKEN is expected, the last
>>> rdev has IO error while updating metadata, the array is now broken
>>> and you can only read it afterwards. Allow using this broken array
>>> read-write might causing more severe problem like data loss.
>>>
>> Thank you for reviewing.
>>
>> I think that only when the bio has the MD_FAILFAST flag,
>> a metadata write failure to the last rdev should not make it
>> broken array at that point.
>>
>> This is because a metadata write with MD_FAILFAST is retried after
>> failure as follows:
>>
>> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>>
>> 2. In md_super_wait, which is called by the function that
>> executed md_super_write and waits for completion,
>> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>>
>> 3. The caller of md_super_wait (such as md_update_sb)
>> receives a negative return value and then retries md_super_write.
>>
>> 4. The md_super_write function, which is called to perform
>> the same metadata write, issues a write bio
>> without MD_FAILFAST this time, because the rdev has LastDev flag.
>>
>> When a bio from super_written without MD_FAILFAST fails,
>> the array is truly broken, and MD_BROKEN should be set.
>>
>> A failfast bio, for example in the case of nvme-tcp ,
>> will fail immediately if the connection to the target is
>> lost for a few seconds and the device enters a reconnecting
>> state - even though it would recover if given a few seconds.
>> This behavior is exactly as intended by the design of failfast.
>>
>> However, md treats super_write operations fails with failfast as fatal.
>> For example, if an initiator - that is, a machine loading the md module -
>> loses all connections for a few seconds, the array becomes
>> broken and subsequent write is no longer possible.
>> This is the issue I am currently facing, and which this patch aims to fix.
>>
>> Should I add more context to the commit message? Please advise.
>
> Yes, please explain in detail in commit message.
>>
>> Thanks,
>> AKAGI
>>
>>> Thanks,
>>> Kuai
>>>
>>>>
>>>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>>>> Signed-off-by: Kenta Akagi <k@mgml.me>
>>>> ---
>>>> drivers/md/md.c | 1 +
>>>> drivers/md/md.h | 2 +-
>>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index ac85ec73a409..3ec4abf02fa0 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio)
>>>> md_error(mddev, rdev);
>>>> if (!test_bit(Faulty, &rdev->flags)
>>>> && (bio->bi_opf & MD_FAILFAST)) {
>>>> + clear_bit(MD_BROKEN, &mddev->flags);
>
> And I feel a beeter way is to set MD_BROKEN only if the last rdev
> failed, set it in middle and clear it is werid.
Copy.
I'll modify logic and commit message, then send it out as v2.
Thanks,
Akagi
> Thanks,
> Kuai
>
>>>> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>> set_bit(LastDev, &rdev->flags);
>>>> }
>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>>> index 51af29a03079..2f87bcc5d834 100644
>>>> --- a/drivers/md/md.h
>>>> +++ b/drivers/md/md.h
>>>> @@ -332,7 +332,7 @@ struct md_cluster_operations;
>>>> * resync lock, need to release the lock.
>>>> * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
>>>> * calls to md_error() will never cause the array to
>>>> - * become failed.
>>>> + * become failed while fail_last_dev is not set.
>>>> * @MD_HAS_PPL: The raid array has PPL feature set.
>>>> * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
>>>> * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
>>>>
>>>
>>>
>> .
>>
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-08-15 19:12 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-12 9:01 [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails Kenta Akagi
2025-08-13 0:59 ` Yu Kuai
2025-08-14 15:54 ` Kenta Akagi
2025-08-15 1:26 ` Yu Kuai
2025-08-15 19:12 ` Kenta Akagi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).