* [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails @ 2025-08-12 9:01 Kenta Akagi 2025-08-13 0:59 ` Yu Kuai 0 siblings, 1 reply; 5+ messages in thread From: Kenta Akagi @ 2025-08-12 9:01 UTC (permalink / raw) To: Song Liu, Yu Kuai, Mariusz Tkaczyk; +Cc: linux-raid, linux-kernel, Kenta Akagi It is not intended for the array to fail when a metadata write with MD_FAILFAST fails. After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), when md_error is called on the last device in RAID1/10, the MD_BROKEN flag is set on the array. Because of this, a failfast metadata write failure will make the array "broken" state. If rdev is not Faulty even after calling md_error, the rdev is the last device, and there is nothing except MD_BROKEN that prevents writes to the array. Therefore, by clearing MD_BROKEN, the array will not become "broken" after a failfast metadata write failure. Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") Signed-off-by: Kenta Akagi <k@mgml.me> --- drivers/md/md.c | 1 + drivers/md/md.h | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index ac85ec73a409..3ec4abf02fa0 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio) md_error(mddev, rdev); if (!test_bit(Faulty, &rdev->flags) && (bio->bi_opf & MD_FAILFAST)) { + clear_bit(MD_BROKEN, &mddev->flags); set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); set_bit(LastDev, &rdev->flags); } diff --git a/drivers/md/md.h b/drivers/md/md.h index 51af29a03079..2f87bcc5d834 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -332,7 +332,7 @@ struct md_cluster_operations; * resync lock, need to release the lock. * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as * calls to md_error() will never cause the array to - * become failed. + * become failed while fail_last_dev is not set. * @MD_HAS_PPL: The raid array has PPL feature set. * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set. * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that -- 2.50.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails 2025-08-12 9:01 [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails Kenta Akagi @ 2025-08-13 0:59 ` Yu Kuai 2025-08-14 15:54 ` Kenta Akagi 0 siblings, 1 reply; 5+ messages in thread From: Yu Kuai @ 2025-08-13 0:59 UTC (permalink / raw) To: Kenta Akagi, Song Liu, Mariusz Tkaczyk Cc: linux-raid, linux-kernel, yukuai (C) Hi, 在 2025/08/12 17:01, Kenta Akagi 写道: > It is not intended for the array to fail when a metadata write with > MD_FAILFAST fails. > After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), > when md_error is called on the last device in RAID1/10, > the MD_BROKEN flag is set on the array. > Because of this, a failfast metadata write failure will > make the array "broken" state. > > If rdev is not Faulty even after calling md_error, > the rdev is the last device, and there is nothing except > MD_BROKEN that prevents writes to the array. > Therefore, by clearing MD_BROKEN, the array will not become > "broken" after a failfast metadata write failure. I don't understand here, I think MD_BROKEN is expected, the last rdev has IO error while updating metadata, the array is now broken and you can only read it afterwards. Allow using this broken array read-write might causing more severe problem like data loss. Thanks, Kuai > > Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") > Signed-off-by: Kenta Akagi <k@mgml.me> > --- > drivers/md/md.c | 1 + > drivers/md/md.h | 2 +- > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index ac85ec73a409..3ec4abf02fa0 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio) > md_error(mddev, rdev); > if (!test_bit(Faulty, &rdev->flags) > && (bio->bi_opf & MD_FAILFAST)) { > + clear_bit(MD_BROKEN, &mddev->flags); > set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); > set_bit(LastDev, &rdev->flags); > } > diff --git a/drivers/md/md.h b/drivers/md/md.h > index 51af29a03079..2f87bcc5d834 100644 > --- a/drivers/md/md.h > +++ b/drivers/md/md.h > @@ -332,7 +332,7 @@ struct md_cluster_operations; > * resync lock, need to release the lock. > * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as > * calls to md_error() will never cause the array to > - * become failed. > + * become failed while fail_last_dev is not set. > * @MD_HAS_PPL: The raid array has PPL feature set. > * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set. > * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails 2025-08-13 0:59 ` Yu Kuai @ 2025-08-14 15:54 ` Kenta Akagi 2025-08-15 1:26 ` Yu Kuai 0 siblings, 1 reply; 5+ messages in thread From: Kenta Akagi @ 2025-08-14 15:54 UTC (permalink / raw) To: Yu Kuai, Song Liu, Mariusz Tkaczyk Cc: linux-raid, linux-kernel, yukuai (C), Kenta Akagi On 2025/08/13 9:59, Yu Kuai wrote: > Hi, > > 在 2025/08/12 17:01, Kenta Akagi 写道: >> It is not intended for the array to fail when a metadata write with >> MD_FAILFAST fails. >> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), >> when md_error is called on the last device in RAID1/10, >> the MD_BROKEN flag is set on the array. >> Because of this, a failfast metadata write failure will >> make the array "broken" state. >> >> If rdev is not Faulty even after calling md_error, >> the rdev is the last device, and there is nothing except >> MD_BROKEN that prevents writes to the array. >> Therefore, by clearing MD_BROKEN, the array will not become >> "broken" after a failfast metadata write failure. > > I don't understand here, I think MD_BROKEN is expected, the last > rdev has IO error while updating metadata, the array is now broken > and you can only read it afterwards. Allow using this broken array > read-write might causing more severe problem like data loss. > Thank you for reviewing. I think that only when the bio has the MD_FAILFAST flag, a metadata write failure to the last rdev should not make it broken array at that point. This is because a metadata write with MD_FAILFAST is retried after failure as follows: 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags. 2. In md_super_wait, which is called by the function that executed md_super_write and waits for completion, -EAGAIN is returned because MD_SB_NEED_REWRITE is set. 3. The caller of md_super_wait (such as md_update_sb) receives a negative return value and then retries md_super_write. 4. The md_super_write function, which is called to perform the same metadata write, issues a write bio without MD_FAILFAST this time, because the rdev has LastDev flag. When a bio from super_written without MD_FAILFAST fails, the array is truly broken, and MD_BROKEN should be set. A failfast bio, for example in the case of nvme-tcp , will fail immediately if the connection to the target is lost for a few seconds and the device enters a reconnecting state - even though it would recover if given a few seconds. This behavior is exactly as intended by the design of failfast. However, md treats super_write operations fails with failfast as fatal. For example, if an initiator - that is, a machine loading the md module - loses all connections for a few seconds, the array becomes broken and subsequent write is no longer possible. This is the issue I am currently facing, and which this patch aims to fix. Should I add more context to the commit message? Please advise. Thanks, AKAGI > Thanks, > Kuai > >> >> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") >> Signed-off-by: Kenta Akagi <k@mgml.me> >> --- >> drivers/md/md.c | 1 + >> drivers/md/md.h | 2 +- >> 2 files changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index ac85ec73a409..3ec4abf02fa0 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio) >> md_error(mddev, rdev); >> if (!test_bit(Faulty, &rdev->flags) >> && (bio->bi_opf & MD_FAILFAST)) { >> + clear_bit(MD_BROKEN, &mddev->flags); >> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); >> set_bit(LastDev, &rdev->flags); >> } >> diff --git a/drivers/md/md.h b/drivers/md/md.h >> index 51af29a03079..2f87bcc5d834 100644 >> --- a/drivers/md/md.h >> +++ b/drivers/md/md.h >> @@ -332,7 +332,7 @@ struct md_cluster_operations; >> * resync lock, need to release the lock. >> * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as >> * calls to md_error() will never cause the array to >> - * become failed. >> + * become failed while fail_last_dev is not set. >> * @MD_HAS_PPL: The raid array has PPL feature set. >> * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set. >> * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that >> > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails 2025-08-14 15:54 ` Kenta Akagi @ 2025-08-15 1:26 ` Yu Kuai 2025-08-15 19:12 ` Kenta Akagi 0 siblings, 1 reply; 5+ messages in thread From: Yu Kuai @ 2025-08-15 1:26 UTC (permalink / raw) To: Kenta Akagi, Yu Kuai, Song Liu, Mariusz Tkaczyk Cc: linux-raid, linux-kernel, yukuai (C) Hi, 在 2025/08/14 23:54, Kenta Akagi 写道: > On 2025/08/13 9:59, Yu Kuai wrote: >> Hi, >> >> 在 2025/08/12 17:01, Kenta Akagi 写道: >>> It is not intended for the array to fail when a metadata write with >>> MD_FAILFAST fails. >>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), >>> when md_error is called on the last device in RAID1/10, >>> the MD_BROKEN flag is set on the array. >>> Because of this, a failfast metadata write failure will >>> make the array "broken" state. >>> >>> If rdev is not Faulty even after calling md_error, >>> the rdev is the last device, and there is nothing except >>> MD_BROKEN that prevents writes to the array. >>> Therefore, by clearing MD_BROKEN, the array will not become >>> "broken" after a failfast metadata write failure. >> >> I don't understand here, I think MD_BROKEN is expected, the last >> rdev has IO error while updating metadata, the array is now broken >> and you can only read it afterwards. Allow using this broken array >> read-write might causing more severe problem like data loss. >> > Thank you for reviewing. > > I think that only when the bio has the MD_FAILFAST flag, > a metadata write failure to the last rdev should not make it > broken array at that point. > > This is because a metadata write with MD_FAILFAST is retried after > failure as follows: > > 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags. > > 2. In md_super_wait, which is called by the function that > executed md_super_write and waits for completion, > -EAGAIN is returned because MD_SB_NEED_REWRITE is set. > > 3. The caller of md_super_wait (such as md_update_sb) > receives a negative return value and then retries md_super_write. > > 4. The md_super_write function, which is called to perform > the same metadata write, issues a write bio > without MD_FAILFAST this time, because the rdev has LastDev flag. > > When a bio from super_written without MD_FAILFAST fails, > the array is truly broken, and MD_BROKEN should be set. > > A failfast bio, for example in the case of nvme-tcp , > will fail immediately if the connection to the target is > lost for a few seconds and the device enters a reconnecting > state - even though it would recover if given a few seconds. > This behavior is exactly as intended by the design of failfast. > > However, md treats super_write operations fails with failfast as fatal. > For example, if an initiator - that is, a machine loading the md module - > loses all connections for a few seconds, the array becomes > broken and subsequent write is no longer possible. > This is the issue I am currently facing, and which this patch aims to fix. > > Should I add more context to the commit message? Please advise. Yes, please explain in detail in commit message. > > Thanks, > AKAGI > >> Thanks, >> Kuai >> >>> >>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") >>> Signed-off-by: Kenta Akagi <k@mgml.me> >>> --- >>> drivers/md/md.c | 1 + >>> drivers/md/md.h | 2 +- >>> 2 files changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>> index ac85ec73a409..3ec4abf02fa0 100644 >>> --- a/drivers/md/md.c >>> +++ b/drivers/md/md.c >>> @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio) >>> md_error(mddev, rdev); >>> if (!test_bit(Faulty, &rdev->flags) >>> && (bio->bi_opf & MD_FAILFAST)) { >>> + clear_bit(MD_BROKEN, &mddev->flags); And I feel a beeter way is to set MD_BROKEN only if the last rdev failed, set it in middle and clear it is werid. Thanks, Kuai >>> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); >>> set_bit(LastDev, &rdev->flags); >>> } >>> diff --git a/drivers/md/md.h b/drivers/md/md.h >>> index 51af29a03079..2f87bcc5d834 100644 >>> --- a/drivers/md/md.h >>> +++ b/drivers/md/md.h >>> @@ -332,7 +332,7 @@ struct md_cluster_operations; >>> * resync lock, need to release the lock. >>> * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as >>> * calls to md_error() will never cause the array to >>> - * become failed. >>> + * become failed while fail_last_dev is not set. >>> * @MD_HAS_PPL: The raid array has PPL feature set. >>> * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set. >>> * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that >>> >> >> > . > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails 2025-08-15 1:26 ` Yu Kuai @ 2025-08-15 19:12 ` Kenta Akagi 0 siblings, 0 replies; 5+ messages in thread From: Kenta Akagi @ 2025-08-15 19:12 UTC (permalink / raw) To: Yu Kuai, Song Liu, Mariusz Tkaczyk; +Cc: linux-raid, linux-kernel, yukuai (C) On 2025/08/15 10:26, Yu Kuai wrote: > Hi, > > 在 2025/08/14 23:54, Kenta Akagi 写道: >> On 2025/08/13 9:59, Yu Kuai wrote: >>> Hi, >>> >>> 在 2025/08/12 17:01, Kenta Akagi 写道: >>>> It is not intended for the array to fail when a metadata write with >>>> MD_FAILFAST fails. >>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), >>>> when md_error is called on the last device in RAID1/10, >>>> the MD_BROKEN flag is set on the array. >>>> Because of this, a failfast metadata write failure will >>>> make the array "broken" state. >>>> >>>> If rdev is not Faulty even after calling md_error, >>>> the rdev is the last device, and there is nothing except >>>> MD_BROKEN that prevents writes to the array. >>>> Therefore, by clearing MD_BROKEN, the array will not become >>>> "broken" after a failfast metadata write failure. >>> >>> I don't understand here, I think MD_BROKEN is expected, the last >>> rdev has IO error while updating metadata, the array is now broken >>> and you can only read it afterwards. Allow using this broken array >>> read-write might causing more severe problem like data loss. >>> >> Thank you for reviewing. >> >> I think that only when the bio has the MD_FAILFAST flag, >> a metadata write failure to the last rdev should not make it >> broken array at that point. >> >> This is because a metadata write with MD_FAILFAST is retried after >> failure as follows: >> >> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags. >> >> 2. In md_super_wait, which is called by the function that >> executed md_super_write and waits for completion, >> -EAGAIN is returned because MD_SB_NEED_REWRITE is set. >> >> 3. The caller of md_super_wait (such as md_update_sb) >> receives a negative return value and then retries md_super_write. >> >> 4. The md_super_write function, which is called to perform >> the same metadata write, issues a write bio >> without MD_FAILFAST this time, because the rdev has LastDev flag. >> >> When a bio from super_written without MD_FAILFAST fails, >> the array is truly broken, and MD_BROKEN should be set. >> >> A failfast bio, for example in the case of nvme-tcp , >> will fail immediately if the connection to the target is >> lost for a few seconds and the device enters a reconnecting >> state - even though it would recover if given a few seconds. >> This behavior is exactly as intended by the design of failfast. >> >> However, md treats super_write operations fails with failfast as fatal. >> For example, if an initiator - that is, a machine loading the md module - >> loses all connections for a few seconds, the array becomes >> broken and subsequent write is no longer possible. >> This is the issue I am currently facing, and which this patch aims to fix. >> >> Should I add more context to the commit message? Please advise. > > Yes, please explain in detail in commit message. >> >> Thanks, >> AKAGI >> >>> Thanks, >>> Kuai >>> >>>> >>>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") >>>> Signed-off-by: Kenta Akagi <k@mgml.me> >>>> --- >>>> drivers/md/md.c | 1 + >>>> drivers/md/md.h | 2 +- >>>> 2 files changed, 2 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>>> index ac85ec73a409..3ec4abf02fa0 100644 >>>> --- a/drivers/md/md.c >>>> +++ b/drivers/md/md.c >>>> @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio) >>>> md_error(mddev, rdev); >>>> if (!test_bit(Faulty, &rdev->flags) >>>> && (bio->bi_opf & MD_FAILFAST)) { >>>> + clear_bit(MD_BROKEN, &mddev->flags); > > And I feel a beeter way is to set MD_BROKEN only if the last rdev > failed, set it in middle and clear it is werid. Copy. I'll modify logic and commit message, then send it out as v2. Thanks, Akagi > Thanks, > Kuai > >>>> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); >>>> set_bit(LastDev, &rdev->flags); >>>> } >>>> diff --git a/drivers/md/md.h b/drivers/md/md.h >>>> index 51af29a03079..2f87bcc5d834 100644 >>>> --- a/drivers/md/md.h >>>> +++ b/drivers/md/md.h >>>> @@ -332,7 +332,7 @@ struct md_cluster_operations; >>>> * resync lock, need to release the lock. >>>> * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as >>>> * calls to md_error() will never cause the array to >>>> - * become failed. >>>> + * become failed while fail_last_dev is not set. >>>> * @MD_HAS_PPL: The raid array has PPL feature set. >>>> * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set. >>>> * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that >>>> >>> >>> >> . >> > > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-08-15 19:12 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-12 9:01 [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails Kenta Akagi 2025-08-13 0:59 ` Yu Kuai 2025-08-14 15:54 ` Kenta Akagi 2025-08-15 1:26 ` Yu Kuai 2025-08-15 19:12 ` Kenta Akagi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).