* [PATCH v3 0/3] Do not set MD_BROKEN on failfast io failure @ 2025-08-28 16:32 Kenta Akagi 2025-08-28 16:32 ` [PATCH v3 1/3] md/raid1,raid10: " Kenta Akagi ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Kenta Akagi @ 2025-08-28 16:32 UTC (permalink / raw) To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel, Kenta Akagi Changes from V2: - Fix to prevent the array from being marked broken for all Failfast IOs, not just metadata. - Reflecting the review, update raid{1,10}_error to clear FailfastIOFailure so that devices are properly marked Faulty. Changes from V1: - Avoid setting MD_BROKEN instead of clearing it - Add pr_crit() when setting MD_BROKEN - Fix the message may shown after all rdevs failure: "Operation continuing on 0 devices" v2: https://lore.kernel.org/linux-raid/20250817172710.4892-1-k@mgml.me/ v1: https://lore.kernel.org/linux-raid/20250812090119.153697-1-k@mgml.me/ A failfast bio, for example in the case of nvme-tcp, bio will fail immediately if the connection to the target is briefly lost and the device enters a reconnecting state - even though it would recover given few seconds. This behavior is by design in failfast. However, md treats Failfast IO failures as fatal, potentially marking the array as MD_BROKEN when a connection is lost. For example, if an initiator - that is, a machine loading the md module - loses all connections briefly, the array is marked as MD_BROKEN, preventing subsequent writes. This is the issue I am currently facing, and which this patch aims to fix. The 1st patch changes the behavior on MD_FAILFAST IO failures on the last rdev. The 2nd and 3rd patches modify the pr_crit messages. Kenta Akagi (3): md/raid1,raid10: Do not set MD_BROKEN on failfast io failure md/raid1,raid10: Add error message when setting MD_BROKEN md/raid1,raid10: Fix: Operation continuing on 0 devices. drivers/md/md.c | 14 +++++++++----- drivers/md/md.h | 13 +++++++------ drivers/md/raid1.c | 32 ++++++++++++++++++++++++++------ drivers/md/raid10.c | 35 ++++++++++++++++++++++++++++------- 4 files changed, 70 insertions(+), 24 deletions(-) -- 2.50.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure 2025-08-28 16:32 [PATCH v3 0/3] Do not set MD_BROKEN on failfast io failure Kenta Akagi @ 2025-08-28 16:32 ` Kenta Akagi 2025-08-29 2:54 ` Li Nan 2025-08-28 16:32 ` [PATCH v3 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi 2025-08-28 16:32 ` [PATCH v3 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi 2 siblings, 1 reply; 13+ messages in thread From: Kenta Akagi @ 2025-08-28 16:32 UTC (permalink / raw) To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel, Kenta Akagi This commit ensures that an MD_FAILFAST IO failure does not put the array into a broken state. When failfast is enabled on rdev in RAID1 or RAID10, the array may be flagged MD_BROKEN in the following cases. - If MD_FAILFAST IOs to multiple rdevs fail simultaneously - If an MD_FAILFAST metadata write to the 'last' rdev fails The MD_FAILFAST bio error handler always calls md_error on IO failure, under the assumption that raid{1,10}_error will neither fail the last rdev nor break the array. After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling md_error on the 'last' rdev in RAID1/10 always sets the MD_BROKEN flag on the array. As a result, when failfast IO fails on the last rdev, the array immediately becomes failed. Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is an edge case; however, it can occur if rdevs in a non-degraded array share the same path and that path is lost, or if a metadata write is triggered in a degraded array and fails due to failfast. When a failfast metadata write fails, it is retried through the following sequence. If a metadata write without failfast fails, the array will be marked with MD_BROKEN. 1. MD_SB_NEED_REWRITE is set in sb_flags by super_written. 2. md_super_wait, called by the function executing md_super_write, returns -EAGAIN due to MD_SB_NEED_REWRITE. 3. The caller of md_super_wait (e.g., md_update_sb) receives the negative return value and retries md_super_write. 4. md_super_write issues the metadata write again, this time without MD_FAILFAST. Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") Signed-off-by: Kenta Akagi <k@mgml.me> --- drivers/md/md.c | 14 +++++++++----- drivers/md/md.h | 13 +++++++------ drivers/md/raid1.c | 18 ++++++++++++++++-- drivers/md/raid10.c | 21 ++++++++++++++++++--- 4 files changed, 50 insertions(+), 16 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index ac85ec73a409..547b88e253f7 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -999,14 +999,18 @@ static void super_written(struct bio *bio) if (bio->bi_status) { pr_err("md: %s gets error=%d\n", __func__, blk_status_to_errno(bio->bi_status)); + if (bio->bi_opf & MD_FAILFAST) + set_bit(FailfastIOFailure, &rdev->flags); md_error(mddev, rdev); if (!test_bit(Faulty, &rdev->flags) && (bio->bi_opf & MD_FAILFAST)) { + pr_warn("md: %s: Metadata write will be repeated to %pg\n", + mdname(mddev), rdev->bdev); set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); - set_bit(LastDev, &rdev->flags); } - } else - clear_bit(LastDev, &rdev->flags); + } else { + clear_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); + } bio_put(bio); @@ -1048,7 +1052,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev, if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) && test_bit(FailFast, &rdev->flags) && - !test_bit(LastDev, &rdev->flags)) + !test_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags)) bio->bi_opf |= MD_FAILFAST; atomic_inc(&mddev->pending_writes); @@ -1059,7 +1063,7 @@ int md_super_wait(struct mddev *mddev) { /* wait for all superblock writes that were scheduled to complete */ wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0); - if (test_and_clear_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags)) + if (test_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags)) return -EAGAIN; return 0; } diff --git a/drivers/md/md.h b/drivers/md/md.h index 51af29a03079..52c9fc759015 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -281,9 +281,10 @@ enum flag_bits { * It is expects that no bad block log * is present. */ - LastDev, /* Seems to be the last working dev as - * it didn't fail, so don't use FailFast - * any more for metadata + FailfastIOFailure, /* rdev with failfast IO failure + * but md_error not yet completed. + * If the last rdev has this flag, + * error_handler must not fail the array */ CollisionCheck, /* * check if there is collision between raid1 @@ -331,8 +332,8 @@ struct md_cluster_operations; * @MD_CLUSTER_RESYNC_LOCKED: cluster raid only, which means node, already took * resync lock, need to release the lock. * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as - * calls to md_error() will never cause the array to - * become failed. + * calls to md_error() with FailfastIOFailure will + * never cause the array to become failed. * @MD_HAS_PPL: The raid array has PPL feature set. * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set. * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that @@ -360,7 +361,7 @@ enum mddev_sb_flags { MD_SB_CHANGE_DEVS, /* Some device status has changed */ MD_SB_CHANGE_CLEAN, /* transition to or from 'clean' */ MD_SB_CHANGE_PENDING, /* switch from 'clean' to 'active' in progress */ - MD_SB_NEED_REWRITE, /* metadata write needs to be repeated */ + MD_SB_NEED_REWRITE, /* metadata write needs to be repeated, do not use failfast */ }; #define NR_SERIAL_INFOS 8 diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 408c26398321..8a61fd93b3ff 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) (bio->bi_opf & MD_FAILFAST) && /* We never try FailFast to WriteMostly devices */ !test_bit(WriteMostly, &rdev->flags)) { + set_bit(FailfastIOFailure, &rdev->flags); md_error(r1_bio->mddev, rdev); } @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) * - recovery is interrupted. * - &mddev->degraded is bumped. * - * @rdev is marked as &Faulty excluding case when array is failed and - * &mddev->fail_last_dev is off. + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, + * then @mddev and @rdev will not be marked as failed. + * + * @rdev is marked as &Faulty excluding any cases: + * - when @mddev is failed and &mddev->fail_last_dev is off + * - when @rdev is last device and &FailfastIOFailure flag is set */ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) { @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) if (test_bit(In_sync, &rdev->flags) && (conf->raid_disks - mddev->degraded) == 1) { + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { + spin_unlock_irqrestore(&conf->device_lock, flags); + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " + "last device but ignoring it\n", + mdname(mddev), rdev->bdev); + return; + } set_bit(MD_BROKEN, &mddev->flags); if (!mddev->fail_last_dev) { @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) if (test_bit(FailFast, &rdev->flags)) { /* Don't try recovering from here - just fail it * ... unless it is the last working device of course */ + set_bit(FailfastIOFailure, &rdev->flags); md_error(mddev, rdev); if (test_bit(Faulty, &rdev->flags)) /* Don't try to read from here, but make sure @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) fix_read_error(conf, r1_bio); unfreeze_array(conf); } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { + set_bit(FailfastIOFailure, &rdev->flags); md_error(mddev, rdev); } else { r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index b60c30bfb6c7..530ad6503189 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) dec_rdev = 0; if (test_bit(FailFast, &rdev->flags) && (bio->bi_opf & MD_FAILFAST)) { + set_bit(FailfastIOFailure, &rdev->flags); md_error(rdev->mddev, rdev); } @@ -1995,8 +1996,12 @@ static int enough(struct r10conf *conf, int ignore) * - recovery is interrupted. * - &mddev->degraded is bumped. * - * @rdev is marked as &Faulty excluding case when array is failed and - * &mddev->fail_last_dev is off. + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, + * then @mddev and @rdev will not be marked as failed. + * + * @rdev is marked as &Faulty excluding any cases: + * - when @mddev is failed and &mddev->fail_last_dev is off + * - when @rdev is last device and &FailfastIOFailure flag is set */ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) { @@ -2006,6 +2011,13 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) spin_lock_irqsave(&conf->device_lock, flags); if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) { + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { + spin_unlock_irqrestore(&conf->device_lock, flags); + pr_warn_ratelimited("md/raid10:%s: Failfast IO failure on %pg, " + "last device but ignoring it\n", + mdname(mddev), rdev->bdev); + return; + } set_bit(MD_BROKEN, &mddev->flags); if (!mddev->fail_last_dev) { @@ -2413,6 +2425,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) continue; } else if (test_bit(FailFast, &rdev->flags)) { /* Just give up on this device */ + set_bit(FailfastIOFailure, &rdev->flags); md_error(rdev->mddev, rdev); continue; } @@ -2865,8 +2878,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio) freeze_array(conf, 1); fix_read_error(conf, mddev, r10_bio); unfreeze_array(conf); - } else + } else { + set_bit(FailfastIOFailure, &rdev->flags); md_error(mddev, rdev); + } rdev_dec_pending(rdev, mddev); r10_bio->state = 0; -- 2.50.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure 2025-08-28 16:32 ` [PATCH v3 1/3] md/raid1,raid10: " Kenta Akagi @ 2025-08-29 2:54 ` Li Nan 2025-08-29 12:21 ` Kenta Akagi 0 siblings, 1 reply; 13+ messages in thread From: Li Nan @ 2025-08-29 2:54 UTC (permalink / raw) To: Kenta Akagi, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel 在 2025/8/29 0:32, Kenta Akagi 写道: > This commit ensures that an MD_FAILFAST IO failure does not put > the array into a broken state. > > When failfast is enabled on rdev in RAID1 or RAID10, > the array may be flagged MD_BROKEN in the following cases. > - If MD_FAILFAST IOs to multiple rdevs fail simultaneously > - If an MD_FAILFAST metadata write to the 'last' rdev fails [...] > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 408c26398321..8a61fd93b3ff 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) > (bio->bi_opf & MD_FAILFAST) && > /* We never try FailFast to WriteMostly devices */ > !test_bit(WriteMostly, &rdev->flags)) { > + set_bit(FailfastIOFailure, &rdev->flags); > md_error(r1_bio->mddev, rdev); > } > > @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) > * - recovery is interrupted. > * - &mddev->degraded is bumped. > * > - * @rdev is marked as &Faulty excluding case when array is failed and > - * &mddev->fail_last_dev is off. > + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, > + * then @mddev and @rdev will not be marked as failed. > + * > + * @rdev is marked as &Faulty excluding any cases: > + * - when @mddev is failed and &mddev->fail_last_dev is off > + * - when @rdev is last device and &FailfastIOFailure flag is set > */ > static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) > { > @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) > > if (test_bit(In_sync, &rdev->flags) && > (conf->raid_disks - mddev->degraded) == 1) { > + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { > + spin_unlock_irqrestore(&conf->device_lock, flags); > + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " > + "last device but ignoring it\n", > + mdname(mddev), rdev->bdev); > + return; > + } > set_bit(MD_BROKEN, &mddev->flags); > > if (!mddev->fail_last_dev) { > @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) > if (test_bit(FailFast, &rdev->flags)) { > /* Don't try recovering from here - just fail it > * ... unless it is the last working device of course */ > + set_bit(FailfastIOFailure, &rdev->flags); > md_error(mddev, rdev); > if (test_bit(Faulty, &rdev->flags)) > /* Don't try to read from here, but make sure > @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) > fix_read_error(conf, r1_bio); > unfreeze_array(conf); > } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { > + set_bit(FailfastIOFailure, &rdev->flags); > md_error(mddev, rdev); > } else { > r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index b60c30bfb6c7..530ad6503189 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) > dec_rdev = 0; > if (test_bit(FailFast, &rdev->flags) && > (bio->bi_opf & MD_FAILFAST)) { > + set_bit(FailfastIOFailure, &rdev->flags); > md_error(rdev->mddev, rdev); > } > Thank you for the patch. There may be an issue with 'test_and_clear'. If two write IO go to the same rdev, MD_BROKEN may be set as below: IO1 IO2 set FailfastIOFailure set FailfastIOFailure md_error raid1_error test_and_clear FailfastIOFailur md_error raid1_error //FailfastIOFailur is cleared set MD_BROKEN Maybe we should check whether FailfastIOFailure is already set before setting it. It also needs to be considered in metadata writes. > @@ -1995,8 +1996,12 @@ static int enough(struct r10conf *conf, int ignore) > * - recovery is interrupted. > * - &mddev->degraded is bumped. > * > - * @rdev is marked as &Faulty excluding case when array is failed and > - * &mddev->fail_last_dev is off. > + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, > + * then @mddev and @rdev will not be marked as failed. > + * > + * @rdev is marked as &Faulty excluding any cases: > + * - when @mddev is failed and &mddev->fail_last_dev is off > + * - when @rdev is last device and &FailfastIOFailure flag is set > */ > static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) > { > @@ -2006,6 +2011,13 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) > spin_lock_irqsave(&conf->device_lock, flags); > > if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) { > + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { > + spin_unlock_irqrestore(&conf->device_lock, flags); > + pr_warn_ratelimited("md/raid10:%s: Failfast IO failure on %pg, " > + "last device but ignoring it\n", > + mdname(mddev), rdev->bdev); > + return; > + > set_bit(MD_BROKEN, &mddev->flags); > > if (!mddev->fail_last_dev) { > @@ -2413,6 +2425,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) > continue; > } else if (test_bit(FailFast, &rdev->flags)) { > /* Just give up on this device */ > + set_bit(FailfastIOFailure, &rdev->flags); > md_error(rdev->mddev, rdev); > continue; > } > @@ -2865,8 +2878,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio) > freeze_array(conf, 1); > fix_read_error(conf, mddev, r10_bio); > unfreeze_array(conf); > - } else > + } else { > + set_bit(FailfastIOFailure, &rdev->flags); > md_error(mddev, rdev); > + } > > rdev_dec_pending(rdev, mddev); > r10_bio->state = 0; -- Thanks, Nan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure 2025-08-29 2:54 ` Li Nan @ 2025-08-29 12:21 ` Kenta Akagi 2025-08-30 8:48 ` Li Nan 0 siblings, 1 reply; 13+ messages in thread From: Kenta Akagi @ 2025-08-29 12:21 UTC (permalink / raw) To: Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel, Kenta Akagi On 2025/08/29 11:54, Li Nan wrote: > > 在 2025/8/29 0:32, Kenta Akagi 写道: >> This commit ensures that an MD_FAILFAST IO failure does not put >> the array into a broken state. >> >> When failfast is enabled on rdev in RAID1 or RAID10, >> the array may be flagged MD_BROKEN in the following cases. >> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously >> - If an MD_FAILFAST metadata write to the 'last' rdev fails > > [...] > >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index 408c26398321..8a61fd93b3ff 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) >> (bio->bi_opf & MD_FAILFAST) && >> /* We never try FailFast to WriteMostly devices */ >> !test_bit(WriteMostly, &rdev->flags)) { >> + set_bit(FailfastIOFailure, &rdev->flags); >> md_error(r1_bio->mddev, rdev); >> } >> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >> * - recovery is interrupted. >> * - &mddev->degraded is bumped. >> * >> - * @rdev is marked as &Faulty excluding case when array is failed and >> - * &mddev->fail_last_dev is off. >> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, >> + * then @mddev and @rdev will not be marked as failed. >> + * >> + * @rdev is marked as &Faulty excluding any cases: >> + * - when @mddev is failed and &mddev->fail_last_dev is off >> + * - when @rdev is last device and &FailfastIOFailure flag is set >> */ >> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >> { >> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >> if (test_bit(In_sync, &rdev->flags) && >> (conf->raid_disks - mddev->degraded) == 1) { >> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { >> + spin_unlock_irqrestore(&conf->device_lock, flags); >> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " >> + "last device but ignoring it\n", >> + mdname(mddev), rdev->bdev); >> + return; >> + } >> set_bit(MD_BROKEN, &mddev->flags); >> if (!mddev->fail_last_dev) { >> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >> if (test_bit(FailFast, &rdev->flags)) { >> /* Don't try recovering from here - just fail it >> * ... unless it is the last working device of course */ >> + set_bit(FailfastIOFailure, &rdev->flags); >> md_error(mddev, rdev); >> if (test_bit(Faulty, &rdev->flags)) >> /* Don't try to read from here, but make sure >> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >> fix_read_error(conf, r1_bio); >> unfreeze_array(conf); >> } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >> + set_bit(FailfastIOFailure, &rdev->flags); >> md_error(mddev, rdev); >> } else { >> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >> index b60c30bfb6c7..530ad6503189 100644 >> --- a/drivers/md/raid10.c >> +++ b/drivers/md/raid10.c >> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) >> dec_rdev = 0; >> if (test_bit(FailFast, &rdev->flags) && >> (bio->bi_opf & MD_FAILFAST)) { >> + set_bit(FailfastIOFailure, &rdev->flags); >> md_error(rdev->mddev, rdev); >> } >> > > Thank you for the patch. There may be an issue with 'test_and_clear'. > > If two write IO go to the same rdev, MD_BROKEN may be set as below: > IO1 IO2 > set FailfastIOFailure > set FailfastIOFailure > md_error > raid1_error > test_and_clear FailfastIOFailur > md_error > raid1_error > //FailfastIOFailur is cleared > set MD_BROKEN > > Maybe we should check whether FailfastIOFailure is already set before > setting it. It also needs to be considered in metadata writes. Thank you for reviewing. I agree, this seems to be as you described. So, should it be implemented as follows? bool old=false; do{ spin_lock_irqsave(&conf->device_lock, flags); old = test_and_set_bit(FailfastIOFailure, &rdev->flags); spin_unlock_irqrestore(&conf->device_lock, flags); }while(old); However, since I am concerned about potential deadlocks, so I am considering two alternative approaches: * Add an atomic_t counter to md_rdev to track failfast IO failures. This may set MD_BROKEN at a slightly incorrect timing, but mixing error handling of Failfast and non-Failfast IOs appears to be rare. In any case, the final outcome would be the same, i.e. the array ends up with MD_BROKEN. Therefore, I think this should not cause issues. I think the same applies to test_and_set_bit. IO1 IO2 IO3 FailfastIOFailure Normal IOFailure FailfastIOFailure atomic_inc md_error atomic_inc raid1_error atomic_dec //2to1 md_error raid1_error md_error atomic_dec //1to0 raid1_error atomic_dec //0 set MD_BROKEN * Alternatively, create a separate error handler, e.g. md_error_failfast(), that clearly does not fail the array. This approach would require somewhat larger changes and may not be very elegant, but it seems to be a reliable way to ensure MD_BROKEN is never set at the wrong timing. Which of these three approaches would you consider preferable? I would appreciate your feedback. For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED when the array is degraded. Thanks, Akagi >> @@ -1995,8 +1996,12 @@ static int enough(struct r10conf *conf, int ignore) >> * - recovery is interrupted. >> * - &mddev->degraded is bumped. >> * >> - * @rdev is marked as &Faulty excluding case when array is failed and >> - * &mddev->fail_last_dev is off. >> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, >> + * then @mddev and @rdev will not be marked as failed. >> + * >> + * @rdev is marked as &Faulty excluding any cases: >> + * - when @mddev is failed and &mddev->fail_last_dev is off >> + * - when @rdev is last device and &FailfastIOFailure flag is set >> */ >> static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) >> { >> @@ -2006,6 +2011,13 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) >> spin_lock_irqsave(&conf->device_lock, flags); >> if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) { >> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { >> + spin_unlock_irqrestore(&conf->device_lock, flags); >> + pr_warn_ratelimited("md/raid10:%s: Failfast IO failure on %pg, " >> + "last device but ignoring it\n", >> + mdname(mddev), rdev->bdev); >> + return; >> + > set_bit(MD_BROKEN, &mddev->flags); >> if (!mddev->fail_last_dev) { >> @@ -2413,6 +2425,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) >> continue; >> } else if (test_bit(FailFast, &rdev->flags)) { >> /* Just give up on this device */ >> + set_bit(FailfastIOFailure, &rdev->flags); >> md_error(rdev->mddev, rdev); >> continue; >> } >> @@ -2865,8 +2878,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio) >> freeze_array(conf, 1); >> fix_read_error(conf, mddev, r10_bio); >> unfreeze_array(conf); >> - } else >> + } else { >> + set_bit(FailfastIOFailure, &rdev->flags); >> md_error(mddev, rdev); >> + } >> rdev_dec_pending(rdev, mddev); >> r10_bio->state = 0; > > -- > Thanks, > Nan > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure 2025-08-29 12:21 ` Kenta Akagi @ 2025-08-30 8:48 ` Li Nan 2025-08-30 18:10 ` Kenta Akagi 0 siblings, 1 reply; 13+ messages in thread From: Li Nan @ 2025-08-30 8:48 UTC (permalink / raw) To: Kenta Akagi, Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel 在 2025/8/29 20:21, Kenta Akagi 写道: > > > On 2025/08/29 11:54, Li Nan wrote: >> >> 在 2025/8/29 0:32, Kenta Akagi 写道: >>> This commit ensures that an MD_FAILFAST IO failure does not put >>> the array into a broken state. >>> >>> When failfast is enabled on rdev in RAID1 or RAID10, >>> the array may be flagged MD_BROKEN in the following cases. >>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously >>> - If an MD_FAILFAST metadata write to the 'last' rdev fails >> >> [...] >> >>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>> index 408c26398321..8a61fd93b3ff 100644 >>> --- a/drivers/md/raid1.c >>> +++ b/drivers/md/raid1.c >>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) >>> (bio->bi_opf & MD_FAILFAST) && >>> /* We never try FailFast to WriteMostly devices */ >>> !test_bit(WriteMostly, &rdev->flags)) { >>> + set_bit(FailfastIOFailure, &rdev->flags); >>> md_error(r1_bio->mddev, rdev); >>> } >>> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>> * - recovery is interrupted. >>> * - &mddev->degraded is bumped. >>> * >>> - * @rdev is marked as &Faulty excluding case when array is failed and >>> - * &mddev->fail_last_dev is off. >>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, >>> + * then @mddev and @rdev will not be marked as failed. >>> + * >>> + * @rdev is marked as &Faulty excluding any cases: >>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>> + * - when @rdev is last device and &FailfastIOFailure flag is set >>> */ >>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>> { >>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>> if (test_bit(In_sync, &rdev->flags) && >>> (conf->raid_disks - mddev->degraded) == 1) { >>> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { >>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " >>> + "last device but ignoring it\n", >>> + mdname(mddev), rdev->bdev); >>> + return; >>> + } >>> set_bit(MD_BROKEN, &mddev->flags); >>> if (!mddev->fail_last_dev) { >>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >>> if (test_bit(FailFast, &rdev->flags)) { >>> /* Don't try recovering from here - just fail it >>> * ... unless it is the last working device of course */ >>> + set_bit(FailfastIOFailure, &rdev->flags); >>> md_error(mddev, rdev); >>> if (test_bit(Faulty, &rdev->flags)) >>> /* Don't try to read from here, but make sure >>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >>> fix_read_error(conf, r1_bio); >>> unfreeze_array(conf); >>> } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >>> + set_bit(FailfastIOFailure, &rdev->flags); >>> md_error(mddev, rdev); >>> } else { >>> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >>> index b60c30bfb6c7..530ad6503189 100644 >>> --- a/drivers/md/raid10.c >>> +++ b/drivers/md/raid10.c >>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) >>> dec_rdev = 0; >>> if (test_bit(FailFast, &rdev->flags) && >>> (bio->bi_opf & MD_FAILFAST)) { >>> + set_bit(FailfastIOFailure, &rdev->flags); >>> md_error(rdev->mddev, rdev); >>> } >>> >> >> Thank you for the patch. There may be an issue with 'test_and_clear'. >> >> If two write IO go to the same rdev, MD_BROKEN may be set as below: > >> IO1 IO2 >> set FailfastIOFailure >> set FailfastIOFailure >> md_error >> raid1_error >> test_and_clear FailfastIOFailur >> md_error >> raid1_error >> //FailfastIOFailur is cleared >> set MD_BROKEN >> >> Maybe we should check whether FailfastIOFailure is already set before >> setting it. It also needs to be considered in metadata writes. > Thank you for reviewing. > > I agree, this seems to be as you described. > So, should it be implemented as follows? > > bool old=false; > do{ > spin_lock_irqsave(&conf->device_lock, flags); > old = test_and_set_bit(FailfastIOFailure, &rdev->flags); > spin_unlock_irqrestore(&conf->device_lock, flags); > }while(old); > > However, since I am concerned about potential deadlocks, > so I am considering two alternative approaches: > > * Add an atomic_t counter to md_rdev to track failfast IO failures. > > This may set MD_BROKEN at a slightly incorrect timing, but mixing > error handling of Failfast and non-Failfast IOs appears to be rare. > In any case, the final outcome would be the same, i.e. the array > ends up with MD_BROKEN. Therefore, I think this should not cause > issues. I think the same applies to test_and_set_bit. > > IO1 IO2 IO3 > FailfastIOFailure Normal IOFailure FailfastIOFailure > atomic_inc > > md_error atomic_inc > raid1_error > atomic_dec //2to1 > md_error > raid1_error md_error > atomic_dec //1to0 raid1_error > atomic_dec //0 > set MD_BROKEN > > * Alternatively, create a separate error handler, > e.g. md_error_failfast(), that clearly does not fail the array. > > This approach would require somewhat larger changes and may not > be very elegant, but it seems to be a reliable way to ensure > MD_BROKEN is never set at the wrong timing. > > Which of these three approaches would you consider preferable? > I would appreciate your feedback. > > > For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED > when the array is degraded. > > Thanks, > Akagi > I took a closer look at the FailFast code and found a few issues, using RAID1 as an example: For normal read/write IO, FailFast is only triggered when there is another disk is available, as seen in read_balance() and raid1_write_request(). In raid1_error(), MD_BROKEN is set only when no other disks are available. So, the FailFast for normal read/write is not triggered in the scenario you described in cover-letter. Normal writes may call md_error() in narrow_write_error. Normal reads do not execute md_error() on the last disk. Perhaps you should get more information to confirm how MD_BROKEN is set in normal read/write IO. -- Thanks, Nan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure 2025-08-30 8:48 ` Li Nan @ 2025-08-30 18:10 ` Kenta Akagi 2025-09-01 3:22 ` Li Nan 0 siblings, 1 reply; 13+ messages in thread From: Kenta Akagi @ 2025-08-30 18:10 UTC (permalink / raw) To: Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel, Kenta Akagi On 2025/08/30 17:48, Li Nan wrote: > > > 在 2025/8/29 20:21, Kenta Akagi 写道: >> >> >> On 2025/08/29 11:54, Li Nan wrote: >>> >>> 在 2025/8/29 0:32, Kenta Akagi 写道: >>>> This commit ensures that an MD_FAILFAST IO failure does not put >>>> the array into a broken state. >>>> >>>> When failfast is enabled on rdev in RAID1 or RAID10, >>>> the array may be flagged MD_BROKEN in the following cases. >>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously >>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails >>> >>> [...] >>> >>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>>> index 408c26398321..8a61fd93b3ff 100644 >>>> --- a/drivers/md/raid1.c >>>> +++ b/drivers/md/raid1.c >>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) >>>> (bio->bi_opf & MD_FAILFAST) && >>>> /* We never try FailFast to WriteMostly devices */ >>>> !test_bit(WriteMostly, &rdev->flags)) { >>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>> md_error(r1_bio->mddev, rdev); >>>> } >>>> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>>> * - recovery is interrupted. >>>> * - &mddev->degraded is bumped. >>>> * >>>> - * @rdev is marked as &Faulty excluding case when array is failed and >>>> - * &mddev->fail_last_dev is off. >>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, >>>> + * then @mddev and @rdev will not be marked as failed. >>>> + * >>>> + * @rdev is marked as &Faulty excluding any cases: >>>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>>> + * - when @rdev is last device and &FailfastIOFailure flag is set >>>> */ >>>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>> { >>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>> if (test_bit(In_sync, &rdev->flags) && >>>> (conf->raid_disks - mddev->degraded) == 1) { >>>> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { >>>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>>> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " >>>> + "last device but ignoring it\n", >>>> + mdname(mddev), rdev->bdev); >>>> + return; >>>> + } >>>> set_bit(MD_BROKEN, &mddev->flags); >>>> if (!mddev->fail_last_dev) { >>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >>>> if (test_bit(FailFast, &rdev->flags)) { >>>> /* Don't try recovering from here - just fail it >>>> * ... unless it is the last working device of course */ >>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>> md_error(mddev, rdev); >>>> if (test_bit(Faulty, &rdev->flags)) >>>> /* Don't try to read from here, but make sure >>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >>>> fix_read_error(conf, r1_bio); >>>> unfreeze_array(conf); >>>> } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>> md_error(mddev, rdev); >>>> } else { >>>> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >>>> index b60c30bfb6c7..530ad6503189 100644 >>>> --- a/drivers/md/raid10.c >>>> +++ b/drivers/md/raid10.c >>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) >>>> dec_rdev = 0; >>>> if (test_bit(FailFast, &rdev->flags) && >>>> (bio->bi_opf & MD_FAILFAST)) { >>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>> md_error(rdev->mddev, rdev); >>>> } >>>> >>> >>> Thank you for the patch. There may be an issue with 'test_and_clear'. >>> >>> If two write IO go to the same rdev, MD_BROKEN may be set as below: >> >>> IO1 IO2 >>> set FailfastIOFailure >>> set FailfastIOFailure >>> md_error >>> raid1_error >>> test_and_clear FailfastIOFailur >>> md_error >>> raid1_error >>> //FailfastIOFailur is cleared >>> set MD_BROKEN >>> >>> Maybe we should check whether FailfastIOFailure is already set before >>> setting it. It also needs to be considered in metadata writes. >> Thank you for reviewing. >> >> I agree, this seems to be as you described. >> So, should it be implemented as follows? >> >> bool old=false; >> do{ >> spin_lock_irqsave(&conf->device_lock, flags); >> old = test_and_set_bit(FailfastIOFailure, &rdev->flags); >> spin_unlock_irqrestore(&conf->device_lock, flags); >> }while(old); >> >> However, since I am concerned about potential deadlocks, >> so I am considering two alternative approaches: >> >> * Add an atomic_t counter to md_rdev to track failfast IO failures. >> >> This may set MD_BROKEN at a slightly incorrect timing, but mixing >> error handling of Failfast and non-Failfast IOs appears to be rare. >> In any case, the final outcome would be the same, i.e. the array >> ends up with MD_BROKEN. Therefore, I think this should not cause >> issues. I think the same applies to test_and_set_bit. >> >> IO1 IO2 IO3 >> FailfastIOFailure Normal IOFailure FailfastIOFailure >> atomic_inc >> md_error atomic_inc >> raid1_error >> atomic_dec //2to1 >> md_error >> raid1_error md_error >> atomic_dec //1to0 raid1_error >> atomic_dec //0 >> set MD_BROKEN >> >> * Alternatively, create a separate error handler, >> e.g. md_error_failfast(), that clearly does not fail the array. >> >> This approach would require somewhat larger changes and may not >> be very elegant, but it seems to be a reliable way to ensure >> MD_BROKEN is never set at the wrong timing. >> >> Which of these three approaches would you consider preferable? >> I would appreciate your feedback. >> >> >> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED >> when the array is degraded. >> >> Thanks, >> Akagi >> > > I took a closer look at the FailFast code and found a few issues, using > RAID1 as an example: > > For normal read/write IO, FailFast is only triggered when there is another > disk is available, as seen in read_balance() and raid1_write_request(). > In raid1_error(), MD_BROKEN is set only when no other disks are available. Hi, Agree, I think so too. > So, the FailFast for normal read/write is not triggered in the scenario you > described in cover-letter. This corresponds to the case described in the commit message of PATCH v3 1/3. "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is an edge case; however, it can occur if rdevs in a non-degraded array share the same path and that path is lost, or if a metadata write is triggered in a degraded array and fails due to failfast." To describe it in more detail, the flow is as follows: Prerequisites: - Both rdevs are in-sync - Both rdevs have in-flight MD_FAILFAST bios - Both rdevs depend on the same lower-level path (e.g., nvme-tcp over a single Ethernet interface) Sequence: - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage), in the case of nvme-tcp: - The Ethernet connection is lost on the node where md is running over 5 seconds - Then the connection is restored. Idk the details of nvme-tcp implementation, but it seems that failfast IOs finish only after the connection is back. - All failfast bios fail, raid1_end_write_request is called. - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev. - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely. - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding. - MD_BROKEN remains set, leaving the array in a state where no further writes can occur. > Normal writes may call md_error() in narrow_write_error. Normal reads do > not execute md_error() on the last disk. > > Perhaps you should get more information to confirm how MD_BROKEN is set in > normal read/write IO. Should I add the above sequence of events to the cover letter, or commit message? Thanks, Akagi > -- > Thanks, > Nan > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure 2025-08-30 18:10 ` Kenta Akagi @ 2025-09-01 3:22 ` Li Nan 2025-09-01 4:22 ` Kenta Akagi 0 siblings, 1 reply; 13+ messages in thread From: Li Nan @ 2025-09-01 3:22 UTC (permalink / raw) To: Kenta Akagi, Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel 在 2025/8/31 2:10, Kenta Akagi 写道: > > > On 2025/08/30 17:48, Li Nan wrote: >> >> >> 在 2025/8/29 20:21, Kenta Akagi 写道: >>> >>> >>> On 2025/08/29 11:54, Li Nan wrote: >>>> >>>> 在 2025/8/29 0:32, Kenta Akagi 写道: >>>>> This commit ensures that an MD_FAILFAST IO failure does not put >>>>> the array into a broken state. >>>>> >>>>> When failfast is enabled on rdev in RAID1 or RAID10, >>>>> the array may be flagged MD_BROKEN in the following cases. >>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously >>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails >>>> >>>> [...] >>>> >>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>>>> index 408c26398321..8a61fd93b3ff 100644 >>>>> --- a/drivers/md/raid1.c >>>>> +++ b/drivers/md/raid1.c >>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) >>>>> (bio->bi_opf & MD_FAILFAST) && >>>>> /* We never try FailFast to WriteMostly devices */ >>>>> !test_bit(WriteMostly, &rdev->flags)) { >>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>> md_error(r1_bio->mddev, rdev); >>>>> } >>>>> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>>>> * - recovery is interrupted. >>>>> * - &mddev->degraded is bumped. >>>>> * >>>>> - * @rdev is marked as &Faulty excluding case when array is failed and >>>>> - * &mddev->fail_last_dev is off. >>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, >>>>> + * then @mddev and @rdev will not be marked as failed. >>>>> + * >>>>> + * @rdev is marked as &Faulty excluding any cases: >>>>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>>>> + * - when @rdev is last device and &FailfastIOFailure flag is set >>>>> */ >>>>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>> { >>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>> if (test_bit(In_sync, &rdev->flags) && >>>>> (conf->raid_disks - mddev->degraded) == 1) { >>>>> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { >>>>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>>>> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " >>>>> + "last device but ignoring it\n", >>>>> + mdname(mddev), rdev->bdev); >>>>> + return; >>>>> + } >>>>> set_bit(MD_BROKEN, &mddev->flags); >>>>> if (!mddev->fail_last_dev) { >>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >>>>> if (test_bit(FailFast, &rdev->flags)) { >>>>> /* Don't try recovering from here - just fail it >>>>> * ... unless it is the last working device of course */ >>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>> md_error(mddev, rdev); >>>>> if (test_bit(Faulty, &rdev->flags)) >>>>> /* Don't try to read from here, but make sure >>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >>>>> fix_read_error(conf, r1_bio); >>>>> unfreeze_array(conf); >>>>> } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>> md_error(mddev, rdev); >>>>> } else { >>>>> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >>>>> index b60c30bfb6c7..530ad6503189 100644 >>>>> --- a/drivers/md/raid10.c >>>>> +++ b/drivers/md/raid10.c >>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) >>>>> dec_rdev = 0; >>>>> if (test_bit(FailFast, &rdev->flags) && >>>>> (bio->bi_opf & MD_FAILFAST)) { >>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>> md_error(rdev->mddev, rdev); >>>>> } >>>>> >>>> >>>> Thank you for the patch. There may be an issue with 'test_and_clear'. >>>> >>>> If two write IO go to the same rdev, MD_BROKEN may be set as below: >>> >>>> IO1 IO2 >>>> set FailfastIOFailure >>>> set FailfastIOFailure >>>> md_error >>>> raid1_error >>>> test_and_clear FailfastIOFailur >>>> md_error >>>> raid1_error >>>> //FailfastIOFailur is cleared >>>> set MD_BROKEN >>>> >>>> Maybe we should check whether FailfastIOFailure is already set before >>>> setting it. It also needs to be considered in metadata writes. >>> Thank you for reviewing. >>> >>> I agree, this seems to be as you described. >>> So, should it be implemented as follows? >>> >>> bool old=false; >>> do{ >>> spin_lock_irqsave(&conf->device_lock, flags); >>> old = test_and_set_bit(FailfastIOFailure, &rdev->flags); >>> spin_unlock_irqrestore(&conf->device_lock, flags); >>> }while(old); >>> >>> However, since I am concerned about potential deadlocks, >>> so I am considering two alternative approaches: >>> >>> * Add an atomic_t counter to md_rdev to track failfast IO failures. >>> >>> This may set MD_BROKEN at a slightly incorrect timing, but mixing >>> error handling of Failfast and non-Failfast IOs appears to be rare. >>> In any case, the final outcome would be the same, i.e. the array >>> ends up with MD_BROKEN. Therefore, I think this should not cause >>> issues. I think the same applies to test_and_set_bit. >>> >>> IO1 IO2 IO3 >>> FailfastIOFailure Normal IOFailure FailfastIOFailure >>> atomic_inc >>> md_error atomic_inc >>> raid1_error >>> atomic_dec //2to1 >>> md_error >>> raid1_error md_error >>> atomic_dec //1to0 raid1_error >>> atomic_dec //0 >>> set MD_BROKEN >>> >>> * Alternatively, create a separate error handler, >>> e.g. md_error_failfast(), that clearly does not fail the array. >>> >>> This approach would require somewhat larger changes and may not >>> be very elegant, but it seems to be a reliable way to ensure >>> MD_BROKEN is never set at the wrong timing. >>> >>> Which of these three approaches would you consider preferable? >>> I would appreciate your feedback. >>> >>> >>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED >>> when the array is degraded. >>> >>> Thanks, >>> Akagi >>> >> >> I took a closer look at the FailFast code and found a few issues, using >> RAID1 as an example: >> >> For normal read/write IO, FailFast is only triggered when there is another >> disk is available, as seen in read_balance() and raid1_write_request(). >> In raid1_error(), MD_BROKEN is set only when no other disks are available. > > Hi, > Agree, I think so too. > >> So, the FailFast for normal read/write is not triggered in the scenario you >> described in cover-letter. > > This corresponds to the case described in the commit message of PATCH v3 1/3. > "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is > an edge case; however, it can occur if rdevs in a non-degraded > array share the same path and that path is lost, or if a metadata > write is triggered in a degraded array and fails due to failfast." > > To describe it in more detail, the flow is as follows: > > Prerequisites: > > - Both rdevs are in-sync > - Both rdevs have in-flight MD_FAILFAST bios > - Both rdevs depend on the same lower-level path > (e.g., nvme-tcp over a single Ethernet interface) > > Sequence: > > - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage), > in the case of nvme-tcp: > - The Ethernet connection is lost on the node where md is running over 5 seconds > - Then the connection is restored. Idk the details of nvme-tcp implementation, > but it seems that failfast IOs finish only after the connection is back. > - All failfast bios fail, raid1_end_write_request is called. > - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev. > - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely. > - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding. > - MD_BROKEN remains set, leaving the array in a state where no further writes can occur. > Thanks for your patient explanation. I understand. Maybe we need a separate error-handling path for failfast. How about adding an extra parameter to md_error()? Kuai, do you have any suggestions? >> Normal writes may call md_error() in narrow_write_error. Normal reads do >> not execute md_error() on the last disk. >> >> Perhaps you should get more information to confirm how MD_BROKEN is set in >> normal read/write IO. > > Should I add the above sequence of events to the cover letter, or commit message? > I think we should mention this in the commit message. > Thanks, > Akagi > >> -- >> Thanks, >> Nan >> >> > > > . -- Thanks, Nan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure 2025-09-01 3:22 ` Li Nan @ 2025-09-01 4:22 ` Kenta Akagi 2025-09-01 7:48 ` Yu Kuai 0 siblings, 1 reply; 13+ messages in thread From: Kenta Akagi @ 2025-09-01 4:22 UTC (permalink / raw) To: Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel, Kenta Akagi On 2025/09/01 12:22, Li Nan wrote: > > > 在 2025/8/31 2:10, Kenta Akagi 写道: >> >> >> On 2025/08/30 17:48, Li Nan wrote: >>> >>> >>> 在 2025/8/29 20:21, Kenta Akagi 写道: >>>> >>>> >>>> On 2025/08/29 11:54, Li Nan wrote: >>>>> >>>>> 在 2025/8/29 0:32, Kenta Akagi 写道: >>>>>> This commit ensures that an MD_FAILFAST IO failure does not put >>>>>> the array into a broken state. >>>>>> >>>>>> When failfast is enabled on rdev in RAID1 or RAID10, >>>>>> the array may be flagged MD_BROKEN in the following cases. >>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously >>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails >>>>> >>>>> [...] >>>>> >>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>>>>> index 408c26398321..8a61fd93b3ff 100644 >>>>>> --- a/drivers/md/raid1.c >>>>>> +++ b/drivers/md/raid1.c >>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) >>>>>> (bio->bi_opf & MD_FAILFAST) && >>>>>> /* We never try FailFast to WriteMostly devices */ >>>>>> !test_bit(WriteMostly, &rdev->flags)) { >>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>> md_error(r1_bio->mddev, rdev); >>>>>> } >>>>>> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>>>>> * - recovery is interrupted. >>>>>> * - &mddev->degraded is bumped. >>>>>> * >>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and >>>>>> - * &mddev->fail_last_dev is off. >>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, >>>>>> + * then @mddev and @rdev will not be marked as failed. >>>>>> + * >>>>>> + * @rdev is marked as &Faulty excluding any cases: >>>>>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>>>>> + * - when @rdev is last device and &FailfastIOFailure flag is set >>>>>> */ >>>>>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>>> { >>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>>> if (test_bit(In_sync, &rdev->flags) && >>>>>> (conf->raid_disks - mddev->degraded) == 1) { >>>>>> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { >>>>>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>>>>> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " >>>>>> + "last device but ignoring it\n", >>>>>> + mdname(mddev), rdev->bdev); >>>>>> + return; >>>>>> + } >>>>>> set_bit(MD_BROKEN, &mddev->flags); >>>>>> if (!mddev->fail_last_dev) { >>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >>>>>> if (test_bit(FailFast, &rdev->flags)) { >>>>>> /* Don't try recovering from here - just fail it >>>>>> * ... unless it is the last working device of course */ >>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>> md_error(mddev, rdev); >>>>>> if (test_bit(Faulty, &rdev->flags)) >>>>>> /* Don't try to read from here, but make sure >>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >>>>>> fix_read_error(conf, r1_bio); >>>>>> unfreeze_array(conf); >>>>>> } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>> md_error(mddev, rdev); >>>>>> } else { >>>>>> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >>>>>> index b60c30bfb6c7..530ad6503189 100644 >>>>>> --- a/drivers/md/raid10.c >>>>>> +++ b/drivers/md/raid10.c >>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) >>>>>> dec_rdev = 0; >>>>>> if (test_bit(FailFast, &rdev->flags) && >>>>>> (bio->bi_opf & MD_FAILFAST)) { >>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>> md_error(rdev->mddev, rdev); >>>>>> } >>>>>> >>>>> >>>>> Thank you for the patch. There may be an issue with 'test_and_clear'. >>>>> >>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below: >>>> >>>>> IO1 IO2 >>>>> set FailfastIOFailure >>>>> set FailfastIOFailure >>>>> md_error >>>>> raid1_error >>>>> test_and_clear FailfastIOFailur >>>>> md_error >>>>> raid1_error >>>>> //FailfastIOFailur is cleared >>>>> set MD_BROKEN >>>>> >>>>> Maybe we should check whether FailfastIOFailure is already set before >>>>> setting it. It also needs to be considered in metadata writes. >>>> Thank you for reviewing. >>>> >>>> I agree, this seems to be as you described. >>>> So, should it be implemented as follows? >>>> >>>> bool old=false; >>>> do{ >>>> spin_lock_irqsave(&conf->device_lock, flags); >>>> old = test_and_set_bit(FailfastIOFailure, &rdev->flags); >>>> spin_unlock_irqrestore(&conf->device_lock, flags); >>>> }while(old); >>>> >>>> However, since I am concerned about potential deadlocks, >>>> so I am considering two alternative approaches: >>>> >>>> * Add an atomic_t counter to md_rdev to track failfast IO failures. >>>> >>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing >>>> error handling of Failfast and non-Failfast IOs appears to be rare. >>>> In any case, the final outcome would be the same, i.e. the array >>>> ends up with MD_BROKEN. Therefore, I think this should not cause >>>> issues. I think the same applies to test_and_set_bit. >>>> >>>> IO1 IO2 IO3 >>>> FailfastIOFailure Normal IOFailure FailfastIOFailure >>>> atomic_inc >>>> md_error atomic_inc >>>> raid1_error >>>> atomic_dec //2to1 >>>> md_error >>>> raid1_error md_error >>>> atomic_dec //1to0 raid1_error >>>> atomic_dec //0 >>>> set MD_BROKEN >>>> >>>> * Alternatively, create a separate error handler, >>>> e.g. md_error_failfast(), that clearly does not fail the array. >>>> >>>> This approach would require somewhat larger changes and may not >>>> be very elegant, but it seems to be a reliable way to ensure >>>> MD_BROKEN is never set at the wrong timing. >>>> >>>> Which of these three approaches would you consider preferable? >>>> I would appreciate your feedback. >>>> >>>> >>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED >>>> when the array is degraded. >>>> >>>> Thanks, >>>> Akagi >>>> >>> >>> I took a closer look at the FailFast code and found a few issues, using >>> RAID1 as an example: >>> >>> For normal read/write IO, FailFast is only triggered when there is another >>> disk is available, as seen in read_balance() and raid1_write_request(). >>> In raid1_error(), MD_BROKEN is set only when no other disks are available. >> >> Hi, >> Agree, I think so too. >> >>> So, the FailFast for normal read/write is not triggered in the scenario you >>> described in cover-letter. >> >> This corresponds to the case described in the commit message of PATCH v3 1/3. >> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is >> an edge case; however, it can occur if rdevs in a non-degraded >> array share the same path and that path is lost, or if a metadata >> write is triggered in a degraded array and fails due to failfast." >> >> To describe it in more detail, the flow is as follows: >> >> Prerequisites: >> >> - Both rdevs are in-sync >> - Both rdevs have in-flight MD_FAILFAST bios >> - Both rdevs depend on the same lower-level path >> (e.g., nvme-tcp over a single Ethernet interface) >> >> Sequence: >> >> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage), >> in the case of nvme-tcp: >> - The Ethernet connection is lost on the node where md is running over 5 seconds >> - Then the connection is restored. Idk the details of nvme-tcp implementation, >> but it seems that failfast IOs finish only after the connection is back. >> - All failfast bios fail, raid1_end_write_request is called. >> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev. >> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely. >> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding. >> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur. >> > > Thanks for your patient explanation. I understand. Maybe we need a separate > error-handling path for failfast. How about adding an extra parameter to md_error()? Thank you for reviewing. I am thinking of proceeding like as follows. md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function. ... diff --git a/drivers/md/md.c b/drivers/md/md.c index ac85ec73a409..855cddeb0c09 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread); -void md_error(struct mddev *mddev, struct md_rdev *rdev) +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail) { @@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) return; - mddev->pers->error_handler(mddev, rdev); + mddev->pers->error_handler(mddev, rdev, nofail); @@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) } + +void md_error(struct mddev *mddev, struct md_rdev *rdev) +{ + return _md_error(mddev, rdev, false); +} EXPORT_SYMBOL(md_error); +void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev) +{ + WARN_ON(mddev->pers->head.id != ID_RAID1 && + mddev->pers->head.id != ID_RAID10); + return _md_error(mddev, rdev, true); +} +EXPORT_SYMBOL(md_error_failfast); + /* seq_file implementation /proc/mdstat */ diff --git a/drivers/md/md.h b/drivers/md/md.h index 51af29a03079..6ca1aea630ce 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -758,3 +758,3 @@ struct md_personality */ - void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev); + void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail); int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); @@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev); extern void md_done_sync(struct mddev *mddev, int blocks, int ok); +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail); extern void md_error(struct mddev *mddev, struct md_rdev *rdev); +extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev); extern void md_finish_reshape(struct mddev *mddev); diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index f1d8811a542a..8aea51227a96 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev) -static void raid0_error(struct mddev *mddev, struct md_rdev *rdev) +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev, + bool nofail __maybe_unused) { diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 408c26398321..d93275899e9e 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) * @rdev: member device to fail. + * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set * @@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) * - * @rdev is marked as &Faulty excluding case when array is failed and - * &mddev->fail_last_dev is off. + * @rdev is marked as &Faulty excluding any cases: + * - when @mddev is failed and &mddev->fail_last_dev is off + * - when @rdev is last device and @nofail is true */ -static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) +static void raid1_error(struct mddev *mddev, struct md_rdev *rdev, + bool nofail) { @@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) (conf->raid_disks - mddev->degraded) == 1) { + if (nofail) { + spin_unlock_irqrestore(&conf->device_lock, flags); + pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, " + "last device but ignoring it\n", + mdname(mddev), rdev->bdev); + return; + } set_bit(MD_BROKEN, &mddev->flags); ... > Kuai, do you have any suggestions? > >>> Normal writes may call md_error() in narrow_write_error. Normal reads do >>> not execute md_error() on the last disk. >>> >>> Perhaps you should get more information to confirm how MD_BROKEN is set in >>> normal read/write IO. >> >> Should I add the above sequence of events to the cover letter, or commit message? >> > > I think we should mention this in the commit message. Understood. I will explicitly describe this in the commit message in v4. Thanks, Akagi >> Thanks, >> Akagi >> >>> -- >>> Thanks, >>> Nan >>> >>> >> >> >> . > > -- > Thanks, > Nan > > ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure 2025-09-01 4:22 ` Kenta Akagi @ 2025-09-01 7:48 ` Yu Kuai 2025-09-01 16:48 ` Kenta Akagi 0 siblings, 1 reply; 13+ messages in thread From: Yu Kuai @ 2025-09-01 7:48 UTC (permalink / raw) To: Kenta Akagi, Li Nan, Song Liu, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel, yukuai (C) Hi, 在 2025/09/01 12:22, Kenta Akagi 写道: > > > On 2025/09/01 12:22, Li Nan wrote: >> >> >> 在 2025/8/31 2:10, Kenta Akagi 写道: >>> >>> >>> On 2025/08/30 17:48, Li Nan wrote: >>>> >>>> >>>> 在 2025/8/29 20:21, Kenta Akagi 写道: >>>>> >>>>> >>>>> On 2025/08/29 11:54, Li Nan wrote: >>>>>> >>>>>> 在 2025/8/29 0:32, Kenta Akagi 写道: >>>>>>> This commit ensures that an MD_FAILFAST IO failure does not put >>>>>>> the array into a broken state. >>>>>>> >>>>>>> When failfast is enabled on rdev in RAID1 or RAID10, >>>>>>> the array may be flagged MD_BROKEN in the following cases. >>>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously >>>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails >>>>>> >>>>>> [...] >>>>>> >>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>>>>>> index 408c26398321..8a61fd93b3ff 100644 >>>>>>> --- a/drivers/md/raid1.c >>>>>>> +++ b/drivers/md/raid1.c >>>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) >>>>>>> (bio->bi_opf & MD_FAILFAST) && >>>>>>> /* We never try FailFast to WriteMostly devices */ >>>>>>> !test_bit(WriteMostly, &rdev->flags)) { >>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>> md_error(r1_bio->mddev, rdev); >>>>>>> } >>>>>>> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>>>>>> * - recovery is interrupted. >>>>>>> * - &mddev->degraded is bumped. >>>>>>> * >>>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and >>>>>>> - * &mddev->fail_last_dev is off. >>>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, >>>>>>> + * then @mddev and @rdev will not be marked as failed. >>>>>>> + * >>>>>>> + * @rdev is marked as &Faulty excluding any cases: >>>>>>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>>>>>> + * - when @rdev is last device and &FailfastIOFailure flag is set >>>>>>> */ >>>>>>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>>>> { >>>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>>>> if (test_bit(In_sync, &rdev->flags) && >>>>>>> (conf->raid_disks - mddev->degraded) == 1) { >>>>>>> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { >>>>>>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>>>>>> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " >>>>>>> + "last device but ignoring it\n", >>>>>>> + mdname(mddev), rdev->bdev); >>>>>>> + return; >>>>>>> + } >>>>>>> set_bit(MD_BROKEN, &mddev->flags); >>>>>>> if (!mddev->fail_last_dev) { >>>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >>>>>>> if (test_bit(FailFast, &rdev->flags)) { >>>>>>> /* Don't try recovering from here - just fail it >>>>>>> * ... unless it is the last working device of course */ >>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>> md_error(mddev, rdev); >>>>>>> if (test_bit(Faulty, &rdev->flags)) >>>>>>> /* Don't try to read from here, but make sure >>>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >>>>>>> fix_read_error(conf, r1_bio); >>>>>>> unfreeze_array(conf); >>>>>>> } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>> md_error(mddev, rdev); >>>>>>> } else { >>>>>>> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >>>>>>> index b60c30bfb6c7..530ad6503189 100644 >>>>>>> --- a/drivers/md/raid10.c >>>>>>> +++ b/drivers/md/raid10.c >>>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) >>>>>>> dec_rdev = 0; >>>>>>> if (test_bit(FailFast, &rdev->flags) && >>>>>>> (bio->bi_opf & MD_FAILFAST)) { >>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>> md_error(rdev->mddev, rdev); >>>>>>> } >>>>>>> >>>>>> >>>>>> Thank you for the patch. There may be an issue with 'test_and_clear'. >>>>>> >>>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below: >>>>> >>>>>> IO1 IO2 >>>>>> set FailfastIOFailure >>>>>> set FailfastIOFailure >>>>>> md_error >>>>>> raid1_error >>>>>> test_and_clear FailfastIOFailur >>>>>> md_error >>>>>> raid1_error >>>>>> //FailfastIOFailur is cleared >>>>>> set MD_BROKEN >>>>>> >>>>>> Maybe we should check whether FailfastIOFailure is already set before >>>>>> setting it. It also needs to be considered in metadata writes. >>>>> Thank you for reviewing. >>>>> >>>>> I agree, this seems to be as you described. >>>>> So, should it be implemented as follows? >>>>> >>>>> bool old=false; >>>>> do{ >>>>> spin_lock_irqsave(&conf->device_lock, flags); >>>>> old = test_and_set_bit(FailfastIOFailure, &rdev->flags); >>>>> spin_unlock_irqrestore(&conf->device_lock, flags); >>>>> }while(old); >>>>> >>>>> However, since I am concerned about potential deadlocks, >>>>> so I am considering two alternative approaches: >>>>> >>>>> * Add an atomic_t counter to md_rdev to track failfast IO failures. >>>>> >>>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing >>>>> error handling of Failfast and non-Failfast IOs appears to be rare. >>>>> In any case, the final outcome would be the same, i.e. the array >>>>> ends up with MD_BROKEN. Therefore, I think this should not cause >>>>> issues. I think the same applies to test_and_set_bit. >>>>> >>>>> IO1 IO2 IO3 >>>>> FailfastIOFailure Normal IOFailure FailfastIOFailure >>>>> atomic_inc >>>>> md_error atomic_inc >>>>> raid1_error >>>>> atomic_dec //2to1 >>>>> md_error >>>>> raid1_error md_error >>>>> atomic_dec //1to0 raid1_error >>>>> atomic_dec //0 >>>>> set MD_BROKEN >>>>> >>>>> * Alternatively, create a separate error handler, >>>>> e.g. md_error_failfast(), that clearly does not fail the array. >>>>> >>>>> This approach would require somewhat larger changes and may not >>>>> be very elegant, but it seems to be a reliable way to ensure >>>>> MD_BROKEN is never set at the wrong timing. >>>>> >>>>> Which of these three approaches would you consider preferable? >>>>> I would appreciate your feedback. >>>>> >>>>> >>>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED >>>>> when the array is degraded. >>>>> >>>>> Thanks, >>>>> Akagi >>>>> >>>> >>>> I took a closer look at the FailFast code and found a few issues, using >>>> RAID1 as an example: >>>> >>>> For normal read/write IO, FailFast is only triggered when there is another >>>> disk is available, as seen in read_balance() and raid1_write_request(). >>>> In raid1_error(), MD_BROKEN is set only when no other disks are available. >>> >>> Hi, >>> Agree, I think so too. >>> >>>> So, the FailFast for normal read/write is not triggered in the scenario you >>>> described in cover-letter. >>> >>> This corresponds to the case described in the commit message of PATCH v3 1/3. >>> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is >>> an edge case; however, it can occur if rdevs in a non-degraded >>> array share the same path and that path is lost, or if a metadata >>> write is triggered in a degraded array and fails due to failfast." >>> >>> To describe it in more detail, the flow is as follows: >>> >>> Prerequisites: >>> >>> - Both rdevs are in-sync >>> - Both rdevs have in-flight MD_FAILFAST bios >>> - Both rdevs depend on the same lower-level path >>> (e.g., nvme-tcp over a single Ethernet interface) >>> >>> Sequence: >>> >>> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage), >>> in the case of nvme-tcp: >>> - The Ethernet connection is lost on the node where md is running over 5 seconds >>> - Then the connection is restored. Idk the details of nvme-tcp implementation, >>> but it seems that failfast IOs finish only after the connection is back. >>> - All failfast bios fail, raid1_end_write_request is called. >>> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev. >>> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely. >>> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding. >>> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur. >>> >> >> Thanks for your patient explanation. I understand. Maybe we need a separate >> error-handling path for failfast. How about adding an extra parameter to md_error()? > > Thank you for reviewing. > > I am thinking of proceeding like as follows. > md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function. > It doesn't matter if it's a exported symbol, we should just keep code as simple as possible. > ... > diff --git a/drivers/md/md.c b/drivers/md/md.c > index ac85ec73a409..855cddeb0c09 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread); > > -void md_error(struct mddev *mddev, struct md_rdev *rdev) > +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail) > { > @@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) > return; > - mddev->pers->error_handler(mddev, rdev); > + mddev->pers->error_handler(mddev, rdev, nofail); > > @@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) > } > + > +void md_error(struct mddev *mddev, struct md_rdev *rdev) > +{ > + return _md_error(mddev, rdev, false); > +} > EXPORT_SYMBOL(md_error); > > +void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev) > +{ > + WARN_ON(mddev->pers->head.id != ID_RAID1 && > + mddev->pers->head.id != ID_RAID10); > + return _md_error(mddev, rdev, true); > +} > +EXPORT_SYMBOL(md_error_failfast); > + I will prefer we add a common procedures to fix this problme. How about the first patch to serialize all the md_error(), and then and a new helper md_bio_failue_error(mddev, rdev, bio), called when bio is failed, in this helper: 1) if bio is not failfast, call md_error() and return true; otherwise: 2) if rdev contain the last data copy, return false directly, caller should check return value and retry, otherwise: 3) call md_error() and return true; Then, for raid1, the callers will look like: iff --git a/drivers/md/md.c b/drivers/md/md.c index 1baaf52c603c..c6d150e9f1a7 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1003,9 +1003,7 @@ static void super_written(struct bio *bio) if (bio->bi_status) { pr_err("md: %s gets error=%d\n", __func__, blk_status_to_errno(bio->bi_status)); - md_error(mddev, rdev); - if (!test_bit(Faulty, &rdev->flags) - && (bio->bi_opf & MD_FAILFAST)) { + if (!md_bio_failure_error(mddev, rdev, bio)) { set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); set_bit(LastDev, &rdev->flags); } @@ -466,20 +472,11 @@ static void raid1_end_write_request(struct bio *bio) set_bit(MD_RECOVERY_NEEDED, & conf->mddev->recovery); - if (test_bit(FailFast, &rdev->flags) && - (bio->bi_opf & MD_FAILFAST) && /* We never try FailFast to WriteMostly devices */ - !test_bit(WriteMostly, &rdev->flags)) { - md_error(r1_bio->mddev, rdev); - } - - /* - * When the device is faulty, it is not necessary to - * handle write error. - */ - if (!test_bit(Faulty, &rdev->flags)) + if(!test_bit(WriteMostly, &rdev->flags) && + !md_bio_failure_error(mddev, rdev, bio)) { set_bit(R1BIO_WriteError, &r1_bio->state); - else { + } else { /* Finished with this branch */ r1_bio->bios[mirror] = NULL; to_put = bio; @@ -2630,7 +2627,6 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) */ bio = r1_bio->bios[r1_bio->read_disk]; - bio_put(bio); r1_bio->bios[r1_bio->read_disk] = NULL; rdev = conf->mirrors[r1_bio->read_disk].rdev; @@ -2639,19 +2635,18 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) freeze_array(conf, 1); fix_read_error(conf, r1_bio); unfreeze_array(conf); - } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { - md_error(mddev, rdev); - } else { + } else if (mddev->ro == 0 && + !md_bio_failure_error(mddev, rdev, bio)) { r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; } + bio_put(bio); rdev_dec_pending(rdev, conf->mddev); sector = r1_bio->sector; - bio = r1_bio->master_bio; /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */ r1_bio->state = 0; - raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio); + raid1_read_request(mddev, r1_bio->maxter_bio, r1_bio->sectors, r1_bio); allow_barrier(conf, sector); } > /* seq_file implementation /proc/mdstat */ > diff --git a/drivers/md/md.h b/drivers/md/md.h > index 51af29a03079..6ca1aea630ce 100644 > --- a/drivers/md/md.h > +++ b/drivers/md/md.h > @@ -758,3 +758,3 @@ struct md_personality > */ > - void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev); > + void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail); > int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); > @@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev); > extern void md_done_sync(struct mddev *mddev, int blocks, int ok); > +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail); > extern void md_error(struct mddev *mddev, struct md_rdev *rdev); > +extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev); > extern void md_finish_reshape(struct mddev *mddev); > diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c > index f1d8811a542a..8aea51227a96 100644 > --- a/drivers/md/raid0.c > +++ b/drivers/md/raid0.c > @@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev) > > -static void raid0_error(struct mddev *mddev, struct md_rdev *rdev) > +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev, > + bool nofail __maybe_unused) > { > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 408c26398321..d93275899e9e 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) > * @rdev: member device to fail. > + * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set > * > @@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) > * > - * @rdev is marked as &Faulty excluding case when array is failed and > - * &mddev->fail_last_dev is off. > + * @rdev is marked as &Faulty excluding any cases: > + * - when @mddev is failed and &mddev->fail_last_dev is off > + * - when @rdev is last device and @nofail is true > */ > -static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) > +static void raid1_error(struct mddev *mddev, struct md_rdev *rdev, > + bool nofail) > { > @@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) > (conf->raid_disks - mddev->degraded) == 1) { > + if (nofail) { > + spin_unlock_irqrestore(&conf->device_lock, flags); > + pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, " > + "last device but ignoring it\n", > + mdname(mddev), rdev->bdev); > + return; > + } > set_bit(MD_BROKEN, &mddev->flags); > ... > >> Kuai, do you have any suggestions? >> >>>> Normal writes may call md_error() in narrow_write_error. Normal reads do >>>> not execute md_error() on the last disk. >>>> >>>> Perhaps you should get more information to confirm how MD_BROKEN is set in >>>> normal read/write IO. >>> >>> Should I add the above sequence of events to the cover letter, or commit message? >>> >> >> I think we should mention this in the commit message. > > Understood. I will explicitly describe this in the commit message in v4. > > Thanks, > Akagi > >>> Thanks, >>> Akagi >>> >>>> -- >>>> Thanks, >>>> Nan >>>> >>>> >>> >>> >>> . >> >> -- >> Thanks, >> Nan >> >> > > . > ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure 2025-09-01 7:48 ` Yu Kuai @ 2025-09-01 16:48 ` Kenta Akagi 2025-09-05 15:07 ` Kenta Akagi 0 siblings, 1 reply; 13+ messages in thread From: Kenta Akagi @ 2025-09-01 16:48 UTC (permalink / raw) To: Yu Kuai, Li Nan, Song Liu, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel, yukuai (C), Kenta Akagi On 2025/09/01 16:48, Yu Kuai wrote: > Hi, > > 在 2025/09/01 12:22, Kenta Akagi 写道: >> >> >> On 2025/09/01 12:22, Li Nan wrote: >>> >>> >>> 在 2025/8/31 2:10, Kenta Akagi 写道: >>>> >>>> >>>> On 2025/08/30 17:48, Li Nan wrote: >>>>> >>>>> >>>>> 在 2025/8/29 20:21, Kenta Akagi 写道: >>>>>> >>>>>> >>>>>> On 2025/08/29 11:54, Li Nan wrote: >>>>>>> >>>>>>> 在 2025/8/29 0:32, Kenta Akagi 写道: >>>>>>>> This commit ensures that an MD_FAILFAST IO failure does not put >>>>>>>> the array into a broken state. >>>>>>>> >>>>>>>> When failfast is enabled on rdev in RAID1 or RAID10, >>>>>>>> the array may be flagged MD_BROKEN in the following cases. >>>>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously >>>>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails >>>>>>> >>>>>>> [...] >>>>>>> >>>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>>>>>>> index 408c26398321..8a61fd93b3ff 100644 >>>>>>>> --- a/drivers/md/raid1.c >>>>>>>> +++ b/drivers/md/raid1.c >>>>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) >>>>>>>> (bio->bi_opf & MD_FAILFAST) && >>>>>>>> /* We never try FailFast to WriteMostly devices */ >>>>>>>> !test_bit(WriteMostly, &rdev->flags)) { >>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>> md_error(r1_bio->mddev, rdev); >>>>>>>> } >>>>>>>> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>>>>>>> * - recovery is interrupted. >>>>>>>> * - &mddev->degraded is bumped. >>>>>>>> * >>>>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and >>>>>>>> - * &mddev->fail_last_dev is off. >>>>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, >>>>>>>> + * then @mddev and @rdev will not be marked as failed. >>>>>>>> + * >>>>>>>> + * @rdev is marked as &Faulty excluding any cases: >>>>>>>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>>>>>>> + * - when @rdev is last device and &FailfastIOFailure flag is set >>>>>>>> */ >>>>>>>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>>>>> { >>>>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>>>>> if (test_bit(In_sync, &rdev->flags) && >>>>>>>> (conf->raid_disks - mddev->degraded) == 1) { >>>>>>>> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { >>>>>>>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>>>>>>> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " >>>>>>>> + "last device but ignoring it\n", >>>>>>>> + mdname(mddev), rdev->bdev); >>>>>>>> + return; >>>>>>>> + } >>>>>>>> set_bit(MD_BROKEN, &mddev->flags); >>>>>>>> if (!mddev->fail_last_dev) { >>>>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >>>>>>>> if (test_bit(FailFast, &rdev->flags)) { >>>>>>>> /* Don't try recovering from here - just fail it >>>>>>>> * ... unless it is the last working device of course */ >>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>> md_error(mddev, rdev); >>>>>>>> if (test_bit(Faulty, &rdev->flags)) >>>>>>>> /* Don't try to read from here, but make sure >>>>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >>>>>>>> fix_read_error(conf, r1_bio); >>>>>>>> unfreeze_array(conf); >>>>>>>> } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>> md_error(mddev, rdev); >>>>>>>> } else { >>>>>>>> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >>>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >>>>>>>> index b60c30bfb6c7..530ad6503189 100644 >>>>>>>> --- a/drivers/md/raid10.c >>>>>>>> +++ b/drivers/md/raid10.c >>>>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) >>>>>>>> dec_rdev = 0; >>>>>>>> if (test_bit(FailFast, &rdev->flags) && >>>>>>>> (bio->bi_opf & MD_FAILFAST)) { >>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>> md_error(rdev->mddev, rdev); >>>>>>>> } >>>>>>>> >>>>>>> >>>>>>> Thank you for the patch. There may be an issue with 'test_and_clear'. >>>>>>> >>>>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below: >>>>>> >>>>>>> IO1 IO2 >>>>>>> set FailfastIOFailure >>>>>>> set FailfastIOFailure >>>>>>> md_error >>>>>>> raid1_error >>>>>>> test_and_clear FailfastIOFailur >>>>>>> md_error >>>>>>> raid1_error >>>>>>> //FailfastIOFailur is cleared >>>>>>> set MD_BROKEN >>>>>>> >>>>>>> Maybe we should check whether FailfastIOFailure is already set before >>>>>>> setting it. It also needs to be considered in metadata writes. >>>>>> Thank you for reviewing. >>>>>> >>>>>> I agree, this seems to be as you described. >>>>>> So, should it be implemented as follows? >>>>>> >>>>>> bool old=false; >>>>>> do{ >>>>>> spin_lock_irqsave(&conf->device_lock, flags); >>>>>> old = test_and_set_bit(FailfastIOFailure, &rdev->flags); >>>>>> spin_unlock_irqrestore(&conf->device_lock, flags); >>>>>> }while(old); >>>>>> >>>>>> However, since I am concerned about potential deadlocks, >>>>>> so I am considering two alternative approaches: >>>>>> >>>>>> * Add an atomic_t counter to md_rdev to track failfast IO failures. >>>>>> >>>>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing >>>>>> error handling of Failfast and non-Failfast IOs appears to be rare. >>>>>> In any case, the final outcome would be the same, i.e. the array >>>>>> ends up with MD_BROKEN. Therefore, I think this should not cause >>>>>> issues. I think the same applies to test_and_set_bit. >>>>>> >>>>>> IO1 IO2 IO3 >>>>>> FailfastIOFailure Normal IOFailure FailfastIOFailure >>>>>> atomic_inc >>>>>> md_error atomic_inc >>>>>> raid1_error >>>>>> atomic_dec //2to1 >>>>>> md_error >>>>>> raid1_error md_error >>>>>> atomic_dec //1to0 raid1_error >>>>>> atomic_dec //0 >>>>>> set MD_BROKEN >>>>>> >>>>>> * Alternatively, create a separate error handler, >>>>>> e.g. md_error_failfast(), that clearly does not fail the array. >>>>>> >>>>>> This approach would require somewhat larger changes and may not >>>>>> be very elegant, but it seems to be a reliable way to ensure >>>>>> MD_BROKEN is never set at the wrong timing. >>>>>> >>>>>> Which of these three approaches would you consider preferable? >>>>>> I would appreciate your feedback. >>>>>> >>>>>> >>>>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED >>>>>> when the array is degraded. >>>>>> >>>>>> Thanks, >>>>>> Akagi >>>>>> >>>>> >>>>> I took a closer look at the FailFast code and found a few issues, using >>>>> RAID1 as an example: >>>>> >>>>> For normal read/write IO, FailFast is only triggered when there is another >>>>> disk is available, as seen in read_balance() and raid1_write_request(). >>>>> In raid1_error(), MD_BROKEN is set only when no other disks are available. >>>> >>>> Hi, >>>> Agree, I think so too. >>>> >>>>> So, the FailFast for normal read/write is not triggered in the scenario you >>>>> described in cover-letter. >>>> >>>> This corresponds to the case described in the commit message of PATCH v3 1/3. >>>> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is >>>> an edge case; however, it can occur if rdevs in a non-degraded >>>> array share the same path and that path is lost, or if a metadata >>>> write is triggered in a degraded array and fails due to failfast." >>>> >>>> To describe it in more detail, the flow is as follows: >>>> >>>> Prerequisites: >>>> >>>> - Both rdevs are in-sync >>>> - Both rdevs have in-flight MD_FAILFAST bios >>>> - Both rdevs depend on the same lower-level path >>>> (e.g., nvme-tcp over a single Ethernet interface) >>>> >>>> Sequence: >>>> >>>> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage), >>>> in the case of nvme-tcp: >>>> - The Ethernet connection is lost on the node where md is running over 5 seconds >>>> - Then the connection is restored. Idk the details of nvme-tcp implementation, >>>> but it seems that failfast IOs finish only after the connection is back. >>>> - All failfast bios fail, raid1_end_write_request is called. >>>> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev. >>>> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely. >>>> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding. >>>> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur. >>>> >>> >>> Thanks for your patient explanation. I understand. Maybe we need a separate >>> error-handling path for failfast. How about adding an extra parameter to md_error()? >> >> Thank you for reviewing. >> >> I am thinking of proceeding like as follows. >> md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function. >> > > It doesn't matter if it's a exported symbol, we should just keep code as > simple as possible. >> ... >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index ac85ec73a409..855cddeb0c09 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread); >> >> -void md_error(struct mddev *mddev, struct md_rdev *rdev) >> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail) >> { >> @@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) >> return; >> - mddev->pers->error_handler(mddev, rdev); >> + mddev->pers->error_handler(mddev, rdev, nofail); >> >> @@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) >> } >> + >> +void md_error(struct mddev *mddev, struct md_rdev *rdev) >> +{ >> + return _md_error(mddev, rdev, false); >> +} >> EXPORT_SYMBOL(md_error); >> >> +void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev) >> +{ >> + WARN_ON(mddev->pers->head.id != ID_RAID1 && >> + mddev->pers->head.id != ID_RAID10); >> + return _md_error(mddev, rdev, true); >> +} >> +EXPORT_SYMBOL(md_error_failfast); >> + > > I will prefer we add a common procedures to fix this problme. > > How about the first patch to serialize all the md_error(), and then > and a new helper md_bio_failue_error(mddev, rdev, bio), called when > bio is failed, in this helper: > > 1) if bio is not failfast, call md_error() and return true; otherwise: > 2) if rdev contain the last data copy, return false directly, caller > should check return value and retry, otherwise: > 3) call md_error() and return true; Hi, I think this approach has some issues. There are cases where md_error is called only when MD_FAILFAST is set. One example is the processing below in raid1_end_write_request. > Then, for raid1, the callers will look like: > > iff --git a/drivers/md/md.c b/drivers/md/md.c > index 1baaf52c603c..c6d150e9f1a7 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -1003,9 +1003,7 @@ static void super_written(struct bio *bio) > if (bio->bi_status) { > pr_err("md: %s gets error=%d\n", __func__, > blk_status_to_errno(bio->bi_status)); > - md_error(mddev, rdev); > - if (!test_bit(Faulty, &rdev->flags) > - && (bio->bi_opf & MD_FAILFAST)) { > + if (!md_bio_failure_error(mddev, rdev, bio)) { > set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); > set_bit(LastDev, &rdev->flags); > } > > @@ -466,20 +472,11 @@ static void raid1_end_write_request(struct bio *bio) > set_bit(MD_RECOVERY_NEEDED, & > conf->mddev->recovery); > > - if (test_bit(FailFast, &rdev->flags) && > - (bio->bi_opf & MD_FAILFAST) && > /* We never try FailFast to WriteMostly devices */ > - !test_bit(WriteMostly, &rdev->flags)) { > - md_error(r1_bio->mddev, rdev); > - } > - > - /* > - * When the device is faulty, it is not necessary to > - * handle write error. > - */ > - if (!test_bit(Faulty, &rdev->flags)) > + if(!test_bit(WriteMostly, &rdev->flags) && > + !md_bio_failure_error(mddev, rdev, bio)) { > set_bit(R1BIO_WriteError, &r1_bio->state); > - else { > + } else { > /* Finished with this branch */ > r1_bio->bios[mirror] = NULL; > to_put = bio; In the current raid1_end_write_request implementation, - md_error is called only in the Failfast case. - Afterwards, if the rdev is not Faulty (that is, not Failfast, or Failfast but the last rdev — which originally was not expected MD_BROKEN in RAID1), R1BIO_WriteError is set. In the suggested implementation, it seems that a non-Failfast write failure will immediately mark the rdev as Faulty, without retries. This could be avoided by testing MD_FAILFAST before call the new helper md_bio_failure_error, but I believe duplicating the same check in both caller/callee would be undesirable. Should we try to avoid modifying pers->error_handler? One possible alternative approach is as follows. - serialize calls to md_error regardless of whether Failfast or not - raid{1,10}_error is: - The remaining copy (rdev) is marked with the LastDev flag - clear MD_FAILFAST_SUPPORTED for prohibit super_write using Failfast - super_written will simply put MD_SB_NEED_REWRITE without calling md_error when MD_FAILFAST bio and LastDev rdev. After the changes, I believe it is rare for super_written to be called with error on multiple rdevs due to failfast. super_write is caused by errors from normal failfast IO and invoked via MD_SB_CHANGE_DEVS through the serialized raid1_error. Since MD_FAILFAST_SUPPORTED is cleared, metadata writes occur without failfast. It's not exactly a common procedure, but as it doesn't add functions to md.c, I think this approach is preferable to adding md_error_failfast(). ... diff --git a/drivers/md/md.c b/drivers/md/md.c index 1baaf52c603c..ba524fa96091 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1003,14 +1003,15 @@ static void super_written(struct bio *bio) if (bio->bi_status) { pr_err("md: %s gets error=%d\n", __func__, blk_status_to_errno(bio->bi_status)); - md_error(mddev, rdev); - if (!test_bit(Faulty, &rdev->flags) + if (test_bit(LastDev, &rdev->flags) && (bio->bi_opf & MD_FAILFAST)) { + pr_warn("md: %s: Metadata write will be repeated to %pg\n", + mdname(mddev), rdev->bdev); set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); - set_bit(LastDev, &rdev->flags); + } else { + md_error(mddev, rdev); } - } else - clear_bit(LastDev, &rdev->flags); + } bio_put(bio); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 408c26398321..a52c5277add7 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio) (bio->bi_opf & MD_FAILFAST) && /* We never try FailFast to WriteMostly devices */ !test_bit(WriteMostly, &rdev->flags)) { - md_error(r1_bio->mddev, rdev); + raid1_md_error_failfast(r1_bio->mddev, rdev); } /* @@ -1733,6 +1733,27 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) seq_printf(seq, "]"); } +static void _raid1_md_error(struct mddev *mddev, struct md_rdev *rdev, bool failfast){ + struct r1conf *conf = mddev->private; + unsigned long flags; + + spin_lock_irqsave(&conf->new_lock_for_md_error, flags); + if (failfast) + set_bit(FailfastIOFailure, &rdev->flags); + md_error(mddev, rdev); + if (failfast) + WARN_ON(!test_and_clear_bit(FailfastIOFailure, &rdev->flags)); + spin_unlock_irqrestore(&conf->new_lock_for_md_error, flags); +} + +static void raid1_md_error(struct mddev *mddev, struct md_rdev *rdev){ + return _raid1_md_error(mddev, rdev, false); +} + +static void raid1_md_error_failfast(struct mddev *mddev, struct md_rdev *rdev){ + return _raid1_md_error(mddev, rdev, true); +} + /** * raid1_error() - RAID1 error handler. * @mddev: affected md device. @@ -1758,6 +1783,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) if (test_bit(In_sync, &rdev->flags) && (conf->raid_disks - mddev->degraded) == 1) { + if (test_bit(FailfastIOFailure, &rdev->flags)) { + spin_unlock_irqrestore(&conf->device_lock, flags); + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " + "last device but ignoring it\n", + mdname(mddev), rdev->bdev); + return; + } set_bit(MD_BROKEN, &mddev->flags); if (!mddev->fail_last_dev) { @@ -1767,8 +1799,16 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) } } set_bit(Blocked, &rdev->flags); - if (test_and_clear_bit(In_sync, &rdev->flags)) + if (test_and_clear_bit(In_sync, &rdev->flags)) { mddev->degraded++; + clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags); + for (i = 0; i < conf->raid_disks; i++) { + struct md_rdev *rdev2 = conf->mirrors[i].rdev; + if (rdev2 && rdev != rdev2 && + test_bit(In_sync, &rdev2->flags)) + set_bit(LastDev, &rdev2->flags); + } + } set_bit(Faulty, &rdev->flags); spin_unlock_irqrestore(&conf->device_lock, flags); /* @@ -2118,7 +2158,7 @@ static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector, } /* need to record an error - either for the block or the device */ if (!rdev_set_badblocks(rdev, sector, sectors, 0)) - md_error(rdev->mddev, rdev); + raid1_md_error(rdev->mddev, rdev); return 0; } ... Thanks, Akagi > @@ -2630,7 +2627,6 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) > */ > > bio = r1_bio->bios[r1_bio->read_disk]; > - bio_put(bio); > r1_bio->bios[r1_bio->read_disk] = NULL; > > rdev = conf->mirrors[r1_bio->read_disk].rdev; > @@ -2639,19 +2635,18 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) > freeze_array(conf, 1); > fix_read_error(conf, r1_bio); > unfreeze_array(conf); > - } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { > - md_error(mddev, rdev); > - } else { > + } else if (mddev->ro == 0 && > + !md_bio_failure_error(mddev, rdev, bio)) { > r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; > } > > + bio_put(bio); > rdev_dec_pending(rdev, conf->mddev); > sector = r1_bio->sector; > - bio = r1_bio->master_bio; > > /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */ > r1_bio->state = 0; > - raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio); > + raid1_read_request(mddev, r1_bio->maxter_bio, r1_bio->sectors, r1_bio); > allow_barrier(conf, sector); > } > > >> /* seq_file implementation /proc/mdstat */ >> diff --git a/drivers/md/md.h b/drivers/md/md.h >> index 51af29a03079..6ca1aea630ce 100644 >> --- a/drivers/md/md.h >> +++ b/drivers/md/md.h >> @@ -758,3 +758,3 @@ struct md_personality >> */ >> - void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev); >> + void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail); >> int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); >> @@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev); >> extern void md_done_sync(struct mddev *mddev, int blocks, int ok); >> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail); >> extern void md_error(struct mddev *mddev, struct md_rdev *rdev); >> +extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev); >> extern void md_finish_reshape(struct mddev *mddev); >> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c >> index f1d8811a542a..8aea51227a96 100644 >> --- a/drivers/md/raid0.c >> +++ b/drivers/md/raid0.c >> @@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev) >> >> -static void raid0_error(struct mddev *mddev, struct md_rdev *rdev) >> +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev, >> + bool nofail __maybe_unused) >> { >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index 408c26398321..d93275899e9e 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >> * @rdev: member device to fail. >> + * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set >> * >> @@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >> * >> - * @rdev is marked as &Faulty excluding case when array is failed and >> - * &mddev->fail_last_dev is off. >> + * @rdev is marked as &Faulty excluding any cases: >> + * - when @mddev is failed and &mddev->fail_last_dev is off >> + * - when @rdev is last device and @nofail is true >> */ >> -static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >> +static void raid1_error(struct mddev *mddev, struct md_rdev *rdev, >> + bool nofail) >> { >> @@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >> (conf->raid_disks - mddev->degraded) == 1) { >> + if (nofail) { >> + spin_unlock_irqrestore(&conf->device_lock, flags); >> + pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, " >> + "last device but ignoring it\n", >> + mdname(mddev), rdev->bdev); >> + return; >> + } >> set_bit(MD_BROKEN, &mddev->flags); >> ... >> >>> Kuai, do you have any suggestions? >>> >>>>> Normal writes may call md_error() in narrow_write_error. Normal reads do >>>>> not execute md_error() on the last disk. >>>>> >>>>> Perhaps you should get more information to confirm how MD_BROKEN is set in >>>>> normal read/write IO. >>>> >>>> Should I add the above sequence of events to the cover letter, or commit message? >>>> >>> >>> I think we should mention this in the commit message. >> >> Understood. I will explicitly describe this in the commit message in v4. >> >> Thanks, >> Akagi >> >>>> Thanks, >>>> Akagi >>>> >>>>> -- >>>>> Thanks, >>>>> Nan >>>>> >>>>> >>>> >>>> >>>> . >>> >>> -- >>> Thanks, >>> Nan >>> >>> >> >> . >> > > ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure 2025-09-01 16:48 ` Kenta Akagi @ 2025-09-05 15:07 ` Kenta Akagi 0 siblings, 0 replies; 13+ messages in thread From: Kenta Akagi @ 2025-09-05 15:07 UTC (permalink / raw) To: Yu Kuai, Li Nan, Song Liu, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel, yukuai (C), Kenta Akagi On 2025/09/02 1:48, Kenta Akagi wrote: > Hi, By the way, I found two more issues. This can be reproduced by causing a Failfast IO failure using the attached procedure. I wrote patches for them. Would it make sense to send these as v4? * raid1,raid10: IO error returned even when all writes succeed in narrow_write_error When only one rdev remains, a failfast write bio is retried in narrow_write_error() via handle_write_completed(). Even if all retries succeed, R1BIO_Uptodate is not set. As a result, the upper layer get BLK_STS_IOERR. This behavior appears to be intentional, under the assumption of at least one bad sector. However, this behavior is undesirable for failfast retries. In addition, although less likely, even on a normal HDD or SSD, if all split writes succeed, the upper layer would still receive BLK_STS_IOERR despite the write having ultimately completed successfully. Of course, this does not apply if the other rdev writes succeed. I'm not exactly sure how to handle the following point. - Can a retried write be marked R1BIO_Uptodate after one failure in non-failfast? - Should it only be marked Uptodate for failfast? - Should failfast retries be handled outside narrow_write_error? For now, I attempt to set R1BIO_Uptodate whenever narrow_write_error() succeeds and there are no bad blocks, regardless of MD_FAILFAST. * raid10: no retry scheduled for failfast read failure raid10_end_read_request lacks a path to retry when a FailFast IO fails. As a result, when Failfast Read IOs fail on all rdevs, the upper layer receives BLK_STS_IOERR without a retry, Looking at the two commits below, it seems only raid10_end_read_request lacks the failfast read retry handling, while raid1_end_read_request has it. In RAID1, the retry works as expected. I don't know why raid10_end_read_request lacks this, but it is probably just a simple oversight. - 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.") - 2e52d449bcec ("md/raid1: add failfast handling for reads.") Here's how to reproduce : # prepare nvmet/nvme-tcp and md array sh-5.2# cat << 'EOF' > loopback-nvme.sh > set -eu nqn="nqn.2025-08.io.example:nvmet-test-$1" back=$2 cd /sys/kernel/config/nvmet/ mkdir subsystems/$nqn echo 1 > subsystems/${nqn}/attr_allow_any_host mkdir subsystems/${nqn}/namespaces/1 echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path echo 1 > subsystems/${nqn}/namespaces/1/enable ports="ports/1" if [ ! -d $ports ]; then mkdir $ports cd $ports echo 127.0.0.1 > addr_traddr echo tcp > addr_trtype echo 4420 > addr_trsvcid echo ipv4 > addr_adrfam cd ../../ fi ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/ nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420 > EOF sh-5.2# chmod +x loopback-nvme.sh sh-5.2# modprobe -a nvme-tcp nvmet-tcp sh-5.2# truncate -s 1g a.img b.img sh-5.2# losetup --show -f a.img /dev/loop0 sh-5.2# losetup --show -f b.img /dev/loop1 sh-5.2# ./loopback-nvme.sh 0 /dev/loop0 connecting to device: nvme0 sh-5.2# ./loopback-nvme.sh 1 /dev/loop1 connecting to device: nvme1 sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \ --failfast /dev/nvme0n1 --failfast /dev/nvme1n1 ... mdadm: array /dev/md0 started. # run fio sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \ --bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1 # It can reproduce the issue by block nvme traffic during fio sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP; sh-5.2# sleep 10; # twice the default KATO value sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 408c26398321..ce4dff63f50a 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c static bool narrow_write_error(struct r1bio *r1_bio, int i) { struct mddev *mddev = r1_bio->mddev; @@ -2519,13 +2521,16 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i) sector_t sector; int sectors; int sect_to_write = r1_bio->sectors; - bool ok = true; + bool write_ok = true; + bool setbad_ok = true; + bool bbl_enabled = !(rdev->badblocks.shift < 0); - if (rdev->badblocks.shift < 0) - return false; + if (bbl_enabled) + block_sectors = roundup(1 << rdev->badblocks.shift, + bdev_logical_block_size(rdev->bdev) >> 9); + else + block_sectors = 1; - block_sectors = roundup(1 << rdev->badblocks.shift, - bdev_logical_block_size(rdev->bdev) >> 9); sector = r1_bio->sector; sectors = ((sector + block_sectors) & ~(sector_t)(block_sectors - 1)) @@ -2543,18 +2554,25 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i) bio_trim(wbio, sector - r1_bio->sector, sectors); wbio->bi_iter.bi_sector += rdev->data_offset; - if (submit_bio_wait(wbio) < 0) + if (submit_bio_wait(wbio) < 0) { /* failure! */ - ok = rdev_set_badblocks(rdev, sector, - sectors, 0) - && ok; + write_ok = false; + if (bbl_enabled) + setbad_ok = rdev_set_badblocks(rdev, sector, + sectors, 0) + && setbad_ok; + } bio_put(wbio); sect_to_write -= sectors; sector += sectors; sectors = block_sectors; } - return ok; + + if (!write_ok && + (!setbad_ok || !bbl_enabled)) + return false; + return true; } static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio) @@ -2585,26 +2603,34 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio) int m, idx; bool fail = false; - for (m = 0; m < conf->raid_disks * 2 ; m++) - if (r1_bio->bios[m] == IO_MADE_GOOD) { - struct md_rdev *rdev = conf->mirrors[m].rdev; + for (m = 0; m < conf->raid_disks * 2 ; m++) { + struct md_rdev *rdev = conf->mirrors[m].rdev; + struct bio *bio = r1_bio->bios[m]; + if (bio == IO_MADE_GOOD) { rdev_clear_badblocks(rdev, r1_bio->sector, r1_bio->sectors, 0); rdev_dec_pending(rdev, conf->mddev); - } else if (r1_bio->bios[m] != NULL) { + } else if (bio != NULL) { /* This drive got a write error. We need to * narrow down and record precise write * errors. */ fail = true; - if (!narrow_write_error(r1_bio, m)) - md_error(conf->mddev, - conf->mirrors[m].rdev); + if (!narrow_write_error(r1_bio, m)){ + md_error(conf->mddev, rdev); /* an I/O failed, we can't clear the bitmap */ - rdev_dec_pending(conf->mirrors[m].rdev, - conf->mddev); + } else if(test_bit(In_sync, &rdev->flags) && + !test_bit(Faulty, &rdev->flags) && + rdev_has_badblock(rdev, + r1_bio->sector, + r1_bio->sectors) == 0) { + set_bit(R1BIO_Uptodate, &r1_bio->state); + } + rdev_dec_pending(rdev, conf->mddev); } + } if (fail) { spin_lock_irq(&conf->device_lock); list_add(&r1_bio->retry_list, &conf->bio_end_io_list); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index b60c30bfb6c7..7145daf1543b 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -399,6 +399,11 @@ static void raid10_end_read_request(struct bio *bio) * wait for the 'master' bio. */ set_bit(R10BIO_Uptodate, &r10_bio->state); + } else if (test_bit(FailFast, &rdev->flags) && + test_bit(R10BIO_FailFast, &r10_bio->state)) { + /* This was a fail-fast read so we definitely + * want to retry */ + ; } else if (!raid1_should_handle_error(bio)) { uptodate = 1; } else { > > On 2025/09/01 16:48, Yu Kuai wrote: >> Hi, >> >> 在 2025/09/01 12:22, Kenta Akagi 写道: >>> >>> >>> On 2025/09/01 12:22, Li Nan wrote: >>>> >>>> >>>> 在 2025/8/31 2:10, Kenta Akagi 写道: >>>>> >>>>> >>>>> On 2025/08/30 17:48, Li Nan wrote: >>>>>> >>>>>> >>>>>> 在 2025/8/29 20:21, Kenta Akagi 写道: >>>>>>> >>>>>>> >>>>>>> On 2025/08/29 11:54, Li Nan wrote: >>>>>>>> >>>>>>>> 在 2025/8/29 0:32, Kenta Akagi 写道: >>>>>>>>> This commit ensures that an MD_FAILFAST IO failure does not put >>>>>>>>> the array into a broken state. >>>>>>>>> >>>>>>>>> When failfast is enabled on rdev in RAID1 or RAID10, >>>>>>>>> the array may be flagged MD_BROKEN in the following cases. >>>>>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously >>>>>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails >>>>>>>> >>>>>>>> [...] >>>>>>>> >>>>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>>>>>>>> index 408c26398321..8a61fd93b3ff 100644 >>>>>>>>> --- a/drivers/md/raid1.c >>>>>>>>> +++ b/drivers/md/raid1.c >>>>>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) >>>>>>>>> (bio->bi_opf & MD_FAILFAST) && >>>>>>>>> /* We never try FailFast to WriteMostly devices */ >>>>>>>>> !test_bit(WriteMostly, &rdev->flags)) { >>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>>> md_error(r1_bio->mddev, rdev); >>>>>>>>> } >>>>>>>>> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>>>>>>>> * - recovery is interrupted. >>>>>>>>> * - &mddev->degraded is bumped. >>>>>>>>> * >>>>>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and >>>>>>>>> - * &mddev->fail_last_dev is off. >>>>>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, >>>>>>>>> + * then @mddev and @rdev will not be marked as failed. >>>>>>>>> + * >>>>>>>>> + * @rdev is marked as &Faulty excluding any cases: >>>>>>>>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>>>>>>>> + * - when @rdev is last device and &FailfastIOFailure flag is set >>>>>>>>> */ >>>>>>>>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>>>>>> { >>>>>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>>>>>> if (test_bit(In_sync, &rdev->flags) && >>>>>>>>> (conf->raid_disks - mddev->degraded) == 1) { >>>>>>>>> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { >>>>>>>>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>>>>>>>> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " >>>>>>>>> + "last device but ignoring it\n", >>>>>>>>> + mdname(mddev), rdev->bdev); >>>>>>>>> + return; >>>>>>>>> + } >>>>>>>>> set_bit(MD_BROKEN, &mddev->flags); >>>>>>>>> if (!mddev->fail_last_dev) { >>>>>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >>>>>>>>> if (test_bit(FailFast, &rdev->flags)) { >>>>>>>>> /* Don't try recovering from here - just fail it >>>>>>>>> * ... unless it is the last working device of course */ >>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>>> md_error(mddev, rdev); >>>>>>>>> if (test_bit(Faulty, &rdev->flags)) >>>>>>>>> /* Don't try to read from here, but make sure >>>>>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >>>>>>>>> fix_read_error(conf, r1_bio); >>>>>>>>> unfreeze_array(conf); >>>>>>>>> } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>>> md_error(mddev, rdev); >>>>>>>>> } else { >>>>>>>>> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >>>>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >>>>>>>>> index b60c30bfb6c7..530ad6503189 100644 >>>>>>>>> --- a/drivers/md/raid10.c >>>>>>>>> +++ b/drivers/md/raid10.c >>>>>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) >>>>>>>>> dec_rdev = 0; >>>>>>>>> if (test_bit(FailFast, &rdev->flags) && >>>>>>>>> (bio->bi_opf & MD_FAILFAST)) { >>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>>> md_error(rdev->mddev, rdev); >>>>>>>>> } >>>>>>>>> >>>>>>>> >>>>>>>> Thank you for the patch. There may be an issue with 'test_and_clear'. >>>>>>>> >>>>>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below: >>>>>>> >>>>>>>> IO1 IO2 >>>>>>>> set FailfastIOFailure >>>>>>>> set FailfastIOFailure >>>>>>>> md_error >>>>>>>> raid1_error >>>>>>>> test_and_clear FailfastIOFailur >>>>>>>> md_error >>>>>>>> raid1_error >>>>>>>> //FailfastIOFailur is cleared >>>>>>>> set MD_BROKEN >>>>>>>> >>>>>>>> Maybe we should check whether FailfastIOFailure is already set before >>>>>>>> setting it. It also needs to be considered in metadata writes. >>>>>>> Thank you for reviewing. >>>>>>> >>>>>>> I agree, this seems to be as you described. >>>>>>> So, should it be implemented as follows? >>>>>>> >>>>>>> bool old=false; >>>>>>> do{ >>>>>>> spin_lock_irqsave(&conf->device_lock, flags); >>>>>>> old = test_and_set_bit(FailfastIOFailure, &rdev->flags); >>>>>>> spin_unlock_irqrestore(&conf->device_lock, flags); >>>>>>> }while(old); >>>>>>> >>>>>>> However, since I am concerned about potential deadlocks, >>>>>>> so I am considering two alternative approaches: >>>>>>> >>>>>>> * Add an atomic_t counter to md_rdev to track failfast IO failures. >>>>>>> >>>>>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing >>>>>>> error handling of Failfast and non-Failfast IOs appears to be rare. >>>>>>> In any case, the final outcome would be the same, i.e. the array >>>>>>> ends up with MD_BROKEN. Therefore, I think this should not cause >>>>>>> issues. I think the same applies to test_and_set_bit. >>>>>>> >>>>>>> IO1 IO2 IO3 >>>>>>> FailfastIOFailure Normal IOFailure FailfastIOFailure >>>>>>> atomic_inc >>>>>>> md_error atomic_inc >>>>>>> raid1_error >>>>>>> atomic_dec //2to1 >>>>>>> md_error >>>>>>> raid1_error md_error >>>>>>> atomic_dec //1to0 raid1_error >>>>>>> atomic_dec //0 >>>>>>> set MD_BROKEN >>>>>>> >>>>>>> * Alternatively, create a separate error handler, >>>>>>> e.g. md_error_failfast(), that clearly does not fail the array. >>>>>>> >>>>>>> This approach would require somewhat larger changes and may not >>>>>>> be very elegant, but it seems to be a reliable way to ensure >>>>>>> MD_BROKEN is never set at the wrong timing. >>>>>>> >>>>>>> Which of these three approaches would you consider preferable? >>>>>>> I would appreciate your feedback. >>>>>>> >>>>>>> >>>>>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED >>>>>>> when the array is degraded. >>>>>>> >>>>>>> Thanks, >>>>>>> Akagi >>>>>>> >>>>>> >>>>>> I took a closer look at the FailFast code and found a few issues, using >>>>>> RAID1 as an example: >>>>>> >>>>>> For normal read/write IO, FailFast is only triggered when there is another >>>>>> disk is available, as seen in read_balance() and raid1_write_request(). >>>>>> In raid1_error(), MD_BROKEN is set only when no other disks are available. >>>>> >>>>> Hi, >>>>> Agree, I think so too. >>>>> >>>>>> So, the FailFast for normal read/write is not triggered in the scenario you >>>>>> described in cover-letter. >>>>> >>>>> This corresponds to the case described in the commit message of PATCH v3 1/3. >>>>> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is >>>>> an edge case; however, it can occur if rdevs in a non-degraded >>>>> array share the same path and that path is lost, or if a metadata >>>>> write is triggered in a degraded array and fails due to failfast." >>>>> >>>>> To describe it in more detail, the flow is as follows: >>>>> >>>>> Prerequisites: >>>>> >>>>> - Both rdevs are in-sync >>>>> - Both rdevs have in-flight MD_FAILFAST bios >>>>> - Both rdevs depend on the same lower-level path >>>>> (e.g., nvme-tcp over a single Ethernet interface) >>>>> >>>>> Sequence: >>>>> >>>>> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage), >>>>> in the case of nvme-tcp: >>>>> - The Ethernet connection is lost on the node where md is running over 5 seconds >>>>> - Then the connection is restored. Idk the details of nvme-tcp implementation, >>>>> but it seems that failfast IOs finish only after the connection is back. >>>>> - All failfast bios fail, raid1_end_write_request is called. >>>>> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev. >>>>> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely. >>>>> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding. >>>>> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur. >>>>> >>>> >>>> Thanks for your patient explanation. I understand. Maybe we need a separate >>>> error-handling path for failfast. How about adding an extra parameter to md_error()? >>> >>> Thank you for reviewing. >>> >>> I am thinking of proceeding like as follows. >>> md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function. >>> >> >> It doesn't matter if it's a exported symbol, we should just keep code as >> simple as possible. >>> ... >>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>> index ac85ec73a409..855cddeb0c09 100644 >>> --- a/drivers/md/md.c >>> +++ b/drivers/md/md.c >>> @@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread); >>> >>> -void md_error(struct mddev *mddev, struct md_rdev *rdev) >>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail) >>> { >>> @@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) >>> return; >>> - mddev->pers->error_handler(mddev, rdev); >>> + mddev->pers->error_handler(mddev, rdev, nofail); >>> >>> @@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) >>> } >>> + >>> +void md_error(struct mddev *mddev, struct md_rdev *rdev) >>> +{ >>> + return _md_error(mddev, rdev, false); >>> +} >>> EXPORT_SYMBOL(md_error); >>> >>> +void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev) >>> +{ >>> + WARN_ON(mddev->pers->head.id != ID_RAID1 && >>> + mddev->pers->head.id != ID_RAID10); >>> + return _md_error(mddev, rdev, true); >>> +} >>> +EXPORT_SYMBOL(md_error_failfast); >>> + >> >> I will prefer we add a common procedures to fix this problme. >> >> How about the first patch to serialize all the md_error(), and then >> and a new helper md_bio_failue_error(mddev, rdev, bio), called when >> bio is failed, in this helper: >> >> 1) if bio is not failfast, call md_error() and return true; otherwise: >> 2) if rdev contain the last data copy, return false directly, caller >> should check return value and retry, otherwise: >> 3) call md_error() and return true; > > Hi, > I think this approach has some issues. There are cases where md_error is > called only when MD_FAILFAST is set. > > One example is the processing below in raid1_end_write_request. > >> Then, for raid1, the callers will look like: >> >> iff --git a/drivers/md/md.c b/drivers/md/md.c >> index 1baaf52c603c..c6d150e9f1a7 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -1003,9 +1003,7 @@ static void super_written(struct bio *bio) >> if (bio->bi_status) { >> pr_err("md: %s gets error=%d\n", __func__, >> blk_status_to_errno(bio->bi_status)); >> - md_error(mddev, rdev); >> - if (!test_bit(Faulty, &rdev->flags) >> - && (bio->bi_opf & MD_FAILFAST)) { >> + if (!md_bio_failure_error(mddev, rdev, bio)) { >> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); >> set_bit(LastDev, &rdev->flags); >> } >> >> @@ -466,20 +472,11 @@ static void raid1_end_write_request(struct bio *bio) >> set_bit(MD_RECOVERY_NEEDED, & >> conf->mddev->recovery); >> >> - if (test_bit(FailFast, &rdev->flags) && >> - (bio->bi_opf & MD_FAILFAST) && >> /* We never try FailFast to WriteMostly devices */ >> - !test_bit(WriteMostly, &rdev->flags)) { >> - md_error(r1_bio->mddev, rdev); >> - } >> - >> - /* >> - * When the device is faulty, it is not necessary to >> - * handle write error. >> - */ >> - if (!test_bit(Faulty, &rdev->flags)) >> + if(!test_bit(WriteMostly, &rdev->flags) && >> + !md_bio_failure_error(mddev, rdev, bio)) { >> set_bit(R1BIO_WriteError, &r1_bio->state); >> - else { >> + } else { >> /* Finished with this branch */ >> r1_bio->bios[mirror] = NULL; >> to_put = bio; > > In the current raid1_end_write_request implementation, > - md_error is called only in the Failfast case. > - Afterwards, if the rdev is not Faulty (that is, not Failfast, > or Failfast but the last rdev — which originally was not expected > MD_BROKEN in RAID1), R1BIO_WriteError is set. > In the suggested implementation, it seems that a non-Failfast write > failure will immediately mark the rdev as Faulty, without retries. > > This could be avoided by testing MD_FAILFAST before call the > new helper md_bio_failure_error, but I believe duplicating the > same check in both caller/callee would be undesirable. > > Should we try to avoid modifying pers->error_handler? > One possible alternative approach is as follows. > > - serialize calls to md_error regardless of whether Failfast or not > - raid{1,10}_error is: > - The remaining copy (rdev) is marked with the LastDev flag > - clear MD_FAILFAST_SUPPORTED for prohibit super_write using Failfast > - super_written will simply put MD_SB_NEED_REWRITE without calling > md_error when MD_FAILFAST bio and LastDev rdev. > > After the changes, I believe it is rare for super_written to be called with error on > multiple rdevs due to failfast. super_write is caused by errors from normal failfast > IO and invoked via MD_SB_CHANGE_DEVS through the serialized raid1_error. Since > MD_FAILFAST_SUPPORTED is cleared, metadata writes occur without failfast. > > It's not exactly a common procedure, but as it doesn't add functions to md.c, > I think this approach is preferable to adding md_error_failfast(). > > ... > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 1baaf52c603c..ba524fa96091 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -1003,14 +1003,15 @@ static void super_written(struct bio *bio) > if (bio->bi_status) { > pr_err("md: %s gets error=%d\n", __func__, > blk_status_to_errno(bio->bi_status)); > - md_error(mddev, rdev); > - if (!test_bit(Faulty, &rdev->flags) > + if (test_bit(LastDev, &rdev->flags) > && (bio->bi_opf & MD_FAILFAST)) { > + pr_warn("md: %s: Metadata write will be repeated to %pg\n", > + mdname(mddev), rdev->bdev); > set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); > - set_bit(LastDev, &rdev->flags); > + } else { > + md_error(mddev, rdev); > } > - } else > - clear_bit(LastDev, &rdev->flags); > + } > > bio_put(bio); > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 408c26398321..a52c5277add7 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio) > (bio->bi_opf & MD_FAILFAST) && > /* We never try FailFast to WriteMostly devices */ > !test_bit(WriteMostly, &rdev->flags)) { > - md_error(r1_bio->mddev, rdev); > + raid1_md_error_failfast(r1_bio->mddev, rdev); > } > > /* > @@ -1733,6 +1733,27 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) > seq_printf(seq, "]"); > } > > +static void _raid1_md_error(struct mddev *mddev, struct md_rdev *rdev, bool failfast){ > + struct r1conf *conf = mddev->private; > + unsigned long flags; > + > + spin_lock_irqsave(&conf->new_lock_for_md_error, flags); > + if (failfast) > + set_bit(FailfastIOFailure, &rdev->flags); > + md_error(mddev, rdev); > + if (failfast) > + WARN_ON(!test_and_clear_bit(FailfastIOFailure, &rdev->flags)); > + spin_unlock_irqrestore(&conf->new_lock_for_md_error, flags); > +} > + > +static void raid1_md_error(struct mddev *mddev, struct md_rdev *rdev){ > + return _raid1_md_error(mddev, rdev, false); > +} > + > +static void raid1_md_error_failfast(struct mddev *mddev, struct md_rdev *rdev){ > + return _raid1_md_error(mddev, rdev, true); > +} > + > /** > * raid1_error() - RAID1 error handler. > * @mddev: affected md device. > @@ -1758,6 +1783,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) > > if (test_bit(In_sync, &rdev->flags) && > (conf->raid_disks - mddev->degraded) == 1) { > + if (test_bit(FailfastIOFailure, &rdev->flags)) { > + spin_unlock_irqrestore(&conf->device_lock, flags); > + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " > + "last device but ignoring it\n", > + mdname(mddev), rdev->bdev); > + return; > + } > set_bit(MD_BROKEN, &mddev->flags); > > if (!mddev->fail_last_dev) { > @@ -1767,8 +1799,16 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) > } > } > set_bit(Blocked, &rdev->flags); > - if (test_and_clear_bit(In_sync, &rdev->flags)) > + if (test_and_clear_bit(In_sync, &rdev->flags)) { > mddev->degraded++; > + clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags); > + for (i = 0; i < conf->raid_disks; i++) { > + struct md_rdev *rdev2 = conf->mirrors[i].rdev; > + if (rdev2 && rdev != rdev2 && > + test_bit(In_sync, &rdev2->flags)) > + set_bit(LastDev, &rdev2->flags); > + } > + } > set_bit(Faulty, &rdev->flags); > spin_unlock_irqrestore(&conf->device_lock, flags); > /* > @@ -2118,7 +2158,7 @@ static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector, > } > /* need to record an error - either for the block or the device */ > if (!rdev_set_badblocks(rdev, sector, sectors, 0)) > - md_error(rdev->mddev, rdev); > + raid1_md_error(rdev->mddev, rdev); > return 0; > } > ... > > > Thanks, > Akagi > >> @@ -2630,7 +2627,6 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >> */ >> >> bio = r1_bio->bios[r1_bio->read_disk]; >> - bio_put(bio); >> r1_bio->bios[r1_bio->read_disk] = NULL; >> >> rdev = conf->mirrors[r1_bio->read_disk].rdev; >> @@ -2639,19 +2635,18 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >> freeze_array(conf, 1); >> fix_read_error(conf, r1_bio); >> unfreeze_array(conf); >> - } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >> - md_error(mddev, rdev); >> - } else { >> + } else if (mddev->ro == 0 && >> + !md_bio_failure_error(mddev, rdev, bio)) { >> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >> } >> >> + bio_put(bio); >> rdev_dec_pending(rdev, conf->mddev); >> sector = r1_bio->sector; >> - bio = r1_bio->master_bio; >> >> /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */ >> r1_bio->state = 0; >> - raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio); >> + raid1_read_request(mddev, r1_bio->maxter_bio, r1_bio->sectors, r1_bio); >> allow_barrier(conf, sector); >> } >> >> >>> /* seq_file implementation /proc/mdstat */ >>> diff --git a/drivers/md/md.h b/drivers/md/md.h >>> index 51af29a03079..6ca1aea630ce 100644 >>> --- a/drivers/md/md.h >>> +++ b/drivers/md/md.h >>> @@ -758,3 +758,3 @@ struct md_personality >>> */ >>> - void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev); >>> + void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail); >>> int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); >>> @@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev); >>> extern void md_done_sync(struct mddev *mddev, int blocks, int ok); >>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail); >>> extern void md_error(struct mddev *mddev, struct md_rdev *rdev); >>> +extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev); >>> extern void md_finish_reshape(struct mddev *mddev); >>> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c >>> index f1d8811a542a..8aea51227a96 100644 >>> --- a/drivers/md/raid0.c >>> +++ b/drivers/md/raid0.c >>> @@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev) >>> >>> -static void raid0_error(struct mddev *mddev, struct md_rdev *rdev) >>> +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev, >>> + bool nofail __maybe_unused) >>> { >>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>> index 408c26398321..d93275899e9e 100644 >>> --- a/drivers/md/raid1.c >>> +++ b/drivers/md/raid1.c >>> @@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>> * @rdev: member device to fail. >>> + * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set >>> * >>> @@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>> * >>> - * @rdev is marked as &Faulty excluding case when array is failed and >>> - * &mddev->fail_last_dev is off. >>> + * @rdev is marked as &Faulty excluding any cases: >>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>> + * - when @rdev is last device and @nofail is true >>> */ >>> -static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>> +static void raid1_error(struct mddev *mddev, struct md_rdev *rdev, >>> + bool nofail) >>> { >>> @@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>> (conf->raid_disks - mddev->degraded) == 1) { >>> + if (nofail) { >>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>> + pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, " >>> + "last device but ignoring it\n", >>> + mdname(mddev), rdev->bdev); >>> + return; >>> + } >>> set_bit(MD_BROKEN, &mddev->flags); >>> ... >>> >>>> Kuai, do you have any suggestions? >>>> >>>>>> Normal writes may call md_error() in narrow_write_error. Normal reads do >>>>>> not execute md_error() on the last disk. >>>>>> >>>>>> Perhaps you should get more information to confirm how MD_BROKEN is set in >>>>>> normal read/write IO. >>>>> >>>>> Should I add the above sequence of events to the cover letter, or commit message? >>>>> >>>> >>>> I think we should mention this in the commit message. >>> >>> Understood. I will explicitly describe this in the commit message in v4. >>> >>> Thanks, >>> Akagi >>> >>>>> Thanks, >>>>> Akagi >>>>> >>>>>> -- >>>>>> Thanks, >>>>>> Nan >>>>>> >>>>>> >>>>> >>>>> >>>>> . >>>> >>>> -- >>>> Thanks, >>>> Nan >>>> >>>> >>> >>> . >>> >> >> > ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v3 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN 2025-08-28 16:32 [PATCH v3 0/3] Do not set MD_BROKEN on failfast io failure Kenta Akagi 2025-08-28 16:32 ` [PATCH v3 1/3] md/raid1,raid10: " Kenta Akagi @ 2025-08-28 16:32 ` Kenta Akagi 2025-08-28 16:32 ` [PATCH v3 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi 2 siblings, 0 replies; 13+ messages in thread From: Kenta Akagi @ 2025-08-28 16:32 UTC (permalink / raw) To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel, Kenta Akagi Once MD_BROKEN is set on an array, no further writes can be performed to it. The user must be informed that the array cannot continue operation. Signed-off-by: Kenta Akagi <k@mgml.me> --- drivers/md/raid1.c | 5 +++++ drivers/md/raid10.c | 5 +++++ 2 files changed, 10 insertions(+) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 8a61fd93b3ff..547635bcfdb9 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1770,7 +1770,12 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) mdname(mddev), rdev->bdev); return; } + set_bit(MD_BROKEN, &mddev->flags); + pr_crit("md/raid1:%s: Disk failure on %pg, this is the last device.\n" + "md/raid1:%s: Cannot continue operation (%d/%d failed).\n", + mdname(mddev), rdev->bdev, + mdname(mddev), mddev->degraded + 1, conf->raid_disks); if (!mddev->fail_last_dev) { conf->recovery_disabled = mddev->recovery_disabled; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 530ad6503189..b940ab4f6618 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2018,7 +2018,12 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) mdname(mddev), rdev->bdev); return; } + set_bit(MD_BROKEN, &mddev->flags); + pr_crit("md/raid10:%s: Disk failure on %pg, this is the last device.\n" + "md/raid10:%s: Cannot continue operation (%d/%d failed).\n", + mdname(mddev), rdev->bdev, + mdname(mddev), mddev->degraded + 1, conf->geo.raid_disks); if (!mddev->fail_last_dev) { spin_unlock_irqrestore(&conf->device_lock, flags); -- 2.50.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v3 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices. 2025-08-28 16:32 [PATCH v3 0/3] Do not set MD_BROKEN on failfast io failure Kenta Akagi 2025-08-28 16:32 ` [PATCH v3 1/3] md/raid1,raid10: " Kenta Akagi 2025-08-28 16:32 ` [PATCH v3 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi @ 2025-08-28 16:32 ` Kenta Akagi 2 siblings, 0 replies; 13+ messages in thread From: Kenta Akagi @ 2025-08-28 16:32 UTC (permalink / raw) To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang Cc: linux-raid, linux-kernel, Kenta Akagi Since commit 9a567843f7ce ("md: allow last device to be forcibly removed from RAID1/RAID10."), RAID1/10 arrays can now lose all rdevs. Before that commit, losing the array last rdev or reaching the end of the function without early return in raid{1,10}_error never occurred. However, both situations can occur in the current implementation. As a result, when mddev->fail_last_dev is set, a spurious pr_crit message can be printed. This patch prevents "Operation continuing" printed if the array is not operational. root@fedora:~# mdadm --create --verbose /dev/md0 --level=1 \ --raid-devices=2 /dev/loop0 /dev/loop1 mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90 mdadm: size set to 1046528K Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. root@fedora:~# echo 1 > /sys/block/md0/md/fail_last_dev root@fedora:~# mdadm --fail /dev/md0 loop0 mdadm: set loop0 faulty in /dev/md0 root@fedora:~# mdadm --fail /dev/md0 loop1 mdadm: set device faulty failed for loop1: Device or resource busy root@fedora:~# dmesg | tail -n 4 [ 1314.359674] md/raid1:md0: Disk failure on loop0, disabling device. md/raid1:md0: Operation continuing on 1 devices. [ 1315.506633] md/raid1:md0: Disk failure on loop1, disabling device. md/raid1:md0: Operation continuing on 0 devices. root@fedora:~# Fixes: 9a567843f7ce ("md: allow last device to be forcibly removed from RAID1/RAID10.") Signed-off-by: Kenta Akagi <k@mgml.me> --- drivers/md/raid1.c | 9 +++++---- drivers/md/raid10.c | 9 +++++---- 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 547635bcfdb9..e774c207eb70 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1787,6 +1787,11 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) if (test_and_clear_bit(In_sync, &rdev->flags)) mddev->degraded++; set_bit(Faulty, &rdev->flags); + if ((conf->raid_disks - mddev->degraded) > 0) + pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n" + "md/raid1:%s: Operation continuing on %d devices.\n", + mdname(mddev), rdev->bdev, + mdname(mddev), conf->raid_disks - mddev->degraded); spin_unlock_irqrestore(&conf->device_lock, flags); /* * if recovery is running, make sure it aborts. @@ -1794,10 +1799,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) set_bit(MD_RECOVERY_INTR, &mddev->recovery); set_mask_bits(&mddev->sb_flags, 0, BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING)); - pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n" - "md/raid1:%s: Operation continuing on %d devices.\n", - mdname(mddev), rdev->bdev, - mdname(mddev), conf->raid_disks - mddev->degraded); } static void print_conf(struct r1conf *conf) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index b940ab4f6618..3c9b2173a8a8 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2038,11 +2038,12 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) set_bit(Faulty, &rdev->flags); set_mask_bits(&mddev->sb_flags, 0, BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING)); + if (enough(conf, -1)) + pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n" + "md/raid10:%s: Operation continuing on %d devices.\n", + mdname(mddev), rdev->bdev, + mdname(mddev), conf->geo.raid_disks - mddev->degraded); spin_unlock_irqrestore(&conf->device_lock, flags); - pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n" - "md/raid10:%s: Operation continuing on %d devices.\n", - mdname(mddev), rdev->bdev, - mdname(mddev), conf->geo.raid_disks - mddev->degraded); } static void print_conf(struct r10conf *conf) -- 2.50.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-09-05 15:07 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-28 16:32 [PATCH v3 0/3] Do not set MD_BROKEN on failfast io failure Kenta Akagi 2025-08-28 16:32 ` [PATCH v3 1/3] md/raid1,raid10: " Kenta Akagi 2025-08-29 2:54 ` Li Nan 2025-08-29 12:21 ` Kenta Akagi 2025-08-30 8:48 ` Li Nan 2025-08-30 18:10 ` Kenta Akagi 2025-09-01 3:22 ` Li Nan 2025-09-01 4:22 ` Kenta Akagi 2025-09-01 7:48 ` Yu Kuai 2025-09-01 16:48 ` Kenta Akagi 2025-09-05 15:07 ` Kenta Akagi 2025-08-28 16:32 ` [PATCH v3 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi 2025-08-28 16:32 ` [PATCH v3 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).