linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] Do not set MD_BROKEN on failfast io failure
@ 2025-08-28 16:32 Kenta Akagi
  2025-08-28 16:32 ` [PATCH v3 1/3] md/raid1,raid10: " Kenta Akagi
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Kenta Akagi @ 2025-08-28 16:32 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Changes from V2:
- Fix to prevent the array from being marked broken for all
  Failfast IOs, not just metadata.
- Reflecting the review, update raid{1,10}_error to clear
  FailfastIOFailure so that devices are properly marked Faulty.
Changes from V1:
- Avoid setting MD_BROKEN instead of clearing it
- Add pr_crit() when setting MD_BROKEN
- Fix the message may shown after all rdevs failure:
  "Operation continuing on 0 devices"

v2: https://lore.kernel.org/linux-raid/20250817172710.4892-1-k@mgml.me/
v1: https://lore.kernel.org/linux-raid/20250812090119.153697-1-k@mgml.me/

A failfast bio, for example in the case of nvme-tcp, bio will fail
immediately if the connection to the target is briefly lost and
the device enters a reconnecting state - even though it would
recover given few seconds. This behavior is by design in failfast.

However, md treats Failfast IO failures as fatal,
potentially marking the array as MD_BROKEN when a connection is lost.

For example, if an initiator - that is, a machine loading the md
module - loses all connections briefly, the array is marked
as MD_BROKEN, preventing subsequent writes.
This is the issue I am currently facing, and which this patch aims to fix.

The 1st patch changes the behavior on MD_FAILFAST IO failures on
the last rdev. The 2nd and 3rd patches modify the pr_crit messages.

Kenta Akagi (3):
  md/raid1,raid10: Do not set MD_BROKEN on failfast io failure
  md/raid1,raid10: Add error message when setting MD_BROKEN
  md/raid1,raid10: Fix: Operation continuing on 0 devices.

 drivers/md/md.c     | 14 +++++++++-----
 drivers/md/md.h     | 13 +++++++------
 drivers/md/raid1.c  | 32 ++++++++++++++++++++++++++------
 drivers/md/raid10.c | 35 ++++++++++++++++++++++++++++-------
 4 files changed, 70 insertions(+), 24 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure
  2025-08-28 16:32 [PATCH v3 0/3] Do not set MD_BROKEN on failfast io failure Kenta Akagi
@ 2025-08-28 16:32 ` Kenta Akagi
  2025-08-29  2:54   ` Li Nan
  2025-08-28 16:32 ` [PATCH v3 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi
  2025-08-28 16:32 ` [PATCH v3 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi
  2 siblings, 1 reply; 12+ messages in thread
From: Kenta Akagi @ 2025-08-28 16:32 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

This commit ensures that an MD_FAILFAST IO failure does not put
the array into a broken state.

When failfast is enabled on rdev in RAID1 or RAID10,
the array may be flagged MD_BROKEN in the following cases.
- If MD_FAILFAST IOs to multiple rdevs fail simultaneously
- If an MD_FAILFAST metadata write to the 'last' rdev fails

The MD_FAILFAST bio error handler always calls md_error on IO failure,
under the assumption that raid{1,10}_error will neither fail
the last rdev nor break the array.
After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
calling md_error on the 'last' rdev in RAID1/10 always sets
the MD_BROKEN flag on the array.
As a result, when failfast IO fails on the last rdev, the array
immediately becomes failed.

Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is
an edge case; however, it can occur if rdevs in a non-degraded
array share the same path and that path is lost, or if a metadata
write is triggered in a degraded array and fails due to failfast.

When a failfast metadata write fails, it is retried through the
following sequence. If a metadata write without failfast fails,
the array will be marked with MD_BROKEN.

1. MD_SB_NEED_REWRITE is set in sb_flags by super_written.
2. md_super_wait, called by the function executing md_super_write,
   returns -EAGAIN due to MD_SB_NEED_REWRITE.
3. The caller of md_super_wait (e.g., md_update_sb) receives the
   negative return value and retries md_super_write.
4. md_super_write issues the metadata write again,
   this time without MD_FAILFAST.

Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c     | 14 +++++++++-----
 drivers/md/md.h     | 13 +++++++------
 drivers/md/raid1.c  | 18 ++++++++++++++++--
 drivers/md/raid10.c | 21 ++++++++++++++++++---
 4 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ac85ec73a409..547b88e253f7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -999,14 +999,18 @@ static void super_written(struct bio *bio)
 	if (bio->bi_status) {
 		pr_err("md: %s gets error=%d\n", __func__,
 		       blk_status_to_errno(bio->bi_status));
+		if (bio->bi_opf & MD_FAILFAST)
+			set_bit(FailfastIOFailure, &rdev->flags);
 		md_error(mddev, rdev);
 		if (!test_bit(Faulty, &rdev->flags)
 		    && (bio->bi_opf & MD_FAILFAST)) {
+			pr_warn("md: %s: Metadata write will be repeated to %pg\n",
+				mdname(mddev), rdev->bdev);
 			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
-			set_bit(LastDev, &rdev->flags);
 		}
-	} else
-		clear_bit(LastDev, &rdev->flags);
+	} else {
+		clear_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
+	}
 
 	bio_put(bio);
 
@@ -1048,7 +1052,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
 
 	if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
 	    test_bit(FailFast, &rdev->flags) &&
-	    !test_bit(LastDev, &rdev->flags))
+	    !test_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags))
 		bio->bi_opf |= MD_FAILFAST;
 
 	atomic_inc(&mddev->pending_writes);
@@ -1059,7 +1063,7 @@ int md_super_wait(struct mddev *mddev)
 {
 	/* wait for all superblock writes that were scheduled to complete */
 	wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
-	if (test_and_clear_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags))
+	if (test_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags))
 		return -EAGAIN;
 	return 0;
 }
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 51af29a03079..52c9fc759015 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -281,9 +281,10 @@ enum flag_bits {
 				 * It is expects that no bad block log
 				 * is present.
 				 */
-	LastDev,		/* Seems to be the last working dev as
-				 * it didn't fail, so don't use FailFast
-				 * any more for metadata
+	FailfastIOFailure,	/* rdev with failfast IO failure
+				 * but md_error not yet completed.
+				 * If the last rdev has this flag,
+				 * error_handler must not fail the array
 				 */
 	CollisionCheck,		/*
 				 * check if there is collision between raid1
@@ -331,8 +332,8 @@ struct md_cluster_operations;
  * @MD_CLUSTER_RESYNC_LOCKED: cluster raid only, which means node, already took
  *			       resync lock, need to release the lock.
  * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
- *			    calls to md_error() will never cause the array to
- *			    become failed.
+ *			    calls to md_error() with FailfastIOFailure will
+ *			    never cause the array to become failed.
  * @MD_HAS_PPL:  The raid array has PPL feature set.
  * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
  * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
@@ -360,7 +361,7 @@ enum mddev_sb_flags {
 	MD_SB_CHANGE_DEVS,		/* Some device status has changed */
 	MD_SB_CHANGE_CLEAN,	/* transition to or from 'clean' */
 	MD_SB_CHANGE_PENDING,	/* switch from 'clean' to 'active' in progress */
-	MD_SB_NEED_REWRITE,	/* metadata write needs to be repeated */
+	MD_SB_NEED_REWRITE,	/* metadata write needs to be repeated, do not use failfast */
 };
 
 #define NR_SERIAL_INFOS		8
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 408c26398321..8a61fd93b3ff 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
 		    (bio->bi_opf & MD_FAILFAST) &&
 		    /* We never try FailFast to WriteMostly devices */
 		    !test_bit(WriteMostly, &rdev->flags)) {
+			set_bit(FailfastIOFailure, &rdev->flags);
 			md_error(r1_bio->mddev, rdev);
 		}
 
@@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
  *	- recovery is interrupted.
  *	- &mddev->degraded is bumped.
  *
- * @rdev is marked as &Faulty excluding case when array is failed and
- * &mddev->fail_last_dev is off.
+ * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
+ * then @mddev and @rdev will not be marked as failed.
+ *
+ * @rdev is marked as &Faulty excluding any cases:
+ *	- when @mddev is failed and &mddev->fail_last_dev is off
+ *	- when @rdev is last device and &FailfastIOFailure flag is set
  */
 static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 {
@@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	if (test_bit(In_sync, &rdev->flags) &&
 	    (conf->raid_disks - mddev->degraded) == 1) {
+		if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
+				"last device but ignoring it\n",
+				mdname(mddev), rdev->bdev);
+			return;
+		}
 		set_bit(MD_BROKEN, &mddev->flags);
 
 		if (!mddev->fail_last_dev) {
@@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
 	if (test_bit(FailFast, &rdev->flags)) {
 		/* Don't try recovering from here - just fail it
 		 * ... unless it is the last working device of course */
+		set_bit(FailfastIOFailure, &rdev->flags);
 		md_error(mddev, rdev);
 		if (test_bit(Faulty, &rdev->flags))
 			/* Don't try to read from here, but make sure
@@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 		fix_read_error(conf, r1_bio);
 		unfreeze_array(conf);
 	} else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
+		set_bit(FailfastIOFailure, &rdev->flags);
 		md_error(mddev, rdev);
 	} else {
 		r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b60c30bfb6c7..530ad6503189 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
 			dec_rdev = 0;
 			if (test_bit(FailFast, &rdev->flags) &&
 			    (bio->bi_opf & MD_FAILFAST)) {
+				set_bit(FailfastIOFailure, &rdev->flags);
 				md_error(rdev->mddev, rdev);
 			}
 
@@ -1995,8 +1996,12 @@ static int enough(struct r10conf *conf, int ignore)
  *	- recovery is interrupted.
  *	- &mddev->degraded is bumped.
  *
- * @rdev is marked as &Faulty excluding case when array is failed and
- * &mddev->fail_last_dev is off.
+ * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
+ * then @mddev and @rdev will not be marked as failed.
+ *
+ * @rdev is marked as &Faulty excluding any cases:
+ *	- when @mddev is failed and &mddev->fail_last_dev is off
+ *	- when @rdev is last device and &FailfastIOFailure flag is set
  */
 static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 {
@@ -2006,6 +2011,13 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	spin_lock_irqsave(&conf->device_lock, flags);
 
 	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
+		if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			pr_warn_ratelimited("md/raid10:%s: Failfast IO failure on %pg, "
+				"last device but ignoring it\n",
+				mdname(mddev), rdev->bdev);
+			return;
+		}
 		set_bit(MD_BROKEN, &mddev->flags);
 
 		if (!mddev->fail_last_dev) {
@@ -2413,6 +2425,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 				continue;
 		} else if (test_bit(FailFast, &rdev->flags)) {
 			/* Just give up on this device */
+			set_bit(FailfastIOFailure, &rdev->flags);
 			md_error(rdev->mddev, rdev);
 			continue;
 		}
@@ -2865,8 +2878,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 		freeze_array(conf, 1);
 		fix_read_error(conf, mddev, r10_bio);
 		unfreeze_array(conf);
-	} else
+	} else {
+		set_bit(FailfastIOFailure, &rdev->flags);
 		md_error(mddev, rdev);
+	}
 
 	rdev_dec_pending(rdev, mddev);
 	r10_bio->state = 0;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN
  2025-08-28 16:32 [PATCH v3 0/3] Do not set MD_BROKEN on failfast io failure Kenta Akagi
  2025-08-28 16:32 ` [PATCH v3 1/3] md/raid1,raid10: " Kenta Akagi
@ 2025-08-28 16:32 ` Kenta Akagi
  2025-08-28 16:32 ` [PATCH v3 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi
  2 siblings, 0 replies; 12+ messages in thread
From: Kenta Akagi @ 2025-08-28 16:32 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Once MD_BROKEN is set on an array, no further writes can be
performed to it.
The user must be informed that the array cannot continue operation.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c  | 5 +++++
 drivers/md/raid10.c | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8a61fd93b3ff..547635bcfdb9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1770,7 +1770,12 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 				mdname(mddev), rdev->bdev);
 			return;
 		}
+
 		set_bit(MD_BROKEN, &mddev->flags);
+		pr_crit("md/raid1:%s: Disk failure on %pg, this is the last device.\n"
+			"md/raid1:%s: Cannot continue operation (%d/%d failed).\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), mddev->degraded + 1, conf->raid_disks);
 
 		if (!mddev->fail_last_dev) {
 			conf->recovery_disabled = mddev->recovery_disabled;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 530ad6503189..b940ab4f6618 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2018,7 +2018,12 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 				mdname(mddev), rdev->bdev);
 			return;
 		}
+
 		set_bit(MD_BROKEN, &mddev->flags);
+		pr_crit("md/raid10:%s: Disk failure on %pg, this is the last device.\n"
+			"md/raid10:%s: Cannot continue operation (%d/%d failed).\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), mddev->degraded + 1, conf->geo.raid_disks);
 
 		if (!mddev->fail_last_dev) {
 			spin_unlock_irqrestore(&conf->device_lock, flags);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices.
  2025-08-28 16:32 [PATCH v3 0/3] Do not set MD_BROKEN on failfast io failure Kenta Akagi
  2025-08-28 16:32 ` [PATCH v3 1/3] md/raid1,raid10: " Kenta Akagi
  2025-08-28 16:32 ` [PATCH v3 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi
@ 2025-08-28 16:32 ` Kenta Akagi
  2 siblings, 0 replies; 12+ messages in thread
From: Kenta Akagi @ 2025-08-28 16:32 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Since commit 9a567843f7ce ("md: allow last device to be forcibly
removed from RAID1/RAID10."), RAID1/10 arrays can now lose all rdevs.

Before that commit, losing the array last rdev or reaching the end of
the function without early return in raid{1,10}_error never occurred.
However, both situations can occur in the current implementation.

As a result, when mddev->fail_last_dev is set, a spurious pr_crit
message can be printed.

This patch prevents "Operation continuing" printed if the array
is not operational.

root@fedora:~# mdadm --create --verbose /dev/md0 --level=1 \
--raid-devices=2  /dev/loop0 /dev/loop1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: size set to 1046528K
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
root@fedora:~# echo 1 > /sys/block/md0/md/fail_last_dev
root@fedora:~# mdadm --fail /dev/md0 loop0
mdadm: set loop0 faulty in /dev/md0
root@fedora:~# mdadm --fail /dev/md0 loop1
mdadm: set device faulty failed for loop1:  Device or resource busy
root@fedora:~# dmesg | tail -n 4
[ 1314.359674] md/raid1:md0: Disk failure on loop0, disabling device.
               md/raid1:md0: Operation continuing on 1 devices.
[ 1315.506633] md/raid1:md0: Disk failure on loop1, disabling device.
               md/raid1:md0: Operation continuing on 0 devices.
root@fedora:~#

Fixes: 9a567843f7ce ("md: allow last device to be forcibly removed from RAID1/RAID10.")
Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c  | 9 +++++----
 drivers/md/raid10.c | 9 +++++----
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 547635bcfdb9..e774c207eb70 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1787,6 +1787,11 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	if (test_and_clear_bit(In_sync, &rdev->flags))
 		mddev->degraded++;
 	set_bit(Faulty, &rdev->flags);
+	if ((conf->raid_disks - mddev->degraded) > 0)
+		pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n"
+			"md/raid1:%s: Operation continuing on %d devices.\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), conf->raid_disks - mddev->degraded);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	/*
 	 * if recovery is running, make sure it aborts.
@@ -1794,10 +1799,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
-	pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n"
-		"md/raid1:%s: Operation continuing on %d devices.\n",
-		mdname(mddev), rdev->bdev,
-		mdname(mddev), conf->raid_disks - mddev->degraded);
 }
 
 static void print_conf(struct r1conf *conf)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b940ab4f6618..3c9b2173a8a8 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2038,11 +2038,12 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	set_bit(Faulty, &rdev->flags);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
+	if (enough(conf, -1))
+		pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
+			"md/raid10:%s: Operation continuing on %d devices.\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), conf->geo.raid_disks - mddev->degraded);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
-	pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
-		"md/raid10:%s: Operation continuing on %d devices.\n",
-		mdname(mddev), rdev->bdev,
-		mdname(mddev), conf->geo.raid_disks - mddev->degraded);
 }
 
 static void print_conf(struct r10conf *conf)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure
  2025-08-28 16:32 ` [PATCH v3 1/3] md/raid1,raid10: " Kenta Akagi
@ 2025-08-29  2:54   ` Li Nan
  2025-08-29 12:21     ` Kenta Akagi
  0 siblings, 1 reply; 12+ messages in thread
From: Li Nan @ 2025-08-29  2:54 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel


在 2025/8/29 0:32, Kenta Akagi 写道:
> This commit ensures that an MD_FAILFAST IO failure does not put
> the array into a broken state.
> 
> When failfast is enabled on rdev in RAID1 or RAID10,
> the array may be flagged MD_BROKEN in the following cases.
> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
> - If an MD_FAILFAST metadata write to the 'last' rdev fails

[...]

> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 408c26398321..8a61fd93b3ff 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>   		    (bio->bi_opf & MD_FAILFAST) &&
>   		    /* We never try FailFast to WriteMostly devices */
>   		    !test_bit(WriteMostly, &rdev->flags)) {
> +			set_bit(FailfastIOFailure, &rdev->flags);
>   			md_error(r1_bio->mddev, rdev);
>   		}
>   
> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>    *	- recovery is interrupted.
>    *	- &mddev->degraded is bumped.
>    *
> - * @rdev is marked as &Faulty excluding case when array is failed and
> - * &mddev->fail_last_dev is off.
> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
> + * then @mddev and @rdev will not be marked as failed.
> + *
> + * @rdev is marked as &Faulty excluding any cases:
> + *	- when @mddev is failed and &mddev->fail_last_dev is off
> + *	- when @rdev is last device and &FailfastIOFailure flag is set
>    */
>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   {
> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   
>   	if (test_bit(In_sync, &rdev->flags) &&
>   	    (conf->raid_disks - mddev->degraded) == 1) {
> +		if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
> +			pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
> +				"last device but ignoring it\n",
> +				mdname(mddev), rdev->bdev);
> +			return;
> +		}
>   		set_bit(MD_BROKEN, &mddev->flags);
>   
>   		if (!mddev->fail_last_dev) {
> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>   	if (test_bit(FailFast, &rdev->flags)) {
>   		/* Don't try recovering from here - just fail it
>   		 * ... unless it is the last working device of course */
> +		set_bit(FailfastIOFailure, &rdev->flags);
>   		md_error(mddev, rdev);
>   		if (test_bit(Faulty, &rdev->flags))
>   			/* Don't try to read from here, but make sure
> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>   		fix_read_error(conf, r1_bio);
>   		unfreeze_array(conf);
>   	} else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
> +		set_bit(FailfastIOFailure, &rdev->flags);
>   		md_error(mddev, rdev);
>   	} else {
>   		r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index b60c30bfb6c7..530ad6503189 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>   			dec_rdev = 0;
>   			if (test_bit(FailFast, &rdev->flags) &&
>   			    (bio->bi_opf & MD_FAILFAST)) {
> +				set_bit(FailfastIOFailure, &rdev->flags);
>   				md_error(rdev->mddev, rdev);
>   			}
>   

Thank you for the patch. There may be an issue with 'test_and_clear'.

If two write IO go to the same rdev, MD_BROKEN may be set as below:

IO1					IO2
set FailfastIOFailure
					set FailfastIOFailure		
  md_error
   raid1_error
    test_and_clear FailfastIOFailur
   					 md_error
					  raid1_error
					   //FailfastIOFailur is cleared
					   set MD_BROKEN

Maybe we should check whether FailfastIOFailure is already set before
setting it. It also needs to be considered in metadata writes.

> @@ -1995,8 +1996,12 @@ static int enough(struct r10conf *conf, int ignore)
>    *	- recovery is interrupted.
>    *	- &mddev->degraded is bumped.
>    *
> - * @rdev is marked as &Faulty excluding case when array is failed and
> - * &mddev->fail_last_dev is off.
> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
> + * then @mddev and @rdev will not be marked as failed.
> + *
> + * @rdev is marked as &Faulty excluding any cases:
> + *	- when @mddev is failed and &mddev->fail_last_dev is off
> + *	- when @rdev is last device and &FailfastIOFailure flag is set
>    */
>   static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>   {
> @@ -2006,6 +2011,13 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>   	spin_lock_irqsave(&conf->device_lock, flags);
>   
>   	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
> +		if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
> +			pr_warn_ratelimited("md/raid10:%s: Failfast IO failure on %pg, "
> +				"last device but ignoring it\n",
> +				mdname(mddev), rdev->bdev);
> +			return;
> + >   		set_bit(MD_BROKEN, &mddev->flags);
>   
>   		if (!mddev->fail_last_dev) {
> @@ -2413,6 +2425,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
>   				continue;
>   		} else if (test_bit(FailFast, &rdev->flags)) {
>   			/* Just give up on this device */
> +			set_bit(FailfastIOFailure, &rdev->flags);
>   			md_error(rdev->mddev, rdev);
>   			continue;
>   		}
> @@ -2865,8 +2878,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
>   		freeze_array(conf, 1);
>   		fix_read_error(conf, mddev, r10_bio);
>   		unfreeze_array(conf);
> -	} else
> +	} else {
> +		set_bit(FailfastIOFailure, &rdev->flags);
>   		md_error(mddev, rdev);
> +	}
>   
>   	rdev_dec_pending(rdev, mddev);
>   	r10_bio->state = 0;

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure
  2025-08-29  2:54   ` Li Nan
@ 2025-08-29 12:21     ` Kenta Akagi
  2025-08-30  8:48       ` Li Nan
  0 siblings, 1 reply; 12+ messages in thread
From: Kenta Akagi @ 2025-08-29 12:21 UTC (permalink / raw)
  To: Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi



On 2025/08/29 11:54, Li Nan wrote:
> 
> 在 2025/8/29 0:32, Kenta Akagi 写道:
>> This commit ensures that an MD_FAILFAST IO failure does not put
>> the array into a broken state.
>>
>> When failfast is enabled on rdev in RAID1 or RAID10,
>> the array may be flagged MD_BROKEN in the following cases.
>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>> - If an MD_FAILFAST metadata write to the 'last' rdev fails
> 
> [...]
> 
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 408c26398321..8a61fd93b3ff 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>               (bio->bi_opf & MD_FAILFAST) &&
>>               /* We never try FailFast to WriteMostly devices */
>>               !test_bit(WriteMostly, &rdev->flags)) {
>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>               md_error(r1_bio->mddev, rdev);
>>           }
>>   @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>    *    - recovery is interrupted.
>>    *    - &mddev->degraded is bumped.
>>    *
>> - * @rdev is marked as &Faulty excluding case when array is failed and
>> - * &mddev->fail_last_dev is off.
>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>> + * then @mddev and @rdev will not be marked as failed.
>> + *
>> + * @rdev is marked as &Faulty excluding any cases:
>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>    */
>>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>   {
>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>         if (test_bit(In_sync, &rdev->flags) &&
>>           (conf->raid_disks - mddev->degraded) == 1) {
>> +        if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>> +            pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>> +                "last device but ignoring it\n",
>> +                mdname(mddev), rdev->bdev);
>> +            return;
>> +        }
>>           set_bit(MD_BROKEN, &mddev->flags);
>>             if (!mddev->fail_last_dev) {
>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>       if (test_bit(FailFast, &rdev->flags)) {
>>           /* Don't try recovering from here - just fail it
>>            * ... unless it is the last working device of course */
>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>           md_error(mddev, rdev);
>>           if (test_bit(Faulty, &rdev->flags))
>>               /* Don't try to read from here, but make sure
>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>           fix_read_error(conf, r1_bio);
>>           unfreeze_array(conf);
>>       } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>           md_error(mddev, rdev);
>>       } else {
>>           r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index b60c30bfb6c7..530ad6503189 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>               dec_rdev = 0;
>>               if (test_bit(FailFast, &rdev->flags) &&
>>                   (bio->bi_opf & MD_FAILFAST)) {
>> +                set_bit(FailfastIOFailure, &rdev->flags);
>>                   md_error(rdev->mddev, rdev);
>>               }
>>   
> 
> Thank you for the patch. There may be an issue with 'test_and_clear'.
> 
> If two write IO go to the same rdev, MD_BROKEN may be set as below:

> IO1                    IO2
> set FailfastIOFailure
>                     set FailfastIOFailure       
>  md_error
>   raid1_error
>    test_and_clear FailfastIOFailur
>                        md_error
>                       raid1_error
>                        //FailfastIOFailur is cleared
>                        set MD_BROKEN
> 
> Maybe we should check whether FailfastIOFailure is already set before
> setting it. It also needs to be considered in metadata writes.
Thank you for reviewing.

I agree, this seems to be as you described.
So, should it be implemented as follows?

bool old=false;
do{
 spin_lock_irqsave(&conf->device_lock, flags);
 old = test_and_set_bit(FailfastIOFailure, &rdev->flags);
 spin_unlock_irqrestore(&conf->device_lock, flags);
}while(old);

However, since I am concerned about potential deadlocks, 
so I am considering two alternative approaches:

* Add an atomic_t counter to md_rdev to track failfast IO failures.

This may set MD_BROKEN at a slightly incorrect timing, but mixing
error handling of Failfast and non-Failfast IOs appears to be rare.
In any case, the final outcome would be the same, i.e. the array
ends up with MD_BROKEN. Therefore, I think this should not cause
issues. I think the same applies to test_and_set_bit.

IO1                    IO2                    IO3
FailfastIOFailure      Normal IOFailure       FailfastIOFailure
atomic_inc                            
                                              
 md_error                                     atomic_inc 
  raid1_error
   atomic_dec //2to1          
                       md_error          
                        raid1_error           md_error
                         atomic_dec //1to0     raid1_error
                                                atomic_dec //0
                                                  set MD_BROKEN

* Alternatively, create a separate error handler,
  e.g. md_error_failfast(), that clearly does not fail the array.

This approach would require somewhat larger changes and may not
be very elegant, but it seems to be a reliable way to ensure 
MD_BROKEN is never set at the wrong timing.

Which of these three approaches would you consider preferable?
I would appreciate your feedback.


For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED
when the array is degraded.

Thanks,
Akagi

>> @@ -1995,8 +1996,12 @@ static int enough(struct r10conf *conf, int ignore)
>>    *    - recovery is interrupted.
>>    *    - &mddev->degraded is bumped.
>>    *
>> - * @rdev is marked as &Faulty excluding case when array is failed and
>> - * &mddev->fail_last_dev is off.
>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>> + * then @mddev and @rdev will not be marked as failed.
>> + *
>> + * @rdev is marked as &Faulty excluding any cases:
>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>    */
>>   static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>   {
>> @@ -2006,6 +2011,13 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>       spin_lock_irqsave(&conf->device_lock, flags);
>>         if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
>> +        if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>> +            pr_warn_ratelimited("md/raid10:%s: Failfast IO failure on %pg, "
>> +                "last device but ignoring it\n",
>> +                mdname(mddev), rdev->bdev);
>> +            return;
>> + >           set_bit(MD_BROKEN, &mddev->flags);
>>             if (!mddev->fail_last_dev) {
>> @@ -2413,6 +2425,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
>>                   continue;
>>           } else if (test_bit(FailFast, &rdev->flags)) {
>>               /* Just give up on this device */
>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>               md_error(rdev->mddev, rdev);
>>               continue;
>>           }
>> @@ -2865,8 +2878,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
>>           freeze_array(conf, 1);
>>           fix_read_error(conf, mddev, r10_bio);
>>           unfreeze_array(conf);
>> -    } else
>> +    } else {
>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>           md_error(mddev, rdev);
>> +    }
>>         rdev_dec_pending(rdev, mddev);
>>       r10_bio->state = 0;
> 
> -- 
> Thanks,
> Nan
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure
  2025-08-29 12:21     ` Kenta Akagi
@ 2025-08-30  8:48       ` Li Nan
  2025-08-30 18:10         ` Kenta Akagi
  0 siblings, 1 reply; 12+ messages in thread
From: Li Nan @ 2025-08-30  8:48 UTC (permalink / raw)
  To: Kenta Akagi, Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk,
	Guoqing Jiang
  Cc: linux-raid, linux-kernel



在 2025/8/29 20:21, Kenta Akagi 写道:
> 
> 
> On 2025/08/29 11:54, Li Nan wrote:
>>
>> 在 2025/8/29 0:32, Kenta Akagi 写道:
>>> This commit ensures that an MD_FAILFAST IO failure does not put
>>> the array into a broken state.
>>>
>>> When failfast is enabled on rdev in RAID1 or RAID10,
>>> the array may be flagged MD_BROKEN in the following cases.
>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails
>>
>> [...]
>>
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index 408c26398321..8a61fd93b3ff 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>                (bio->bi_opf & MD_FAILFAST) &&
>>>                /* We never try FailFast to WriteMostly devices */
>>>                !test_bit(WriteMostly, &rdev->flags)) {
>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>>                md_error(r1_bio->mddev, rdev);
>>>            }
>>>    @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>     *    - recovery is interrupted.
>>>     *    - &mddev->degraded is bumped.
>>>     *
>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>> - * &mddev->fail_last_dev is off.
>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>>> + * then @mddev and @rdev will not be marked as failed.
>>> + *
>>> + * @rdev is marked as &Faulty excluding any cases:
>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>     */
>>>    static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>    {
>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>          if (test_bit(In_sync, &rdev->flags) &&
>>>            (conf->raid_disks - mddev->degraded) == 1) {
>>> +        if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>> +            pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>>> +                "last device but ignoring it\n",
>>> +                mdname(mddev), rdev->bdev);
>>> +            return;
>>> +        }
>>>            set_bit(MD_BROKEN, &mddev->flags);
>>>              if (!mddev->fail_last_dev) {
>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>        if (test_bit(FailFast, &rdev->flags)) {
>>>            /* Don't try recovering from here - just fail it
>>>             * ... unless it is the last working device of course */
>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>            md_error(mddev, rdev);
>>>            if (test_bit(Faulty, &rdev->flags))
>>>                /* Don't try to read from here, but make sure
>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>            fix_read_error(conf, r1_bio);
>>>            unfreeze_array(conf);
>>>        } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>            md_error(mddev, rdev);
>>>        } else {
>>>            r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>> index b60c30bfb6c7..530ad6503189 100644
>>> --- a/drivers/md/raid10.c
>>> +++ b/drivers/md/raid10.c
>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>>                dec_rdev = 0;
>>>                if (test_bit(FailFast, &rdev->flags) &&
>>>                    (bio->bi_opf & MD_FAILFAST)) {
>>> +                set_bit(FailfastIOFailure, &rdev->flags);
>>>                    md_error(rdev->mddev, rdev);
>>>                }
>>>    
>>
>> Thank you for the patch. There may be an issue with 'test_and_clear'.
>>
>> If two write IO go to the same rdev, MD_BROKEN may be set as below:
> 
>> IO1                    IO2
>> set FailfastIOFailure
>>                      set FailfastIOFailure
>>   md_error
>>    raid1_error
>>     test_and_clear FailfastIOFailur
>>                         md_error
>>                        raid1_error
>>                         //FailfastIOFailur is cleared
>>                         set MD_BROKEN
>>
>> Maybe we should check whether FailfastIOFailure is already set before
>> setting it. It also needs to be considered in metadata writes.
> Thank you for reviewing.
> 
> I agree, this seems to be as you described.
> So, should it be implemented as follows?
> 
> bool old=false;
> do{
>   spin_lock_irqsave(&conf->device_lock, flags);
>   old = test_and_set_bit(FailfastIOFailure, &rdev->flags);
>   spin_unlock_irqrestore(&conf->device_lock, flags);
> }while(old);
> 
> However, since I am concerned about potential deadlocks,
> so I am considering two alternative approaches:
> 
> * Add an atomic_t counter to md_rdev to track failfast IO failures.
> 
> This may set MD_BROKEN at a slightly incorrect timing, but mixing
> error handling of Failfast and non-Failfast IOs appears to be rare.
> In any case, the final outcome would be the same, i.e. the array
> ends up with MD_BROKEN. Therefore, I think this should not cause
> issues. I think the same applies to test_and_set_bit.
> 
> IO1                    IO2                    IO3
> FailfastIOFailure      Normal IOFailure       FailfastIOFailure
> atomic_inc
>                                                
>   md_error                                     atomic_inc
>    raid1_error
>     atomic_dec //2to1
>                         md_error
>                          raid1_error           md_error
>                           atomic_dec //1to0     raid1_error
>                                                  atomic_dec //0
>                                                    set MD_BROKEN
> 
> * Alternatively, create a separate error handler,
>    e.g. md_error_failfast(), that clearly does not fail the array.
> 
> This approach would require somewhat larger changes and may not
> be very elegant, but it seems to be a reliable way to ensure
> MD_BROKEN is never set at the wrong timing.
> 
> Which of these three approaches would you consider preferable?
> I would appreciate your feedback.
> 
> 
> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED
> when the array is degraded.
> 
> Thanks,
> Akagi
> 

I took a closer look at the FailFast code and found a few issues, using
RAID1 as an example:

For normal read/write IO, FailFast is only triggered when there is another
disk is available, as seen in read_balance() and raid1_write_request().
In raid1_error(), MD_BROKEN is set only when no other disks are available.

So, the FailFast for normal read/write is not triggered in the scenario you
described in cover-letter.

Normal writes may call md_error() in narrow_write_error. Normal reads do
not execute md_error() on the last disk.

Perhaps you should get more information to confirm how MD_BROKEN is set in
normal read/write IO.

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure
  2025-08-30  8:48       ` Li Nan
@ 2025-08-30 18:10         ` Kenta Akagi
  2025-09-01  3:22           ` Li Nan
  0 siblings, 1 reply; 12+ messages in thread
From: Kenta Akagi @ 2025-08-30 18:10 UTC (permalink / raw)
  To: Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi



On 2025/08/30 17:48, Li Nan wrote:
> 
> 
> 在 2025/8/29 20:21, Kenta Akagi 写道:
>>
>>
>> On 2025/08/29 11:54, Li Nan wrote:
>>>
>>> 在 2025/8/29 0:32, Kenta Akagi 写道:
>>>> This commit ensures that an MD_FAILFAST IO failure does not put
>>>> the array into a broken state.
>>>>
>>>> When failfast is enabled on rdev in RAID1 or RAID10,
>>>> the array may be flagged MD_BROKEN in the following cases.
>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails
>>>
>>> [...]
>>>
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index 408c26398321..8a61fd93b3ff 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>                (bio->bi_opf & MD_FAILFAST) &&
>>>>                /* We never try FailFast to WriteMostly devices */
>>>>                !test_bit(WriteMostly, &rdev->flags)) {
>>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>>>                md_error(r1_bio->mddev, rdev);
>>>>            }
>>>>    @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>     *    - recovery is interrupted.
>>>>     *    - &mddev->degraded is bumped.
>>>>     *
>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>> - * &mddev->fail_last_dev is off.
>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>>>> + * then @mddev and @rdev will not be marked as failed.
>>>> + *
>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>     */
>>>>    static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>    {
>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>          if (test_bit(In_sync, &rdev->flags) &&
>>>>            (conf->raid_disks - mddev->degraded) == 1) {
>>>> +        if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>> +            pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>>>> +                "last device but ignoring it\n",
>>>> +                mdname(mddev), rdev->bdev);
>>>> +            return;
>>>> +        }
>>>>            set_bit(MD_BROKEN, &mddev->flags);
>>>>              if (!mddev->fail_last_dev) {
>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>>        if (test_bit(FailFast, &rdev->flags)) {
>>>>            /* Don't try recovering from here - just fail it
>>>>             * ... unless it is the last working device of course */
>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>            md_error(mddev, rdev);
>>>>            if (test_bit(Faulty, &rdev->flags))
>>>>                /* Don't try to read from here, but make sure
>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>>            fix_read_error(conf, r1_bio);
>>>>            unfreeze_array(conf);
>>>>        } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>            md_error(mddev, rdev);
>>>>        } else {
>>>>            r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>> index b60c30bfb6c7..530ad6503189 100644
>>>> --- a/drivers/md/raid10.c
>>>> +++ b/drivers/md/raid10.c
>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>>>                dec_rdev = 0;
>>>>                if (test_bit(FailFast, &rdev->flags) &&
>>>>                    (bio->bi_opf & MD_FAILFAST)) {
>>>> +                set_bit(FailfastIOFailure, &rdev->flags);
>>>>                    md_error(rdev->mddev, rdev);
>>>>                }
>>>>    
>>>
>>> Thank you for the patch. There may be an issue with 'test_and_clear'.
>>>
>>> If two write IO go to the same rdev, MD_BROKEN may be set as below:
>>
>>> IO1                    IO2
>>> set FailfastIOFailure
>>>                      set FailfastIOFailure
>>>   md_error
>>>    raid1_error
>>>     test_and_clear FailfastIOFailur
>>>                         md_error
>>>                        raid1_error
>>>                         //FailfastIOFailur is cleared
>>>                         set MD_BROKEN
>>>
>>> Maybe we should check whether FailfastIOFailure is already set before
>>> setting it. It also needs to be considered in metadata writes.
>> Thank you for reviewing.
>>
>> I agree, this seems to be as you described.
>> So, should it be implemented as follows?
>>
>> bool old=false;
>> do{
>>   spin_lock_irqsave(&conf->device_lock, flags);
>>   old = test_and_set_bit(FailfastIOFailure, &rdev->flags);
>>   spin_unlock_irqrestore(&conf->device_lock, flags);
>> }while(old);
>>
>> However, since I am concerned about potential deadlocks,
>> so I am considering two alternative approaches:
>>
>> * Add an atomic_t counter to md_rdev to track failfast IO failures.
>>
>> This may set MD_BROKEN at a slightly incorrect timing, but mixing
>> error handling of Failfast and non-Failfast IOs appears to be rare.
>> In any case, the final outcome would be the same, i.e. the array
>> ends up with MD_BROKEN. Therefore, I think this should not cause
>> issues. I think the same applies to test_and_set_bit.
>>
>> IO1                    IO2                    IO3
>> FailfastIOFailure      Normal IOFailure       FailfastIOFailure
>> atomic_inc
>>                                                  md_error                                     atomic_inc
>>    raid1_error
>>     atomic_dec //2to1
>>                         md_error
>>                          raid1_error           md_error
>>                           atomic_dec //1to0     raid1_error
>>                                                  atomic_dec //0
>>                                                    set MD_BROKEN
>>
>> * Alternatively, create a separate error handler,
>>    e.g. md_error_failfast(), that clearly does not fail the array.
>>
>> This approach would require somewhat larger changes and may not
>> be very elegant, but it seems to be a reliable way to ensure
>> MD_BROKEN is never set at the wrong timing.
>>
>> Which of these three approaches would you consider preferable?
>> I would appreciate your feedback.
>>
>>
>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED
>> when the array is degraded.
>>
>> Thanks,
>> Akagi
>>
> 
> I took a closer look at the FailFast code and found a few issues, using
> RAID1 as an example:
> 
> For normal read/write IO, FailFast is only triggered when there is another
> disk is available, as seen in read_balance() and raid1_write_request().
> In raid1_error(), MD_BROKEN is set only when no other disks are available.

Hi,
Agree, I think so too.

> So, the FailFast for normal read/write is not triggered in the scenario you
> described in cover-letter.

This corresponds to the case described in the commit message of PATCH v3 1/3.
"Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is
an edge case; however, it can occur if rdevs in a non-degraded
array share the same path and that path is lost, or if a metadata
write is triggered in a degraded array and fails due to failfast."

To describe it in more detail, the flow is as follows:

Prerequisites:

- Both rdevs are in-sync
- Both rdevs have in-flight MD_FAILFAST bios
- Both rdevs depend on the same lower-level path
  (e.g., nvme-tcp over a single Ethernet interface)

Sequence:

- A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage),
  in the case of nvme-tcp:
   - The Ethernet connection is lost on the node where md is running over 5 seconds
   - Then the connection is restored. Idk the details of nvme-tcp implementation, 
     but it seems that failfast IOs finish only after the connection is back.
- All failfast bios fail, raid1_end_write_request is called.
- md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev.
- md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely.
- The write is retried via handle_write_finished -> narrow_write_error, usually succeeding.
- MD_BROKEN remains set, leaving the array in a state where no further writes can occur.

> Normal writes may call md_error() in narrow_write_error. Normal reads do
> not execute md_error() on the last disk.
> 
> Perhaps you should get more information to confirm how MD_BROKEN is set in
> normal read/write IO.

Should I add the above sequence of events to the cover letter, or commit message?

Thanks,
Akagi

> -- 
> Thanks,
> Nan
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure
  2025-08-30 18:10         ` Kenta Akagi
@ 2025-09-01  3:22           ` Li Nan
  2025-09-01  4:22             ` Kenta Akagi
  0 siblings, 1 reply; 12+ messages in thread
From: Li Nan @ 2025-09-01  3:22 UTC (permalink / raw)
  To: Kenta Akagi, Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk,
	Guoqing Jiang
  Cc: linux-raid, linux-kernel



在 2025/8/31 2:10, Kenta Akagi 写道:
> 
> 
> On 2025/08/30 17:48, Li Nan wrote:
>>
>>
>> 在 2025/8/29 20:21, Kenta Akagi 写道:
>>>
>>>
>>> On 2025/08/29 11:54, Li Nan wrote:
>>>>
>>>> 在 2025/8/29 0:32, Kenta Akagi 写道:
>>>>> This commit ensures that an MD_FAILFAST IO failure does not put
>>>>> the array into a broken state.
>>>>>
>>>>> When failfast is enabled on rdev in RAID1 or RAID10,
>>>>> the array may be flagged MD_BROKEN in the following cases.
>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails
>>>>
>>>> [...]
>>>>
>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>> index 408c26398321..8a61fd93b3ff 100644
>>>>> --- a/drivers/md/raid1.c
>>>>> +++ b/drivers/md/raid1.c
>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>>                 (bio->bi_opf & MD_FAILFAST) &&
>>>>>                 /* We never try FailFast to WriteMostly devices */
>>>>>                 !test_bit(WriteMostly, &rdev->flags)) {
>>>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>>>>                 md_error(r1_bio->mddev, rdev);
>>>>>             }
>>>>>     @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>>      *    - recovery is interrupted.
>>>>>      *    - &mddev->degraded is bumped.
>>>>>      *
>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>> - * &mddev->fail_last_dev is off.
>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>>>>> + * then @mddev and @rdev will not be marked as failed.
>>>>> + *
>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>      */
>>>>>     static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>     {
>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>           if (test_bit(In_sync, &rdev->flags) &&
>>>>>             (conf->raid_disks - mddev->degraded) == 1) {
>>>>> +        if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>> +            pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>>>>> +                "last device but ignoring it\n",
>>>>> +                mdname(mddev), rdev->bdev);
>>>>> +            return;
>>>>> +        }
>>>>>             set_bit(MD_BROKEN, &mddev->flags);
>>>>>               if (!mddev->fail_last_dev) {
>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>>>         if (test_bit(FailFast, &rdev->flags)) {
>>>>>             /* Don't try recovering from here - just fail it
>>>>>              * ... unless it is the last working device of course */
>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>             md_error(mddev, rdev);
>>>>>             if (test_bit(Faulty, &rdev->flags))
>>>>>                 /* Don't try to read from here, but make sure
>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>>>             fix_read_error(conf, r1_bio);
>>>>>             unfreeze_array(conf);
>>>>>         } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>             md_error(mddev, rdev);
>>>>>         } else {
>>>>>             r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>> index b60c30bfb6c7..530ad6503189 100644
>>>>> --- a/drivers/md/raid10.c
>>>>> +++ b/drivers/md/raid10.c
>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>>>>                 dec_rdev = 0;
>>>>>                 if (test_bit(FailFast, &rdev->flags) &&
>>>>>                     (bio->bi_opf & MD_FAILFAST)) {
>>>>> +                set_bit(FailfastIOFailure, &rdev->flags);
>>>>>                     md_error(rdev->mddev, rdev);
>>>>>                 }
>>>>>     
>>>>
>>>> Thank you for the patch. There may be an issue with 'test_and_clear'.
>>>>
>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below:
>>>
>>>> IO1                    IO2
>>>> set FailfastIOFailure
>>>>                       set FailfastIOFailure
>>>>    md_error
>>>>     raid1_error
>>>>      test_and_clear FailfastIOFailur
>>>>                          md_error
>>>>                         raid1_error
>>>>                          //FailfastIOFailur is cleared
>>>>                          set MD_BROKEN
>>>>
>>>> Maybe we should check whether FailfastIOFailure is already set before
>>>> setting it. It also needs to be considered in metadata writes.
>>> Thank you for reviewing.
>>>
>>> I agree, this seems to be as you described.
>>> So, should it be implemented as follows?
>>>
>>> bool old=false;
>>> do{
>>>    spin_lock_irqsave(&conf->device_lock, flags);
>>>    old = test_and_set_bit(FailfastIOFailure, &rdev->flags);
>>>    spin_unlock_irqrestore(&conf->device_lock, flags);
>>> }while(old);
>>>
>>> However, since I am concerned about potential deadlocks,
>>> so I am considering two alternative approaches:
>>>
>>> * Add an atomic_t counter to md_rdev to track failfast IO failures.
>>>
>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing
>>> error handling of Failfast and non-Failfast IOs appears to be rare.
>>> In any case, the final outcome would be the same, i.e. the array
>>> ends up with MD_BROKEN. Therefore, I think this should not cause
>>> issues. I think the same applies to test_and_set_bit.
>>>
>>> IO1                    IO2                    IO3
>>> FailfastIOFailure      Normal IOFailure       FailfastIOFailure
>>> atomic_inc
>>>                                                   md_error                                     atomic_inc
>>>     raid1_error
>>>      atomic_dec //2to1
>>>                          md_error
>>>                           raid1_error           md_error
>>>                            atomic_dec //1to0     raid1_error
>>>                                                   atomic_dec //0
>>>                                                     set MD_BROKEN
>>>
>>> * Alternatively, create a separate error handler,
>>>     e.g. md_error_failfast(), that clearly does not fail the array.
>>>
>>> This approach would require somewhat larger changes and may not
>>> be very elegant, but it seems to be a reliable way to ensure
>>> MD_BROKEN is never set at the wrong timing.
>>>
>>> Which of these three approaches would you consider preferable?
>>> I would appreciate your feedback.
>>>
>>>
>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED
>>> when the array is degraded.
>>>
>>> Thanks,
>>> Akagi
>>>
>>
>> I took a closer look at the FailFast code and found a few issues, using
>> RAID1 as an example:
>>
>> For normal read/write IO, FailFast is only triggered when there is another
>> disk is available, as seen in read_balance() and raid1_write_request().
>> In raid1_error(), MD_BROKEN is set only when no other disks are available.
> 
> Hi,
> Agree, I think so too.
> 
>> So, the FailFast for normal read/write is not triggered in the scenario you
>> described in cover-letter.
> 
> This corresponds to the case described in the commit message of PATCH v3 1/3.
> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is
> an edge case; however, it can occur if rdevs in a non-degraded
> array share the same path and that path is lost, or if a metadata
> write is triggered in a degraded array and fails due to failfast."
> 
> To describe it in more detail, the flow is as follows:
> 
> Prerequisites:
> 
> - Both rdevs are in-sync
> - Both rdevs have in-flight MD_FAILFAST bios
> - Both rdevs depend on the same lower-level path
>    (e.g., nvme-tcp over a single Ethernet interface)
> 
> Sequence:
> 
> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage),
>    in the case of nvme-tcp:
>     - The Ethernet connection is lost on the node where md is running over 5 seconds
>     - Then the connection is restored. Idk the details of nvme-tcp implementation,
>       but it seems that failfast IOs finish only after the connection is back.
> - All failfast bios fail, raid1_end_write_request is called.
> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev.
> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely.
> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding.
> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur.
> 

Thanks for your patient explanation. I understand. Maybe we need a separate
error-handling path for failfast. How about adding an extra parameter to 
md_error()?

Kuai, do you have any suggestions?

>> Normal writes may call md_error() in narrow_write_error. Normal reads do
>> not execute md_error() on the last disk.
>>
>> Perhaps you should get more information to confirm how MD_BROKEN is set in
>> normal read/write IO.
> 
> Should I add the above sequence of events to the cover letter, or commit message?
> 

I think we should mention this in the commit message.

> Thanks,
> Akagi
> 
>> -- 
>> Thanks,
>> Nan
>>
>>
> 
> 
> .

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure
  2025-09-01  3:22           ` Li Nan
@ 2025-09-01  4:22             ` Kenta Akagi
  2025-09-01  7:48               ` Yu Kuai
  0 siblings, 1 reply; 12+ messages in thread
From: Kenta Akagi @ 2025-09-01  4:22 UTC (permalink / raw)
  To: Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi



On 2025/09/01 12:22, Li Nan wrote:
> 
> 
> 在 2025/8/31 2:10, Kenta Akagi 写道:
>>
>>
>> On 2025/08/30 17:48, Li Nan wrote:
>>>
>>>
>>> 在 2025/8/29 20:21, Kenta Akagi 写道:
>>>>
>>>>
>>>> On 2025/08/29 11:54, Li Nan wrote:
>>>>>
>>>>> 在 2025/8/29 0:32, Kenta Akagi 写道:
>>>>>> This commit ensures that an MD_FAILFAST IO failure does not put
>>>>>> the array into a broken state.
>>>>>>
>>>>>> When failfast is enabled on rdev in RAID1 or RAID10,
>>>>>> the array may be flagged MD_BROKEN in the following cases.
>>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails
>>>>>
>>>>> [...]
>>>>>
>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>>> index 408c26398321..8a61fd93b3ff 100644
>>>>>> --- a/drivers/md/raid1.c
>>>>>> +++ b/drivers/md/raid1.c
>>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>>>                 (bio->bi_opf & MD_FAILFAST) &&
>>>>>>                 /* We never try FailFast to WriteMostly devices */
>>>>>>                 !test_bit(WriteMostly, &rdev->flags)) {
>>>>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>                 md_error(r1_bio->mddev, rdev);
>>>>>>             }
>>>>>>     @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>>>      *    - recovery is interrupted.
>>>>>>      *    - &mddev->degraded is bumped.
>>>>>>      *
>>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>>> - * &mddev->fail_last_dev is off.
>>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>>>>>> + * then @mddev and @rdev will not be marked as failed.
>>>>>> + *
>>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>>      */
>>>>>>     static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>     {
>>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>           if (test_bit(In_sync, &rdev->flags) &&
>>>>>>             (conf->raid_disks - mddev->degraded) == 1) {
>>>>>> +        if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>>>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>> +            pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>>>>>> +                "last device but ignoring it\n",
>>>>>> +                mdname(mddev), rdev->bdev);
>>>>>> +            return;
>>>>>> +        }
>>>>>>             set_bit(MD_BROKEN, &mddev->flags);
>>>>>>               if (!mddev->fail_last_dev) {
>>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>>>>         if (test_bit(FailFast, &rdev->flags)) {
>>>>>>             /* Don't try recovering from here - just fail it
>>>>>>              * ... unless it is the last working device of course */
>>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>             md_error(mddev, rdev);
>>>>>>             if (test_bit(Faulty, &rdev->flags))
>>>>>>                 /* Don't try to read from here, but make sure
>>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>>>>             fix_read_error(conf, r1_bio);
>>>>>>             unfreeze_array(conf);
>>>>>>         } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>             md_error(mddev, rdev);
>>>>>>         } else {
>>>>>>             r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>>> index b60c30bfb6c7..530ad6503189 100644
>>>>>> --- a/drivers/md/raid10.c
>>>>>> +++ b/drivers/md/raid10.c
>>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>>>>>                 dec_rdev = 0;
>>>>>>                 if (test_bit(FailFast, &rdev->flags) &&
>>>>>>                     (bio->bi_opf & MD_FAILFAST)) {
>>>>>> +                set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>                     md_error(rdev->mddev, rdev);
>>>>>>                 }
>>>>>>     
>>>>>
>>>>> Thank you for the patch. There may be an issue with 'test_and_clear'.
>>>>>
>>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below:
>>>>
>>>>> IO1                    IO2
>>>>> set FailfastIOFailure
>>>>>                       set FailfastIOFailure
>>>>>    md_error
>>>>>     raid1_error
>>>>>      test_and_clear FailfastIOFailur
>>>>>                          md_error
>>>>>                         raid1_error
>>>>>                          //FailfastIOFailur is cleared
>>>>>                          set MD_BROKEN
>>>>>
>>>>> Maybe we should check whether FailfastIOFailure is already set before
>>>>> setting it. It also needs to be considered in metadata writes.
>>>> Thank you for reviewing.
>>>>
>>>> I agree, this seems to be as you described.
>>>> So, should it be implemented as follows?
>>>>
>>>> bool old=false;
>>>> do{
>>>>    spin_lock_irqsave(&conf->device_lock, flags);
>>>>    old = test_and_set_bit(FailfastIOFailure, &rdev->flags);
>>>>    spin_unlock_irqrestore(&conf->device_lock, flags);
>>>> }while(old);
>>>>
>>>> However, since I am concerned about potential deadlocks,
>>>> so I am considering two alternative approaches:
>>>>
>>>> * Add an atomic_t counter to md_rdev to track failfast IO failures.
>>>>
>>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing
>>>> error handling of Failfast and non-Failfast IOs appears to be rare.
>>>> In any case, the final outcome would be the same, i.e. the array
>>>> ends up with MD_BROKEN. Therefore, I think this should not cause
>>>> issues. I think the same applies to test_and_set_bit.
>>>>
>>>> IO1                    IO2                    IO3
>>>> FailfastIOFailure      Normal IOFailure       FailfastIOFailure
>>>> atomic_inc
>>>>                                                   md_error                                     atomic_inc
>>>>     raid1_error
>>>>      atomic_dec //2to1
>>>>                          md_error
>>>>                           raid1_error           md_error
>>>>                            atomic_dec //1to0     raid1_error
>>>>                                                   atomic_dec //0
>>>>                                                     set MD_BROKEN
>>>>
>>>> * Alternatively, create a separate error handler,
>>>>     e.g. md_error_failfast(), that clearly does not fail the array.
>>>>
>>>> This approach would require somewhat larger changes and may not
>>>> be very elegant, but it seems to be a reliable way to ensure
>>>> MD_BROKEN is never set at the wrong timing.
>>>>
>>>> Which of these three approaches would you consider preferable?
>>>> I would appreciate your feedback.
>>>>
>>>>
>>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED
>>>> when the array is degraded.
>>>>
>>>> Thanks,
>>>> Akagi
>>>>
>>>
>>> I took a closer look at the FailFast code and found a few issues, using
>>> RAID1 as an example:
>>>
>>> For normal read/write IO, FailFast is only triggered when there is another
>>> disk is available, as seen in read_balance() and raid1_write_request().
>>> In raid1_error(), MD_BROKEN is set only when no other disks are available.
>>
>> Hi,
>> Agree, I think so too.
>>
>>> So, the FailFast for normal read/write is not triggered in the scenario you
>>> described in cover-letter.
>>
>> This corresponds to the case described in the commit message of PATCH v3 1/3.
>> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is
>> an edge case; however, it can occur if rdevs in a non-degraded
>> array share the same path and that path is lost, or if a metadata
>> write is triggered in a degraded array and fails due to failfast."
>>
>> To describe it in more detail, the flow is as follows:
>>
>> Prerequisites:
>>
>> - Both rdevs are in-sync
>> - Both rdevs have in-flight MD_FAILFAST bios
>> - Both rdevs depend on the same lower-level path
>>    (e.g., nvme-tcp over a single Ethernet interface)
>>
>> Sequence:
>>
>> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage),
>>    in the case of nvme-tcp:
>>     - The Ethernet connection is lost on the node where md is running over 5 seconds
>>     - Then the connection is restored. Idk the details of nvme-tcp implementation,
>>       but it seems that failfast IOs finish only after the connection is back.
>> - All failfast bios fail, raid1_end_write_request is called.
>> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev.
>> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely.
>> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding.
>> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur.
>>
> 
> Thanks for your patient explanation. I understand. Maybe we need a separate
> error-handling path for failfast. How about adding an extra parameter to md_error()?

Thank you for reviewing.

I am thinking of proceeding like as follows.
md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function.

...
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ac85ec73a409..855cddeb0c09 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread);

-void md_error(struct mddev *mddev, struct md_rdev *rdev)
+void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail)
 {
@@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
                return;
-       mddev->pers->error_handler(mddev, rdev);
+       mddev->pers->error_handler(mddev, rdev, nofail);

@@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
 }
+
+void md_error(struct mddev *mddev, struct md_rdev *rdev)
+{
+       return _md_error(mddev, rdev, false);
+}
 EXPORT_SYMBOL(md_error);

+void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev)
+{
+       WARN_ON(mddev->pers->head.id != ID_RAID1 &&
+               mddev->pers->head.id != ID_RAID10);
+       return _md_error(mddev, rdev, true);
+}
+EXPORT_SYMBOL(md_error_failfast);
+
 /* seq_file implementation /proc/mdstat */
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 51af29a03079..6ca1aea630ce 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -758,3 +758,3 @@ struct md_personality
         */
-       void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
+       void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
        int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
@@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev);
 extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
+void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
 extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
+extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev);
 extern void md_finish_reshape(struct mddev *mddev);
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f1d8811a542a..8aea51227a96 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev)

-static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
+static void raid0_error(struct mddev *mddev, struct md_rdev *rdev,
+       bool nofail __maybe_unused)
 {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 408c26398321..d93275899e9e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
  * @rdev: member device to fail.
+ * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set
  *
@@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
  *
- * @rdev is marked as &Faulty excluding case when array is failed and
- * &mddev->fail_last_dev is off.
+ * @rdev is marked as &Faulty excluding any cases:
+ *     - when @mddev is failed and &mddev->fail_last_dev is off
+ *     - when @rdev is last device and @nofail is true
  */
-static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
+static void raid1_error(struct mddev *mddev, struct md_rdev *rdev,
+       bool nofail)
 {
@@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
            (conf->raid_disks - mddev->degraded) == 1) {
+               if (nofail) {
+                       spin_unlock_irqrestore(&conf->device_lock, flags);
+                       pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, "
+                               "last device but ignoring it\n",
+                               mdname(mddev), rdev->bdev);
+                       return;
+               }
                set_bit(MD_BROKEN, &mddev->flags);
...

> Kuai, do you have any suggestions?
> 
>>> Normal writes may call md_error() in narrow_write_error. Normal reads do
>>> not execute md_error() on the last disk.
>>>
>>> Perhaps you should get more information to confirm how MD_BROKEN is set in
>>> normal read/write IO.
>>
>> Should I add the above sequence of events to the cover letter, or commit message?
>>
> 
> I think we should mention this in the commit message.

Understood. I will explicitly describe this in the commit message in v4.

Thanks,
Akagi

>> Thanks,
>> Akagi
>>
>>> -- 
>>> Thanks,
>>> Nan
>>>
>>>
>>
>>
>> .
> 
> -- 
> Thanks,
> Nan
> 
> 


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure
  2025-09-01  4:22             ` Kenta Akagi
@ 2025-09-01  7:48               ` Yu Kuai
  2025-09-01 16:48                 ` Kenta Akagi
  0 siblings, 1 reply; 12+ messages in thread
From: Yu Kuai @ 2025-09-01  7:48 UTC (permalink / raw)
  To: Kenta Akagi, Li Nan, Song Liu, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C)

Hi,

在 2025/09/01 12:22, Kenta Akagi 写道:
> 
> 
> On 2025/09/01 12:22, Li Nan wrote:
>>
>>
>> 在 2025/8/31 2:10, Kenta Akagi 写道:
>>>
>>>
>>> On 2025/08/30 17:48, Li Nan wrote:
>>>>
>>>>
>>>> 在 2025/8/29 20:21, Kenta Akagi 写道:
>>>>>
>>>>>
>>>>> On 2025/08/29 11:54, Li Nan wrote:
>>>>>>
>>>>>> 在 2025/8/29 0:32, Kenta Akagi 写道:
>>>>>>> This commit ensures that an MD_FAILFAST IO failure does not put
>>>>>>> the array into a broken state.
>>>>>>>
>>>>>>> When failfast is enabled on rdev in RAID1 or RAID10,
>>>>>>> the array may be flagged MD_BROKEN in the following cases.
>>>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>>>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>>>> index 408c26398321..8a61fd93b3ff 100644
>>>>>>> --- a/drivers/md/raid1.c
>>>>>>> +++ b/drivers/md/raid1.c
>>>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>>>>                  (bio->bi_opf & MD_FAILFAST) &&
>>>>>>>                  /* We never try FailFast to WriteMostly devices */
>>>>>>>                  !test_bit(WriteMostly, &rdev->flags)) {
>>>>>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>                  md_error(r1_bio->mddev, rdev);
>>>>>>>              }
>>>>>>>      @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>>>>       *    - recovery is interrupted.
>>>>>>>       *    - &mddev->degraded is bumped.
>>>>>>>       *
>>>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>>>> - * &mddev->fail_last_dev is off.
>>>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>>>>>>> + * then @mddev and @rdev will not be marked as failed.
>>>>>>> + *
>>>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>>>       */
>>>>>>>      static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>>      {
>>>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>>            if (test_bit(In_sync, &rdev->flags) &&
>>>>>>>              (conf->raid_disks - mddev->degraded) == 1) {
>>>>>>> +        if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>>>>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>>> +            pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>>>>>>> +                "last device but ignoring it\n",
>>>>>>> +                mdname(mddev), rdev->bdev);
>>>>>>> +            return;
>>>>>>> +        }
>>>>>>>              set_bit(MD_BROKEN, &mddev->flags);
>>>>>>>                if (!mddev->fail_last_dev) {
>>>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>>>>>          if (test_bit(FailFast, &rdev->flags)) {
>>>>>>>              /* Don't try recovering from here - just fail it
>>>>>>>               * ... unless it is the last working device of course */
>>>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>              md_error(mddev, rdev);
>>>>>>>              if (test_bit(Faulty, &rdev->flags))
>>>>>>>                  /* Don't try to read from here, but make sure
>>>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>>>>>              fix_read_error(conf, r1_bio);
>>>>>>>              unfreeze_array(conf);
>>>>>>>          } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>              md_error(mddev, rdev);
>>>>>>>          } else {
>>>>>>>              r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>>>> index b60c30bfb6c7..530ad6503189 100644
>>>>>>> --- a/drivers/md/raid10.c
>>>>>>> +++ b/drivers/md/raid10.c
>>>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>>>>>>                  dec_rdev = 0;
>>>>>>>                  if (test_bit(FailFast, &rdev->flags) &&
>>>>>>>                      (bio->bi_opf & MD_FAILFAST)) {
>>>>>>> +                set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>                      md_error(rdev->mddev, rdev);
>>>>>>>                  }
>>>>>>>      
>>>>>>
>>>>>> Thank you for the patch. There may be an issue with 'test_and_clear'.
>>>>>>
>>>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below:
>>>>>
>>>>>> IO1                    IO2
>>>>>> set FailfastIOFailure
>>>>>>                        set FailfastIOFailure
>>>>>>     md_error
>>>>>>      raid1_error
>>>>>>       test_and_clear FailfastIOFailur
>>>>>>                           md_error
>>>>>>                          raid1_error
>>>>>>                           //FailfastIOFailur is cleared
>>>>>>                           set MD_BROKEN
>>>>>>
>>>>>> Maybe we should check whether FailfastIOFailure is already set before
>>>>>> setting it. It also needs to be considered in metadata writes.
>>>>> Thank you for reviewing.
>>>>>
>>>>> I agree, this seems to be as you described.
>>>>> So, should it be implemented as follows?
>>>>>
>>>>> bool old=false;
>>>>> do{
>>>>>     spin_lock_irqsave(&conf->device_lock, flags);
>>>>>     old = test_and_set_bit(FailfastIOFailure, &rdev->flags);
>>>>>     spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>> }while(old);
>>>>>
>>>>> However, since I am concerned about potential deadlocks,
>>>>> so I am considering two alternative approaches:
>>>>>
>>>>> * Add an atomic_t counter to md_rdev to track failfast IO failures.
>>>>>
>>>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing
>>>>> error handling of Failfast and non-Failfast IOs appears to be rare.
>>>>> In any case, the final outcome would be the same, i.e. the array
>>>>> ends up with MD_BROKEN. Therefore, I think this should not cause
>>>>> issues. I think the same applies to test_and_set_bit.
>>>>>
>>>>> IO1                    IO2                    IO3
>>>>> FailfastIOFailure      Normal IOFailure       FailfastIOFailure
>>>>> atomic_inc
>>>>>                                                    md_error                                     atomic_inc
>>>>>      raid1_error
>>>>>       atomic_dec //2to1
>>>>>                           md_error
>>>>>                            raid1_error           md_error
>>>>>                             atomic_dec //1to0     raid1_error
>>>>>                                                    atomic_dec //0
>>>>>                                                      set MD_BROKEN
>>>>>
>>>>> * Alternatively, create a separate error handler,
>>>>>      e.g. md_error_failfast(), that clearly does not fail the array.
>>>>>
>>>>> This approach would require somewhat larger changes and may not
>>>>> be very elegant, but it seems to be a reliable way to ensure
>>>>> MD_BROKEN is never set at the wrong timing.
>>>>>
>>>>> Which of these three approaches would you consider preferable?
>>>>> I would appreciate your feedback.
>>>>>
>>>>>
>>>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED
>>>>> when the array is degraded.
>>>>>
>>>>> Thanks,
>>>>> Akagi
>>>>>
>>>>
>>>> I took a closer look at the FailFast code and found a few issues, using
>>>> RAID1 as an example:
>>>>
>>>> For normal read/write IO, FailFast is only triggered when there is another
>>>> disk is available, as seen in read_balance() and raid1_write_request().
>>>> In raid1_error(), MD_BROKEN is set only when no other disks are available.
>>>
>>> Hi,
>>> Agree, I think so too.
>>>
>>>> So, the FailFast for normal read/write is not triggered in the scenario you
>>>> described in cover-letter.
>>>
>>> This corresponds to the case described in the commit message of PATCH v3 1/3.
>>> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is
>>> an edge case; however, it can occur if rdevs in a non-degraded
>>> array share the same path and that path is lost, or if a metadata
>>> write is triggered in a degraded array and fails due to failfast."
>>>
>>> To describe it in more detail, the flow is as follows:
>>>
>>> Prerequisites:
>>>
>>> - Both rdevs are in-sync
>>> - Both rdevs have in-flight MD_FAILFAST bios
>>> - Both rdevs depend on the same lower-level path
>>>     (e.g., nvme-tcp over a single Ethernet interface)
>>>
>>> Sequence:
>>>
>>> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage),
>>>     in the case of nvme-tcp:
>>>      - The Ethernet connection is lost on the node where md is running over 5 seconds
>>>      - Then the connection is restored. Idk the details of nvme-tcp implementation,
>>>        but it seems that failfast IOs finish only after the connection is back.
>>> - All failfast bios fail, raid1_end_write_request is called.
>>> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev.
>>> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely.
>>> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding.
>>> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur.
>>>
>>
>> Thanks for your patient explanation. I understand. Maybe we need a separate
>> error-handling path for failfast. How about adding an extra parameter to md_error()?
> 
> Thank you for reviewing.
> 
> I am thinking of proceeding like as follows.
> md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function.
> 

It doesn't matter if it's a exported symbol, we should just keep code as
simple as possible.
> ...
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index ac85ec73a409..855cddeb0c09 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread);
> 
> -void md_error(struct mddev *mddev, struct md_rdev *rdev)
> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail)
>   {
> @@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>                  return;
> -       mddev->pers->error_handler(mddev, rdev);
> +       mddev->pers->error_handler(mddev, rdev, nofail);
> 
> @@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>   }
> +
> +void md_error(struct mddev *mddev, struct md_rdev *rdev)
> +{
> +       return _md_error(mddev, rdev, false);
> +}
>   EXPORT_SYMBOL(md_error);
> 
> +void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev)
> +{
> +       WARN_ON(mddev->pers->head.id != ID_RAID1 &&
> +               mddev->pers->head.id != ID_RAID10);
> +       return _md_error(mddev, rdev, true);
> +}
> +EXPORT_SYMBOL(md_error_failfast);
> +

I will prefer we add a common procedures to fix this problme.

How about the first patch to serialize all the md_error(), and then
and a new helper md_bio_failue_error(mddev, rdev, bio), called when
bio is failed, in this helper:

1) if bio is not failfast, call md_error() and return true; otherwise:
2) if rdev contain the last data copy, return false directly, caller
should check return value and retry, otherwise:
3) call md_error() and return true;

Then, for raid1, the callers will look like:

iff --git a/drivers/md/md.c b/drivers/md/md.c
index 1baaf52c603c..c6d150e9f1a7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1003,9 +1003,7 @@ static void super_written(struct bio *bio)
         if (bio->bi_status) {
                 pr_err("md: %s gets error=%d\n", __func__,
                        blk_status_to_errno(bio->bi_status));
-               md_error(mddev, rdev);
-               if (!test_bit(Faulty, &rdev->flags)
-                   && (bio->bi_opf & MD_FAILFAST)) {
+               if (!md_bio_failure_error(mddev, rdev, bio)) {
                         set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
                         set_bit(LastDev, &rdev->flags);
                 }

@@ -466,20 +472,11 @@ static void raid1_end_write_request(struct bio *bio)
                         set_bit(MD_RECOVERY_NEEDED, &
                                 conf->mddev->recovery);

-               if (test_bit(FailFast, &rdev->flags) &&
-                   (bio->bi_opf & MD_FAILFAST) &&
                     /* We never try FailFast to WriteMostly devices */
-                   !test_bit(WriteMostly, &rdev->flags)) {
-                       md_error(r1_bio->mddev, rdev);
-               }
-
-               /*
-                * When the device is faulty, it is not necessary to
-                * handle write error.
-                */
-               if (!test_bit(Faulty, &rdev->flags))
+               if(!test_bit(WriteMostly, &rdev->flags) &&
+                  !md_bio_failure_error(mddev, rdev, bio)) {
                         set_bit(R1BIO_WriteError, &r1_bio->state);
-               else {
+               } else {
                         /* Finished with this branch */
                         r1_bio->bios[mirror] = NULL;
                         to_put = bio;
@@ -2630,7 +2627,6 @@ static void handle_read_error(struct r1conf *conf, 
struct r1bio *r1_bio)
          */

         bio = r1_bio->bios[r1_bio->read_disk];
-       bio_put(bio);
         r1_bio->bios[r1_bio->read_disk] = NULL;

         rdev = conf->mirrors[r1_bio->read_disk].rdev;
@@ -2639,19 +2635,18 @@ static void handle_read_error(struct r1conf 
*conf, struct r1bio *r1_bio)
                 freeze_array(conf, 1);
                 fix_read_error(conf, r1_bio);
                 unfreeze_array(conf);
-       } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
-               md_error(mddev, rdev);
-       } else {
+       } else if (mddev->ro == 0 &&
+                  !md_bio_failure_error(mddev, rdev, bio)) {
                 r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
         }

+       bio_put(bio);
         rdev_dec_pending(rdev, conf->mddev);
         sector = r1_bio->sector;
-       bio = r1_bio->master_bio;

         /* Reuse the old r1_bio so that the IO_BLOCKED settings are 
preserved */
         r1_bio->state = 0;
-       raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio);
+       raid1_read_request(mddev, r1_bio->maxter_bio, r1_bio->sectors, 
r1_bio);
         allow_barrier(conf, sector);
  }


>   /* seq_file implementation /proc/mdstat */
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 51af29a03079..6ca1aea630ce 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -758,3 +758,3 @@ struct md_personality
>           */
> -       void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
> +       void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
>          int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
> @@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev);
>   extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
>   extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
> +extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev);
>   extern void md_finish_reshape(struct mddev *mddev);
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index f1d8811a542a..8aea51227a96 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev)
> 
> -static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
> +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev,
> +       bool nofail __maybe_unused)
>   {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 408c26398321..d93275899e9e 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>    * @rdev: member device to fail.
> + * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set
>    *
> @@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>    *
> - * @rdev is marked as &Faulty excluding case when array is failed and
> - * &mddev->fail_last_dev is off.
> + * @rdev is marked as &Faulty excluding any cases:
> + *     - when @mddev is failed and &mddev->fail_last_dev is off
> + *     - when @rdev is last device and @nofail is true
>    */
> -static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> +static void raid1_error(struct mddev *mddev, struct md_rdev *rdev,
> +       bool nofail)
>   {
> @@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>              (conf->raid_disks - mddev->degraded) == 1) {
> +               if (nofail) {
> +                       spin_unlock_irqrestore(&conf->device_lock, flags);
> +                       pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, "
> +                               "last device but ignoring it\n",
> +                               mdname(mddev), rdev->bdev);
> +                       return;
> +               }
>                  set_bit(MD_BROKEN, &mddev->flags);
> ...
> 
>> Kuai, do you have any suggestions?
>>
>>>> Normal writes may call md_error() in narrow_write_error. Normal reads do
>>>> not execute md_error() on the last disk.
>>>>
>>>> Perhaps you should get more information to confirm how MD_BROKEN is set in
>>>> normal read/write IO.
>>>
>>> Should I add the above sequence of events to the cover letter, or commit message?
>>>
>>
>> I think we should mention this in the commit message.
> 
> Understood. I will explicitly describe this in the commit message in v4.
> 
> Thanks,
> Akagi
> 
>>> Thanks,
>>> Akagi
>>>
>>>> -- 
>>>> Thanks,
>>>> Nan
>>>>
>>>>
>>>
>>>
>>> .
>>
>> -- 
>> Thanks,
>> Nan
>>
>>
> 
> .
> 


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast io failure
  2025-09-01  7:48               ` Yu Kuai
@ 2025-09-01 16:48                 ` Kenta Akagi
  0 siblings, 0 replies; 12+ messages in thread
From: Kenta Akagi @ 2025-09-01 16:48 UTC (permalink / raw)
  To: Yu Kuai, Li Nan, Song Liu, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C), Kenta Akagi



On 2025/09/01 16:48, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/01 12:22, Kenta Akagi 写道:
>>
>>
>> On 2025/09/01 12:22, Li Nan wrote:
>>>
>>>
>>> 在 2025/8/31 2:10, Kenta Akagi 写道:
>>>>
>>>>
>>>> On 2025/08/30 17:48, Li Nan wrote:
>>>>>
>>>>>
>>>>> 在 2025/8/29 20:21, Kenta Akagi 写道:
>>>>>>
>>>>>>
>>>>>> On 2025/08/29 11:54, Li Nan wrote:
>>>>>>>
>>>>>>> 在 2025/8/29 0:32, Kenta Akagi 写道:
>>>>>>>> This commit ensures that an MD_FAILFAST IO failure does not put
>>>>>>>> the array into a broken state.
>>>>>>>>
>>>>>>>> When failfast is enabled on rdev in RAID1 or RAID10,
>>>>>>>> the array may be flagged MD_BROKEN in the following cases.
>>>>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>>>>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>>>>> index 408c26398321..8a61fd93b3ff 100644
>>>>>>>> --- a/drivers/md/raid1.c
>>>>>>>> +++ b/drivers/md/raid1.c
>>>>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>>>>>                  (bio->bi_opf & MD_FAILFAST) &&
>>>>>>>>                  /* We never try FailFast to WriteMostly devices */
>>>>>>>>                  !test_bit(WriteMostly, &rdev->flags)) {
>>>>>>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>                  md_error(r1_bio->mddev, rdev);
>>>>>>>>              }
>>>>>>>>      @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>>>>>       *    - recovery is interrupted.
>>>>>>>>       *    - &mddev->degraded is bumped.
>>>>>>>>       *
>>>>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>>>>> - * &mddev->fail_last_dev is off.
>>>>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>>>>>>>> + * then @mddev and @rdev will not be marked as failed.
>>>>>>>> + *
>>>>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>>>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>>>>       */
>>>>>>>>      static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>>>      {
>>>>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>>>            if (test_bit(In_sync, &rdev->flags) &&
>>>>>>>>              (conf->raid_disks - mddev->degraded) == 1) {
>>>>>>>> +        if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>>>>>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>>>> +            pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>>>>>>>> +                "last device but ignoring it\n",
>>>>>>>> +                mdname(mddev), rdev->bdev);
>>>>>>>> +            return;
>>>>>>>> +        }
>>>>>>>>              set_bit(MD_BROKEN, &mddev->flags);
>>>>>>>>                if (!mddev->fail_last_dev) {
>>>>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>>>>>>          if (test_bit(FailFast, &rdev->flags)) {
>>>>>>>>              /* Don't try recovering from here - just fail it
>>>>>>>>               * ... unless it is the last working device of course */
>>>>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>              md_error(mddev, rdev);
>>>>>>>>              if (test_bit(Faulty, &rdev->flags))
>>>>>>>>                  /* Don't try to read from here, but make sure
>>>>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>>>>>>              fix_read_error(conf, r1_bio);
>>>>>>>>              unfreeze_array(conf);
>>>>>>>>          } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>>>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>              md_error(mddev, rdev);
>>>>>>>>          } else {
>>>>>>>>              r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>>>>> index b60c30bfb6c7..530ad6503189 100644
>>>>>>>> --- a/drivers/md/raid10.c
>>>>>>>> +++ b/drivers/md/raid10.c
>>>>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>>>>>>>                  dec_rdev = 0;
>>>>>>>>                  if (test_bit(FailFast, &rdev->flags) &&
>>>>>>>>                      (bio->bi_opf & MD_FAILFAST)) {
>>>>>>>> +                set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>                      md_error(rdev->mddev, rdev);
>>>>>>>>                  }
>>>>>>>>      
>>>>>>>
>>>>>>> Thank you for the patch. There may be an issue with 'test_and_clear'.
>>>>>>>
>>>>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below:
>>>>>>
>>>>>>> IO1                    IO2
>>>>>>> set FailfastIOFailure
>>>>>>>                        set FailfastIOFailure
>>>>>>>     md_error
>>>>>>>      raid1_error
>>>>>>>       test_and_clear FailfastIOFailur
>>>>>>>                           md_error
>>>>>>>                          raid1_error
>>>>>>>                           //FailfastIOFailur is cleared
>>>>>>>                           set MD_BROKEN
>>>>>>>
>>>>>>> Maybe we should check whether FailfastIOFailure is already set before
>>>>>>> setting it. It also needs to be considered in metadata writes.
>>>>>> Thank you for reviewing.
>>>>>>
>>>>>> I agree, this seems to be as you described.
>>>>>> So, should it be implemented as follows?
>>>>>>
>>>>>> bool old=false;
>>>>>> do{
>>>>>>     spin_lock_irqsave(&conf->device_lock, flags);
>>>>>>     old = test_and_set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>     spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>> }while(old);
>>>>>>
>>>>>> However, since I am concerned about potential deadlocks,
>>>>>> so I am considering two alternative approaches:
>>>>>>
>>>>>> * Add an atomic_t counter to md_rdev to track failfast IO failures.
>>>>>>
>>>>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing
>>>>>> error handling of Failfast and non-Failfast IOs appears to be rare.
>>>>>> In any case, the final outcome would be the same, i.e. the array
>>>>>> ends up with MD_BROKEN. Therefore, I think this should not cause
>>>>>> issues. I think the same applies to test_and_set_bit.
>>>>>>
>>>>>> IO1                    IO2                    IO3
>>>>>> FailfastIOFailure      Normal IOFailure       FailfastIOFailure
>>>>>> atomic_inc
>>>>>>                                                    md_error                                     atomic_inc
>>>>>>      raid1_error
>>>>>>       atomic_dec //2to1
>>>>>>                           md_error
>>>>>>                            raid1_error           md_error
>>>>>>                             atomic_dec //1to0     raid1_error
>>>>>>                                                    atomic_dec //0
>>>>>>                                                      set MD_BROKEN
>>>>>>
>>>>>> * Alternatively, create a separate error handler,
>>>>>>      e.g. md_error_failfast(), that clearly does not fail the array.
>>>>>>
>>>>>> This approach would require somewhat larger changes and may not
>>>>>> be very elegant, but it seems to be a reliable way to ensure
>>>>>> MD_BROKEN is never set at the wrong timing.
>>>>>>
>>>>>> Which of these three approaches would you consider preferable?
>>>>>> I would appreciate your feedback.
>>>>>>
>>>>>>
>>>>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED
>>>>>> when the array is degraded.
>>>>>>
>>>>>> Thanks,
>>>>>> Akagi
>>>>>>
>>>>>
>>>>> I took a closer look at the FailFast code and found a few issues, using
>>>>> RAID1 as an example:
>>>>>
>>>>> For normal read/write IO, FailFast is only triggered when there is another
>>>>> disk is available, as seen in read_balance() and raid1_write_request().
>>>>> In raid1_error(), MD_BROKEN is set only when no other disks are available.
>>>>
>>>> Hi,
>>>> Agree, I think so too.
>>>>
>>>>> So, the FailFast for normal read/write is not triggered in the scenario you
>>>>> described in cover-letter.
>>>>
>>>> This corresponds to the case described in the commit message of PATCH v3 1/3.
>>>> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is
>>>> an edge case; however, it can occur if rdevs in a non-degraded
>>>> array share the same path and that path is lost, or if a metadata
>>>> write is triggered in a degraded array and fails due to failfast."
>>>>
>>>> To describe it in more detail, the flow is as follows:
>>>>
>>>> Prerequisites:
>>>>
>>>> - Both rdevs are in-sync
>>>> - Both rdevs have in-flight MD_FAILFAST bios
>>>> - Both rdevs depend on the same lower-level path
>>>>     (e.g., nvme-tcp over a single Ethernet interface)
>>>>
>>>> Sequence:
>>>>
>>>> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage),
>>>>     in the case of nvme-tcp:
>>>>      - The Ethernet connection is lost on the node where md is running over 5 seconds
>>>>      - Then the connection is restored. Idk the details of nvme-tcp implementation,
>>>>        but it seems that failfast IOs finish only after the connection is back.
>>>> - All failfast bios fail, raid1_end_write_request is called.
>>>> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev.
>>>> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely.
>>>> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding.
>>>> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur.
>>>>
>>>
>>> Thanks for your patient explanation. I understand. Maybe we need a separate
>>> error-handling path for failfast. How about adding an extra parameter to md_error()?
>>
>> Thank you for reviewing.
>>
>> I am thinking of proceeding like as follows.
>> md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function.
>>
> 
> It doesn't matter if it's a exported symbol, we should just keep code as
> simple as possible.
>> ...
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index ac85ec73a409..855cddeb0c09 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread);
>>
>> -void md_error(struct mddev *mddev, struct md_rdev *rdev)
>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail)
>>   {
>> @@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>                  return;
>> -       mddev->pers->error_handler(mddev, rdev);
>> +       mddev->pers->error_handler(mddev, rdev, nofail);
>>
>> @@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>   }
>> +
>> +void md_error(struct mddev *mddev, struct md_rdev *rdev)
>> +{
>> +       return _md_error(mddev, rdev, false);
>> +}
>>   EXPORT_SYMBOL(md_error);
>>
>> +void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev)
>> +{
>> +       WARN_ON(mddev->pers->head.id != ID_RAID1 &&
>> +               mddev->pers->head.id != ID_RAID10);
>> +       return _md_error(mddev, rdev, true);
>> +}
>> +EXPORT_SYMBOL(md_error_failfast);
>> +
> 
> I will prefer we add a common procedures to fix this problme.
> 
> How about the first patch to serialize all the md_error(), and then
> and a new helper md_bio_failue_error(mddev, rdev, bio), called when
> bio is failed, in this helper:
> 
> 1) if bio is not failfast, call md_error() and return true; otherwise:
> 2) if rdev contain the last data copy, return false directly, caller
> should check return value and retry, otherwise:
> 3) call md_error() and return true;

Hi,
I think this approach has some issues. There are cases where md_error is
called only when MD_FAILFAST is set.

One example is the processing below in raid1_end_write_request.

> Then, for raid1, the callers will look like:
> 
> iff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1baaf52c603c..c6d150e9f1a7 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1003,9 +1003,7 @@ static void super_written(struct bio *bio)
>         if (bio->bi_status) {
>                 pr_err("md: %s gets error=%d\n", __func__,
>                        blk_status_to_errno(bio->bi_status));
> -               md_error(mddev, rdev);
> -               if (!test_bit(Faulty, &rdev->flags)
> -                   && (bio->bi_opf & MD_FAILFAST)) {
> +               if (!md_bio_failure_error(mddev, rdev, bio)) {
>                         set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>                         set_bit(LastDev, &rdev->flags);
>                 }
> 
> @@ -466,20 +472,11 @@ static void raid1_end_write_request(struct bio *bio)
>                         set_bit(MD_RECOVERY_NEEDED, &
>                                 conf->mddev->recovery);
> 
> -               if (test_bit(FailFast, &rdev->flags) &&
> -                   (bio->bi_opf & MD_FAILFAST) &&
>                     /* We never try FailFast to WriteMostly devices */
> -                   !test_bit(WriteMostly, &rdev->flags)) {
> -                       md_error(r1_bio->mddev, rdev);
> -               }
> -
> -               /*
> -                * When the device is faulty, it is not necessary to
> -                * handle write error.
> -                */
> -               if (!test_bit(Faulty, &rdev->flags))
> +               if(!test_bit(WriteMostly, &rdev->flags) &&
> +                  !md_bio_failure_error(mddev, rdev, bio)) {
>                         set_bit(R1BIO_WriteError, &r1_bio->state);
> -               else {
> +               } else {
>                         /* Finished with this branch */
>                         r1_bio->bios[mirror] = NULL;
>                         to_put = bio;

In the current raid1_end_write_request implementation, 
- md_error is called only in the Failfast case.
- Afterwards, if the rdev is not Faulty (that is, not Failfast, 
  or Failfast but the last rdev — which originally was not expected 
  MD_BROKEN in RAID1), R1BIO_WriteError is set.
In the suggested implementation, it seems that a non-Failfast write 
failure will immediately mark the rdev as Faulty, without retries.

This could be avoided by testing MD_FAILFAST before call the 
new helper md_bio_failure_error, but I believe duplicating the 
same check in both caller/callee would be undesirable.

Should we try to avoid modifying pers->error_handler?
One possible alternative approach is as follows.

- serialize calls to md_error regardless of whether Failfast or not
- raid{1,10}_error is:
  - The remaining copy (rdev) is marked with the LastDev flag
  - clear MD_FAILFAST_SUPPORTED for prohibit super_write using Failfast
- super_written will simply put MD_SB_NEED_REWRITE without calling 
  md_error when MD_FAILFAST bio and LastDev rdev.

After the changes, I believe it is rare for super_written to be called with error on
multiple rdevs due to failfast. super_write is caused by errors from normal failfast 
IO and invoked via MD_SB_CHANGE_DEVS through the serialized raid1_error. Since
MD_FAILFAST_SUPPORTED is cleared, metadata writes occur without failfast.

It's not exactly a common procedure, but as it doesn't add functions to md.c, 
I think this approach is preferable to adding md_error_failfast().

...
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1baaf52c603c..ba524fa96091 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1003,14 +1003,15 @@ static void super_written(struct bio *bio)
        if (bio->bi_status) {
                pr_err("md: %s gets error=%d\n", __func__,
                       blk_status_to_errno(bio->bi_status));
-               md_error(mddev, rdev);
-               if (!test_bit(Faulty, &rdev->flags)
+               if (test_bit(LastDev, &rdev->flags)
                    && (bio->bi_opf & MD_FAILFAST)) {
+                       pr_warn("md: %s: Metadata write will be repeated to %pg\n",
+                               mdname(mddev), rdev->bdev);
                        set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
-                       set_bit(LastDev, &rdev->flags);
+               } else {
+                       md_error(mddev, rdev);
                }
-       } else
-               clear_bit(LastDev, &rdev->flags);
+       }

        bio_put(bio);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 408c26398321..a52c5277add7 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
                    (bio->bi_opf & MD_FAILFAST) &&
                    /* We never try FailFast to WriteMostly devices */
                    !test_bit(WriteMostly, &rdev->flags)) {
-                       md_error(r1_bio->mddev, rdev);
+                       raid1_md_error_failfast(r1_bio->mddev, rdev);
                }

                /*
@@ -1733,6 +1733,27 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
        seq_printf(seq, "]");
 }

+static void _raid1_md_error(struct mddev *mddev, struct md_rdev *rdev, bool failfast){
+       struct r1conf *conf = mddev->private;
+       unsigned long flags;
+
+       spin_lock_irqsave(&conf->new_lock_for_md_error, flags);
+       if (failfast)
+               set_bit(FailfastIOFailure, &rdev->flags);
+       md_error(mddev, rdev);
+       if (failfast)
+               WARN_ON(!test_and_clear_bit(FailfastIOFailure, &rdev->flags));
+       spin_unlock_irqrestore(&conf->new_lock_for_md_error, flags);
+}
+
+static void raid1_md_error(struct mddev *mddev, struct md_rdev *rdev){
+       return _raid1_md_error(mddev, rdev, false);
+}
+
+static void raid1_md_error_failfast(struct mddev *mddev, struct md_rdev *rdev){
+       return _raid1_md_error(mddev, rdev, true);
+}
+
 /**
  * raid1_error() - RAID1 error handler.
  * @mddev: affected md device.
@@ -1758,6 +1783,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)

        if (test_bit(In_sync, &rdev->flags) &&
            (conf->raid_disks - mddev->degraded) == 1) {
+               if (test_bit(FailfastIOFailure, &rdev->flags)) {
+                       spin_unlock_irqrestore(&conf->device_lock, flags);
+                       pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
+                               "last device but ignoring it\n",
+                               mdname(mddev), rdev->bdev);
+                       return;
+               }
                set_bit(MD_BROKEN, &mddev->flags);

                if (!mddev->fail_last_dev) {
@@ -1767,8 +1799,16 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
                }
        }
        set_bit(Blocked, &rdev->flags);
-       if (test_and_clear_bit(In_sync, &rdev->flags))
+       if (test_and_clear_bit(In_sync, &rdev->flags)) {
                mddev->degraded++;
+               clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
+               for (i = 0; i < conf->raid_disks; i++) {
+                       struct md_rdev *rdev2 = conf->mirrors[i].rdev;
+                       if (rdev2 && rdev != rdev2 &&
+                           test_bit(In_sync, &rdev2->flags))
+                               set_bit(LastDev, &rdev2->flags);
+               }
+       }
        set_bit(Faulty, &rdev->flags);
        spin_unlock_irqrestore(&conf->device_lock, flags);
        /*
@@ -2118,7 +2158,7 @@ static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector,
        }
        /* need to record an error - either for the block or the device */
        if (!rdev_set_badblocks(rdev, sector, sectors, 0))
-               md_error(rdev->mddev, rdev);
+               raid1_md_error(rdev->mddev, rdev);
        return 0;
 }
...


Thanks,
Akagi

> @@ -2630,7 +2627,6 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>          */
> 
>         bio = r1_bio->bios[r1_bio->read_disk];
> -       bio_put(bio);
>         r1_bio->bios[r1_bio->read_disk] = NULL;
> 
>         rdev = conf->mirrors[r1_bio->read_disk].rdev;
> @@ -2639,19 +2635,18 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>                 freeze_array(conf, 1);
>                 fix_read_error(conf, r1_bio);
>                 unfreeze_array(conf);
> -       } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
> -               md_error(mddev, rdev);
> -       } else {
> +       } else if (mddev->ro == 0 &&
> +                  !md_bio_failure_error(mddev, rdev, bio)) {
>                 r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>         }
> 
> +       bio_put(bio);
>         rdev_dec_pending(rdev, conf->mddev);
>         sector = r1_bio->sector;
> -       bio = r1_bio->master_bio;
> 
>         /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */
>         r1_bio->state = 0;
> -       raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio);
> +       raid1_read_request(mddev, r1_bio->maxter_bio, r1_bio->sectors, r1_bio);
>         allow_barrier(conf, sector);
>  }
> 
> 
>>   /* seq_file implementation /proc/mdstat */
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 51af29a03079..6ca1aea630ce 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -758,3 +758,3 @@ struct md_personality
>>           */
>> -       void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
>> +       void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
>>          int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
>> @@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev);
>>   extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
>>   extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
>> +extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev);
>>   extern void md_finish_reshape(struct mddev *mddev);
>> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
>> index f1d8811a542a..8aea51227a96 100644
>> --- a/drivers/md/raid0.c
>> +++ b/drivers/md/raid0.c
>> @@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev)
>>
>> -static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
>> +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev,
>> +       bool nofail __maybe_unused)
>>   {
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 408c26398321..d93275899e9e 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>    * @rdev: member device to fail.
>> + * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set
>>    *
>> @@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>    *
>> - * @rdev is marked as &Faulty excluding case when array is failed and
>> - * &mddev->fail_last_dev is off.
>> + * @rdev is marked as &Faulty excluding any cases:
>> + *     - when @mddev is failed and &mddev->fail_last_dev is off
>> + *     - when @rdev is last device and @nofail is true
>>    */
>> -static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>> +static void raid1_error(struct mddev *mddev, struct md_rdev *rdev,
>> +       bool nofail)
>>   {
>> @@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>              (conf->raid_disks - mddev->degraded) == 1) {
>> +               if (nofail) {
>> +                       spin_unlock_irqrestore(&conf->device_lock, flags);
>> +                       pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, "
>> +                               "last device but ignoring it\n",
>> +                               mdname(mddev), rdev->bdev);
>> +                       return;
>> +               }
>>                  set_bit(MD_BROKEN, &mddev->flags);
>> ...
>>
>>> Kuai, do you have any suggestions?
>>>
>>>>> Normal writes may call md_error() in narrow_write_error. Normal reads do
>>>>> not execute md_error() on the last disk.
>>>>>
>>>>> Perhaps you should get more information to confirm how MD_BROKEN is set in
>>>>> normal read/write IO.
>>>>
>>>> Should I add the above sequence of events to the cover letter, or commit message?
>>>>
>>>
>>> I think we should mention this in the commit message.
>>
>> Understood. I will explicitly describe this in the commit message in v4.
>>
>> Thanks,
>> Akagi
>>
>>>> Thanks,
>>>> Akagi
>>>>
>>>>> -- 
>>>>> Thanks,
>>>>> Nan
>>>>>
>>>>>
>>>>
>>>>
>>>> .
>>>
>>> -- 
>>> Thanks,
>>> Nan
>>>
>>>
>>
>> .
>>
> 
> 


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-09-01 16:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-28 16:32 [PATCH v3 0/3] Do not set MD_BROKEN on failfast io failure Kenta Akagi
2025-08-28 16:32 ` [PATCH v3 1/3] md/raid1,raid10: " Kenta Akagi
2025-08-29  2:54   ` Li Nan
2025-08-29 12:21     ` Kenta Akagi
2025-08-30  8:48       ` Li Nan
2025-08-30 18:10         ` Kenta Akagi
2025-09-01  3:22           ` Li Nan
2025-09-01  4:22             ` Kenta Akagi
2025-09-01  7:48               ` Yu Kuai
2025-09-01 16:48                 ` Kenta Akagi
2025-08-28 16:32 ` [PATCH v3 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi
2025-08-28 16:32 ` [PATCH v3 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).