linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] md/raid1,raid10: don't broken array on failfast metadata write fails
@ 2025-08-17 17:27 Kenta Akagi
  2025-08-17 17:27 ` [PATCH v2 1/3] " Kenta Akagi
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Kenta Akagi @ 2025-08-17 17:27 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Changes from V1:
- Avoid setting MD_BROKEN instead of clearing it
- Add pr_crit() when setting MD_BROKEN
- Fix the message may shown after all rdevs failure:
  "Operation continuing on 0 devices"

A failfast bio, for example in the case of nvme-tcp,
will fail immediately if the connection to the target is
lost for a few seconds and the device enters a reconnecting
state - even though it would recover if given a few seconds.
This behavior is exactly as intended by the design of failfast.

However, md treats super_write operations fails with failfast as fatal.
For example, if an initiator - that is, a machine loading the md module -
loses all connections for a few seconds, the array becomes
broken and subsequent write is no longer possible.
This is the issue I am currently facing, and which this patch aims to fix.

The 1st patch changes the behavior on super_write MD_FAILFAST IO failures.
The 2nd and 3rd patches modify the output of pr_crit.

Kenta Akagi (3):
  md/raid1,raid10: don't broken array on failfast metadata write fails
  md/raid1,raid10: Add error message when setting MD_BROKEN
  md/raid1,raid10: Fix: Operation continuing on 0 devices.

 drivers/md/md.c     |  9 ++++++---
 drivers/md/md.h     |  7 ++++---
 drivers/md/raid1.c  | 26 ++++++++++++++++++++------
 drivers/md/raid10.c | 26 ++++++++++++++++++++------
 4 files changed, 50 insertions(+), 18 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails
  2025-08-17 17:27 [PATCH v2 0/3] md/raid1,raid10: don't broken array on failfast metadata write fails Kenta Akagi
@ 2025-08-17 17:27 ` Kenta Akagi
  2025-08-18  2:05   ` Yu Kuai
  2025-08-23  1:54   ` Li Nan
  2025-08-17 17:27 ` [PATCH v2 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi
  2025-08-17 17:27 ` [PATCH v2 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi
  2 siblings, 2 replies; 11+ messages in thread
From: Kenta Akagi @ 2025-08-17 17:27 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

A super_write IO failure with MD_FAILFAST must not cause the array
to fail.

Because a failfast bio may fail even when the rdev is not broken,
so IO must be retried rather than failing the array when a metadata
write with MD_FAILFAST fails on the last rdev.

A metadata write with MD_FAILFAST is retried after failure as
follows:

1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.

2. In md_super_wait, which is called by the function that
executed md_super_write and waits for completion,
-EAGAIN is returned because MD_SB_NEED_REWRITE is set.

3. The caller of md_super_wait (such as md_update_sb)
receives a negative return value and then retries md_super_write.

4. The md_super_write function, which is called to perform
the same metadata write, issues a write bio without MD_FAILFAST
this time.

When a write from super_written without MD_FAILFAST fails,
the array may broken, and MD_BROKEN should be set.

After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
calling md_error on the last rdev in RAID1/10 always sets
the MD_BROKEN flag on the array.
As a result, when failfast IO fails on the last rdev, the array
immediately becomes failed.

This commit prevents MD_BROKEN from being set when a super_write with
MD_FAILFAST fails on the last rdev, ensuring that the array does
not become failed due to failfast IO failures.

Failfast IO failures on any rdev except the last one are not retried
and are marked as Faulty immediately. This minimizes array IO latency
when an rdev fails.

Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c     |  9 ++++++---
 drivers/md/md.h     |  7 ++++---
 drivers/md/raid1.c  | 12 ++++++++++--
 drivers/md/raid10.c | 12 ++++++++++--
 4 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ac85ec73a409..61a8188849a3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -999,14 +999,17 @@ static void super_written(struct bio *bio)
 	if (bio->bi_status) {
 		pr_err("md: %s gets error=%d\n", __func__,
 		       blk_status_to_errno(bio->bi_status));
+		if (bio->bi_opf & MD_FAILFAST)
+			set_bit(FailfastIOFailure, &rdev->flags);
 		md_error(mddev, rdev);
 		if (!test_bit(Faulty, &rdev->flags)
 		    && (bio->bi_opf & MD_FAILFAST)) {
+			pr_warn("md: %s: Metadata write will be repeated to %pg\n",
+				mdname(mddev), rdev->bdev);
 			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
-			set_bit(LastDev, &rdev->flags);
 		}
 	} else
-		clear_bit(LastDev, &rdev->flags);
+		clear_bit(FailfastIOFailure, &rdev->flags);
 
 	bio_put(bio);
 
@@ -1048,7 +1051,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
 
 	if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
 	    test_bit(FailFast, &rdev->flags) &&
-	    !test_bit(LastDev, &rdev->flags))
+	    !test_bit(FailfastIOFailure, &rdev->flags))
 		bio->bi_opf |= MD_FAILFAST;
 
 	atomic_inc(&mddev->pending_writes);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 51af29a03079..cf989aca72ad 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -281,9 +281,10 @@ enum flag_bits {
 				 * It is expects that no bad block log
 				 * is present.
 				 */
-	LastDev,		/* Seems to be the last working dev as
-				 * it didn't fail, so don't use FailFast
-				 * any more for metadata
+	FailfastIOFailure,	/* A device that failled a metadata write
+				 * with failfast.
+				 * error_handler must not fail the array
+				 * if last device has this flag.
 				 */
 	CollisionCheck,		/*
 				 * check if there is collision between raid1
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 408c26398321..fc7195e58f80 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
  *	- recovery is interrupted.
  *	- &mddev->degraded is bumped.
  *
- * @rdev is marked as &Faulty excluding case when array is failed and
- * &mddev->fail_last_dev is off.
+ * If @rdev is marked with &FailfastIOFailure, it means that super_write
+ * failed in failfast and will be retried, so the @mddev did not fail.
+ *
+ * @rdev is marked as &Faulty excluding any cases:
+ *	- when @mddev is failed and &mddev->fail_last_dev is off
+ *	- when @rdev is last device and &FailfastIOFailure flag is set
  */
 static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 {
@@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	if (test_bit(In_sync, &rdev->flags) &&
 	    (conf->raid_disks - mddev->degraded) == 1) {
+		if (test_bit(FailfastIOFailure, &rdev->flags)) {
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			return;
+		}
 		set_bit(MD_BROKEN, &mddev->flags);
 
 		if (!mddev->fail_last_dev) {
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b60c30bfb6c7..ff105a0dcd05 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1995,8 +1995,12 @@ static int enough(struct r10conf *conf, int ignore)
  *	- recovery is interrupted.
  *	- &mddev->degraded is bumped.
  *
- * @rdev is marked as &Faulty excluding case when array is failed and
- * &mddev->fail_last_dev is off.
+ * If @rdev is marked with &FailfastIOFailure, it means that super_write
+ * failed in failfast, so the @mddev did not fail.
+ *
+ * @rdev is marked as &Faulty excluding any cases:
+ *	- when @mddev is failed and &mddev->fail_last_dev is off
+ *	- when @rdev is last device and &FailfastIOFailure flag is set
  */
 static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 {
@@ -2006,6 +2010,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	spin_lock_irqsave(&conf->device_lock, flags);
 
 	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
+		if (test_bit(FailfastIOFailure, &rdev->flags)) {
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			return;
+		}
 		set_bit(MD_BROKEN, &mddev->flags);
 
 		if (!mddev->fail_last_dev) {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN
  2025-08-17 17:27 [PATCH v2 0/3] md/raid1,raid10: don't broken array on failfast metadata write fails Kenta Akagi
  2025-08-17 17:27 ` [PATCH v2 1/3] " Kenta Akagi
@ 2025-08-17 17:27 ` Kenta Akagi
  2025-08-17 17:27 ` [PATCH v2 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi
  2 siblings, 0 replies; 11+ messages in thread
From: Kenta Akagi @ 2025-08-17 17:27 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Once MD_BROKEN is set on an array, no further writes can be
performed to it.
The user must be informed that the array cannot continue operation.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c  | 5 +++++
 drivers/md/raid10.c | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fc7195e58f80..007e825c2e07 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1766,7 +1766,12 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 			spin_unlock_irqrestore(&conf->device_lock, flags);
 			return;
 		}
+
 		set_bit(MD_BROKEN, &mddev->flags);
+		pr_crit("md/raid1:%s: Disk failure on %pg, this is the last device.\n"
+			"md/raid1:%s: Cannot continue operation (%d/%d failed).\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), mddev->degraded + 1, conf->raid_disks);
 
 		if (!mddev->fail_last_dev) {
 			conf->recovery_disabled = mddev->recovery_disabled;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ff105a0dcd05..07248142ac52 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2014,7 +2014,12 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 			spin_unlock_irqrestore(&conf->device_lock, flags);
 			return;
 		}
+
 		set_bit(MD_BROKEN, &mddev->flags);
+		pr_crit("md/raid10:%s: Disk failure on %pg, this is the last device.\n"
+			"md/raid10:%s: Cannot continue operation (%d/%d failed).\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), mddev->degraded + 1, conf->geo.raid_disks);
 
 		if (!mddev->fail_last_dev) {
 			spin_unlock_irqrestore(&conf->device_lock, flags);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices.
  2025-08-17 17:27 [PATCH v2 0/3] md/raid1,raid10: don't broken array on failfast metadata write fails Kenta Akagi
  2025-08-17 17:27 ` [PATCH v2 1/3] " Kenta Akagi
  2025-08-17 17:27 ` [PATCH v2 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi
@ 2025-08-17 17:27 ` Kenta Akagi
  2 siblings, 0 replies; 11+ messages in thread
From: Kenta Akagi @ 2025-08-17 17:27 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Since commit 9a567843f7ce ("md: allow last device to be forcibly
removed from RAID1/RAID10."), RAID1/10 arrays can now lose all rdevs.

Before that commit, losing the array last rdev or reaching the end of
the function without early return in raid{1,10}_error never occurred.
However, both situations can occur in the current implementation.

As a result, when mddev->fail_last_dev is set, a spurious pr_crit
message can be printed.

This patch prevents "Operation continuing" printed if the array
is not operational.

root@fedora:~# mdadm --create --verbose /dev/md0 --level=1 \
--raid-devices=2  /dev/loop0 /dev/loop1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: size set to 1046528K
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
root@fedora:~# echo 1 > /sys/block/md0/md/fail_last_dev
root@fedora:~# mdadm --fail /dev/md0 loop0
mdadm: set loop0 faulty in /dev/md0
root@fedora:~# mdadm --fail /dev/md0 loop1
mdadm: set device faulty failed for loop1:  Device or resource busy
root@fedora:~# dmesg | tail -n 4
[ 1314.359674] md/raid1:md0: Disk failure on loop0, disabling device.
               md/raid1:md0: Operation continuing on 1 devices.
[ 1315.506633] md/raid1:md0: Disk failure on loop1, disabling device.
               md/raid1:md0: Operation continuing on 0 devices.
root@fedora:~#

Fixes: 9a567843f7ce ("md: allow last device to be forcibly removed from RAID1/RAID10.")
Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c  | 9 +++++----
 drivers/md/raid10.c | 9 +++++----
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 007e825c2e07..095a0dbb5167 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1783,6 +1783,11 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	if (test_and_clear_bit(In_sync, &rdev->flags))
 		mddev->degraded++;
 	set_bit(Faulty, &rdev->flags);
+	if ((conf->raid_disks - mddev->degraded) > 0)
+		pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n"
+			"md/raid1:%s: Operation continuing on %d devices.\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), conf->raid_disks - mddev->degraded);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	/*
 	 * if recovery is running, make sure it aborts.
@@ -1790,10 +1795,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
-	pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n"
-		"md/raid1:%s: Operation continuing on %d devices.\n",
-		mdname(mddev), rdev->bdev,
-		mdname(mddev), conf->raid_disks - mddev->degraded);
 }
 
 static void print_conf(struct r1conf *conf)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 07248142ac52..407edf1b9708 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2034,11 +2034,12 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	set_bit(Faulty, &rdev->flags);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
+	if (enough(conf, -1))
+		pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
+			"md/raid10:%s: Operation continuing on %d devices.\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), conf->geo.raid_disks - mddev->degraded);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
-	pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
-		"md/raid10:%s: Operation continuing on %d devices.\n",
-		mdname(mddev), rdev->bdev,
-		mdname(mddev), conf->geo.raid_disks - mddev->degraded);
 }
 
 static void print_conf(struct r10conf *conf)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails
  2025-08-17 17:27 ` [PATCH v2 1/3] " Kenta Akagi
@ 2025-08-18  2:05   ` Yu Kuai
  2025-08-18  2:48     ` Yu Kuai
  2025-08-23  1:54   ` Li Nan
  1 sibling, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2025-08-18  2:05 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C)

Hi,

在 2025/08/18 1:27, Kenta Akagi 写道:
> A super_write IO failure with MD_FAILFAST must not cause the array
> to fail.
> 
> Because a failfast bio may fail even when the rdev is not broken,
> so IO must be retried rather than failing the array when a metadata
> write with MD_FAILFAST fails on the last rdev.

Why just last rdev? If failfast can fail when the rdev is not broken, I
feel we should retry for all the rdev.
> 
> A metadata write with MD_FAILFAST is retried after failure as
> follows:
> 
> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
> 
> 2. In md_super_wait, which is called by the function that
> executed md_super_write and waits for completion,
> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
> 
> 3. The caller of md_super_wait (such as md_update_sb)
> receives a negative return value and then retries md_super_write.
> 
> 4. The md_super_write function, which is called to perform
> the same metadata write, issues a write bio without MD_FAILFAST
> this time.
> 
> When a write from super_written without MD_FAILFAST fails,
> the array may broken, and MD_BROKEN should be set.
> 
> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
> calling md_error on the last rdev in RAID1/10 always sets
> the MD_BROKEN flag on the array.
> As a result, when failfast IO fails on the last rdev, the array
> immediately becomes failed.
> 
> This commit prevents MD_BROKEN from being set when a super_write with
> MD_FAILFAST fails on the last rdev, ensuring that the array does
> not become failed due to failfast IO failures.
> 
> Failfast IO failures on any rdev except the last one are not retried
> and are marked as Faulty immediately. This minimizes array IO latency
> when an rdev fails.
> 
> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/md.c     |  9 ++++++---
>   drivers/md/md.h     |  7 ++++---
>   drivers/md/raid1.c  | 12 ++++++++++--
>   drivers/md/raid10.c | 12 ++++++++++--
>   4 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index ac85ec73a409..61a8188849a3 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -999,14 +999,17 @@ static void super_written(struct bio *bio)
>   	if (bio->bi_status) {
>   		pr_err("md: %s gets error=%d\n", __func__,
>   		       blk_status_to_errno(bio->bi_status));
> +		if (bio->bi_opf & MD_FAILFAST)
> +			set_bit(FailfastIOFailure, &rdev->flags);

I think it's better to retry the bio with the flag cleared, then all
underlying procedures can stay the same.

Thanks,
Kuai

>   		md_error(mddev, rdev);
>   		if (!test_bit(Faulty, &rdev->flags)
>   		    && (bio->bi_opf & MD_FAILFAST)) {
> +			pr_warn("md: %s: Metadata write will be repeated to %pg\n",
> +				mdname(mddev), rdev->bdev);
>   			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> -			set_bit(LastDev, &rdev->flags);
>   		}
>   	} else
> -		clear_bit(LastDev, &rdev->flags);
> +		clear_bit(FailfastIOFailure, &rdev->flags);
>   
>   	bio_put(bio);
>   
> @@ -1048,7 +1051,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
>   
>   	if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
>   	    test_bit(FailFast, &rdev->flags) &&
> -	    !test_bit(LastDev, &rdev->flags))
> +	    !test_bit(FailfastIOFailure, &rdev->flags))
>   		bio->bi_opf |= MD_FAILFAST;
>   
>   	atomic_inc(&mddev->pending_writes);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 51af29a03079..cf989aca72ad 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -281,9 +281,10 @@ enum flag_bits {
>   				 * It is expects that no bad block log
>   				 * is present.
>   				 */
> -	LastDev,		/* Seems to be the last working dev as
> -				 * it didn't fail, so don't use FailFast
> -				 * any more for metadata
> +	FailfastIOFailure,	/* A device that failled a metadata write
> +				 * with failfast.
> +				 * error_handler must not fail the array
> +				 * if last device has this flag.
>   				 */
>   	CollisionCheck,		/*
>   				 * check if there is collision between raid1
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 408c26398321..fc7195e58f80 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>    *	- recovery is interrupted.
>    *	- &mddev->degraded is bumped.
>    *
> - * @rdev is marked as &Faulty excluding case when array is failed and
> - * &mddev->fail_last_dev is off.
> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
> + * failed in failfast and will be retried, so the @mddev did not fail.
> + *
> + * @rdev is marked as &Faulty excluding any cases:
> + *	- when @mddev is failed and &mddev->fail_last_dev is off
> + *	- when @rdev is last device and &FailfastIOFailure flag is set
>    */
>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   {
> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   
>   	if (test_bit(In_sync, &rdev->flags) &&
>   	    (conf->raid_disks - mddev->degraded) == 1) {
> +		if (test_bit(FailfastIOFailure, &rdev->flags)) {
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
> +			return;
> +		}
>   		set_bit(MD_BROKEN, &mddev->flags);
>   
>   		if (!mddev->fail_last_dev) {
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index b60c30bfb6c7..ff105a0dcd05 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1995,8 +1995,12 @@ static int enough(struct r10conf *conf, int ignore)
>    *	- recovery is interrupted.
>    *	- &mddev->degraded is bumped.
>    *
> - * @rdev is marked as &Faulty excluding case when array is failed and
> - * &mddev->fail_last_dev is off.
> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
> + * failed in failfast, so the @mddev did not fail.
> + *
> + * @rdev is marked as &Faulty excluding any cases:
> + *	- when @mddev is failed and &mddev->fail_last_dev is off
> + *	- when @rdev is last device and &FailfastIOFailure flag is set
>    */
>   static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>   {
> @@ -2006,6 +2010,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>   	spin_lock_irqsave(&conf->device_lock, flags);
>   
>   	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
> +		if (test_bit(FailfastIOFailure, &rdev->flags)) {
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
> +			return;
> +		}
>   		set_bit(MD_BROKEN, &mddev->flags);
>   
>   		if (!mddev->fail_last_dev) {
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails
  2025-08-18  2:05   ` Yu Kuai
@ 2025-08-18  2:48     ` Yu Kuai
  2025-08-18 12:48       ` Kenta Akagi
  0 siblings, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2025-08-18  2:48 UTC (permalink / raw)
  To: Yu Kuai, Kenta Akagi, Song Liu, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C)

Hi,

在 2025/08/18 10:05, Yu Kuai 写道:
> Hi,
> 
> 在 2025/08/18 1:27, Kenta Akagi 写道:
>> A super_write IO failure with MD_FAILFAST must not cause the array
>> to fail.
>>
>> Because a failfast bio may fail even when the rdev is not broken,
>> so IO must be retried rather than failing the array when a metadata
>> write with MD_FAILFAST fails on the last rdev.
> 
> Why just last rdev? If failfast can fail when the rdev is not broken, I
> feel we should retry for all the rdev.

BTW, I couldn't figure out the reason, why failfast is added for the
meta write. I do feel just remove this flag for metadata write will fix
this problem.

Thanks,
Kuai

>>
>> A metadata write with MD_FAILFAST is retried after failure as
>> follows:
>>
>> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>>
>> 2. In md_super_wait, which is called by the function that
>> executed md_super_write and waits for completion,
>> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>>
>> 3. The caller of md_super_wait (such as md_update_sb)
>> receives a negative return value and then retries md_super_write.
>>
>> 4. The md_super_write function, which is called to perform
>> the same metadata write, issues a write bio without MD_FAILFAST
>> this time.
>>
>> When a write from super_written without MD_FAILFAST fails,
>> the array may broken, and MD_BROKEN should be set.
>>
>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>> calling md_error on the last rdev in RAID1/10 always sets
>> the MD_BROKEN flag on the array.
>> As a result, when failfast IO fails on the last rdev, the array
>> immediately becomes failed.
>>
>> This commit prevents MD_BROKEN from being set when a super_write with
>> MD_FAILFAST fails on the last rdev, ensuring that the array does
>> not become failed due to failfast IO failures.
>>
>> Failfast IO failures on any rdev except the last one are not retried
>> and are marked as Faulty immediately. This minimizes array IO latency
>> when an rdev fails.
>>
>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>   drivers/md/md.c     |  9 ++++++---
>>   drivers/md/md.h     |  7 ++++---
>>   drivers/md/raid1.c  | 12 ++++++++++--
>>   drivers/md/raid10.c | 12 ++++++++++--
>>   4 files changed, 30 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index ac85ec73a409..61a8188849a3 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -999,14 +999,17 @@ static void super_written(struct bio *bio)
>>       if (bio->bi_status) {
>>           pr_err("md: %s gets error=%d\n", __func__,
>>                  blk_status_to_errno(bio->bi_status));
>> +        if (bio->bi_opf & MD_FAILFAST)
>> +            set_bit(FailfastIOFailure, &rdev->flags);
> 
> I think it's better to retry the bio with the flag cleared, then all
> underlying procedures can stay the same.
> 
> Thanks,
> Kuai
> 
>>           md_error(mddev, rdev);
>>           if (!test_bit(Faulty, &rdev->flags)
>>               && (bio->bi_opf & MD_FAILFAST)) {
>> +            pr_warn("md: %s: Metadata write will be repeated to %pg\n",
>> +                mdname(mddev), rdev->bdev);
>>               set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>> -            set_bit(LastDev, &rdev->flags);
>>           }
>>       } else
>> -        clear_bit(LastDev, &rdev->flags);
>> +        clear_bit(FailfastIOFailure, &rdev->flags);
>>       bio_put(bio);
>> @@ -1048,7 +1051,7 @@ void md_super_write(struct mddev *mddev, struct 
>> md_rdev *rdev,
>>       if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
>>           test_bit(FailFast, &rdev->flags) &&
>> -        !test_bit(LastDev, &rdev->flags))
>> +        !test_bit(FailfastIOFailure, &rdev->flags))
>>           bio->bi_opf |= MD_FAILFAST;
>>       atomic_inc(&mddev->pending_writes);
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 51af29a03079..cf989aca72ad 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -281,9 +281,10 @@ enum flag_bits {
>>                    * It is expects that no bad block log
>>                    * is present.
>>                    */
>> -    LastDev,        /* Seems to be the last working dev as
>> -                 * it didn't fail, so don't use FailFast
>> -                 * any more for metadata
>> +    FailfastIOFailure,    /* A device that failled a metadata write
>> +                 * with failfast.
>> +                 * error_handler must not fail the array
>> +                 * if last device has this flag.
>>                    */
>>       CollisionCheck,        /*
>>                    * check if there is collision between raid1
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 408c26398321..fc7195e58f80 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, 
>> struct mddev *mddev)
>>    *    - recovery is interrupted.
>>    *    - &mddev->degraded is bumped.
>>    *
>> - * @rdev is marked as &Faulty excluding case when array is failed and
>> - * &mddev->fail_last_dev is off.
>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>> + * failed in failfast and will be retried, so the @mddev did not fail.
>> + *
>> + * @rdev is marked as &Faulty excluding any cases:
>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>    */
>>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>   {
>> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, 
>> struct md_rdev *rdev)
>>       if (test_bit(In_sync, &rdev->flags) &&
>>           (conf->raid_disks - mddev->degraded) == 1) {
>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>> +            return;
>> +        }
>>           set_bit(MD_BROKEN, &mddev->flags);
>>           if (!mddev->fail_last_dev) {
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index b60c30bfb6c7..ff105a0dcd05 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1995,8 +1995,12 @@ static int enough(struct r10conf *conf, int 
>> ignore)
>>    *    - recovery is interrupted.
>>    *    - &mddev->degraded is bumped.
>>    *
>> - * @rdev is marked as &Faulty excluding case when array is failed and
>> - * &mddev->fail_last_dev is off.
>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>> + * failed in failfast, so the @mddev did not fail.
>> + *
>> + * @rdev is marked as &Faulty excluding any cases:
>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>    */
>>   static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>   {
>> @@ -2006,6 +2010,10 @@ static void raid10_error(struct mddev *mddev, 
>> struct md_rdev *rdev)
>>       spin_lock_irqsave(&conf->device_lock, flags);
>>       if (test_bit(In_sync, &rdev->flags) && !enough(conf, 
>> rdev->raid_disk)) {
>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>> +            return;
>> +        }
>>           set_bit(MD_BROKEN, &mddev->flags);
>>           if (!mddev->fail_last_dev) {
>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails
  2025-08-18  2:48     ` Yu Kuai
@ 2025-08-18 12:48       ` Kenta Akagi
  2025-08-18 15:45         ` Yu Kuai
  0 siblings, 1 reply; 11+ messages in thread
From: Kenta Akagi @ 2025-08-18 12:48 UTC (permalink / raw)
  To: Yu Kuai, Song Liu, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C), Kenta Akagi

On 2025/08/18 11:48, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/18 10:05, Yu Kuai 写道:
>> Hi,
>>
>> 在 2025/08/18 1:27, Kenta Akagi 写道:
>>> A super_write IO failure with MD_FAILFAST must not cause the array
>>> to fail.
>>>
>>> Because a failfast bio may fail even when the rdev is not broken,
>>> so IO must be retried rather than failing the array when a metadata
>>> write with MD_FAILFAST fails on the last rdev.
>>
>> Why just last rdev? If failfast can fail when the rdev is not broken, I
>> feel we should retry for all the rdev.

Thank you for reviewing.

The reason this retry applies only to the last rdev is that the purpose 
of failfast is to quickly detach a faulty device and thereby minimize 
mdev IO latency on rdev failure.
If md retries all rdevs, the Faulty handler will no longer act 
quickly enough, which will always "cause long delays" [1].
I believe this is not the behavior users want.

[1] https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/mdadm.8.in?h=main&id=34f21b7acea8afbea9348d0f421beeeedca7a136#n784

> 
> BTW, I couldn't figure out the reason, why failfast is added for the
> meta write. I do feel just remove this flag for metadata write will fix
> this problem.

By issuing metadata writes with failfast in md, it becomes possible to 
detect rdev failures quickly.
Most applications never issue IO with the REQ_FAILFAST flag set, 
so if md issues its metadata writes without failfast, 
rdev failures would not be detected quickly.
This would undermine the point of the md's failfast feature.
And this would also "cause long delays" [1].
I believe this is also not what users want.

> Thanks,
> Kuai
> 
>>>
>>> A metadata write with MD_FAILFAST is retried after failure as
>>> follows:
>>>
>>> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>>>
>>> 2. In md_super_wait, which is called by the function that
>>> executed md_super_write and waits for completion,
>>> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>>>
>>> 3. The caller of md_super_wait (such as md_update_sb)
>>> receives a negative return value and then retries md_super_write.
>>>
>>> 4. The md_super_write function, which is called to perform
>>> the same metadata write, issues a write bio without MD_FAILFAST
>>> this time.
>>>
>>> When a write from super_written without MD_FAILFAST fails,
>>> the array may broken, and MD_BROKEN should be set.
>>>
>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>> calling md_error on the last rdev in RAID1/10 always sets
>>> the MD_BROKEN flag on the array.
>>> As a result, when failfast IO fails on the last rdev, the array
>>> immediately becomes failed.
>>>
>>> This commit prevents MD_BROKEN from being set when a super_write with
>>> MD_FAILFAST fails on the last rdev, ensuring that the array does
>>> not become failed due to failfast IO failures.
>>>
>>> Failfast IO failures on any rdev except the last one are not retried
>>> and are marked as Faulty immediately. This minimizes array IO latency
>>> when an rdev fails.
>>>
>>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>>> Signed-off-by: Kenta Akagi <k@mgml.me>
>>> ---
>>>   drivers/md/md.c     |  9 ++++++---
>>>   drivers/md/md.h     |  7 ++++---
>>>   drivers/md/raid1.c  | 12 ++++++++++--
>>>   drivers/md/raid10.c | 12 ++++++++++--
>>>   4 files changed, 30 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index ac85ec73a409..61a8188849a3 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -999,14 +999,17 @@ static void super_written(struct bio *bio)
>>>       if (bio->bi_status) {
>>>           pr_err("md: %s gets error=%d\n", __func__,
>>>                  blk_status_to_errno(bio->bi_status));
>>> +        if (bio->bi_opf & MD_FAILFAST)
>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>
>> I think it's better to retry the bio with the flag cleared, then all
>> underlying procedures can stay the same.

That might be a better approach. I'll check the call hierarchy and lock dependencies.

Thanks,
Akagi


>>
>> Thanks,
>> Kuai
>>
>>>           md_error(mddev, rdev);
>>>           if (!test_bit(Faulty, &rdev->flags)
>>>               && (bio->bi_opf & MD_FAILFAST)) {
>>> +            pr_warn("md: %s: Metadata write will be repeated to %pg\n",
>>> +                mdname(mddev), rdev->bdev);
>>>               set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>> -            set_bit(LastDev, &rdev->flags);
>>>           }
>>>       } else
>>> -        clear_bit(LastDev, &rdev->flags);
>>> +        clear_bit(FailfastIOFailure, &rdev->flags);
>>>       bio_put(bio);
>>> @@ -1048,7 +1051,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
>>>       if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
>>>           test_bit(FailFast, &rdev->flags) &&
>>> -        !test_bit(LastDev, &rdev->flags))
>>> +        !test_bit(FailfastIOFailure, &rdev->flags))
>>>           bio->bi_opf |= MD_FAILFAST;
>>>       atomic_inc(&mddev->pending_writes);
>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>> index 51af29a03079..cf989aca72ad 100644
>>> --- a/drivers/md/md.h
>>> +++ b/drivers/md/md.h
>>> @@ -281,9 +281,10 @@ enum flag_bits {
>>>                    * It is expects that no bad block log
>>>                    * is present.
>>>                    */
>>> -    LastDev,        /* Seems to be the last working dev as
>>> -                 * it didn't fail, so don't use FailFast
>>> -                 * any more for metadata
>>> +    FailfastIOFailure,    /* A device that failled a metadata write
>>> +                 * with failfast.
>>> +                 * error_handler must not fail the array
>>> +                 * if last device has this flag.
>>>                    */
>>>       CollisionCheck,        /*
>>>                    * check if there is collision between raid1
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index 408c26398321..fc7195e58f80 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>    *    - recovery is interrupted.
>>>    *    - &mddev->degraded is bumped.
>>>    *
>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>> - * &mddev->fail_last_dev is off.
>>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>>> + * failed in failfast and will be retried, so the @mddev did not fail.
>>> + *
>>> + * @rdev is marked as &Faulty excluding any cases:
>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>    */
>>>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>   {
>>> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>       if (test_bit(In_sync, &rdev->flags) &&
>>>           (conf->raid_disks - mddev->degraded) == 1) {
>>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>> +            return;
>>> +        }
>>>           set_bit(MD_BROKEN, &mddev->flags);
>>>           if (!mddev->fail_last_dev) {
>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>> index b60c30bfb6c7..ff105a0dcd05 100644
>>> --- a/drivers/md/raid10.c
>>> +++ b/drivers/md/raid10.c
>>> @@ -1995,8 +1995,12 @@ static int enough(struct r10conf *conf, int ignore)
>>>    *    - recovery is interrupted.
>>>    *    - &mddev->degraded is bumped.
>>>    *
>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>> - * &mddev->fail_last_dev is off.
>>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>>> + * failed in failfast, so the @mddev did not fail.
>>> + *
>>> + * @rdev is marked as &Faulty excluding any cases:
>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>    */
>>>   static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>>   {
>>> @@ -2006,6 +2010,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>>       spin_lock_irqsave(&conf->device_lock, flags);
>>>       if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
>>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>> +            return;
>>> +        }
>>>           set_bit(MD_BROKEN, &mddev->flags);
>>>           if (!mddev->fail_last_dev) {
>>>
>>
>> .
>>
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails
  2025-08-18 12:48       ` Kenta Akagi
@ 2025-08-18 15:45         ` Yu Kuai
  2025-08-20 17:09           ` Kenta Akagi
  0 siblings, 1 reply; 11+ messages in thread
From: Yu Kuai @ 2025-08-18 15:45 UTC (permalink / raw)
  To: Kenta Akagi, Yu Kuai, Song Liu, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C)

Hi,

在 2025/8/18 20:48, Kenta Akagi 写道:
> On 2025/08/18 11:48, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/18 10:05, Yu Kuai 写道:
>>> Hi,
>>>
>>> 在 2025/08/18 1:27, Kenta Akagi 写道:
>>>> A super_write IO failure with MD_FAILFAST must not cause the array
>>>> to fail.
>>>>
>>>> Because a failfast bio may fail even when the rdev is not broken,
>>>> so IO must be retried rather than failing the array when a metadata
>>>> write with MD_FAILFAST fails on the last rdev.
>>> Why just last rdev? If failfast can fail when the rdev is not broken, I
>>> feel we should retry for all the rdev.
> Thank you for reviewing.
>
> The reason this retry applies only to the last rdev is that the purpose
> of failfast is to quickly detach a faulty device and thereby minimize
> mdev IO latency on rdev failure.
> If md retries all rdevs, the Faulty handler will no longer act
> quickly enough, which will always "cause long delays" [1].
> I believe this is not the behavior users want.
>
> [1] https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/mdadm.8.in?h=main&id=34f21b7acea8afbea9348d0f421beeeedca7a136#n784
>
>> BTW, I couldn't figure out the reason, why failfast is added for the
>> meta write. I do feel just remove this flag for metadata write will fix
>> this problem.
> By issuing metadata writes with failfast in md, it becomes possible to
> detect rdev failures quickly.
> Most applications never issue IO with the REQ_FAILFAST flag set,
> so if md issues its metadata writes without failfast,
> rdev failures would not be detected quickly.
> This would undermine the point of the md's failfast feature.
> And this would also "cause long delays" [1].
> I believe this is also not what users want.

Yes, this make sense. But I was thinking failfast will work on normal IO,
not metadata IO like updating superblock, which doesn't happen quite often
for user. But consider we have this behavior for such a long time, I agree
we'd better not change it.

>> Thanks,
>> Kuai
>>
>>>> A metadata write with MD_FAILFAST is retried after failure as
>>>> follows:
>>>>
>>>> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>>>>
>>>> 2. In md_super_wait, which is called by the function that
>>>> executed md_super_write and waits for completion,
>>>> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>>>>
>>>> 3. The caller of md_super_wait (such as md_update_sb)
>>>> receives a negative return value and then retries md_super_write.
>>>>
>>>> 4. The md_super_write function, which is called to perform
>>>> the same metadata write, issues a write bio without MD_FAILFAST
>>>> this time.
>>>>
>>>> When a write from super_written without MD_FAILFAST fails,
>>>> the array may broken, and MD_BROKEN should be set.
>>>>
>>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>>> calling md_error on the last rdev in RAID1/10 always sets
>>>> the MD_BROKEN flag on the array.
>>>> As a result, when failfast IO fails on the last rdev, the array
>>>> immediately becomes failed.
>>>>
>>>> This commit prevents MD_BROKEN from being set when a super_write with
>>>> MD_FAILFAST fails on the last rdev, ensuring that the array does
>>>> not become failed due to failfast IO failures.
>>>>
>>>> Failfast IO failures on any rdev except the last one are not retried
>>>> and are marked as Faulty immediately. This minimizes array IO latency
>>>> when an rdev fails.
>>>>
>>>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>>>> Signed-off-by: Kenta Akagi <k@mgml.me>
>>>> ---
>>>>    drivers/md/md.c     |  9 ++++++---
>>>>    drivers/md/md.h     |  7 ++++---
>>>>    drivers/md/raid1.c  | 12 ++++++++++--
>>>>    drivers/md/raid10.c | 12 ++++++++++--
>>>>    4 files changed, 30 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index ac85ec73a409..61a8188849a3 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -999,14 +999,17 @@ static void super_written(struct bio *bio)
>>>>        if (bio->bi_status) {
>>>>            pr_err("md: %s gets error=%d\n", __func__,
>>>>                   blk_status_to_errno(bio->bi_status));
>>>> +        if (bio->bi_opf & MD_FAILFAST)
>>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>> I think it's better to retry the bio with the flag cleared, then all
>>> underlying procedures can stay the same.
> That might be a better approach. I'll check the call hierarchy and lock dependencies.

You might need to add a new async work to resubmit this bio.

Thanks,
Kuai

> Thanks,
> Akagi
>
>
>>> Thanks,
>>> Kuai
>>>
>>>>            md_error(mddev, rdev);
>>>>            if (!test_bit(Faulty, &rdev->flags)
>>>>                && (bio->bi_opf & MD_FAILFAST)) {
>>>> +            pr_warn("md: %s: Metadata write will be repeated to %pg\n",
>>>> +                mdname(mddev), rdev->bdev);
>>>>                set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>> -            set_bit(LastDev, &rdev->flags);
>>>>            }
>>>>        } else
>>>> -        clear_bit(LastDev, &rdev->flags);
>>>> +        clear_bit(FailfastIOFailure, &rdev->flags);
>>>>        bio_put(bio);
>>>> @@ -1048,7 +1051,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
>>>>        if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
>>>>            test_bit(FailFast, &rdev->flags) &&
>>>> -        !test_bit(LastDev, &rdev->flags))
>>>> +        !test_bit(FailfastIOFailure, &rdev->flags))
>>>>            bio->bi_opf |= MD_FAILFAST;
>>>>        atomic_inc(&mddev->pending_writes);
>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>>> index 51af29a03079..cf989aca72ad 100644
>>>> --- a/drivers/md/md.h
>>>> +++ b/drivers/md/md.h
>>>> @@ -281,9 +281,10 @@ enum flag_bits {
>>>>                     * It is expects that no bad block log
>>>>                     * is present.
>>>>                     */
>>>> -    LastDev,        /* Seems to be the last working dev as
>>>> -                 * it didn't fail, so don't use FailFast
>>>> -                 * any more for metadata
>>>> +    FailfastIOFailure,    /* A device that failled a metadata write
>>>> +                 * with failfast.
>>>> +                 * error_handler must not fail the array
>>>> +                 * if last device has this flag.
>>>>                     */
>>>>        CollisionCheck,        /*
>>>>                     * check if there is collision between raid1
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index 408c26398321..fc7195e58f80 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>     *    - recovery is interrupted.
>>>>     *    - &mddev->degraded is bumped.
>>>>     *
>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>> - * &mddev->fail_last_dev is off.
>>>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>>>> + * failed in failfast and will be retried, so the @mddev did not fail.
>>>> + *
>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>     */
>>>>    static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>    {
>>>> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>        if (test_bit(In_sync, &rdev->flags) &&
>>>>            (conf->raid_disks - mddev->degraded) == 1) {
>>>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>> +            return;
>>>> +        }
>>>>            set_bit(MD_BROKEN, &mddev->flags);
>>>>            if (!mddev->fail_last_dev) {
>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>> index b60c30bfb6c7..ff105a0dcd05 100644
>>>> --- a/drivers/md/raid10.c
>>>> +++ b/drivers/md/raid10.c
>>>> @@ -1995,8 +1995,12 @@ static int enough(struct r10conf *conf, int ignore)
>>>>     *    - recovery is interrupted.
>>>>     *    - &mddev->degraded is bumped.
>>>>     *
>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>> - * &mddev->fail_last_dev is off.
>>>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>>>> + * failed in failfast, so the @mddev did not fail.
>>>> + *
>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>     */
>>>>    static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>    {
>>>> @@ -2006,6 +2010,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>        spin_lock_irqsave(&conf->device_lock, flags);
>>>>        if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
>>>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>> +            return;
>>>> +        }
>>>>            set_bit(MD_BROKEN, &mddev->flags);
>>>>            if (!mddev->fail_last_dev) {
>>>>
>>> .
>>>
>>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails
  2025-08-18 15:45         ` Yu Kuai
@ 2025-08-20 17:09           ` Kenta Akagi
  0 siblings, 0 replies; 11+ messages in thread
From: Kenta Akagi @ 2025-08-20 17:09 UTC (permalink / raw)
  To: yukuai, Yu Kuai, Song Liu, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C), Kenta Akagi



On 2025/08/19 0:45, Yu Kuai wrote:
> Hi,
> 
> 在 2025/8/18 20:48, Kenta Akagi 写道:
>> On 2025/08/18 11:48, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/08/18 10:05, Yu Kuai 写道:
>>>> Hi,
>>>>
>>>> 在 2025/08/18 1:27, Kenta Akagi 写道:
>>>>> A super_write IO failure with MD_FAILFAST must not cause the array
>>>>> to fail.
>>>>>
>>>>> Because a failfast bio may fail even when the rdev is not broken,
>>>>> so IO must be retried rather than failing the array when a metadata
>>>>> write with MD_FAILFAST fails on the last rdev.
>>>> Why just last rdev? If failfast can fail when the rdev is not broken, I
>>>> feel we should retry for all the rdev.
>> Thank you for reviewing.
>>
>> The reason this retry applies only to the last rdev is that the purpose
>> of failfast is to quickly detach a faulty device and thereby minimize
>> mdev IO latency on rdev failure.
>> If md retries all rdevs, the Faulty handler will no longer act
>> quickly enough, which will always "cause long delays" [1].
>> I believe this is not the behavior users want.
>>
>> [1] https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/mdadm.8.in?h=main&id=34f21b7acea8afbea9348d0f421beeeedca7a136#n784
>>
>>> BTW, I couldn't figure out the reason, why failfast is added for the
>>> meta write. I do feel just remove this flag for metadata write will fix
>>> this problem.
>> By issuing metadata writes with failfast in md, it becomes possible to
>> detect rdev failures quickly.
>> Most applications never issue IO with the REQ_FAILFAST flag set,
>> so if md issues its metadata writes without failfast,
>> rdev failures would not be detected quickly.
>> This would undermine the point of the md's failfast feature.
>> And this would also "cause long delays" [1].
>> I believe this is also not what users want.
> 
> Yes, this make sense. But I was thinking failfast will work on normal IO,
> not metadata IO like updating superblock, which doesn't happen quite often
> for user. But consider we have this behavior for such a long time, I agree
> we'd better not change it.

Thank you for reviewing.

Sorry, I forgot to mention that in my environment bitmap is enabled, so 
super_write is called frequently.

Also, what I said earlier was incorrect. md actively attaches MD_FAILFAST 
not only for metadata writes but also for other requests.
Therefore, the current patch is insufficient to achieve the prevent last
rdev fails by failfast.

I will submit a new patch with the fix.

Thanks,
Akagi

> 
>>> Thanks,
>>> Kuai
>>>
>>>>> A metadata write with MD_FAILFAST is retried after failure as
>>>>> follows:
>>>>>
>>>>> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>>>>>
>>>>> 2. In md_super_wait, which is called by the function that
>>>>> executed md_super_write and waits for completion,
>>>>> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>>>>>
>>>>> 3. The caller of md_super_wait (such as md_update_sb)
>>>>> receives a negative return value and then retries md_super_write.
>>>>>
>>>>> 4. The md_super_write function, which is called to perform
>>>>> the same metadata write, issues a write bio without MD_FAILFAST
>>>>> this time.
>>>>>
>>>>> When a write from super_written without MD_FAILFAST fails,
>>>>> the array may broken, and MD_BROKEN should be set.
>>>>>
>>>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>>>> calling md_error on the last rdev in RAID1/10 always sets
>>>>> the MD_BROKEN flag on the array.
>>>>> As a result, when failfast IO fails on the last rdev, the array
>>>>> immediately becomes failed.
>>>>>
>>>>> This commit prevents MD_BROKEN from being set when a super_write with
>>>>> MD_FAILFAST fails on the last rdev, ensuring that the array does
>>>>> not become failed due to failfast IO failures.
>>>>>
>>>>> Failfast IO failures on any rdev except the last one are not retried
>>>>> and are marked as Faulty immediately. This minimizes array IO latency
>>>>> when an rdev fails.
>>>>>
>>>>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>>>>> Signed-off-by: Kenta Akagi <k@mgml.me>
>>>>> ---
>>>>>    drivers/md/md.c     |  9 ++++++---
>>>>>    drivers/md/md.h     |  7 ++++---
>>>>>    drivers/md/raid1.c  | 12 ++++++++++--
>>>>>    drivers/md/raid10.c | 12 ++++++++++--
>>>>>    4 files changed, 30 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>>> index ac85ec73a409..61a8188849a3 100644
>>>>> --- a/drivers/md/md.c
>>>>> +++ b/drivers/md/md.c
>>>>> @@ -999,14 +999,17 @@ static void super_written(struct bio *bio)
>>>>>        if (bio->bi_status) {
>>>>>            pr_err("md: %s gets error=%d\n", __func__,
>>>>>                   blk_status_to_errno(bio->bi_status));
>>>>> +        if (bio->bi_opf & MD_FAILFAST)
>>>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>>> I think it's better to retry the bio with the flag cleared, then all
>>>> underlying procedures can stay the same.
>> That might be a better approach. I'll check the call hierarchy and lock dependencies.
> 
> You might need to add a new async work to resubmit this bio.
> 
> Thanks,
> Kuai
> 
>> Thanks,
>> Akagi
>>
>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>>>            md_error(mddev, rdev);
>>>>>            if (!test_bit(Faulty, &rdev->flags)
>>>>>                && (bio->bi_opf & MD_FAILFAST)) {
>>>>> +            pr_warn("md: %s: Metadata write will be repeated to %pg\n",
>>>>> +                mdname(mddev), rdev->bdev);
>>>>>                set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>>> -            set_bit(LastDev, &rdev->flags);
>>>>>            }
>>>>>        } else
>>>>> -        clear_bit(LastDev, &rdev->flags);
>>>>> +        clear_bit(FailfastIOFailure, &rdev->flags);
>>>>>        bio_put(bio);
>>>>> @@ -1048,7 +1051,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
>>>>>        if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
>>>>>            test_bit(FailFast, &rdev->flags) &&
>>>>> -        !test_bit(LastDev, &rdev->flags))
>>>>> +        !test_bit(FailfastIOFailure, &rdev->flags))
>>>>>            bio->bi_opf |= MD_FAILFAST;
>>>>>        atomic_inc(&mddev->pending_writes);
>>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>>>> index 51af29a03079..cf989aca72ad 100644
>>>>> --- a/drivers/md/md.h
>>>>> +++ b/drivers/md/md.h
>>>>> @@ -281,9 +281,10 @@ enum flag_bits {
>>>>>                     * It is expects that no bad block log
>>>>>                     * is present.
>>>>>                     */
>>>>> -    LastDev,        /* Seems to be the last working dev as
>>>>> -                 * it didn't fail, so don't use FailFast
>>>>> -                 * any more for metadata
>>>>> +    FailfastIOFailure,    /* A device that failled a metadata write
>>>>> +                 * with failfast.
>>>>> +                 * error_handler must not fail the array
>>>>> +                 * if last device has this flag.
>>>>>                     */
>>>>>        CollisionCheck,        /*
>>>>>                     * check if there is collision between raid1
>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>> index 408c26398321..fc7195e58f80 100644
>>>>> --- a/drivers/md/raid1.c
>>>>> +++ b/drivers/md/raid1.c
>>>>> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>>     *    - recovery is interrupted.
>>>>>     *    - &mddev->degraded is bumped.
>>>>>     *
>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>> - * &mddev->fail_last_dev is off.
>>>>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>>>>> + * failed in failfast and will be retried, so the @mddev did not fail.
>>>>> + *
>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>     */
>>>>>    static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>    {
>>>>> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>        if (test_bit(In_sync, &rdev->flags) &&
>>>>>            (conf->raid_disks - mddev->degraded) == 1) {
>>>>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>> +            return;
>>>>> +        }
>>>>>            set_bit(MD_BROKEN, &mddev->flags);
>>>>>            if (!mddev->fail_last_dev) {
>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>> index b60c30bfb6c7..ff105a0dcd05 100644
>>>>> --- a/drivers/md/raid10.c
>>>>> +++ b/drivers/md/raid10.c
>>>>> @@ -1995,8 +1995,12 @@ static int enough(struct r10conf *conf, int ignore)
>>>>>     *    - recovery is interrupted.
>>>>>     *    - &mddev->degraded is bumped.
>>>>>     *
>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>> - * &mddev->fail_last_dev is off.
>>>>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>>>>> + * failed in failfast, so the @mddev did not fail.
>>>>> + *
>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>     */
>>>>>    static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>    {
>>>>> @@ -2006,6 +2010,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>        spin_lock_irqsave(&conf->device_lock, flags);
>>>>>        if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
>>>>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>> +            return;
>>>>> +        }
>>>>>            set_bit(MD_BROKEN, &mddev->flags);
>>>>>            if (!mddev->fail_last_dev) {
>>>>>
>>>> .
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails
  2025-08-17 17:27 ` [PATCH v2 1/3] " Kenta Akagi
  2025-08-18  2:05   ` Yu Kuai
@ 2025-08-23  1:54   ` Li Nan
  2025-08-27 17:31     ` Kenta Akagi
  1 sibling, 1 reply; 11+ messages in thread
From: Li Nan @ 2025-08-23  1:54 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel



在 2025/8/18 1:27, Kenta Akagi 写道:
> A super_write IO failure with MD_FAILFAST must not cause the array
> to fail.
> 
> Because a failfast bio may fail even when the rdev is not broken,
> so IO must be retried rather than failing the array when a metadata
> write with MD_FAILFAST fails on the last rdev.
> 
> A metadata write with MD_FAILFAST is retried after failure as
> follows:
> 
> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
> 
> 2. In md_super_wait, which is called by the function that
> executed md_super_write and waits for completion,
> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
> 
> 3. The caller of md_super_wait (such as md_update_sb)
> receives a negative return value and then retries md_super_write.
> 
> 4. The md_super_write function, which is called to perform
> the same metadata write, issues a write bio without MD_FAILFAST
> this time.
> 
> When a write from super_written without MD_FAILFAST fails,
> the array may broken, and MD_BROKEN should be set.
> 
> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
> calling md_error on the last rdev in RAID1/10 always sets
> the MD_BROKEN flag on the array.
> As a result, when failfast IO fails on the last rdev, the array
> immediately becomes failed.
> 
> This commit prevents MD_BROKEN from being set when a super_write with
> MD_FAILFAST fails on the last rdev, ensuring that the array does
> not become failed due to failfast IO failures.
> 
> Failfast IO failures on any rdev except the last one are not retried
> and are marked as Faulty immediately. This minimizes array IO latency
> when an rdev fails.
> 
> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
> Signed-off-by: Kenta Akagi <k@mgml.me>


[...]

> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>    *	- recovery is interrupted.
>    *	- &mddev->degraded is bumped.
>    *
> - * @rdev is marked as &Faulty excluding case when array is failed and
> - * &mddev->fail_last_dev is off.
> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
> + * failed in failfast and will be retried, so the @mddev did not fail.
> + *
> + * @rdev is marked as &Faulty excluding any cases:
> + *	- when @mddev is failed and &mddev->fail_last_dev is off
> + *	- when @rdev is last device and &FailfastIOFailure flag is set
>    */
>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   {
> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   
>   	if (test_bit(In_sync, &rdev->flags) &&
>   	    (conf->raid_disks - mddev->degraded) == 1) {
> +		if (test_bit(FailfastIOFailure, &rdev->flags)) {
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
> +			return;
> +		}
>   		set_bit(MD_BROKEN, &mddev->flags);
>   
>   		if (!mddev->fail_last_dev) {

At this point, users who try to fail this rdev will get a successful return
without Faulty flag. Should we consider it?

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails
  2025-08-23  1:54   ` Li Nan
@ 2025-08-27 17:31     ` Kenta Akagi
  0 siblings, 0 replies; 11+ messages in thread
From: Kenta Akagi @ 2025-08-27 17:31 UTC (permalink / raw)
  To: Li Nan, Song Liu, Yu Kuai, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi



On 2025/08/23 10:54, Li Nan wrote:
> 
> 
> 在 2025/8/18 1:27, Kenta Akagi 写道:
>> A super_write IO failure with MD_FAILFAST must not cause the array
>> to fail.
>>
>> Because a failfast bio may fail even when the rdev is not broken,
>> so IO must be retried rather than failing the array when a metadata
>> write with MD_FAILFAST fails on the last rdev.
>>
>> A metadata write with MD_FAILFAST is retried after failure as
>> follows:
>>
>> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>>
>> 2. In md_super_wait, which is called by the function that
>> executed md_super_write and waits for completion,
>> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>>
>> 3. The caller of md_super_wait (such as md_update_sb)
>> receives a negative return value and then retries md_super_write.
>>
>> 4. The md_super_write function, which is called to perform
>> the same metadata write, issues a write bio without MD_FAILFAST
>> this time.
>>
>> When a write from super_written without MD_FAILFAST fails,
>> the array may broken, and MD_BROKEN should be set.
>>
>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>> calling md_error on the last rdev in RAID1/10 always sets
>> the MD_BROKEN flag on the array.
>> As a result, when failfast IO fails on the last rdev, the array
>> immediately becomes failed.
>>
>> This commit prevents MD_BROKEN from being set when a super_write with
>> MD_FAILFAST fails on the last rdev, ensuring that the array does
>> not become failed due to failfast IO failures.
>>
>> Failfast IO failures on any rdev except the last one are not retried
>> and are marked as Faulty immediately. This minimizes array IO latency
>> when an rdev fails.
>>
>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>> Signed-off-by: Kenta Akagi <k@mgml.me>
> 
> 
> [...]
> 
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>    *    - recovery is interrupted.
>>    *    - &mddev->degraded is bumped.
>>    *
>> - * @rdev is marked as &Faulty excluding case when array is failed and
>> - * &mddev->fail_last_dev is off.
>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>> + * failed in failfast and will be retried, so the @mddev did not fail.
>> + *
>> + * @rdev is marked as &Faulty excluding any cases:
>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>    */
>>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>   {
>> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>         if (test_bit(In_sync, &rdev->flags) &&
>>           (conf->raid_disks - mddev->degraded) == 1) {
>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>> +            return;
>> +        }
>>           set_bit(MD_BROKEN, &mddev->flags);
>>             if (!mddev->fail_last_dev) {
> 
> At this point, users who try to fail this rdev will get a successful return
> without Faulty flag. Should we consider it?
Hi,
Sorry for the late reply.

I will submit a v3 patch that sets the flag just before md_error and 
modifies raid{1,10}_error to use test_and_clear_bit.
Therefore, echo "faulty" to dev-*/state should always correctly 
mark the device as faulty.

Thanks,
Akagi

> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-08-27 17:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-17 17:27 [PATCH v2 0/3] md/raid1,raid10: don't broken array on failfast metadata write fails Kenta Akagi
2025-08-17 17:27 ` [PATCH v2 1/3] " Kenta Akagi
2025-08-18  2:05   ` Yu Kuai
2025-08-18  2:48     ` Yu Kuai
2025-08-18 12:48       ` Kenta Akagi
2025-08-18 15:45         ` Yu Kuai
2025-08-20 17:09           ` Kenta Akagi
2025-08-23  1:54   ` Li Nan
2025-08-27 17:31     ` Kenta Akagi
2025-08-17 17:27 ` [PATCH v2 2/3] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi
2025-08-17 17:27 ` [PATCH v2 3/3] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).