[PATCH v6 0/2] Don't set MD_BROKEN on failfast bio failure

public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/2] Don't set MD_BROKEN on failfast bio failure
@ 2026-01-05 14:40 Kenta Akagi
  2026-01-05 14:40 ` [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast Kenta Akagi
  2026-01-05 14:40 ` [PATCH v6 2/2] md/raid10: fix failfast read error not rescheduled Kenta Akagi
  0 siblings, 2 replies; 13+ messages in thread
From: Kenta Akagi @ 2026-01-05 14:40 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Xiao Ni
  Cc: linux-raid, linux-kernel, Kenta Akagi

Changes from V5:
- Prevent md being broken when rdev has FailFast, regardless of the bios flag.
  Thanks to Xiao for the advice:
  https://lore.kernel.org/linux-raid/CALTww2_nJqyA99cG9YNarXEB4wimFK=pKy=qrxdkfB60PaUa1w@mail.gmail.com/#t
- Dropping preparation refactor, flag rename, error logging improvement
  commits
Changes from V4:
- Use device_lock to serialize md_error() instead of adding a new
  spinlock_t.
- Rename new function md_bio_failure_error() to md_cond_error().
- Add helper function pers->should_error() to determine whether to fail
  rdev in failfast bio failure, instead of using the LastDev flag.
- Avoid changing the behavior of the LastDev flag.
- Drop fix for R{1,10}BIO_Uptodate not being set despite successful
  retry; this will be sent separately after Nan's refactor.
- Drop fix for the message 'Operation continuing on 0 devices'; as it is
  outside the scope of this patch, it will be sent separately.
- Improve logging when metadata writing fails.
- Rename LastDev to RetryingSBWrite.
Changes from V3:
- The error handling in md_error() is now serialized, and a new helper
  function, md_bio_failure_error, has been introduced.
- MD_FAILFAST bio failures are now processed by md_bio_failure_error
  instead of signaling via FailfastIOFailure.
- RAID10: Fix missing reschedule of failfast read bio failure
- Regardless of failfast, in narrow_write_error, writes that succeed
  in retry are returned to the higher layer as success
Changes from V2:
- Fix to prevent the array from being marked broken for all
  Failfast IOs, not just metadata.
- Reflecting the review, update raid{1,10}_error to clear
  FailfastIOFailure so that devices are properly marked Faulty.
Changes from V1:
- Avoid setting MD_BROKEN instead of clearing it
- Add pr_crit() when setting MD_BROKEN
- Fix the message may shown after all rdevs failure:
  "Operation continuing on 0 devices"

v5: https://lore.kernel.org/linux-raid/20251027150433.18193-1-k@mgml.me/
v4: https://lore.kernel.org/linux-raid/20250915034210.8533-1-k@mgml.me/
v3: https://lore.kernel.org/linux-raid/20250828163216.4225-1-k@mgml.me/
v2: https://lore.kernel.org/linux-raid/20250817172710.4892-1-k@mgml.me/
v1: https://lore.kernel.org/linux-raid/20250812090119.153697-1-k@mgml.me/

When multiple MD_FAILFAST bios fail simultaneously on Failfast-enabled
rdevs in RAID1/RAID10, the following issues can occur:
* MD_BROKEN is set and the array halts, even though this should not occur
  under the intended Failfast design.
* Writes retried through narrow_write_error succeed, but the I/O is still
  reported as BLK_STS_IOERR
  * NOTE: a fix for this was removed in v5, will be send separetely
    https://lore.kernel.org/linux-raid/6f0f9730-4bbe-7f3c-1b50-690bb77d5d90@huaweicloud.com/
* RAID10 only: If a Failfast read I/O fails, it is not retried on any
  remaining rdev, and as a result, the upper layer receives an I/O error.

Simultaneous bio failures across multiple rdevs are uncommon; however,
rdevs serviced via nvme-tcp can still experience them due to something as
simple as an Ethernet fault. The issue can be reproduced using the
following steps.

# prepare nvmet/nvme-tcp and md array 
sh-5.2# cat << 'EOF' > loopback-nvme.sh
set -eu
nqn="nqn.2025-08.io.example:nvmet-test-$1"
back=$2
cd /sys/kernel/config/nvmet/
mkdir subsystems/$nqn
echo 1 > subsystems/${nqn}/attr_allow_any_host
mkdir subsystems/${nqn}/namespaces/1
echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path
echo 1 > subsystems/${nqn}/namespaces/1/enable
ports="ports/1"
if [ ! -d $ports ]; then
        mkdir $ports
        cd $ports
        echo 127.0.0.1 > addr_traddr
        echo tcp       > addr_trtype
        echo 4420      > addr_trsvcid
        echo ipv4      > addr_adrfam
        cd ../../
fi
ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/
nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420
EOF

sh-5.2# chmod +x loopback-nvme.sh
sh-5.2# modprobe -a nvme-tcp nvmet-tcp
sh-5.2# truncate -s 1g a.img b.img
sh-5.2# losetup --show -f a.img
/dev/loop0
sh-5.2# losetup --show -f b.img
/dev/loop1
sh-5.2# ./loopback-nvme.sh 0 /dev/loop0
connecting to device: nvme0
sh-5.2# ./loopback-nvme.sh 1 /dev/loop1
connecting to device: nvme1
sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \
--failfast /dev/nvme0n1 --failfast /dev/nvme1n1
...
mdadm: array /dev/md0 started.

# run fio 
sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \
--bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1 &

# It can reproduce the issue by block nvme traffic during fio
sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP;
sh-5.2# sleep 10; # twice the default KATO value
sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP


Patch 1 prevent array broken when FailFast is set on rdev
Patch 2 adds the missing retry path for Failfast read errors in RAID10.

Kenta Akagi (2):
  md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  md/raid10: fix failfast read error not rescheduled

 drivers/md/md.c     |  6 ++++--
 drivers/md/raid1.c  |  8 +++++++-
 drivers/md/raid10.c | 15 ++++++++++++++-
 3 files changed, 25 insertions(+), 4 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-05 14:40 [PATCH v6 0/2] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
@ 2026-01-05 14:40 ` Kenta Akagi
  2026-01-06  2:57   ` Li Nan
  2026-01-05 14:40 ` [PATCH v6 2/2] md/raid10: fix failfast read error not rescheduled Kenta Akagi
  1 sibling, 1 reply; 13+ messages in thread
From: Kenta Akagi @ 2026-01-05 14:40 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Xiao Ni
  Cc: linux-raid, linux-kernel, Kenta Akagi

After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
if the error handler is called on the last rdev in RAID1 or RAID10,
the MD_BROKEN flag will be set on that mddev.
When MD_BROKEN is set, write bios to the md will result in an I/O error.

This causes a problem when using FailFast.
The current implementation of FailFast expects the array to continue
functioning without issues even after calling md_error for the last
rdev.  Furthermore, due to the nature of its functionality, FailFast may
call md_error on all rdevs of the md. Even if retrying I/O on an rdev
would succeed, it first calls md_error before retrying.

To fix this issue, this commit ensures that for RAID1 and RAID10, if the
last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
is off, the MD_BROKEN flag will not be set on that mddev.

This change impacts userspace. After this commit, If the rdev has the
FailFast flag, the mddev never broken even if the failing bio is not
FailFast. However, it's unlikely that any setup using FailFast expects
the array to halt when md_error is called on the last rdev.

Since FailFast is only implemented for RAID1 and RAID10, no changes are
needed for other personalities.

Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
Suggested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c     | 6 ++++--
 drivers/md/raid1.c  | 8 +++++++-
 drivers/md/raid10.c | 8 +++++++-
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6062e0deb616..f1745f8921fc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3050,7 +3050,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 	if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
 		md_error(rdev->mddev, rdev);
 
-		if (test_bit(MD_BROKEN, &rdev->mddev->flags))
+		if (test_bit(MD_BROKEN, &rdev->mddev->flags) ||
+		    !test_bit(Faulty, &rdev->flags))
 			err = -EBUSY;
 		else
 			err = 0;
@@ -7915,7 +7916,8 @@ static int set_disk_faulty(struct mddev *mddev, dev_t dev)
 		err =  -ENODEV;
 	else {
 		md_error(mddev, rdev);
-		if (test_bit(MD_BROKEN, &mddev->flags))
+		if (test_bit(MD_BROKEN, &mddev->flags) ||
+		    !test_bit(Faulty, &rdev->flags))
 			err = -EBUSY;
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 592a40233004..459b34cd358b 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1745,6 +1745,10 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
  *	- recovery is interrupted.
  *	- &mddev->degraded is bumped.
  *
+ * If the following conditions are met, @mddev never fails:
+ *	- The last In_sync @rdev has the &FailFast flag set.
+ *	- &mddev->fail_last_dev is off.
+ *
  * @rdev is marked as &Faulty excluding case when array is failed and
  * &mddev->fail_last_dev is off.
  */
@@ -1757,7 +1761,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	if (test_bit(In_sync, &rdev->flags) &&
 	    (conf->raid_disks - mddev->degraded) == 1) {
-		set_bit(MD_BROKEN, &mddev->flags);
+		if (!test_bit(FailFast, &rdev->flags) ||
+		    mddev->fail_last_dev)
+			set_bit(MD_BROKEN, &mddev->flags);
 
 		if (!mddev->fail_last_dev) {
 			conf->recovery_disabled = mddev->recovery_disabled;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 14dcd5142eb4..b33149aa5b29 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1989,6 +1989,10 @@ static int enough(struct r10conf *conf, int ignore)
  *	- recovery is interrupted.
  *	- &mddev->degraded is bumped.
  *
+ * If the following conditions are met, @mddev never fails:
+ *	- The last In_sync @rdev has the &FailFast flag set.
+ *	- &mddev->fail_last_dev is off.
+ *
  * @rdev is marked as &Faulty excluding case when array is failed and
  * &mddev->fail_last_dev is off.
  */
@@ -2000,7 +2004,9 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	spin_lock_irqsave(&conf->device_lock, flags);
 
 	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
-		set_bit(MD_BROKEN, &mddev->flags);
+		if (!test_bit(FailFast, &rdev->flags) ||
+		    mddev->fail_last_dev)
+			set_bit(MD_BROKEN, &mddev->flags);
 
 		if (!mddev->fail_last_dev) {
 			spin_unlock_irqrestore(&conf->device_lock, flags);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-05 14:40 ` [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast Kenta Akagi
@ 2026-01-06  2:57   ` Li Nan
  2026-01-06  7:59     ` Xiao Ni
  2026-01-06 12:30     ` Kenta Akagi
  0 siblings, 2 replies; 13+ messages in thread
From: Li Nan @ 2026-01-06  2:57 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk,
	Xiao Ni
  Cc: linux-raid, linux-kernel



在 2026/1/5 22:40, Kenta Akagi 写道:
> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
> if the error handler is called on the last rdev in RAID1 or RAID10,
> the MD_BROKEN flag will be set on that mddev.
> When MD_BROKEN is set, write bios to the md will result in an I/O error.
> 
> This causes a problem when using FailFast.
> The current implementation of FailFast expects the array to continue
> functioning without issues even after calling md_error for the last
> rdev.  Furthermore, due to the nature of its functionality, FailFast may
> call md_error on all rdevs of the md. Even if retrying I/O on an rdev
> would succeed, it first calls md_error before retrying.
> 
> To fix this issue, this commit ensures that for RAID1 and RAID10, if the
> last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
> is off, the MD_BROKEN flag will not be set on that mddev.
> 
> This change impacts userspace. After this commit, If the rdev has the
> FailFast flag, the mddev never broken even if the failing bio is not
> FailFast. However, it's unlikely that any setup using FailFast expects
> the array to halt when md_error is called on the last rdev.
> 

In the current RAID design, when an IO error occurs, RAID ensures faulty
data is not read via the following actions:
1. Mark the badblocks (no FailFast flag); if this fails,
2. Mark the disk as Faulty.

If neither action is taken, and BROKEN is not set to prevent continued RAID
use, errors on the last remaining disk will be ignored. Subsequent reads
may return incorrect data. This seems like a more serious issue in my opinion.

In scenarios with a large number of transient IO errors, is FailFast not a
suitable configuration? As you mentioned: "retrying I/O on an rdev would
succeed".

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-06  2:57   ` Li Nan
@ 2026-01-06  7:59     ` Xiao Ni
  2026-01-06  9:11       ` Li Nan
  2026-01-06 12:30     ` Kenta Akagi
  1 sibling, 1 reply; 13+ messages in thread
From: Xiao Ni @ 2026-01-06  7:59 UTC (permalink / raw)
  To: Li Nan
  Cc: Kenta Akagi, Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk,
	linux-raid, linux-kernel

On Tue, Jan 6, 2026 at 10:57 AM Li Nan <linan666@huaweicloud.com> wrote:
>
>
>
> 在 2026/1/5 22:40, Kenta Akagi 写道:
> > After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
> > if the error handler is called on the last rdev in RAID1 or RAID10,
> > the MD_BROKEN flag will be set on that mddev.
> > When MD_BROKEN is set, write bios to the md will result in an I/O error.
> >
> > This causes a problem when using FailFast.
> > The current implementation of FailFast expects the array to continue
> > functioning without issues even after calling md_error for the last
> > rdev.  Furthermore, due to the nature of its functionality, FailFast may
> > call md_error on all rdevs of the md. Even if retrying I/O on an rdev
> > would succeed, it first calls md_error before retrying.
> >
> > To fix this issue, this commit ensures that for RAID1 and RAID10, if the
> > last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
> > is off, the MD_BROKEN flag will not be set on that mddev.
> >
> > This change impacts userspace. After this commit, If the rdev has the
> > FailFast flag, the mddev never broken even if the failing bio is not
> > FailFast. However, it's unlikely that any setup using FailFast expects
> > the array to halt when md_error is called on the last rdev.
> >
>
> In the current RAID design, when an IO error occurs, RAID ensures faulty
> data is not read via the following actions:
> 1. Mark the badblocks (no FailFast flag); if this fails,
> 2. Mark the disk as Faulty.
>
> If neither action is taken, and BROKEN is not set to prevent continued RAID
> use, errors on the last remaining disk will be ignored. Subsequent reads
> may return incorrect data. This seems like a more serious issue in my opinion.
>
> In scenarios with a large number of transient IO errors, is FailFast not a
> suitable configuration? As you mentioned: "retrying I/O on an rdev would
> succeed".

Hi Nan

According to my understanding, the policy here is to try to keep raid
work if io error happens on the last device. It doesn't set faulty on
the last in_sync device. It only sets MD_BROKEN to forbid write
requests. But it still can read data from the last device.

static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
{

    if (test_bit(In_sync, &rdev->flags) &&
        (conf->raid_disks - mddev->degraded) == 1) {
        set_bit(MD_BROKEN, &mddev->flags);

        if (!mddev->fail_last_dev) {
            return;  // return directly here
        }



static void md_submit_bio(struct bio *bio)
{
    if (unlikely(test_bit(MD_BROKEN, &mddev->flags)) && (rw == WRITE)) {
        bio_io_error(bio);
        return;
    }

Read requests can submit to the last working device. Right?

Best Regards
Xiao


>
> --
> Thanks,
> Nan
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-06  7:59     ` Xiao Ni
@ 2026-01-06  9:11       ` Li Nan
  2026-01-06  9:25         ` Xiao Ni
  0 siblings, 1 reply; 13+ messages in thread
From: Li Nan @ 2026-01-06  9:11 UTC (permalink / raw)
  To: Xiao Ni, Li Nan
  Cc: Kenta Akagi, Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk,
	linux-raid, linux-kernel



在 2026/1/6 15:59, Xiao Ni 写道:
> On Tue, Jan 6, 2026 at 10:57 AM Li Nan <linan666@huaweicloud.com> wrote:
>>
>>
>>
>> 在 2026/1/5 22:40, Kenta Akagi 写道:
>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>> if the error handler is called on the last rdev in RAID1 or RAID10,
>>> the MD_BROKEN flag will be set on that mddev.
>>> When MD_BROKEN is set, write bios to the md will result in an I/O error.
>>>
>>> This causes a problem when using FailFast.
>>> The current implementation of FailFast expects the array to continue
>>> functioning without issues even after calling md_error for the last
>>> rdev.  Furthermore, due to the nature of its functionality, FailFast may
>>> call md_error on all rdevs of the md. Even if retrying I/O on an rdev
>>> would succeed, it first calls md_error before retrying.
>>>
>>> To fix this issue, this commit ensures that for RAID1 and RAID10, if the
>>> last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
>>> is off, the MD_BROKEN flag will not be set on that mddev.
>>>
>>> This change impacts userspace. After this commit, If the rdev has the
>>> FailFast flag, the mddev never broken even if the failing bio is not
>>> FailFast. However, it's unlikely that any setup using FailFast expects
>>> the array to halt when md_error is called on the last rdev.
>>>
>>
>> In the current RAID design, when an IO error occurs, RAID ensures faulty
>> data is not read via the following actions:
>> 1. Mark the badblocks (no FailFast flag); if this fails,
>> 2. Mark the disk as Faulty.
>>
>> If neither action is taken, and BROKEN is not set to prevent continued RAID
>> use, errors on the last remaining disk will be ignored. Subsequent reads
>> may return incorrect data. This seems like a more serious issue in my opinion.
>>
>> In scenarios with a large number of transient IO errors, is FailFast not a
>> suitable configuration? As you mentioned: "retrying I/O on an rdev would
>> succeed".
> 
> Hi Nan
> 
> According to my understanding, the policy here is to try to keep raid
> work if io error happens on the last device. It doesn't set faulty on
> the last in_sync device. It only sets MD_BROKEN to forbid write
> requests. But it still can read data from the last device.
> 
> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> {
> 
>      if (test_bit(In_sync, &rdev->flags) &&
>          (conf->raid_disks - mddev->degraded) == 1) {
>          set_bit(MD_BROKEN, &mddev->flags);
> 
>          if (!mddev->fail_last_dev) {
>              return;  // return directly here
>          }
> 
> 
> 
> static void md_submit_bio(struct bio *bio)
> {
>      if (unlikely(test_bit(MD_BROKEN, &mddev->flags)) && (rw == WRITE)) {
>          bio_io_error(bio);
>          return;
>      }
> 
> Read requests can submit to the last working device. Right?
> 
> Best Regards
> Xiao
> 

Yeah, after MD_BROKEN is set, read are forbidden but writes remain allowed.
IMO we preserve the RAID array in this state to enable users to retrieve
stored data, not to continue using it. However, continued writes to the
array will cause subsequent errors to fail to be logged, either due to
failfast or the badblocks being full. Read errors have no impact as they do
not damage the original data.

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-06  9:11       ` Li Nan
@ 2026-01-06  9:25         ` Xiao Ni
  2026-01-06 11:14           ` Li Nan
  0 siblings, 1 reply; 13+ messages in thread
From: Xiao Ni @ 2026-01-06  9:25 UTC (permalink / raw)
  To: Li Nan
  Cc: Kenta Akagi, Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk,
	linux-raid, linux-kernel

On Tue, Jan 6, 2026 at 5:11 PM Li Nan <linan666@huaweicloud.com> wrote:
>
>
>
> 在 2026/1/6 15:59, Xiao Ni 写道:
> > On Tue, Jan 6, 2026 at 10:57 AM Li Nan <linan666@huaweicloud.com> wrote:
> >>
> >>
> >>
> >> 在 2026/1/5 22:40, Kenta Akagi 写道:
> >>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
> >>> if the error handler is called on the last rdev in RAID1 or RAID10,
> >>> the MD_BROKEN flag will be set on that mddev.
> >>> When MD_BROKEN is set, write bios to the md will result in an I/O error.
> >>>
> >>> This causes a problem when using FailFast.
> >>> The current implementation of FailFast expects the array to continue
> >>> functioning without issues even after calling md_error for the last
> >>> rdev.  Furthermore, due to the nature of its functionality, FailFast may
> >>> call md_error on all rdevs of the md. Even if retrying I/O on an rdev
> >>> would succeed, it first calls md_error before retrying.
> >>>
> >>> To fix this issue, this commit ensures that for RAID1 and RAID10, if the
> >>> last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
> >>> is off, the MD_BROKEN flag will not be set on that mddev.
> >>>
> >>> This change impacts userspace. After this commit, If the rdev has the
> >>> FailFast flag, the mddev never broken even if the failing bio is not
> >>> FailFast. However, it's unlikely that any setup using FailFast expects
> >>> the array to halt when md_error is called on the last rdev.
> >>>
> >>
> >> In the current RAID design, when an IO error occurs, RAID ensures faulty
> >> data is not read via the following actions:
> >> 1. Mark the badblocks (no FailFast flag); if this fails,
> >> 2. Mark the disk as Faulty.
> >>
> >> If neither action is taken, and BROKEN is not set to prevent continued RAID
> >> use, errors on the last remaining disk will be ignored. Subsequent reads
> >> may return incorrect data. This seems like a more serious issue in my opinion.
> >>
> >> In scenarios with a large number of transient IO errors, is FailFast not a
> >> suitable configuration? As you mentioned: "retrying I/O on an rdev would
> >> succeed".
> >
> > Hi Nan
> >
> > According to my understanding, the policy here is to try to keep raid
> > work if io error happens on the last device. It doesn't set faulty on
> > the last in_sync device. It only sets MD_BROKEN to forbid write
> > requests. But it still can read data from the last device.
> >
> > static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> > {
> >
> >      if (test_bit(In_sync, &rdev->flags) &&
> >          (conf->raid_disks - mddev->degraded) == 1) {
> >          set_bit(MD_BROKEN, &mddev->flags);
> >
> >          if (!mddev->fail_last_dev) {
> >              return;  // return directly here
> >          }
> >
> >
> >
> > static void md_submit_bio(struct bio *bio)
> > {
> >      if (unlikely(test_bit(MD_BROKEN, &mddev->flags)) && (rw == WRITE)) {
> >          bio_io_error(bio);
> >          return;
> >      }
> >
> > Read requests can submit to the last working device. Right?
> >
> > Best Regards
> > Xiao
> >
>
> Yeah, after MD_BROKEN is set, read are forbidden but writes remain allowed.

Hmm, reverse way? Write requests are forbidden and read requests are
allowed now. If MD_BROKEN is set, write requests return directly after
bio_io_error.

Regards
Xiao

> IMO we preserve the RAID array in this state to enable users to retrieve
> stored data, not to continue using it. However, continued writes to the
> array will cause subsequent errors to fail to be logged, either due to
> failfast or the badblocks being full. Read errors have no impact as they do
> not damage the original data.
>
> --
> Thanks,
> Nan
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-06  9:25         ` Xiao Ni
@ 2026-01-06 11:14           ` Li Nan
  0 siblings, 0 replies; 13+ messages in thread
From: Li Nan @ 2026-01-06 11:14 UTC (permalink / raw)
  To: Xiao Ni, Li Nan
  Cc: Kenta Akagi, Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk,
	linux-raid, linux-kernel



在 2026/1/6 17:25, Xiao Ni 写道:
> On Tue, Jan 6, 2026 at 5:11 PM Li Nan <linan666@huaweicloud.com> wrote:
>>
>>
>>
>> 在 2026/1/6 15:59, Xiao Ni 写道:
>>> On Tue, Jan 6, 2026 at 10:57 AM Li Nan <linan666@huaweicloud.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2026/1/5 22:40, Kenta Akagi 写道:
>>>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>>>> if the error handler is called on the last rdev in RAID1 or RAID10,
>>>>> the MD_BROKEN flag will be set on that mddev.
>>>>> When MD_BROKEN is set, write bios to the md will result in an I/O error.
>>>>>
>>>>> This causes a problem when using FailFast.
>>>>> The current implementation of FailFast expects the array to continue
>>>>> functioning without issues even after calling md_error for the last
>>>>> rdev.  Furthermore, due to the nature of its functionality, FailFast may
>>>>> call md_error on all rdevs of the md. Even if retrying I/O on an rdev
>>>>> would succeed, it first calls md_error before retrying.
>>>>>
>>>>> To fix this issue, this commit ensures that for RAID1 and RAID10, if the
>>>>> last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
>>>>> is off, the MD_BROKEN flag will not be set on that mddev.
>>>>>
>>>>> This change impacts userspace. After this commit, If the rdev has the
>>>>> FailFast flag, the mddev never broken even if the failing bio is not
>>>>> FailFast. However, it's unlikely that any setup using FailFast expects
>>>>> the array to halt when md_error is called on the last rdev.
>>>>>
>>>>
>>>> In the current RAID design, when an IO error occurs, RAID ensures faulty
>>>> data is not read via the following actions:
>>>> 1. Mark the badblocks (no FailFast flag); if this fails,
>>>> 2. Mark the disk as Faulty.
>>>>
>>>> If neither action is taken, and BROKEN is not set to prevent continued RAID
>>>> use, errors on the last remaining disk will be ignored. Subsequent reads
>>>> may return incorrect data. This seems like a more serious issue in my opinion.
>>>>
>>>> In scenarios with a large number of transient IO errors, is FailFast not a
>>>> suitable configuration? As you mentioned: "retrying I/O on an rdev would
>>>> succeed".
>>>
>>> Hi Nan
>>>
>>> According to my understanding, the policy here is to try to keep raid
>>> work if io error happens on the last device. It doesn't set faulty on
>>> the last in_sync device. It only sets MD_BROKEN to forbid write
>>> requests. But it still can read data from the last device.
>>>
>>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>> {
>>>
>>>       if (test_bit(In_sync, &rdev->flags) &&
>>>           (conf->raid_disks - mddev->degraded) == 1) {
>>>           set_bit(MD_BROKEN, &mddev->flags);
>>>
>>>           if (!mddev->fail_last_dev) {
>>>               return;  // return directly here
>>>           }
>>>
>>>
>>>
>>> static void md_submit_bio(struct bio *bio)
>>> {
>>>       if (unlikely(test_bit(MD_BROKEN, &mddev->flags)) && (rw == WRITE)) {
>>>           bio_io_error(bio);
>>>           return;
>>>       }
>>>
>>> Read requests can submit to the last working device. Right?
>>>
>>> Best Regards
>>> Xiao
>>>
>>
>> Yeah, after MD_BROKEN is set, read are forbidden but writes remain allowed.
> 
> Hmm, reverse way? Write requests are forbidden and read requests are
> allowed now. If MD_BROKEN is set, write requests return directly after
> bio_io_error.
> 
> Regards
> Xiao
> 

Apologies for the typo... The rest of the content was written with this
exact meaning in mind.

>> IMO we preserve the RAID array in this state to enable users to retrieve
>> stored data, not to continue using it. However, continued writes to the
>> array will cause subsequent errors to fail to be logged, either due to
>> failfast or the badblocks being full. Read errors have no impact as they do
>> not damage the original data.
>>

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-06  2:57   ` Li Nan
  2026-01-06  7:59     ` Xiao Ni
@ 2026-01-06 12:30     ` Kenta Akagi
  2026-01-07  2:09       ` Li Nan
  2026-01-07  3:35       ` Xiao Ni
  1 sibling, 2 replies; 13+ messages in thread
From: Kenta Akagi @ 2026-01-06 12:30 UTC (permalink / raw)
  To: linan666, xni; +Cc: linux-raid, linux-kernel, song, yukuai, shli, mtkaczyk

Hi,
Thank you for reviewing.

On 2026/01/06 11:57, Li Nan wrote:
> 
> 
> 在 2026/1/5 22:40, Kenta Akagi 写道:
>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>> if the error handler is called on the last rdev in RAID1 or RAID10,
>> the MD_BROKEN flag will be set on that mddev.
>> When MD_BROKEN is set, write bios to the md will result in an I/O error.
>>
>> This causes a problem when using FailFast.
>> The current implementation of FailFast expects the array to continue
>> functioning without issues even after calling md_error for the last
>> rdev.  Furthermore, due to the nature of its functionality, FailFast may
>> call md_error on all rdevs of the md. Even if retrying I/O on an rdev
>> would succeed, it first calls md_error before retrying.
>>
>> To fix this issue, this commit ensures that for RAID1 and RAID10, if the
>> last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
>> is off, the MD_BROKEN flag will not be set on that mddev.
>>
>> This change impacts userspace. After this commit, If the rdev has the
>> FailFast flag, the mddev never broken even if the failing bio is not
>> FailFast. However, it's unlikely that any setup using FailFast expects
>> the array to halt when md_error is called on the last rdev.
>>
> 
> In the current RAID design, when an IO error occurs, RAID ensures faulty
> data is not read via the following actions:
> 1. Mark the badblocks (no FailFast flag); if this fails,
> 2. Mark the disk as Faulty.
> 
> If neither action is taken, and BROKEN is not set to prevent continued RAID
> use, errors on the last remaining disk will be ignored. Subsequent reads
> may return incorrect data. This seems like a more serious issue in my opinion.

I agree that data inconsistency can certainly occur in this scenario.

However, a RAID1 with only one remaining rdev can considered the same as a plain
disk. From that perspective, I do not believe it is the mandatory responsibility
of md raid to block subsequent writes nor prevent data inconsistency in this situation.

The commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") that introduced
BROKEN for RAID1/10 also does not seem to have done so for that responsibility.

> 
> In scenarios with a large number of transient IO errors, is FailFast not a
> suitable configuration? As you mentioned: "retrying I/O on an rdev would

It seems be right about that. Using FailFast with unstable underlayer is not good.
However, as md raid, which is issuer of FailFast bios,
I believe it is incorrect to shutdown the array due to the failure of a FailFast bio.

Thanks,
Akagi

> succeed".
> 
> -- 
> Thanks,
> Nan
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-06 12:30     ` Kenta Akagi
@ 2026-01-07  2:09       ` Li Nan
  2026-01-07  3:35       ` Xiao Ni
  1 sibling, 0 replies; 13+ messages in thread
From: Li Nan @ 2026-01-07  2:09 UTC (permalink / raw)
  To: Kenta Akagi, linan666, xni, Yu Kuai
  Cc: linux-raid, linux-kernel, song, shli, mtkaczyk



在 2026/1/6 20:30, Kenta Akagi 写道:
> Hi,
> Thank you for reviewing.
> 
> On 2026/01/06 11:57, Li Nan wrote:
>>
>>
>> 在 2026/1/5 22:40, Kenta Akagi 写道:
>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>> if the error handler is called on the last rdev in RAID1 or RAID10,
>>> the MD_BROKEN flag will be set on that mddev.
>>> When MD_BROKEN is set, write bios to the md will result in an I/O error.
>>>
>>> This causes a problem when using FailFast.
>>> The current implementation of FailFast expects the array to continue
>>> functioning without issues even after calling md_error for the last
>>> rdev.  Furthermore, due to the nature of its functionality, FailFast may
>>> call md_error on all rdevs of the md. Even if retrying I/O on an rdev
>>> would succeed, it first calls md_error before retrying.
>>>
>>> To fix this issue, this commit ensures that for RAID1 and RAID10, if the
>>> last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
>>> is off, the MD_BROKEN flag will not be set on that mddev.
>>>
>>> This change impacts userspace. After this commit, If the rdev has the
>>> FailFast flag, the mddev never broken even if the failing bio is not
>>> FailFast. However, it's unlikely that any setup using FailFast expects
>>> the array to halt when md_error is called on the last rdev.
>>>
>>
>> In the current RAID design, when an IO error occurs, RAID ensures faulty
>> data is not read via the following actions:
>> 1. Mark the badblocks (no FailFast flag); if this fails,
>> 2. Mark the disk as Faulty.
>>
>> If neither action is taken, and BROKEN is not set to prevent continued RAID
>> use, errors on the last remaining disk will be ignored. Subsequent reads
>> may return incorrect data. This seems like a more serious issue in my opinion.
> 
> I agree that data inconsistency can certainly occur in this scenario.
> 
> However, a RAID1 with only one remaining rdev can considered the same as a plain
> disk. From that perspective, I do not believe it is the mandatory responsibility
> of md raid to block subsequent writes nor prevent data inconsistency in this situation.
> 
> The commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") that introduced
> BROKEN for RAID1/10 also does not seem to have done so for that responsibility.
> 
>>
>> In scenarios with a large number of transient IO errors, is FailFast not a
>> suitable configuration? As you mentioned: "retrying I/O on an rdev would
> 
> It seems be right about that. Using FailFast with unstable underlayer is not good.
> However, as md raid, which is issuer of FailFast bios,
> I believe it is incorrect to shutdown the array due to the failure of a FailFast bio.
> 
> Thanks,
> Akagi
> 

I get your point, Kuai, what's your take on this?

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-06 12:30     ` Kenta Akagi
  2026-01-07  2:09       ` Li Nan
@ 2026-01-07  3:35       ` Xiao Ni
  2026-01-07  6:43         ` Kenta Akagi
  2026-01-16  2:04         ` Kenta Akagi
  1 sibling, 2 replies; 13+ messages in thread
From: Xiao Ni @ 2026-01-07  3:35 UTC (permalink / raw)
  To: Kenta Akagi
  Cc: linan666, linux-raid, linux-kernel, song, yukuai, shli, mtkaczyk

On Tue, Jan 6, 2026 at 8:30 PM Kenta Akagi <k@mgml.me> wrote:
>
> Hi,
> Thank you for reviewing.
>
> On 2026/01/06 11:57, Li Nan wrote:
> >
> >
> > 在 2026/1/5 22:40, Kenta Akagi 写道:
> >> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
> >> if the error handler is called on the last rdev in RAID1 or RAID10,
> >> the MD_BROKEN flag will be set on that mddev.
> >> When MD_BROKEN is set, write bios to the md will result in an I/O error.
> >>
> >> This causes a problem when using FailFast.
> >> The current implementation of FailFast expects the array to continue
> >> functioning without issues even after calling md_error for the last
> >> rdev.  Furthermore, due to the nature of its functionality, FailFast may
> >> call md_error on all rdevs of the md. Even if retrying I/O on an rdev
> >> would succeed, it first calls md_error before retrying.
> >>
> >> To fix this issue, this commit ensures that for RAID1 and RAID10, if the
> >> last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
> >> is off, the MD_BROKEN flag will not be set on that mddev.
> >>
> >> This change impacts userspace. After this commit, If the rdev has the
> >> FailFast flag, the mddev never broken even if the failing bio is not
> >> FailFast. However, it's unlikely that any setup using FailFast expects
> >> the array to halt when md_error is called on the last rdev.
> >>
> >
> > In the current RAID design, when an IO error occurs, RAID ensures faulty
> > data is not read via the following actions:
> > 1. Mark the badblocks (no FailFast flag); if this fails,
> > 2. Mark the disk as Faulty.
> >
> > If neither action is taken, and BROKEN is not set to prevent continued RAID
> > use, errors on the last remaining disk will be ignored. Subsequent reads
> > may return incorrect data. This seems like a more serious issue in my opinion.
>
> I agree that data inconsistency can certainly occur in this scenario.
>
> However, a RAID1 with only one remaining rdev can considered the same as a plain
> disk. From that perspective, I do not believe it is the mandatory responsibility
> of md raid to block subsequent writes nor prevent data inconsistency in this situation.
>
> The commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") that introduced
> BROKEN for RAID1/10 also does not seem to have done so for that responsibility.
>
> >
> > In scenarios with a large number of transient IO errors, is FailFast not a
> > suitable configuration? As you mentioned: "retrying I/O on an rdev would
>
> It seems be right about that. Using FailFast with unstable underlayer is not good.
> However, as md raid, which is issuer of FailFast bios,
> I believe it is incorrect to shutdown the array due to the failure of a FailFast bio.

Hi all

I understand @Li Nan 's point now. The badblock can't be recorded in
this situation and the last working device is not set to faulty. To be
frank, I think consistency of data is more important. Users don't
think it's a single disk, they must think raid1 should guarantee the
consistency. But the write request should return an error when calling
raid1_error for the last working device, right? So there is no
consistency problem?

hi, Kenta. I have a question too. What will you do in your environment
after the network connection works again? Add those disks one by one
to do recovery?

Best Regards
Xiao

>
> Thanks,
> Akagi
>
> > succeed".
> >
> > --
> > Thanks,
> > Nan
> >
> >
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-07  3:35       ` Xiao Ni
@ 2026-01-07  6:43         ` Kenta Akagi
  2026-01-16  2:04         ` Kenta Akagi
  1 sibling, 0 replies; 13+ messages in thread
From: Kenta Akagi @ 2026-01-07  6:43 UTC (permalink / raw)
  To: Xiao Ni; +Cc: k, linan666, linux-raid, linux-kernel, song, yukuai, shli,
	mtkaczyk

Hi,

On 2026/01/07 12:35, Xiao Ni wrote:
> On Tue, Jan 6, 2026 at 8:30 PM Kenta Akagi <k@mgml.me> wrote:
>>
>> Hi,
>> Thank you for reviewing.
>>
>> On 2026/01/06 11:57, Li Nan wrote:
>>>
>>>
>>> 在 2026/1/5 22:40, Kenta Akagi 写道:
>>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>>> if the error handler is called on the last rdev in RAID1 or RAID10,
>>>> the MD_BROKEN flag will be set on that mddev.
>>>> When MD_BROKEN is set, write bios to the md will result in an I/O error.
>>>>
>>>> This causes a problem when using FailFast.
>>>> The current implementation of FailFast expects the array to continue
>>>> functioning without issues even after calling md_error for the last
>>>> rdev.  Furthermore, due to the nature of its functionality, FailFast may
>>>> call md_error on all rdevs of the md. Even if retrying I/O on an rdev
>>>> would succeed, it first calls md_error before retrying.
>>>>
>>>> To fix this issue, this commit ensures that for RAID1 and RAID10, if the
>>>> last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
>>>> is off, the MD_BROKEN flag will not be set on that mddev.
>>>>
>>>> This change impacts userspace. After this commit, If the rdev has the
>>>> FailFast flag, the mddev never broken even if the failing bio is not
>>>> FailFast. However, it's unlikely that any setup using FailFast expects
>>>> the array to halt when md_error is called on the last rdev.
>>>>
>>>
>>> In the current RAID design, when an IO error occurs, RAID ensures faulty
>>> data is not read via the following actions:
>>> 1. Mark the badblocks (no FailFast flag); if this fails,
>>> 2. Mark the disk as Faulty.
>>>
>>> If neither action is taken, and BROKEN is not set to prevent continued RAID
>>> use, errors on the last remaining disk will be ignored. Subsequent reads
>>> may return incorrect data. This seems like a more serious issue in my opinion.
>>
>> I agree that data inconsistency can certainly occur in this scenario.
>>
>> However, a RAID1 with only one remaining rdev can considered the same as a plain
>> disk. From that perspective, I do not believe it is the mandatory responsibility
>> of md raid to block subsequent writes nor prevent data inconsistency in this situation.
>>
>> The commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") that introduced
>> BROKEN for RAID1/10 also does not seem to have done so for that responsibility.
>>
>>>
>>> In scenarios with a large number of transient IO errors, is FailFast not a
>>> suitable configuration? As you mentioned: "retrying I/O on an rdev would
>>
>> It seems be right about that. Using FailFast with unstable underlayer is not good.
>> However, as md raid, which is issuer of FailFast bios,
>> I believe it is incorrect to shutdown the array due to the failure of a FailFast bio.
> 
> Hi all
> 
> I understand @Li Nan 's point now. The badblock can't be recorded in
> this situation and the last working device is not set to faulty. To be
> frank, I think consistency of data is more important. Users don't
> think it's a single disk, they must think raid1 should guarantee the

Hmm, I see...

> consistency. But the write request should return an error when calling
> raid1_error for the last working device, right? So there is no
> consistency problem?
> 
> hi, Kenta. I have a question too. What will you do in your environment
> after the network connection works again? Add those disks one by one
> to do recovery?

Yes. We will have to add a new disk or remove and add the rdev marked as faulty.
Currently, the array is being recreated because it is mark as broken.

Thanks,
Akagi

> 
> Best Regards
> Xiao
> 
>>
>> Thanks,
>> Akagi
>>
>>> succeed".
>>>
>>> --
>>> Thanks,
>>> Nan
>>>
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
  2026-01-07  3:35       ` Xiao Ni
  2026-01-07  6:43         ` Kenta Akagi
@ 2026-01-16  2:04         ` Kenta Akagi
  1 sibling, 0 replies; 13+ messages in thread
From: Kenta Akagi @ 2026-01-16  2:04 UTC (permalink / raw)
  To: Xiao Ni, linan666
  Cc: Kenta Akagi, linux-raid, linux-kernel, song, yukuai, shli,
	mtkaczyk



On 2026/01/07 12:35, Xiao Ni wrote:
> On Tue, Jan 6, 2026 at 8:30 PM Kenta Akagi <k@mgml.me> wrote:
>>
>> Hi,
>> Thank you for reviewing.
>>
>> On 2026/01/06 11:57, Li Nan wrote:
>>>
>>>
>>> 在 2026/1/5 22:40, Kenta Akagi 写道:
>>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>>> if the error handler is called on the last rdev in RAID1 or RAID10,
>>>> the MD_BROKEN flag will be set on that mddev.
>>>> When MD_BROKEN is set, write bios to the md will result in an I/O error.
>>>>
>>>> This causes a problem when using FailFast.
>>>> The current implementation of FailFast expects the array to continue
>>>> functioning without issues even after calling md_error for the last
>>>> rdev.  Furthermore, due to the nature of its functionality, FailFast may
>>>> call md_error on all rdevs of the md. Even if retrying I/O on an rdev
>>>> would succeed, it first calls md_error before retrying.
>>>>
>>>> To fix this issue, this commit ensures that for RAID1 and RAID10, if the
>>>> last In_sync rdev has the FailFast flag set and the mddev's fail_last_dev
>>>> is off, the MD_BROKEN flag will not be set on that mddev.
>>>>
>>>> This change impacts userspace. After this commit, If the rdev has the
>>>> FailFast flag, the mddev never broken even if the failing bio is not
>>>> FailFast. However, it's unlikely that any setup using FailFast expects
>>>> the array to halt when md_error is called on the last rdev.
>>>>
>>>
>>> In the current RAID design, when an IO error occurs, RAID ensures faulty
>>> data is not read via the following actions:
>>> 1. Mark the badblocks (no FailFast flag); if this fails,
>>> 2. Mark the disk as Faulty.
>>>
>>> If neither action is taken, and BROKEN is not set to prevent continued RAID
>>> use, errors on the last remaining disk will be ignored. Subsequent reads
>>> may return incorrect data. This seems like a more serious issue in my opinion.
>>
>> I agree that data inconsistency can certainly occur in this scenario.
>>
>> However, a RAID1 with only one remaining rdev can considered the same as a plain
>> disk. From that perspective, I do not believe it is the mandatory responsibility
>> of md raid to block subsequent writes nor prevent data inconsistency in this situation.
>>
>> The commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") that introduced
>> BROKEN for RAID1/10 also does not seem to have done so for that responsibility.
>>
>>>
>>> In scenarios with a large number of transient IO errors, is FailFast not a
>>> suitable configuration? As you mentioned: "retrying I/O on an rdev would
>>
>> It seems be right about that. Using FailFast with unstable underlayer is not good.
>> However, as md raid, which is issuer of FailFast bios,
>> I believe it is incorrect to shutdown the array due to the failure of a FailFast bio.
> 
> Hi all
> 
> I understand @Li Nan 's point now. The badblock can't be recorded in
> this situation and the last working device is not set to faulty. To be
> frank, I think consistency of data is more important. Users don't
> think it's a single disk, they must think raid1 should guarantee the
> consistency. But the write request should return an error when calling
> raid1_error for the last working device, right? So there is no
> consistency problem?

Hi all,

I understand that when md_error is issued for the last remaining rdev, 
the array should be stopped except in the failfast case, also, 
it is no longer appropriate to treat an RAID1 array that has lost 
redundancy as "just a normal single drive" [1].

I will post an PATCH v7 based on v5.

[1] commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")

Thanks.

> 
> hi, Kenta. I have a question too. What will you do in your environment
> after the network connection works again? Add those disks one by one
> to do recovery?
> 
> Best Regards
> Xiao
> 
>>
>> Thanks,
>> Akagi
>>
>>> succeed".
>>>
>>> --
>>> Thanks,
>>> Nan
>>>
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v6 2/2] md/raid10: fix failfast read error not rescheduled
  2026-01-05 14:40 [PATCH v6 0/2] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
  2026-01-05 14:40 ` [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast Kenta Akagi
@ 2026-01-05 14:40 ` Kenta Akagi
  1 sibling, 0 replies; 13+ messages in thread
From: Kenta Akagi @ 2026-01-05 14:40 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Xiao Ni
  Cc: linux-raid, linux-kernel, Kenta Akagi, Li Nan

raid10_end_read_request lacks a path to retry when a FailFast IO fails.
As a result, when Failfast Read IOs fail on all rdevs, the upper layer
receives EIO, without read rescheduled.

Looking at the two commits below, it seems only raid10_end_read_request
lacks the failfast read retry handling, while raid1_end_read_request has
it. In RAID1, the retry works as expected.
* commit 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
* commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")

This commit will make the failfast read bio for the last rdev in raid10
retry if it fails.

Fixes: 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
Signed-off-by: Kenta Akagi <k@mgml.me>
Reviewed-by: Li Nan <linan122@huawei.com>
---
 drivers/md/raid10.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b33149aa5b29..8a254bab52e8 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -401,6 +401,13 @@ static void raid10_end_read_request(struct bio *bio)
 		 * wait for the 'master' bio.
 		 */
 		set_bit(R10BIO_Uptodate, &r10_bio->state);
+	} else if (test_bit(FailFast, &rdev->flags) &&
+		 test_bit(R10BIO_FailFast, &r10_bio->state)) {
+		/*
+		 * This was a fail-fast read so we definitely
+		 * want to retry
+		 */
+		;
 	} else if (!raid1_should_handle_error(bio)) {
 		uptodate = 1;
 	} else {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-01-16  2:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-05 14:40 [PATCH v6 0/2] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
2026-01-05 14:40 ` [PATCH v6 1/2] md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast Kenta Akagi
2026-01-06  2:57   ` Li Nan
2026-01-06  7:59     ` Xiao Ni
2026-01-06  9:11       ` Li Nan
2026-01-06  9:25         ` Xiao Ni
2026-01-06 11:14           ` Li Nan
2026-01-06 12:30     ` Kenta Akagi
2026-01-07  2:09       ` Li Nan
2026-01-07  3:35       ` Xiao Ni
2026-01-07  6:43         ` Kenta Akagi
2026-01-16  2:04         ` Kenta Akagi
2026-01-05 14:40 ` [PATCH v6 2/2] md/raid10: fix failfast read error not rescheduled Kenta Akagi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox