[PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure

public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure
@ 2025-09-15  3:42 Kenta Akagi
  2025-09-15  3:42 ` [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes Kenta Akagi
                   ` (8 more replies)
  0 siblings, 9 replies; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  3:42 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Changes from V3:
- The error handling in md_error() is now serialized, and a new helper
  function, md_bio_failure_error, has been introduced.
- MD_FAILFAST bio failures are now processed by md_bio_failure_error
  instead of signaling via FailfastIOFailure.
- RAID10: Fix missing reschedule of failfast read bio failure
- Regardless of failfast, in narrow_write_error, writes that succeed
  in retry are returned to the higher layer as success
Changes from V2:
- Fix to prevent the array from being marked broken for all
  Failfast IOs, not just metadata.
- Reflecting the review, update raid{1,10}_error to clear
  FailfastIOFailure so that devices are properly marked Faulty.
Changes from V1:
- Avoid setting MD_BROKEN instead of clearing it
- Add pr_crit() when setting MD_BROKEN
- Fix the message may shown after all rdevs failure:
  "Operation continuing on 0 devices"

v3: https://lore.kernel.org/linux-raid/20250828163216.4225-1-k@mgml.me/
v2: https://lore.kernel.org/linux-raid/20250817172710.4892-1-k@mgml.me/
v1: https://lore.kernel.org/linux-raid/20250812090119.153697-1-k@mgml.me/

When multiple MD_FAILFAST bios fail simultaneously on Failfast-enabled
rdevs in RAID1/RAID10, the following issues can occur:
* MD_BROKEN is set and the array halts, even though this should not occur
  under the intended Failfast design.
* Writes retried through narrow_write_error succeed, but the I/O is still
  reported as BLK_STS_IOERR
* RAID10 only: If a Failfast read I/O fails, it is not retried on any
  remaining rdev, and as a result, the upper layer receives an I/O error.

Simultaneous bio failures across multiple rdevs are uncommon; however,
rdevs serviced via nvme-tcp can still experience them due to something as
simple as an Ethernet fault. The issue can be reproduced using the
following steps.

# prepare nvmet/nvme-tcp and md array #
sh-5.2# cat << 'EOF' > loopback-nvme.sh
set -eu
nqn="nqn.2025-08.io.example:nvmet-test-$1"
back=$2
cd /sys/kernel/config/nvmet/
mkdir subsystems/$nqn
echo 1 > subsystems/${nqn}/attr_allow_any_host
mkdir subsystems/${nqn}/namespaces/1
echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path
echo 1 > subsystems/${nqn}/namespaces/1/enable
ports="ports/1"
if [ ! -d $ports ]; then
        mkdir $ports
        cd $ports
        echo 127.0.0.1 > addr_traddr
        echo tcp       > addr_trtype
        echo 4420      > addr_trsvcid
        echo ipv4      > addr_adrfam
        cd ../../
fi
ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/
nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420
EOF

sh-5.2# chmod +x loopback-nvme.sh
sh-5.2# modprobe -a nvme-tcp nvmet-tcp
sh-5.2# truncate -s 1g a.img b.img
sh-5.2# losetup --show -f a.img
/dev/loop0
sh-5.2# losetup --show -f b.img
/dev/loop1
sh-5.2# ./loopback-nvme.sh 0 /dev/loop0
connecting to device: nvme0
sh-5.2# ./loopback-nvme.sh 1 /dev/loop1
connecting to device: nvme1
sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \
--failfast /dev/nvme0n1 --failfast /dev/nvme1n1
...
mdadm: array /dev/md0 started.

# run fio #
sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \
--bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1

# It can reproduce the issue by block nvme traffic during fio #
sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP;
sh-5.2# sleep 10; # twice the default KATO value
sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP


Patch 1–3 serve as preparatory changes for patch 4.
Patch 4 prevents MD_FAILFAST bio failure causing the array to fail.
Patch 5, regardless of FAILFAST, reports success to the upper layer
if a write retry in narrow_write_error succeeds without marking badblock.
Patch 6 fixes an issue where writes are not retried in a no-bbl
configuration when last rdevs MD_FAILFAST bio failure.
Patch 7 adds the missing retry path for Failfast read errors in RAID10.
Patch 8-9 modify the pr_crit handling in raid{1,10}_error.

Kenta Akagi (9):
  md/raid1,raid10: Set the LastDev flag when the configuration changes
  md: serialize md_error()
  md: introduce md_bio_failure_error()
  md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure
  md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a
    failed bio
  md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl
    rdevs
  md/raid10: fix failfast read error not rescheduled
  md/raid1,raid10: Add error message when setting MD_BROKEN
  md/raid1,raid10: Fix: Operation continuing on 0 devices.

 drivers/md/md.c     |  61 ++++++++++++++---
 drivers/md/md.h     |  12 +++-
 drivers/md/raid1.c  | 156 ++++++++++++++++++++++++++++++++------------
 drivers/md/raid10.c | 115 ++++++++++++++++++++++++++------
 4 files changed, 270 insertions(+), 74 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes
  2025-09-15  3:42 [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
@ 2025-09-15  3:42 ` Kenta Akagi
  2025-09-18  1:00   ` Yu Kuai
  2025-09-21  7:54   ` Xiao Ni
  2025-09-15  3:42 ` [PATCH v4 2/9] md: serialize md_error() Kenta Akagi
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  3:42 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Currently, the LastDev flag is set on an rdev that failed a failfast
metadata write and called md_error, but did not become Faulty. It is
cleared when the metadata write retry succeeds. This has problems for
the following reasons:

* Despite its name, the flag is only set during a metadata write window.
* Unlike when LastDev and Failfast was introduced, md_error on the last
  rdev of a RAID1/10 array now sets MD_BROKEN. Thus when LastDev is set,
  the array is already unwritable.

A following commit will prevent failfast bios from breaking the array,
which requires knowing from outside the personality whether an rdev is
the last one. For that purpose, LastDev should be set on rdevs that must
not be lost.

This commit ensures that LastDev is set on the indispensable rdev in a
degraded RAID1/10 array.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c     |  4 +---
 drivers/md/md.h     |  6 +++---
 drivers/md/raid1.c  | 34 +++++++++++++++++++++++++++++++++-
 drivers/md/raid10.c | 34 +++++++++++++++++++++++++++++++++-
 4 files changed, 70 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4e033c26fdd4..268410b66b83 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1007,10 +1007,8 @@ static void super_written(struct bio *bio)
 		if (!test_bit(Faulty, &rdev->flags)
 		    && (bio->bi_opf & MD_FAILFAST)) {
 			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
-			set_bit(LastDev, &rdev->flags);
 		}
-	} else
-		clear_bit(LastDev, &rdev->flags);
+	}
 
 	bio_put(bio);
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 51af29a03079..ec598f9a8381 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -281,9 +281,9 @@ enum flag_bits {
 				 * It is expects that no bad block log
 				 * is present.
 				 */
-	LastDev,		/* Seems to be the last working dev as
-				 * it didn't fail, so don't use FailFast
-				 * any more for metadata
+	LastDev,		/* This is the last working rdev.
+				 * so don't use FailFast any more for
+				 * metadata.
 				 */
 	CollisionCheck,		/*
 				 * check if there is collision between raid1
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index bf44878ec640..32ad6b102ff7 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1733,6 +1733,33 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
 	seq_printf(seq, "]");
 }
 
+/**
+ * update_lastdev - Set or clear LastDev flag for all rdevs in array
+ * @conf: pointer to r1conf
+ *
+ * Sets LastDev if the device is In_sync and cannot be lost for the array.
+ * Otherwise, clear it.
+ *
+ * Caller must hold ->device_lock.
+ */
+static void update_lastdev(struct r1conf *conf)
+{
+	int i;
+	int alive_disks = conf->raid_disks - conf->mddev->degraded;
+
+	for (i = 0; i < conf->raid_disks; i++) {
+		struct md_rdev *rdev = conf->mirrors[i].rdev;
+
+		if (rdev) {
+			if (test_bit(In_sync, &rdev->flags) &&
+			    alive_disks == 1)
+				set_bit(LastDev, &rdev->flags);
+			else
+				clear_bit(LastDev, &rdev->flags);
+		}
+	}
+}
+
 /**
  * raid1_error() - RAID1 error handler.
  * @mddev: affected md device.
@@ -1767,8 +1794,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 		}
 	}
 	set_bit(Blocked, &rdev->flags);
-	if (test_and_clear_bit(In_sync, &rdev->flags))
+	if (test_and_clear_bit(In_sync, &rdev->flags)) {
 		mddev->degraded++;
+		update_lastdev(conf);
+	}
 	set_bit(Faulty, &rdev->flags);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	/*
@@ -1864,6 +1893,7 @@ static int raid1_spare_active(struct mddev *mddev)
 		}
 	}
 	mddev->degraded -= count;
+	update_lastdev(conf);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	print_conf(conf);
@@ -3290,6 +3320,7 @@ static int raid1_run(struct mddev *mddev)
 	rcu_assign_pointer(conf->thread, NULL);
 	mddev->private = conf;
 	set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
+	update_lastdev(conf);
 
 	md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));
 
@@ -3427,6 +3458,7 @@ static int raid1_reshape(struct mddev *mddev)
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 	mddev->degraded += (raid_disks - conf->raid_disks);
+	update_lastdev(conf);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	conf->raid_disks = mddev->raid_disks = raid_disks;
 	mddev->delta_disks = 0;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b60c30bfb6c7..dc4edd4689f8 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1983,6 +1983,33 @@ static int enough(struct r10conf *conf, int ignore)
 		_enough(conf, 1, ignore);
 }
 
+/**
+ * update_lastdev - Set or clear LastDev flag for all rdevs in array
+ * @conf: pointer to r10conf
+ *
+ * Sets LastDev if the device is In_sync and cannot be lost for the array.
+ * Otherwise, clear it.
+ *
+ * Caller must hold ->reconfig_mutex or ->device_lock.
+ */
+static void update_lastdev(struct r10conf *conf)
+{
+	int i;
+	int raid_disks = max(conf->geo.raid_disks, conf->prev.raid_disks);
+
+	for (i = 0; i < raid_disks; i++) {
+		struct md_rdev *rdev = conf->mirrors[i].rdev;
+
+		if (rdev) {
+			if (test_bit(In_sync, &rdev->flags) &&
+			    !enough(conf, i))
+				set_bit(LastDev, &rdev->flags);
+			else
+				clear_bit(LastDev, &rdev->flags);
+		}
+	}
+}
+
 /**
  * raid10_error() - RAID10 error handler.
  * @mddev: affected md device.
@@ -2013,8 +2040,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 			return;
 		}
 	}
-	if (test_and_clear_bit(In_sync, &rdev->flags))
+	if (test_and_clear_bit(In_sync, &rdev->flags)) {
 		mddev->degraded++;
+		update_lastdev(conf);
+	}
 
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	set_bit(Blocked, &rdev->flags);
@@ -2102,6 +2131,7 @@ static int raid10_spare_active(struct mddev *mddev)
 	}
 	spin_lock_irqsave(&conf->device_lock, flags);
 	mddev->degraded -= count;
+	update_lastdev(conf);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	print_conf(conf);
@@ -4159,6 +4189,7 @@ static int raid10_run(struct mddev *mddev)
 	md_set_array_sectors(mddev, size);
 	mddev->resync_max_sectors = size;
 	set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
+	update_lastdev(conf);
 
 	if (md_integrity_register(mddev))
 		goto out_free_conf;
@@ -4567,6 +4598,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 	 */
 	spin_lock_irq(&conf->device_lock);
 	mddev->degraded = calc_degraded(conf);
+	update_lastdev(conf);
 	spin_unlock_irq(&conf->device_lock);
 	mddev->raid_disks = conf->geo.raid_disks;
 	mddev->reshape_position = conf->reshape_progress;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 2/9] md: serialize md_error()
  2025-09-15  3:42 [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
  2025-09-15  3:42 ` [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes Kenta Akagi
@ 2025-09-15  3:42 ` Kenta Akagi
  2025-09-18  1:04   ` Yu Kuai
  2025-09-15  3:42 ` [PATCH v4 3/9] md: introduce md_bio_failure_error() Kenta Akagi
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  3:42 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

md_error is mainly called when a bio fails, so it can run in parallel.
Each personality’s error_handler locks with device_lock, so concurrent
calls are safe.

However, RAID1 and RAID10 require changes for Failfast bio error handling,
which needs a special helper for md_error. For that helper to work, the
regular md_error must also be serialized.

The helper function md_bio_failure_error for failfast will be introduced
in a subsequent commit.

This commit serializes md_error for all RAID personalities. While
unnecessary for RAID levels other than 1 and 10, it has no performance
impact as it is a cold path.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c | 10 +++++++++-
 drivers/md/md.h |  4 ++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 268410b66b83..5607578a6db9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -705,6 +705,7 @@ int mddev_init(struct mddev *mddev)
 	atomic_set(&mddev->openers, 0);
 	atomic_set(&mddev->sync_seq, 0);
 	spin_lock_init(&mddev->lock);
+	spin_lock_init(&mddev->error_handle_lock);
 	init_waitqueue_head(&mddev->sb_wait);
 	init_waitqueue_head(&mddev->recovery_wait);
 	mddev->reshape_position = MaxSector;
@@ -8262,7 +8263,7 @@ void md_unregister_thread(struct mddev *mddev, struct md_thread __rcu **threadp)
 }
 EXPORT_SYMBOL(md_unregister_thread);
 
-void md_error(struct mddev *mddev, struct md_rdev *rdev)
+void _md_error(struct mddev *mddev, struct md_rdev *rdev)
 {
 	if (!rdev || test_bit(Faulty, &rdev->flags))
 		return;
@@ -8287,6 +8288,13 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
 		queue_work(md_misc_wq, &mddev->event_work);
 	md_new_event();
 }
+
+void md_error(struct mddev *mddev, struct md_rdev *rdev)
+{
+	spin_lock(&mddev->error_handle_lock);
+	_md_error(mddev, rdev);
+	spin_unlock(&mddev->error_handle_lock);
+}
 EXPORT_SYMBOL(md_error);
 
 /* seq_file implementation /proc/mdstat */
diff --git a/drivers/md/md.h b/drivers/md/md.h
index ec598f9a8381..5177cb609e4b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -619,6 +619,9 @@ struct mddev {
 	/* The sequence number for sync thread */
 	atomic_t sync_seq;
 
+	/* Lock for serializing md_error */
+	spinlock_t			error_handle_lock;
+
 	bool	has_superblocks:1;
 	bool	fail_last_dev:1;
 	bool	serialize_policy:1;
@@ -901,6 +904,7 @@ extern void md_write_start(struct mddev *mddev, struct bio *bi);
 extern void md_write_inc(struct mddev *mddev, struct bio *bi);
 extern void md_write_end(struct mddev *mddev);
 extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
+void _md_error(struct mddev *mddev, struct md_rdev *rdev);
 extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
 extern void md_finish_reshape(struct mddev *mddev);
 void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 3/9] md: introduce md_bio_failure_error()
  2025-09-15  3:42 [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
  2025-09-15  3:42 ` [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes Kenta Akagi
  2025-09-15  3:42 ` [PATCH v4 2/9] md: serialize md_error() Kenta Akagi
@ 2025-09-15  3:42 ` Kenta Akagi
  2025-09-18  1:09   ` Yu Kuai
  2025-09-15  3:42 ` [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  3:42 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Add a new helper function md_bio_failure_error().
It is serialized with md_error() under the same lock and works
almost the same, but with two differences:

* Takes the failed bio as an argument
* If MD_FAILFAST is set in bi_opf and the target rdev is LastDev,
  it does not mark the rdev faulty

Failfast bios must not break the array, but in the current implementation
this can happen. This is because MD_BROKEN was introduced in RAID1/RAID10
and is set when md_error() is called on an rdev required for mddev
operation. At the time failfast was introduced, this was not the case.

Before this commit, md_error() has already been serialized, and
RAID1/RAID10 mark rdevs that must not be set Faulty by Failfast
with the LastDev flag.

The actual change in bio error handling will follow in a later commit.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.h |  4 +++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5607578a6db9..65fdd9bae8f4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8297,6 +8297,48 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
 }
 EXPORT_SYMBOL(md_error);
 
+/** md_bio_failure_error() - md error handler for MD_FAILFAST bios
+ * @mddev: affected md device.
+ * @rdev: member device to fail.
+ * @bio: bio whose triggered device failure.
+ *
+ * This is almost the same as md_error(). That is, it is serialized at
+ * the same level as md_error, marks the rdev as Faulty, and changes
+ * the mddev status.
+ * However, if all of the following conditions are met, it does nothing.
+ * This is because MD_FAILFAST bios must not stopping the array.
+ *  * RAID1 or RAID10
+ *  * LastDev - if rdev becomes Faulty, mddev will stop
+ *  * The failed bio has MD_FAILFAST set
+ *
+ * Returns: true if _md_error() was called, false if not.
+ */
+bool md_bio_failure_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio)
+{
+	bool do_md_error = true;
+
+	spin_lock(&mddev->error_handle_lock);
+	if (mddev->pers) {
+		if (mddev->pers->head.id == ID_RAID1 ||
+		    mddev->pers->head.id == ID_RAID10) {
+			if (test_bit(LastDev, &rdev->flags) &&
+			    test_bit(FailFast, &rdev->flags) &&
+			    bio != NULL && (bio->bi_opf & MD_FAILFAST))
+				do_md_error = false;
+		}
+	}
+
+	if (do_md_error)
+		_md_error(mddev, rdev);
+	else
+		pr_warn_ratelimited("md: %s: %s didn't do anything for %pg\n",
+			mdname(mddev), __func__, rdev->bdev);
+
+	spin_unlock(&mddev->error_handle_lock);
+	return do_md_error;
+}
+EXPORT_SYMBOL(md_bio_failure_error);
+
 /* seq_file implementation /proc/mdstat */
 
 static void status_unused(struct seq_file *seq)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 5177cb609e4b..11389ea58431 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -283,7 +283,8 @@ enum flag_bits {
 				 */
 	LastDev,		/* This is the last working rdev.
 				 * so don't use FailFast any more for
-				 * metadata.
+				 * metadata and don't Fail rdev
+				 * when FailFast bio failure.
 				 */
 	CollisionCheck,		/*
 				 * check if there is collision between raid1
@@ -906,6 +907,7 @@ extern void md_write_end(struct mddev *mddev);
 extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
 void _md_error(struct mddev *mddev, struct md_rdev *rdev);
 extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
+extern bool md_bio_failure_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio);
 extern void md_finish_reshape(struct mddev *mddev);
 void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
 			struct bio *bio, sector_t start, sector_t size);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure
  2025-09-15  3:42 [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (2 preceding siblings ...)
  2025-09-15  3:42 ` [PATCH v4 3/9] md: introduce md_bio_failure_error() Kenta Akagi
@ 2025-09-15  3:42 ` Kenta Akagi
  2025-09-18  1:26   ` Yu Kuai
  2025-09-15  3:42 ` [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio Kenta Akagi
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  3:42 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Failfast is a feature implemented only for RAID1 and RAID10. It instructs
the block device providing the rdev to immediately return a bio error
without retrying if any issue occurs. This allows quickly detaching a
problematic rdev and minimizes IO latency.

Due to its nature, failfast bios can fail easily, and md must not mark
an essential rdev as Faulty or set MD_BROKEN on the array just because
a failfast bio failed.

When failfast was introduced, RAID1 and RAID10 were designed to continue
operating normally even if md_error was called for the last rdev. However,
with the introduction of MD_BROKEN in RAID1/RAID10
in commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling
md_error for the last rdev now prevents further writes to the array.
Despite this, the current failfast error handler still assumes that
calling md_error will not break the array.

Normally, this is not an issue because MD_FAILFAST is not set when a bio
is issued to the last rdev. However, if the array is not degraded and a
bio with MD_FAILFAST has been issued, simultaneous failures could
potentially break the array. This is unusual but can happen; for example,
this can occur when using NVMe over TCP if all rdevs depend on
a single Ethernet link.

In other words, this becomes a problem under the following conditions:
Preconditions:
* Failfast is enabled on all rdevs.
* All rdevs are In_sync - This is a requirement for bio to be submit
  with MD_FAILFAST.
* At least one bio has been submitted but has not yet completed.

Trigger condition:
* All underlying devices of the rdevs return an error for their failfast
  bios.

Whether the bio is a read or a write makes little difference to the
outcome.
In the write case, md_error is invoked on each rdev through its bi_end_io
handler.
In the read case, losing the first rdev triggers a metadata
update. Next, md_super_write, unlike raid1_write_request, issues the bio
with MD_FAILFAST if the rdev supports it, causing the bio to fail
immediately - Before this patchset, LastDev was set only by the failure
path in super_written. Consequently, super_written calls md_error on the
remaining rdev.

Prior to this commit, the following changes were introduced:
* The helper function md_bio_failure_error() that skips the error handler
  if a failfast bio targets the last rdev.
* Serialization md_error() and md_bio_failure_error().
* Setting the LastDev flag for rdevs that must not be lost.

This commit uses md_bio_failure_error() instead of md_error() for failfast
bio failures, ensuring that failfast bios do not stop array operations.

Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c     |  5 +----
 drivers/md/raid1.c  | 37 ++++++++++++++++++-------------------
 drivers/md/raid10.c |  9 +++++----
 3 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 65fdd9bae8f4..65814bbe9bad 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1004,11 +1004,8 @@ static void super_written(struct bio *bio)
 	if (bio->bi_status) {
 		pr_err("md: %s gets error=%d\n", __func__,
 		       blk_status_to_errno(bio->bi_status));
-		md_error(mddev, rdev);
-		if (!test_bit(Faulty, &rdev->flags)
-		    && (bio->bi_opf & MD_FAILFAST)) {
+		if (!md_bio_failure_error(mddev, rdev, bio))
 			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
-		}
 	}
 
 	bio_put(bio);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 32ad6b102ff7..8fff9dacc6e0 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
 		    (bio->bi_opf & MD_FAILFAST) &&
 		    /* We never try FailFast to WriteMostly devices */
 		    !test_bit(WriteMostly, &rdev->flags)) {
-			md_error(r1_bio->mddev, rdev);
+			md_bio_failure_error(r1_bio->mddev, rdev, bio);
 		}
 
 		/*
@@ -2178,8 +2178,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
 	if (test_bit(FailFast, &rdev->flags)) {
 		/* Don't try recovering from here - just fail it
 		 * ... unless it is the last working device of course */
-		md_error(mddev, rdev);
-		if (test_bit(Faulty, &rdev->flags))
+		if (md_bio_failure_error(mddev, rdev, bio))
 			/* Don't try to read from here, but make sure
 			 * put_buf does it's thing
 			 */
@@ -2657,9 +2656,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 {
 	struct mddev *mddev = conf->mddev;
-	struct bio *bio;
+	struct bio *bio, *updated_bio;
 	struct md_rdev *rdev;
-	sector_t sector;
 
 	clear_bit(R1BIO_ReadError, &r1_bio->state);
 	/* we got a read error. Maybe the drive is bad.  Maybe just
@@ -2672,29 +2670,30 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 	 */
 
 	bio = r1_bio->bios[r1_bio->read_disk];
-	bio_put(bio);
-	r1_bio->bios[r1_bio->read_disk] = NULL;
+	updated_bio = NULL;
 
 	rdev = conf->mirrors[r1_bio->read_disk].rdev;
-	if (mddev->ro == 0
-	    && !test_bit(FailFast, &rdev->flags)) {
-		freeze_array(conf, 1);
-		fix_read_error(conf, r1_bio);
-		unfreeze_array(conf);
-	} else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
-		md_error(mddev, rdev);
+	if (mddev->ro == 0) {
+		if (!test_bit(FailFast, &rdev->flags)) {
+			freeze_array(conf, 1);
+			fix_read_error(conf, r1_bio);
+			unfreeze_array(conf);
+		} else {
+			md_bio_failure_error(mddev, rdev, bio);
+		}
 	} else {
-		r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
+		updated_bio = IO_BLOCKED;
 	}
 
+	bio_put(bio);
+	r1_bio->bios[r1_bio->read_disk] = updated_bio;
+
 	rdev_dec_pending(rdev, conf->mddev);
-	sector = r1_bio->sector;
-	bio = r1_bio->master_bio;
 
 	/* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */
 	r1_bio->state = 0;
-	raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio);
-	allow_barrier(conf, sector);
+	raid1_read_request(mddev, r1_bio->master_bio, r1_bio->sectors, r1_bio);
+	allow_barrier(conf, r1_bio->sector);
 }
 
 static void raid1d(struct md_thread *thread)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index dc4edd4689f8..b73af94a88b0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -488,7 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
 			dec_rdev = 0;
 			if (test_bit(FailFast, &rdev->flags) &&
 			    (bio->bi_opf & MD_FAILFAST)) {
-				md_error(rdev->mddev, rdev);
+				md_bio_failure_error(rdev->mddev, rdev, bio);
 			}
 
 			/*
@@ -2443,7 +2443,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 				continue;
 		} else if (test_bit(FailFast, &rdev->flags)) {
 			/* Just give up on this device */
-			md_error(rdev->mddev, rdev);
+			md_bio_failure_error(rdev->mddev, rdev, tbio);
 			continue;
 		}
 		/* Ok, we need to write this bio, either to correct an
@@ -2895,8 +2895,9 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 		freeze_array(conf, 1);
 		fix_read_error(conf, mddev, r10_bio);
 		unfreeze_array(conf);
-	} else
-		md_error(mddev, rdev);
+	} else {
+		md_bio_failure_error(mddev, rdev, bio);
+	}
 
 	rdev_dec_pending(rdev, mddev);
 	r10_bio->state = 0;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio
  2025-09-15  3:42 [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (3 preceding siblings ...)
  2025-09-15  3:42 ` [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure Kenta Akagi
@ 2025-09-15  3:42 ` Kenta Akagi
  2025-09-17  9:24   ` Li Nan
  2025-09-15  3:42 ` [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs Kenta Akagi
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  3:42 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

In the current implementation, when a write bio fails, the retry flow
is as follows:
* In bi_end_io, e.g. raid1_end_write_request, R1BIO_WriteError is
  set on the r1bio.
* The md thread calls handle_write_finished corresponding to this r1bio.
* Inside handle_write_finished, narrow_write_error is invoked.
* narrow_write_error rewrites the r1bio on a per-sector basis, marking
  any failed sectors as badblocks. It returns true if all sectors succeed,
  or if failed sectors are successfully recorded via rdev_set_badblock.
  It returns false if rdev_set_badblock fails or if badblocks are disabled.
* handle_write_finished faults the rdev if it receives false from
  narrow_write_error. Otherwise, it does nothing.

This can cause a problem where an r1bio that succeeded on retry is
incorrectly reported as failed to the higher layer, for example in the
following case:
* Only one In_sync rdev exists, and
* The write bio initially failed but all retries in
  narrow_write_error succeed.

This commit ensures that if a write initially fails but all retries in
narrow_write_error succeed, R1BIO_Uptodate or R10BIO_Uptodate is set
and the higher layer receives a successful write status.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c  | 32 ++++++++++++++++++++++++++------
 drivers/md/raid10.c | 21 +++++++++++++++++++++
 2 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8fff9dacc6e0..806f5cb33a8e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2517,6 +2517,21 @@ static void fix_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 	}
 }
 
+/**
+ * narrow_write_error() - Retry write and set badblock
+ * @r1_bio:	the r1bio containing the write error
+ * @i:		which device to retry
+ *
+ * Rewrites the bio, splitting it at the least common multiple of the logical
+ * block size and the badblock size. Blocks that fail to be written are marked
+ * as bad. If badblocks are disabled, no write is attempted and false is
+ * returned immediately.
+ *
+ * Return:
+ * * %true	- all blocks were written or marked bad successfully
+ * * %false	- bbl disabled or
+ *		  one or more blocks write failed and could not be marked bad
+ */
 static bool narrow_write_error(struct r1bio *r1_bio, int i)
 {
 	struct mddev *mddev = r1_bio->mddev;
@@ -2614,9 +2629,9 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 	int m, idx;
 	bool fail = false;
 
-	for (m = 0; m < conf->raid_disks * 2 ; m++)
+	for (m = 0; m < conf->raid_disks * 2 ; m++) {
+		struct md_rdev *rdev = conf->mirrors[m].rdev;
 		if (r1_bio->bios[m] == IO_MADE_GOOD) {
-			struct md_rdev *rdev = conf->mirrors[m].rdev;
 			rdev_clear_badblocks(rdev,
 					     r1_bio->sector,
 					     r1_bio->sectors, 0);
@@ -2628,12 +2643,17 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 			 */
 			fail = true;
 			if (!narrow_write_error(r1_bio, m))
-				md_error(conf->mddev,
-					 conf->mirrors[m].rdev);
+				md_error(conf->mddev, rdev);
 				/* an I/O failed, we can't clear the bitmap */
-			rdev_dec_pending(conf->mirrors[m].rdev,
-					 conf->mddev);
+			else if (test_bit(In_sync, &rdev->flags) &&
+				 !test_bit(Faulty, &rdev->flags) &&
+				 rdev_has_badblock(rdev,
+						   r1_bio->sector,
+						   r1_bio->sectors) == 0)
+				set_bit(R1BIO_Uptodate, &r1_bio->state);
+			rdev_dec_pending(rdev, conf->mddev);
 		}
+	}
 	if (fail) {
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b73af94a88b0..21c2821453e1 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2809,6 +2809,21 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 	}
 }
 
+/**
+ * narrow_write_error() - Retry write and set badblock
+ * @r10_bio:	the r10bio containing the write error
+ * @i:		which device to retry
+ *
+ * Rewrites the bio, splitting it at the least common multiple of the logical
+ * block size and the badblock size. Blocks that fail to be written are marked
+ * as bad. If badblocks are disabled, no write is attempted and false is
+ * returned immediately.
+ *
+ * Return:
+ * * %true	- all blocks were written or marked bad successfully
+ * * %false	- bbl disabled or
+ *		  one or more blocks write failed and could not be marked bad
+ */
 static bool narrow_write_error(struct r10bio *r10_bio, int i)
 {
 	struct bio *bio = r10_bio->master_bio;
@@ -2975,6 +2990,12 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 				fail = true;
 				if (!narrow_write_error(r10_bio, m))
 					md_error(conf->mddev, rdev);
+				else if (test_bit(In_sync, &rdev->flags) &&
+					 !test_bit(Faulty, &rdev->flags) &&
+					 rdev_has_badblock(rdev,
+							   r10_bio->devs[m].addr,
+							   r10_bio->sectors) == 0)
+					set_bit(R10BIO_Uptodate, &r10_bio->state);
 				rdev_dec_pending(rdev, conf->mddev);
 			}
 			bio = r10_bio->devs[m].repl_bio;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs
  2025-09-15  3:42 [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (4 preceding siblings ...)
  2025-09-15  3:42 ` [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio Kenta Akagi
@ 2025-09-15  3:42 ` Kenta Akagi
  2025-09-17 10:06   ` Li Nan
  2025-09-15  3:42 ` [PATCH v4 7/9] md/raid10: fix failfast read error not rescheduled Kenta Akagi
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  3:42 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

In the current implementation, write failures are not retried on rdevs
with badblocks disabled. This is because narrow_write_error, which issues
retry bios, immediately returns when badblocks are disabled. As a result,
a single write failure on such an rdev will immediately mark it as Faulty.

The retry mechanism appears to have been implemented under the assumption
that a bad block is involved in the failure. However, the retry after
MD_FAILFAST write failure depend on this code, and a Failfast write request
may fail for reasons unrelated to bad blocks.

Consequently, if failfast is enabled and badblocks are disabled on all
rdevs, and all rdevs encounter a failfast write bio failure at the same
time, no retries will occur and the entire array can be lost.

This commit adds a path in narrow_write_error to retry writes even on rdevs
where bad blocks are disabled, and failed bios marked with MD_FAILFAST will
use this path. For non-failfast cases, the behavior remains unchanged: no
retry writes are attempted to rdevs with bad blocks disabled.

Fixes: 1919cbb23bf1 ("md/raid10: add failfast handling for writes.")
Fixes: 212e7eb7a340 ("md/raid1: add failfast handling for writes.")
Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c  | 44 +++++++++++++++++++++++++++++---------------
 drivers/md/raid10.c | 37 ++++++++++++++++++++++++-------------
 2 files changed, 53 insertions(+), 28 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 806f5cb33a8e..55213bcd82f4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2521,18 +2521,19 @@ static void fix_read_error(struct r1conf *conf, struct r1bio *r1_bio)
  * narrow_write_error() - Retry write and set badblock
  * @r1_bio:	the r1bio containing the write error
  * @i:		which device to retry
+ * @force:	Retry writing even if badblock is disabled
  *
  * Rewrites the bio, splitting it at the least common multiple of the logical
  * block size and the badblock size. Blocks that fail to be written are marked
- * as bad. If badblocks are disabled, no write is attempted and false is
- * returned immediately.
+ * as bad. If bbl disabled and @force is not set, no retry is attempted.
+ * If bbl disabled and @force is set, the write is retried in the same way.
  *
  * Return:
  * * %true	- all blocks were written or marked bad successfully
  * * %false	- bbl disabled or
  *		  one or more blocks write failed and could not be marked bad
  */
-static bool narrow_write_error(struct r1bio *r1_bio, int i)
+static bool narrow_write_error(struct r1bio *r1_bio, int i, bool force)
 {
 	struct mddev *mddev = r1_bio->mddev;
 	struct r1conf *conf = mddev->private;
@@ -2553,13 +2554,17 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
 	sector_t sector;
 	int sectors;
 	int sect_to_write = r1_bio->sectors;
-	bool ok = true;
+	bool write_ok = true;
+	bool setbad_ok = true;
+	bool bbl_enabled = !(rdev->badblocks.shift < 0);
 
-	if (rdev->badblocks.shift < 0)
+	if (!force && !bbl_enabled)
 		return false;
 
-	block_sectors = roundup(1 << rdev->badblocks.shift,
-				bdev_logical_block_size(rdev->bdev) >> 9);
+	block_sectors = bdev_logical_block_size(rdev->bdev) >> 9;
+	if (bbl_enabled)
+		block_sectors = roundup(1 << rdev->badblocks.shift,
+					block_sectors);
 	sector = r1_bio->sector;
 	sectors = ((sector + block_sectors)
 		   & ~(sector_t)(block_sectors - 1))
@@ -2587,18 +2592,22 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
 		bio_trim(wbio, sector - r1_bio->sector, sectors);
 		wbio->bi_iter.bi_sector += rdev->data_offset;
 
-		if (submit_bio_wait(wbio) < 0)
+		if (submit_bio_wait(wbio) < 0) {
 			/* failure! */
-			ok = rdev_set_badblocks(rdev, sector,
-						sectors, 0)
-				&& ok;
+			write_ok = false;
+			if (bbl_enabled)
+				setbad_ok = rdev_set_badblocks(rdev, sector,
+							       sectors, 0)
+					    && setbad_ok;
+		}
 
 		bio_put(wbio);
 		sect_to_write -= sectors;
 		sector += sectors;
 		sectors = block_sectors;
 	}
-	return ok;
+	return (write_ok ||
+		(bbl_enabled && setbad_ok));
 }
 
 static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
@@ -2631,18 +2640,23 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 
 	for (m = 0; m < conf->raid_disks * 2 ; m++) {
 		struct md_rdev *rdev = conf->mirrors[m].rdev;
-		if (r1_bio->bios[m] == IO_MADE_GOOD) {
+		struct bio *bio = r1_bio->bios[m];
+
+		if (bio == IO_MADE_GOOD) {
 			rdev_clear_badblocks(rdev,
 					     r1_bio->sector,
 					     r1_bio->sectors, 0);
 			rdev_dec_pending(rdev, conf->mddev);
-		} else if (r1_bio->bios[m] != NULL) {
+		} else if (bio != NULL) {
 			/* This drive got a write error.  We need to
 			 * narrow down and record precise write
 			 * errors.
 			 */
 			fail = true;
-			if (!narrow_write_error(r1_bio, m))
+			if (!narrow_write_error(
+					r1_bio, m,
+					test_bit(FailFast, &rdev->flags) &&
+					(bio->bi_opf & MD_FAILFAST)))
 				md_error(conf->mddev, rdev);
 				/* an I/O failed, we can't clear the bitmap */
 			else if (test_bit(In_sync, &rdev->flags) &&
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 21c2821453e1..92cf3047dce6 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2813,18 +2813,18 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
  * narrow_write_error() - Retry write and set badblock
  * @r10_bio:	the r10bio containing the write error
  * @i:		which device to retry
+ * @force:	Retry writing even if badblock is disabled
  *
  * Rewrites the bio, splitting it at the least common multiple of the logical
  * block size and the badblock size. Blocks that fail to be written are marked
- * as bad. If badblocks are disabled, no write is attempted and false is
- * returned immediately.
+ * as bad. If bbl disabled and @force is not set, no retry is attempted.
  *
  * Return:
  * * %true	- all blocks were written or marked bad successfully
  * * %false	- bbl disabled or
  *		  one or more blocks write failed and could not be marked bad
  */
-static bool narrow_write_error(struct r10bio *r10_bio, int i)
+static bool narrow_write_error(struct r10bio *r10_bio, int i, bool force)
 {
 	struct bio *bio = r10_bio->master_bio;
 	struct mddev *mddev = r10_bio->mddev;
@@ -2845,13 +2845,17 @@ static bool narrow_write_error(struct r10bio *r10_bio, int i)
 	sector_t sector;
 	int sectors;
 	int sect_to_write = r10_bio->sectors;
-	bool ok = true;
+	bool write_ok = true;
+	bool setbad_ok = true;
+	bool bbl_enabled = !(rdev->badblocks.shift < 0);
 
-	if (rdev->badblocks.shift < 0)
+	if (!force && !bbl_enabled)
 		return false;
 
-	block_sectors = roundup(1 << rdev->badblocks.shift,
-				bdev_logical_block_size(rdev->bdev) >> 9);
+	block_sectors = bdev_logical_block_size(rdev->bdev) >> 9;
+	if (bbl_enabled)
+		block_sectors = roundup(1 << rdev->badblocks.shift,
+					block_sectors);
 	sector = r10_bio->sector;
 	sectors = ((r10_bio->sector + block_sectors)
 		   & ~(sector_t)(block_sectors - 1))
@@ -2871,18 +2875,22 @@ static bool narrow_write_error(struct r10bio *r10_bio, int i)
 				   choose_data_offset(r10_bio, rdev);
 		wbio->bi_opf = REQ_OP_WRITE;
 
-		if (submit_bio_wait(wbio) < 0)
+		if (submit_bio_wait(wbio) < 0) {
 			/* Failure! */
-			ok = rdev_set_badblocks(rdev, wsector,
-						sectors, 0)
-				&& ok;
+			write_ok = false;
+			if (bbl_enabled)
+				setbad_ok = rdev_set_badblocks(rdev, wsector,
+							       sectors, 0)
+					    && setbad_ok;
+		}
 
 		bio_put(wbio);
 		sect_to_write -= sectors;
 		sector += sectors;
 		sectors = block_sectors;
 	}
-	return ok;
+	return (write_ok ||
+		(bbl_enabled && setbad_ok));
 }
 
 static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
@@ -2988,7 +2996,10 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 				rdev_dec_pending(rdev, conf->mddev);
 			} else if (bio != NULL && bio->bi_status) {
 				fail = true;
-				if (!narrow_write_error(r10_bio, m))
+				if (!narrow_write_error(
+						r10_bio, m,
+						test_bit(FailFast, &rdev->flags) &&
+						(bio->bi_opf & MD_FAILFAST)))
 					md_error(conf->mddev, rdev);
 				else if (test_bit(In_sync, &rdev->flags) &&
 					 !test_bit(Faulty, &rdev->flags) &&
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 7/9] md/raid10: fix failfast read error not rescheduled
  2025-09-15  3:42 [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (5 preceding siblings ...)
  2025-09-15  3:42 ` [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs Kenta Akagi
@ 2025-09-15  3:42 ` Kenta Akagi
  2025-09-18  7:38   ` Li Nan
  2025-09-15  3:42 ` [PATCH v4 8/9] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi
  2025-09-15  3:42 ` [PATCH v4 9/9] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi
  8 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  3:42 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

raid10_end_read_request lacks a path to retry when a FailFast IO fails.
As a result, when Failfast Read IOs fail on all rdevs, the upper layer
receives EIO, without read rescheduled.

Looking at the two commits below, it seems only raid10_end_read_request
lacks the failfast read retry handling, while raid1_end_read_request has
it. In RAID1, the retry works as expected.
* commit 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
* commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")

I don't know why raid10_end_read_request lacks this, but it is probably
just a simple oversight.

This commit will make the failfast read bio for the last rdev in raid10
retry if it fails.

Fixes: 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid10.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 92cf3047dce6..86c0eacd37cb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -399,6 +399,11 @@ static void raid10_end_read_request(struct bio *bio)
 		 * wait for the 'master' bio.
 		 */
 		set_bit(R10BIO_Uptodate, &r10_bio->state);
+	} else if (test_bit(FailFast, &rdev->flags) &&
+		 test_bit(R10BIO_FailFast, &r10_bio->state)) {
+		/* This was a fail-fast read so we definitely
+		 * want to retry */
+		;
 	} else if (!raid1_should_handle_error(bio)) {
 		uptodate = 1;
 	} else {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 8/9] md/raid1,raid10: Add error message when setting MD_BROKEN
  2025-09-15  3:42 [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (6 preceding siblings ...)
  2025-09-15  3:42 ` [PATCH v4 7/9] md/raid10: fix failfast read error not rescheduled Kenta Akagi
@ 2025-09-15  3:42 ` Kenta Akagi
  2025-09-15  3:42 ` [PATCH v4 9/9] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi
  8 siblings, 0 replies; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  3:42 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Once MD_BROKEN is set on an array, no further writes can be
performed to it.
The user must be informed that the array cannot continue operation.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c  | 4 ++++
 drivers/md/raid10.c | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 55213bcd82f4..febe2849a71a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1786,6 +1786,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	if (test_bit(In_sync, &rdev->flags) &&
 	    (conf->raid_disks - mddev->degraded) == 1) {
 		set_bit(MD_BROKEN, &mddev->flags);
+		pr_crit("md/raid1:%s: Disk failure on %pg, this is the last device.\n"
+			"md/raid1:%s: Cannot continue operation (%d/%d failed).\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), mddev->degraded + 1, conf->raid_disks);
 
 		if (!mddev->fail_last_dev) {
 			conf->recovery_disabled = mddev->recovery_disabled;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 86c0eacd37cb..be5fd77da3e1 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2039,6 +2039,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
 		set_bit(MD_BROKEN, &mddev->flags);
+		pr_crit("md/raid10:%s: Disk failure on %pg, this is the last device.\n"
+			"md/raid10:%s: Cannot continue operation (%d/%d failed).\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), mddev->degraded + 1, conf->geo.raid_disks);
 
 		if (!mddev->fail_last_dev) {
 			spin_unlock_irqrestore(&conf->device_lock, flags);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 9/9] md/raid1,raid10: Fix: Operation continuing on 0 devices.
  2025-09-15  3:42 [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (7 preceding siblings ...)
  2025-09-15  3:42 ` [PATCH v4 8/9] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi
@ 2025-09-15  3:42 ` Kenta Akagi
  2025-09-15  7:19   ` Paul Menzel
  8 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  3:42 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Since commit 9a567843f7ce ("md: allow last device to be forcibly
removed from RAID1/RAID10."), RAID1/10 arrays can now lose all rdevs.

Before that commit, losing the array last rdev or reaching the end of
the function without early return in raid{1,10}_error never occurred.
However, both situations can occur in the current implementation.

As a result, when mddev->fail_last_dev is set, a spurious pr_crit
message can be printed.

This patch prevents "Operation continuing" printed if the array
is not operational.

root@fedora:~# mdadm --create --verbose /dev/md0 --level=1 \
--raid-devices=2  /dev/loop0 /dev/loop1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: size set to 1046528K
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
root@fedora:~# echo 1 > /sys/block/md0/md/fail_last_dev
root@fedora:~# mdadm --fail /dev/md0 loop0
mdadm: set loop0 faulty in /dev/md0
root@fedora:~# mdadm --fail /dev/md0 loop1
mdadm: set device faulty failed for loop1:  Device or resource busy
root@fedora:~# dmesg | tail -n 4
[ 1314.359674] md/raid1:md0: Disk failure on loop0, disabling device.
               md/raid1:md0: Operation continuing on 1 devices.
[ 1315.506633] md/raid1:md0: Disk failure on loop1, disabling device.
               md/raid1:md0: Operation continuing on 0 devices.
root@fedora:~#

Fixes: 9a567843f7ce ("md: allow last device to be forcibly removed from RAID1/RAID10.")
Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c  | 9 +++++----
 drivers/md/raid10.c | 9 +++++----
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index febe2849a71a..b3c845855841 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1803,6 +1803,11 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 		update_lastdev(conf);
 	}
 	set_bit(Faulty, &rdev->flags);
+	if ((conf->raid_disks - mddev->degraded) > 0)
+		pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n"
+			"md/raid1:%s: Operation continuing on %d devices.\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), conf->raid_disks - mddev->degraded);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	/*
 	 * if recovery is running, make sure it aborts.
@@ -1810,10 +1815,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
-	pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n"
-		"md/raid1:%s: Operation continuing on %d devices.\n",
-		mdname(mddev), rdev->bdev,
-		mdname(mddev), conf->raid_disks - mddev->degraded);
 }
 
 static void print_conf(struct r1conf *conf)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index be5fd77da3e1..4f3ef43ebd2a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2059,11 +2059,12 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	set_bit(Faulty, &rdev->flags);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
+	if (enough(conf, -1))
+		pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
+			"md/raid10:%s: Operation continuing on %d devices.\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), conf->geo.raid_disks - mddev->degraded);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
-	pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
-		"md/raid10:%s: Operation continuing on %d devices.\n",
-		mdname(mddev), rdev->bdev,
-		mdname(mddev), conf->geo.raid_disks - mddev->degraded);
 }
 
 static void print_conf(struct r10conf *conf)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 9/9] md/raid1,raid10: Fix: Operation continuing on 0 devices.
  2025-09-15  3:42 ` [PATCH v4 9/9] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi
@ 2025-09-15  7:19   ` Paul Menzel
  2025-09-15  8:19     ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Menzel @ 2025-09-15  7:19 UTC (permalink / raw)
  To: Kenta Akagi
  Cc: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang,
	linux-raid, linux-kernel

Dear Kenta,


Thank you for your patch, and well-written commit message and test. 
Reading just the summary, I didn’t know what it’s about. Should you 
resend, maybe:

 > md/raid1,raid10: Fix logging `Operation continuing on 0 devices.`

Am 15.09.25 um 05:42 schrieb Kenta Akagi:
> Since commit 9a567843f7ce ("md: allow last device to be forcibly
> removed from RAID1/RAID10."), RAID1/10 arrays can now lose all rdevs.
> 
> Before that commit, losing the array last rdev or reaching the end of
> the function without early return in raid{1,10}_error never occurred.
> However, both situations can occur in the current implementation.
> 
> As a result, when mddev->fail_last_dev is set, a spurious pr_crit
> message can be printed.
> 
> This patch prevents "Operation continuing" printed if the array
> is not operational.
> 
> root@fedora:~# mdadm --create --verbose /dev/md0 --level=1 \
> --raid-devices=2  /dev/loop0 /dev/loop1
> mdadm: Note: this array has metadata at the start and
>      may not be suitable as a boot device.  If you plan to
>      store '/boot' on this device please ensure that
>      your boot-loader understands md/v1.x metadata, or use
>      --metadata=0.90
> mdadm: size set to 1046528K
> Continue creating array? y
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md0 started.
> root@fedora:~# echo 1 > /sys/block/md0/md/fail_last_dev
> root@fedora:~# mdadm --fail /dev/md0 loop0
> mdadm: set loop0 faulty in /dev/md0
> root@fedora:~# mdadm --fail /dev/md0 loop1
> mdadm: set device faulty failed for loop1:  Device or resource busy
> root@fedora:~# dmesg | tail -n 4
> [ 1314.359674] md/raid1:md0: Disk failure on loop0, disabling device.
>                 md/raid1:md0: Operation continuing on 1 devices.
> [ 1315.506633] md/raid1:md0: Disk failure on loop1, disabling device.
>                 md/raid1:md0: Operation continuing on 0 devices.

Shouldn’t the first line of the critical log still be printed, but your 
patch would remove it?

> root@fedora:~#
> 
> Fixes: 9a567843f7ce ("md: allow last device to be forcibly removed from RAID1/RAID10.")
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/raid1.c  | 9 +++++----
>   drivers/md/raid10.c | 9 +++++----
>   2 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index febe2849a71a..b3c845855841 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1803,6 +1803,11 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   		update_lastdev(conf);
>   	}
>   	set_bit(Faulty, &rdev->flags);
> +	if ((conf->raid_disks - mddev->degraded) > 0)
> +		pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n"
> +			"md/raid1:%s: Operation continuing on %d devices.\n",
> +			mdname(mddev), rdev->bdev,
> +			mdname(mddev), conf->raid_disks - mddev->degraded);
>   	spin_unlock_irqrestore(&conf->device_lock, flags);
>   	/*
>   	 * if recovery is running, make sure it aborts.
> @@ -1810,10 +1815,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>   	set_mask_bits(&mddev->sb_flags, 0,
>   		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
> -	pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n"
> -		"md/raid1:%s: Operation continuing on %d devices.\n",
> -		mdname(mddev), rdev->bdev,
> -		mdname(mddev), conf->raid_disks - mddev->degraded);
>   }
>   
>   static void print_conf(struct r1conf *conf)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index be5fd77da3e1..4f3ef43ebd2a 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2059,11 +2059,12 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>   	set_bit(Faulty, &rdev->flags);
>   	set_mask_bits(&mddev->sb_flags, 0,
>   		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
> +	if (enough(conf, -1))
> +		pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
> +			"md/raid10:%s: Operation continuing on %d devices.\n",
> +			mdname(mddev), rdev->bdev,
> +			mdname(mddev), conf->geo.raid_disks - mddev->degraded);
>   	spin_unlock_irqrestore(&conf->device_lock, flags);
> -	pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
> -		"md/raid10:%s: Operation continuing on %d devices.\n",
> -		mdname(mddev), rdev->bdev,
> -		mdname(mddev), conf->geo.raid_disks - mddev->degraded);
>   }
>   
>   static void print_conf(struct r10conf *conf)


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 9/9] md/raid1,raid10: Fix: Operation continuing on 0 devices.
  2025-09-15  7:19   ` Paul Menzel
@ 2025-09-15  8:19     ` Kenta Akagi
  0 siblings, 0 replies; 39+ messages in thread
From: Kenta Akagi @ 2025-09-15  8:19 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang,
	linux-raid, linux-kernel, Kenta Akagi

Hi,
Thank you for reviewing.

On 2025/09/15 16:19, Paul Menzel wrote:
> Dear Kenta,
> 
> 
> Thank you for your patch, and well-written commit message and test. Reading just the summary, I didn’t know what it’s about. Should you resend, maybe:
> 
>> md/raid1,raid10: Fix logging `Operation continuing on 0 devices.`

Understood.
I'll fix commit message and title as you advised.

> 
> Am 15.09.25 um 05:42 schrieb Kenta Akagi:
>> Since commit 9a567843f7ce ("md: allow last device to be forcibly
>> removed from RAID1/RAID10."), RAID1/10 arrays can now lose all rdevs.
>>
>> Before that commit, losing the array last rdev or reaching the end of
>> the function without early return in raid{1,10}_error never occurred.
>> However, both situations can occur in the current implementation.
>>
>> As a result, when mddev->fail_last_dev is set, a spurious pr_crit
>> message can be printed.
>>
>> This patch prevents "Operation continuing" printed if the array
>> is not operational.
>>
>> root@fedora:~# mdadm --create --verbose /dev/md0 --level=1 \
>> --raid-devices=2  /dev/loop0 /dev/loop1
>> mdadm: Note: this array has metadata at the start and
>>      may not be suitable as a boot device.  If you plan to
>>      store '/boot' on this device please ensure that
>>      your boot-loader understands md/v1.x metadata, or use
>>      --metadata=0.90
>> mdadm: size set to 1046528K
>> Continue creating array? y
>> mdadm: Defaulting to version 1.2 metadata
>> mdadm: array /dev/md0 started.
>> root@fedora:~# echo 1 > /sys/block/md0/md/fail_last_dev
>> root@fedora:~# mdadm --fail /dev/md0 loop0
>> mdadm: set loop0 faulty in /dev/md0
>> root@fedora:~# mdadm --fail /dev/md0 loop1
>> mdadm: set device faulty failed for loop1:  Device or resource busy
>> root@fedora:~# dmesg | tail -n 4
>> [ 1314.359674] md/raid1:md0: Disk failure on loop0, disabling device.
>>                 md/raid1:md0: Operation continuing on 1 devices.
>> [ 1315.506633] md/raid1:md0: Disk failure on loop1, disabling device.
>>                 md/raid1:md0: Operation continuing on 0 devices.
> 
> Shouldn’t the first line of the critical log still be printed, but your patch would remove it?
> 

"Operation continuing on 1 devices." will remain after this patch.
"Operation continuing on 0 devices." will be removed.

I thought that since Patch 8 in this series introduces the message
"Cannot continue operation", the message "Continuing operation on 0 devices"
would no longer be necessary.

However, the conditions under which the messages are output differ slightly.
"Cannot continue operation" is output when md_error is called on the last
rdev and MD_BROKEN is set.
"Operation continuing on 0 devices." is output when, after md_error is called
on the last rdev and MD_BROKEN is set, the rdev becomes Faulty because
fail_last_dev is set. In both cases, there is no difference in the fact that
the array cannot operate normally, but the state of the rdevs and whether it
can still be read are different.

Would it be better to only adjust the message itself?
e.g. "Operation cannot continue, there are 0 devices."
Or, since setting fail_last_dev is presumably done only by power users,
perhaps it is acceptable to leave this odd "continuing on 0 device" message as is.

Thanks,
Akagi

>> root@fedora:~#
>>
>> Fixes: 9a567843f7ce ("md: allow last device to be forcibly removed from RAID1/RAID10.")
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>   drivers/md/raid1.c  | 9 +++++----
>>   drivers/md/raid10.c | 9 +++++----
>>   2 files changed, 10 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index febe2849a71a..b3c845855841 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1803,6 +1803,11 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>           update_lastdev(conf);
>>       }
>>       set_bit(Faulty, &rdev->flags);
>> +    if ((conf->raid_disks - mddev->degraded) > 0)
>> +        pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n"
>> +            "md/raid1:%s: Operation continuing on %d devices.\n",
>> +            mdname(mddev), rdev->bdev,
>> +            mdname(mddev), conf->raid_disks - mddev->degraded);
>>       spin_unlock_irqrestore(&conf->device_lock, flags);
>>       /*
>>        * if recovery is running, make sure it aborts.
>> @@ -1810,10 +1815,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>       set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>       set_mask_bits(&mddev->sb_flags, 0,
>>                 BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
>> -    pr_crit("md/raid1:%s: Disk failure on %pg, disabling device.\n"
>> -        "md/raid1:%s: Operation continuing on %d devices.\n",
>> -        mdname(mddev), rdev->bdev,
>> -        mdname(mddev), conf->raid_disks - mddev->degraded);
>>   }
>>     static void print_conf(struct r1conf *conf)
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index be5fd77da3e1..4f3ef43ebd2a 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -2059,11 +2059,12 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>       set_bit(Faulty, &rdev->flags);
>>       set_mask_bits(&mddev->sb_flags, 0,
>>                 BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
>> +    if (enough(conf, -1))
>> +        pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
>> +            "md/raid10:%s: Operation continuing on %d devices.\n",
>> +            mdname(mddev), rdev->bdev,
>> +            mdname(mddev), conf->geo.raid_disks - mddev->degraded);
>>       spin_unlock_irqrestore(&conf->device_lock, flags);
>> -    pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
>> -        "md/raid10:%s: Operation continuing on %d devices.\n",
>> -        mdname(mddev), rdev->bdev,
>> -        mdname(mddev), conf->geo.raid_disks - mddev->degraded);
>>   }
>>     static void print_conf(struct r10conf *conf)
> 
> 
> Kind regards,
> 
> Paul
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio
  2025-09-15  3:42 ` [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio Kenta Akagi
@ 2025-09-17  9:24   ` Li Nan
  2025-09-17 13:20     ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Li Nan @ 2025-09-17  9:24 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li,
	Guoqing Jiang
  Cc: linux-raid, linux-kernel



在 2025/9/15 11:42, Kenta Akagi 写道:
> In the current implementation, when a write bio fails, the retry flow
> is as follows:
> * In bi_end_io, e.g. raid1_end_write_request, R1BIO_WriteError is
>    set on the r1bio.
> * The md thread calls handle_write_finished corresponding to this r1bio.
> * Inside handle_write_finished, narrow_write_error is invoked.
> * narrow_write_error rewrites the r1bio on a per-sector basis, marking
>    any failed sectors as badblocks. It returns true if all sectors succeed,
>    or if failed sectors are successfully recorded via rdev_set_badblock.
>    It returns false if rdev_set_badblock fails or if badblocks are disabled.
> * handle_write_finished faults the rdev if it receives false from
>    narrow_write_error. Otherwise, it does nothing.
> 
> This can cause a problem where an r1bio that succeeded on retry is
> incorrectly reported as failed to the higher layer, for example in the
> following case:
> * Only one In_sync rdev exists, and
> * The write bio initially failed but all retries in
>    narrow_write_error succeed.
> 
> This commit ensures that if a write initially fails but all retries in
> narrow_write_error succeed, R1BIO_Uptodate or R10BIO_Uptodate is set
> and the higher layer receives a successful write status.
> 
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/raid1.c  | 32 ++++++++++++++++++++++++++------
>   drivers/md/raid10.c | 21 +++++++++++++++++++++
>   2 files changed, 47 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 8fff9dacc6e0..806f5cb33a8e 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2517,6 +2517,21 @@ static void fix_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>   	}
>   }
>   
> +/**
> + * narrow_write_error() - Retry write and set badblock
> + * @r1_bio:	the r1bio containing the write error
> + * @i:		which device to retry
> + *
> + * Rewrites the bio, splitting it at the least common multiple of the logical
> + * block size and the badblock size. Blocks that fail to be written are marked
> + * as bad. If badblocks are disabled, no write is attempted and false is
> + * returned immediately.
> + *
> + * Return:
> + * * %true	- all blocks were written or marked bad successfully
> + * * %false	- bbl disabled or
> + *		  one or more blocks write failed and could not be marked bad
> + */
>   static bool narrow_write_error(struct r1bio *r1_bio, int i)
>   {
>   	struct mddev *mddev = r1_bio->mddev;
> @@ -2614,9 +2629,9 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>   	int m, idx;
>   	bool fail = false;
>   
> -	for (m = 0; m < conf->raid_disks * 2 ; m++)
> +	for (m = 0; m < conf->raid_disks * 2 ; m++) {
> +		struct md_rdev *rdev = conf->mirrors[m].rdev;
>   		if (r1_bio->bios[m] == IO_MADE_GOOD) {
> -			struct md_rdev *rdev = conf->mirrors[m].rdev;
>   			rdev_clear_badblocks(rdev,
>   					     r1_bio->sector,
>   					     r1_bio->sectors, 0);
> @@ -2628,12 +2643,17 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>   			 */
>   			fail = true;

'fail' should be false when re-write is successful.

>   			if (!narrow_write_error(r1_bio, m))
> -				md_error(conf->mddev,
> -					 conf->mirrors[m].rdev);
> +				md_error(conf->mddev, rdev);
>   				/* an I/O failed, we can't clear the bitmap */
> -			rdev_dec_pending(conf->mirrors[m].rdev,
> -					 conf->mddev);
> +			else if (test_bit(In_sync, &rdev->flags) &&
> +				 !test_bit(Faulty, &rdev->flags) &&
> +				 rdev_has_badblock(rdev,
> +						   r1_bio->sector,
> +						   r1_bio->sectors) == 0)

Clear badblock and set R10BIO_Uptodate if rdev has badblock.

> +				set_bit(R1BIO_Uptodate, &r1_bio->state);
> +			rdev_dec_pending(rdev, conf->mddev);
>   		}
> +	}
>   	if (fail) {
>   		spin_lock_irq(&conf->device_lock);
>   		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index b73af94a88b0..21c2821453e1 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2809,6 +2809,21 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
>   	}
>   }
>   
> +/**
> + * narrow_write_error() - Retry write and set badblock
> + * @r10_bio:	the r10bio containing the write error
> + * @i:		which device to retry
> + *
> + * Rewrites the bio, splitting it at the least common multiple of the logical
> + * block size and the badblock size. Blocks that fail to be written are marked
> + * as bad. If badblocks are disabled, no write is attempted and false is
> + * returned immediately.
> + *
> + * Return:
> + * * %true	- all blocks were written or marked bad successfully
> + * * %false	- bbl disabled or
> + *		  one or more blocks write failed and could not be marked bad
> + */
>   static bool narrow_write_error(struct r10bio *r10_bio, int i)
>   {
>   	struct bio *bio = r10_bio->master_bio;
> @@ -2975,6 +2990,12 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
>   				fail = true;
>   				if (!narrow_write_error(r10_bio, m))
>   					md_error(conf->mddev, rdev);
> +				else if (test_bit(In_sync, &rdev->flags) &&
> +					 !test_bit(Faulty, &rdev->flags) &&
> +					 rdev_has_badblock(rdev,
> +							   r10_bio->devs[m].addr,
> +							   r10_bio->sectors) == 0)

Same as raid1.

> +					set_bit(R10BIO_Uptodate, &r10_bio->state);
>   				rdev_dec_pending(rdev, conf->mddev);
>   			}
>   			bio = r10_bio->devs[m].repl_bio;

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs
  2025-09-15  3:42 ` [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs Kenta Akagi
@ 2025-09-17 10:06   ` Li Nan
  2025-09-17 13:33     ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Li Nan @ 2025-09-17 10:06 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li,
	Guoqing Jiang
  Cc: linux-raid, linux-kernel



在 2025/9/15 11:42, Kenta Akagi 写道:
> In the current implementation, write failures are not retried on rdevs
> with badblocks disabled. This is because narrow_write_error, which issues
> retry bios, immediately returns when badblocks are disabled. As a result,
> a single write failure on such an rdev will immediately mark it as Faulty.
> 

IMO, there's no need to add extra logic for scenarios where badblocks
is not enabled. Do you have real-world scenarios where badblocks is
disabled?

> The retry mechanism appears to have been implemented under the assumption
> that a bad block is involved in the failure. However, the retry after
> MD_FAILFAST write failure depend on this code, and a Failfast write request
> may fail for reasons unrelated to bad blocks.
> 
> Consequently, if failfast is enabled and badblocks are disabled on all
> rdevs, and all rdevs encounter a failfast write bio failure at the same
> time, no retries will occur and the entire array can be lost.
> 
> This commit adds a path in narrow_write_error to retry writes even on rdevs
> where bad blocks are disabled, and failed bios marked with MD_FAILFAST will
> use this path. For non-failfast cases, the behavior remains unchanged: no
> retry writes are attempted to rdevs with bad blocks disabled.
> 
> Fixes: 1919cbb23bf1 ("md/raid10: add failfast handling for writes.")
> Fixes: 212e7eb7a340 ("md/raid1: add failfast handling for writes.")
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/raid1.c  | 44 +++++++++++++++++++++++++++++---------------
>   drivers/md/raid10.c | 37 ++++++++++++++++++++++++-------------
>   2 files changed, 53 insertions(+), 28 deletions(-)
> 				 !test_bit(Faulty, &rdev->flags) &&

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio
  2025-09-17  9:24   ` Li Nan
@ 2025-09-17 13:20     ` Kenta Akagi
  2025-09-18  6:39       ` Li Nan
  0 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-17 13:20 UTC (permalink / raw)
  To: linan666, song, yukuai3, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, k

Hi,
Thank you for reviewing.

On 2025/09/17 18:24, Li Nan wrote:
> 
> 
> 在 2025/9/15 11:42, Kenta Akagi 写道:
>> In the current implementation, when a write bio fails, the retry flow
>> is as follows:
>> * In bi_end_io, e.g. raid1_end_write_request, R1BIO_WriteError is
>>    set on the r1bio.
>> * The md thread calls handle_write_finished corresponding to this r1bio.
>> * Inside handle_write_finished, narrow_write_error is invoked.
>> * narrow_write_error rewrites the r1bio on a per-sector basis, marking
>>    any failed sectors as badblocks. It returns true if all sectors succeed,
>>    or if failed sectors are successfully recorded via rdev_set_badblock.
>>    It returns false if rdev_set_badblock fails or if badblocks are disabled.
>> * handle_write_finished faults the rdev if it receives false from
>>    narrow_write_error. Otherwise, it does nothing.
>>
>> This can cause a problem where an r1bio that succeeded on retry is
>> incorrectly reported as failed to the higher layer, for example in the
>> following case:
>> * Only one In_sync rdev exists, and
>> * The write bio initially failed but all retries in
>>    narrow_write_error succeed.
>>
>> This commit ensures that if a write initially fails but all retries in
>> narrow_write_error succeed, R1BIO_Uptodate or R10BIO_Uptodate is set
>> and the higher layer receives a successful write status.
>>
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>   drivers/md/raid1.c  | 32 ++++++++++++++++++++++++++------
>>   drivers/md/raid10.c | 21 +++++++++++++++++++++
>>   2 files changed, 47 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 8fff9dacc6e0..806f5cb33a8e 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -2517,6 +2517,21 @@ static void fix_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>       }
>>   }
>>   +/**
>> + * narrow_write_error() - Retry write and set badblock
>> + * @r1_bio:    the r1bio containing the write error
>> + * @i:        which device to retry
>> + *
>> + * Rewrites the bio, splitting it at the least common multiple of the logical
>> + * block size and the badblock size. Blocks that fail to be written are marked
>> + * as bad. If badblocks are disabled, no write is attempted and false is
>> + * returned immediately.
>> + *
>> + * Return:
>> + * * %true    - all blocks were written or marked bad successfully
>> + * * %false    - bbl disabled or
>> + *          one or more blocks write failed and could not be marked bad
>> + */
>>   static bool narrow_write_error(struct r1bio *r1_bio, int i)
>>   {
>>       struct mddev *mddev = r1_bio->mddev;
>> @@ -2614,9 +2629,9 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>>       int m, idx;
>>       bool fail = false;
>>   -    for (m = 0; m < conf->raid_disks * 2 ; m++)
>> +    for (m = 0; m < conf->raid_disks * 2 ; m++) {
>> +        struct md_rdev *rdev = conf->mirrors[m].rdev;
>>           if (r1_bio->bios[m] == IO_MADE_GOOD) {
>> -            struct md_rdev *rdev = conf->mirrors[m].rdev;
>>               rdev_clear_badblocks(rdev,
>>                            r1_bio->sector,
>>                            r1_bio->sectors, 0);
>> @@ -2628,12 +2643,17 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>>                */
>>               fail = true;
> 
> 'fail' should be false when re-write is successful.

Thanks, it seems there is no need to handle it with bio_end_io_list when successful re-write.
I will fix it.
> 
>>               if (!narrow_write_error(r1_bio, m))
>> -                md_error(conf->mddev,
>> -                     conf->mirrors[m].rdev);
>> +                md_error(conf->mddev, rdev);
>>                   /* an I/O failed, we can't clear the bitmap */
>> -            rdev_dec_pending(conf->mirrors[m].rdev,
>> -                     conf->mddev);
>> +            else if (test_bit(In_sync, &rdev->flags) &&
>> +                 !test_bit(Faulty, &rdev->flags) &&
>> +                 rdev_has_badblock(rdev,
>> +                           r1_bio->sector,
>> +                           r1_bio->sectors) == 0)
> 
> Clear badblock and set R10BIO_Uptodate if rdev has badblock.

narrow_write_error returns true when the write succeeds, or when the write
fails but rdev_set_badblocks succeeds. Here, it determines that the re-write
succeeded if there is no badblock in the sector to be written by r1_bio.
So we should not call rdev_clear_badblocks here.

> 
>> +                set_bit(R1BIO_Uptodate, &r1_bio->state);
>> +            rdev_dec_pending(rdev, conf->mddev);
>>           }
>> +    }
>>       if (fail) {
>>           spin_lock_irq(&conf->device_lock);
>>           list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index b73af94a88b0..21c2821453e1 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -2809,6 +2809,21 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
>>       }
>>   }
>>   +/**
>> + * narrow_write_error() - Retry write and set badblock
>> + * @r10_bio:    the r10bio containing the write error
>> + * @i:        which device to retry
>> + *
>> + * Rewrites the bio, splitting it at the least common multiple of the logical
>> + * block size and the badblock size. Blocks that fail to be written are marked
>> + * as bad. If badblocks are disabled, no write is attempted and false is
>> + * returned immediately.
>> + *
>> + * Return:
>> + * * %true    - all blocks were written or marked bad successfully
>> + * * %false    - bbl disabled or
>> + *          one or more blocks write failed and could not be marked bad
>> + */
>>   static bool narrow_write_error(struct r10bio *r10_bio, int i)
>>   {
>>       struct bio *bio = r10_bio->master_bio;
>> @@ -2975,6 +2990,12 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
>>                   fail = true;
>>                   if (!narrow_write_error(r10_bio, m))
>>                       md_error(conf->mddev, rdev);
>> +                else if (test_bit(In_sync, &rdev->flags) &&
>> +                     !test_bit(Faulty, &rdev->flags) &&
>> +                     rdev_has_badblock(rdev,
>> +                               r10_bio->devs[m].addr,
>> +                               r10_bio->sectors) == 0)
> 
> Same as raid1.

I think it is the same as RAID1. should not call rdev_clear_badblocks.


Thanks,
Akagi

> 
>> +                    set_bit(R10BIO_Uptodate, &r10_bio->state);
>>                   rdev_dec_pending(rdev, conf->mddev);
>>               }
>>               bio = r10_bio->devs[m].repl_bio;
> 
> -- 
> Thanks,
> Nan
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs
  2025-09-17 10:06   ` Li Nan
@ 2025-09-17 13:33     ` Kenta Akagi
  2025-09-18  6:58       ` Li Nan
  0 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-17 13:33 UTC (permalink / raw)
  To: linan666, song, yukuai3, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, k



On 2025/09/17 19:06, Li Nan wrote:
> 
> 
> 在 2025/9/15 11:42, Kenta Akagi 写道:
>> In the current implementation, write failures are not retried on rdevs
>> with badblocks disabled. This is because narrow_write_error, which issues
>> retry bios, immediately returns when badblocks are disabled. As a result,
>> a single write failure on such an rdev will immediately mark it as Faulty.
>>
> 
> IMO, there's no need to add extra logic for scenarios where badblocks
> is not enabled. Do you have real-world scenarios where badblocks is
> disabled?

No, badblocks are enabled in my environment.
I'm fine if it's not added, but I still think it's worth adding WARN_ON like:

@@ -2553,13 +2554,17 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
  fail = true;
+ WARN_ON( (bio->bi_opf & MD_FAILFAST) && (rdev->badblocks.shift < 0) );
  if (!narrow_write_error(r1_bio, m))

What do you think?


Thanks,
Akagi

>> The retry mechanism appears to have been implemented under the assumption
>> that a bad block is involved in the failure. However, the retry after
>> MD_FAILFAST write failure depend on this code, and a Failfast write request
>> may fail for reasons unrelated to bad blocks.
>>
>> Consequently, if failfast is enabled and badblocks are disabled on all
>> rdevs, and all rdevs encounter a failfast write bio failure at the same
>> time, no retries will occur and the entire array can be lost.
>>
>> This commit adds a path in narrow_write_error to retry writes even on rdevs
>> where bad blocks are disabled, and failed bios marked with MD_FAILFAST will
>> use this path. For non-failfast cases, the behavior remains unchanged: no
>> retry writes are attempted to rdevs with bad blocks disabled.
>>
>> Fixes: 1919cbb23bf1 ("md/raid10: add failfast handling for writes.")
>> Fixes: 212e7eb7a340 ("md/raid1: add failfast handling for writes.")
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>   drivers/md/raid1.c  | 44 +++++++++++++++++++++++++++++---------------
>>   drivers/md/raid10.c | 37 ++++++++++++++++++++++++-------------
>>   2 files changed, 53 insertions(+), 28 deletions(-)
>>                  !test_bit(Faulty, &rdev->flags) &&
> 
> -- 
> Thanks,
> Nan
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes
  2025-09-15  3:42 ` [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes Kenta Akagi
@ 2025-09-18  1:00   ` Yu Kuai
  2025-09-18 14:02     ` Kenta Akagi
  2025-09-21  7:54   ` Xiao Ni
  1 sibling, 1 reply; 39+ messages in thread
From: Yu Kuai @ 2025-09-18  1:00 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C)

Hi,

在 2025/09/15 11:42, Kenta Akagi 写道:
> Currently, the LastDev flag is set on an rdev that failed a failfast
> metadata write and called md_error, but did not become Faulty. It is
> cleared when the metadata write retry succeeds. This has problems for
> the following reasons:
> 
> * Despite its name, the flag is only set during a metadata write window.
> * Unlike when LastDev and Failfast was introduced, md_error on the last
>    rdev of a RAID1/10 array now sets MD_BROKEN. Thus when LastDev is set,
>    the array is already unwritable.
> 
> A following commit will prevent failfast bios from breaking the array,
> which requires knowing from outside the personality whether an rdev is
> the last one. For that purpose, LastDev should be set on rdevs that must
> not be lost.
> 
> This commit ensures that LastDev is set on the indispensable rdev in a
> degraded RAID1/10 array.
> 
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/md.c     |  4 +---
>   drivers/md/md.h     |  6 +++---
>   drivers/md/raid1.c  | 34 +++++++++++++++++++++++++++++++++-
>   drivers/md/raid10.c | 34 +++++++++++++++++++++++++++++++++-
>   4 files changed, 70 insertions(+), 8 deletions(-)
> 
After md_error() is serialized, why not check if rdev is the last rdev
from md_bio_failure_error() in patch 3 directly? I think it is cleaner
and code changes should be much less to add a new method like
pers->lastdev.

Thanks,
Kuai

> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4e033c26fdd4..268410b66b83 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1007,10 +1007,8 @@ static void super_written(struct bio *bio)
>   		if (!test_bit(Faulty, &rdev->flags)
>   		    && (bio->bi_opf & MD_FAILFAST)) {
>   			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> -			set_bit(LastDev, &rdev->flags);
>   		}
> -	} else
> -		clear_bit(LastDev, &rdev->flags);
> +	}
>   
>   	bio_put(bio);
>   
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 51af29a03079..ec598f9a8381 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -281,9 +281,9 @@ enum flag_bits {
>   				 * It is expects that no bad block log
>   				 * is present.
>   				 */
> -	LastDev,		/* Seems to be the last working dev as
> -				 * it didn't fail, so don't use FailFast
> -				 * any more for metadata
> +	LastDev,		/* This is the last working rdev.
> +				 * so don't use FailFast any more for
> +				 * metadata.
>   				 */
>   	CollisionCheck,		/*
>   				 * check if there is collision between raid1
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index bf44878ec640..32ad6b102ff7 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1733,6 +1733,33 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>   	seq_printf(seq, "]");
>   }
>   
> +/**
> + * update_lastdev - Set or clear LastDev flag for all rdevs in array
> + * @conf: pointer to r1conf
> + *
> + * Sets LastDev if the device is In_sync and cannot be lost for the array.
> + * Otherwise, clear it.
> + *
> + * Caller must hold ->device_lock.
> + */
> +static void update_lastdev(struct r1conf *conf)
> +{
> +	int i;
> +	int alive_disks = conf->raid_disks - conf->mddev->degraded;
> +
> +	for (i = 0; i < conf->raid_disks; i++) {
> +		struct md_rdev *rdev = conf->mirrors[i].rdev;
> +
> +		if (rdev) {
> +			if (test_bit(In_sync, &rdev->flags) &&
> +			    alive_disks == 1)
> +				set_bit(LastDev, &rdev->flags);
> +			else
> +				clear_bit(LastDev, &rdev->flags);
> +		}
> +	}
> +}
> +
>   /**
>    * raid1_error() - RAID1 error handler.
>    * @mddev: affected md device.
> @@ -1767,8 +1794,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   		}
>   	}
>   	set_bit(Blocked, &rdev->flags);
> -	if (test_and_clear_bit(In_sync, &rdev->flags))
> +	if (test_and_clear_bit(In_sync, &rdev->flags)) {
>   		mddev->degraded++;
> +		update_lastdev(conf);
> +	}
>   	set_bit(Faulty, &rdev->flags);
>   	spin_unlock_irqrestore(&conf->device_lock, flags);
>   	/*
> @@ -1864,6 +1893,7 @@ static int raid1_spare_active(struct mddev *mddev)
>   		}
>   	}
>   	mddev->degraded -= count;
> +	update_lastdev(conf);
>   	spin_unlock_irqrestore(&conf->device_lock, flags);
>   
>   	print_conf(conf);
> @@ -3290,6 +3320,7 @@ static int raid1_run(struct mddev *mddev)
>   	rcu_assign_pointer(conf->thread, NULL);
>   	mddev->private = conf;
>   	set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
> +	update_lastdev(conf);
>   
>   	md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));
>   
> @@ -3427,6 +3458,7 @@ static int raid1_reshape(struct mddev *mddev)
>   
>   	spin_lock_irqsave(&conf->device_lock, flags);
>   	mddev->degraded += (raid_disks - conf->raid_disks);
> +	update_lastdev(conf);
>   	spin_unlock_irqrestore(&conf->device_lock, flags);
>   	conf->raid_disks = mddev->raid_disks = raid_disks;
>   	mddev->delta_disks = 0;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index b60c30bfb6c7..dc4edd4689f8 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1983,6 +1983,33 @@ static int enough(struct r10conf *conf, int ignore)
>   		_enough(conf, 1, ignore);
>   }
>   
> +/**
> + * update_lastdev - Set or clear LastDev flag for all rdevs in array
> + * @conf: pointer to r10conf
> + *
> + * Sets LastDev if the device is In_sync and cannot be lost for the array.
> + * Otherwise, clear it.
> + *
> + * Caller must hold ->reconfig_mutex or ->device_lock.
> + */
> +static void update_lastdev(struct r10conf *conf)
> +{
> +	int i;
> +	int raid_disks = max(conf->geo.raid_disks, conf->prev.raid_disks);
> +
> +	for (i = 0; i < raid_disks; i++) {
> +		struct md_rdev *rdev = conf->mirrors[i].rdev;
> +
> +		if (rdev) {
> +			if (test_bit(In_sync, &rdev->flags) &&
> +			    !enough(conf, i))
> +				set_bit(LastDev, &rdev->flags);
> +			else
> +				clear_bit(LastDev, &rdev->flags);
> +		}
> +	}
> +}
> +
>   /**
>    * raid10_error() - RAID10 error handler.
>    * @mddev: affected md device.
> @@ -2013,8 +2040,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>   			return;
>   		}
>   	}
> -	if (test_and_clear_bit(In_sync, &rdev->flags))
> +	if (test_and_clear_bit(In_sync, &rdev->flags)) {
>   		mddev->degraded++;
> +		update_lastdev(conf);
> +	}
>   
>   	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>   	set_bit(Blocked, &rdev->flags);
> @@ -2102,6 +2131,7 @@ static int raid10_spare_active(struct mddev *mddev)
>   	}
>   	spin_lock_irqsave(&conf->device_lock, flags);
>   	mddev->degraded -= count;
> +	update_lastdev(conf);
>   	spin_unlock_irqrestore(&conf->device_lock, flags);
>   
>   	print_conf(conf);
> @@ -4159,6 +4189,7 @@ static int raid10_run(struct mddev *mddev)
>   	md_set_array_sectors(mddev, size);
>   	mddev->resync_max_sectors = size;
>   	set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
> +	update_lastdev(conf);
>   
>   	if (md_integrity_register(mddev))
>   		goto out_free_conf;
> @@ -4567,6 +4598,7 @@ static int raid10_start_reshape(struct mddev *mddev)
>   	 */
>   	spin_lock_irq(&conf->device_lock);
>   	mddev->degraded = calc_degraded(conf);
> +	update_lastdev(conf);
>   	spin_unlock_irq(&conf->device_lock);
>   	mddev->raid_disks = conf->geo.raid_disks;
>   	mddev->reshape_position = conf->reshape_progress;
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/9] md: serialize md_error()
  2025-09-15  3:42 ` [PATCH v4 2/9] md: serialize md_error() Kenta Akagi
@ 2025-09-18  1:04   ` Yu Kuai
  2025-09-21  6:11     ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Yu Kuai @ 2025-09-18  1:04 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C)

Hi,

在 2025/09/15 11:42, Kenta Akagi 写道:
> md_error is mainly called when a bio fails, so it can run in parallel.
> Each personality’s error_handler locks with device_lock, so concurrent
> calls are safe.
> 
> However, RAID1 and RAID10 require changes for Failfast bio error handling,
> which needs a special helper for md_error. For that helper to work, the
> regular md_error must also be serialized.
> 
> The helper function md_bio_failure_error for failfast will be introduced
> in a subsequent commit.
> 
> This commit serializes md_error for all RAID personalities. While
> unnecessary for RAID levels other than 1 and 10, it has no performance
> impact as it is a cold path.
> 
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/md.c | 10 +++++++++-
>   drivers/md/md.h |  4 ++++
>   2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 268410b66b83..5607578a6db9 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -705,6 +705,7 @@ int mddev_init(struct mddev *mddev)
>   	atomic_set(&mddev->openers, 0);
>   	atomic_set(&mddev->sync_seq, 0);
>   	spin_lock_init(&mddev->lock);
> +	spin_lock_init(&mddev->error_handle_lock);

Instead of introduing a new lock, can we use device_lock directly?
it's held inside pers->error_handler() now, just move it forward
to md_error().

Thanks,
Kuai

>   	init_waitqueue_head(&mddev->sb_wait);
>   	init_waitqueue_head(&mddev->recovery_wait);
>   	mddev->reshape_position = MaxSector;
> @@ -8262,7 +8263,7 @@ void md_unregister_thread(struct mddev *mddev, struct md_thread __rcu **threadp)
>   }
>   EXPORT_SYMBOL(md_unregister_thread);
>   
> -void md_error(struct mddev *mddev, struct md_rdev *rdev)
> +void _md_error(struct mddev *mddev, struct md_rdev *rdev)
>   {
>   	if (!rdev || test_bit(Faulty, &rdev->flags))
>   		return;
> @@ -8287,6 +8288,13 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>   		queue_work(md_misc_wq, &mddev->event_work);
>   	md_new_event();
>   }
> +
> +void md_error(struct mddev *mddev, struct md_rdev *rdev)
> +{
> +	spin_lock(&mddev->error_handle_lock);
> +	_md_error(mddev, rdev);
> +	spin_unlock(&mddev->error_handle_lock);
> +}
>   EXPORT_SYMBOL(md_error);
>   
>   /* seq_file implementation /proc/mdstat */
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index ec598f9a8381..5177cb609e4b 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -619,6 +619,9 @@ struct mddev {
>   	/* The sequence number for sync thread */
>   	atomic_t sync_seq;
>   
> +	/* Lock for serializing md_error */
> +	spinlock_t			error_handle_lock;
> +
>   	bool	has_superblocks:1;
>   	bool	fail_last_dev:1;
>   	bool	serialize_policy:1;
> @@ -901,6 +904,7 @@ extern void md_write_start(struct mddev *mddev, struct bio *bi);
>   extern void md_write_inc(struct mddev *mddev, struct bio *bi);
>   extern void md_write_end(struct mddev *mddev);
>   extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
> +void _md_error(struct mddev *mddev, struct md_rdev *rdev);
>   extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
>   extern void md_finish_reshape(struct mddev *mddev);
>   void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/9] md: introduce md_bio_failure_error()
  2025-09-15  3:42 ` [PATCH v4 3/9] md: introduce md_bio_failure_error() Kenta Akagi
@ 2025-09-18  1:09   ` Yu Kuai
  2025-09-18 14:56     ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Yu Kuai @ 2025-09-18  1:09 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C)

Hi,

在 2025/09/15 11:42, Kenta Akagi 写道:
> Add a new helper function md_bio_failure_error().
> It is serialized with md_error() under the same lock and works
> almost the same, but with two differences:
> 
> * Takes the failed bio as an argument
> * If MD_FAILFAST is set in bi_opf and the target rdev is LastDev,
>    it does not mark the rdev faulty
> 
> Failfast bios must not break the array, but in the current implementation
> this can happen. This is because MD_BROKEN was introduced in RAID1/RAID10
> and is set when md_error() is called on an rdev required for mddev
> operation. At the time failfast was introduced, this was not the case.
> 
> Before this commit, md_error() has already been serialized, and
> RAID1/RAID10 mark rdevs that must not be set Faulty by Failfast
> with the LastDev flag.
> 
> The actual change in bio error handling will follow in a later commit.
> 
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/md.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>   drivers/md/md.h |  4 +++-
>   2 files changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 5607578a6db9..65fdd9bae8f4 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8297,6 +8297,48 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>   }
>   EXPORT_SYMBOL(md_error);
>   
> +/** md_bio_failure_error() - md error handler for MD_FAILFAST bios
> + * @mddev: affected md device.
> + * @rdev: member device to fail.
> + * @bio: bio whose triggered device failure.
> + *
> + * This is almost the same as md_error(). That is, it is serialized at
> + * the same level as md_error, marks the rdev as Faulty, and changes
> + * the mddev status.
> + * However, if all of the following conditions are met, it does nothing.
> + * This is because MD_FAILFAST bios must not stopping the array.
> + *  * RAID1 or RAID10
> + *  * LastDev - if rdev becomes Faulty, mddev will stop
> + *  * The failed bio has MD_FAILFAST set
> + *
> + * Returns: true if _md_error() was called, false if not.
> + */
> +bool md_bio_failure_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio)
> +{
> +	bool do_md_error = true;
> +
> +	spin_lock(&mddev->error_handle_lock);
> +	if (mddev->pers) {

With the respect this is only called from IO path, mddev->pers must be
checked already while submitting the bio. You can add a warn here like:

if (WARN_ON_ONCE(!mddev->pers))
	/* return true because we don't want caller to retry */
	return true;
> +		if (mddev->pers->head.id == ID_RAID1 ||
> +		    mddev->pers->head.id == ID_RAID10) {
> +			if (test_bit(LastDev, &rdev->flags) &&
> +			    test_bit(FailFast, &rdev->flags) &&
> +			    bio != NULL && (bio->bi_opf & MD_FAILFAST))
> +				do_md_error = false;
> +		}

As I suggested in patch 1, this can be:
	if (!mddev->pers->lastdev ||
	    !mddev->pers->lastdev(mddev, rdev, bio)) {
		__md_error(mddev, rdev);
		return true;
	}

Thanks,
Kuai

> +	}
> +
> +	if (do_md_error)
> +		_md_error(mddev, rdev);
> +	else
> +		pr_warn_ratelimited("md: %s: %s didn't do anything for %pg\n",
> +			mdname(mddev), __func__, rdev->bdev);
> +
> +	spin_unlock(&mddev->error_handle_lock);
> +	return do_md_error;
> +}
> +EXPORT_SYMBOL(md_bio_failure_error);
> +
>   /* seq_file implementation /proc/mdstat */
>   
>   static void status_unused(struct seq_file *seq)
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 5177cb609e4b..11389ea58431 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -283,7 +283,8 @@ enum flag_bits {
>   				 */
>   	LastDev,		/* This is the last working rdev.
>   				 * so don't use FailFast any more for
> -				 * metadata.
> +				 * metadata and don't Fail rdev
> +				 * when FailFast bio failure.
>   				 */
>   	CollisionCheck,		/*
>   				 * check if there is collision between raid1
> @@ -906,6 +907,7 @@ extern void md_write_end(struct mddev *mddev);
>   extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
>   void _md_error(struct mddev *mddev, struct md_rdev *rdev);
>   extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
> +extern bool md_bio_failure_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio);
>   extern void md_finish_reshape(struct mddev *mddev);
>   void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
>   			struct bio *bio, sector_t start, sector_t size);
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure
  2025-09-15  3:42 ` [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure Kenta Akagi
@ 2025-09-18  1:26   ` Yu Kuai
  2025-09-18 15:22     ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Yu Kuai @ 2025-09-18  1:26 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C)

Hi,

在 2025/09/15 11:42, Kenta Akagi 写道:
> Failfast is a feature implemented only for RAID1 and RAID10. It instructs
> the block device providing the rdev to immediately return a bio error
> without retrying if any issue occurs. This allows quickly detaching a
> problematic rdev and minimizes IO latency.
> 
> Due to its nature, failfast bios can fail easily, and md must not mark
> an essential rdev as Faulty or set MD_BROKEN on the array just because
> a failfast bio failed.
> 
> When failfast was introduced, RAID1 and RAID10 were designed to continue
> operating normally even if md_error was called for the last rdev. However,
> with the introduction of MD_BROKEN in RAID1/RAID10
> in commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling
> md_error for the last rdev now prevents further writes to the array.
> Despite this, the current failfast error handler still assumes that
> calling md_error will not break the array.
> 
> Normally, this is not an issue because MD_FAILFAST is not set when a bio
> is issued to the last rdev. However, if the array is not degraded and a
> bio with MD_FAILFAST has been issued, simultaneous failures could
> potentially break the array. This is unusual but can happen; for example,
> this can occur when using NVMe over TCP if all rdevs depend on
> a single Ethernet link.
> 
> In other words, this becomes a problem under the following conditions:
> Preconditions:
> * Failfast is enabled on all rdevs.
> * All rdevs are In_sync - This is a requirement for bio to be submit
>    with MD_FAILFAST.
> * At least one bio has been submitted but has not yet completed.
> 
> Trigger condition:
> * All underlying devices of the rdevs return an error for their failfast
>    bios.
> 
> Whether the bio is a read or a write makes little difference to the
> outcome.
> In the write case, md_error is invoked on each rdev through its bi_end_io
> handler.
> In the read case, losing the first rdev triggers a metadata
> update. Next, md_super_write, unlike raid1_write_request, issues the bio
> with MD_FAILFAST if the rdev supports it, causing the bio to fail
> immediately - Before this patchset, LastDev was set only by the failure
> path in super_written. Consequently, super_written calls md_error on the
> remaining rdev.
> 
> Prior to this commit, the following changes were introduced:
> * The helper function md_bio_failure_error() that skips the error handler
>    if a failfast bio targets the last rdev.
> * Serialization md_error() and md_bio_failure_error().
> * Setting the LastDev flag for rdevs that must not be lost.
> 
> This commit uses md_bio_failure_error() instead of md_error() for failfast
> bio failures, ensuring that failfast bios do not stop array operations.
> 
> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/md.c     |  5 +----
>   drivers/md/raid1.c  | 37 ++++++++++++++++++-------------------
>   drivers/md/raid10.c |  9 +++++----
>   3 files changed, 24 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 65fdd9bae8f4..65814bbe9bad 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1004,11 +1004,8 @@ static void super_written(struct bio *bio)
>   	if (bio->bi_status) {
>   		pr_err("md: %s gets error=%d\n", __func__,
>   		       blk_status_to_errno(bio->bi_status));
> -		md_error(mddev, rdev);
> -		if (!test_bit(Faulty, &rdev->flags)
> -		    && (bio->bi_opf & MD_FAILFAST)) {
> +		if (!md_bio_failure_error(mddev, rdev, bio))
>   			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> -		}
>   	}
>   
>   	bio_put(bio);
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 32ad6b102ff7..8fff9dacc6e0 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>   		    (bio->bi_opf & MD_FAILFAST) &&
>   		    /* We never try FailFast to WriteMostly devices */
>   		    !test_bit(WriteMostly, &rdev->flags)) {
> -			md_error(r1_bio->mddev, rdev);
> +			md_bio_failure_error(r1_bio->mddev, rdev, bio);
>   		}

Can following check of faulty replaced with return value?
>   
>   		/*
> @@ -2178,8 +2178,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>   	if (test_bit(FailFast, &rdev->flags)) {
>   		/* Don't try recovering from here - just fail it
>   		 * ... unless it is the last working device of course */
> -		md_error(mddev, rdev);
> -		if (test_bit(Faulty, &rdev->flags))
> +		if (md_bio_failure_error(mddev, rdev, bio))
>   			/* Don't try to read from here, but make sure
>   			 * put_buf does it's thing
>   			 */
> @@ -2657,9 +2656,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>   static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>   {
>   	struct mddev *mddev = conf->mddev;
> -	struct bio *bio;
> +	struct bio *bio, *updated_bio;
>   	struct md_rdev *rdev;
> -	sector_t sector;
>   
>   	clear_bit(R1BIO_ReadError, &r1_bio->state);
>   	/* we got a read error. Maybe the drive is bad.  Maybe just
> @@ -2672,29 +2670,30 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>   	 */
>   
>   	bio = r1_bio->bios[r1_bio->read_disk];
> -	bio_put(bio);
> -	r1_bio->bios[r1_bio->read_disk] = NULL;
> +	updated_bio = NULL;
>   
>   	rdev = conf->mirrors[r1_bio->read_disk].rdev;
> -	if (mddev->ro == 0
> -	    && !test_bit(FailFast, &rdev->flags)) {
> -		freeze_array(conf, 1);
> -		fix_read_error(conf, r1_bio);
> -		unfreeze_array(conf);
> -	} else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
> -		md_error(mddev, rdev);
> +	if (mddev->ro == 0) {
> +		if (!test_bit(FailFast, &rdev->flags)) {
> +			freeze_array(conf, 1);
> +			fix_read_error(conf, r1_bio);
> +			unfreeze_array(conf);
> +		} else {
> +			md_bio_failure_error(mddev, rdev, bio);
> +		}
>   	} else {
> -		r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
> +		updated_bio = IO_BLOCKED;
>   	}

I'll suggest a separate patch to cleanup the conditions first, it's
better for code review.

BTW, I'll prefer if else chain insted of nested if else, perhaps
following is better:

if (mddev->ro != 0) {
  /* read-only */
} else if (!test_bit(FailFast, &rdev->flags) {
  /* read-write and failfast is not set */
} else {
  /* read-write and failfast is set */
}
>   
> +	bio_put(bio);
> +	r1_bio->bios[r1_bio->read_disk] = updated_bio;
> +
>   	rdev_dec_pending(rdev, conf->mddev);
> -	sector = r1_bio->sector;
> -	bio = r1_bio->master_bio;
>   
>   	/* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */
>   	r1_bio->state = 0;
> -	raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio);
> -	allow_barrier(conf, sector);
> +	raid1_read_request(mddev, r1_bio->master_bio, r1_bio->sectors, r1_bio);
> +	allow_barrier(conf, r1_bio->sector);
>   }
>   
>   static void raid1d(struct md_thread *thread)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index dc4edd4689f8..b73af94a88b0 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -488,7 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>   			dec_rdev = 0;
>   			if (test_bit(FailFast, &rdev->flags) &&
>   			    (bio->bi_opf & MD_FAILFAST)) {
> -				md_error(rdev->mddev, rdev);
> +				md_bio_failure_error(rdev->mddev, rdev, bio);
>   			}
>  

Same as raid1, can following check of faulty replaced of return value.
>   			/*
> @@ -2443,7 +2443,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
>   				continue;
>   		} else if (test_bit(FailFast, &rdev->flags)) {
>   			/* Just give up on this device */
> -			md_error(rdev->mddev, rdev);
> +			md_bio_failure_error(rdev->mddev, rdev, tbio);
>   			continue;
>   		}
>   		/* Ok, we need to write this bio, either to correct an
> @@ -2895,8 +2895,9 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
>   		freeze_array(conf, 1);
>   		fix_read_error(conf, mddev, r10_bio);
>   		unfreeze_array(conf);
> -	} else
> -		md_error(mddev, rdev);
> +	} else {
> +		md_bio_failure_error(mddev, rdev, bio);
> +	}
>   
>   	rdev_dec_pending(rdev, mddev);
>   	r10_bio->state = 0;
> 

And please split this patch for raid1 and raid10.

Thanks
Kuai


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio
  2025-09-17 13:20     ` Kenta Akagi
@ 2025-09-18  6:39       ` Li Nan
  2025-09-18 15:36         ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Li Nan @ 2025-09-18  6:39 UTC (permalink / raw)
  To: Kenta Akagi, linan666, song, yukuai3, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel



在 2025/9/17 21:20, Kenta Akagi 写道:

>>>                if (!narrow_write_error(r1_bio, m))
>>> -                md_error(conf->mddev,
>>> -                     conf->mirrors[m].rdev);
>>> +                md_error(conf->mddev, rdev);
>>>                    /* an I/O failed, we can't clear the bitmap */
>>> -            rdev_dec_pending(conf->mirrors[m].rdev,
>>> -                     conf->mddev);
>>> +            else if (test_bit(In_sync, &rdev->flags) &&
>>> +                 !test_bit(Faulty, &rdev->flags) &&
>>> +                 rdev_has_badblock(rdev,
>>> +                           r1_bio->sector,
>>> +                           r1_bio->sectors) == 0)
>>
>> Clear badblock and set R10BIO_Uptodate if rdev has badblock.
> 
> narrow_write_error returns true when the write succeeds, or when the write
> fails but rdev_set_badblocks succeeds. Here, it determines that the re-write
> succeeded if there is no badblock in the sector to be written by r1_bio.
> So we should not call rdev_clear_badblocks here.
> 

I am trying to cleanup narrow_write_error():

https://lore.kernel.org/linux-raid/20250917093508.456790-3-linan666@huaweicloud.com/T/#u

It may be clearer if narrow_write_error() returns true when all fix IO
succeeds.

```
@@ -2553,11 +2551,17 @@ static bool narrow_write_error(struct r1bio 
*r1_bio, int i)
                 bio_trim(wbio, sector - r1_bio->sector, sectors);
                 wbio->bi_iter.bi_sector += rdev->data_offset;

-               if (submit_bio_wait(wbio) < 0)
-                       /* failure! */
-                       ok = rdev_set_badblocks(rdev, sector,
-                                               sectors, 0)
-                               && ok;
+               if (submit_bio_wait(wbio) < 0) {
+                       ok = false;
+                       if (rdev_set_badblocks(rdev, sector, sectors, 0)) {
+                               /*
+                                * Badblocks set failed, disk marked Faulty.
+                                * No further operations needed.
+                                */
+                               bio_put(wbio);
+                               break;
+                       }
+               }

                 bio_put(wbio);
                 sect_to_write -= sectors;
```

We can clear badblocks and set R10BIO_Uptodate after it. What do you think?
-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs
  2025-09-17 13:33     ` Kenta Akagi
@ 2025-09-18  6:58       ` Li Nan
  2025-09-18 16:23         ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Li Nan @ 2025-09-18  6:58 UTC (permalink / raw)
  To: Kenta Akagi, linan666, song, yukuai3, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel



在 2025/9/17 21:33, Kenta Akagi 写道:
> 
> 
> On 2025/09/17 19:06, Li Nan wrote:
>>
>>
>> 在 2025/9/15 11:42, Kenta Akagi 写道:
>>> In the current implementation, write failures are not retried on rdevs
>>> with badblocks disabled. This is because narrow_write_error, which issues
>>> retry bios, immediately returns when badblocks are disabled. As a result,
>>> a single write failure on such an rdev will immediately mark it as Faulty.
>>>
>>
>> IMO, there's no need to add extra logic for scenarios where badblocks
>> is not enabled. Do you have real-world scenarios where badblocks is
>> disabled?
> 
> No, badblocks are enabled in my environment.
> I'm fine if it's not added, but I still think it's worth adding WARN_ON like:
> 
> @@ -2553,13 +2554,17 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
>    fail = true;
> + WARN_ON( (bio->bi_opf & MD_FAILFAST) && (rdev->badblocks.shift < 0) );
>    if (!narrow_write_error(r1_bio, m))
> 
> What do you think?
> 

How about this?

--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2522,10 +2522,11 @@ static bool narrow_write_error(struct r1bio 
*r1_bio, int i)
         bool ok = true;

         if (rdev->badblocks.shift < 0)
-               return false;
+               block_sectors = bdev_logical_block_size(rdev->bdev) >> 9;
+       else
+               block_sectors = roundup(1 << rdev->badblocks.shift,
+                                       bdev_logical_block_size(rdev->bdev) 
 >> 9);

-       block_sectors = roundup(1 << rdev->badblocks.shift,
-                               bdev_logical_block_size(rdev->bdev) >> 9);
         sector = r1_bio->sector;
         sectors = ((sector + block_sectors)
                    & ~(sector_t)(block_sectors - 1))

rdev_set_badblocks() checks shift, too. rdev is marked to Faulty if setting
badblocks fails.


> 
> Thanks,
> Akagi
> 
>>> The retry mechanism appears to have been implemented under the assumption
>>> that a bad block is involved in the failure. However, the retry after
>>> MD_FAILFAST write failure depend on this code, and a Failfast write request
>>> may fail for reasons unrelated to bad blocks.
>>>
>>> Consequently, if failfast is enabled and badblocks are disabled on all
>>> rdevs, and all rdevs encounter a failfast write bio failure at the same
>>> time, no retries will occur and the entire array can be lost.
>>>
>>> This commit adds a path in narrow_write_error to retry writes even on rdevs
>>> where bad blocks are disabled, and failed bios marked with MD_FAILFAST will
>>> use this path. For non-failfast cases, the behavior remains unchanged: no
>>> retry writes are attempted to rdevs with bad blocks disabled.
>>>
>>> Fixes: 1919cbb23bf1 ("md/raid10: add failfast handling for writes.")
>>> Fixes: 212e7eb7a340 ("md/raid1: add failfast handling for writes.")
>>> Signed-off-by: Kenta Akagi <k@mgml.me>
>>> ---
>>>    drivers/md/raid1.c  | 44 +++++++++++++++++++++++++++++---------------
>>>    drivers/md/raid10.c | 37 ++++++++++++++++++++++++-------------
>>>    2 files changed, 53 insertions(+), 28 deletions(-)
>>>                   !test_bit(Faulty, &rdev->flags) &&
>>
>> -- 
>> Thanks,
>> Nan
>>
>>
> 
> 
> 
> .

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 7/9] md/raid10: fix failfast read error not rescheduled
  2025-09-15  3:42 ` [PATCH v4 7/9] md/raid10: fix failfast read error not rescheduled Kenta Akagi
@ 2025-09-18  7:38   ` Li Nan
  2025-09-18 16:12     ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Li Nan @ 2025-09-18  7:38 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li,
	Guoqing Jiang
  Cc: linux-raid, linux-kernel



在 2025/9/15 11:42, Kenta Akagi 写道:
> raid10_end_read_request lacks a path to retry when a FailFast IO fails.
> As a result, when Failfast Read IOs fail on all rdevs, the upper layer
> receives EIO, without read rescheduled.
> 
> Looking at the two commits below, it seems only raid10_end_read_request
> lacks the failfast read retry handling, while raid1_end_read_request has
> it. In RAID1, the retry works as expected.
> * commit 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
> * commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> 
> I don't know why raid10_end_read_request lacks this, but it is probably
> just a simple oversight.

Agreed, these two lines can be removed.

Other than that, LGTM.

Reviewed-by: Li Nan <linan122@huawei.com>

> 
> This commit will make the failfast read bio for the last rdev in raid10
> retry if it fails.
> 
> Fixes: 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/raid10.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 92cf3047dce6..86c0eacd37cb 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -399,6 +399,11 @@ static void raid10_end_read_request(struct bio *bio)
>   		 * wait for the 'master' bio.
>   		 */
>   		set_bit(R10BIO_Uptodate, &r10_bio->state);
> +	} else if (test_bit(FailFast, &rdev->flags) &&
> +		 test_bit(R10BIO_FailFast, &r10_bio->state)) {
> +		/* This was a fail-fast read so we definitely
> +		 * want to retry */
> +		;
>   	} else if (!raid1_should_handle_error(bio)) {
>   		uptodate = 1;
>   	} else {

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes
  2025-09-18  1:00   ` Yu Kuai
@ 2025-09-18 14:02     ` Kenta Akagi
  0 siblings, 0 replies; 39+ messages in thread
From: Kenta Akagi @ 2025-09-18 14:02 UTC (permalink / raw)
  To: yukuai1, song, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, yukuai3, k

Hi,
Thank you for reviewing.

On 2025/09/18 10:00, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/15 11:42, Kenta Akagi 写道:
>> Currently, the LastDev flag is set on an rdev that failed a failfast
>> metadata write and called md_error, but did not become Faulty. It is
>> cleared when the metadata write retry succeeds. This has problems for
>> the following reasons:
>>
>> * Despite its name, the flag is only set during a metadata write window.
>> * Unlike when LastDev and Failfast was introduced, md_error on the last
>>    rdev of a RAID1/10 array now sets MD_BROKEN. Thus when LastDev is set,
>>    the array is already unwritable.
>>
>> A following commit will prevent failfast bios from breaking the array,
>> which requires knowing from outside the personality whether an rdev is
>> the last one. For that purpose, LastDev should be set on rdevs that must
>> not be lost.
>>
>> This commit ensures that LastDev is set on the indispensable rdev in a
>> degraded RAID1/10 array.
>>
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>   drivers/md/md.c     |  4 +---
>>   drivers/md/md.h     |  6 +++---
>>   drivers/md/raid1.c  | 34 +++++++++++++++++++++++++++++++++-
>>   drivers/md/raid10.c | 34 +++++++++++++++++++++++++++++++++-
>>   4 files changed, 70 insertions(+), 8 deletions(-)
>>
> After md_error() is serialized, why not check if rdev is the last rdev
> from md_bio_failure_error() in patch 3 directly? I think it is cleaner
> and code changes should be much less to add a new method like
> pers->lastdev.

That makes sense. I'll proceed that way.

Thanks,
Akagi

> 
> Thanks,
> Kuai
> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 4e033c26fdd4..268410b66b83 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1007,10 +1007,8 @@ static void super_written(struct bio *bio)
>>           if (!test_bit(Faulty, &rdev->flags)
>>               && (bio->bi_opf & MD_FAILFAST)) {
>>               set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>> -            set_bit(LastDev, &rdev->flags);
>>           }
>> -    } else
>> -        clear_bit(LastDev, &rdev->flags);
>> +    }
>>         bio_put(bio);
>>   diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 51af29a03079..ec598f9a8381 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -281,9 +281,9 @@ enum flag_bits {
>>                    * It is expects that no bad block log
>>                    * is present.
>>                    */
>> -    LastDev,        /* Seems to be the last working dev as
>> -                 * it didn't fail, so don't use FailFast
>> -                 * any more for metadata
>> +    LastDev,        /* This is the last working rdev.
>> +                 * so don't use FailFast any more for
>> +                 * metadata.
>>                    */
>>       CollisionCheck,        /*
>>                    * check if there is collision between raid1
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index bf44878ec640..32ad6b102ff7 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1733,6 +1733,33 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>       seq_printf(seq, "]");
>>   }
>>   +/**
>> + * update_lastdev - Set or clear LastDev flag for all rdevs in array
>> + * @conf: pointer to r1conf
>> + *
>> + * Sets LastDev if the device is In_sync and cannot be lost for the array.
>> + * Otherwise, clear it.
>> + *
>> + * Caller must hold ->device_lock.
>> + */
>> +static void update_lastdev(struct r1conf *conf)
>> +{
>> +    int i;
>> +    int alive_disks = conf->raid_disks - conf->mddev->degraded;
>> +
>> +    for (i = 0; i < conf->raid_disks; i++) {
>> +        struct md_rdev *rdev = conf->mirrors[i].rdev;
>> +
>> +        if (rdev) {
>> +            if (test_bit(In_sync, &rdev->flags) &&
>> +                alive_disks == 1)
>> +                set_bit(LastDev, &rdev->flags);
>> +            else
>> +                clear_bit(LastDev, &rdev->flags);
>> +        }
>> +    }
>> +}
>> +
>>   /**
>>    * raid1_error() - RAID1 error handler.
>>    * @mddev: affected md device.
>> @@ -1767,8 +1794,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>           }
>>       }
>>       set_bit(Blocked, &rdev->flags);
>> -    if (test_and_clear_bit(In_sync, &rdev->flags))
>> +    if (test_and_clear_bit(In_sync, &rdev->flags)) {
>>           mddev->degraded++;
>> +        update_lastdev(conf);
>> +    }
>>       set_bit(Faulty, &rdev->flags);
>>       spin_unlock_irqrestore(&conf->device_lock, flags);
>>       /*
>> @@ -1864,6 +1893,7 @@ static int raid1_spare_active(struct mddev *mddev)
>>           }
>>       }
>>       mddev->degraded -= count;
>> +    update_lastdev(conf);
>>       spin_unlock_irqrestore(&conf->device_lock, flags);
>>         print_conf(conf);
>> @@ -3290,6 +3320,7 @@ static int raid1_run(struct mddev *mddev)
>>       rcu_assign_pointer(conf->thread, NULL);
>>       mddev->private = conf;
>>       set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
>> +    update_lastdev(conf);
>>         md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));
>>   @@ -3427,6 +3458,7 @@ static int raid1_reshape(struct mddev *mddev)
>>         spin_lock_irqsave(&conf->device_lock, flags);
>>       mddev->degraded += (raid_disks - conf->raid_disks);
>> +    update_lastdev(conf);
>>       spin_unlock_irqrestore(&conf->device_lock, flags);
>>       conf->raid_disks = mddev->raid_disks = raid_disks;
>>       mddev->delta_disks = 0;
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index b60c30bfb6c7..dc4edd4689f8 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1983,6 +1983,33 @@ static int enough(struct r10conf *conf, int ignore)
>>           _enough(conf, 1, ignore);
>>   }
>>   +/**
>> + * update_lastdev - Set or clear LastDev flag for all rdevs in array
>> + * @conf: pointer to r10conf
>> + *
>> + * Sets LastDev if the device is In_sync and cannot be lost for the array.
>> + * Otherwise, clear it.
>> + *
>> + * Caller must hold ->reconfig_mutex or ->device_lock.
>> + */
>> +static void update_lastdev(struct r10conf *conf)
>> +{
>> +    int i;
>> +    int raid_disks = max(conf->geo.raid_disks, conf->prev.raid_disks);
>> +
>> +    for (i = 0; i < raid_disks; i++) {
>> +        struct md_rdev *rdev = conf->mirrors[i].rdev;
>> +
>> +        if (rdev) {
>> +            if (test_bit(In_sync, &rdev->flags) &&
>> +                !enough(conf, i))
>> +                set_bit(LastDev, &rdev->flags);
>> +            else
>> +                clear_bit(LastDev, &rdev->flags);
>> +        }
>> +    }
>> +}
>> +
>>   /**
>>    * raid10_error() - RAID10 error handler.
>>    * @mddev: affected md device.
>> @@ -2013,8 +2040,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>               return;
>>           }
>>       }
>> -    if (test_and_clear_bit(In_sync, &rdev->flags))
>> +    if (test_and_clear_bit(In_sync, &rdev->flags)) {
>>           mddev->degraded++;
>> +        update_lastdev(conf);
>> +    }
>>         set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>       set_bit(Blocked, &rdev->flags);
>> @@ -2102,6 +2131,7 @@ static int raid10_spare_active(struct mddev *mddev)
>>       }
>>       spin_lock_irqsave(&conf->device_lock, flags);
>>       mddev->degraded -= count;
>> +    update_lastdev(conf);
>>       spin_unlock_irqrestore(&conf->device_lock, flags);
>>         print_conf(conf);
>> @@ -4159,6 +4189,7 @@ static int raid10_run(struct mddev *mddev)
>>       md_set_array_sectors(mddev, size);
>>       mddev->resync_max_sectors = size;
>>       set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
>> +    update_lastdev(conf);
>>         if (md_integrity_register(mddev))
>>           goto out_free_conf;
>> @@ -4567,6 +4598,7 @@ static int raid10_start_reshape(struct mddev *mddev)
>>        */
>>       spin_lock_irq(&conf->device_lock);
>>       mddev->degraded = calc_degraded(conf);
>> +    update_lastdev(conf);
>>       spin_unlock_irq(&conf->device_lock);
>>       mddev->raid_disks = conf->geo.raid_disks;
>>       mddev->reshape_position = conf->reshape_progress;
>>
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/9] md: introduce md_bio_failure_error()
  2025-09-18  1:09   ` Yu Kuai
@ 2025-09-18 14:56     ` Kenta Akagi
  0 siblings, 0 replies; 39+ messages in thread
From: Kenta Akagi @ 2025-09-18 14:56 UTC (permalink / raw)
  To: yukuai1, song, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, yukuai3, k



On 2025/09/18 10:09, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/15 11:42, Kenta Akagi 写道:
>> Add a new helper function md_bio_failure_error().
>> It is serialized with md_error() under the same lock and works
>> almost the same, but with two differences:
>>
>> * Takes the failed bio as an argument
>> * If MD_FAILFAST is set in bi_opf and the target rdev is LastDev,
>>    it does not mark the rdev faulty
>>
>> Failfast bios must not break the array, but in the current implementation
>> this can happen. This is because MD_BROKEN was introduced in RAID1/RAID10
>> and is set when md_error() is called on an rdev required for mddev
>> operation. At the time failfast was introduced, this was not the case.
>>
>> Before this commit, md_error() has already been serialized, and
>> RAID1/RAID10 mark rdevs that must not be set Faulty by Failfast
>> with the LastDev flag.
>>
>> The actual change in bio error handling will follow in a later commit.
>>
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>   drivers/md/md.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>>   drivers/md/md.h |  4 +++-
>>   2 files changed, 45 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 5607578a6db9..65fdd9bae8f4 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8297,6 +8297,48 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>   }
>>   EXPORT_SYMBOL(md_error);
>>   +/** md_bio_failure_error() - md error handler for MD_FAILFAST bios
>> + * @mddev: affected md device.
>> + * @rdev: member device to fail.
>> + * @bio: bio whose triggered device failure.
>> + *
>> + * This is almost the same as md_error(). That is, it is serialized at
>> + * the same level as md_error, marks the rdev as Faulty, and changes
>> + * the mddev status.
>> + * However, if all of the following conditions are met, it does nothing.
>> + * This is because MD_FAILFAST bios must not stopping the array.
>> + *  * RAID1 or RAID10
>> + *  * LastDev - if rdev becomes Faulty, mddev will stop
>> + *  * The failed bio has MD_FAILFAST set
>> + *
>> + * Returns: true if _md_error() was called, false if not.
>> + */
>> +bool md_bio_failure_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio)
>> +{
>> +    bool do_md_error = true;
>> +
>> +    spin_lock(&mddev->error_handle_lock);
>> +    if (mddev->pers) {
> 
> With the respect this is only called from IO path, mddev->pers must be
> checked already while submitting the bio. You can add a warn here like:
> 
> if (WARN_ON_ONCE(!mddev->pers))
>     /* return true because we don't want caller to retry */
>     return true;

Thank you, Understood. I will fix it.

>> +        if (mddev->pers->head.id == ID_RAID1 ||
>> +            mddev->pers->head.id == ID_RAID10) {
>> +            if (test_bit(LastDev, &rdev->flags) &&
>> +                test_bit(FailFast, &rdev->flags) &&
>> +                bio != NULL && (bio->bi_opf & MD_FAILFAST))
>> +                do_md_error = false;
>> +        }
> 
> As I suggested in patch 1, this can be:
>     if (!mddev->pers->lastdev ||
>         !mddev->pers->lastdev(mddev, rdev, bio)) {
>         __md_error(mddev, rdev);
>         return true;
>     }

Understood this too.

Thanks,
Akagi

> 
> Thanks,
> Kuai
> 
>> +    }
>> +
>> +    if (do_md_error)
>> +        _md_error(mddev, rdev);
>> +    else
>> +        pr_warn_ratelimited("md: %s: %s didn't do anything for %pg\n",
>> +            mdname(mddev), __func__, rdev->bdev);
>> +
>> +    spin_unlock(&mddev->error_handle_lock);
>> +    return do_md_error;
>> +}
>> +EXPORT_SYMBOL(md_bio_failure_error);
>> +
>>   /* seq_file implementation /proc/mdstat */
>>     static void status_unused(struct seq_file *seq)
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 5177cb609e4b..11389ea58431 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -283,7 +283,8 @@ enum flag_bits {
>>                    */
>>       LastDev,        /* This is the last working rdev.
>>                    * so don't use FailFast any more for
>> -                 * metadata.
>> +                 * metadata and don't Fail rdev
>> +                 * when FailFast bio failure.
>>                    */
>>       CollisionCheck,        /*
>>                    * check if there is collision between raid1
>> @@ -906,6 +907,7 @@ extern void md_write_end(struct mddev *mddev);
>>   extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
>>   void _md_error(struct mddev *mddev, struct md_rdev *rdev);
>>   extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
>> +extern bool md_bio_failure_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio);
>>   extern void md_finish_reshape(struct mddev *mddev);
>>   void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
>>               struct bio *bio, sector_t start, sector_t size);
>>
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure
  2025-09-18  1:26   ` Yu Kuai
@ 2025-09-18 15:22     ` Kenta Akagi
  2025-09-19  1:36       ` Yu Kuai
  0 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-18 15:22 UTC (permalink / raw)
  To: yukuai1, song, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, yukuai3, k



On 2025/09/18 10:26, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/15 11:42, Kenta Akagi 写道:
>> Failfast is a feature implemented only for RAID1 and RAID10. It instructs
>> the block device providing the rdev to immediately return a bio error
>> without retrying if any issue occurs. This allows quickly detaching a
>> problematic rdev and minimizes IO latency.
>>
>> Due to its nature, failfast bios can fail easily, and md must not mark
>> an essential rdev as Faulty or set MD_BROKEN on the array just because
>> a failfast bio failed.
>>
>> When failfast was introduced, RAID1 and RAID10 were designed to continue
>> operating normally even if md_error was called for the last rdev. However,
>> with the introduction of MD_BROKEN in RAID1/RAID10
>> in commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling
>> md_error for the last rdev now prevents further writes to the array.
>> Despite this, the current failfast error handler still assumes that
>> calling md_error will not break the array.
>>
>> Normally, this is not an issue because MD_FAILFAST is not set when a bio
>> is issued to the last rdev. However, if the array is not degraded and a
>> bio with MD_FAILFAST has been issued, simultaneous failures could
>> potentially break the array. This is unusual but can happen; for example,
>> this can occur when using NVMe over TCP if all rdevs depend on
>> a single Ethernet link.
>>
>> In other words, this becomes a problem under the following conditions:
>> Preconditions:
>> * Failfast is enabled on all rdevs.
>> * All rdevs are In_sync - This is a requirement for bio to be submit
>>    with MD_FAILFAST.
>> * At least one bio has been submitted but has not yet completed.
>>
>> Trigger condition:
>> * All underlying devices of the rdevs return an error for their failfast
>>    bios.
>>
>> Whether the bio is a read or a write makes little difference to the
>> outcome.
>> In the write case, md_error is invoked on each rdev through its bi_end_io
>> handler.
>> In the read case, losing the first rdev triggers a metadata
>> update. Next, md_super_write, unlike raid1_write_request, issues the bio
>> with MD_FAILFAST if the rdev supports it, causing the bio to fail
>> immediately - Before this patchset, LastDev was set only by the failure
>> path in super_written. Consequently, super_written calls md_error on the
>> remaining rdev.
>>
>> Prior to this commit, the following changes were introduced:
>> * The helper function md_bio_failure_error() that skips the error handler
>>    if a failfast bio targets the last rdev.
>> * Serialization md_error() and md_bio_failure_error().
>> * Setting the LastDev flag for rdevs that must not be lost.
>>
>> This commit uses md_bio_failure_error() instead of md_error() for failfast
>> bio failures, ensuring that failfast bios do not stop array operations.
>>
>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>   drivers/md/md.c     |  5 +----
>>   drivers/md/raid1.c  | 37 ++++++++++++++++++-------------------
>>   drivers/md/raid10.c |  9 +++++----
>>   3 files changed, 24 insertions(+), 27 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 65fdd9bae8f4..65814bbe9bad 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1004,11 +1004,8 @@ static void super_written(struct bio *bio)
>>       if (bio->bi_status) {
>>           pr_err("md: %s gets error=%d\n", __func__,
>>                  blk_status_to_errno(bio->bi_status));
>> -        md_error(mddev, rdev);
>> -        if (!test_bit(Faulty, &rdev->flags)
>> -            && (bio->bi_opf & MD_FAILFAST)) {
>> +        if (!md_bio_failure_error(mddev, rdev, bio))
>>               set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>> -        }
>>       }
>>         bio_put(bio);
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 32ad6b102ff7..8fff9dacc6e0 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>               (bio->bi_opf & MD_FAILFAST) &&
>>               /* We never try FailFast to WriteMostly devices */
>>               !test_bit(WriteMostly, &rdev->flags)) {
>> -            md_error(r1_bio->mddev, rdev);
>> +            md_bio_failure_error(r1_bio->mddev, rdev, bio);
>>           }
> 
> Can following check of faulty replaced with return value?

In the case where raid1_end_write_request is called for a non-failfast IO,
and the rdev has already been marked Faulty by another bio, it must not retry too.
I think it would be simpler not to use a return value here.

>>             /*
>> @@ -2178,8 +2178,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>       if (test_bit(FailFast, &rdev->flags)) {
>>           /* Don't try recovering from here - just fail it
>>            * ... unless it is the last working device of course */
>> -        md_error(mddev, rdev);
>> -        if (test_bit(Faulty, &rdev->flags))
>> +        if (md_bio_failure_error(mddev, rdev, bio))
>>               /* Don't try to read from here, but make sure
>>                * put_buf does it's thing
>>                */
>> @@ -2657,9 +2656,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>>   static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>   {
>>       struct mddev *mddev = conf->mddev;
>> -    struct bio *bio;
>> +    struct bio *bio, *updated_bio;
>>       struct md_rdev *rdev;
>> -    sector_t sector;
>>         clear_bit(R1BIO_ReadError, &r1_bio->state);
>>       /* we got a read error. Maybe the drive is bad.  Maybe just
>> @@ -2672,29 +2670,30 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>        */
>>         bio = r1_bio->bios[r1_bio->read_disk];
>> -    bio_put(bio);
>> -    r1_bio->bios[r1_bio->read_disk] = NULL;
>> +    updated_bio = NULL;
>>         rdev = conf->mirrors[r1_bio->read_disk].rdev;
>> -    if (mddev->ro == 0
>> -        && !test_bit(FailFast, &rdev->flags)) {
>> -        freeze_array(conf, 1);
>> -        fix_read_error(conf, r1_bio);
>> -        unfreeze_array(conf);
>> -    } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>> -        md_error(mddev, rdev);
>> +    if (mddev->ro == 0) {
>> +        if (!test_bit(FailFast, &rdev->flags)) {
>> +            freeze_array(conf, 1);
>> +            fix_read_error(conf, r1_bio);
>> +            unfreeze_array(conf);
>> +        } else {
>> +            md_bio_failure_error(mddev, rdev, bio);
>> +        }
>>       } else {
>> -        r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>> +        updated_bio = IO_BLOCKED;
>>       }
> 
> I'll suggest a separate patch to cleanup the conditions first, it's
> better for code review.

Thank you for your guidance. I will split the commit.

> 
> BTW, I'll prefer if else chain insted of nested if else, perhaps
> following is better:
> 
> if (mddev->ro != 0) {
>  /* read-only */
> } else if (!test_bit(FailFast, &rdev->flags) {
>  /* read-write and failfast is not set */
> } else {
>  /* read-write and failfast is set */
> }

Ah, this looks much readable. I'll fix. Thanks.

>>   +    bio_put(bio);
>> +    r1_bio->bios[r1_bio->read_disk] = updated_bio;
>> +
>>       rdev_dec_pending(rdev, conf->mddev);
>> -    sector = r1_bio->sector;
>> -    bio = r1_bio->master_bio;
>>         /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */
>>       r1_bio->state = 0;
>> -    raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio);
>> -    allow_barrier(conf, sector);
>> +    raid1_read_request(mddev, r1_bio->master_bio, r1_bio->sectors, r1_bio);
>> +    allow_barrier(conf, r1_bio->sector);
>>   }
>>     static void raid1d(struct md_thread *thread)
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index dc4edd4689f8..b73af94a88b0 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -488,7 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>               dec_rdev = 0;
>>               if (test_bit(FailFast, &rdev->flags) &&
>>                   (bio->bi_opf & MD_FAILFAST)) {
>> -                md_error(rdev->mddev, rdev);
>> +                md_bio_failure_error(rdev->mddev, rdev, bio);
>>               }
>>  
> 
> Same as raid1, can following check of faulty replaced of return value.

Same as RAID1, non-Failfast IO must also be handled. Therefore, the Faulty
bit should be checked instead of relying on the return value.

>>               /*
>> @@ -2443,7 +2443,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
>>                   continue;
>>           } else if (test_bit(FailFast, &rdev->flags)) {
>>               /* Just give up on this device */
>> -            md_error(rdev->mddev, rdev);
>> +            md_bio_failure_error(rdev->mddev, rdev, tbio);
>>               continue;
>>           }
>>           /* Ok, we need to write this bio, either to correct an
>> @@ -2895,8 +2895,9 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
>>           freeze_array(conf, 1);
>>           fix_read_error(conf, mddev, r10_bio);
>>           unfreeze_array(conf);
>> -    } else
>> -        md_error(mddev, rdev);
>> +    } else {
>> +        md_bio_failure_error(mddev, rdev, bio);
>> +    }
>>         rdev_dec_pending(rdev, mddev);
>>       r10_bio->state = 0;
>>
> 
> And please split this patch for raid1 and raid10.

Understood, I will split it.

Thanks,
Akagi

> 
> Thanks
> Kuai
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio
  2025-09-18  6:39       ` Li Nan
@ 2025-09-18 15:36         ` Kenta Akagi
  2025-09-19  1:37           ` Li Nan
  0 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-18 15:36 UTC (permalink / raw)
  To: linan666, song, yukuai3, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, k



On 2025/09/18 15:39, Li Nan wrote:
> 
> 
> 在 2025/9/17 21:20, Kenta Akagi 写道:
> 
>>>>                if (!narrow_write_error(r1_bio, m))
>>>> -                md_error(conf->mddev,
>>>> -                     conf->mirrors[m].rdev);
>>>> +                md_error(conf->mddev, rdev);
>>>>                    /* an I/O failed, we can't clear the bitmap */
>>>> -            rdev_dec_pending(conf->mirrors[m].rdev,
>>>> -                     conf->mddev);
>>>> +            else if (test_bit(In_sync, &rdev->flags) &&
>>>> +                 !test_bit(Faulty, &rdev->flags) &&
>>>> +                 rdev_has_badblock(rdev,
>>>> +                           r1_bio->sector,
>>>> +                           r1_bio->sectors) == 0)
>>>
>>> Clear badblock and set R10BIO_Uptodate if rdev has badblock.
>>
>> narrow_write_error returns true when the write succeeds, or when the write
>> fails but rdev_set_badblocks succeeds. Here, it determines that the re-write
>> succeeded if there is no badblock in the sector to be written by r1_bio.
>> So we should not call rdev_clear_badblocks here.
>>
> 
> I am trying to cleanup narrow_write_error():
> 
> https://lore.kernel.org/linux-raid/20250917093508.456790-3-linan666@huaweicloud.com/T/#u
> 
> It may be clearer if narrow_write_error() returns true when all fix IO
> succeeds.
> 
> ```
> @@ -2553,11 +2551,17 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
>                 bio_trim(wbio, sector - r1_bio->sector, sectors);
>                 wbio->bi_iter.bi_sector += rdev->data_offset;
> 
> -               if (submit_bio_wait(wbio) < 0)
> -                       /* failure! */
> -                       ok = rdev_set_badblocks(rdev, sector,
> -                                               sectors, 0)
> -                               && ok;
> +               if (submit_bio_wait(wbio) < 0) {
> +                       ok = false;
> +                       if (rdev_set_badblocks(rdev, sector, sectors, 0)) {
> +                               /*
> +                                * Badblocks set failed, disk marked Faulty.
> +                                * No further operations needed.
> +                                */
> +                               bio_put(wbio);
> +                               break;
> +                       }
> +               }
> 
>                 bio_put(wbio);
>                 sect_to_write -= sectors;
> ```
> 
> We can clear badblocks and set R10BIO_Uptodate after it. What do you think?

Thanks for the detailed explanation.
narrow_write_error now returns whether the retry write succeeded.
If rdev_set_badblock fails, narrow_write_error calls md_error. This looks good to me.
And after narrow_write_error cleanup, rdev_clear_badblock should be called.

BTW, I'm not familiar with the patch workflow. Should I submit my
handle_write_finished patch separately after your cleanup patch merged?

Thanks,
Akagi

> -- 
> Thanks,
> Nan
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 7/9] md/raid10: fix failfast read error not rescheduled
  2025-09-18  7:38   ` Li Nan
@ 2025-09-18 16:12     ` Kenta Akagi
  2025-09-19  1:20       ` Li Nan
  0 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-18 16:12 UTC (permalink / raw)
  To: linan666, song, yukuai3, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, k



On 2025/09/18 16:38, Li Nan wrote:
> 
> 
> 在 2025/9/15 11:42, Kenta Akagi 写道:
>> raid10_end_read_request lacks a path to retry when a FailFast IO fails.
>> As a result, when Failfast Read IOs fail on all rdevs, the upper layer
>> receives EIO, without read rescheduled.
>>
>> Looking at the two commits below, it seems only raid10_end_read_request
>> lacks the failfast read retry handling, while raid1_end_read_request has
>> it. In RAID1, the retry works as expected.
>> * commit 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
>> * commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>
>> I don't know why raid10_end_read_request lacks this, but it is probably
>> just a simple oversight.
> 
> Agreed, these two lines can be removed.

I will revise the commit message.

> 
> Other than that, LGTM.
> 
> Reviewed-by: Li Nan <linan122@huawei.com>

Thank you. However, there is a WARNING due to the comment format that needs to be fixed.
I also received a failure email from the RAID CI system.

------------------------------------------------------------------------
patch-v4/v4-0007-md-raid10-fix-failfast-read-error-not-rescheduled.patch
------------------------------------------------------------------------
WARNING: Block comments use a trailing */ on a separate line
#39: FILE: drivers/md/raid10.c:405:
+                * want to retry */

total: 0 errors, 1 warnings, 11 lines checked


I will apply the corrections below and resubmit as v5.
Is it okay to add a Reviewed-by tag in this case?
Sorry to bother you.

+       } else if (test_bit(FailFast, &rdev->flags) &&
+                test_bit(R10BIO_FailFast, &r10_bio->state)) {
+               /* This was a fail-fast read so we definitely
+                * want to retry
+                */
+               ;

Thanks,
Akagi

> 
>>
>> This commit will make the failfast read bio for the last rdev in raid10
>> retry if it fails.
>>
>> Fixes: 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>   drivers/md/raid10.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 92cf3047dce6..86c0eacd37cb 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -399,6 +399,11 @@ static void raid10_end_read_request(struct bio *bio)
>>            * wait for the 'master' bio.
>>            */
>>           set_bit(R10BIO_Uptodate, &r10_bio->state);
>> +    } else if (test_bit(FailFast, &rdev->flags) &&
>> +         test_bit(R10BIO_FailFast, &r10_bio->state)) {
>> +        /* This was a fail-fast read so we definitely
>> +         * want to retry */
>> +        ;
>>       } else if (!raid1_should_handle_error(bio)) {
>>           uptodate = 1;
>>       } else {
> 
> -- 
> Thanks,
> Nan
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs
  2025-09-18  6:58       ` Li Nan
@ 2025-09-18 16:23         ` Kenta Akagi
  2025-09-19  1:28           ` Li Nan
  0 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-18 16:23 UTC (permalink / raw)
  To: linan666, song, yukuai3, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, k



On 2025/09/18 15:58, Li Nan wrote:
> 
> 
> 在 2025/9/17 21:33, Kenta Akagi 写道:
>>
>>
>> On 2025/09/17 19:06, Li Nan wrote:
>>>
>>>
>>> 在 2025/9/15 11:42, Kenta Akagi 写道:
>>>> In the current implementation, write failures are not retried on rdevs
>>>> with badblocks disabled. This is because narrow_write_error, which issues
>>>> retry bios, immediately returns when badblocks are disabled. As a result,
>>>> a single write failure on such an rdev will immediately mark it as Faulty.
>>>>
>>>
>>> IMO, there's no need to add extra logic for scenarios where badblocks
>>> is not enabled. Do you have real-world scenarios where badblocks is
>>> disabled?
>>
>> No, badblocks are enabled in my environment.
>> I'm fine if it's not added, but I still think it's worth adding WARN_ON like:
>>
>> @@ -2553,13 +2554,17 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
>>    fail = true;
>> + WARN_ON( (bio->bi_opf & MD_FAILFAST) && (rdev->badblocks.shift < 0) );
>>    if (!narrow_write_error(r1_bio, m))
>>
>> What do you think?
>>
> 
> How about this?
> 
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2522,10 +2522,11 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
>         bool ok = true;
> 
>         if (rdev->badblocks.shift < 0)
> -               return false;
> +               block_sectors = bdev_logical_block_size(rdev->bdev) >> 9;
> +       else
> +               block_sectors = roundup(1 << rdev->badblocks.shift,
> +                                       bdev_logical_block_size(rdev->bdev) >> 9);
> 
> -       block_sectors = roundup(1 << rdev->badblocks.shift,
> -                               bdev_logical_block_size(rdev->bdev) >> 9);
>         sector = r1_bio->sector;
>         sectors = ((sector + block_sectors)
>                    & ~(sector_t)(block_sectors - 1))
> 
> rdev_set_badblocks() checks shift, too. rdev is marked to Faulty if setting
> badblocks fails.

Sounds good. If badblocks are disabled, rdev_set_badblocks() returns false, so there
should be no problem.

Can this be included in the cleanup?
https://lore.kernel.org/linux-raid/20250917093508.456790-3-linan666@huaweicloud.com/T/#u

or should I resend this patch as proposed?

Thanks,
Akagi

> 
> 
>>
>> Thanks,
>> Akagi
>>
>>>> The retry mechanism appears to have been implemented under the assumption
>>>> that a bad block is involved in the failure. However, the retry after
>>>> MD_FAILFAST write failure depend on this code, and a Failfast write request
>>>> may fail for reasons unrelated to bad blocks.
>>>>
>>>> Consequently, if failfast is enabled and badblocks are disabled on all
>>>> rdevs, and all rdevs encounter a failfast write bio failure at the same
>>>> time, no retries will occur and the entire array can be lost.
>>>>
>>>> This commit adds a path in narrow_write_error to retry writes even on rdevs
>>>> where bad blocks are disabled, and failed bios marked with MD_FAILFAST will
>>>> use this path. For non-failfast cases, the behavior remains unchanged: no
>>>> retry writes are attempted to rdevs with bad blocks disabled.
>>>>
>>>> Fixes: 1919cbb23bf1 ("md/raid10: add failfast handling for writes.")
>>>> Fixes: 212e7eb7a340 ("md/raid1: add failfast handling for writes.")
>>>> Signed-off-by: Kenta Akagi <k@mgml.me>
>>>> ---
>>>>    drivers/md/raid1.c  | 44 +++++++++++++++++++++++++++++---------------
>>>>    drivers/md/raid10.c | 37 ++++++++++++++++++++++++-------------
>>>>    2 files changed, 53 insertions(+), 28 deletions(-)
>>>>                   !test_bit(Faulty, &rdev->flags) &&
>>>
>>> -- 
>>> Thanks,
>>> Nan
>>>
>>>
>>
>>
>>
>> .
> 
> -- 
> Thanks,
> Nan
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 7/9] md/raid10: fix failfast read error not rescheduled
  2025-09-18 16:12     ` Kenta Akagi
@ 2025-09-19  1:20       ` Li Nan
  0 siblings, 0 replies; 39+ messages in thread
From: Li Nan @ 2025-09-19  1:20 UTC (permalink / raw)
  To: Kenta Akagi, linan666, song, yukuai3, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel



在 2025/9/19 0:12, Kenta Akagi 写道:
> 
> 
> On 2025/09/18 16:38, Li Nan wrote:
>>
>>
>> 在 2025/9/15 11:42, Kenta Akagi 写道:
>>> raid10_end_read_request lacks a path to retry when a FailFast IO fails.
>>> As a result, when Failfast Read IOs fail on all rdevs, the upper layer
>>> receives EIO, without read rescheduled.
>>>
>>> Looking at the two commits below, it seems only raid10_end_read_request
>>> lacks the failfast read retry handling, while raid1_end_read_request has
>>> it. In RAID1, the retry works as expected.
>>> * commit 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
>>> * commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>
>>> I don't know why raid10_end_read_request lacks this, but it is probably
>>> just a simple oversight.
>>
>> Agreed, these two lines can be removed.
> 
> I will revise the commit message.
> 
>>
>> Other than that, LGTM.
>>
>> Reviewed-by: Li Nan <linan122@huawei.com>
> 
> Thank you. However, there is a WARNING due to the comment format that needs to be fixed.
> I also received a failure email from the RAID CI system.
> 
> ------------------------------------------------------------------------
> patch-v4/v4-0007-md-raid10-fix-failfast-read-error-not-rescheduled.patch
> ------------------------------------------------------------------------
> WARNING: Block comments use a trailing */ on a separate line
> #39: FILE: drivers/md/raid10.c:405:
> +                * want to retry */
> 
> total: 0 errors, 1 warnings, 11 lines checked
> 
> 
> I will apply the corrections below and resubmit as v5.
> Is it okay to add a Reviewed-by tag in this case?
> Sorry to bother you.

Yes, please feel free to add it.

> 
> +       } else if (test_bit(FailFast, &rdev->flags) &&
> +                test_bit(R10BIO_FailFast, &r10_bio->state)) {
> +               /* This was a fail-fast read so we definitely


/*
  * This was ...
  */

This way is better.

> +                * want to retry
> +                */
> +               ;
> 
> Thanks,
> Akagi
> 
>>
>>>
>>> This commit will make the failfast read bio for the last rdev in raid10
>>> retry if it fails.
>>>
>>> Fixes: 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
>>> Signed-off-by: Kenta Akagi <k@mgml.me>
>>> ---
>>>    drivers/md/raid10.c | 5 +++++
>>>    1 file changed, 5 insertions(+)
>>>
>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>> index 92cf3047dce6..86c0eacd37cb 100644
>>> --- a/drivers/md/raid10.c
>>> +++ b/drivers/md/raid10.c
>>> @@ -399,6 +399,11 @@ static void raid10_end_read_request(struct bio *bio)
>>>             * wait for the 'master' bio.
>>>             */
>>>            set_bit(R10BIO_Uptodate, &r10_bio->state);
>>> +    } else if (test_bit(FailFast, &rdev->flags) &&
>>> +         test_bit(R10BIO_FailFast, &r10_bio->state)) {
>>> +        /* This was a fail-fast read so we definitely
>>> +         * want to retry */
>>> +        ;
>>>        } else if (!raid1_should_handle_error(bio)) {
>>>            uptodate = 1;
>>>        } else {
>>
>> -- 
>> Thanks,
>> Nan
>>
>>
> 
> 
> .

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs
  2025-09-18 16:23         ` Kenta Akagi
@ 2025-09-19  1:28           ` Li Nan
  0 siblings, 0 replies; 39+ messages in thread
From: Li Nan @ 2025-09-19  1:28 UTC (permalink / raw)
  To: Kenta Akagi, linan666, song, yukuai3, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel



在 2025/9/19 0:23, Kenta Akagi 写道:
> 
> 
> On 2025/09/18 15:58, Li Nan wrote:
>>
>>
>> 在 2025/9/17 21:33, Kenta Akagi 写道:
>>>
>>>
>>> On 2025/09/17 19:06, Li Nan wrote:
>>>>
>>>>
>>>> 在 2025/9/15 11:42, Kenta Akagi 写道:
>>>>> In the current implementation, write failures are not retried on rdevs
>>>>> with badblocks disabled. This is because narrow_write_error, which issues
>>>>> retry bios, immediately returns when badblocks are disabled. As a result,
>>>>> a single write failure on such an rdev will immediately mark it as Faulty.
>>>>>
>>>>
>>>> IMO, there's no need to add extra logic for scenarios where badblocks
>>>> is not enabled. Do you have real-world scenarios where badblocks is
>>>> disabled?
>>>
>>> No, badblocks are enabled in my environment.
>>> I'm fine if it's not added, but I still think it's worth adding WARN_ON like:
>>>
>>> @@ -2553,13 +2554,17 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
>>>     fail = true;
>>> + WARN_ON( (bio->bi_opf & MD_FAILFAST) && (rdev->badblocks.shift < 0) );
>>>     if (!narrow_write_error(r1_bio, m))
>>>
>>> What do you think?
>>>
>>
>> How about this?
>>
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -2522,10 +2522,11 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
>>          bool ok = true;
>>
>>          if (rdev->badblocks.shift < 0)
>> -               return false;
>> +               block_sectors = bdev_logical_block_size(rdev->bdev) >> 9;
>> +       else
>> +               block_sectors = roundup(1 << rdev->badblocks.shift,
>> +                                       bdev_logical_block_size(rdev->bdev) >> 9);
>>
>> -       block_sectors = roundup(1 << rdev->badblocks.shift,
>> -                               bdev_logical_block_size(rdev->bdev) >> 9);
>>          sector = r1_bio->sector;
>>          sectors = ((sector + block_sectors)
>>                     & ~(sector_t)(block_sectors - 1))
>>
>> rdev_set_badblocks() checks shift, too. rdev is marked to Faulty if setting
>> badblocks fails.
> 
> Sounds good. If badblocks are disabled, rdev_set_badblocks() returns false, so there
> should be no problem.
> 
> Can this be included in the cleanup?
> https://lore.kernel.org/linux-raid/20250917093508.456790-3-linan666@huaweicloud.com/T/#u
> 
> or should I resend this patch as proposed?

Yes, please resend. I will rewrite my patch.

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure
  2025-09-18 15:22     ` Kenta Akagi
@ 2025-09-19  1:36       ` Yu Kuai
  2025-09-20  6:30         ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Yu Kuai @ 2025-09-19  1:36 UTC (permalink / raw)
  To: Kenta Akagi, yukuai1, song, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, yukuai3

Hi,

在 2025/9/18 23:22, Kenta Akagi 写道:
>>> @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>                (bio->bi_opf & MD_FAILFAST) &&
>>>                /* We never try FailFast to WriteMostly devices */
>>>                !test_bit(WriteMostly, &rdev->flags)) {
>>> -            md_error(r1_bio->mddev, rdev);
>>> +            md_bio_failure_error(r1_bio->mddev, rdev, bio);
>>>            }
>> Can following check of faulty replaced with return value?
> In the case where raid1_end_write_request is called for a non-failfast IO,
> and the rdev has already been marked Faulty by another bio, it must not retry too.
> I think it would be simpler not to use a return value here.

You can just add Faulty check inside md_bio_failure_error() as well, and both
failfast and writemostly check.

Thanks,
Kuai



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio
  2025-09-18 15:36         ` Kenta Akagi
@ 2025-09-19  1:37           ` Li Nan
  0 siblings, 0 replies; 39+ messages in thread
From: Li Nan @ 2025-09-19  1:37 UTC (permalink / raw)
  To: Kenta Akagi, linan666, song, yukuai3, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel



在 2025/9/18 23:36, Kenta Akagi 写道:
> 
> 
> On 2025/09/18 15:39, Li Nan wrote:
>>
>>
>> 在 2025/9/17 21:20, Kenta Akagi 写道:
>>
>>>>>                 if (!narrow_write_error(r1_bio, m))
>>>>> -                md_error(conf->mddev,
>>>>> -                     conf->mirrors[m].rdev);
>>>>> +                md_error(conf->mddev, rdev);
>>>>>                     /* an I/O failed, we can't clear the bitmap */
>>>>> -            rdev_dec_pending(conf->mirrors[m].rdev,
>>>>> -                     conf->mddev);
>>>>> +            else if (test_bit(In_sync, &rdev->flags) &&
>>>>> +                 !test_bit(Faulty, &rdev->flags) &&
>>>>> +                 rdev_has_badblock(rdev,
>>>>> +                           r1_bio->sector,
>>>>> +                           r1_bio->sectors) == 0)
>>>>
>>>> Clear badblock and set R10BIO_Uptodate if rdev has badblock.
>>>
>>> narrow_write_error returns true when the write succeeds, or when the write
>>> fails but rdev_set_badblocks succeeds. Here, it determines that the re-write
>>> succeeded if there is no badblock in the sector to be written by r1_bio.
>>> So we should not call rdev_clear_badblocks here.
>>>
>>
>> I am trying to cleanup narrow_write_error():
>>
>> https://lore.kernel.org/linux-raid/20250917093508.456790-3-linan666@huaweicloud.com/T/#u
>>
>> It may be clearer if narrow_write_error() returns true when all fix IO
>> succeeds.
>>
>> ```
>> @@ -2553,11 +2551,17 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
>>                  bio_trim(wbio, sector - r1_bio->sector, sectors);
>>                  wbio->bi_iter.bi_sector += rdev->data_offset;
>>
>> -               if (submit_bio_wait(wbio) < 0)
>> -                       /* failure! */
>> -                       ok = rdev_set_badblocks(rdev, sector,
>> -                                               sectors, 0)
>> -                               && ok;
>> +               if (submit_bio_wait(wbio) < 0) {
>> +                       ok = false;
>> +                       if (rdev_set_badblocks(rdev, sector, sectors, 0)) {
>> +                               /*
>> +                                * Badblocks set failed, disk marked Faulty.
>> +                                * No further operations needed.
>> +                                */
>> +                               bio_put(wbio);
>> +                               break;
>> +                       }
>> +               }
>>
>>                  bio_put(wbio);
>>                  sect_to_write -= sectors;
>> ```
>>
>> We can clear badblocks and set R10BIO_Uptodate after it. What do you think?
> 
> Thanks for the detailed explanation.
> narrow_write_error now returns whether the retry write succeeded.
> If rdev_set_badblock fails, narrow_write_error calls md_error. This looks good to me.
> And after narrow_write_error cleanup, rdev_clear_badblock should be called.
> 
> BTW, I'm not familiar with the patch workflow. Should I submit my
> handle_write_finished patch separately after your cleanup patch merged?
> 

Sure, please send it separately after my cleanup patch. Thanks.

> Thanks,
> Akagi
> 
>> -- 
>> Thanks,
>> Nan
>>
>>
> 
> 
> 
> .

-- 
Thanks,
Nan


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure
  2025-09-19  1:36       ` Yu Kuai
@ 2025-09-20  6:30         ` Kenta Akagi
  2025-09-20  9:51           ` Yu Kuai
  0 siblings, 1 reply; 39+ messages in thread
From: Kenta Akagi @ 2025-09-20  6:30 UTC (permalink / raw)
  To: Yu Kuai, yukuai1, song, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, yukuai3, k

Hi,

I have changed my email address because our primary MX server
suddenly started rejecting non-DKIM mail.

On 2025/09/19 10:36, Yu Kuai wrote:
> Hi,
> 
> 在 2025/9/18 23:22, Kenta Akagi 写道:
>>>> @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>                (bio->bi_opf & MD_FAILFAST) &&
>>>>                /* We never try FailFast to WriteMostly devices */
>>>>                !test_bit(WriteMostly, &rdev->flags)) {
>>>> -            md_error(r1_bio->mddev, rdev);
>>>> +            md_bio_failure_error(r1_bio->mddev, rdev, bio);
>>>>            }
>>> Can following check of faulty replaced with return value?
>> In the case where raid1_end_write_request is called for a non-failfast IO,
>> and the rdev has already been marked Faulty by another bio, it must not retry too.
>> I think it would be simpler not to use a return value here.
> 
> You can just add Faulty check inside md_bio_failure_error() as well, and both
> failfast and writemostly check.

Sorry, I'm not sure I understand this part. 
In raid1_end_write_request, this code path is also used for a regular bio,
not only for FailFast.

You mean to change md_bio_failure_error as follows:
* If the rdev is Faulty, immediately return true.
* If the given bio is Failfast and the rdev is not the lastdev, call md_error.
* If the given bio is not Failfast, do nothing and return false.

And then apply this?
This is complicated. Wouldn't it be better to keep the Faulty check as it is?

@@ -466,18 +466,12 @@ static void raid1_end_write_request(struct bio *bio)
                        set_bit(MD_RECOVERY_NEEDED, &
                                conf->mddev->recovery);

-               if (test_bit(FailFast, &rdev->flags) &&
-                   (bio->bi_opf & MD_FAILFAST) &&
-                   /* We never try FailFast to WriteMostly devices */
-                   !test_bit(WriteMostly, &rdev->flags)) {
-                       md_error(r1_bio->mddev, rdev);
-               }
-
                /*
                 * When the device is faulty, it is not necessary to
                 * handle write error.
                 */
-               if (!test_bit(Faulty, &rdev->flags))
+               if (!test_bit(Faulty, &rdev->flags) ||
+                   !md_bio_failure_error(r1_bio->mddev, rdev, bio))
                        set_bit(R1BIO_WriteError, &r1_bio->state);
                else {
                        /* Finished with this branch */


Or do you mean a fix like this?

@@ -466,23 +466,24 @@ static void raid1_end_write_request(struct bio *bio)
                        set_bit(MD_RECOVERY_NEEDED, &
                                conf->mddev->recovery);

-               if (test_bit(FailFast, &rdev->flags) &&
-                   (bio->bi_opf & MD_FAILFAST) &&
-                   /* We never try FailFast to WriteMostly devices */
-                   !test_bit(WriteMostly, &rdev->flags)) {
-                       md_error(r1_bio->mddev, rdev);
-               }
-
                /*
                 * When the device is faulty, it is not necessary to
                 * handle write error.
                 */
-               if (!test_bit(Faulty, &rdev->flags))
-                       set_bit(R1BIO_WriteError, &r1_bio->state);
-               else {
+               if (test_bit(Faulty, &rdev->flags) ||
+                   (
+                   test_bit(FailFast, &rdev->flags) &&
+                   (bio->bi_opf & MD_FAILFAST) &&
+                   /* We never try FailFast to WriteMostly devices */
+                   !test_bit(WriteMostly, &rdev->flags) &&
+                   md_bio_failure_error(r1_bio->mddev, rdev, bio)
+                   )
+               ) {
                        /* Finished with this branch */
                        r1_bio->bios[mirror] = NULL;
                        to_put = bio;
+               } else {
+                       set_bit(R1BIO_WriteError, &r1_bio->state);
                }
        } else {
                /*

Thanks,
Akagi

> Thanks,
> Kuai
> 
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure
  2025-09-20  6:30         ` Kenta Akagi
@ 2025-09-20  9:51           ` Yu Kuai
  2025-09-23 15:54             ` Kenta Akagi
  0 siblings, 1 reply; 39+ messages in thread
From: Yu Kuai @ 2025-09-20  9:51 UTC (permalink / raw)
  To: Kenta Akagi, Yu Kuai, yukuai1, song, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, yukuai3

Hi,

在 2025/9/20 14:30, Kenta Akagi 写道:
> Hi,
>
> I have changed my email address because our primary MX server
> suddenly started rejecting non-DKIM mail.
>
> On 2025/09/19 10:36, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/9/18 23:22, Kenta Akagi 写道:
>>>>> @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>>                 (bio->bi_opf & MD_FAILFAST) &&
>>>>>                 /* We never try FailFast to WriteMostly devices */
>>>>>                 !test_bit(WriteMostly, &rdev->flags)) {
>>>>> -            md_error(r1_bio->mddev, rdev);
>>>>> +            md_bio_failure_error(r1_bio->mddev, rdev, bio);
>>>>>             }
>>>> Can following check of faulty replaced with return value?
>>> In the case where raid1_end_write_request is called for a non-failfast IO,
>>> and the rdev has already been marked Faulty by another bio, it must not retry too.
>>> I think it would be simpler not to use a return value here.
>> You can just add Faulty check inside md_bio_failure_error() as well, and both
>> failfast and writemostly check.
> Sorry, I'm not sure I understand this part.
> In raid1_end_write_request, this code path is also used for a regular bio,
> not only for FailFast.
>
> You mean to change md_bio_failure_error as follows:
> * If the rdev is Faulty, immediately return true.
> * If the given bio is Failfast and the rdev is not the lastdev, call md_error.
> * If the given bio is not Failfast, do nothing and return false.

Yes, doesn't that apply to all the callers?

>
> And then apply this?
> This is complicated. Wouldn't it be better to keep the Faulty check as it is?
>
> @@ -466,18 +466,12 @@ static void raid1_end_write_request(struct bio *bio)
>                          set_bit(MD_RECOVERY_NEEDED, &
>                                  conf->mddev->recovery);
>
> -               if (test_bit(FailFast, &rdev->flags) &&
> -                   (bio->bi_opf & MD_FAILFAST) &&
> -                   /* We never try FailFast to WriteMostly devices */
> -                   !test_bit(WriteMostly, &rdev->flags)) {
> -                       md_error(r1_bio->mddev, rdev);
> -               }
> -
>                  /*
>                   * When the device is faulty, it is not necessary to
>                   * handle write error.
>                   */
> -               if (!test_bit(Faulty, &rdev->flags))
> +               if (!test_bit(Faulty, &rdev->flags) ||
> +                   !md_bio_failure_error(r1_bio->mddev, rdev, bio))
>                          set_bit(R1BIO_WriteError, &r1_bio->state);
>                  else {
>                          /* Finished with this branch */

Faulty is set with lock held, so check Faulty with lock held as well can
prevent rdev to be Faulty concurrently, and this check can be added to all
callers, I think.

>
> Or do you mean a fix like this?
>
> @@ -466,23 +466,24 @@ static void raid1_end_write_request(struct bio *bio)
>                          set_bit(MD_RECOVERY_NEEDED, &
>                                  conf->mddev->recovery);
>
> -               if (test_bit(FailFast, &rdev->flags) &&
> -                   (bio->bi_opf & MD_FAILFAST) &&
> -                   /* We never try FailFast to WriteMostly devices */
> -                   !test_bit(WriteMostly, &rdev->flags)) {
> -                       md_error(r1_bio->mddev, rdev);
> -               }
> -
>                  /*
>                   * When the device is faulty, it is not necessary to
>                   * handle write error.
>                   */
> -               if (!test_bit(Faulty, &rdev->flags))
> -                       set_bit(R1BIO_WriteError, &r1_bio->state);
> -               else {
> +               if (test_bit(Faulty, &rdev->flags) ||
> +                   (
> +                   test_bit(FailFast, &rdev->flags) &&
> +                   (bio->bi_opf & MD_FAILFAST) &&
> +                   /* We never try FailFast to WriteMostly devices */
> +                   !test_bit(WriteMostly, &rdev->flags) &&
> +                   md_bio_failure_error(r1_bio->mddev, rdev, bio)
> +                   )
> +               ) {
>                          /* Finished with this branch */
>                          r1_bio->bios[mirror] = NULL;
>                          to_put = bio;
> +               } else {
> +                       set_bit(R1BIO_WriteError, &r1_bio->state);
>                  }
>          } else {
>                  /*

No, this just make code even more unreadable.

Thanks,
Kuai

> Thanks,
> Akagi
>
>> Thanks,
>> Kuai
>>
>>
>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/9] md: serialize md_error()
  2025-09-18  1:04   ` Yu Kuai
@ 2025-09-21  6:11     ` Kenta Akagi
  0 siblings, 0 replies; 39+ messages in thread
From: Kenta Akagi @ 2025-09-21  6:11 UTC (permalink / raw)
  To: Yu Kuai, Song Liu, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai (C), Kenta Akagi



On 2025/09/18 10:04, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/15 11:42, Kenta Akagi 写道:
>> md_error is mainly called when a bio fails, so it can run in parallel.
>> Each personality’s error_handler locks with device_lock, so concurrent
>> calls are safe.
>>
>> However, RAID1 and RAID10 require changes for Failfast bio error handling,
>> which needs a special helper for md_error. For that helper to work, the
>> regular md_error must also be serialized.
>>
>> The helper function md_bio_failure_error for failfast will be introduced
>> in a subsequent commit.
>>
>> This commit serializes md_error for all RAID personalities. While
>> unnecessary for RAID levels other than 1 and 10, it has no performance
>> impact as it is a cold path.
>>
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>   drivers/md/md.c | 10 +++++++++-
>>   drivers/md/md.h |  4 ++++
>>   2 files changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 268410b66b83..5607578a6db9 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -705,6 +705,7 @@ int mddev_init(struct mddev *mddev)
>>       atomic_set(&mddev->openers, 0);
>>       atomic_set(&mddev->sync_seq, 0);
>>       spin_lock_init(&mddev->lock);
>> +    spin_lock_init(&mddev->error_handle_lock);
> 
> Instead of introduing a new lock, can we use device_lock directly?
> it's held inside pers->error_handler() now, just move it forward
> to md_error().

It seems possible. In all personalities, both the caller and the callee
of md_error() appear to have no dependency on device_lock.
I will move device_lock to mddev and use it.

Thanks,
Akagi
> 
> Thanks,
> Kuai
> 
>>       init_waitqueue_head(&mddev->sb_wait);
>>       init_waitqueue_head(&mddev->recovery_wait);
>>       mddev->reshape_position = MaxSector;
>> @@ -8262,7 +8263,7 @@ void md_unregister_thread(struct mddev *mddev, struct md_thread __rcu **threadp)
>>   }
>>   EXPORT_SYMBOL(md_unregister_thread);
>>   -void md_error(struct mddev *mddev, struct md_rdev *rdev)
>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev)
>>   {
>>       if (!rdev || test_bit(Faulty, &rdev->flags))
>>           return;
>> @@ -8287,6 +8288,13 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>           queue_work(md_misc_wq, &mddev->event_work);
>>       md_new_event();
>>   }
>> +
>> +void md_error(struct mddev *mddev, struct md_rdev *rdev)
>> +{
>> +    spin_lock(&mddev->error_handle_lock);
>> +    _md_error(mddev, rdev);
>> +    spin_unlock(&mddev->error_handle_lock);
>> +}
>>   EXPORT_SYMBOL(md_error);
>>     /* seq_file implementation /proc/mdstat */
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index ec598f9a8381..5177cb609e4b 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -619,6 +619,9 @@ struct mddev {
>>       /* The sequence number for sync thread */
>>       atomic_t sync_seq;
>>   +    /* Lock for serializing md_error */
>> +    spinlock_t            error_handle_lock;
>> +
>>       bool    has_superblocks:1;
>>       bool    fail_last_dev:1;
>>       bool    serialize_policy:1;
>> @@ -901,6 +904,7 @@ extern void md_write_start(struct mddev *mddev, struct bio *bi);
>>   extern void md_write_inc(struct mddev *mddev, struct bio *bi);
>>   extern void md_write_end(struct mddev *mddev);
>>   extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev);
>>   extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
>>   extern void md_finish_reshape(struct mddev *mddev);
>>   void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
>>
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes
  2025-09-15  3:42 ` [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes Kenta Akagi
  2025-09-18  1:00   ` Yu Kuai
@ 2025-09-21  7:54   ` Xiao Ni
  2025-09-21 14:48     ` Kenta Akagi
  1 sibling, 1 reply; 39+ messages in thread
From: Xiao Ni @ 2025-09-21  7:54 UTC (permalink / raw)
  To: Kenta Akagi
  Cc: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang,
	linux-raid, linux-kernel

Hi Kenta

On Mon, Sep 15, 2025 at 11:44 AM Kenta Akagi <k@mgml.me> wrote:
>
> Currently, the LastDev flag is set on an rdev that failed a failfast
> metadata write and called md_error, but did not become Faulty. It is
> cleared when the metadata write retry succeeds. This has problems for
> the following reasons:
>
> * Despite its name, the flag is only set during a metadata write window.
> * Unlike when LastDev and Failfast was introduced, md_error on the last
>   rdev of a RAID1/10 array now sets MD_BROKEN. Thus when LastDev is set,
>   the array is already unwritable.
>
> A following commit will prevent failfast bios from breaking the array,
> which requires knowing from outside the personality whether an rdev is
> the last one. For that purpose, LastDev should be set on rdevs that must
> not be lost.
>
> This commit ensures that LastDev is set on the indispensable rdev in a
> degraded RAID1/10 array.
>
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>  drivers/md/md.c     |  4 +---
>  drivers/md/md.h     |  6 +++---
>  drivers/md/raid1.c  | 34 +++++++++++++++++++++++++++++++++-
>  drivers/md/raid10.c | 34 +++++++++++++++++++++++++++++++++-
>  4 files changed, 70 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4e033c26fdd4..268410b66b83 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1007,10 +1007,8 @@ static void super_written(struct bio *bio)
>                 if (!test_bit(Faulty, &rdev->flags)
>                     && (bio->bi_opf & MD_FAILFAST)) {
>                         set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> -                       set_bit(LastDev, &rdev->flags);
>                 }
> -       } else
> -               clear_bit(LastDev, &rdev->flags);
> +       }
>
>         bio_put(bio);
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 51af29a03079..ec598f9a8381 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -281,9 +281,9 @@ enum flag_bits {
>                                  * It is expects that no bad block log
>                                  * is present.
>                                  */
> -       LastDev,                /* Seems to be the last working dev as
> -                                * it didn't fail, so don't use FailFast
> -                                * any more for metadata
> +       LastDev,                /* This is the last working rdev.
> +                                * so don't use FailFast any more for
> +                                * metadata.
>                                  */
>         CollisionCheck,         /*
>                                  * check if there is collision between raid1
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index bf44878ec640..32ad6b102ff7 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1733,6 +1733,33 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>         seq_printf(seq, "]");
>  }
>
> +/**
> + * update_lastdev - Set or clear LastDev flag for all rdevs in array
> + * @conf: pointer to r1conf
> + *
> + * Sets LastDev if the device is In_sync and cannot be lost for the array.
> + * Otherwise, clear it.
> + *
> + * Caller must hold ->device_lock.
> + */
> +static void update_lastdev(struct r1conf *conf)
> +{
> +       int i;
> +       int alive_disks = conf->raid_disks - conf->mddev->degraded;
> +
> +       for (i = 0; i < conf->raid_disks; i++) {
> +               struct md_rdev *rdev = conf->mirrors[i].rdev;
> +
> +               if (rdev) {
> +                       if (test_bit(In_sync, &rdev->flags) &&
> +                           alive_disks == 1)
> +                               set_bit(LastDev, &rdev->flags);
> +                       else
> +                               clear_bit(LastDev, &rdev->flags);
> +               }
> +       }
> +}
> +
>  /**
>   * raid1_error() - RAID1 error handler.
>   * @mddev: affected md device.
> @@ -1767,8 +1794,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>                 }
>         }
>         set_bit(Blocked, &rdev->flags);
> -       if (test_and_clear_bit(In_sync, &rdev->flags))
> +       if (test_and_clear_bit(In_sync, &rdev->flags)) {
>                 mddev->degraded++;
> +               update_lastdev(conf);
> +       }
>         set_bit(Faulty, &rdev->flags);
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>         /*
> @@ -1864,6 +1893,7 @@ static int raid1_spare_active(struct mddev *mddev)
>                 }
>         }
>         mddev->degraded -= count;
> +       update_lastdev(conf);

update_lastdev is called in raid1_spare_active, raid1_run and
raid1_reshape. Could you explain the reason why it needs to call this
function? Is it the reason you want to clear LastDev flag? If so, is
it a right place to do it? As your commit message says, it will be
cleared after retry metadata successfully. In raid1, is it the right
place that fixes read/write successfully?

Best Regards
Xiao

>         spin_unlock_irqrestore(&conf->device_lock, flags);
>
>         print_conf(conf);
> @@ -3290,6 +3320,7 @@ static int raid1_run(struct mddev *mddev)
>         rcu_assign_pointer(conf->thread, NULL);
>         mddev->private = conf;
>         set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
> +       update_lastdev(conf);
>
>         md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));
>
> @@ -3427,6 +3458,7 @@ static int raid1_reshape(struct mddev *mddev)
>
>         spin_lock_irqsave(&conf->device_lock, flags);
>         mddev->degraded += (raid_disks - conf->raid_disks);
> +       update_lastdev(conf);
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>         conf->raid_disks = mddev->raid_disks = raid_disks;
>         mddev->delta_disks = 0;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index b60c30bfb6c7..dc4edd4689f8 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1983,6 +1983,33 @@ static int enough(struct r10conf *conf, int ignore)
>                 _enough(conf, 1, ignore);
>  }
>
> +/**
> + * update_lastdev - Set or clear LastDev flag for all rdevs in array
> + * @conf: pointer to r10conf
> + *
> + * Sets LastDev if the device is In_sync and cannot be lost for the array.
> + * Otherwise, clear it.
> + *
> + * Caller must hold ->reconfig_mutex or ->device_lock.
> + */
> +static void update_lastdev(struct r10conf *conf)
> +{
> +       int i;
> +       int raid_disks = max(conf->geo.raid_disks, conf->prev.raid_disks);
> +
> +       for (i = 0; i < raid_disks; i++) {
> +               struct md_rdev *rdev = conf->mirrors[i].rdev;
> +
> +               if (rdev) {
> +                       if (test_bit(In_sync, &rdev->flags) &&
> +                           !enough(conf, i))
> +                               set_bit(LastDev, &rdev->flags);
> +                       else
> +                               clear_bit(LastDev, &rdev->flags);
> +               }
> +       }
> +}
> +
>  /**
>   * raid10_error() - RAID10 error handler.
>   * @mddev: affected md device.
> @@ -2013,8 +2040,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>                         return;
>                 }
>         }
> -       if (test_and_clear_bit(In_sync, &rdev->flags))
> +       if (test_and_clear_bit(In_sync, &rdev->flags)) {
>                 mddev->degraded++;
> +               update_lastdev(conf);
> +       }
>
>         set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>         set_bit(Blocked, &rdev->flags);
> @@ -2102,6 +2131,7 @@ static int raid10_spare_active(struct mddev *mddev)
>         }
>         spin_lock_irqsave(&conf->device_lock, flags);
>         mddev->degraded -= count;
> +       update_lastdev(conf);
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>
>         print_conf(conf);
> @@ -4159,6 +4189,7 @@ static int raid10_run(struct mddev *mddev)
>         md_set_array_sectors(mddev, size);
>         mddev->resync_max_sectors = size;
>         set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
> +       update_lastdev(conf);
>
>         if (md_integrity_register(mddev))
>                 goto out_free_conf;
> @@ -4567,6 +4598,7 @@ static int raid10_start_reshape(struct mddev *mddev)
>          */
>         spin_lock_irq(&conf->device_lock);
>         mddev->degraded = calc_degraded(conf);
> +       update_lastdev(conf);
>         spin_unlock_irq(&conf->device_lock);
>         mddev->raid_disks = conf->geo.raid_disks;
>         mddev->reshape_position = conf->reshape_progress;
> --
> 2.50.1
>
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes
  2025-09-21  7:54   ` Xiao Ni
@ 2025-09-21 14:48     ` Kenta Akagi
  0 siblings, 0 replies; 39+ messages in thread
From: Kenta Akagi @ 2025-09-21 14:48 UTC (permalink / raw)
  To: Xiao Ni
  Cc: Song Liu, Yu Kuai, Mariusz Tkaczyk, Shaohua Li, Guoqing Jiang,
	linux-raid, linux-kernel, Kenta Akagi

Hi,
Thank you for review.

On 2025/09/21 16:54, Xiao Ni wrote:
> Hi Kenta
> 
> On Mon, Sep 15, 2025 at 11:44 AM Kenta Akagi <k@mgml.me> wrote:
>>
>> Currently, the LastDev flag is set on an rdev that failed a failfast
>> metadata write and called md_error, but did not become Faulty. It is
>> cleared when the metadata write retry succeeds. This has problems for
>> the following reasons:
>>
>> * Despite its name, the flag is only set during a metadata write window.
>> * Unlike when LastDev and Failfast was introduced, md_error on the last
>>   rdev of a RAID1/10 array now sets MD_BROKEN. Thus when LastDev is set,
>>   the array is already unwritable.
>>
>> A following commit will prevent failfast bios from breaking the array,
>> which requires knowing from outside the personality whether an rdev is
>> the last one. For that purpose, LastDev should be set on rdevs that must
>> not be lost.
>>
>> This commit ensures that LastDev is set on the indispensable rdev in a
>> degraded RAID1/10 array.
>>
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>  drivers/md/md.c     |  4 +---
>>  drivers/md/md.h     |  6 +++---
>>  drivers/md/raid1.c  | 34 +++++++++++++++++++++++++++++++++-
>>  drivers/md/raid10.c | 34 +++++++++++++++++++++++++++++++++-
>>  4 files changed, 70 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 4e033c26fdd4..268410b66b83 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1007,10 +1007,8 @@ static void super_written(struct bio *bio)
>>                 if (!test_bit(Faulty, &rdev->flags)
>>                     && (bio->bi_opf & MD_FAILFAST)) {
>>                         set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>> -                       set_bit(LastDev, &rdev->flags);
>>                 }
>> -       } else
>> -               clear_bit(LastDev, &rdev->flags);
>> +       }
>>
>>         bio_put(bio);
>>
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 51af29a03079..ec598f9a8381 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -281,9 +281,9 @@ enum flag_bits {
>>                                  * It is expects that no bad block log
>>                                  * is present.
>>                                  */
>> -       LastDev,                /* Seems to be the last working dev as
>> -                                * it didn't fail, so don't use FailFast
>> -                                * any more for metadata
>> +       LastDev,                /* This is the last working rdev.
>> +                                * so don't use FailFast any more for
>> +                                * metadata.
>>                                  */
>>         CollisionCheck,         /*
>>                                  * check if there is collision between raid1
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index bf44878ec640..32ad6b102ff7 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1733,6 +1733,33 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>         seq_printf(seq, "]");
>>  }
>>
>> +/**
>> + * update_lastdev - Set or clear LastDev flag for all rdevs in array
>> + * @conf: pointer to r1conf
>> + *
>> + * Sets LastDev if the device is In_sync and cannot be lost for the array.
>> + * Otherwise, clear it.
>> + *
>> + * Caller must hold ->device_lock.
>> + */
>> +static void update_lastdev(struct r1conf *conf)
>> +{
>> +       int i;
>> +       int alive_disks = conf->raid_disks - conf->mddev->degraded;
>> +
>> +       for (i = 0; i < conf->raid_disks; i++) {
>> +               struct md_rdev *rdev = conf->mirrors[i].rdev;
>> +
>> +               if (rdev) {
>> +                       if (test_bit(In_sync, &rdev->flags) &&
>> +                           alive_disks == 1)
>> +                               set_bit(LastDev, &rdev->flags);
>> +                       else
>> +                               clear_bit(LastDev, &rdev->flags);
>> +               }
>> +       }
>> +}
>> +
>>  /**
>>   * raid1_error() - RAID1 error handler.
>>   * @mddev: affected md device.
>> @@ -1767,8 +1794,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>                 }
>>         }
>>         set_bit(Blocked, &rdev->flags);
>> -       if (test_and_clear_bit(In_sync, &rdev->flags))
>> +       if (test_and_clear_bit(In_sync, &rdev->flags)) {
>>                 mddev->degraded++;
>> +               update_lastdev(conf);
>> +       }
>>         set_bit(Faulty, &rdev->flags);
>>         spin_unlock_irqrestore(&conf->device_lock, flags);
>>         /*
>> @@ -1864,6 +1893,7 @@ static int raid1_spare_active(struct mddev *mddev)
>>                 }
>>         }
>>         mddev->degraded -= count;
>> +       update_lastdev(conf);
> 
> update_lastdev is called in raid1_spare_active, raid1_run and
> raid1_reshape. Could you explain the reason why it needs to call this
> function? Is it the reason you want to clear LastDev flag? If so, is
> it a right place to do it? As your commit message says, it will be
> cleared after retry metadata successfully. In raid1, is it the right
> place that fixes read/write successfully?

The intention was to set LastDev only when the rdev is the last device in the array.

As suggested in review, I will check whether it is the last device when
md_error is called, instead of using a flag.
As a result, update_lastdev() will be removed, and the purpose and behavior
of LastDev will remain unchanged in v5.

Thanks,
Akagi

> 
> Best Regards
> Xiao
> 
>>         spin_unlock_irqrestore(&conf->device_lock, flags);
>>
>>         print_conf(conf);
>> @@ -3290,6 +3320,7 @@ static int raid1_run(struct mddev *mddev)
>>         rcu_assign_pointer(conf->thread, NULL);
>>         mddev->private = conf;
>>         set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
>> +       update_lastdev(conf);
>>
>>         md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));
>>
>> @@ -3427,6 +3458,7 @@ static int raid1_reshape(struct mddev *mddev)
>>
>>         spin_lock_irqsave(&conf->device_lock, flags);
>>         mddev->degraded += (raid_disks - conf->raid_disks);
>> +       update_lastdev(conf);
>>         spin_unlock_irqrestore(&conf->device_lock, flags);
>>         conf->raid_disks = mddev->raid_disks = raid_disks;
>>         mddev->delta_disks = 0;
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index b60c30bfb6c7..dc4edd4689f8 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1983,6 +1983,33 @@ static int enough(struct r10conf *conf, int ignore)
>>                 _enough(conf, 1, ignore);
>>  }
>>
>> +/**
>> + * update_lastdev - Set or clear LastDev flag for all rdevs in array
>> + * @conf: pointer to r10conf
>> + *
>> + * Sets LastDev if the device is In_sync and cannot be lost for the array.
>> + * Otherwise, clear it.
>> + *
>> + * Caller must hold ->reconfig_mutex or ->device_lock.
>> + */
>> +static void update_lastdev(struct r10conf *conf)
>> +{
>> +       int i;
>> +       int raid_disks = max(conf->geo.raid_disks, conf->prev.raid_disks);
>> +
>> +       for (i = 0; i < raid_disks; i++) {
>> +               struct md_rdev *rdev = conf->mirrors[i].rdev;
>> +
>> +               if (rdev) {
>> +                       if (test_bit(In_sync, &rdev->flags) &&
>> +                           !enough(conf, i))
>> +                               set_bit(LastDev, &rdev->flags);
>> +                       else
>> +                               clear_bit(LastDev, &rdev->flags);
>> +               }
>> +       }
>> +}
>> +
>>  /**
>>   * raid10_error() - RAID10 error handler.
>>   * @mddev: affected md device.
>> @@ -2013,8 +2040,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>                         return;
>>                 }
>>         }
>> -       if (test_and_clear_bit(In_sync, &rdev->flags))
>> +       if (test_and_clear_bit(In_sync, &rdev->flags)) {
>>                 mddev->degraded++;
>> +               update_lastdev(conf);
>> +       }
>>
>>         set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>         set_bit(Blocked, &rdev->flags);
>> @@ -2102,6 +2131,7 @@ static int raid10_spare_active(struct mddev *mddev)
>>         }
>>         spin_lock_irqsave(&conf->device_lock, flags);
>>         mddev->degraded -= count;
>> +       update_lastdev(conf);
>>         spin_unlock_irqrestore(&conf->device_lock, flags);
>>
>>         print_conf(conf);
>> @@ -4159,6 +4189,7 @@ static int raid10_run(struct mddev *mddev)
>>         md_set_array_sectors(mddev, size);
>>         mddev->resync_max_sectors = size;
>>         set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
>> +       update_lastdev(conf);
>>
>>         if (md_integrity_register(mddev))
>>                 goto out_free_conf;
>> @@ -4567,6 +4598,7 @@ static int raid10_start_reshape(struct mddev *mddev)
>>          */
>>         spin_lock_irq(&conf->device_lock);
>>         mddev->degraded = calc_degraded(conf);
>> +       update_lastdev(conf);
>>         spin_unlock_irq(&conf->device_lock);
>>         mddev->raid_disks = conf->geo.raid_disks;
>>         mddev->reshape_position = conf->reshape_progress;
>> --
>> 2.50.1
>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure
  2025-09-20  9:51           ` Yu Kuai
@ 2025-09-23 15:54             ` Kenta Akagi
  0 siblings, 0 replies; 39+ messages in thread
From: Kenta Akagi @ 2025-09-23 15:54 UTC (permalink / raw)
  To: Yu Kuai, yukuai1, song, mtkaczyk, shli, jgq516
  Cc: linux-raid, linux-kernel, yukuai3, Kenta Akagi

Hi,

On 2025/09/20 18:51, Yu Kuai wrote:
> Hi,
> 
> 在 2025/9/20 14:30, Kenta Akagi 写道:
>> Hi,
>>
>> I have changed my email address because our primary MX server
>> suddenly started rejecting non-DKIM mail.
>>
>> On 2025/09/19 10:36, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/9/18 23:22, Kenta Akagi 写道:
>>>>>> @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>>>                 (bio->bi_opf & MD_FAILFAST) &&
>>>>>>                 /* We never try FailFast to WriteMostly devices */
>>>>>>                 !test_bit(WriteMostly, &rdev->flags)) {
>>>>>> -            md_error(r1_bio->mddev, rdev);
>>>>>> +            md_bio_failure_error(r1_bio->mddev, rdev, bio);
>>>>>>             }
>>>>> Can following check of faulty replaced with return value?
>>>> In the case where raid1_end_write_request is called for a non-failfast IO,
>>>> and the rdev has already been marked Faulty by another bio, it must not retry too.
>>>> I think it would be simpler not to use a return value here.
>>> You can just add Faulty check inside md_bio_failure_error() as well, and both
>>> failfast and writemostly check.
>> Sorry, I'm not sure I understand this part.
>> In raid1_end_write_request, this code path is also used for a regular bio,
>> not only for FailFast.
>>
>> You mean to change md_bio_failure_error as follows:
>> * If the rdev is Faulty, immediately return true.
>> * If the given bio is Failfast and the rdev is not the lastdev, call md_error.
>> * If the given bio is not Failfast, do nothing and return false.
> 
> Yes, doesn't that apply to all the callers?

It's difficult because the flow differs depending on the function. 
For example, in raid1_end_write_request, if rdev and bio are Failfast but not Writemostly,
it calls md_error, and then performs a something if it is Faulty regardless
of whether it is Failfast or not. This flow is specific to raid1_end_write_request.

Other functions that need to be changed to md_bio_failure_error are handle_read_error
and fix_sync_read_error, but the path for determining whether these are Faulty,
regardless of whether they are Failfast, is not exists there functions.

It may be possible with some refactoring,
but I think raid1_end_write_request current style, that is
if(Failfast) md_bio_failure_error();
if(Faulty) something;
would be better because We can see at a glance what is happening.

BTW, fix_sync_read_error can use the return value of md_bio_failure_error as
suggested. so I'll revise it as follows:

@@ -2167,8 +2174,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
        if (test_bit(FailFast, &rdev->flags)) {
                /* Don't try recovering from here - just fail it
                 * ... unless it is the last working device of course */
-               md_bio_failure_error(mddev, rdev, bio);
-               if (test_bit(Faulty, &rdev->flags))
+               if (md_bio_failure_error(mddev, rdev, bio))
                        /* Don't try to read from here, but make sure
                         * put_buf does it's thing
                         */

> 
>>
>> And then apply this?
>> This is complicated. Wouldn't it be better to keep the Faulty check as it is?
>>
>> @@ -466,18 +466,12 @@ static void raid1_end_write_request(struct bio *bio)
>>                          set_bit(MD_RECOVERY_NEEDED, &
>>                                  conf->mddev->recovery);
>>
>> -               if (test_bit(FailFast, &rdev->flags) &&
>> -                   (bio->bi_opf & MD_FAILFAST) &&
>> -                   /* We never try FailFast to WriteMostly devices */
>> -                   !test_bit(WriteMostly, &rdev->flags)) {
>> -                       md_error(r1_bio->mddev, rdev);
>> -               }
>> -
>>                  /*
>>                   * When the device is faulty, it is not necessary to
>>                   * handle write error.
>>                   */
>> -               if (!test_bit(Faulty, &rdev->flags))
>> +               if (!test_bit(Faulty, &rdev->flags) ||
>> +                   !md_bio_failure_error(r1_bio->mddev, rdev, bio))
>>                          set_bit(R1BIO_WriteError, &r1_bio->state);
>>                  else {
>>                          /* Finished with this branch */
> 
> Faulty is set with lock held, so check Faulty with lock held as well can
> prevent rdev to be Faulty concurrently, and this check can be added to all
> callers, I think.
> 
>>
>> Or do you mean a fix like this?
>>
>> @@ -466,23 +466,24 @@ static void raid1_end_write_request(struct bio *bio)
>>                          set_bit(MD_RECOVERY_NEEDED, &
>>                                  conf->mddev->recovery);
>>
>> -               if (test_bit(FailFast, &rdev->flags) &&
>> -                   (bio->bi_opf & MD_FAILFAST) &&
>> -                   /* We never try FailFast to WriteMostly devices */
>> -                   !test_bit(WriteMostly, &rdev->flags)) {
>> -                       md_error(r1_bio->mddev, rdev);
>> -               }
>> -
>>                  /*
>>                   * When the device is faulty, it is not necessary to
>>                   * handle write error.
>>                   */
>> -               if (!test_bit(Faulty, &rdev->flags))
>> -                       set_bit(R1BIO_WriteError, &r1_bio->state);
>> -               else {
>> +               if (test_bit(Faulty, &rdev->flags) ||
>> +                   (
>> +                   test_bit(FailFast, &rdev->flags) &&
>> +                   (bio->bi_opf & MD_FAILFAST) &&
>> +                   /* We never try FailFast to WriteMostly devices */
>> +                   !test_bit(WriteMostly, &rdev->flags) &&
>> +                   md_bio_failure_error(r1_bio->mddev, rdev, bio)
>> +                   )
>> +               ) {
>>                          /* Finished with this branch */
>>                          r1_bio->bios[mirror] = NULL;
>>                          to_put = bio;
>> +               } else {
>> +                       set_bit(R1BIO_WriteError, &r1_bio->state);
>>                  }
>>          } else {
>>                  /*
> 
> No, this just make code even more unreadable.

Understood.

Thanks,
Akagi

> 
> Thanks,
> Kuai
> 
>> Thanks,
>> Akagi
>>
>>> Thanks,
>>> Kuai
>>>
>>>
>>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-09-23 16:03 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-15  3:42 [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
2025-09-15  3:42 ` [PATCH v4 1/9] md/raid1,raid10: Set the LastDev flag when the configuration changes Kenta Akagi
2025-09-18  1:00   ` Yu Kuai
2025-09-18 14:02     ` Kenta Akagi
2025-09-21  7:54   ` Xiao Ni
2025-09-21 14:48     ` Kenta Akagi
2025-09-15  3:42 ` [PATCH v4 2/9] md: serialize md_error() Kenta Akagi
2025-09-18  1:04   ` Yu Kuai
2025-09-21  6:11     ` Kenta Akagi
2025-09-15  3:42 ` [PATCH v4 3/9] md: introduce md_bio_failure_error() Kenta Akagi
2025-09-18  1:09   ` Yu Kuai
2025-09-18 14:56     ` Kenta Akagi
2025-09-15  3:42 ` [PATCH v4 4/9] md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure Kenta Akagi
2025-09-18  1:26   ` Yu Kuai
2025-09-18 15:22     ` Kenta Akagi
2025-09-19  1:36       ` Yu Kuai
2025-09-20  6:30         ` Kenta Akagi
2025-09-20  9:51           ` Yu Kuai
2025-09-23 15:54             ` Kenta Akagi
2025-09-15  3:42 ` [PATCH v4 5/9] md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a failed bio Kenta Akagi
2025-09-17  9:24   ` Li Nan
2025-09-17 13:20     ` Kenta Akagi
2025-09-18  6:39       ` Li Nan
2025-09-18 15:36         ` Kenta Akagi
2025-09-19  1:37           ` Li Nan
2025-09-15  3:42 ` [PATCH v4 6/9] md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl rdevs Kenta Akagi
2025-09-17 10:06   ` Li Nan
2025-09-17 13:33     ` Kenta Akagi
2025-09-18  6:58       ` Li Nan
2025-09-18 16:23         ` Kenta Akagi
2025-09-19  1:28           ` Li Nan
2025-09-15  3:42 ` [PATCH v4 7/9] md/raid10: fix failfast read error not rescheduled Kenta Akagi
2025-09-18  7:38   ` Li Nan
2025-09-18 16:12     ` Kenta Akagi
2025-09-19  1:20       ` Li Nan
2025-09-15  3:42 ` [PATCH v4 8/9] md/raid1,raid10: Add error message when setting MD_BROKEN Kenta Akagi
2025-09-15  3:42 ` [PATCH v4 9/9] md/raid1,raid10: Fix: Operation continuing on 0 devices Kenta Akagi
2025-09-15  7:19   ` Paul Menzel
2025-09-15  8:19     ` Kenta Akagi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox