public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure
@ 2025-10-27 15:04 Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 01/16] md: move device_lock from conf to mddev Kenta Akagi
                   ` (15 more replies)
  0 siblings, 16 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Changes from V4:
- Use device_lock to serialize md_error() instead of adding a new
  spinlock_t.
- Rename new function md_bio_failure_error() to md_cond_error().
- Add helper function pers->should_error() to determine whether to fail
  rdev in failfast bio failure, instead of using the LastDev flag.
- Avoid changing the behavior of the LastDev flag.
- Drop fix for R{1,10}BIO_Uptodate not being set despite successful
  retry; this will be sent separately after Nan's refactor.
- Drop fix for the message 'Operation continuing on 0 devices'; as it is
  outside the scope of this patch, it will be sent separately.
- Improve logging when metadata writing fails.
- Rename LastDev to RetryingSBWrite.
Changes from V3:
- The error handling in md_error() is now serialized, and a new helper
  function, md_bio_failure_error, has been introduced.
- MD_FAILFAST bio failures are now processed by md_bio_failure_error
  instead of signaling via FailfastIOFailure.
- RAID10: Fix missing reschedule of failfast read bio failure
- Regardless of failfast, in narrow_write_error, writes that succeed
  in retry are returned to the higher layer as success
Changes from V2:
- Fix to prevent the array from being marked broken for all
  Failfast IOs, not just metadata.
- Reflecting the review, update raid{1,10}_error to clear
  FailfastIOFailure so that devices are properly marked Faulty.
Changes from V1:
- Avoid setting MD_BROKEN instead of clearing it
- Add pr_crit() when setting MD_BROKEN
- Fix the message may shown after all rdevs failure:
  "Operation continuing on 0 devices"

v4: https://lore.kernel.org/linux-raid/20250915034210.8533-1-k@mgml.me/
v3: https://lore.kernel.org/linux-raid/20250828163216.4225-1-k@mgml.me/
v2: https://lore.kernel.org/linux-raid/20250817172710.4892-1-k@mgml.me/
v1: https://lore.kernel.org/linux-raid/20250812090119.153697-1-k@mgml.me/

When multiple MD_FAILFAST bios fail simultaneously on Failfast-enabled
rdevs in RAID1/RAID10, the following issues can occur:
* MD_BROKEN is set and the array halts, even though this should not occur
  under the intended Failfast design.
* Writes retried through narrow_write_error succeed, but the I/O is still
  reported as BLK_STS_IOERR
  * NOTE: a fix for this was removed in v5, will be send separetely
    https://lore.kernel.org/linux-raid/6f0f9730-4bbe-7f3c-1b50-690bb77d5d90@huaweicloud.com/
* RAID10 only: If a Failfast read I/O fails, it is not retried on any
  remaining rdev, and as a result, the upper layer receives an I/O error.

Simultaneous bio failures across multiple rdevs are uncommon; however,
rdevs serviced via nvme-tcp can still experience them due to something as
simple as an Ethernet fault. The issue can be reproduced using the
following steps.

# prepare nvmet/nvme-tcp and md array #
sh-5.2# cat << 'EOF' > loopback-nvme.sh
set -eu
nqn="nqn.2025-08.io.example:nvmet-test-$1"
back=$2
cd /sys/kernel/config/nvmet/
mkdir subsystems/$nqn
echo 1 > subsystems/${nqn}/attr_allow_any_host
mkdir subsystems/${nqn}/namespaces/1
echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path
echo 1 > subsystems/${nqn}/namespaces/1/enable
ports="ports/1"
if [ ! -d $ports ]; then
        mkdir $ports
        cd $ports
        echo 127.0.0.1 > addr_traddr
        echo tcp       > addr_trtype
        echo 4420      > addr_trsvcid
        echo ipv4      > addr_adrfam
        cd ../../
fi
ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/
nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420
EOF

sh-5.2# chmod +x loopback-nvme.sh
sh-5.2# modprobe -a nvme-tcp nvmet-tcp
sh-5.2# truncate -s 1g a.img b.img
sh-5.2# losetup --show -f a.img
/dev/loop0
sh-5.2# losetup --show -f b.img
/dev/loop1
sh-5.2# ./loopback-nvme.sh 0 /dev/loop0
connecting to device: nvme0
sh-5.2# ./loopback-nvme.sh 1 /dev/loop1
connecting to device: nvme1
sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \
--failfast /dev/nvme0n1 --failfast /dev/nvme1n1
...
mdadm: array /dev/md0 started.

# run fio #
sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \
--bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1

# It can reproduce the issue by block nvme traffic during fio #
sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP;
sh-5.2# sleep 10; # twice the default KATO value
sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP


Patch 1-2 serialize the md_error() with device_lock
Patch 3-6 introduce md_cond_error() and dependent helpers
Patch 7-8 preparation refactor for patch 11-12
Patch 9 adds the missing retry path for Failfast read errors in RAID10.
Patch 10-12 prevents MD_FAILFAST bio failure causing the array to fail;
this is what I want to achieve.
Patch 13-14 add the error log when the array stops functioning.
Patch 15 simply rename the LastDev flag.
Patch 16 add the mddev and rdev name to the error log when metadata
writing fails.
 
Kenta Akagi (16):
  md: move device_lock from conf to mddev
  md: serialize md_error()
  md: add pers->should_error() callback
  md: introduce md_cond_error()
  md/raid1: implement pers->should_error()
  md/raid10: implement pers->should_error()
  md/raid1: refactor handle_read_error()
  md/raid10: refactor handle_read_error()
  md/raid10: fix failfast read error not rescheduled
  md: prevent set MD_BROKEN on super_write failure with failfast
  md/raid1: Prevent set MD_BROKEN on failfast bio failure
  md/raid10: Prevent set MD_BROKEN on failfast bio failure
  md/raid1: add error message when setting MD_BROKEN
  md/raid10: Add error message when setting MD_BROKEN
  md: rename 'LastDev' rdev flag to 'RetryingSBWrite'
  md: Improve super_written() error logging

 drivers/md/md-linear.c   |   1 +
 drivers/md/md.c          |  58 ++++++++++++++--
 drivers/md/md.h          |  12 +++-
 drivers/md/raid0.c       |   1 +
 drivers/md/raid1.c       | 111 ++++++++++++++++++++-----------
 drivers/md/raid1.h       |   2 -
 drivers/md/raid10.c      | 116 ++++++++++++++++++++------------
 drivers/md/raid10.h      |   1 -
 drivers/md/raid5-cache.c |  16 ++---
 drivers/md/raid5.c       | 139 +++++++++++++++++++--------------------
 drivers/md/raid5.h       |   1 -
 11 files changed, 283 insertions(+), 175 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 01/16] md: move device_lock from conf to mddev
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-30  1:59   ` Yu Kuai
  2025-10-27 15:04 ` [PATCH v5 02/16] md: serialize md_error() Kenta Akagi
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Move device_lock from mddev->private (r1conf, r10conf, r5conf)
to the mddev structure in preparation for serializing md_error() and
introducing new function that conditional md_error() calls to
improvement failfast bio error handling.

This commit only moves the device_lock location with no functional
changes. Subsequent commits will serialize md_error() and introduce
a new function that calls md_error() conditionally.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c          |   1 +
 drivers/md/md.h          |   2 +
 drivers/md/raid1.c       |  51 +++++++-------
 drivers/md/raid1.h       |   2 -
 drivers/md/raid10.c      |  61 +++++++++--------
 drivers/md/raid10.h      |   1 -
 drivers/md/raid5-cache.c |  16 ++---
 drivers/md/raid5.c       | 139 +++++++++++++++++++--------------------
 drivers/md/raid5.h       |   1 -
 9 files changed, 135 insertions(+), 139 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6062e0deb616..d667580e3125 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -760,6 +760,7 @@ int mddev_init(struct mddev *mddev)
 	atomic_set(&mddev->openers, 0);
 	atomic_set(&mddev->sync_seq, 0);
 	spin_lock_init(&mddev->lock);
+	spin_lock_init(&mddev->device_lock);
 	init_waitqueue_head(&mddev->sb_wait);
 	init_waitqueue_head(&mddev->recovery_wait);
 	mddev->reshape_position = MaxSector;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 5d5f780b8447..64ac22edf372 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -543,6 +543,8 @@ struct mddev {
 	/* used for register new sync thread */
 	struct work_struct sync_work;
 
+	spinlock_t			device_lock;
+
 	/* "lock" protects:
 	 *   flush_bio transition from NULL to !NULL
 	 *   rdev superblocks, events
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 592a40233004..7924d5ee189d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -282,10 +282,10 @@ static void reschedule_retry(struct r1bio *r1_bio)
 	int idx;
 
 	idx = sector_to_idx(r1_bio->sector);
-	spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
 	atomic_inc(&conf->nr_queued[idx]);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
 	md_wakeup_thread(mddev->thread);
@@ -387,12 +387,12 @@ static void raid1_end_read_request(struct bio *bio)
 		 * Here we redefine "uptodate" to mean "Don't want to retry"
 		 */
 		unsigned long flags;
-		spin_lock_irqsave(&conf->device_lock, flags);
+		spin_lock_irqsave(&conf->mddev->device_lock, flags);
 		if (r1_bio->mddev->degraded == conf->raid_disks ||
 		    (r1_bio->mddev->degraded == conf->raid_disks-1 &&
 		     test_bit(In_sync, &rdev->flags)))
 			uptodate = 1;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	}
 
 	if (uptodate) {
@@ -917,14 +917,14 @@ static void flush_pending_writes(struct r1conf *conf)
 	/* Any writes that have been queued but are awaiting
 	 * bitmap updates get flushed here.
 	 */
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 
 	if (conf->pending_bio_list.head) {
 		struct blk_plug plug;
 		struct bio *bio;
 
 		bio = bio_list_get(&conf->pending_bio_list);
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 
 		/*
 		 * As this is called in a wait_event() loop (see freeze_array),
@@ -940,7 +940,7 @@ static void flush_pending_writes(struct r1conf *conf)
 		flush_bio_list(conf, bio);
 		blk_finish_plug(&plug);
 	} else
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 }
 
 /* Barriers....
@@ -1274,9 +1274,9 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	struct bio *bio;
 
 	if (from_schedule) {
-		spin_lock_irq(&conf->device_lock);
+		spin_lock_irq(&conf->mddev->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 		wake_up_barrier(conf);
 		md_wakeup_thread(mddev->thread);
 		kfree(plug);
@@ -1664,9 +1664,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		/* flush_pending_writes() needs access to the rdev so...*/
 		mbio->bi_bdev = (void *)rdev;
 		if (!raid1_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
-			spin_lock_irqsave(&conf->device_lock, flags);
+			spin_lock_irqsave(&conf->mddev->device_lock, flags);
 			bio_list_add(&conf->pending_bio_list, mbio);
-			spin_unlock_irqrestore(&conf->device_lock, flags);
+			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 			md_wakeup_thread(mddev->thread);
 		}
 	}
@@ -1753,7 +1753,7 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	struct r1conf *conf = mddev->private;
 	unsigned long flags;
 
-	spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 
 	if (test_bit(In_sync, &rdev->flags) &&
 	    (conf->raid_disks - mddev->degraded) == 1) {
@@ -1761,7 +1761,7 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 
 		if (!mddev->fail_last_dev) {
 			conf->recovery_disabled = mddev->recovery_disabled;
-			spin_unlock_irqrestore(&conf->device_lock, flags);
+			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 			return;
 		}
 	}
@@ -1769,7 +1769,7 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	if (test_and_clear_bit(In_sync, &rdev->flags))
 		mddev->degraded++;
 	set_bit(Faulty, &rdev->flags);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	/*
 	 * if recovery is running, make sure it aborts.
 	 */
@@ -1831,7 +1831,7 @@ static int raid1_spare_active(struct mddev *mddev)
 	 * device_lock used to avoid races with raid1_end_read_request
 	 * which expects 'In_sync' flags and ->degraded to be consistent.
 	 */
-	spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 	for (i = 0; i < conf->raid_disks; i++) {
 		struct md_rdev *rdev = conf->mirrors[i].rdev;
 		struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev;
@@ -1863,7 +1863,7 @@ static int raid1_spare_active(struct mddev *mddev)
 		}
 	}
 	mddev->degraded -= count;
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 
 	print_conf(conf);
 	return count;
@@ -2605,11 +2605,11 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 					 conf->mddev);
 		}
 	if (fail) {
-		spin_lock_irq(&conf->device_lock);
+		spin_lock_irq(&conf->mddev->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
 		idx = sector_to_idx(r1_bio->sector);
 		atomic_inc(&conf->nr_queued[idx]);
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 		/*
 		 * In case freeze_array() is waiting for condition
 		 * get_unqueued_pending() == extra to be true.
@@ -2681,10 +2681,10 @@ static void raid1d(struct md_thread *thread)
 	if (!list_empty_careful(&conf->bio_end_io_list) &&
 	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		LIST_HEAD(tmp);
-		spin_lock_irqsave(&conf->device_lock, flags);
+		spin_lock_irqsave(&conf->mddev->device_lock, flags);
 		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
 			list_splice_init(&conf->bio_end_io_list, &tmp);
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
@@ -2702,16 +2702,16 @@ static void raid1d(struct md_thread *thread)
 
 		flush_pending_writes(conf);
 
-		spin_lock_irqsave(&conf->device_lock, flags);
+		spin_lock_irqsave(&conf->mddev->device_lock, flags);
 		if (list_empty(head)) {
-			spin_unlock_irqrestore(&conf->device_lock, flags);
+			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 			break;
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
 		idx = sector_to_idx(r1_bio->sector);
 		atomic_dec(&conf->nr_queued[idx]);
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 
 		mddev = r1_bio->mddev;
 		conf = mddev->private;
@@ -3131,7 +3131,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		goto abort;
 
 	err = -EINVAL;
-	spin_lock_init(&conf->device_lock);
 	conf->raid_disks = mddev->raid_disks;
 	rdev_for_each(rdev, mddev) {
 		int disk_idx = rdev->raid_disk;
@@ -3429,9 +3428,9 @@ static int raid1_reshape(struct mddev *mddev)
 	kfree(conf->mirrors);
 	conf->mirrors = newmirrors;
 
-	spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 	mddev->degraded += (raid_disks - conf->raid_disks);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	conf->raid_disks = mddev->raid_disks = raid_disks;
 	mddev->delta_disks = 0;
 
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 2ebe35aaa534..7af8e294e7ae 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -57,8 +57,6 @@ struct r1conf {
 	int			raid_disks;
 	int			nonrot_disks;
 
-	spinlock_t		device_lock;
-
 	/* list of 'struct r1bio' that need to be processed by raid1d,
 	 * whether to retry a read, writeout a resync or recovery
 	 * block, or anything else.
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 14dcd5142eb4..57c887070df3 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -301,10 +301,10 @@ static void reschedule_retry(struct r10bio *r10_bio)
 	struct mddev *mddev = r10_bio->mddev;
 	struct r10conf *conf = mddev->private;
 
-	spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 	list_add(&r10_bio->retry_list, &conf->retry_list);
 	conf->nr_queued ++;
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 
 	/* wake up frozen array... */
 	wake_up(&conf->wait_barrier);
@@ -863,14 +863,14 @@ static void flush_pending_writes(struct r10conf *conf)
 	/* Any writes that have been queued but are awaiting
 	 * bitmap updates get flushed here.
 	 */
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 
 	if (conf->pending_bio_list.head) {
 		struct blk_plug plug;
 		struct bio *bio;
 
 		bio = bio_list_get(&conf->pending_bio_list);
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 
 		/*
 		 * As this is called in a wait_event() loop (see freeze_array),
@@ -896,7 +896,7 @@ static void flush_pending_writes(struct r10conf *conf)
 		}
 		blk_finish_plug(&plug);
 	} else
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 }
 
 /* Barriers....
@@ -1089,9 +1089,9 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	struct bio *bio;
 
 	if (from_schedule) {
-		spin_lock_irq(&conf->device_lock);
+		spin_lock_irq(&conf->mddev->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 		wake_up_barrier(conf);
 		md_wakeup_thread(mddev->thread);
 		kfree(plug);
@@ -1278,9 +1278,9 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
 	atomic_inc(&r10_bio->remaining);
 
 	if (!raid1_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
-		spin_lock_irqsave(&conf->device_lock, flags);
+		spin_lock_irqsave(&conf->mddev->device_lock, flags);
 		bio_list_add(&conf->pending_bio_list, mbio);
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 		md_wakeup_thread(mddev->thread);
 	}
 }
@@ -1997,13 +1997,13 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	struct r10conf *conf = mddev->private;
 	unsigned long flags;
 
-	spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 
 	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
 		set_bit(MD_BROKEN, &mddev->flags);
 
 		if (!mddev->fail_last_dev) {
-			spin_unlock_irqrestore(&conf->device_lock, flags);
+			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 			return;
 		}
 	}
@@ -2015,7 +2015,7 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	set_bit(Faulty, &rdev->flags);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
 		"md/raid10:%s: Operation continuing on %d devices.\n",
 		mdname(mddev), rdev->bdev,
@@ -2094,9 +2094,9 @@ static int raid10_spare_active(struct mddev *mddev)
 			sysfs_notify_dirent_safe(tmp->rdev->sysfs_state);
 		}
 	}
-	spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 	mddev->degraded -= count;
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 
 	print_conf(conf);
 	return count;
@@ -2951,10 +2951,10 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 			}
 		}
 		if (fail) {
-			spin_lock_irq(&conf->device_lock);
+			spin_lock_irq(&conf->mddev->device_lock);
 			list_add(&r10_bio->retry_list, &conf->bio_end_io_list);
 			conf->nr_queued++;
-			spin_unlock_irq(&conf->device_lock);
+			spin_unlock_irq(&conf->mddev->device_lock);
 			/*
 			 * In case freeze_array() is waiting for condition
 			 * nr_pending == nr_queued + extra to be true.
@@ -2984,14 +2984,14 @@ static void raid10d(struct md_thread *thread)
 	if (!list_empty_careful(&conf->bio_end_io_list) &&
 	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		LIST_HEAD(tmp);
-		spin_lock_irqsave(&conf->device_lock, flags);
+		spin_lock_irqsave(&conf->mddev->device_lock, flags);
 		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 			while (!list_empty(&conf->bio_end_io_list)) {
 				list_move(conf->bio_end_io_list.prev, &tmp);
 				conf->nr_queued--;
 			}
 		}
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 		while (!list_empty(&tmp)) {
 			r10_bio = list_first_entry(&tmp, struct r10bio,
 						   retry_list);
@@ -3009,15 +3009,15 @@ static void raid10d(struct md_thread *thread)
 
 		flush_pending_writes(conf);
 
-		spin_lock_irqsave(&conf->device_lock, flags);
+		spin_lock_irqsave(&conf->mddev->device_lock, flags);
 		if (list_empty(head)) {
-			spin_unlock_irqrestore(&conf->device_lock, flags);
+			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 			break;
 		}
 		r10_bio = list_entry(head->prev, struct r10bio, retry_list);
 		list_del(head->prev);
 		conf->nr_queued--;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 
 		mddev = r10_bio->mddev;
 		conf = mddev->private;
@@ -3960,7 +3960,6 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 			conf->prev.stride = conf->dev_sectors;
 	}
 	conf->reshape_safe = conf->reshape_progress;
-	spin_lock_init(&conf->device_lock);
 	INIT_LIST_HEAD(&conf->retry_list);
 	INIT_LIST_HEAD(&conf->bio_end_io_list);
 
@@ -4467,7 +4466,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 		return -EINVAL;
 
 	conf->offset_diff = min_offset_diff;
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 	if (conf->mirrors_new) {
 		memcpy(conf->mirrors_new, conf->mirrors,
 		       sizeof(struct raid10_info)*conf->prev.raid_disks);
@@ -4482,7 +4481,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 	if (mddev->reshape_backwards) {
 		sector_t size = raid10_size(mddev, 0, 0);
 		if (size < mddev->array_sectors) {
-			spin_unlock_irq(&conf->device_lock);
+			spin_unlock_irq(&conf->mddev->device_lock);
 			pr_warn("md/raid10:%s: array size must be reduce before number of disks\n",
 				mdname(mddev));
 			return -EINVAL;
@@ -4492,7 +4491,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 	} else
 		conf->reshape_progress = 0;
 	conf->reshape_safe = conf->reshape_progress;
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 
 	if (mddev->delta_disks && mddev->bitmap) {
 		struct mdp_superblock_1 *sb = NULL;
@@ -4561,9 +4560,9 @@ static int raid10_start_reshape(struct mddev *mddev)
 	 * ->degraded is measured against the larger of the
 	 * pre and  post numbers.
 	 */
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 	mddev->degraded = calc_degraded(conf);
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 	mddev->raid_disks = conf->geo.raid_disks;
 	mddev->reshape_position = conf->reshape_progress;
 	set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
@@ -4579,7 +4578,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 
 abort:
 	mddev->recovery = 0;
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 	conf->geo = conf->prev;
 	mddev->raid_disks = conf->geo.raid_disks;
 	rdev_for_each(rdev, mddev)
@@ -4588,7 +4587,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 	conf->reshape_progress = MaxSector;
 	conf->reshape_safe = MaxSector;
 	mddev->reshape_position = MaxSector;
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 	return ret;
 }
 
@@ -4947,13 +4946,13 @@ static void end_reshape(struct r10conf *conf)
 	if (test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery))
 		return;
 
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 	conf->prev = conf->geo;
 	md_finish_reshape(conf->mddev);
 	smp_wmb();
 	conf->reshape_progress = MaxSector;
 	conf->reshape_safe = MaxSector;
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 
 	mddev_update_io_opt(conf->mddev, raid10_nr_stripes(conf));
 	conf->fullsync = 0;
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index da00a55f7a55..5f6c8b21ecd0 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -29,7 +29,6 @@ struct r10conf {
 	struct mddev		*mddev;
 	struct raid10_info	*mirrors;
 	struct raid10_info	*mirrors_new, *mirrors_old;
-	spinlock_t		device_lock;
 
 	/* geometry */
 	struct geom {
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index ba768ca7f422..177759b18c72 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1356,7 +1356,7 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
  * r5c_flush_stripe moves stripe from cached list to handle_list. When called,
  * the stripe must be on r5c_cached_full_stripes or r5c_cached_partial_stripes.
  *
- * must hold conf->device_lock
+ * must hold conf->mddev->device_lock
  */
 static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
 {
@@ -1366,10 +1366,10 @@ static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
 
 	/*
 	 * The stripe is not ON_RELEASE_LIST, so it is safe to call
-	 * raid5_release_stripe() while holding conf->device_lock
+	 * raid5_release_stripe() while holding conf->mddev->device_lock
 	 */
 	BUG_ON(test_bit(STRIPE_ON_RELEASE_LIST, &sh->state));
-	lockdep_assert_held(&conf->device_lock);
+	lockdep_assert_held(&conf->mddev->device_lock);
 
 	list_del_init(&sh->lru);
 	atomic_inc(&sh->count);
@@ -1396,7 +1396,7 @@ void r5c_flush_cache(struct r5conf *conf, int num)
 	int count;
 	struct stripe_head *sh, *next;
 
-	lockdep_assert_held(&conf->device_lock);
+	lockdep_assert_held(&conf->mddev->device_lock);
 	if (!READ_ONCE(conf->log))
 		return;
 
@@ -1455,15 +1455,15 @@ static void r5c_do_reclaim(struct r5conf *conf)
 		stripes_to_flush = -1;
 
 	if (stripes_to_flush >= 0) {
-		spin_lock_irqsave(&conf->device_lock, flags);
+		spin_lock_irqsave(&conf->mddev->device_lock, flags);
 		r5c_flush_cache(conf, stripes_to_flush);
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	}
 
 	/* if log space is tight, flush stripes on stripe_in_journal_list */
 	if (test_bit(R5C_LOG_TIGHT, &conf->cache_state)) {
 		spin_lock_irqsave(&log->stripe_in_journal_lock, flags);
-		spin_lock(&conf->device_lock);
+		spin_lock(&conf->mddev->device_lock);
 		list_for_each_entry(sh, &log->stripe_in_journal_list, r5c) {
 			/*
 			 * stripes on stripe_in_journal_list could be in any
@@ -1481,7 +1481,7 @@ static void r5c_do_reclaim(struct r5conf *conf)
 					break;
 			}
 		}
-		spin_unlock(&conf->device_lock);
+		spin_unlock(&conf->mddev->device_lock);
 		spin_unlock_irqrestore(&log->stripe_in_journal_lock, flags);
 	}
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 24b32a0c95b4..3350dcf9cab6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -83,34 +83,34 @@ static inline int stripe_hash_locks_hash(struct r5conf *conf, sector_t sect)
 }
 
 static inline void lock_device_hash_lock(struct r5conf *conf, int hash)
-	__acquires(&conf->device_lock)
+	__acquires(&conf->mddev->device_lock)
 {
 	spin_lock_irq(conf->hash_locks + hash);
-	spin_lock(&conf->device_lock);
+	spin_lock(&conf->mddev->device_lock);
 }
 
 static inline void unlock_device_hash_lock(struct r5conf *conf, int hash)
-	__releases(&conf->device_lock)
+	__releases(&conf->mddev->device_lock)
 {
-	spin_unlock(&conf->device_lock);
+	spin_unlock(&conf->mddev->device_lock);
 	spin_unlock_irq(conf->hash_locks + hash);
 }
 
 static inline void lock_all_device_hash_locks_irq(struct r5conf *conf)
-	__acquires(&conf->device_lock)
+	__acquires(&conf->mddev->device_lock)
 {
 	int i;
 	spin_lock_irq(conf->hash_locks);
 	for (i = 1; i < NR_STRIPE_HASH_LOCKS; i++)
 		spin_lock_nest_lock(conf->hash_locks + i, conf->hash_locks);
-	spin_lock(&conf->device_lock);
+	spin_lock(&conf->mddev->device_lock);
 }
 
 static inline void unlock_all_device_hash_locks_irq(struct r5conf *conf)
-	__releases(&conf->device_lock)
+	__releases(&conf->mddev->device_lock)
 {
 	int i;
-	spin_unlock(&conf->device_lock);
+	spin_unlock(&conf->mddev->device_lock);
 	for (i = NR_STRIPE_HASH_LOCKS - 1; i; i--)
 		spin_unlock(conf->hash_locks + i);
 	spin_unlock_irq(conf->hash_locks);
@@ -172,7 +172,7 @@ static bool stripe_is_lowprio(struct stripe_head *sh)
 }
 
 static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
-	__must_hold(&sh->raid_conf->device_lock)
+	__must_hold(&sh->raid_conf->mddev->device_lock)
 {
 	struct r5conf *conf = sh->raid_conf;
 	struct r5worker_group *group;
@@ -220,7 +220,7 @@ static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
 
 static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 			      struct list_head *temp_inactive_list)
-	__must_hold(&conf->device_lock)
+	__must_hold(&conf->mddev->device_lock)
 {
 	int i;
 	int injournal = 0;	/* number of date pages with R5_InJournal */
@@ -306,7 +306,7 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 static void __release_stripe(struct r5conf *conf, struct stripe_head *sh,
 			     struct list_head *temp_inactive_list)
-	__must_hold(&conf->device_lock)
+	__must_hold(&conf->mddev->device_lock)
 {
 	if (atomic_dec_and_test(&sh->count))
 		do_release_stripe(conf, sh, temp_inactive_list);
@@ -363,7 +363,7 @@ static void release_inactive_stripe_list(struct r5conf *conf,
 
 static int release_stripe_list(struct r5conf *conf,
 			       struct list_head *temp_inactive_list)
-	__must_hold(&conf->device_lock)
+	__must_hold(&conf->mddev->device_lock)
 {
 	struct stripe_head *sh, *t;
 	int count = 0;
@@ -412,11 +412,11 @@ void raid5_release_stripe(struct stripe_head *sh)
 	return;
 slow_path:
 	/* we are ok here if STRIPE_ON_RELEASE_LIST is set or not */
-	if (atomic_dec_and_lock_irqsave(&sh->count, &conf->device_lock, flags)) {
+	if (atomic_dec_and_lock_irqsave(&sh->count, &conf->mddev->device_lock, flags)) {
 		INIT_LIST_HEAD(&list);
 		hash = sh->hash_lock_index;
 		do_release_stripe(conf, sh, &list);
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 		release_inactive_stripe_list(conf, &list, hash);
 	}
 }
@@ -647,7 +647,7 @@ static struct stripe_head *find_get_stripe(struct r5conf *conf,
 	 * references it with the device_lock held.
 	 */
 
-	spin_lock(&conf->device_lock);
+	spin_lock(&conf->mddev->device_lock);
 	if (!atomic_read(&sh->count)) {
 		if (!test_bit(STRIPE_HANDLE, &sh->state))
 			atomic_inc(&conf->active_stripes);
@@ -666,7 +666,7 @@ static struct stripe_head *find_get_stripe(struct r5conf *conf,
 		}
 	}
 	atomic_inc(&sh->count);
-	spin_unlock(&conf->device_lock);
+	spin_unlock(&conf->mddev->device_lock);
 
 	return sh;
 }
@@ -684,7 +684,7 @@ static struct stripe_head *find_get_stripe(struct r5conf *conf,
  * of the two sections, and some non-in_sync devices may
  * be insync in the section most affected by failed devices.
  *
- * Most calls to this function hold &conf->device_lock. Calls
+ * Most calls to this function hold &conf->mddev->device_lock. Calls
  * in raid5_run() do not require the lock as no other threads
  * have been started yet.
  */
@@ -2913,7 +2913,7 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 	pr_crit("md/raid:%s: Disk failure on %pg, disabling device.\n",
 		mdname(mddev), rdev->bdev);
 
-	spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 	set_bit(Faulty, &rdev->flags);
 	clear_bit(In_sync, &rdev->flags);
 	mddev->degraded = raid5_calc_degraded(conf);
@@ -2929,7 +2929,7 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 			mdname(mddev), conf->raid_disks - mddev->degraded);
 	}
 
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 
 	set_bit(Blocked, &rdev->flags);
@@ -5294,7 +5294,7 @@ static void handle_stripe(struct stripe_head *sh)
 }
 
 static void raid5_activate_delayed(struct r5conf *conf)
-	__must_hold(&conf->device_lock)
+	__must_hold(&conf->mddev->device_lock)
 {
 	if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
 		while (!list_empty(&conf->delayed_list)) {
@@ -5313,7 +5313,7 @@ static void raid5_activate_delayed(struct r5conf *conf)
 
 static void activate_bit_delay(struct r5conf *conf,
 		struct list_head *temp_inactive_list)
-	__must_hold(&conf->device_lock)
+	__must_hold(&conf->mddev->device_lock)
 {
 	struct list_head head;
 	list_add(&head, &conf->bitmap_list);
@@ -5348,12 +5348,12 @@ static void add_bio_to_retry(struct bio *bi,struct r5conf *conf)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 
 	bi->bi_next = conf->retry_read_aligned_list;
 	conf->retry_read_aligned_list = bi;
 
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	md_wakeup_thread(conf->mddev->thread);
 }
 
@@ -5472,11 +5472,11 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 		 */
 		if (did_inc && atomic_dec_and_test(&conf->active_aligned_reads))
 			wake_up(&conf->wait_for_quiescent);
-		spin_lock_irq(&conf->device_lock);
+		spin_lock_irq(&conf->mddev->device_lock);
 		wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
-				    conf->device_lock);
+				    conf->mddev->device_lock);
 		atomic_inc(&conf->active_aligned_reads);
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 	}
 
 	mddev_trace_remap(mddev, align_bio, raid_bio->bi_iter.bi_sector);
@@ -5516,7 +5516,7 @@ static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio)
  * handle_list.
  */
 static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
-	__must_hold(&conf->device_lock)
+	__must_hold(&conf->mddev->device_lock)
 {
 	struct stripe_head *sh, *tmp;
 	struct list_head *handle_list = NULL;
@@ -5625,7 +5625,7 @@ static void raid5_unplug(struct blk_plug_cb *blk_cb, bool from_schedule)
 	int hash;
 
 	if (cb->list.next && !list_empty(&cb->list)) {
-		spin_lock_irq(&conf->device_lock);
+		spin_lock_irq(&conf->mddev->device_lock);
 		while (!list_empty(&cb->list)) {
 			sh = list_first_entry(&cb->list, struct stripe_head, lru);
 			list_del_init(&sh->lru);
@@ -5644,7 +5644,7 @@ static void raid5_unplug(struct blk_plug_cb *blk_cb, bool from_schedule)
 			__release_stripe(conf, sh, &cb->temp_inactive_list[hash]);
 			cnt++;
 		}
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 	}
 	release_inactive_stripe_list(conf, cb->temp_inactive_list,
 				     NR_STRIPE_HASH_LOCKS);
@@ -5793,14 +5793,14 @@ static bool stripe_ahead_of_reshape(struct mddev *mddev, struct r5conf *conf,
 		max_sector = max(max_sector, sh->dev[dd_idx].sector);
 	}
 
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 
 	if (!range_ahead_of_reshape(mddev, min_sector, max_sector,
 				     conf->reshape_progress))
 		/* mismatch, need to try again */
 		ret = true;
 
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 
 	return ret;
 }
@@ -5880,10 +5880,10 @@ static enum reshape_loc get_reshape_loc(struct mddev *mddev,
 	 * to the stripe that we think it is, we will have
 	 * to check again.
 	 */
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 	reshape_progress = conf->reshape_progress;
 	reshape_safe = conf->reshape_safe;
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 	if (reshape_progress == MaxSector)
 		return LOC_NO_RESHAPE;
 	if (ahead_of_reshape(mddev, logical_sector, reshape_progress))
@@ -6373,9 +6373,9 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 			   test_bit(MD_RECOVERY_INTR, &mddev->recovery));
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 			return 0;
-		spin_lock_irq(&conf->device_lock);
+		spin_lock_irq(&conf->mddev->device_lock);
 		conf->reshape_safe = mddev->reshape_position;
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 		wake_up(&conf->wait_for_reshape);
 		sysfs_notify_dirent_safe(mddev->sysfs_completed);
 	}
@@ -6413,12 +6413,12 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 		}
 		list_add(&sh->lru, &stripes);
 	}
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 	if (mddev->reshape_backwards)
 		conf->reshape_progress -= reshape_sectors * new_data_disks;
 	else
 		conf->reshape_progress += reshape_sectors * new_data_disks;
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 	/* Ok, those stripe are ready. We can start scheduling
 	 * reads on the source stripes.
 	 * The source stripes are determined by mapping the first and last
@@ -6482,9 +6482,9 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 			   || test_bit(MD_RECOVERY_INTR, &mddev->recovery));
 		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 			goto ret;
-		spin_lock_irq(&conf->device_lock);
+		spin_lock_irq(&conf->mddev->device_lock);
 		conf->reshape_safe = mddev->reshape_position;
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 		wake_up(&conf->wait_for_reshape);
 		sysfs_notify_dirent_safe(mddev->sysfs_completed);
 	}
@@ -6651,7 +6651,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
 static int handle_active_stripes(struct r5conf *conf, int group,
 				 struct r5worker *worker,
 				 struct list_head *temp_inactive_list)
-		__must_hold(&conf->device_lock)
+		__must_hold(&conf->mddev->device_lock)
 {
 	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
 	int i, batch_size = 0, hash;
@@ -6666,21 +6666,21 @@ static int handle_active_stripes(struct r5conf *conf, int group,
 			if (!list_empty(temp_inactive_list + i))
 				break;
 		if (i == NR_STRIPE_HASH_LOCKS) {
-			spin_unlock_irq(&conf->device_lock);
+			spin_unlock_irq(&conf->mddev->device_lock);
 			log_flush_stripe_to_raid(conf);
-			spin_lock_irq(&conf->device_lock);
+			spin_lock_irq(&conf->mddev->device_lock);
 			return batch_size;
 		}
 		release_inactive = true;
 	}
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 
 	release_inactive_stripe_list(conf, temp_inactive_list,
 				     NR_STRIPE_HASH_LOCKS);
 
 	r5l_flush_stripe_to_raid(conf->log);
 	if (release_inactive) {
-		spin_lock_irq(&conf->device_lock);
+		spin_lock_irq(&conf->mddev->device_lock);
 		return 0;
 	}
 
@@ -6690,7 +6690,7 @@ static int handle_active_stripes(struct r5conf *conf, int group,
 
 	cond_resched();
 
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 	for (i = 0; i < batch_size; i++) {
 		hash = batch[i]->hash_lock_index;
 		__release_stripe(conf, batch[i], &temp_inactive_list[hash]);
@@ -6712,7 +6712,7 @@ static void raid5_do_work(struct work_struct *work)
 
 	blk_start_plug(&plug);
 	handled = 0;
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 	while (1) {
 		int batch_size, released;
 
@@ -6726,11 +6726,11 @@ static void raid5_do_work(struct work_struct *work)
 		handled += batch_size;
 		wait_event_lock_irq(mddev->sb_wait,
 			!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
-			conf->device_lock);
+			conf->mddev->device_lock);
 	}
 	pr_debug("%d stripes handled\n", handled);
 
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 
 	flush_deferred_bios(conf);
 
@@ -6762,7 +6762,7 @@ static void raid5d(struct md_thread *thread)
 
 	blk_start_plug(&plug);
 	handled = 0;
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 	while (1) {
 		struct bio *bio;
 		int batch_size, released;
@@ -6779,10 +6779,10 @@ static void raid5d(struct md_thread *thread)
 		    !list_empty(&conf->bitmap_list)) {
 			/* Now is a good time to flush some bitmap updates */
 			conf->seq_flush++;
-			spin_unlock_irq(&conf->device_lock);
+			spin_unlock_irq(&conf->mddev->device_lock);
 			if (md_bitmap_enabled(mddev, true))
 				mddev->bitmap_ops->unplug(mddev, true);
-			spin_lock_irq(&conf->device_lock);
+			spin_lock_irq(&conf->mddev->device_lock);
 			conf->seq_write = conf->seq_flush;
 			activate_bit_delay(conf, conf->temp_inactive_list);
 		}
@@ -6790,9 +6790,9 @@ static void raid5d(struct md_thread *thread)
 
 		while ((bio = remove_bio_from_retry(conf, &offset))) {
 			int ok;
-			spin_unlock_irq(&conf->device_lock);
+			spin_unlock_irq(&conf->mddev->device_lock);
 			ok = retry_aligned_read(conf, bio, offset);
-			spin_lock_irq(&conf->device_lock);
+			spin_lock_irq(&conf->mddev->device_lock);
 			if (!ok)
 				break;
 			handled++;
@@ -6805,14 +6805,14 @@ static void raid5d(struct md_thread *thread)
 		handled += batch_size;
 
 		if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
-			spin_unlock_irq(&conf->device_lock);
+			spin_unlock_irq(&conf->mddev->device_lock);
 			md_check_recovery(mddev);
-			spin_lock_irq(&conf->device_lock);
+			spin_lock_irq(&conf->mddev->device_lock);
 		}
 	}
 	pr_debug("%d stripes handled\n", handled);
 
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 	if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
 	    mutex_trylock(&conf->cache_size_mutex)) {
 		grow_one_stripe(conf, __GFP_NOWARN);
@@ -7197,11 +7197,11 @@ raid5_store_group_thread_cnt(struct mddev *mddev, const char *page, size_t len)
 
 		err = alloc_thread_groups(conf, new, &group_cnt, &new_groups);
 		if (!err) {
-			spin_lock_irq(&conf->device_lock);
+			spin_lock_irq(&conf->mddev->device_lock);
 			conf->group_cnt = group_cnt;
 			conf->worker_cnt_per_group = new;
 			conf->worker_groups = new_groups;
-			spin_unlock_irq(&conf->device_lock);
+			spin_unlock_irq(&conf->mddev->device_lock);
 
 			if (old_groups)
 				kfree(old_groups[0].workers);
@@ -7504,8 +7504,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 		conf->worker_groups = new_group;
 	} else
 		goto abort;
-	spin_lock_init(&conf->device_lock);
-	seqcount_spinlock_init(&conf->gen_lock, &conf->device_lock);
+	seqcount_spinlock_init(&conf->gen_lock, &conf->mddev->device_lock);
 	mutex_init(&conf->cache_size_mutex);
 
 	init_waitqueue_head(&conf->wait_for_quiescent);
@@ -8151,9 +8150,9 @@ static int raid5_spare_active(struct mddev *mddev)
 			sysfs_notify_dirent_safe(rdev->sysfs_state);
 		}
 	}
-	spin_lock_irqsave(&conf->device_lock, flags);
+	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 	mddev->degraded = raid5_calc_degraded(conf);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	print_raid5_conf(conf);
 	return count;
 }
@@ -8474,7 +8473,7 @@ static int raid5_start_reshape(struct mddev *mddev)
 	}
 
 	atomic_set(&conf->reshape_stripes, 0);
-	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&conf->mddev->device_lock);
 	write_seqcount_begin(&conf->gen_lock);
 	conf->previous_raid_disks = conf->raid_disks;
 	conf->raid_disks += mddev->delta_disks;
@@ -8493,7 +8492,7 @@ static int raid5_start_reshape(struct mddev *mddev)
 		conf->reshape_progress = 0;
 	conf->reshape_safe = conf->reshape_progress;
 	write_seqcount_end(&conf->gen_lock);
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&conf->mddev->device_lock);
 
 	/* Now make sure any requests that proceeded on the assumption
 	 * the reshape wasn't running - like Discard or Read - have
@@ -8533,9 +8532,9 @@ static int raid5_start_reshape(struct mddev *mddev)
 		 * ->degraded is measured against the larger of the
 		 * pre and post number of devices.
 		 */
-		spin_lock_irqsave(&conf->device_lock, flags);
+		spin_lock_irqsave(&conf->mddev->device_lock, flags);
 		mddev->degraded = raid5_calc_degraded(conf);
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	}
 	mddev->raid_disks = conf->raid_disks;
 	mddev->reshape_position = conf->reshape_progress;
@@ -8560,7 +8559,7 @@ static void end_reshape(struct r5conf *conf)
 	if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
 		struct md_rdev *rdev;
 
-		spin_lock_irq(&conf->device_lock);
+		spin_lock_irq(&conf->mddev->device_lock);
 		conf->previous_raid_disks = conf->raid_disks;
 		md_finish_reshape(conf->mddev);
 		smp_wmb();
@@ -8571,7 +8570,7 @@ static void end_reshape(struct r5conf *conf)
 			    !test_bit(Journal, &rdev->flags) &&
 			    !test_bit(In_sync, &rdev->flags))
 				rdev->recovery_offset = MaxSector;
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&conf->mddev->device_lock);
 		wake_up(&conf->wait_for_reshape);
 
 		mddev_update_io_opt(conf->mddev,
@@ -8591,9 +8590,9 @@ static void raid5_finish_reshape(struct mddev *mddev)
 
 		if (mddev->delta_disks <= 0) {
 			int d;
-			spin_lock_irq(&conf->device_lock);
+			spin_lock_irq(&conf->mddev->device_lock);
 			mddev->degraded = raid5_calc_degraded(conf);
-			spin_unlock_irq(&conf->device_lock);
+			spin_unlock_irq(&conf->mddev->device_lock);
 			for (d = conf->raid_disks ;
 			     d < conf->raid_disks - mddev->delta_disks;
 			     d++) {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index eafc6e9ed6ee..8ec60e06dc05 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -668,7 +668,6 @@ struct r5conf {
 	unsigned long		cache_state;
 	struct shrinker		*shrinker;
 	int			pool_size; /* number of disks in stripeheads in pool */
-	spinlock_t		device_lock;
 	struct disk_info	*disks;
 	struct bio_set		bio_split;
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 02/16] md: serialize md_error()
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 01/16] md: move device_lock from conf to mddev Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-30  2:11   ` Yu Kuai
  2025-10-27 15:04 ` [PATCH v5 03/16] md: add pers->should_error() callback Kenta Akagi
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Serialize the md_error() function in preparation for the introduction of
a conditional md_error() in a subsequent commit. The conditional
md_error() is intended to prevent unintentional setting of MD_BROKEN
during RAID1/10 failfast handling.

To enhance the failfast bio error handler, it must verify that the
affected rdev is not the last working device before marking it as
faulty. Without serialization, a race condition can occur if multiple
failfast bios attempt to call the error handler concurrently:

        failfast bio1			failfast bio2
        ---                             ---
        md_cond_error(md,rdev1,bio)	md_cond_error(md,rdev2,bio)
          if(!is_degraded(md))		  if(!is_degraded(md))
             raid1_error(md,rdev1)          raid1_error(md,rdev2)
               spin_lock(md)
               set_faulty(rdev1)
	       spin_unlock(md)
						spin_lock(md)
						set_faulty(rdev2)
						set_broken(md)
						spin_unlock(md)

This can unintentionally cause the array to stop in situations where the
'Last' rdev should not be marked as Faulty.

This commit serializes the md_error() function for all RAID
personalities to avoid this race condition. Future commits will
introduce a conditional md_error() specifically for failfast bio
handling.

Serialization is applied to both the standard and conditional md_error()
for the following reasons:

- Both use the same error-handling mechanism, so it's clearer to
  serialize them consistently.
- The md_error() path is cold, meaning serialization has no performance
  impact.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md-linear.c |  1 +
 drivers/md/md.c        | 10 +++++++++-
 drivers/md/md.h        |  1 +
 drivers/md/raid0.c     |  1 +
 drivers/md/raid1.c     |  6 +-----
 drivers/md/raid10.c    |  9 ++-------
 drivers/md/raid5.c     |  4 +---
 7 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
index 7033d982d377..0f6893e4b9f5 100644
--- a/drivers/md/md-linear.c
+++ b/drivers/md/md-linear.c
@@ -298,6 +298,7 @@ static void linear_status(struct seq_file *seq, struct mddev *mddev)
 }
 
 static void linear_error(struct mddev *mddev, struct md_rdev *rdev)
+	__must_hold(&mddev->device_lock)
 {
 	if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) {
 		char *md_name = mdname(mddev);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index d667580e3125..4ad9cb0ac98c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8444,7 +8444,8 @@ void md_unregister_thread(struct mddev *mddev, struct md_thread __rcu **threadp)
 }
 EXPORT_SYMBOL(md_unregister_thread);
 
-void md_error(struct mddev *mddev, struct md_rdev *rdev)
+void _md_error(struct mddev *mddev, struct md_rdev *rdev)
+	__must_hold(&mddev->device_lock)
 {
 	if (!rdev || test_bit(Faulty, &rdev->flags))
 		return;
@@ -8469,6 +8470,13 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
 		queue_work(md_misc_wq, &mddev->event_work);
 	md_new_event();
 }
+
+void md_error(struct mddev *mddev, struct md_rdev *rdev)
+{
+	spin_lock(&mddev->device_lock);
+	_md_error(mddev, rdev);
+	spin_unlock(&mddev->device_lock);
+}
 EXPORT_SYMBOL(md_error);
 
 /* seq_file implementation /proc/mdstat */
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 64ac22edf372..c982598cbf97 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -913,6 +913,7 @@ extern void md_write_start(struct mddev *mddev, struct bio *bi);
 extern void md_write_inc(struct mddev *mddev, struct bio *bi);
 extern void md_write_end(struct mddev *mddev);
 extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
+void _md_error(struct mddev *mddev, struct md_rdev *rdev);
 extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
 extern void md_finish_reshape(struct mddev *mddev);
 void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index e443e478645a..8cf3caf9defd 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -625,6 +625,7 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev)
 }
 
 static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
+	__must_hold(&mddev->device_lock)
 {
 	if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) {
 		char *md_name = mdname(mddev);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7924d5ee189d..202e510f73a4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1749,11 +1749,9 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
  * &mddev->fail_last_dev is off.
  */
 static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
+	__must_hold(&mddev->device_lock)
 {
 	struct r1conf *conf = mddev->private;
-	unsigned long flags;
-
-	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 
 	if (test_bit(In_sync, &rdev->flags) &&
 	    (conf->raid_disks - mddev->degraded) == 1) {
@@ -1761,7 +1759,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 
 		if (!mddev->fail_last_dev) {
 			conf->recovery_disabled = mddev->recovery_disabled;
-			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 			return;
 		}
 	}
@@ -1769,7 +1766,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	if (test_and_clear_bit(In_sync, &rdev->flags))
 		mddev->degraded++;
 	set_bit(Faulty, &rdev->flags);
-	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	/*
 	 * if recovery is running, make sure it aborts.
 	 */
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 57c887070df3..25c0ab09807b 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1993,19 +1993,15 @@ static int enough(struct r10conf *conf, int ignore)
  * &mddev->fail_last_dev is off.
  */
 static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
+	__must_hold(&mddev->device_lock)
 {
 	struct r10conf *conf = mddev->private;
-	unsigned long flags;
-
-	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 
 	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
 		set_bit(MD_BROKEN, &mddev->flags);
 
-		if (!mddev->fail_last_dev) {
-			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
+		if (!mddev->fail_last_dev)
 			return;
-		}
 	}
 	if (test_and_clear_bit(In_sync, &rdev->flags))
 		mddev->degraded++;
@@ -2015,7 +2011,6 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	set_bit(Faulty, &rdev->flags);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
-	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
 		"md/raid10:%s: Operation continuing on %d devices.\n",
 		mdname(mddev), rdev->bdev,
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3350dcf9cab6..d1372b1bc405 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2905,15 +2905,14 @@ static void raid5_end_write_request(struct bio *bi)
 }
 
 static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
+	__must_hold(&mddev->device_lock)
 {
 	struct r5conf *conf = mddev->private;
-	unsigned long flags;
 	pr_debug("raid456: error called\n");
 
 	pr_crit("md/raid:%s: Disk failure on %pg, disabling device.\n",
 		mdname(mddev), rdev->bdev);
 
-	spin_lock_irqsave(&conf->mddev->device_lock, flags);
 	set_bit(Faulty, &rdev->flags);
 	clear_bit(In_sync, &rdev->flags);
 	mddev->degraded = raid5_calc_degraded(conf);
@@ -2929,7 +2928,6 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 			mdname(mddev), conf->raid_disks - mddev->degraded);
 	}
 
-	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 
 	set_bit(Blocked, &rdev->flags);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 03/16] md: add pers->should_error() callback
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 01/16] md: move device_lock from conf to mddev Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 02/16] md: serialize md_error() Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-30  2:24   ` Yu Kuai
  2025-10-27 15:04 ` [PATCH v5 04/16] md: introduce md_cond_error() Kenta Akagi
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

The failfast feature in RAID1 and RAID10 assumes that when md_error() is
called, the array remains functional because the last rdev neither fails
nor sets MD_BROKEN.

However, the current implementation can cause the array to lose
its last in-sync device or be marked as MD_BROKEN, which breaks the
assumption and can lead to array failure.

To address this issue, a new handler md_cond_error() will be introduced
to ensure that failfast I/O does not mark the array as broken.

As preparation, this commit adds a helper pers->should_error() to determine
from outside the personality whether an rdev can fail safely, which is
needed by md_cond_error().

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index c982598cbf97..01c8182431d1 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -763,6 +763,7 @@ struct md_personality
 	 * if appropriate, and should abort recovery if needed
 	 */
 	void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
+	bool (*should_error)(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio);
 	int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
 	int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev);
 	int (*spare_active) (struct mddev *mddev);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 04/16] md: introduce md_cond_error()
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (2 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 03/16] md: add pers->should_error() callback Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 05/16] md/raid1: implement pers->should_error() Kenta Akagi
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

The failfast feature in RAID1 and RAID10 assumes that when md_error() is
called, the array remains functional because the last rdev neither fails
nor sets MD_BROKEN.

However, the current implementation can cause the array to lose
its last in-sync device or be marked as MD_BROKEN, which breaks the
assumption and can lead to array failure.

To address this issue, introduce md_cond_error() that handles failfast
bio errors without stopped the array. This function checks whether the
array will become inoperable if a rdev fails, and if so, it skips
error handling to ensure the array remains in an operational state.

Callers of md_error() will be updated to use this new function in
subsequent commits to properly handle failfast scenarios.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c | 33 +++++++++++++++++++++++++++++++++
 drivers/md/md.h |  1 +
 2 files changed, 34 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4ad9cb0ac98c..e33ab564f26b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8479,6 +8479,39 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
 }
 EXPORT_SYMBOL(md_error);
 
+/** md_cond_error() - conditionally md_error()
+ * @mddev: affected md device
+ * @rdev: member device to fail
+ * @bio: bio whose triggered device failure
+ *
+ * Check if the personality wants to fail this rdev for this bio,
+ * and if so, call _md_error().
+ * This function has no different behavior from md_error except
+ * for the raid1/10 with failfast enabled rdevs.
+ *
+ * Returns: %true if rdev already or become Faulty, %false if not.
+ */
+bool md_cond_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio)
+{
+	if (WARN_ON_ONCE(!mddev->pers))
+		/* return true because we don't want caller to retry */
+		return true;
+
+	spin_lock(&mddev->device_lock);
+
+	if (mddev->pers->should_error &&
+	    !mddev->pers->should_error(mddev, rdev, bio)) {
+		spin_unlock(&mddev->device_lock);
+		return test_bit(Faulty, &rdev->flags);
+	}
+
+	_md_error(mddev, rdev);
+	spin_unlock(&mddev->device_lock);
+
+	return !WARN_ON_ONCE(!test_bit(Faulty, &rdev->flags));
+}
+EXPORT_SYMBOL(md_cond_error);
+
 /* seq_file implementation /proc/mdstat */
 
 static void status_unused(struct seq_file *seq)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 01c8182431d1..38f9874538a6 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -916,6 +916,7 @@ extern void md_write_end(struct mddev *mddev);
 extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
 void _md_error(struct mddev *mddev, struct md_rdev *rdev);
 extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
+extern bool md_cond_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio);
 extern void md_finish_reshape(struct mddev *mddev);
 void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
 			struct bio *bio, sector_t start, sector_t size);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 05/16] md/raid1: implement pers->should_error()
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (3 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 04/16] md: introduce md_cond_error() Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-30  2:31   ` Yu Kuai
  2025-10-27 15:04 ` [PATCH v5 06/16] md/raid10: " Kenta Akagi
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

The failfast feature in RAID1 and RAID10 assumes that when md_error() is
called, the array remains functional because the last rdev neither fails
nor sets MD_BROKEN.

However, the current implementation can cause the array to lose
its last in-sync device or be marked as MD_BROKEN, which breaks the
assumption and can lead to array failure.

To address this issue, introduce a new handler, md_cond_error(), to
ensure that failfast I/O does not mark the array as broken.

md_cond_error() checks whether a device should be faulted based on
pers->should_error().  This commit implements should_error() callback
for raid1 personality, which returns true if faulting the specified rdev
would cause the mddev to become non-functional.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 202e510f73a4..69b7730f3875 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1732,6 +1732,40 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
 	seq_printf(seq, "]");
 }
 
+/**
+ * raid1_should_error() - Determine if this rdev should be failed
+ * @mddev: affected md device
+ * @rdev: member device to check
+ * @bio: the bio that caused the failure
+ *
+ * When failfast bios failure, rdev can fail, but the mddev must not fail.
+ * This function tells md_cond_error() not to fail rdev if bio is failfast
+ * and last rdev.
+ *
+ * Returns: %false if bio is failfast and rdev is the last in-sync device.
+ *	     Otherwise %true - should fail this rdev.
+ */
+static bool raid1_should_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio)
+{
+	int i;
+	struct r1conf *conf = mddev->private;
+
+	if (!(bio->bi_opf & MD_FAILFAST) ||
+	    !test_bit(FailFast, &rdev->flags) ||
+	    test_bit(Faulty, &rdev->flags))
+		return true;
+
+	for (i = 0; i < conf->raid_disks; i++) {
+		struct md_rdev *rdev2 = conf->mirrors[i].rdev;
+
+		if (rdev2 && rdev2 != rdev &&
+		    test_bit(In_sync, &rdev2->flags) &&
+		    !test_bit(Faulty, &rdev2->flags))
+			return true;
+	}
+	return false;
+}
+
 /**
  * raid1_error() - RAID1 error handler.
  * @mddev: affected md device.
@@ -3486,6 +3520,7 @@ static struct md_personality raid1_personality =
 	.free		= raid1_free,
 	.status		= raid1_status,
 	.error_handler	= raid1_error,
+	.should_error	= raid1_should_error,
 	.hot_add_disk	= raid1_add_disk,
 	.hot_remove_disk= raid1_remove_disk,
 	.spare_active	= raid1_spare_active,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 06/16] md/raid10: implement pers->should_error()
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (4 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 05/16] md/raid1: implement pers->should_error() Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 07/16] md/raid1: refactor handle_read_error() Kenta Akagi
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

The failfast feature in RAID1 and RAID10 assumes that when md_error() is
called, the array remains functional because the last rdev neither fails
nor sets MD_BROKEN.

However, the current implementation can cause the array to lose its last
in-sync device or be marked as MD_BROKEN, which breaks the assumption
and can lead to array failure.

To address this issue, introduce a new handler, md_cond_error(), to
ensure that failfast I/O does not mark the array as broken.

md_cond_error() checks whether a device should be faulted based on
pers->should_error(). This commit implements should_error() callback for
raid10 personality, which returns true if faulting the specified rdev
would cause the mddev to become non-functional.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid10.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 25c0ab09807b..68dbab7b360b 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1977,6 +1977,31 @@ static int enough(struct r10conf *conf, int ignore)
 		_enough(conf, 1, ignore);
 }
 
+/**
+ * raid10_should_error() - Determine if this rdev should be failed
+ * @mddev: affected md device
+ * @rdev: member device to check
+ * @bio: the bio that caused the failure
+ *
+ * When failfast bios failure, rdev can fail, but the mddev must not fail.
+ * This function tells md_cond_error() not to fail rdev if bio is failfast
+ * and last rdev.
+ *
+ * Returns: %false if bio is failfast and rdev is the last in-sync device.
+ *	     Otherwise %true - should fail this rdev.
+ */
+static bool raid10_should_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio)
+{
+	struct r10conf *conf = mddev->private;
+
+	if (!(bio->bi_opf & MD_FAILFAST) ||
+	    !test_bit(FailFast, &rdev->flags) ||
+	    test_bit(Faulty, &rdev->flags))
+		return true;
+
+	return enough(conf, rdev->raid_disk);
+}
+
 /**
  * raid10_error() - RAID10 error handler.
  * @mddev: affected md device.
@@ -5116,6 +5141,7 @@ static struct md_personality raid10_personality =
 	.free		= raid10_free,
 	.status		= raid10_status,
 	.error_handler	= raid10_error,
+	.should_error	= raid10_should_error,
 	.hot_add_disk	= raid10_add_disk,
 	.hot_remove_disk= raid10_remove_disk,
 	.spare_active	= raid10_spare_active,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 07/16] md/raid1: refactor handle_read_error()
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (5 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 06/16] md/raid10: " Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-30  2:38   ` Yu Kuai
  2025-10-27 15:04 ` [PATCH v5 08/16] md/raid10: " Kenta Akagi
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

For the failfast bio feature, the behavior of handle_read_error() will
be changed in a subsequent commit, but refactor it first.

This commit only refactors the code without functional changes. A
subsequent commit will replace md_error() with md_cond_error()
to implement proper failfast error handling.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 69b7730f3875..a70ca6bc28f3 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2675,24 +2675,22 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 	r1_bio->bios[r1_bio->read_disk] = NULL;
 
 	rdev = conf->mirrors[r1_bio->read_disk].rdev;
-	if (mddev->ro == 0
-	    && !test_bit(FailFast, &rdev->flags)) {
+	if (mddev->ro) {
+		r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
+	} else if (test_bit(FailFast, &rdev->flags)) {
+		md_error(mddev, rdev);
+	} else {
 		freeze_array(conf, 1);
 		fix_read_error(conf, r1_bio);
 		unfreeze_array(conf);
-	} else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
-		md_error(mddev, rdev);
-	} else {
-		r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
 	}
 
 	rdev_dec_pending(rdev, conf->mddev);
 	sector = r1_bio->sector;
-	bio = r1_bio->master_bio;
 
 	/* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */
 	r1_bio->state = 0;
-	raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio);
+	raid1_read_request(mddev, r1_bio->master_bio, r1_bio->sectors, r1_bio);
 	allow_barrier(conf, sector);
 }
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 08/16] md/raid10: refactor handle_read_error()
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (6 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 07/16] md/raid1: refactor handle_read_error() Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-30  2:39   ` Yu Kuai
  2025-10-27 15:04 ` [PATCH v5 09/16] md/raid10: fix failfast read error not rescheduled Kenta Akagi
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

For the failfast bio feature, the behavior of handle_read_error() will
be changed in a subsequent commit, but refactor it first.

This commit only refactors the code without functional changes. A
subsequent commit will replace md_error() with md_cond_error()
to implement proper failfast error handling.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid10.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 68dbab7b360b..87468113e31a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2873,14 +2873,15 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 	bio_put(bio);
 	r10_bio->devs[slot].bio = NULL;
 
-	if (mddev->ro)
+	if (mddev->ro) {
 		r10_bio->devs[slot].bio = IO_BLOCKED;
-	else if (!test_bit(FailFast, &rdev->flags)) {
+	} else if (test_bit(FailFast, &rdev->flags)) {
+		md_error(mddev, rdev);
+	} else {
 		freeze_array(conf, 1);
 		fix_read_error(conf, mddev, r10_bio);
 		unfreeze_array(conf);
-	} else
-		md_error(mddev, rdev);
+	}
 
 	rdev_dec_pending(rdev, mddev);
 	r10_bio->state = 0;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 09/16] md/raid10: fix failfast read error not rescheduled
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (7 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 08/16] md/raid10: " Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-30  2:41   ` Yu Kuai
  2025-10-27 15:04 ` [PATCH v5 10/16] md: prevent set MD_BROKEN on super_write failure with failfast Kenta Akagi
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi, Li Nan

raid10_end_read_request lacks a path to retry when a FailFast IO fails.
As a result, when Failfast Read IOs fail on all rdevs, the upper layer
receives EIO, without read rescheduled.

Looking at the two commits below, it seems only raid10_end_read_request
lacks the failfast read retry handling, while raid1_end_read_request has
it. In RAID1, the retry works as expected.
* commit 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
* commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")

This commit will make the failfast read bio for the last rdev in raid10
retry if it fails.

Fixes: 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
Signed-off-by: Kenta Akagi <k@mgml.me>
Reviewed-by: Li Nan <linan122@huawei.com>
---
 drivers/md/raid10.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 87468113e31a..1dd27b9ef48e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -401,6 +401,13 @@ static void raid10_end_read_request(struct bio *bio)
 		 * wait for the 'master' bio.
 		 */
 		set_bit(R10BIO_Uptodate, &r10_bio->state);
+	} else if (test_bit(FailFast, &rdev->flags) &&
+		 test_bit(R10BIO_FailFast, &r10_bio->state)) {
+		/*
+		 * This was a fail-fast read so we definitely
+		 * want to retry
+		 */
+		;
 	} else if (!raid1_should_handle_error(bio)) {
 		uptodate = 1;
 	} else {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 10/16] md: prevent set MD_BROKEN on super_write failure with failfast
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (8 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 09/16] md/raid10: fix failfast read error not rescheduled Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 11/16] md/raid1: Prevent set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Failfast is a feature implemented only for RAID1 and RAID10. It instructs
the block device providing the rdev to immediately return a bio error
without retrying if any issue occurs. This allows quickly detaching a
problematic rdev and minimizes IO latency.

Due to its nature, failfast bios can fail easily, and md must not mark
an essential rdev as Faulty or set MD_BROKEN on the array just because
a failfast bio failed.

When failfast was introduced, RAID1 and RAID10 were designed to continue
operating normally even if md_error was called for the last rdev. However,
with the introduction of MD_BROKEN in RAID1/RAID10 in commit 9631abdbf406
("md: Set MD_BROKEN for RAID1 and RAID10"), calling md_error for the last
rdev now prevents further writes to the array. Despite this, the current
failfast error handler still assumes that calling md_error will not break
the array.

Normally, this is not an issue because MD_FAILFAST is not set when a bio
is issued to the last rdev. However, if the array is not degraded and a
bio with MD_FAILFAST has been issued, simultaneous failures could
potentially break the array. This is unusual but can happen; for example,
this can occur when using NVMe over TCP if all rdevs depend on a single
Ethernet link.

In other words, this becomes a problem under the following conditions:

Preconditions:
* Failfast is enabled on all rdevs.
* All rdevs are In_sync - This is a requirement for bio to be submit
  with MD_FAILFAST.
* At least one bio has been submitted but has not yet completed.

Trigger condition:
* All underlying devices of the rdevs return an error for their failfast
  bios.

Whether the bio is read or write, eventually both rdevs will be lost.
In the write case, md_error is invoked on each rdev through its
bi_end_io handler. In the read case, if bio has been issued to multiple
rdevs via read_balance, it will be the same as write. Even in the read
case where only a single rdev has bio issued, both rdevs will be lost in
the following sequence:
1. losing the first rdev triggers a metadata update
2. md_super_write issues the bio with MD_FAILFAST, causing the bio to
   fail immediately. md_super_write always issues MD_FAILFAST bio if
rdev has FailFast, regardless of whether there are other rdevs or not.
3. md_super_write issued bio failed, so super_written calls md_error on
   the remaining rdev.

This commit fixes a second of read cases. Ensure that a failfast metadata
write does not mark the last rdev as faulty or set MD_BROKEN on the
array.

Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
Fixes: 9a567843f7ce ("md: allow last device to be forcibly removed from RAID1/RAID10.")
Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index e33ab564f26b..3c3f5703531b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1060,8 +1060,7 @@ static void super_written(struct bio *bio)
 	if (bio->bi_status) {
 		pr_err("md: %s gets error=%d\n", __func__,
 		       blk_status_to_errno(bio->bi_status));
-		md_error(mddev, rdev);
-		if (!test_bit(Faulty, &rdev->flags)
+		if (!md_cond_error(mddev, rdev, bio)
 		    && (bio->bi_opf & MD_FAILFAST)) {
 			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
 			set_bit(LastDev, &rdev->flags);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 11/16] md/raid1: Prevent set MD_BROKEN on failfast bio failure
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (9 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 10/16] md: prevent set MD_BROKEN on super_write failure with failfast Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 12/16] md/raid10: " Kenta Akagi
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Failfast is a feature implemented only for RAID1 and RAID10. It instructs
the block device providing the rdev to immediately return a bio error
without retrying if any issue occurs. This allows quickly detaching a
problematic rdev and minimizes IO latency.

Due to its nature, failfast bios can fail easily, and md must not mark
an essential rdev as Faulty or set MD_BROKEN on the array just because
a failfast bio failed.

When failfast was introduced, RAID1 and RAID10 were designed to continue
operating normally even if md_error was called for the last rdev. However,
with the introduction of MD_BROKEN in RAID1/RAID10
in commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling
md_error for the last rdev now prevents further writes to the array.
Despite this, the current failfast error handler still assumes that
calling md_error will not break the array.

Normally, this is not an issue because MD_FAILFAST is not set when a bio
is issued to the last rdev. However, if the array is not degraded and a
bio with MD_FAILFAST has been issued, simultaneous failures could
potentially break the array. This is unusual but can happen; for example,
this can occur when using NVMe over TCP if all rdevs depend on
a single Ethernet link.

In other words, this becomes a problem under the following conditions:
Preconditions:
* Failfast is enabled on all rdevs.
* All rdevs are In_sync - This is a requirement for bio to be submit
  with MD_FAILFAST.
* At least one bio has been submitted but has not yet completed.

Trigger condition:
* All underlying devices of the rdevs return an error for their failfast
  bios.

Whether the bio is read or write, eventually both rdevs will be lost.
In the write case, md_error is invoked on each rdev through its
bi_end_io handler. In the read case, if bio has been issued to multiple
rdevs via read_balance, it will be the same as write. Even in the read
case where only a single rdev has bio issued, both rdevs will be lost in
the following sequence:
1. losing the first rdev triggers a metadata update
2. md_super_write issues the bio with MD_FAILFAST, causing the bio to
   fail immediately. md_super_write always issues MD_FAILFAST bio if
rdev has FailFast, regardless of whether there are other rdevs or not.
3. md_super_write issued bio failed, so super_written calls md_error on
   the remaining rdev.

This commit fixes the write case and first of read cases. Ensure that a
failfast bio failure will not cause the last rdev to become faulty or
the array to be marked MD_BROKEN.

The second of read, i.e., failure of metadata update, has already been
fixed in the previous commit.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a70ca6bc28f3..bf96ae78a8b1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
 		    (bio->bi_opf & MD_FAILFAST) &&
 		    /* We never try FailFast to WriteMostly devices */
 		    !test_bit(WriteMostly, &rdev->flags)) {
-			md_error(r1_bio->mddev, rdev);
+			md_cond_error(r1_bio->mddev, rdev, bio);
 		}
 
 		/*
@@ -2177,8 +2177,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
 	if (test_bit(FailFast, &rdev->flags)) {
 		/* Don't try recovering from here - just fail it
 		 * ... unless it is the last working device of course */
-		md_error(mddev, rdev);
-		if (test_bit(Faulty, &rdev->flags))
+		if (md_cond_error(mddev, rdev, bio))
 			/* Don't try to read from here, but make sure
 			 * put_buf does it's thing
 			 */
@@ -2671,20 +2670,20 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 	 */
 
 	bio = r1_bio->bios[r1_bio->read_disk];
-	bio_put(bio);
 	r1_bio->bios[r1_bio->read_disk] = NULL;
 
 	rdev = conf->mirrors[r1_bio->read_disk].rdev;
 	if (mddev->ro) {
 		r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
 	} else if (test_bit(FailFast, &rdev->flags)) {
-		md_error(mddev, rdev);
+		md_cond_error(mddev, rdev, bio);
 	} else {
 		freeze_array(conf, 1);
 		fix_read_error(conf, r1_bio);
 		unfreeze_array(conf);
 	}
 
+	bio_put(bio);
 	rdev_dec_pending(rdev, conf->mddev);
 	sector = r1_bio->sector;
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 12/16] md/raid10: Prevent set MD_BROKEN on failfast bio failure
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (10 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 11/16] md/raid1: Prevent set MD_BROKEN on failfast bio failure Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 13/16] md/raid1: add error message when setting MD_BROKEN Kenta Akagi
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Failfast is a feature implemented only for RAID1 and RAID10. It instructs
the block device providing the rdev to immediately return a bio error
without retrying if any issue occurs. This allows quickly detaching a
problematic rdev and minimizes IO latency.

Due to its nature, failfast bios can fail easily, and md must not mark
an essential rdev as Faulty or set MD_BROKEN on the array just because
a failfast bio failed.

When failfast was introduced, RAID1 and RAID10 were designed to continue
operating normally even if md_error was called for the last rdev. However,
with the introduction of MD_BROKEN in RAID1/RAID10
in commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), calling
md_error for the last rdev now prevents further writes to the array.
Despite this, the current failfast error handler still assumes that
calling md_error will not break the array.

Normally, this is not an issue because MD_FAILFAST is not set when a bio
is issued to the last rdev. However, if the array is not degraded and a
bio with MD_FAILFAST has been issued, simultaneous failures could
potentially break the array. This is unusual but can happen; for example,
this can occur when using NVMe over TCP if all rdevs depend on
a single Ethernet link.

In other words, this becomes a problem under the following conditions:
Preconditions:
* Failfast is enabled on all rdevs.
* All rdevs are In_sync - This is a requirement for bio to be submit
  with MD_FAILFAST.
* At least one bio has been submitted but has not yet completed.

Trigger condition:
* All underlying devices of the rdevs return an error for their failfast
  bios.

Whether the bio is read or write, eventually both rdevs will be lost.
In the write case, md_error is invoked on each rdev through its
bi_end_io handler. In the read case, if bio has been issued to multiple
rdevs via read_balance, it will be the same as write. Even in the read
case where only a single rdev has bio issued, both rdevs will be lost in
the following sequence:
1. losing the first rdev triggers a metadata update
2. md_super_write issues the bio with MD_FAILFAST, causing the bio to
   fail immediately. md_super_write always issues MD_FAILFAST bio if
rdev has FailFast, regardless of whether there are other rdevs or not.
3. md_super_write issued bio failed, so super_written calls md_error on
   the remaining rdev.

This commit fixes the write case and first of read cases. Ensure that a
failfast bio failure will not cause the last rdev to become faulty or
the array to be marked MD_BROKEN.

The second of read, i.e., failure of metadata update, has already been
fixed in the previous commit.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid10.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1dd27b9ef48e..aa9d328fe875 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -497,7 +497,7 @@ static void raid10_end_write_request(struct bio *bio)
 			dec_rdev = 0;
 			if (test_bit(FailFast, &rdev->flags) &&
 			    (bio->bi_opf & MD_FAILFAST)) {
-				md_error(rdev->mddev, rdev);
+				md_cond_error(rdev->mddev, rdev, bio);
 			}
 
 			/*
@@ -2434,7 +2434,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 				continue;
 		} else if (test_bit(FailFast, &rdev->flags)) {
 			/* Just give up on this device */
-			md_error(rdev->mddev, rdev);
+			md_cond_error(rdev->mddev, rdev, r10_bio->devs[i].bio);
 			continue;
 		}
 		/* Ok, we need to write this bio, either to correct an
@@ -2877,19 +2877,19 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 	 * frozen.
 	 */
 	bio = r10_bio->devs[slot].bio;
-	bio_put(bio);
 	r10_bio->devs[slot].bio = NULL;
 
 	if (mddev->ro) {
 		r10_bio->devs[slot].bio = IO_BLOCKED;
 	} else if (test_bit(FailFast, &rdev->flags)) {
-		md_error(mddev, rdev);
+		md_cond_error(mddev, rdev, bio);
 	} else {
 		freeze_array(conf, 1);
 		fix_read_error(conf, mddev, r10_bio);
 		unfreeze_array(conf);
 	}
 
+	bio_put(bio);
 	rdev_dec_pending(rdev, mddev);
 	r10_bio->state = 0;
 	raid10_read_request(mddev, r10_bio->master_bio, r10_bio, false);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 13/16] md/raid1: add error message when setting MD_BROKEN
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (11 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 12/16] md/raid10: " Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 14/16] md/raid10: Add " Kenta Akagi
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Once MD_BROKEN is set on an array, no further writes can be performed.
The user must be informed that the array cannot continue operation.

Add error logging when MD_BROKEN flag is set on raid1 arrays to improve
debugging and system administration visibility.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid1.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index bf96ae78a8b1..d58a60fb5b2f 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1790,6 +1790,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	if (test_bit(In_sync, &rdev->flags) &&
 	    (conf->raid_disks - mddev->degraded) == 1) {
 		set_bit(MD_BROKEN, &mddev->flags);
+		pr_crit("md/raid1:%s: Disk failure on %pg, this is the last device.\n"
+			"md/raid1:%s: Cannot continue operation (%d/%d failed).\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), mddev->degraded + 1, conf->raid_disks);
 
 		if (!mddev->fail_last_dev) {
 			conf->recovery_disabled = mddev->recovery_disabled;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 14/16] md/raid10: Add error message when setting MD_BROKEN
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (12 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 13/16] md/raid1: add error message when setting MD_BROKEN Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 15/16] md: rename 'LastDev' rdev flag to 'RetryingSBWrite' Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 16/16] md: Improve super_written() error logging Kenta Akagi
  15 siblings, 0 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

Once MD_BROKEN is set on an array, no further writes can be performed.
The user must be informed that the array cannot continue operation.

Add error logging when MD_BROKEN flag is set on raid10 arrays to improve
debugging and system administration visibility.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/raid10.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index aa9d328fe875..369def3413c0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2031,6 +2031,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
 		set_bit(MD_BROKEN, &mddev->flags);
+		pr_crit("md/raid10:%s: Disk failure on %pg, this is the last device.\n"
+			"md/raid10:%s: Cannot continue operation (%d/%d failed).\n",
+			mdname(mddev), rdev->bdev,
+			mdname(mddev), mddev->degraded + 1, conf->geo.raid_disks);
 
 		if (!mddev->fail_last_dev)
 			return;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 15/16] md: rename 'LastDev' rdev flag to 'RetryingSBWrite'
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (13 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 14/16] md/raid10: Add " Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  2025-10-27 15:04 ` [PATCH v5 16/16] md: Improve super_written() error logging Kenta Akagi
  15 siblings, 0 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

The rdev LastDev flag is assigned to devices
that have had metadata written with MD_FAILFAST in a Failfast rdev,
but whose bio failed but did not become faulty after an md_error.
After that, the metadata write is retried by sb_flags
MD_SB_NEED_REWRITE, but if rdev LastDev is set at this time, FailFast is
not used and LastDev is cleared if the write is successful.

Although it's called LastDev, this rdev flag actually means "metadata
write with FailFast failed and a retry is required. Do not use FailFast
when retrying for this rdev."

This is different from what we might expect from the name LastDev,
and can be confusing when reading the code.

This commit renames the LastDev flag to better reflect its actual
behavior for improving readability. The implementation remains
unchanged.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c | 6 +++---
 drivers/md/md.h | 7 ++++---
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3c3f5703531b..1cbb4fd8bbc0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1063,10 +1063,10 @@ static void super_written(struct bio *bio)
 		if (!md_cond_error(mddev, rdev, bio)
 		    && (bio->bi_opf & MD_FAILFAST)) {
 			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
-			set_bit(LastDev, &rdev->flags);
+			set_bit(RetryingSBWrite, &rdev->flags);
 		}
 	} else
-		clear_bit(LastDev, &rdev->flags);
+		clear_bit(RetryingSBWrite, &rdev->flags);
 
 	bio_put(bio);
 
@@ -1119,7 +1119,7 @@ void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev,
 
 	if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
 	    test_bit(FailFast, &rdev->flags) &&
-	    !test_bit(LastDev, &rdev->flags))
+	    !test_bit(RetryingSBWrite, &rdev->flags))
 		bio->bi_opf |= MD_FAILFAST;
 
 	atomic_inc(&mddev->pending_writes);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 38f9874538a6..0943cc5a86aa 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -282,9 +282,10 @@ enum flag_bits {
 				 * It is expects that no bad block log
 				 * is present.
 				 */
-	LastDev,		/* Seems to be the last working dev as
-				 * it didn't fail, so don't use FailFast
-				 * any more for metadata
+	RetryingSBWrite,	/*
+				 * metadata write with MD_FAILFAST failed,
+				 * so it is being retried. Failfast
+				 * will not be used during the retry.
 				 */
 	CollisionCheck,		/*
 				 * check if there is collision between raid1
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 16/16] md: Improve super_written() error logging
  2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
                   ` (14 preceding siblings ...)
  2025-10-27 15:04 ` [PATCH v5 15/16] md: rename 'LastDev' rdev flag to 'RetryingSBWrite' Kenta Akagi
@ 2025-10-27 15:04 ` Kenta Akagi
  15 siblings, 0 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-10-27 15:04 UTC (permalink / raw)
  To: Song Liu, Yu Kuai, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, Kenta Akagi

In the current implementation, when super_write fails, the output log
will be like this:
 md: super_written gets error=-5

It is unless combined with other logs - e.g. I/O error message from blk,
it's impossible to determine which md and rdev are causing the problem,
and if the problem occurs on multiple devices, it becomes completely
impossible to determine from the logs where the super_write failed.

Also, currently super_written does not output logs when retrying
metadata write.  If the metadata write fails, the array may be
corrupted, but if it is retried and successful, then it is not.
The user should be informed if a retry was attempted.

This commit adds output to see which array and which device had the
problem, and adds a message to indicate when a metadata write retry is
scheduled.

Signed-off-by: Kenta Akagi <k@mgml.me>
---
 drivers/md/md.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1cbb4fd8bbc0..4cbb31552486 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1058,10 +1058,13 @@ static void super_written(struct bio *bio)
 	struct mddev *mddev = rdev->mddev;
 
 	if (bio->bi_status) {
-		pr_err("md: %s gets error=%d\n", __func__,
+		pr_err("md: %s: %pg: %s gets error=%d\n",
+		       mdname(mddev), rdev->bdev, __func__,
 		       blk_status_to_errno(bio->bi_status));
 		if (!md_cond_error(mddev, rdev, bio)
 		    && (bio->bi_opf & MD_FAILFAST)) {
+			pr_warn("md: %s: %pg: retrying metadata write\n",
+				mdname(mddev), rdev->bdev);
 			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
 			set_bit(RetryingSBWrite, &rdev->flags);
 		}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 01/16] md: move device_lock from conf to mddev
  2025-10-27 15:04 ` [PATCH v5 01/16] md: move device_lock from conf to mddev Kenta Akagi
@ 2025-10-30  1:59   ` Yu Kuai
  0 siblings, 0 replies; 25+ messages in thread
From: Yu Kuai @ 2025-10-30  1:59 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel

在 2025/10/27 23:04, Kenta Akagi 写道:

> Move device_lock from mddev->private (r1conf, r10conf, r5conf)
> to the mddev structure in preparation for serializing md_error() and
> introducing new function that conditional md_error() calls to
> improvement failfast bio error handling.
>
> This commit only moves the device_lock location with no functional
> changes. Subsequent commits will serialize md_error() and introduce
> a new function that calls md_error() conditionally.
>
> Signed-off-by: Kenta Akagi<k@mgml.me>
> ---
>   drivers/md/md.c          |   1 +
>   drivers/md/md.h          |   2 +
>   drivers/md/raid1.c       |  51 +++++++-------
>   drivers/md/raid1.h       |   2 -
>   drivers/md/raid10.c      |  61 +++++++++--------
>   drivers/md/raid10.h      |   1 -
>   drivers/md/raid5-cache.c |  16 ++---
>   drivers/md/raid5.c       | 139 +++++++++++++++++++--------------------
>   drivers/md/raid5.h       |   1 -
>   9 files changed, 135 insertions(+), 139 deletions(-)

Reviewed-by: Yu Kuai <yukuai@fnnas.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 02/16] md: serialize md_error()
  2025-10-27 15:04 ` [PATCH v5 02/16] md: serialize md_error() Kenta Akagi
@ 2025-10-30  2:11   ` Yu Kuai
  2025-11-06 17:06     ` Kenta Akagi
  0 siblings, 1 reply; 25+ messages in thread
From: Yu Kuai @ 2025-10-30  2:11 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel

Hi,

在 2025/10/27 23:04, Kenta Akagi 写道:
> Serialize the md_error() function in preparation for the introduction of
> a conditional md_error() in a subsequent commit. The conditional
> md_error() is intended to prevent unintentional setting of MD_BROKEN
> during RAID1/10 failfast handling.
>
> To enhance the failfast bio error handler, it must verify that the
> affected rdev is not the last working device before marking it as
> faulty. Without serialization, a race condition can occur if multiple
> failfast bios attempt to call the error handler concurrently:
>
>          failfast bio1			failfast bio2
>          ---                             ---
>          md_cond_error(md,rdev1,bio)	md_cond_error(md,rdev2,bio)
>            if(!is_degraded(md))		  if(!is_degraded(md))
>               raid1_error(md,rdev1)          raid1_error(md,rdev2)
>                 spin_lock(md)
>                 set_faulty(rdev1)
> 	       spin_unlock(md)
> 						spin_lock(md)
> 						set_faulty(rdev2)
> 						set_broken(md)
> 						spin_unlock(md)
>
> This can unintentionally cause the array to stop in situations where the
> 'Last' rdev should not be marked as Faulty.
>
> This commit serializes the md_error() function for all RAID
> personalities to avoid this race condition. Future commits will
> introduce a conditional md_error() specifically for failfast bio
> handling.
>
> Serialization is applied to both the standard and conditional md_error()
> for the following reasons:
>
> - Both use the same error-handling mechanism, so it's clearer to
>    serialize them consistently.
> - The md_error() path is cold, meaning serialization has no performance
>    impact.
>
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/md-linear.c |  1 +
>   drivers/md/md.c        | 10 +++++++++-
>   drivers/md/md.h        |  1 +
>   drivers/md/raid0.c     |  1 +
>   drivers/md/raid1.c     |  6 +-----
>   drivers/md/raid10.c    |  9 ++-------
>   drivers/md/raid5.c     |  4 +---
>   7 files changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
> index 7033d982d377..0f6893e4b9f5 100644
> --- a/drivers/md/md-linear.c
> +++ b/drivers/md/md-linear.c
> @@ -298,6 +298,7 @@ static void linear_status(struct seq_file *seq, struct mddev *mddev)
>   }
>   
>   static void linear_error(struct mddev *mddev, struct md_rdev *rdev)
> +	__must_hold(&mddev->device_lock)

This is fine, however, I personally prefer lockdep_assert_held(). :)

Otherwise LGTM.

Thanks,
Kuai

>   {
>   	if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) {
>   		char *md_name = mdname(mddev);
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index d667580e3125..4ad9cb0ac98c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8444,7 +8444,8 @@ void md_unregister_thread(struct mddev *mddev, struct md_thread __rcu **threadp)
>   }
>   EXPORT_SYMBOL(md_unregister_thread);
>   
> -void md_error(struct mddev *mddev, struct md_rdev *rdev)
> +void _md_error(struct mddev *mddev, struct md_rdev *rdev)
> +	__must_hold(&mddev->device_lock)
>   {
>   	if (!rdev || test_bit(Faulty, &rdev->flags))
>   		return;
> @@ -8469,6 +8470,13 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>   		queue_work(md_misc_wq, &mddev->event_work);
>   	md_new_event();
>   }
> +
> +void md_error(struct mddev *mddev, struct md_rdev *rdev)
> +{
> +	spin_lock(&mddev->device_lock);
> +	_md_error(mddev, rdev);
> +	spin_unlock(&mddev->device_lock);
> +}
>   EXPORT_SYMBOL(md_error);
>   
>   /* seq_file implementation /proc/mdstat */
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 64ac22edf372..c982598cbf97 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -913,6 +913,7 @@ extern void md_write_start(struct mddev *mddev, struct bio *bi);
>   extern void md_write_inc(struct mddev *mddev, struct bio *bi);
>   extern void md_write_end(struct mddev *mddev);
>   extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
> +void _md_error(struct mddev *mddev, struct md_rdev *rdev);
>   extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
>   extern void md_finish_reshape(struct mddev *mddev);
>   void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index e443e478645a..8cf3caf9defd 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -625,6 +625,7 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev)
>   }
>   
>   static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
> +	__must_hold(&mddev->device_lock)
>   {
>   	if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) {
>   		char *md_name = mdname(mddev);
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7924d5ee189d..202e510f73a4 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1749,11 +1749,9 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>    * &mddev->fail_last_dev is off.
>    */
>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> +	__must_hold(&mddev->device_lock)
>   {
>   	struct r1conf *conf = mddev->private;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&conf->mddev->device_lock, flags);
>   
>   	if (test_bit(In_sync, &rdev->flags) &&
>   	    (conf->raid_disks - mddev->degraded) == 1) {
> @@ -1761,7 +1759,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   
>   		if (!mddev->fail_last_dev) {
>   			conf->recovery_disabled = mddev->recovery_disabled;
> -			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
>   			return;
>   		}
>   	}
> @@ -1769,7 +1766,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   	if (test_and_clear_bit(In_sync, &rdev->flags))
>   		mddev->degraded++;
>   	set_bit(Faulty, &rdev->flags);
> -	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
>   	/*
>   	 * if recovery is running, make sure it aborts.
>   	 */
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 57c887070df3..25c0ab09807b 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1993,19 +1993,15 @@ static int enough(struct r10conf *conf, int ignore)
>    * &mddev->fail_last_dev is off.
>    */
>   static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
> +	__must_hold(&mddev->device_lock)
>   {
>   	struct r10conf *conf = mddev->private;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&conf->mddev->device_lock, flags);
>   
>   	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
>   		set_bit(MD_BROKEN, &mddev->flags);
>   
> -		if (!mddev->fail_last_dev) {
> -			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
> +		if (!mddev->fail_last_dev)
>   			return;
> -		}
>   	}
>   	if (test_and_clear_bit(In_sync, &rdev->flags))
>   		mddev->degraded++;
> @@ -2015,7 +2011,6 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>   	set_bit(Faulty, &rdev->flags);
>   	set_mask_bits(&mddev->sb_flags, 0,
>   		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
> -	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
>   	pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
>   		"md/raid10:%s: Operation continuing on %d devices.\n",
>   		mdname(mddev), rdev->bdev,
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3350dcf9cab6..d1372b1bc405 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2905,15 +2905,14 @@ static void raid5_end_write_request(struct bio *bi)
>   }
>   
>   static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
> +	__must_hold(&mddev->device_lock)
>   {
>   	struct r5conf *conf = mddev->private;
> -	unsigned long flags;
>   	pr_debug("raid456: error called\n");
>   
>   	pr_crit("md/raid:%s: Disk failure on %pg, disabling device.\n",
>   		mdname(mddev), rdev->bdev);
>   
> -	spin_lock_irqsave(&conf->mddev->device_lock, flags);
>   	set_bit(Faulty, &rdev->flags);
>   	clear_bit(In_sync, &rdev->flags);
>   	mddev->degraded = raid5_calc_degraded(conf);
> @@ -2929,7 +2928,6 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>   			mdname(mddev), conf->raid_disks - mddev->degraded);
>   	}
>   
> -	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
>   	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>   
>   	set_bit(Blocked, &rdev->flags);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 03/16] md: add pers->should_error() callback
  2025-10-27 15:04 ` [PATCH v5 03/16] md: add pers->should_error() callback Kenta Akagi
@ 2025-10-30  2:24   ` Yu Kuai
  0 siblings, 0 replies; 25+ messages in thread
From: Yu Kuai @ 2025-10-30  2:24 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai

Hi,

在 2025/10/27 23:04, Kenta Akagi 写道:
> The failfast feature in RAID1 and RAID10 assumes that when md_error() is
> called, the array remains functional because the last rdev neither fails
> nor sets MD_BROKEN.
>
> However, the current implementation can cause the array to lose
> its last in-sync device or be marked as MD_BROKEN, which breaks the
> assumption and can lead to array failure.
>
> To address this issue, a new handler md_cond_error() will be introduced
> to ensure that failfast I/O does not mark the array as broken.
>
> As preparation, this commit adds a helper pers->should_error() to determine
> from outside the personality whether an rdev can fail safely, which is
> needed by md_cond_error().
>
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/md.h | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index c982598cbf97..01c8182431d1 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -763,6 +763,7 @@ struct md_personality
>   	 * if appropriate, and should abort recovery if needed
>   	 */
>   	void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
> +	bool (*should_error)(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio);

I think the name is not quite accurate, perhaps error_handler_check()?

Thanks,
Kuai

>   	int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
>   	int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev);
>   	int (*spare_active) (struct mddev *mddev);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 05/16] md/raid1: implement pers->should_error()
  2025-10-27 15:04 ` [PATCH v5 05/16] md/raid1: implement pers->should_error() Kenta Akagi
@ 2025-10-30  2:31   ` Yu Kuai
  0 siblings, 0 replies; 25+ messages in thread
From: Yu Kuai @ 2025-10-30  2:31 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai

Hi,

在 2025/10/27 23:04, Kenta Akagi 写道:
> The failfast feature in RAID1 and RAID10 assumes that when md_error() is
> called, the array remains functional because the last rdev neither fails
> nor sets MD_BROKEN.
>
> However, the current implementation can cause the array to lose
> its last in-sync device or be marked as MD_BROKEN, which breaks the
> assumption and can lead to array failure.
>
> To address this issue, introduce a new handler, md_cond_error(), to
> ensure that failfast I/O does not mark the array as broken.
>
> md_cond_error() checks whether a device should be faulted based on
> pers->should_error().  This commit implements should_error() callback
> for raid1 personality, which returns true if faulting the specified rdev
> would cause the mddev to become non-functional.
>
> Signed-off-by: Kenta Akagi <k@mgml.me>
> ---
>   drivers/md/raid1.c | 35 +++++++++++++++++++++++++++++++++++
>   1 file changed, 35 insertions(+)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 202e510f73a4..69b7730f3875 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1732,6 +1732,40 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>   	seq_printf(seq, "]");
>   }
>   
> +/**
> + * raid1_should_error() - Determine if this rdev should be failed
> + * @mddev: affected md device
> + * @rdev: member device to check
> + * @bio: the bio that caused the failure
> + *
> + * When failfast bios failure, rdev can fail, but the mddev must not fail.
> + * This function tells md_cond_error() not to fail rdev if bio is failfast
> + * and last rdev.
> + *
> + * Returns: %false if bio is failfast and rdev is the last in-sync device.
> + *	     Otherwise %true - should fail this rdev.
> + */
> +static bool raid1_should_error(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio)
> +{
> +	int i;
> +	struct r1conf *conf = mddev->private;
> +
> +	if (!(bio->bi_opf & MD_FAILFAST) ||
> +	    !test_bit(FailFast, &rdev->flags) ||
> +	    test_bit(Faulty, &rdev->flags))
> +		return true;

The above checking is the same for raid1 and raid10, so I think it should go to
the caller, and you don't need to pass in bio.

> +
> +	for (i = 0; i < conf->raid_disks; i++) {
> +		struct md_rdev *rdev2 = conf->mirrors[i].rdev;
> +
> +		if (rdev2 && rdev2 != rdev &&
> +		    test_bit(In_sync, &rdev2->flags) &&
> +		    !test_bit(Faulty, &rdev2->flags))
> +			return true;
> +	}

Why not use the same checking from raid1_error()? You can factor out a
helper for this. And BTW, now I feel it's better to name the new method
like rdev_last_in_sync().

Thanks,
Kuai

> +	return false;
> +}
> +
>   /**
>    * raid1_error() - RAID1 error handler.
>    * @mddev: affected md device.
> @@ -3486,6 +3520,7 @@ static struct md_personality raid1_personality =
>   	.free		= raid1_free,
>   	.status		= raid1_status,
>   	.error_handler	= raid1_error,
> +	.should_error	= raid1_should_error,
>   	.hot_add_disk	= raid1_add_disk,
>   	.hot_remove_disk= raid1_remove_disk,
>   	.spare_active	= raid1_spare_active,

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 07/16] md/raid1: refactor handle_read_error()
  2025-10-27 15:04 ` [PATCH v5 07/16] md/raid1: refactor handle_read_error() Kenta Akagi
@ 2025-10-30  2:38   ` Yu Kuai
  0 siblings, 0 replies; 25+ messages in thread
From: Yu Kuai @ 2025-10-30  2:38 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai

在 2025/10/27 23:04, Kenta Akagi 写道:

> For the failfast bio feature, the behavior of handle_read_error() will
> be changed in a subsequent commit, but refactor it first.
>
> This commit only refactors the code without functional changes. A
> subsequent commit will replace md_error() with md_cond_error()
> to implement proper failfast error handling.
>
> Signed-off-by: Kenta Akagi<k@mgml.me>
> ---
>   drivers/md/raid1.c | 14 ++++++--------
>   1 file changed, 6 insertions(+), 8 deletions(-)

Reviewed-by: Yu Kuai <yukuai@fnnas.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 08/16] md/raid10: refactor handle_read_error()
  2025-10-27 15:04 ` [PATCH v5 08/16] md/raid10: " Kenta Akagi
@ 2025-10-30  2:39   ` Yu Kuai
  0 siblings, 0 replies; 25+ messages in thread
From: Yu Kuai @ 2025-10-30  2:39 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Shaohua Li, Mariusz Tkaczyk, Guoqing Jiang
  Cc: linux-raid, linux-kernel, yukuai

在 2025/10/27 23:04, Kenta Akagi 写道:

> For the failfast bio feature, the behavior of handle_read_error() will
> be changed in a subsequent commit, but refactor it first.
>
> This commit only refactors the code without functional changes. A
> subsequent commit will replace md_error() with md_cond_error()
> to implement proper failfast error handling.
>
> Signed-off-by: Kenta Akagi<k@mgml.me>
> ---
>   drivers/md/raid10.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)

Reviewed-by: Yu Kuai <yukuai@fnnas.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 09/16] md/raid10: fix failfast read error not rescheduled
  2025-10-27 15:04 ` [PATCH v5 09/16] md/raid10: fix failfast read error not rescheduled Kenta Akagi
@ 2025-10-30  2:41   ` Yu Kuai
  0 siblings, 0 replies; 25+ messages in thread
From: Yu Kuai @ 2025-10-30  2:41 UTC (permalink / raw)
  To: Kenta Akagi, Song Liu, Shaohua Li, Mariusz Tkaczyk
  Cc: linux-raid, linux-kernel, Li Nan, yukuai

在 2025/10/27 23:04, Kenta Akagi 写道:

> raid10_end_read_request lacks a path to retry when a FailFast IO fails.
> As a result, when Failfast Read IOs fail on all rdevs, the upper layer
> receives EIO, without read rescheduled.
>
> Looking at the two commits below, it seems only raid10_end_read_request
> lacks the failfast read retry handling, while raid1_end_read_request has
> it. In RAID1, the retry works as expected.
> * commit 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
> * commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>
> This commit will make the failfast read bio for the last rdev in raid10
> retry if it fails.
>
> Fixes: 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
> Signed-off-by: Kenta Akagi<k@mgml.me>
> Reviewed-by: Li Nan<linan122@huawei.com>
> ---
>   drivers/md/raid10.c | 7 +++++++
>   1 file changed, 7 insertions(+)

Reviewed-by: Yu Kuai <yukuai@fnnas.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 02/16] md: serialize md_error()
  2025-10-30  2:11   ` Yu Kuai
@ 2025-11-06 17:06     ` Kenta Akagi
  0 siblings, 0 replies; 25+ messages in thread
From: Kenta Akagi @ 2025-11-06 17:06 UTC (permalink / raw)
  To: yukuai, song, shli, mtkaczyk, jgq516; +Cc: linux-raid, linux-kernel

Hi, thank you for reviewing!

On 2025/10/30 11:11, Yu Kuai wrote:
> Hi,
> 
> 在 2025/10/27 23:04, Kenta Akagi 写道:
>> Serialize the md_error() function in preparation for the introduction of
>> a conditional md_error() in a subsequent commit. The conditional
>> md_error() is intended to prevent unintentional setting of MD_BROKEN
>> during RAID1/10 failfast handling.
>>
>> To enhance the failfast bio error handler, it must verify that the
>> affected rdev is not the last working device before marking it as
>> faulty. Without serialization, a race condition can occur if multiple
>> failfast bios attempt to call the error handler concurrently:
>>
>>          failfast bio1			failfast bio2
>>          ---                             ---
>>          md_cond_error(md,rdev1,bio)	md_cond_error(md,rdev2,bio)
>>            if(!is_degraded(md))		  if(!is_degraded(md))
>>               raid1_error(md,rdev1)          raid1_error(md,rdev2)
>>                 spin_lock(md)
>>                 set_faulty(rdev1)
>> 	       spin_unlock(md)
>> 						spin_lock(md)
>> 						set_faulty(rdev2)
>> 						set_broken(md)
>> 						spin_unlock(md)
>>
>> This can unintentionally cause the array to stop in situations where the
>> 'Last' rdev should not be marked as Faulty.
>>
>> This commit serializes the md_error() function for all RAID
>> personalities to avoid this race condition. Future commits will
>> introduce a conditional md_error() specifically for failfast bio
>> handling.
>>
>> Serialization is applied to both the standard and conditional md_error()
>> for the following reasons:
>>
>> - Both use the same error-handling mechanism, so it's clearer to
>>    serialize them consistently.
>> - The md_error() path is cold, meaning serialization has no performance
>>    impact.
>>
>> Signed-off-by: Kenta Akagi <k@mgml.me>
>> ---
>>   drivers/md/md-linear.c |  1 +
>>   drivers/md/md.c        | 10 +++++++++-
>>   drivers/md/md.h        |  1 +
>>   drivers/md/raid0.c     |  1 +
>>   drivers/md/raid1.c     |  6 +-----
>>   drivers/md/raid10.c    |  9 ++-------
>>   drivers/md/raid5.c     |  4 +---
>>   7 files changed, 16 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
>> index 7033d982d377..0f6893e4b9f5 100644
>> --- a/drivers/md/md-linear.c
>> +++ b/drivers/md/md-linear.c
>> @@ -298,6 +298,7 @@ static void linear_status(struct seq_file *seq, struct mddev *mddev)
>>   }
>>   
>>   static void linear_error(struct mddev *mddev, struct md_rdev *rdev)
>> +	__must_hold(&mddev->device_lock)
> 
> This is fine, however, I personally prefer lockdep_assert_held(). :)

Certainly seems like lockdep is better. I'll fix it.

By the way, I realized this PATCH can cause a deadlock in the following cases:
1. md_error acquires device_lock. It's interruptible because it's spin_lock.
2. a something  - I can't find the specific function or scenario - does call 
  spin_lock_irqsave(&mddev->device_lock);. Since spin_lock_irqsave calls 
preempt_disable first and then spin_acquire, device_lock will never be released 
in a non-SMP environment because md_error thread are never scheduled.

It seems like one of the following changes is necessary, what do you think?
- Add a dedicated spinlock_t member for md_error serialize
- Use spin_lock_irqsave instead of spin_lock in md_error
It's not cool, but I think it would be better to add new members.

Thakns,
Akagi

> 
> Otherwise LGTM.
> 
> Thanks,
> Kuai
> 
>>   {
>>   	if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) {
>>   		char *md_name = mdname(mddev);
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index d667580e3125..4ad9cb0ac98c 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8444,7 +8444,8 @@ void md_unregister_thread(struct mddev *mddev, struct md_thread __rcu **threadp)
>>   }
>>   EXPORT_SYMBOL(md_unregister_thread);
>>   
>> -void md_error(struct mddev *mddev, struct md_rdev *rdev)
>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev)
>> +	__must_hold(&mddev->device_lock)
>>   {
>>   	if (!rdev || test_bit(Faulty, &rdev->flags))
>>   		return;
>> @@ -8469,6 +8470,13 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>   		queue_work(md_misc_wq, &mddev->event_work);
>>   	md_new_event();
>>   }
>> +
>> +void md_error(struct mddev *mddev, struct md_rdev *rdev)
>> +{
>> +	spin_lock(&mddev->device_lock);
>> +	_md_error(mddev, rdev);
>> +	spin_unlock(&mddev->device_lock);
>> +}
>>   EXPORT_SYMBOL(md_error);
>>   
>>   /* seq_file implementation /proc/mdstat */
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 64ac22edf372..c982598cbf97 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -913,6 +913,7 @@ extern void md_write_start(struct mddev *mddev, struct bio *bi);
>>   extern void md_write_inc(struct mddev *mddev, struct bio *bi);
>>   extern void md_write_end(struct mddev *mddev);
>>   extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev);
>>   extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
>>   extern void md_finish_reshape(struct mddev *mddev);
>>   void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev,
>> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
>> index e443e478645a..8cf3caf9defd 100644
>> --- a/drivers/md/raid0.c
>> +++ b/drivers/md/raid0.c
>> @@ -625,6 +625,7 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev)
>>   }
>>   
>>   static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
>> +	__must_hold(&mddev->device_lock)
>>   {
>>   	if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) {
>>   		char *md_name = mdname(mddev);
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 7924d5ee189d..202e510f73a4 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1749,11 +1749,9 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>    * &mddev->fail_last_dev is off.
>>    */
>>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>> +	__must_hold(&mddev->device_lock)
>>   {
>>   	struct r1conf *conf = mddev->private;
>> -	unsigned long flags;
>> -
>> -	spin_lock_irqsave(&conf->mddev->device_lock, flags);
>>   
>>   	if (test_bit(In_sync, &rdev->flags) &&
>>   	    (conf->raid_disks - mddev->degraded) == 1) {
>> @@ -1761,7 +1759,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>   
>>   		if (!mddev->fail_last_dev) {
>>   			conf->recovery_disabled = mddev->recovery_disabled;
>> -			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
>>   			return;
>>   		}
>>   	}
>> @@ -1769,7 +1766,6 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>   	if (test_and_clear_bit(In_sync, &rdev->flags))
>>   		mddev->degraded++;
>>   	set_bit(Faulty, &rdev->flags);
>> -	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
>>   	/*
>>   	 * if recovery is running, make sure it aborts.
>>   	 */
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 57c887070df3..25c0ab09807b 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1993,19 +1993,15 @@ static int enough(struct r10conf *conf, int ignore)
>>    * &mddev->fail_last_dev is off.
>>    */
>>   static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>> +	__must_hold(&mddev->device_lock)
>>   {
>>   	struct r10conf *conf = mddev->private;
>> -	unsigned long flags;
>> -
>> -	spin_lock_irqsave(&conf->mddev->device_lock, flags);
>>   
>>   	if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
>>   		set_bit(MD_BROKEN, &mddev->flags);
>>   
>> -		if (!mddev->fail_last_dev) {
>> -			spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
>> +		if (!mddev->fail_last_dev)
>>   			return;
>> -		}
>>   	}
>>   	if (test_and_clear_bit(In_sync, &rdev->flags))
>>   		mddev->degraded++;
>> @@ -2015,7 +2011,6 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>   	set_bit(Faulty, &rdev->flags);
>>   	set_mask_bits(&mddev->sb_flags, 0,
>>   		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
>> -	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
>>   	pr_crit("md/raid10:%s: Disk failure on %pg, disabling device.\n"
>>   		"md/raid10:%s: Operation continuing on %d devices.\n",
>>   		mdname(mddev), rdev->bdev,
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 3350dcf9cab6..d1372b1bc405 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -2905,15 +2905,14 @@ static void raid5_end_write_request(struct bio *bi)
>>   }
>>   
>>   static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>> +	__must_hold(&mddev->device_lock)
>>   {
>>   	struct r5conf *conf = mddev->private;
>> -	unsigned long flags;
>>   	pr_debug("raid456: error called\n");
>>   
>>   	pr_crit("md/raid:%s: Disk failure on %pg, disabling device.\n",
>>   		mdname(mddev), rdev->bdev);
>>   
>> -	spin_lock_irqsave(&conf->mddev->device_lock, flags);
>>   	set_bit(Faulty, &rdev->flags);
>>   	clear_bit(In_sync, &rdev->flags);
>>   	mddev->degraded = raid5_calc_degraded(conf);
>> @@ -2929,7 +2928,6 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>>   			mdname(mddev), conf->raid_disks - mddev->degraded);
>>   	}
>>   
>> -	spin_unlock_irqrestore(&conf->mddev->device_lock, flags);
>>   	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>   
>>   	set_bit(Blocked, &rdev->flags);
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-11-06 17:06 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27 15:04 [PATCH v5 00/16] Don't set MD_BROKEN on failfast bio failure Kenta Akagi
2025-10-27 15:04 ` [PATCH v5 01/16] md: move device_lock from conf to mddev Kenta Akagi
2025-10-30  1:59   ` Yu Kuai
2025-10-27 15:04 ` [PATCH v5 02/16] md: serialize md_error() Kenta Akagi
2025-10-30  2:11   ` Yu Kuai
2025-11-06 17:06     ` Kenta Akagi
2025-10-27 15:04 ` [PATCH v5 03/16] md: add pers->should_error() callback Kenta Akagi
2025-10-30  2:24   ` Yu Kuai
2025-10-27 15:04 ` [PATCH v5 04/16] md: introduce md_cond_error() Kenta Akagi
2025-10-27 15:04 ` [PATCH v5 05/16] md/raid1: implement pers->should_error() Kenta Akagi
2025-10-30  2:31   ` Yu Kuai
2025-10-27 15:04 ` [PATCH v5 06/16] md/raid10: " Kenta Akagi
2025-10-27 15:04 ` [PATCH v5 07/16] md/raid1: refactor handle_read_error() Kenta Akagi
2025-10-30  2:38   ` Yu Kuai
2025-10-27 15:04 ` [PATCH v5 08/16] md/raid10: " Kenta Akagi
2025-10-30  2:39   ` Yu Kuai
2025-10-27 15:04 ` [PATCH v5 09/16] md/raid10: fix failfast read error not rescheduled Kenta Akagi
2025-10-30  2:41   ` Yu Kuai
2025-10-27 15:04 ` [PATCH v5 10/16] md: prevent set MD_BROKEN on super_write failure with failfast Kenta Akagi
2025-10-27 15:04 ` [PATCH v5 11/16] md/raid1: Prevent set MD_BROKEN on failfast bio failure Kenta Akagi
2025-10-27 15:04 ` [PATCH v5 12/16] md/raid10: " Kenta Akagi
2025-10-27 15:04 ` [PATCH v5 13/16] md/raid1: add error message when setting MD_BROKEN Kenta Akagi
2025-10-27 15:04 ` [PATCH v5 14/16] md/raid10: Add " Kenta Akagi
2025-10-27 15:04 ` [PATCH v5 15/16] md: rename 'LastDev' rdev flag to 'RetryingSBWrite' Kenta Akagi
2025-10-27 15:04 ` [PATCH v5 16/16] md: Improve super_written() error logging Kenta Akagi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox