[PATCH 0/4] md: fix is_mddev

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] md: fix is_mddev_idle()
@ 2025-04-12  7:31 Yu Kuai
  2025-04-12  7:31 ` [PATCH 1/4] block: export part_in_flight() Yu Kuai
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Yu Kuai @ 2025-04-12  7:31 UTC (permalink / raw)
  To: axboe, song, yukuai3, xni
  Cc: linux-block, linux-kernel, linux-raid, yukuai1, yi.zhang,
	yangerkun

From: Yu Kuai <yukuai3@huawei.com>

If sync_speed is above speed_min, then is_mddev_idle() will be called
for each sync IO to check if the array is idle, and inflihgt sync_io
will be limited if the array is not idle.

However, while mkfs.ext4 for a large raid5 array while recovery is in
progress, it's found that sync_speed is already above speed_min while
lots of stripes are used for sync IO, causing long delay for mkfs.ext4.

Root cause is the following checking from is_mddev_idle():

t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2  = completed IO - issued sync IO
if (events2 - events1 > 64)

For consequence, the more sync IO issued, the less likely checking will
pass. And when completed normal IO is more than issued sync IO, the
condition will finally pass and is_mddev_idle() will return false,
however, last_events will be updated hence is_mddev_idle() can only
return false once in a while.

Fix this problem by changing the checking as following:

1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions doesn't
   have IO completed.

Yu Kuai (4):
  block: export part_in_flight()
  md: add a new api sync_io_depth
  md: fix is_mddev_idle()
  md: cleanup accounting for issued sync IO

 block/blk.h               |   1 -
 block/genhd.c             |   1 +
 drivers/md/md.c           | 181 ++++++++++++++++++++++++++------------
 drivers/md/md.h           |  15 +---
 drivers/md/raid1.c        |   3 -
 drivers/md/raid10.c       |   9 --
 drivers/md/raid5.c        |   8 --
 include/linux/blkdev.h    |   1 -
 include/linux/part_stat.h |   1 +
 9 files changed, 130 insertions(+), 90 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] block: export part_in_flight()
  2025-04-12  7:31 [PATCH 0/4] md: fix is_mddev_idle() Yu Kuai
@ 2025-04-12  7:31 ` Yu Kuai
  2025-04-14  6:32   ` Christoph Hellwig
  2025-04-12  7:32 ` [PATCH 2/4] md: add a new api sync_io_depth Yu Kuai
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Yu Kuai @ 2025-04-12  7:31 UTC (permalink / raw)
  To: axboe, song, yukuai3, xni
  Cc: linux-block, linux-kernel, linux-raid, yukuai1, yi.zhang,
	yangerkun

From: Yu Kuai <yukuai3@huawei.com>

This helper will be used in mdraid in later patches, check if there
are normal IO inflight while generating background sync IO, to fix a
problem in mdraid that foreground IO can be starved by background sync
IO.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk.h               | 1 -
 block/genhd.c             | 1 +
 include/linux/part_stat.h | 1 +
 3 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk.h b/block/blk.h
index 006e3be433d2..f476f233f195 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -418,7 +418,6 @@ void blk_apply_bdi_limits(struct backing_dev_info *bdi,
 int blk_dev_init(void);
 
 void update_io_ticks(struct block_device *part, unsigned long now, bool end);
-unsigned int part_in_flight(struct block_device *part);
 
 static inline void req_set_nomerge(struct request_queue *q, struct request *req)
 {
diff --git a/block/genhd.c b/block/genhd.c
index c2bd86cd09de..5b408d9b5a9d 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -139,6 +139,7 @@ unsigned int part_in_flight(struct block_device *part)
 
 	return inflight;
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 static void part_in_flight_rw(struct block_device *part,
 		unsigned int inflight[2])
diff --git a/include/linux/part_stat.h b/include/linux/part_stat.h
index c5e9cac0575e..79ed730a8d50 100644
--- a/include/linux/part_stat.h
+++ b/include/linux/part_stat.h
@@ -79,4 +79,5 @@ static inline void part_stat_set_all(struct block_device *part, int value)
 #define part_stat_local_read_cpu(part, field, cpu)			\
 	local_read(&(part_stat_get_cpu(part, field, cpu)))
 
+unsigned int part_in_flight(struct block_device *part);
 #endif /* _LINUX_PART_STAT_H */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4] md: add a new api sync_io_depth
  2025-04-12  7:31 [PATCH 0/4] md: fix is_mddev_idle() Yu Kuai
  2025-04-12  7:31 ` [PATCH 1/4] block: export part_in_flight() Yu Kuai
@ 2025-04-12  7:32 ` Yu Kuai
  2025-04-16  5:32   ` Xiao Ni
  2025-04-12  7:32 ` [PATCH 3/4] md: fix is_mddev_idle() Yu Kuai
  2025-04-12  7:32 ` [PATCH 4/4] md: cleanup accounting for issued sync IO Yu Kuai
  3 siblings, 1 reply; 16+ messages in thread
From: Yu Kuai @ 2025-04-12  7:32 UTC (permalink / raw)
  To: axboe, song, yukuai3, xni
  Cc: linux-block, linux-kernel, linux-raid, yukuai1, yi.zhang,
	yangerkun

From: Yu Kuai <yukuai3@huawei.com>

Currently if sync speed is above speed_min and below speed_max,
md_do_sync() will wait for all sync IOs to be done before issuing new
sync IO, means sync IO depth is limited to just 1.

This limit is too low, in order to prevent sync speed drop conspicuously
after fixing is_mddev_idle() in the next patch, add a new api for
limiting sync IO depth, the default value is 32.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c | 103 +++++++++++++++++++++++++++++++++++++++---------
 drivers/md/md.h |   1 +
 2 files changed, 85 insertions(+), 19 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 438e71e45c16..8966c4afc62a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -111,32 +111,42 @@ static void md_wakeup_thread_directly(struct md_thread __rcu *thread);
 /* Default safemode delay: 200 msec */
 #define DEFAULT_SAFEMODE_DELAY ((200 * HZ)/1000 +1)
 /*
- * Current RAID-1,4,5 parallel reconstruction 'guaranteed speed limit'
- * is 1000 KB/sec, so the extra system load does not show up that much.
- * Increase it if you want to have more _guaranteed_ speed. Note that
- * the RAID driver will use the maximum available bandwidth if the IO
- * subsystem is idle. There is also an 'absolute maximum' reconstruction
- * speed limit - in case reconstruction slows down your system despite
- * idle IO detection.
+ * Background sync IO speed control:
  *
- * you can change it via /proc/sys/dev/raid/speed_limit_min and _max.
- * or /sys/block/mdX/md/sync_speed_{min,max}
+ * - below speed min:
+ *   no limit;
+ * - above speed min and below speed max:
+ *   a) if mddev is idle, then no limit;
+ *   b) if mddev is busy handling normal IO, then limit inflight sync IO
+ *   to sync_io_depth;
+ * - above speed max:
+ *   sync IO can't be issued;
+ *
+ * Following configurations can be changed via /proc/sys/dev/raid/ for system
+ * or /sys/block/mdX/md/ for one array.
  */
-
 static int sysctl_speed_limit_min = 1000;
 static int sysctl_speed_limit_max = 200000;
-static inline int speed_min(struct mddev *mddev)
+static int sysctl_sync_io_depth = 32;
+
+static int speed_min(struct mddev *mddev)
 {
 	return mddev->sync_speed_min ?
 		mddev->sync_speed_min : sysctl_speed_limit_min;
 }
 
-static inline int speed_max(struct mddev *mddev)
+static int speed_max(struct mddev *mddev)
 {
 	return mddev->sync_speed_max ?
 		mddev->sync_speed_max : sysctl_speed_limit_max;
 }
 
+static int sync_io_depth(struct mddev *mddev)
+{
+	return mddev->sync_io_depth ?
+		mddev->sync_io_depth : sysctl_sync_io_depth;
+}
+
 static void rdev_uninit_serial(struct md_rdev *rdev)
 {
 	if (!test_and_clear_bit(CollisionCheck, &rdev->flags))
@@ -293,14 +303,21 @@ static const struct ctl_table raid_table[] = {
 		.procname	= "speed_limit_min",
 		.data		= &sysctl_speed_limit_min,
 		.maxlen		= sizeof(int),
-		.mode		= S_IRUGO|S_IWUSR,
+		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
 	{
 		.procname	= "speed_limit_max",
 		.data		= &sysctl_speed_limit_max,
 		.maxlen		= sizeof(int),
-		.mode		= S_IRUGO|S_IWUSR,
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "sync_io_depth",
+		.data		= &sysctl_sync_io_depth,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
 };
@@ -5091,7 +5108,7 @@ static ssize_t
 sync_min_show(struct mddev *mddev, char *page)
 {
 	return sprintf(page, "%d (%s)\n", speed_min(mddev),
-		       mddev->sync_speed_min ? "local": "system");
+		       mddev->sync_speed_min ? "local" : "system");
 }
 
 static ssize_t
@@ -5100,7 +5117,7 @@ sync_min_store(struct mddev *mddev, const char *buf, size_t len)
 	unsigned int min;
 	int rv;
 
-	if (strncmp(buf, "system", 6)==0) {
+	if (strncmp(buf, "system", 6) == 0) {
 		min = 0;
 	} else {
 		rv = kstrtouint(buf, 10, &min);
@@ -5120,7 +5137,7 @@ static ssize_t
 sync_max_show(struct mddev *mddev, char *page)
 {
 	return sprintf(page, "%d (%s)\n", speed_max(mddev),
-		       mddev->sync_speed_max ? "local": "system");
+		       mddev->sync_speed_max ? "local" : "system");
 }
 
 static ssize_t
@@ -5129,7 +5146,7 @@ sync_max_store(struct mddev *mddev, const char *buf, size_t len)
 	unsigned int max;
 	int rv;
 
-	if (strncmp(buf, "system", 6)==0) {
+	if (strncmp(buf, "system", 6) == 0) {
 		max = 0;
 	} else {
 		rv = kstrtouint(buf, 10, &max);
@@ -5145,6 +5162,35 @@ sync_max_store(struct mddev *mddev, const char *buf, size_t len)
 static struct md_sysfs_entry md_sync_max =
 __ATTR(sync_speed_max, S_IRUGO|S_IWUSR, sync_max_show, sync_max_store);
 
+static ssize_t
+sync_io_depth_show(struct mddev *mddev, char *page)
+{
+	return sprintf(page, "%d (%s)\n", sync_io_depth(mddev),
+		       mddev->sync_io_depth ? "local" : "system");
+}
+
+static ssize_t
+sync_io_depth_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	unsigned int max;
+	int rv;
+
+	if (strncmp(buf, "system", 6) == 0) {
+		max = 0;
+	} else {
+		rv = kstrtouint(buf, 10, &max);
+		if (rv < 0)
+			return rv;
+		if (max == 0)
+			return -EINVAL;
+	}
+	mddev->sync_io_depth = max;
+	return len;
+}
+
+static struct md_sysfs_entry md_sync_io_depth =
+__ATTR_RW(sync_io_depth);
+
 static ssize_t
 degraded_show(struct mddev *mddev, char *page)
 {
@@ -5671,6 +5717,7 @@ static struct attribute *md_redundancy_attrs[] = {
 	&md_mismatches.attr,
 	&md_sync_min.attr,
 	&md_sync_max.attr,
+	&md_sync_io_depth.attr,
 	&md_sync_speed.attr,
 	&md_sync_force_parallel.attr,
 	&md_sync_completed.attr,
@@ -8927,6 +8974,23 @@ static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
 	}
 }
 
+static bool sync_io_within_limit(struct mddev *mddev)
+{
+	int io_sectors;
+
+	/*
+	 * For raid456, sync IO is stripe(4k) per IO, for other levels, it's
+	 * RESYNC_PAGES(64k) per IO.
+	 */
+	if (mddev->level == 4 || mddev->level == 5 || mddev->level == 6)
+		io_sectors = 8;
+	else
+		io_sectors = 128;
+
+	return atomic_read(&mddev->recovery_active) <
+		io_sectors * sync_io_depth(mddev);
+}
+
 #define SYNC_MARKS	10
 #define	SYNC_MARK_STEP	(3*HZ)
 #define UPDATE_FREQUENCY (5*60*HZ)
@@ -9195,7 +9259,8 @@ void md_do_sync(struct md_thread *thread)
 				msleep(500);
 				goto repeat;
 			}
-			if (!is_mddev_idle(mddev, 0)) {
+			if (!sync_io_within_limit(mddev) &&
+			    !is_mddev_idle(mddev, 0)) {
 				/*
 				 * Give other IO more of a chance.
 				 * The faster the devices, the less we wait.
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1cf00a04bcdd..63be622467c6 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -483,6 +483,7 @@ struct mddev {
 	/* if zero, use the system-wide default */
 	int				sync_speed_min;
 	int				sync_speed_max;
+	int				sync_io_depth;
 
 	/* resync even though the same disks are shared among md-devices */
 	int				parallel_resync;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4] md: fix is_mddev_idle()
  2025-04-12  7:31 [PATCH 0/4] md: fix is_mddev_idle() Yu Kuai
  2025-04-12  7:31 ` [PATCH 1/4] block: export part_in_flight() Yu Kuai
  2025-04-12  7:32 ` [PATCH 2/4] md: add a new api sync_io_depth Yu Kuai
@ 2025-04-12  7:32 ` Yu Kuai
  2025-04-16  6:20   ` Xiao Ni
  2025-04-12  7:32 ` [PATCH 4/4] md: cleanup accounting for issued sync IO Yu Kuai
  3 siblings, 1 reply; 16+ messages in thread
From: Yu Kuai @ 2025-04-12  7:32 UTC (permalink / raw)
  To: axboe, song, yukuai3, xni
  Cc: linux-block, linux-kernel, linux-raid, yukuai1, yi.zhang,
	yangerkun

From: Yu Kuai <yukuai3@huawei.com>

If sync_speed is above speed_min, then is_mddev_idle() will be called
for each sync IO to check if the array is idle, and inflihgt sync_io
will be limited if the array is not idle.

However, while mkfs.ext4 for a large raid5 array while recovery is in
progress, it's found that sync_speed is already above speed_min while
lots of stripes are used for sync IO, causing long delay for mkfs.ext4.

Root cause is the following checking from is_mddev_idle():

t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2  = completed IO - issued sync IO
if (events2 - events1 > 64)

For consequence, the more sync IO issued, the less likely checking will
pass. And when completed normal IO is more than issued sync IO, the
condition will finally pass and is_mddev_idle() will return false,
however, last_events will be updated hence is_mddev_idle() can only
return false once in a while.

Fix this problem by changing the checking as following:

1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions doesn't
   have IO completed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c | 78 ++++++++++++++++++++++++++-----------------------
 drivers/md/md.h |  3 +-
 2 files changed, 43 insertions(+), 38 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8966c4afc62a..19da93f8912c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8619,50 +8619,54 @@ void md_cluster_stop(struct mddev *mddev)
 	put_cluster_ops(mddev);
 }
 
-static int is_mddev_idle(struct mddev *mddev, int init)
+static bool is_rdev_idle(struct md_rdev *rdev, bool init)
+{
+	unsigned long last_events = rdev->last_events;
+
+	if (!bdev_is_partition(rdev->bdev))
+		return true;
+
+	rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0,
+						 sectors) -
+			    part_stat_read_accum(rdev->bdev, sectors);
+
+	if (!init && rdev->last_events > last_events)
+		return false;
+
+	return true;
+}
+
+/*
+ * mddev is idle if following conditions are match since last check:
+ * 1) mddev doesn't have normal IO completed;
+ * 2) mddev doesn't have inflight normal IO;
+ * 3) if any member disk is partition, and other partitions doesn't have IO
+ *    completed;
+ *
+ * Noted this checking rely on IO accounting is enabled.
+ */
+static bool is_mddev_idle(struct mddev *mddev, int init)
 {
 	struct md_rdev *rdev;
-	int idle;
-	int curr_events;
+	bool idle = true;
 
-	idle = 1;
-	rcu_read_lock();
-	rdev_for_each_rcu(rdev, mddev) {
-		struct gendisk *disk = rdev->bdev->bd_disk;
+	if (!mddev_is_dm(mddev)) {
+		unsigned long last_events = mddev->last_events;
 
-		if (!init && !blk_queue_io_stat(disk->queue))
-			continue;
+		mddev->last_events = part_stat_read_accum(mddev->gendisk->part0,
+							  sectors);
 
-		curr_events = (int)part_stat_read_accum(disk->part0, sectors) -
-			      atomic_read(&disk->sync_io);
-		/* sync IO will cause sync_io to increase before the disk_stats
-		 * as sync_io is counted when a request starts, and
-		 * disk_stats is counted when it completes.
-		 * So resync activity will cause curr_events to be smaller than
-		 * when there was no such activity.
-		 * non-sync IO will cause disk_stat to increase without
-		 * increasing sync_io so curr_events will (eventually)
-		 * be larger than it was before.  Once it becomes
-		 * substantially larger, the test below will cause
-		 * the array to appear non-idle, and resync will slow
-		 * down.
-		 * If there is a lot of outstanding resync activity when
-		 * we set last_event to curr_events, then all that activity
-		 * completing might cause the array to appear non-idle
-		 * and resync will be slowed down even though there might
-		 * not have been non-resync activity.  This will only
-		 * happen once though.  'last_events' will soon reflect
-		 * the state where there is little or no outstanding
-		 * resync requests, and further resync activity will
-		 * always make curr_events less than last_events.
-		 *
-		 */
-		if (init || curr_events - rdev->last_events > 64) {
-			rdev->last_events = curr_events;
-			idle = 0;
-		}
+		if (!init && (mddev->last_events > last_events ||
+			      part_in_flight(mddev->gendisk->part0)))
+			idle = false;
 	}
+
+	rcu_read_lock();
+	rdev_for_each_rcu(rdev, mddev)
+		if (!is_rdev_idle(rdev, init))
+			idle = false;
 	rcu_read_unlock();
+
 	return idle;
 }
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 63be622467c6..95cf11c4abc6 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -132,7 +132,7 @@ struct md_rdev {
 
 	sector_t sectors;		/* Device size (in 512bytes sectors) */
 	struct mddev *mddev;		/* RAID array if running */
-	int last_events;		/* IO event timestamp */
+	unsigned long last_events;	/* IO event timestamp */
 
 	/*
 	 * If meta_bdev is non-NULL, it means that a separate device is
@@ -519,6 +519,7 @@ struct mddev {
 							 * adding a spare
 							 */
 
+	unsigned long			last_events;	/* IO event timestamp */
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
 	sector_t			recovery_cp;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] md: cleanup accounting for issued sync IO
  2025-04-12  7:31 [PATCH 0/4] md: fix is_mddev_idle() Yu Kuai
                   ` (2 preceding siblings ...)
  2025-04-12  7:32 ` [PATCH 3/4] md: fix is_mddev_idle() Yu Kuai
@ 2025-04-12  7:32 ` Yu Kuai
  2025-04-16  6:27   ` Xiao Ni
  3 siblings, 1 reply; 16+ messages in thread
From: Yu Kuai @ 2025-04-12  7:32 UTC (permalink / raw)
  To: axboe, song, yukuai3, xni
  Cc: linux-block, linux-kernel, linux-raid, yukuai1, yi.zhang,
	yangerkun

From: Yu Kuai <yukuai3@huawei.com>

It's no longer used and can be removed, also remove the field
'gendisk->sync_io'.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.h        | 11 -----------
 drivers/md/raid1.c     |  3 ---
 drivers/md/raid10.c    |  9 ---------
 drivers/md/raid5.c     |  8 --------
 include/linux/blkdev.h |  1 -
 5 files changed, 32 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 95cf11c4abc6..6233ec9f10a3 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -716,17 +716,6 @@ static inline int mddev_trylock(struct mddev *mddev)
 }
 extern void mddev_unlock(struct mddev *mddev);
 
-static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors)
-{
-	if (blk_queue_io_stat(bdev->bd_disk->queue))
-		atomic_add(nr_sectors, &bdev->bd_disk->sync_io);
-}
-
-static inline void md_sync_acct_bio(struct bio *bio, unsigned long nr_sectors)
-{
-	md_sync_acct(bio->bi_bdev, nr_sectors);
-}
-
 struct md_personality
 {
 	struct md_submodule_head head;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index de9bccbe7337..657d481525be 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2382,7 +2382,6 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
 
 		wbio->bi_end_io = end_sync_write;
 		atomic_inc(&r1_bio->remaining);
-		md_sync_acct(conf->mirrors[i].rdev->bdev, bio_sectors(wbio));
 
 		submit_bio_noacct(wbio);
 	}
@@ -3055,7 +3054,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 			bio = r1_bio->bios[i];
 			if (bio->bi_end_io == end_sync_read) {
 				read_targets--;
-				md_sync_acct_bio(bio, nr_sectors);
 				if (read_targets == 1)
 					bio->bi_opf &= ~MD_FAILFAST;
 				submit_bio_noacct(bio);
@@ -3064,7 +3062,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	} else {
 		atomic_set(&r1_bio->remaining, 1);
 		bio = r1_bio->bios[r1_bio->read_disk];
-		md_sync_acct_bio(bio, nr_sectors);
 		if (read_targets == 1)
 			bio->bi_opf &= ~MD_FAILFAST;
 		submit_bio_noacct(bio);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ba32bac975b8..dce06bf65016 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2426,7 +2426,6 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 
 		atomic_inc(&conf->mirrors[d].rdev->nr_pending);
 		atomic_inc(&r10_bio->remaining);
-		md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(tbio));
 
 		if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
 			tbio->bi_opf |= MD_FAILFAST;
@@ -2448,8 +2447,6 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 			bio_copy_data(tbio, fbio);
 		d = r10_bio->devs[i].devnum;
 		atomic_inc(&r10_bio->remaining);
-		md_sync_acct(conf->mirrors[d].replacement->bdev,
-			     bio_sectors(tbio));
 		submit_bio_noacct(tbio);
 	}
 
@@ -2583,13 +2580,10 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 	d = r10_bio->devs[1].devnum;
 	if (wbio->bi_end_io) {
 		atomic_inc(&conf->mirrors[d].rdev->nr_pending);
-		md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio));
 		submit_bio_noacct(wbio);
 	}
 	if (wbio2) {
 		atomic_inc(&conf->mirrors[d].replacement->nr_pending);
-		md_sync_acct(conf->mirrors[d].replacement->bdev,
-			     bio_sectors(wbio2));
 		submit_bio_noacct(wbio2);
 	}
 }
@@ -3757,7 +3751,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 		r10_bio->sectors = nr_sectors;
 
 		if (bio->bi_end_io == end_sync_read) {
-			md_sync_acct_bio(bio, nr_sectors);
 			bio->bi_status = 0;
 			submit_bio_noacct(bio);
 		}
@@ -4880,7 +4873,6 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
 	r10_bio->sectors = nr_sectors;
 
 	/* Now submit the read */
-	md_sync_acct_bio(read_bio, r10_bio->sectors);
 	atomic_inc(&r10_bio->remaining);
 	read_bio->bi_next = NULL;
 	submit_bio_noacct(read_bio);
@@ -4940,7 +4932,6 @@ static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 			continue;
 
 		atomic_inc(&rdev->nr_pending);
-		md_sync_acct_bio(b, r10_bio->sectors);
 		atomic_inc(&r10_bio->remaining);
 		b->bi_next = NULL;
 		submit_bio_noacct(b);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6389383166c0..ca5b0e8ba707 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1240,10 +1240,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 		}
 
 		if (rdev) {
-			if (s->syncing || s->expanding || s->expanded
-			    || s->replacing)
-				md_sync_acct(rdev->bdev, RAID5_STRIPE_SECTORS(conf));
-
 			set_bit(STRIPE_IO_STARTED, &sh->state);
 
 			bio_init(bi, rdev->bdev, &dev->vec, 1, op | op_flags);
@@ -1300,10 +1296,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				submit_bio_noacct(bi);
 		}
 		if (rrdev) {
-			if (s->syncing || s->expanding || s->expanded
-			    || s->replacing)
-				md_sync_acct(rrdev->bdev, RAID5_STRIPE_SECTORS(conf));
-
 			set_bit(STRIPE_IO_STARTED, &sh->state);
 
 			bio_init(rbi, rrdev->bdev, &dev->rvec, 1, op | op_flags);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e39c45bc0a97..f3a625b00734 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -182,7 +182,6 @@ struct gendisk {
 	struct list_head slave_bdevs;
 #endif
 	struct timer_rand_state *random;
-	atomic_t sync_io;		/* RAID */
 	struct disk_events *ev;
 
 #ifdef CONFIG_BLK_DEV_ZONED
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] block: export part_in_flight()
  2025-04-12  7:31 ` [PATCH 1/4] block: export part_in_flight() Yu Kuai
@ 2025-04-14  6:32   ` Christoph Hellwig
  2025-04-14  6:48     ` Yu Kuai
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2025-04-14  6:32 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, song, yukuai3, xni, linux-block, linux-kernel, linux-raid,
	yi.zhang, yangerkun

On Sat, Apr 12, 2025 at 03:31:59PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> This helper will be used in mdraid in later patches, check if there
> are normal IO inflight while generating background sync IO, to fix a
> problem in mdraid that foreground IO can be starved by background sync
> IO.

If we export this it needs a kerneldoc comment, and probably also
a better name.

Looking at this I'm also a little confused about blk_mq_in_flight_rw vs
blk_mq_in_flight and why one needs blk-mq special casing and the other
not, maybe we need to dig into the history and try to understand that
as well while we're at it.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] block: export part_in_flight()
  2025-04-14  6:32   ` Christoph Hellwig
@ 2025-04-14  6:48     ` Yu Kuai
  2025-04-14 11:39       ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: Yu Kuai @ 2025-04-14  6:48 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: axboe, song, xni, linux-block, linux-kernel, linux-raid, yi.zhang,
	yangerkun, yukuai (C)

Hi,

在 2025/04/14 14:32, Christoph Hellwig 写道:
> On Sat, Apr 12, 2025 at 03:31:59PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> This helper will be used in mdraid in later patches, check if there
>> are normal IO inflight while generating background sync IO, to fix a
>> problem in mdraid that foreground IO can be starved by background sync
>> IO.
> 
> If we export this it needs a kerneldoc comment, and probably also
> a better name.

Sure about comment.
> 
> Looking at this I'm also a little confused about blk_mq_in_flight_rw vs
> blk_mq_in_flight and why one needs blk-mq special casing and the other
> not, maybe we need to dig into the history and try to understand that
> as well while we're at it.

There are two kinds of helpers:

1) part_in_flight and part_in_flight_rw
2) blk_mq_in_flight and blk_mq_in_flight_rw

1) is accounted at blk_account_io_start(), while 2) is
blk_mq_start_request(), I think this is the essential difference.

part_in_flight_rw() and blk_mq_in_flight_rw() is also used in sysfs API
inflight for bio/rq based device. And commit 7be835694dae ("block: fix
that util can be greater than 100%") convert blk_mq_in_flight() to
part_in_flight() from disk stats API. Now I just checked there is no use
for blk_mq_in_flight() anymore and maybe it can be removed.

Thanks,
Kuai

> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] block: export part_in_flight()
  2025-04-14  6:48     ` Yu Kuai
@ 2025-04-14 11:39       ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2025-04-14 11:39 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, axboe, song, xni, linux-block, linux-kernel,
	linux-raid, yi.zhang, yangerkun, yukuai (C)

On Mon, Apr 14, 2025 at 02:48:23PM +0800, Yu Kuai wrote:
> > If we export this it needs a kerneldoc comment, and probably also
> > a better name.
> 
> Sure about comment.

I think a name like bdev_count_inflight might also be helpful as there
is nothing partition-specific in the helper.

> There are two kinds of helpers:
> 
> 1) part_in_flight and part_in_flight_rw
> 2) blk_mq_in_flight and blk_mq_in_flight_rw
> 
> 1) is accounted at blk_account_io_start(), while 2) is
> blk_mq_start_request(), I think this is the essential difference.
> 
> part_in_flight_rw() and blk_mq_in_flight_rw() is also used in sysfs API
> inflight for bio/rq based device. And commit 7be835694dae ("block: fix
> that util can be greater than 100%") convert blk_mq_in_flight() to
> part_in_flight() from disk stats API. Now I just checked there is no use
> for blk_mq_in_flight() anymore and maybe it can be removed.

Yeah.  I'm still confused about having the different methods to count
the _rw vs non-_rw variants for blk-mq, but I guess that's not really
in scope for your series.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] md: add a new api sync_io_depth
  2025-04-12  7:32 ` [PATCH 2/4] md: add a new api sync_io_depth Yu Kuai
@ 2025-04-16  5:32   ` Xiao Ni
  2025-04-16  8:19     ` Yu Kuai
  0 siblings, 1 reply; 16+ messages in thread
From: Xiao Ni @ 2025-04-16  5:32 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, song, yukuai3, linux-block, linux-kernel, linux-raid,
	yi.zhang, yangerkun

On Sat, Apr 12, 2025 at 3:39 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Currently if sync speed is above speed_min and below speed_max,
> md_do_sync() will wait for all sync IOs to be done before issuing new
> sync IO, means sync IO depth is limited to just 1.
>
> This limit is too low, in order to prevent sync speed drop conspicuously
> after fixing is_mddev_idle() in the next patch, add a new api for
> limiting sync IO depth, the default value is 32.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/md.c | 103 +++++++++++++++++++++++++++++++++++++++---------
>  drivers/md/md.h |   1 +
>  2 files changed, 85 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 438e71e45c16..8966c4afc62a 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -111,32 +111,42 @@ static void md_wakeup_thread_directly(struct md_thread __rcu *thread);
>  /* Default safemode delay: 200 msec */
>  #define DEFAULT_SAFEMODE_DELAY ((200 * HZ)/1000 +1)
>  /*
> - * Current RAID-1,4,5 parallel reconstruction 'guaranteed speed limit'
> - * is 1000 KB/sec, so the extra system load does not show up that much.
> - * Increase it if you want to have more _guaranteed_ speed. Note that
> - * the RAID driver will use the maximum available bandwidth if the IO
> - * subsystem is idle. There is also an 'absolute maximum' reconstruction
> - * speed limit - in case reconstruction slows down your system despite
> - * idle IO detection.

These comments are useful. They only describe the meaning of those
control values. Is it good to keep them?

> + * Background sync IO speed control:
>   *
> - * you can change it via /proc/sys/dev/raid/speed_limit_min and _max.
> - * or /sys/block/mdX/md/sync_speed_{min,max}
> + * - below speed min:
> + *   no limit;
> + * - above speed min and below speed max:
> + *   a) if mddev is idle, then no limit;
> + *   b) if mddev is busy handling normal IO, then limit inflight sync IO
> + *   to sync_io_depth;
> + * - above speed max:
> + *   sync IO can't be issued;
> + *
> + * Following configurations can be changed via /proc/sys/dev/raid/ for system
> + * or /sys/block/mdX/md/ for one array.
>   */
> -
>  static int sysctl_speed_limit_min = 1000;
>  static int sysctl_speed_limit_max = 200000;
> -static inline int speed_min(struct mddev *mddev)
> +static int sysctl_sync_io_depth = 32;
> +
> +static int speed_min(struct mddev *mddev)
>  {
>         return mddev->sync_speed_min ?
>                 mddev->sync_speed_min : sysctl_speed_limit_min;
>  }
>
> -static inline int speed_max(struct mddev *mddev)
> +static int speed_max(struct mddev *mddev)
>  {
>         return mddev->sync_speed_max ?
>                 mddev->sync_speed_max : sysctl_speed_limit_max;
>  }
>
> +static int sync_io_depth(struct mddev *mddev)
> +{
> +       return mddev->sync_io_depth ?
> +               mddev->sync_io_depth : sysctl_sync_io_depth;
> +}
> +
>  static void rdev_uninit_serial(struct md_rdev *rdev)
>  {
>         if (!test_and_clear_bit(CollisionCheck, &rdev->flags))
> @@ -293,14 +303,21 @@ static const struct ctl_table raid_table[] = {
>                 .procname       = "speed_limit_min",
>                 .data           = &sysctl_speed_limit_min,
>                 .maxlen         = sizeof(int),
> -               .mode           = S_IRUGO|S_IWUSR,
> +               .mode           = 0644,

Is it better to use macro rather than number directly here?

>                 .proc_handler   = proc_dointvec,
>         },
>         {
>                 .procname       = "speed_limit_max",
>                 .data           = &sysctl_speed_limit_max,
>                 .maxlen         = sizeof(int),
> -               .mode           = S_IRUGO|S_IWUSR,
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec,
> +       },
> +       {
> +               .procname       = "sync_io_depth",
> +               .data           = &sysctl_sync_io_depth,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
>                 .proc_handler   = proc_dointvec,
>         },
>  };
> @@ -5091,7 +5108,7 @@ static ssize_t
>  sync_min_show(struct mddev *mddev, char *page)
>  {
>         return sprintf(page, "%d (%s)\n", speed_min(mddev),
> -                      mddev->sync_speed_min ? "local": "system");
> +                      mddev->sync_speed_min ? "local" : "system");
>  }
>
>  static ssize_t
> @@ -5100,7 +5117,7 @@ sync_min_store(struct mddev *mddev, const char *buf, size_t len)
>         unsigned int min;
>         int rv;
>
> -       if (strncmp(buf, "system", 6)==0) {
> +       if (strncmp(buf, "system", 6) == 0) {
>                 min = 0;
>         } else {
>                 rv = kstrtouint(buf, 10, &min);
> @@ -5120,7 +5137,7 @@ static ssize_t
>  sync_max_show(struct mddev *mddev, char *page)
>  {
>         return sprintf(page, "%d (%s)\n", speed_max(mddev),
> -                      mddev->sync_speed_max ? "local": "system");
> +                      mddev->sync_speed_max ? "local" : "system");
>  }
>
>  static ssize_t
> @@ -5129,7 +5146,7 @@ sync_max_store(struct mddev *mddev, const char *buf, size_t len)
>         unsigned int max;
>         int rv;
>
> -       if (strncmp(buf, "system", 6)==0) {
> +       if (strncmp(buf, "system", 6) == 0) {
>                 max = 0;
>         } else {
>                 rv = kstrtouint(buf, 10, &max);
> @@ -5145,6 +5162,35 @@ sync_max_store(struct mddev *mddev, const char *buf, size_t len)
>  static struct md_sysfs_entry md_sync_max =
>  __ATTR(sync_speed_max, S_IRUGO|S_IWUSR, sync_max_show, sync_max_store);
>
> +static ssize_t
> +sync_io_depth_show(struct mddev *mddev, char *page)
> +{
> +       return sprintf(page, "%d (%s)\n", sync_io_depth(mddev),
> +                      mddev->sync_io_depth ? "local" : "system");
> +}
> +
> +static ssize_t
> +sync_io_depth_store(struct mddev *mddev, const char *buf, size_t len)
> +{
> +       unsigned int max;
> +       int rv;
> +
> +       if (strncmp(buf, "system", 6) == 0) {
> +               max = 0;
> +       } else {
> +               rv = kstrtouint(buf, 10, &max);
> +               if (rv < 0)
> +                       return rv;
> +               if (max == 0)
> +                       return -EINVAL;
> +       }
> +       mddev->sync_io_depth = max;
> +       return len;
> +}
> +
> +static struct md_sysfs_entry md_sync_io_depth =
> +__ATTR_RW(sync_io_depth);
> +
>  static ssize_t
>  degraded_show(struct mddev *mddev, char *page)
>  {
> @@ -5671,6 +5717,7 @@ static struct attribute *md_redundancy_attrs[] = {
>         &md_mismatches.attr,
>         &md_sync_min.attr,
>         &md_sync_max.attr,
> +       &md_sync_io_depth.attr,
>         &md_sync_speed.attr,
>         &md_sync_force_parallel.attr,
>         &md_sync_completed.attr,
> @@ -8927,6 +8974,23 @@ static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
>         }
>  }
>
> +static bool sync_io_within_limit(struct mddev *mddev)
> +{
> +       int io_sectors;
> +
> +       /*
> +        * For raid456, sync IO is stripe(4k) per IO, for other levels, it's
> +        * RESYNC_PAGES(64k) per IO.
> +        */
> +       if (mddev->level == 4 || mddev->level == 5 || mddev->level == 6)
> +               io_sectors = 8;
> +       else
> +               io_sectors = 128;
> +
> +       return atomic_read(&mddev->recovery_active) <
> +               io_sectors * sync_io_depth(mddev);
> +}
> +
>  #define SYNC_MARKS     10
>  #define        SYNC_MARK_STEP  (3*HZ)
>  #define UPDATE_FREQUENCY (5*60*HZ)
> @@ -9195,7 +9259,8 @@ void md_do_sync(struct md_thread *thread)
>                                 msleep(500);
>                                 goto repeat;
>                         }
> -                       if (!is_mddev_idle(mddev, 0)) {
> +                       if (!sync_io_within_limit(mddev) &&
> +                           !is_mddev_idle(mddev, 0)) {
>                                 /*
>                                  * Give other IO more of a chance.
>                                  * The faster the devices, the less we wait.
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 1cf00a04bcdd..63be622467c6 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -483,6 +483,7 @@ struct mddev {
>         /* if zero, use the system-wide default */
>         int                             sync_speed_min;
>         int                             sync_speed_max;
> +       int                             sync_io_depth;
>
>         /* resync even though the same disks are shared among md-devices */
>         int                             parallel_resync;
> --
> 2.39.2
>

This part looks good to me.

Acked-by: Xiao Ni <xni@redhat.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] md: fix is_mddev_idle()
  2025-04-12  7:32 ` [PATCH 3/4] md: fix is_mddev_idle() Yu Kuai
@ 2025-04-16  6:20   ` Xiao Ni
  2025-04-16  7:42     ` Yu Kuai
  0 siblings, 1 reply; 16+ messages in thread
From: Xiao Ni @ 2025-04-16  6:20 UTC (permalink / raw)
  To: Yu Kuai, axboe, song, yukuai3
  Cc: linux-block, linux-kernel, linux-raid, yi.zhang, yangerkun


在 2025/4/12 下午3:32, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
>
> If sync_speed is above speed_min, then is_mddev_idle() will be called
> for each sync IO to check if the array is idle, and inflihgt sync_io
> will be limited if the array is not idle.
>
> However, while mkfs.ext4 for a large raid5 array while recovery is in
> progress, it's found that sync_speed is already above speed_min while
> lots of stripes are used for sync IO, causing long delay for mkfs.ext4.
>
> Root cause is the following checking from is_mddev_idle():
>
> t1: submit sync IO: events1 = completed IO - issued sync IO
> t2: submit next sync IO: events2  = completed IO - issued sync IO
> if (events2 - events1 > 64)
>
> For consequence, the more sync IO issued, the less likely checking will
> pass. And when completed normal IO is more than issued sync IO, the
> condition will finally pass and is_mddev_idle() will return false,
> however, last_events will be updated hence is_mddev_idle() can only
> return false once in a while.
>
> Fix this problem by changing the checking as following:
>
> 1) mddev doesn't have normal IO completed;
> 2) mddev doesn't have normal IO inflight;
> 3) if any member disks is partition, and all other partitions doesn't
>     have IO completed.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md.c | 78 ++++++++++++++++++++++++++-----------------------
>   drivers/md/md.h |  3 +-
>   2 files changed, 43 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 8966c4afc62a..19da93f8912c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8619,50 +8619,54 @@ void md_cluster_stop(struct mddev *mddev)
>   	put_cluster_ops(mddev);
>   }
>   
> -static int is_mddev_idle(struct mddev *mddev, int init)
> +static bool is_rdev_idle(struct md_rdev *rdev, bool init)
> +{
> +	unsigned long last_events = rdev->last_events;
> +
> +	if (!bdev_is_partition(rdev->bdev))
> +		return true;


For md array, I think is_rdev_idle is not useful. Because 
mddev->last_events must be increased while upper ios come in and idle 
will be set to false. For dm array, mddev->last_events can't work. So 
is_rdev_idle is for dm array. If member disk is one partition, 
is_rdev_idle alwasy returns true, and is_mddev_idle always return true. 
It's a bug here. Do we need to check bdev_is_partition here?

Best Regards

Xiao

> +
> +	rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0,
> +						 sectors) -
> +			    part_stat_read_accum(rdev->bdev, sectors);
> +
> +	if (!init && rdev->last_events > last_events)
> +
> +	return true;
> +}
> +
> +/*
> + * mddev is idle if following conditions are match since last check:
> + * 1) mddev doesn't have normal IO completed;
> + * 2) mddev doesn't have inflight normal IO;
> + * 3) if any member disk is partition, and other partitions doesn't have IO
> + *    completed;
> + *
> + * Noted this checking rely on IO accounting is enabled.
> + */
> +static bool is_mddev_idle(struct mddev *mddev, int init)
>   {
>   	struct md_rdev *rdev;
> -	int idle;
> -	int curr_events;
> +	bool idle = true;
>   
> -	idle = 1;
> -	rcu_read_lock();
> -	rdev_for_each_rcu(rdev, mddev) {
> -		struct gendisk *disk = rdev->bdev->bd_disk;
> +	if (!mddev_is_dm(mddev)) {
> +		unsigned long last_events = mddev->last_events;
>   
> -		if (!init && !blk_queue_io_stat(disk->queue))
> -			continue;
> +		mddev->last_events = part_stat_read_accum(mddev->gendisk->part0,
> +							  sectors);
>   
> -		curr_events = (int)part_stat_read_accum(disk->part0, sectors) -
> -			      atomic_read(&disk->sync_io);
> -		/* sync IO will cause sync_io to increase before the disk_stats
> -		 * as sync_io is counted when a request starts, and
> -		 * disk_stats is counted when it completes.
> -		 * So resync activity will cause curr_events to be smaller than
> -		 * when there was no such activity.
> -		 * non-sync IO will cause disk_stat to increase without
> -		 * increasing sync_io so curr_events will (eventually)
> -		 * be larger than it was before.  Once it becomes
> -		 * substantially larger, the test below will cause
> -		 * the array to appear non-idle, and resync will slow
> -		 * down.
> -		 * If there is a lot of outstanding resync activity when
> -		 * we set last_event to curr_events, then all that activity
> -		 * completing might cause the array to appear non-idle
> -		 * and resync will be slowed down even though there might
> -		 * not have been non-resync activity.  This will only
> -		 * happen once though.  'last_events' will soon reflect
> -		 * the state where there is little or no outstanding
> -		 * resync requests, and further resync activity will
> -		 * always make curr_events less than last_events.
> -		 *
> -		 */
> -		if (init || curr_events - rdev->last_events > 64) {
> -			rdev->last_events = curr_events;
> -			idle = 0;
> -		}
> +		if (!init && (mddev->last_events > last_events ||
> +			      part_in_flight(mddev->gendisk->part0)))
> +			idle = false;
>   	}
> +
> +	rcu_read_lock();
> +	rdev_for_each_rcu(rdev, mddev)
> +		if (!is_rdev_idle(rdev, init))
> +			idle = false;
>   	rcu_read_unlock();
> +
>   	return idle;
>   }
>   
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 63be622467c6..95cf11c4abc6 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -132,7 +132,7 @@ struct md_rdev {
>   
>   	sector_t sectors;		/* Device size (in 512bytes sectors) */
>   	struct mddev *mddev;		/* RAID array if running */
> -	int last_events;		/* IO event timestamp */
> +	unsigned long last_events;	/* IO event timestamp */
>   
>   	/*
>   	 * If meta_bdev is non-NULL, it means that a separate device is
> @@ -519,6 +519,7 @@ struct mddev {
>   							 * adding a spare
>   							 */
>   
> +	unsigned long			last_events;	/* IO event timestamp */
>   	atomic_t			recovery_active; /* blocks scheduled, but not written */
>   	wait_queue_head_t		recovery_wait;
>   	sector_t			recovery_cp;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] md: cleanup accounting for issued sync IO
  2025-04-12  7:32 ` [PATCH 4/4] md: cleanup accounting for issued sync IO Yu Kuai
@ 2025-04-16  6:27   ` Xiao Ni
  0 siblings, 0 replies; 16+ messages in thread
From: Xiao Ni @ 2025-04-16  6:27 UTC (permalink / raw)
  To: Yu Kuai, axboe, song, yukuai3
  Cc: linux-block, linux-kernel, linux-raid, yi.zhang, yangerkun


在 2025/4/12 下午3:32, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
>
> It's no longer used and can be removed, also remove the field
> 'gendisk->sync_io'.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md.h        | 11 -----------
>   drivers/md/raid1.c     |  3 ---
>   drivers/md/raid10.c    |  9 ---------
>   drivers/md/raid5.c     |  8 --------
>   include/linux/blkdev.h |  1 -
>   5 files changed, 32 deletions(-)
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 95cf11c4abc6..6233ec9f10a3 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -716,17 +716,6 @@ static inline int mddev_trylock(struct mddev *mddev)
>   }
>   extern void mddev_unlock(struct mddev *mddev);
>   
> -static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors)
> -{
> -	if (blk_queue_io_stat(bdev->bd_disk->queue))
> -		atomic_add(nr_sectors, &bdev->bd_disk->sync_io);
> -}
> -
> -static inline void md_sync_acct_bio(struct bio *bio, unsigned long nr_sectors)
> -{
> -	md_sync_acct(bio->bi_bdev, nr_sectors);
> -}
> -
>   struct md_personality
>   {
>   	struct md_submodule_head head;
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index de9bccbe7337..657d481525be 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2382,7 +2382,6 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
>   
>   		wbio->bi_end_io = end_sync_write;
>   		atomic_inc(&r1_bio->remaining);
> -		md_sync_acct(conf->mirrors[i].rdev->bdev, bio_sectors(wbio));
>   
>   		submit_bio_noacct(wbio);
>   	}
> @@ -3055,7 +3054,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>   			bio = r1_bio->bios[i];
>   			if (bio->bi_end_io == end_sync_read) {
>   				read_targets--;
> -				md_sync_acct_bio(bio, nr_sectors);
>   				if (read_targets == 1)
>   					bio->bi_opf &= ~MD_FAILFAST;
>   				submit_bio_noacct(bio);
> @@ -3064,7 +3062,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>   	} else {
>   		atomic_set(&r1_bio->remaining, 1);
>   		bio = r1_bio->bios[r1_bio->read_disk];
> -		md_sync_acct_bio(bio, nr_sectors);
>   		if (read_targets == 1)
>   			bio->bi_opf &= ~MD_FAILFAST;
>   		submit_bio_noacct(bio);
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index ba32bac975b8..dce06bf65016 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2426,7 +2426,6 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
>   
>   		atomic_inc(&conf->mirrors[d].rdev->nr_pending);
>   		atomic_inc(&r10_bio->remaining);
> -		md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(tbio));
>   
>   		if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
>   			tbio->bi_opf |= MD_FAILFAST;
> @@ -2448,8 +2447,6 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
>   			bio_copy_data(tbio, fbio);
>   		d = r10_bio->devs[i].devnum;
>   		atomic_inc(&r10_bio->remaining);
> -		md_sync_acct(conf->mirrors[d].replacement->bdev,
> -			     bio_sectors(tbio));
>   		submit_bio_noacct(tbio);
>   	}
>   
> @@ -2583,13 +2580,10 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio)
>   	d = r10_bio->devs[1].devnum;
>   	if (wbio->bi_end_io) {
>   		atomic_inc(&conf->mirrors[d].rdev->nr_pending);
> -		md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio));
>   		submit_bio_noacct(wbio);
>   	}
>   	if (wbio2) {
>   		atomic_inc(&conf->mirrors[d].replacement->nr_pending);
> -		md_sync_acct(conf->mirrors[d].replacement->bdev,
> -			     bio_sectors(wbio2));
>   		submit_bio_noacct(wbio2);
>   	}
>   }
> @@ -3757,7 +3751,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>   		r10_bio->sectors = nr_sectors;
>   
>   		if (bio->bi_end_io == end_sync_read) {
> -			md_sync_acct_bio(bio, nr_sectors);
>   			bio->bi_status = 0;
>   			submit_bio_noacct(bio);
>   		}
> @@ -4880,7 +4873,6 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
>   	r10_bio->sectors = nr_sectors;
>   
>   	/* Now submit the read */
> -	md_sync_acct_bio(read_bio, r10_bio->sectors);
>   	atomic_inc(&r10_bio->remaining);
>   	read_bio->bi_next = NULL;
>   	submit_bio_noacct(read_bio);
> @@ -4940,7 +4932,6 @@ static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio)
>   			continue;
>   
>   		atomic_inc(&rdev->nr_pending);
> -		md_sync_acct_bio(b, r10_bio->sectors);
>   		atomic_inc(&r10_bio->remaining);
>   		b->bi_next = NULL;
>   		submit_bio_noacct(b);
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 6389383166c0..ca5b0e8ba707 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -1240,10 +1240,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>   		}
>   
>   		if (rdev) {
> -			if (s->syncing || s->expanding || s->expanded
> -			    || s->replacing)
> -				md_sync_acct(rdev->bdev, RAID5_STRIPE_SECTORS(conf));
> -
>   			set_bit(STRIPE_IO_STARTED, &sh->state);
>   
>   			bio_init(bi, rdev->bdev, &dev->vec, 1, op | op_flags);
> @@ -1300,10 +1296,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>   				submit_bio_noacct(bi);
>   		}
>   		if (rrdev) {
> -			if (s->syncing || s->expanding || s->expanded
> -			    || s->replacing)
> -				md_sync_acct(rrdev->bdev, RAID5_STRIPE_SECTORS(conf));
> -
>   			set_bit(STRIPE_IO_STARTED, &sh->state);
>   
>   			bio_init(rbi, rrdev->bdev, &dev->rvec, 1, op | op_flags);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index e39c45bc0a97..f3a625b00734 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -182,7 +182,6 @@ struct gendisk {
>   	struct list_head slave_bdevs;
>   #endif
>   	struct timer_rand_state *random;
> -	atomic_t sync_io;		/* RAID */
>   	struct disk_events *ev;
>   
>   #ifdef CONFIG_BLK_DEV_ZONED


Looks good to me.

Acked-by: Xiao Ni <xni@redhat.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] md: fix is_mddev_idle()
  2025-04-16  6:20   ` Xiao Ni
@ 2025-04-16  7:42     ` Yu Kuai
  2025-04-16  9:28       ` Yu Kuai
  0 siblings, 1 reply; 16+ messages in thread
From: Yu Kuai @ 2025-04-16  7:42 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai, axboe, song
  Cc: linux-block, linux-kernel, linux-raid, yi.zhang, yangerkun,
	yukuai (C)

Hi,

在 2025/04/16 14:20, Xiao Ni 写道:
>> +static bool is_rdev_idle(struct md_rdev *rdev, bool init)
>> +{
>> +    unsigned long last_events = rdev->last_events;
>> +
>> +    if (!bdev_is_partition(rdev->bdev))
>> +        return true;
> 
> 
> For md array, I think is_rdev_idle is not useful. Because 
> mddev->last_events must be increased while upper ios come in and idle 
> will be set to false. For dm array, mddev->last_events can't work. So 
> is_rdev_idle is for dm array. If member disk is one partition, 
> is_rdev_idle alwasy returns true, and is_mddev_idle always return true. 
> It's a bug here. Do we need to check bdev_is_partition here?

is_rdev_idle() is not used for current array, for example:

sda1 is used for array md0, and user doesn't issue IO to md0, while
user issues IO to sda2. In this case, is_mddev_idle() still fail for
array md0 because is_rdev_idle() fail.

This is just inherited from the old behaviour.

Thanks,
Kuai

> 
> Best Regards
> 
> Xiao


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] md: add a new api sync_io_depth
  2025-04-16  5:32   ` Xiao Ni
@ 2025-04-16  8:19     ` Yu Kuai
  0 siblings, 0 replies; 16+ messages in thread
From: Yu Kuai @ 2025-04-16  8:19 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: axboe, song, linux-block, linux-kernel, linux-raid, yi.zhang,
	yangerkun, yukuai (C)

Hi,

在 2025/04/16 13:32, Xiao Ni 写道:
> On Sat, Apr 12, 2025 at 3:39 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Currently if sync speed is above speed_min and below speed_max,
>> md_do_sync() will wait for all sync IOs to be done before issuing new
>> sync IO, means sync IO depth is limited to just 1.
>>
>> This limit is too low, in order to prevent sync speed drop conspicuously
>> after fixing is_mddev_idle() in the next patch, add a new api for
>> limiting sync IO depth, the default value is 32.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   drivers/md/md.c | 103 +++++++++++++++++++++++++++++++++++++++---------
>>   drivers/md/md.h |   1 +
>>   2 files changed, 85 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 438e71e45c16..8966c4afc62a 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -111,32 +111,42 @@ static void md_wakeup_thread_directly(struct md_thread __rcu *thread);
>>   /* Default safemode delay: 200 msec */
>>   #define DEFAULT_SAFEMODE_DELAY ((200 * HZ)/1000 +1)
>>   /*
>> - * Current RAID-1,4,5 parallel reconstruction 'guaranteed speed limit'
>> - * is 1000 KB/sec, so the extra system load does not show up that much.
>> - * Increase it if you want to have more _guaranteed_ speed. Note that
>> - * the RAID driver will use the maximum available bandwidth if the IO
>> - * subsystem is idle. There is also an 'absolute maximum' reconstruction
>> - * speed limit - in case reconstruction slows down your system despite
>> - * idle IO detection.
> 
> These comments are useful. They only describe the meaning of those
> control values. Is it good to keep them?

Sure
> 
>> + * Background sync IO speed control:
>>    *
>> - * you can change it via /proc/sys/dev/raid/speed_limit_min and _max.
>> - * or /sys/block/mdX/md/sync_speed_{min,max}
>> + * - below speed min:
>> + *   no limit;
>> + * - above speed min and below speed max:
>> + *   a) if mddev is idle, then no limit;
>> + *   b) if mddev is busy handling normal IO, then limit inflight sync IO
>> + *   to sync_io_depth;
>> + * - above speed max:
>> + *   sync IO can't be issued;
>> + *
>> + * Following configurations can be changed via /proc/sys/dev/raid/ for system
>> + * or /sys/block/mdX/md/ for one array.
>>    */
>> -
>>   static int sysctl_speed_limit_min = 1000;
>>   static int sysctl_speed_limit_max = 200000;
>> -static inline int speed_min(struct mddev *mddev)
>> +static int sysctl_sync_io_depth = 32;
>> +
>> +static int speed_min(struct mddev *mddev)
>>   {
>>          return mddev->sync_speed_min ?
>>                  mddev->sync_speed_min : sysctl_speed_limit_min;
>>   }
>>
>> -static inline int speed_max(struct mddev *mddev)
>> +static int speed_max(struct mddev *mddev)
>>   {
>>          return mddev->sync_speed_max ?
>>                  mddev->sync_speed_max : sysctl_speed_limit_max;
>>   }
>>
>> +static int sync_io_depth(struct mddev *mddev)
>> +{
>> +       return mddev->sync_io_depth ?
>> +               mddev->sync_io_depth : sysctl_sync_io_depth;
>> +}
>> +
>>   static void rdev_uninit_serial(struct md_rdev *rdev)
>>   {
>>          if (!test_and_clear_bit(CollisionCheck, &rdev->flags))
>> @@ -293,14 +303,21 @@ static const struct ctl_table raid_table[] = {
>>                  .procname       = "speed_limit_min",
>>                  .data           = &sysctl_speed_limit_min,
>>                  .maxlen         = sizeof(int),
>> -               .mode           = S_IRUGO|S_IWUSR,
>> +               .mode           = 0644,
> 
> Is it better to use macro rather than number directly here?

checkpatch will suggest 0644 over S_IRUGO|S_IWUSR.

Thanks,
Kuai

> 
>>                  .proc_handler   = proc_dointvec,
>>          },
>>          {
>>                  .procname       = "speed_limit_max",
>>                  .data           = &sysctl_speed_limit_max,
>>                  .maxlen         = sizeof(int),
>> -               .mode           = S_IRUGO|S_IWUSR,
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec,
>> +       },
>> +       {
>> +               .procname       = "sync_io_depth",
>> +               .data           = &sysctl_sync_io_depth,
>> +               .maxlen         = sizeof(int),
>> +               .mode           = 0644,
>>                  .proc_handler   = proc_dointvec,
>>          },
>>   };
>> @@ -5091,7 +5108,7 @@ static ssize_t
>>   sync_min_show(struct mddev *mddev, char *page)
>>   {
>>          return sprintf(page, "%d (%s)\n", speed_min(mddev),
>> -                      mddev->sync_speed_min ? "local": "system");
>> +                      mddev->sync_speed_min ? "local" : "system");
>>   }
>>
>>   static ssize_t
>> @@ -5100,7 +5117,7 @@ sync_min_store(struct mddev *mddev, const char *buf, size_t len)
>>          unsigned int min;
>>          int rv;
>>
>> -       if (strncmp(buf, "system", 6)==0) {
>> +       if (strncmp(buf, "system", 6) == 0) {
>>                  min = 0;
>>          } else {
>>                  rv = kstrtouint(buf, 10, &min);
>> @@ -5120,7 +5137,7 @@ static ssize_t
>>   sync_max_show(struct mddev *mddev, char *page)
>>   {
>>          return sprintf(page, "%d (%s)\n", speed_max(mddev),
>> -                      mddev->sync_speed_max ? "local": "system");
>> +                      mddev->sync_speed_max ? "local" : "system");
>>   }
>>
>>   static ssize_t
>> @@ -5129,7 +5146,7 @@ sync_max_store(struct mddev *mddev, const char *buf, size_t len)
>>          unsigned int max;
>>          int rv;
>>
>> -       if (strncmp(buf, "system", 6)==0) {
>> +       if (strncmp(buf, "system", 6) == 0) {
>>                  max = 0;
>>          } else {
>>                  rv = kstrtouint(buf, 10, &max);
>> @@ -5145,6 +5162,35 @@ sync_max_store(struct mddev *mddev, const char *buf, size_t len)
>>   static struct md_sysfs_entry md_sync_max =
>>   __ATTR(sync_speed_max, S_IRUGO|S_IWUSR, sync_max_show, sync_max_store);
>>
>> +static ssize_t
>> +sync_io_depth_show(struct mddev *mddev, char *page)
>> +{
>> +       return sprintf(page, "%d (%s)\n", sync_io_depth(mddev),
>> +                      mddev->sync_io_depth ? "local" : "system");
>> +}
>> +
>> +static ssize_t
>> +sync_io_depth_store(struct mddev *mddev, const char *buf, size_t len)
>> +{
>> +       unsigned int max;
>> +       int rv;
>> +
>> +       if (strncmp(buf, "system", 6) == 0) {
>> +               max = 0;
>> +       } else {
>> +               rv = kstrtouint(buf, 10, &max);
>> +               if (rv < 0)
>> +                       return rv;
>> +               if (max == 0)
>> +                       return -EINVAL;
>> +       }
>> +       mddev->sync_io_depth = max;
>> +       return len;
>> +}
>> +
>> +static struct md_sysfs_entry md_sync_io_depth =
>> +__ATTR_RW(sync_io_depth);
>> +
>>   static ssize_t
>>   degraded_show(struct mddev *mddev, char *page)
>>   {
>> @@ -5671,6 +5717,7 @@ static struct attribute *md_redundancy_attrs[] = {
>>          &md_mismatches.attr,
>>          &md_sync_min.attr,
>>          &md_sync_max.attr,
>> +       &md_sync_io_depth.attr,
>>          &md_sync_speed.attr,
>>          &md_sync_force_parallel.attr,
>>          &md_sync_completed.attr,
>> @@ -8927,6 +8974,23 @@ static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
>>          }
>>   }
>>
>> +static bool sync_io_within_limit(struct mddev *mddev)
>> +{
>> +       int io_sectors;
>> +
>> +       /*
>> +        * For raid456, sync IO is stripe(4k) per IO, for other levels, it's
>> +        * RESYNC_PAGES(64k) per IO.
>> +        */
>> +       if (mddev->level == 4 || mddev->level == 5 || mddev->level == 6)
>> +               io_sectors = 8;
>> +       else
>> +               io_sectors = 128;
>> +
>> +       return atomic_read(&mddev->recovery_active) <
>> +               io_sectors * sync_io_depth(mddev);
>> +}
>> +
>>   #define SYNC_MARKS     10
>>   #define        SYNC_MARK_STEP  (3*HZ)
>>   #define UPDATE_FREQUENCY (5*60*HZ)
>> @@ -9195,7 +9259,8 @@ void md_do_sync(struct md_thread *thread)
>>                                  msleep(500);
>>                                  goto repeat;
>>                          }
>> -                       if (!is_mddev_idle(mddev, 0)) {
>> +                       if (!sync_io_within_limit(mddev) &&
>> +                           !is_mddev_idle(mddev, 0)) {
>>                                  /*
>>                                   * Give other IO more of a chance.
>>                                   * The faster the devices, the less we wait.
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 1cf00a04bcdd..63be622467c6 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -483,6 +483,7 @@ struct mddev {
>>          /* if zero, use the system-wide default */
>>          int                             sync_speed_min;
>>          int                             sync_speed_max;
>> +       int                             sync_io_depth;
>>
>>          /* resync even though the same disks are shared among md-devices */
>>          int                             parallel_resync;
>> --
>> 2.39.2
>>
> 
> This part looks good to me.
> 
> Acked-by: Xiao Ni <xni@redhat.com>
> 
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] md: fix is_mddev_idle()
  2025-04-16  7:42     ` Yu Kuai
@ 2025-04-16  9:28       ` Yu Kuai
  2025-04-16  9:44         ` Xiao Ni
  0 siblings, 1 reply; 16+ messages in thread
From: Yu Kuai @ 2025-04-16  9:28 UTC (permalink / raw)
  To: Yu Kuai, Xiao Ni, axboe, song
  Cc: linux-block, linux-kernel, linux-raid, yi.zhang, yangerkun,
	yukuai (C)

Hi,

在 2025/04/16 15:42, Yu Kuai 写道:
> Hi,
> 
> 在 2025/04/16 14:20, Xiao Ni 写道:
>>> +static bool is_rdev_idle(struct md_rdev *rdev, bool init)
>>> +{
>>> +    unsigned long last_events = rdev->last_events;
>>> +
>>> +    if (!bdev_is_partition(rdev->bdev))
>>> +        return true;
>>
>>
>> For md array, I think is_rdev_idle is not useful. Because 
>> mddev->last_events must be increased while upper ios come in and idle 
>> will be set to false. For dm array, mddev->last_events can't work. So 
>> is_rdev_idle is for dm array. If member disk is one partition, 
>> is_rdev_idle alwasy returns true, and is_mddev_idle always return 
>> true. It's a bug here. Do we need to check bdev_is_partition here?
> 
> is_rdev_idle() is not used for current array, for example:
> 
> sda1 is used for array md0, and user doesn't issue IO to md0, while
> user issues IO to sda2. In this case, is_mddev_idle() still fail for
> array md0 because is_rdev_idle() fail.

Perhaps the name is_rdev_holder_idle() is better.

Thanks,
Kuai

> 
> This is just inherited from the old behaviour.
> 
> Thanks,
> Kuai
> 
>>
>> Best Regards
>>
>> Xiao
> 
> .
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] md: fix is_mddev_idle()
  2025-04-16  9:28       ` Yu Kuai
@ 2025-04-16  9:44         ` Xiao Ni
  2025-04-17  1:47           ` Yu Kuai
  0 siblings, 1 reply; 16+ messages in thread
From: Xiao Ni @ 2025-04-16  9:44 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, song, linux-block, linux-kernel, linux-raid, yi.zhang,
	yangerkun, yukuai (C)

On Wed, Apr 16, 2025 at 5:29 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/04/16 15:42, Yu Kuai 写道:
> > Hi,
> >
> > 在 2025/04/16 14:20, Xiao Ni 写道:
> >>> +static bool is_rdev_idle(struct md_rdev *rdev, bool init)
> >>> +{
> >>> +    unsigned long last_events = rdev->last_events;
> >>> +
> >>> +    if (!bdev_is_partition(rdev->bdev))
> >>> +        return true;
> >>
> >>
> >> For md array, I think is_rdev_idle is not useful. Because
> >> mddev->last_events must be increased while upper ios come in and idle
> >> will be set to false. For dm array, mddev->last_events can't work. So
> >> is_rdev_idle is for dm array. If member disk is one partition,
> >> is_rdev_idle alwasy returns true, and is_mddev_idle always return
> >> true. It's a bug here. Do we need to check bdev_is_partition here?
> >
> > is_rdev_idle() is not used for current array, for example:
> >
> > sda1 is used for array md0, and user doesn't issue IO to md0, while
> > user issues IO to sda2. In this case, is_mddev_idle() still fail for
> > array md0 because is_rdev_idle() fail.

Thanks very much for the explanation. It makes sense :)

>
> Perhaps the name is_rdev_holder_idle() is better.

Your suggestion is better. And it's better to add some comments before
this function.

But how about dm-raid? Can this patch work for dm-raid?

Regards
Xiao

>
> Thanks,
> Kuai
>
> >
> > This is just inherited from the old behaviour.
> >
> > Thanks,
> > Kuai
> >
> >>
> >> Best Regards
> >>
> >> Xiao
> >
> > .
> >
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] md: fix is_mddev_idle()
  2025-04-16  9:44         ` Xiao Ni
@ 2025-04-17  1:47           ` Yu Kuai
  0 siblings, 0 replies; 16+ messages in thread
From: Yu Kuai @ 2025-04-17  1:47 UTC (permalink / raw)
  To: Xiao Ni, Yu Kuai
  Cc: axboe, song, linux-block, linux-kernel, linux-raid, yi.zhang,
	yangerkun, yukuai (C)

Hi,

在 2025/04/16 17:44, Xiao Ni 写道:
> On Wed, Apr 16, 2025 at 5:29 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2025/04/16 15:42, Yu Kuai 写道:
>>> Hi,
>>>
>>> 在 2025/04/16 14:20, Xiao Ni 写道:
>>>>> +static bool is_rdev_idle(struct md_rdev *rdev, bool init)
>>>>> +{
>>>>> +    unsigned long last_events = rdev->last_events;
>>>>> +
>>>>> +    if (!bdev_is_partition(rdev->bdev))
>>>>> +        return true;
>>>>
>>>>
>>>> For md array, I think is_rdev_idle is not useful. Because
>>>> mddev->last_events must be increased while upper ios come in and idle
>>>> will be set to false. For dm array, mddev->last_events can't work. So
>>>> is_rdev_idle is for dm array. If member disk is one partition,
>>>> is_rdev_idle alwasy returns true, and is_mddev_idle always return
>>>> true. It's a bug here. Do we need to check bdev_is_partition here?
>>>
>>> is_rdev_idle() is not used for current array, for example:
>>>
>>> sda1 is used for array md0, and user doesn't issue IO to md0, while
>>> user issues IO to sda2. In this case, is_mddev_idle() still fail for
>>> array md0 because is_rdev_idle() fail.
> 
> Thanks very much for the explanation. It makes sense :)
> 
>>
>> Perhaps the name is_rdev_holder_idle() is better.
> 
> Your suggestion is better. And it's better to add some comments before
> this function.
> 
> But how about dm-raid? Can this patch work for dm-raid?

is_rdev_holder_idle() can work for dm-raid, however, the part to
check if normal IO is inflight or completed can't work for dm-raid,
currently there is no way to grab dm gendisk from mddev. However, I
think there won't be regression since the old buggy is_mddev_idle()
almost always return false.

Thanks,
Kuai

> 
> Regards
> Xiao
> 
>>
>> Thanks,
>> Kuai
>>
>>>
>>> This is just inherited from the old behaviour.
>>>
>>> Thanks,
>>> Kuai
>>>
>>>>
>>>> Best Regards
>>>>
>>>> Xiao
>>>
>>> .
>>>
>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-04-17  1:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-12  7:31 [PATCH 0/4] md: fix is_mddev_idle() Yu Kuai
2025-04-12  7:31 ` [PATCH 1/4] block: export part_in_flight() Yu Kuai
2025-04-14  6:32   ` Christoph Hellwig
2025-04-14  6:48     ` Yu Kuai
2025-04-14 11:39       ` Christoph Hellwig
2025-04-12  7:32 ` [PATCH 2/4] md: add a new api sync_io_depth Yu Kuai
2025-04-16  5:32   ` Xiao Ni
2025-04-16  8:19     ` Yu Kuai
2025-04-12  7:32 ` [PATCH 3/4] md: fix is_mddev_idle() Yu Kuai
2025-04-16  6:20   ` Xiao Ni
2025-04-16  7:42     ` Yu Kuai
2025-04-16  9:28       ` Yu Kuai
2025-04-16  9:44         ` Xiao Ni
2025-04-17  1:47           ` Yu Kuai
2025-04-12  7:32 ` [PATCH 4/4] md: cleanup accounting for issued sync IO Yu Kuai
2025-04-16  6:27   ` Xiao Ni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox