From: Yu Kuai <yukuai1@huaweicloud.com>
To: axboe@kernel.dk, song@kernel.org, yukuai3@huawei.com, xni@redhat.com
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-raid@vger.kernel.org, yukuai1@huaweicloud.com,
yi.zhang@huawei.com, yangerkun@huawei.com
Subject: [PATCH 3/4] md: fix is_mddev_idle()
Date: Sat, 12 Apr 2025 15:32:01 +0800 [thread overview]
Message-ID: <20250412073202.3085138-4-yukuai1@huaweicloud.com> (raw)
In-Reply-To: <20250412073202.3085138-1-yukuai1@huaweicloud.com>
From: Yu Kuai <yukuai3@huawei.com>
If sync_speed is above speed_min, then is_mddev_idle() will be called
for each sync IO to check if the array is idle, and inflihgt sync_io
will be limited if the array is not idle.
However, while mkfs.ext4 for a large raid5 array while recovery is in
progress, it's found that sync_speed is already above speed_min while
lots of stripes are used for sync IO, causing long delay for mkfs.ext4.
Root cause is the following checking from is_mddev_idle():
t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2 = completed IO - issued sync IO
if (events2 - events1 > 64)
For consequence, the more sync IO issued, the less likely checking will
pass. And when completed normal IO is more than issued sync IO, the
condition will finally pass and is_mddev_idle() will return false,
however, last_events will be updated hence is_mddev_idle() can only
return false once in a while.
Fix this problem by changing the checking as following:
1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions doesn't
have IO completed.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md.c | 78 ++++++++++++++++++++++++++-----------------------
drivers/md/md.h | 3 +-
2 files changed, 43 insertions(+), 38 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8966c4afc62a..19da93f8912c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8619,50 +8619,54 @@ void md_cluster_stop(struct mddev *mddev)
put_cluster_ops(mddev);
}
-static int is_mddev_idle(struct mddev *mddev, int init)
+static bool is_rdev_idle(struct md_rdev *rdev, bool init)
+{
+ unsigned long last_events = rdev->last_events;
+
+ if (!bdev_is_partition(rdev->bdev))
+ return true;
+
+ rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0,
+ sectors) -
+ part_stat_read_accum(rdev->bdev, sectors);
+
+ if (!init && rdev->last_events > last_events)
+ return false;
+
+ return true;
+}
+
+/*
+ * mddev is idle if following conditions are match since last check:
+ * 1) mddev doesn't have normal IO completed;
+ * 2) mddev doesn't have inflight normal IO;
+ * 3) if any member disk is partition, and other partitions doesn't have IO
+ * completed;
+ *
+ * Noted this checking rely on IO accounting is enabled.
+ */
+static bool is_mddev_idle(struct mddev *mddev, int init)
{
struct md_rdev *rdev;
- int idle;
- int curr_events;
+ bool idle = true;
- idle = 1;
- rcu_read_lock();
- rdev_for_each_rcu(rdev, mddev) {
- struct gendisk *disk = rdev->bdev->bd_disk;
+ if (!mddev_is_dm(mddev)) {
+ unsigned long last_events = mddev->last_events;
- if (!init && !blk_queue_io_stat(disk->queue))
- continue;
+ mddev->last_events = part_stat_read_accum(mddev->gendisk->part0,
+ sectors);
- curr_events = (int)part_stat_read_accum(disk->part0, sectors) -
- atomic_read(&disk->sync_io);
- /* sync IO will cause sync_io to increase before the disk_stats
- * as sync_io is counted when a request starts, and
- * disk_stats is counted when it completes.
- * So resync activity will cause curr_events to be smaller than
- * when there was no such activity.
- * non-sync IO will cause disk_stat to increase without
- * increasing sync_io so curr_events will (eventually)
- * be larger than it was before. Once it becomes
- * substantially larger, the test below will cause
- * the array to appear non-idle, and resync will slow
- * down.
- * If there is a lot of outstanding resync activity when
- * we set last_event to curr_events, then all that activity
- * completing might cause the array to appear non-idle
- * and resync will be slowed down even though there might
- * not have been non-resync activity. This will only
- * happen once though. 'last_events' will soon reflect
- * the state where there is little or no outstanding
- * resync requests, and further resync activity will
- * always make curr_events less than last_events.
- *
- */
- if (init || curr_events - rdev->last_events > 64) {
- rdev->last_events = curr_events;
- idle = 0;
- }
+ if (!init && (mddev->last_events > last_events ||
+ part_in_flight(mddev->gendisk->part0)))
+ idle = false;
}
+
+ rcu_read_lock();
+ rdev_for_each_rcu(rdev, mddev)
+ if (!is_rdev_idle(rdev, init))
+ idle = false;
rcu_read_unlock();
+
return idle;
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 63be622467c6..95cf11c4abc6 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -132,7 +132,7 @@ struct md_rdev {
sector_t sectors; /* Device size (in 512bytes sectors) */
struct mddev *mddev; /* RAID array if running */
- int last_events; /* IO event timestamp */
+ unsigned long last_events; /* IO event timestamp */
/*
* If meta_bdev is non-NULL, it means that a separate device is
@@ -519,6 +519,7 @@ struct mddev {
* adding a spare
*/
+ unsigned long last_events; /* IO event timestamp */
atomic_t recovery_active; /* blocks scheduled, but not written */
wait_queue_head_t recovery_wait;
sector_t recovery_cp;
--
2.39.2
next prev parent reply other threads:[~2025-04-12 7:38 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-12 7:31 [PATCH 0/4] md: fix is_mddev_idle() Yu Kuai
2025-04-12 7:31 ` [PATCH 1/4] block: export part_in_flight() Yu Kuai
2025-04-14 6:32 ` Christoph Hellwig
2025-04-14 6:48 ` Yu Kuai
2025-04-14 11:39 ` Christoph Hellwig
2025-04-12 7:32 ` [PATCH 2/4] md: add a new api sync_io_depth Yu Kuai
2025-04-16 5:32 ` Xiao Ni
2025-04-16 8:19 ` Yu Kuai
2025-04-12 7:32 ` Yu Kuai [this message]
2025-04-16 6:20 ` [PATCH 3/4] md: fix is_mddev_idle() Xiao Ni
2025-04-16 7:42 ` Yu Kuai
2025-04-16 9:28 ` Yu Kuai
2025-04-16 9:44 ` Xiao Ni
2025-04-17 1:47 ` Yu Kuai
2025-04-12 7:32 ` [PATCH 4/4] md: cleanup accounting for issued sync IO Yu Kuai
2025-04-16 6:27 ` Xiao Ni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250412073202.3085138-4-yukuai1@huaweicloud.com \
--to=yukuai1@huaweicloud.com \
--cc=axboe@kernel.dk \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=song@kernel.org \
--cc=xni@redhat.com \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox