From: Su Yue <l@damenly.org>
To: Yu Kuai <yukuai1@huaweicloud.com>
Cc: axboe@kernel.dk, xni@redhat.com, agk@redhat.com,
snitzer@kernel.org, mpatocka@redhat.com, song@kernel.org,
yukuai3@huawei.com, viro@zeniv.linux.org.uk,
akpm@linux-foundation.org, nadav.amit@gmail.com,
ubizjak@gmail.com, cl@linux.com, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org, dm-devel@lists.linux.dev,
linux-raid@vger.kernel.org, yi.zhang@huawei.com,
yangerkun@huawei.com, johnny.chenyi@huawei.com
Subject: Re: [PATCH v2 4/5] md: fix is_mddev_idle()
Date: Sat, 19 Apr 2025 09:42:28 +0800 [thread overview]
Message-ID: <v7r19baz.fsf@damenly.org> (raw)
In-Reply-To: <20250418010941.667138-5-yukuai1@huaweicloud.com> (Yu Kuai's message of "Fri, 18 Apr 2025 09:09:40 +0800")
On Fri 18 Apr 2025 at 09:09, Yu Kuai <yukuai1@huaweicloud.com>
wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> If sync_speed is above speed_min, then is_mddev_idle() will be
> called
> for each sync IO to check if the array is idle, and inflihgt
> sync_io
> will be limited if the array is not idle.
>
> However, while mkfs.ext4 for a large raid5 array while recovery
> is in
> progress, it's found that sync_speed is already above speed_min
> while
> lots of stripes are used for sync IO, causing long delay for
> mkfs.ext4.
>
> Root cause is the following checking from is_mddev_idle():
>
> t1: submit sync IO: events1 = completed IO - issued sync IO
> t2: submit next sync IO: events2 = completed IO - issued sync
> IO
> if (events2 - events1 > 64)
>
> For consequence, the more sync IO issued, the less likely
> checking will
> pass. And when completed normal IO is more than issued sync IO,
> the
> condition will finally pass and is_mddev_idle() will return
> false,
> however, last_events will be updated hence is_mddev_idle() can
> only
> return false once in a while.
>
> Fix this problem by changing the checking as following:
>
> 1) mddev doesn't have normal IO completed;
> 2) mddev doesn't have normal IO inflight;
> 3) if any member disks is partition, and all other partitions
> doesn't
> have IO completed.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> drivers/md/md.c | 84
> +++++++++++++++++++++++++++----------------------
> drivers/md/md.h | 3 +-
> 2 files changed, 48 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 52cadfce7e8d..dfd85a5d6112 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev
> *mddev)
> put_cluster_ops(mddev);
> }
>
> -static int is_mddev_idle(struct mddev *mddev, int init)
> +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool
> init)
> {
> + unsigned long last_events = rdev->last_events;
> +
> + if (!bdev_is_partition(rdev->bdev))
> + return true;
> +
> + /*
> + * If rdev is partition, and user doesn't issue IO to the
> array, the
> + * array is still not idle if user issues IO to other
> partitions.
> + */
> + rdev->last_events =
> part_stat_read_accum(rdev->bdev->bd_disk->part0,
> + sectors) -
> + part_stat_read_accum(rdev->bdev, sectors);
> +
> + if (!init && rdev->last_events > last_events)
> + return false;
> +
> + return true;
> +}
> +
> +/*
> + * mddev is idle if following conditions are match since last
> check:
> + * 1) mddev doesn't have normal IO completed;
> + * 2) mddev doesn't have inflight normal IO;
> + * 3) if any member disk is partition, and other partitions
> doesn't have IO
> + * completed;
> + *
> + * Noted this checking rely on IO accounting is enabled.
> + */
> +static bool is_mddev_idle(struct mddev *mddev, int init)
> +{
> + unsigned long last_events = mddev->last_events;
> + struct gendisk *disk;
> struct md_rdev *rdev;
> - int idle;
> - int curr_events;
> + bool idle = true;
>
> - idle = 1;
> - rcu_read_lock();
> - rdev_for_each_rcu(rdev, mddev) {
> - struct gendisk *disk = rdev->bdev->bd_disk;
> + disk = mddev_is_dm(mddev) ? mddev->dm_gendisk :
> mddev->gendisk;
> + if (!disk)
> + return true;
>
> - if (!init && !blk_queue_io_stat(disk->queue))
> - continue;
> + mddev->last_events = part_stat_read_accum(disk->part0,
> sectors);
> + if (!init && (mddev->last_events > last_events ||
> + bdev_count_inflight(disk->part0)))
> + idle = false;
>
Forgot return or goto here?
--
Su
> - curr_events = (int)part_stat_read_accum(disk->part0,
> sectors) -
> - atomic_read(&disk->sync_io);
> - /* sync IO will cause sync_io to increase before the
> disk_stats
> - * as sync_io is counted when a request starts, and
> - * disk_stats is counted when it completes.
> - * So resync activity will cause curr_events to be smaller
> than
> - * when there was no such activity.
> - * non-sync IO will cause disk_stat to increase without
> - * increasing sync_io so curr_events will (eventually)
> - * be larger than it was before. Once it becomes
> - * substantially larger, the test below will cause
> - * the array to appear non-idle, and resync will slow
> - * down.
> - * If there is a lot of outstanding resync activity when
> - * we set last_event to curr_events, then all that
> activity
> - * completing might cause the array to appear non-idle
> - * and resync will be slowed down even though there might
> - * not have been non-resync activity. This will only
> - * happen once though. 'last_events' will soon reflect
> - * the state where there is little or no outstanding
> - * resync requests, and further resync activity will
> - * always make curr_events less than last_events.
> - *
> - */
> - if (init || curr_events - rdev->last_events > 64) {
> - rdev->last_events = curr_events;
> - idle = 0;
> - }
> - }
> + rcu_read_lock();
> + rdev_for_each_rcu(rdev, mddev)
> + if (!is_rdev_holder_idle(rdev, init))
> + idle = false;
> rcu_read_unlock();
> +
> return idle;
> }
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index b57842188f18..1d51c2405d3d 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -132,7 +132,7 @@ struct md_rdev {
>
> sector_t sectors; /* Device size (in 512bytes sectors)
> */
> struct mddev *mddev; /* RAID array if running */
> - int last_events; /* IO event timestamp */
> + unsigned long last_events; /* IO event timestamp */
>
> /*
> * If meta_bdev is non-NULL, it means that a separate device
> is
> @@ -520,6 +520,7 @@ struct mddev {
> * adding a spare
> */
>
> + unsigned long last_events; /* IO event timestamp
> */
> atomic_t recovery_active; /* blocks scheduled, but
> not written */
> wait_queue_head_t recovery_wait;
> sector_t recovery_cp;
next prev parent reply other threads:[~2025-04-19 1:48 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-18 1:09 [PATCH v2 0/5] md: fix is_mddev_idle() Yu Kuai
2025-04-18 1:09 ` [PATCH v2 1/5] block: cleanup and export bdev IO inflight APIs Yu Kuai
2025-04-21 11:59 ` Christoph Hellwig
2025-04-21 13:13 ` Yu Kuai
2025-04-22 6:11 ` Christoph Hellwig
2025-04-18 1:09 ` [PATCH v2 2/5] md: record dm-raid gendisk in mddev Yu Kuai
2025-04-22 6:00 ` Xiao Ni
2025-04-18 1:09 ` [PATCH v2 3/5] md: add a new api sync_io_depth Yu Kuai
2025-04-22 6:15 ` Xiao Ni
2025-04-18 1:09 ` [PATCH v2 4/5] md: fix is_mddev_idle() Yu Kuai
2025-04-19 1:42 ` Su Yue [this message]
2025-04-19 2:00 ` Yu Kuai
2025-04-19 5:03 ` Su Yue
2025-04-22 6:35 ` Xiao Ni
2025-04-27 1:37 ` Yu Kuai
2025-04-27 2:45 ` Xiao Ni
2025-04-18 1:09 ` [PATCH v2 5/5] md: cleanup accounting for issued sync IO Yu Kuai
2025-04-22 6:36 ` Xiao Ni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=v7r19baz.fsf@damenly.org \
--to=l@damenly.org \
--cc=agk@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=cl@linux.com \
--cc=dm-devel@lists.linux.dev \
--cc=johnny.chenyi@huawei.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=mpatocka@redhat.com \
--cc=nadav.amit@gmail.com \
--cc=snitzer@kernel.org \
--cc=song@kernel.org \
--cc=ubizjak@gmail.com \
--cc=viro@zeniv.linux.org.uk \
--cc=xni@redhat.com \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=yukuai1@huaweicloud.com \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.