From: Hiroshi Nishida <nishidafmly@gmail.com>
To: Song Liu <song@kernel.org>, Yu Kuai <yukuai@fygo.io>
Cc: Li Nan <magiclinan@didiglobal.com>, Xiao Ni <xiao@kernel.org>,
linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org,
Hiroshi Nishida <nishidafmly@gmail.com>
Subject: [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery
Date: Wed, 24 Jun 2026 08:54:49 -0700 [thread overview]
Message-ID: <20260624155452.211646-6-nishidafmly@gmail.com> (raw)
In-Reply-To: <20260624155452.211646-1-nishidafmly@gmail.com>
raid5_sync_request() dispatches one stripe per call: it fetches a single
stripe head, marks it for sync, and returns one stripe's worth of
sectors. When the stripe cache is full the NOBLOCK fetch fails and it
re-enters a one-jiffy throttle sleep (schedule_timeout_uninterruptible(1))
before retrying. Because that sleep is taken per stripe, sustained cache
pressure bounds sync progress to roughly HZ stripes/second regardless of
how fast the member devices are.
Dispatch up to RAID5_SYNC_WINDOW (32) stripes per call instead. Only the
first stripe of the window keeps the original behaviour (block, then the
one-jiffy throttle if the cache was full); the remaining stripes are
requested with R5_GAS_NOBLOCK and the loop stops as soon as the cache is
full. So at most one throttle sleep is taken per window rather than per
stripe, and when the cache has free slots a single call can queue a batch
instead of one stripe at a time.
With a warm cache the window stays near full: counting raid5_sync_request()
invocations across a rebuild showed it averaging ~30 of the 32 stripes per
call, i.e. roughly 30x fewer calls into the sync path for the same resync.
The return value reports the number of stripes actually submitted, so
md_do_sync()'s recovery_active accounting stays balanced, and the window
is bounded by both the end of the sync region (max_sector) and
mddev->resync_max, so a user- or cluster-imposed sync ceiling is not
overshot.
This does not change which data is read or written during resync or
recovery.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
---
drivers/md/raid5.c | 47 +++++++++++++++++++++++++++++++++-------------
drivers/md/raid5.h | 1 +
2 files changed, 35 insertions(+), 13 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9cb4ed3bd85c..8e9edaaca667 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6563,7 +6563,8 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
struct stripe_head *sh;
sector_t sync_blocks;
bool still_degraded = false;
- int i;
+ int i, submitted;
+ sector_t win_sector;
if (sector_nr >= max_sector) {
/* just being told to finish up .. nothing much to do */
@@ -6620,16 +6621,7 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
if (md_bitmap_enabled(mddev, false))
mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, false);
- sh = raid5_get_active_stripe(conf, NULL, sector_nr,
- R5_GAS_NOBLOCK);
- if (sh == NULL) {
- sh = raid5_get_active_stripe(conf, NULL, sector_nr, 0);
- /* make sure we don't swamp the stripe cache if someone else
- * is trying to get access
- */
- schedule_timeout_uninterruptible(1);
- }
- /* Need to check if array will still be degraded after recovery/resync
+ /* Check once whether array will still be degraded after recovery/resync.
* Note in case of > 1 drive failures it's possible we're rebuilding
* one drive while leaving another faulty drive in array.
*/
@@ -6640,13 +6632,42 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
still_degraded = true;
}
+ /* First stripe: block if stripe cache is full, then throttle. */
+ sh = raid5_get_active_stripe(conf, NULL, sector_nr, R5_GAS_NOBLOCK);
+ if (sh == NULL) {
+ sh = raid5_get_active_stripe(conf, NULL, sector_nr, 0);
+ /* make sure we don't swamp the stripe cache if someone else
+ * is trying to get access
+ */
+ schedule_timeout_uninterruptible(1);
+ }
md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, still_degraded);
set_bit(STRIPE_SYNC_REQUESTED, &sh->state);
set_bit(STRIPE_HANDLE, &sh->state);
-
raid5_release_stripe(sh);
- return RAID5_STRIPE_SECTORS(conf);
+ /* Submit remaining stripes in the window non-blocking. Stop early
+ * if the stripe cache is full: the disk queue is already saturated.
+ * Bound by resync_max so a user- or cluster-imposed sync ceiling is
+ * not overshot.
+ */
+ win_sector = sector_nr + RAID5_STRIPE_SECTORS(conf);
+ for (submitted = 1;
+ submitted < RAID5_SYNC_WINDOW && win_sector < max_sector &&
+ win_sector < mddev->resync_max;
+ submitted++, win_sector += RAID5_STRIPE_SECTORS(conf)) {
+ sh = raid5_get_active_stripe(conf, NULL, win_sector,
+ R5_GAS_NOBLOCK);
+ if (!sh)
+ break;
+ md_bitmap_start_sync(mddev, win_sector, &sync_blocks,
+ still_degraded);
+ set_bit(STRIPE_SYNC_REQUESTED, &sh->state);
+ set_bit(STRIPE_HANDLE, &sh->state);
+ raid5_release_stripe(sh);
+ }
+
+ return submitted * RAID5_STRIPE_SECTORS(conf);
}
static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 3efab71ebef7..7aeba1fc7f09 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -497,6 +497,7 @@ struct disk_info {
#define NR_HASH (PAGE_SIZE / sizeof(struct hlist_head))
#define HASH_MASK (NR_HASH - 1)
#define MAX_STRIPE_BATCH 8
+#define RAID5_SYNC_WINDOW 32 /* stripes to pre-submit per sync_request call */
/* NR_STRIPE_HASH_LOCKS must be a power of two, since
* STRIPE_HASH_LOCKS_MASK masks with (NR_STRIPE_HASH_LOCKS - 1).
--
2.43.0
next prev parent reply other threads:[~2026-06-24 15:55 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
2026-06-24 16:16 ` sashiko-bot
2026-06-24 17:25 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 2/8] md/raid5: raise stripe cache limit from 32768 to 262144 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 3/8] md: widen badblock sectors param from int to sector_t Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 4/8] md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32 Hiroshi Nishida
2026-06-24 15:54 ` Hiroshi Nishida [this message]
2026-06-24 16:12 ` [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery sashiko-bot
2026-06-24 17:13 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 6/8] md/raid5: allocate worker groups per NUMA node Hiroshi Nishida
2026-06-24 16:07 ` sashiko-bot
2026-06-24 16:53 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32 Hiroshi Nishida
2026-06-24 16:09 ` sashiko-bot
2026-06-24 17:01 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild Hiroshi Nishida
2026-06-24 16:12 ` sashiko-bot
2026-06-24 17:25 ` Hiroshi Nishida
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260624155452.211646-6-nishidafmly@gmail.com \
--to=nishidafmly@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=magiclinan@didiglobal.com \
--cc=song@kernel.org \
--cc=xiao@kernel.org \
--cc=yukuai@fygo.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox