From: Hiroshi Nishida <nishidafmly@gmail.com>
To: Song Liu <song@kernel.org>, Yu Kuai <yukuai@fygo.io>
Cc: Li Nan <magiclinan@didiglobal.com>, Xiao Ni <xiao@kernel.org>,
linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org,
Hiroshi Nishida <nishidafmly@gmail.com>
Subject: [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild
Date: Wed, 24 Jun 2026 08:54:52 -0700 [thread overview]
Message-ID: <20260624155452.211646-9-nishidafmly@gmail.com> (raw)
In-Reply-To: <20260624155452.211646-1-nishidafmly@gmail.com>
The resync read-ahead window (RAID5_SYNC_WINDOW) can fill the stripe
cache with rebuild stripes and starve concurrent user I/O, producing a
burst-starvation flip-flop between rebuild and application throughput.
Add two yield points to the window-submission loop:
- stop the window immediately if any thread is waiting for a stripe
(waitqueue_active(&conf->wait_for_stripe)); the check is intentionally
racy -- a waiter appearing just after is serviced by the next
sync_request call, so no barrier is needed.
- stop expanding once active_stripes reaches half the cache
(max_nr_stripes / RAID5_SYNC_HWMARK), but only when
preread_active_stripes > 0, i.e. user write I/O is actually competing.
Sync stripes never set STRIPE_PREREAD_ACTIVE, so during a pure rebuild
the counter stays zero and the window fills freely; rebuild-only
throughput is unchanged.
This bounds the share of the stripe cache a rebuild may hold while user
I/O is present, so application latency no longer collapses during the
read-ahead bursts, without throttling a rebuild that has the array to
itself.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
---
drivers/md/raid5.c | 21 +++++++++++++++++++++
drivers/md/raid5.h | 1 +
2 files changed, 22 insertions(+)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ad6230415af3..480f3aa069ef 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6656,6 +6656,27 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
submitted < RAID5_SYNC_WINDOW && win_sector < max_sector &&
win_sector < mddev->resync_max;
submitted++, win_sector += RAID5_STRIPE_SECTORS(conf)) {
+ /*
+ * Yield to user I/O: stop the read-ahead if anyone is waiting
+ * for a stripe. The check is intentionally racy -- a waiter
+ * appearing just after is serviced by the next sync_request
+ * call, so no barrier is needed.
+ */
+ if (waitqueue_active(&conf->wait_for_stripe))
+ break;
+ /*
+ * Reserve cache for user I/O only when it is actually competing.
+ * preread_active_stripes counts stripes queued for write I/O
+ * (including the read phase of RMW); sync stripes never set
+ * STRIPE_PREREAD_ACTIVE, so during a pure rebuild it stays zero
+ * and the window fills freely. Competing user reads do not bump
+ * the counter but are caught by the waitqueue_active() check
+ * above.
+ */
+ if (atomic_read(&conf->preread_active_stripes) > 0 &&
+ atomic_read(&conf->active_stripes) >=
+ conf->max_nr_stripes / RAID5_SYNC_HWMARK)
+ break;
sh = raid5_get_active_stripe(conf, NULL, win_sector,
R5_GAS_NOBLOCK);
if (!sh)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 1f37dabd727b..7833cc07597f 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -499,6 +499,7 @@ struct disk_info {
#define MAX_STRIPE_BATCH 32 /* stripes per handle_active_stripes pass */
#define STRIPE_BATCH_WORKERS 8 /* stripes-per-worker threshold for spawning */
#define RAID5_SYNC_WINDOW 32 /* stripes to pre-submit per sync_request call */
+#define RAID5_SYNC_HWMARK 2 /* rebuild uses at most 1/N of stripe cache */
/* NR_STRIPE_HASH_LOCKS must be a power of two, since
* STRIPE_HASH_LOCKS_MASK masks with (NR_STRIPE_HASH_LOCKS - 1).
--
2.43.0
next prev parent reply other threads:[~2026-06-24 15:55 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
2026-06-24 16:16 ` sashiko-bot
2026-06-24 17:25 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 2/8] md/raid5: raise stripe cache limit from 32768 to 262144 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 3/8] md: widen badblock sectors param from int to sector_t Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 4/8] md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery Hiroshi Nishida
2026-06-24 16:12 ` sashiko-bot
2026-06-24 17:13 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 6/8] md/raid5: allocate worker groups per NUMA node Hiroshi Nishida
2026-06-24 16:07 ` sashiko-bot
2026-06-24 16:53 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32 Hiroshi Nishida
2026-06-24 16:09 ` sashiko-bot
2026-06-24 17:01 ` Hiroshi Nishida
2026-06-24 15:54 ` Hiroshi Nishida [this message]
2026-06-24 16:12 ` [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild sashiko-bot
2026-06-24 17:25 ` Hiroshi Nishida
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260624155452.211646-9-nishidafmly@gmail.com \
--to=nishidafmly@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=magiclinan@didiglobal.com \
--cc=song@kernel.org \
--cc=xiao@kernel.org \
--cc=yukuai@fygo.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.