From: Hiroshi Nishida <nishidafmly@gmail.com>
To: Song Liu <song@kernel.org>, Yu Kuai <yukuai@fygo.io>
Cc: Li Nan <magiclinan@didiglobal.com>, Xiao Ni <xiao@kernel.org>,
linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org,
Hiroshi Nishida <nishidafmly@gmail.com>
Subject: [PATCH 0/8] md/raid5: scalability and rebuild-path improvements
Date: Wed, 24 Jun 2026 08:54:44 -0700 [thread overview]
Message-ID: <20260624155452.211646-1-nishidafmly@gmail.com> (raw)
This series collects small, individually low-risk md/raid5 changes for
large, many-core, many-disk arrays. Their common theme is reducing
per-stripe and stripe-cache contention, so the benefit appears mainly
when the raid5 stripe-handling worker threads are in use
(group_thread_cnt > 0); at the default group_thread_cnt = 0 (a single
handling thread) the series is essentially neutral.
- patches 1-3 remove signed arithmetic from a hot-path divisor, lift an
arbitrary stripe-cache size cap, and widen a badblock length argument
that currently truncates large ranges;
- patch 4 raises NR_STRIPE_HASH_LOCKS (8 -> 32) to spread stripe-hash
contention on high core-count systems;
- patches 5 and 8 reduce per-stripe overhead in the resync/recovery
path and bound the share of the stripe cache a rebuild may hold while
user I/O is competing;
- patch 6 allocates each worker group's array on its own NUMA node;
- patch 7 raises MAX_STRIPE_BATCH (8 -> 32).
Measured effect, treatment vs baseline, % change in mean IOPS (N=3),
swept over group_thread_cnt (RAID6 4+2, 22-core host, ramdisk members):
workload gtc=0 gtc=2 gtc=4 gtc=8
random 4K write (RMW) +4.2% +8.1% +17.4% +6.5%
DB mixed 75/25 8K +0.4% +4.2% +10.3% +4.7%
high-concurrency 70/30 4K +3.9% +1.2% +10.0% +0.2%
OLTP 70/30 16K -0.3% +4.7% +10.1% +9.3%
partial-stripe write 8K +1.1% +4.8% +11.2% +14.2%
At the default single handling thread (group_thread_cnt = 0) the series is
neutral (no regression). As worker threads are added the gain grows,
peaking broadly around group_thread_cnt = 4 at roughly +10-17% across the
whole mix; at gtc = 8 the write-heavy workloads keep gaining while the
read-heavy high-concurrency case has saturated. (Per-run cv was <1%
except the random-write test, ~5-9%, from a cold first run.)
These numbers are on a ramdisk, which removes device latency and so
overstates the CPU-side contention effect relative to a real device;
they show the direction and the group_thread_cnt dependence, not an
absolute speedup. The stripe-hash/batch patches (4, 7) and the cache cap
(2) drive this; patch 6 only matters on multi-socket systems (not
exercised above) and patches 5/8 act on the resync/recovery path rather
than this steady-state workload.
Reproduction (stock mdadm + fio):
mdadm --create /dev/md0 --level=6 --raid-devices=6 --chunk=512 \
--assume-clean <6 members>
echo 16384 > /sys/block/md0/md/stripe_cache_size
echo N > /sys/block/md0/md/group_thread_cnt # N = 0,2,4,8
fio --filename=/dev/md0 --direct=1 --ioengine=libaio --group_reporting \
--time_based --runtime=15 --name=w <per-workload opts>:
random write : --rw=randwrite --bs=4k --numjobs=4 --iodepth=32
DB mixed : --rw=randrw --rwmixread=75 --bs=8k --numjobs=8 --iodepth=16
high-concur. : --rw=randrw --rwmixread=70 --bs=4k --numjobs=16 --iodepth=8
OLTP : --rw=randrw --rwmixread=70 --bs=16k --numjobs=6 --iodepth=16
partial-stripe : --rw=randwrite --bs=8k --numjobs=4 --iodepth=32
Each patch stands on its own; I am happy to drop or defer any that is not
justified on its own merit.
Functional testing on RAID5 and RAID6: create, fail a member, rebuild
onto a spare / re-add, full data read-back verified, and scrub
("check") reporting mismatch_cnt == 0. The series was also exercised
with KASAN and lockdep enabled -- including heavy group_thread_cnt
churn on a multi-node setup to stress the per-NUMA-node worker
allocation and the raid5_quiesce hash-lock-all path -- with no reports.
Hiroshi Nishida (8):
md: change chunk_sectors and stripe cache counts to unsigned int
md/raid5: raise stripe cache limit from 32768 to 262144
md: widen badblock sectors param from int to sector_t
md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32
md/raid5: submit a window of stripes during resync/recovery
md/raid5: allocate worker groups per NUMA node
md/raid5: raise MAX_STRIPE_BATCH from 8 to 32
md/raid5: reserve stripe cache for user I/O during rebuild
drivers/md/md.c | 4 +-
drivers/md/md.h | 10 ++--
drivers/md/raid5.c | 129 ++++++++++++++++++++++++++++++++-------------
drivers/md/raid5.h | 33 ++++++++----
4 files changed, 121 insertions(+), 55 deletions(-)
base-commit: 55b77337bdd088c77461588e5ec094421b89911b
--
2.43.0
next reply other threads:[~2026-06-24 15:55 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-24 15:54 Hiroshi Nishida [this message]
2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
2026-06-24 16:16 ` sashiko-bot
2026-06-24 17:25 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 2/8] md/raid5: raise stripe cache limit from 32768 to 262144 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 3/8] md: widen badblock sectors param from int to sector_t Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 4/8] md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery Hiroshi Nishida
2026-06-24 16:12 ` sashiko-bot
2026-06-24 17:13 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 6/8] md/raid5: allocate worker groups per NUMA node Hiroshi Nishida
2026-06-24 16:07 ` sashiko-bot
2026-06-24 16:53 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32 Hiroshi Nishida
2026-06-24 16:09 ` sashiko-bot
2026-06-24 17:01 ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild Hiroshi Nishida
2026-06-24 16:12 ` sashiko-bot
2026-06-24 17:25 ` Hiroshi Nishida
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260624155452.211646-1-nishidafmly@gmail.com \
--to=nishidafmly@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=magiclinan@didiglobal.com \
--cc=song@kernel.org \
--cc=xiao@kernel.org \
--cc=yukuai@fygo.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox