Linux RAID subsystem development
 help / color / mirror / Atom feed
* [PATCH 0/8] md/raid5: scalability and rebuild-path improvements
@ 2026-06-24 15:54 Hiroshi Nishida
  2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 15:54 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, Xiao Ni, linux-raid, linux-kernel, Hiroshi Nishida

This series collects small, individually low-risk md/raid5 changes for
large, many-core, many-disk arrays.  Their common theme is reducing
per-stripe and stripe-cache contention, so the benefit appears mainly
when the raid5 stripe-handling worker threads are in use
(group_thread_cnt > 0); at the default group_thread_cnt = 0 (a single
handling thread) the series is essentially neutral.

 - patches 1-3 remove signed arithmetic from a hot-path divisor, lift an
   arbitrary stripe-cache size cap, and widen a badblock length argument
   that currently truncates large ranges;
 - patch 4 raises NR_STRIPE_HASH_LOCKS (8 -> 32) to spread stripe-hash
   contention on high core-count systems;
 - patches 5 and 8 reduce per-stripe overhead in the resync/recovery
   path and bound the share of the stripe cache a rebuild may hold while
   user I/O is competing;
 - patch 6 allocates each worker group's array on its own NUMA node;
 - patch 7 raises MAX_STRIPE_BATCH (8 -> 32).

Measured effect, treatment vs baseline, % change in mean IOPS (N=3),
swept over group_thread_cnt (RAID6 4+2, 22-core host, ramdisk members):

  workload                       gtc=0   gtc=2   gtc=4   gtc=8
  random 4K write (RMW)          +4.2%   +8.1%  +17.4%   +6.5%
  DB mixed 75/25 8K              +0.4%   +4.2%  +10.3%   +4.7%
  high-concurrency 70/30 4K      +3.9%   +1.2%  +10.0%   +0.2%
  OLTP 70/30 16K                 -0.3%   +4.7%  +10.1%   +9.3%
  partial-stripe write 8K        +1.1%   +4.8%  +11.2%  +14.2%

At the default single handling thread (group_thread_cnt = 0) the series is
neutral (no regression).  As worker threads are added the gain grows,
peaking broadly around group_thread_cnt = 4 at roughly +10-17% across the
whole mix; at gtc = 8 the write-heavy workloads keep gaining while the
read-heavy high-concurrency case has saturated.  (Per-run cv was <1%
except the random-write test, ~5-9%, from a cold first run.)

These numbers are on a ramdisk, which removes device latency and so
overstates the CPU-side contention effect relative to a real device;
they show the direction and the group_thread_cnt dependence, not an
absolute speedup.  The stripe-hash/batch patches (4, 7) and the cache cap
(2) drive this; patch 6 only matters on multi-socket systems (not
exercised above) and patches 5/8 act on the resync/recovery path rather
than this steady-state workload.

Reproduction (stock mdadm + fio):
  mdadm --create /dev/md0 --level=6 --raid-devices=6 --chunk=512 \
        --assume-clean <6 members>
  echo 16384 > /sys/block/md0/md/stripe_cache_size
  echo N     > /sys/block/md0/md/group_thread_cnt      # N = 0,2,4,8
  fio --filename=/dev/md0 --direct=1 --ioengine=libaio --group_reporting \
      --time_based --runtime=15 --name=w <per-workload opts>:
    random write   : --rw=randwrite              --bs=4k  --numjobs=4  --iodepth=32
    DB mixed       : --rw=randrw --rwmixread=75  --bs=8k  --numjobs=8  --iodepth=16
    high-concur.   : --rw=randrw --rwmixread=70  --bs=4k  --numjobs=16 --iodepth=8
    OLTP           : --rw=randrw --rwmixread=70  --bs=16k --numjobs=6  --iodepth=16
    partial-stripe : --rw=randwrite              --bs=8k  --numjobs=4  --iodepth=32

Each patch stands on its own; I am happy to drop or defer any that is not
justified on its own merit.

Functional testing on RAID5 and RAID6: create, fail a member, rebuild
onto a spare / re-add, full data read-back verified, and scrub
("check") reporting mismatch_cnt == 0.  The series was also exercised
with KASAN and lockdep enabled -- including heavy group_thread_cnt
churn on a multi-node setup to stress the per-NUMA-node worker
allocation and the raid5_quiesce hash-lock-all path -- with no reports.

Hiroshi Nishida (8):
  md: change chunk_sectors and stripe cache counts to unsigned int
  md/raid5: raise stripe cache limit from 32768 to 262144
  md: widen badblock sectors param from int to sector_t
  md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32
  md/raid5: submit a window of stripes during resync/recovery
  md/raid5: allocate worker groups per NUMA node
  md/raid5: raise MAX_STRIPE_BATCH from 8 to 32
  md/raid5: reserve stripe cache for user I/O during rebuild

 drivers/md/md.c    |   4 +-
 drivers/md/md.h    |  10 ++--
 drivers/md/raid5.c | 129 ++++++++++++++++++++++++++++++++-------------
 drivers/md/raid5.h |  33 ++++++++----
 4 files changed, 121 insertions(+), 55 deletions(-)

base-commit: 55b77337bdd088c77461588e5ec094421b89911b

-- 
2.43.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-06-24 17:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
2026-06-24 16:16   ` sashiko-bot
2026-06-24 17:25     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 2/8] md/raid5: raise stripe cache limit from 32768 to 262144 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 3/8] md: widen badblock sectors param from int to sector_t Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 4/8] md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery Hiroshi Nishida
2026-06-24 16:12   ` sashiko-bot
2026-06-24 17:13     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 6/8] md/raid5: allocate worker groups per NUMA node Hiroshi Nishida
2026-06-24 16:07   ` sashiko-bot
2026-06-24 16:53     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32 Hiroshi Nishida
2026-06-24 16:09   ` sashiko-bot
2026-06-24 17:01     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild Hiroshi Nishida
2026-06-24 16:12   ` sashiko-bot
2026-06-24 17:25     ` Hiroshi Nishida

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox