[PATCH 0/8] md/raid5: scalability and rebuild-path improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Hiroshi Nishida <nishidafmly@gmail.com>
To: Song Liu <song@kernel.org>, Yu Kuai <yukuai@fygo.io>
Cc: Li Nan <magiclinan@didiglobal.com>, Xiao Ni <xiao@kernel.org>,
	linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org,
	Hiroshi Nishida <nishidafmly@gmail.com>
Subject: [PATCH 0/8] md/raid5: scalability and rebuild-path improvements
Date: Wed, 24 Jun 2026 08:54:44 -0700	[thread overview]
Message-ID: <20260624155452.211646-1-nishidafmly@gmail.com> (raw)

This series collects small, individually low-risk md/raid5 changes for
large, many-core, many-disk arrays.  Their common theme is reducing
per-stripe and stripe-cache contention, so the benefit appears mainly
when the raid5 stripe-handling worker threads are in use
(group_thread_cnt > 0); at the default group_thread_cnt = 0 (a single
handling thread) the series is essentially neutral.

 - patches 1-3 remove signed arithmetic from a hot-path divisor, lift an
   arbitrary stripe-cache size cap, and widen a badblock length argument
   that currently truncates large ranges;
 - patch 4 raises NR_STRIPE_HASH_LOCKS (8 -> 32) to spread stripe-hash
   contention on high core-count systems;
 - patches 5 and 8 reduce per-stripe overhead in the resync/recovery
   path and bound the share of the stripe cache a rebuild may hold while
   user I/O is competing;
 - patch 6 allocates each worker group's array on its own NUMA node;
 - patch 7 raises MAX_STRIPE_BATCH (8 -> 32).

Measured effect, treatment vs baseline, % change in mean IOPS (N=3),
swept over group_thread_cnt (RAID6 4+2, 22-core host, ramdisk members):

  workload                       gtc=0   gtc=2   gtc=4   gtc=8
  random 4K write (RMW)          +4.2%   +8.1%  +17.4%   +6.5%
  DB mixed 75/25 8K              +0.4%   +4.2%  +10.3%   +4.7%
  high-concurrency 70/30 4K      +3.9%   +1.2%  +10.0%   +0.2%
  OLTP 70/30 16K                 -0.3%   +4.7%  +10.1%   +9.3%
  partial-stripe write 8K        +1.1%   +4.8%  +11.2%  +14.2%

At the default single handling thread (group_thread_cnt = 0) the series is
neutral (no regression).  As worker threads are added the gain grows,
peaking broadly around group_thread_cnt = 4 at roughly +10-17% across the
whole mix; at gtc = 8 the write-heavy workloads keep gaining while the
read-heavy high-concurrency case has saturated.  (Per-run cv was <1%
except the random-write test, ~5-9%, from a cold first run.)

These numbers are on a ramdisk, which removes device latency and so
overstates the CPU-side contention effect relative to a real device;
they show the direction and the group_thread_cnt dependence, not an
absolute speedup.  The stripe-hash/batch patches (4, 7) and the cache cap
(2) drive this; patch 6 only matters on multi-socket systems (not
exercised above) and patches 5/8 act on the resync/recovery path rather
than this steady-state workload.

Reproduction (stock mdadm + fio):
  mdadm --create /dev/md0 --level=6 --raid-devices=6 --chunk=512 \
        --assume-clean <6 members>
  echo 16384 > /sys/block/md0/md/stripe_cache_size
  echo N     > /sys/block/md0/md/group_thread_cnt      # N = 0,2,4,8
  fio --filename=/dev/md0 --direct=1 --ioengine=libaio --group_reporting \
      --time_based --runtime=15 --name=w <per-workload opts>:
    random write   : --rw=randwrite              --bs=4k  --numjobs=4  --iodepth=32
    DB mixed       : --rw=randrw --rwmixread=75  --bs=8k  --numjobs=8  --iodepth=16
    high-concur.   : --rw=randrw --rwmixread=70  --bs=4k  --numjobs=16 --iodepth=8
    OLTP           : --rw=randrw --rwmixread=70  --bs=16k --numjobs=6  --iodepth=16
    partial-stripe : --rw=randwrite              --bs=8k  --numjobs=4  --iodepth=32

Each patch stands on its own; I am happy to drop or defer any that is not
justified on its own merit.

Functional testing on RAID5 and RAID6: create, fail a member, rebuild
onto a spare / re-add, full data read-back verified, and scrub
("check") reporting mismatch_cnt == 0.  The series was also exercised
with KASAN and lockdep enabled -- including heavy group_thread_cnt
churn on a multi-node setup to stress the per-NUMA-node worker
allocation and the raid5_quiesce hash-lock-all path -- with no reports.

Hiroshi Nishida (8):
  md: change chunk_sectors and stripe cache counts to unsigned int
  md/raid5: raise stripe cache limit from 32768 to 262144
  md: widen badblock sectors param from int to sector_t
  md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32
  md/raid5: submit a window of stripes during resync/recovery
  md/raid5: allocate worker groups per NUMA node
  md/raid5: raise MAX_STRIPE_BATCH from 8 to 32
  md/raid5: reserve stripe cache for user I/O during rebuild

 drivers/md/md.c    |   4 +-
 drivers/md/md.h    |  10 ++--
 drivers/md/raid5.c | 129 ++++++++++++++++++++++++++++++++-------------
 drivers/md/raid5.h |  33 ++++++++----
 4 files changed, 121 insertions(+), 55 deletions(-)

base-commit: 55b77337bdd088c77461588e5ec094421b89911b

-- 
2.43.0

next             reply	other threads:[~2026-06-24 15:55 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-24 15:54 Hiroshi Nishida [this message]
2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
2026-06-24 16:16   ` sashiko-bot
2026-06-24 17:25     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 2/8] md/raid5: raise stripe cache limit from 32768 to 262144 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 3/8] md: widen badblock sectors param from int to sector_t Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 4/8] md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery Hiroshi Nishida
2026-06-24 16:12   ` sashiko-bot
2026-06-24 17:13     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 6/8] md/raid5: allocate worker groups per NUMA node Hiroshi Nishida
2026-06-24 16:07   ` sashiko-bot
2026-06-24 16:53     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32 Hiroshi Nishida
2026-06-24 16:09   ` sashiko-bot
2026-06-24 17:01     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild Hiroshi Nishida
2026-06-24 16:12   ` sashiko-bot
2026-06-24 17:25     ` Hiroshi Nishida

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260624155452.211646-1-nishidafmly@gmail.com \
    --to=nishidafmly@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=magiclinan@didiglobal.com \
    --cc=song@kernel.org \
    --cc=xiao@kernel.org \
    --cc=yukuai@fygo.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.