Linux RAID subsystem development
 help / color / mirror / Atom feed
From: Hiroshi Nishida <nishidafmly@gmail.com>
To: Song Liu <song@kernel.org>, Yu Kuai <yukuai@fygo.io>
Cc: Li Nan <magiclinan@didiglobal.com>, Xiao Ni <xiao@kernel.org>,
	linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org,
	Hiroshi Nishida <nishidafmly@gmail.com>
Subject: [PATCH 0/8] md/raid5: scalability and rebuild-path improvements
Date: Wed, 24 Jun 2026 08:54:44 -0700	[thread overview]
Message-ID: <20260624155452.211646-1-nishidafmly@gmail.com> (raw)

This series collects small, individually low-risk md/raid5 changes for
large, many-core, many-disk arrays.  Their common theme is reducing
per-stripe and stripe-cache contention, so the benefit appears mainly
when the raid5 stripe-handling worker threads are in use
(group_thread_cnt > 0); at the default group_thread_cnt = 0 (a single
handling thread) the series is essentially neutral.

 - patches 1-3 remove signed arithmetic from a hot-path divisor, lift an
   arbitrary stripe-cache size cap, and widen a badblock length argument
   that currently truncates large ranges;
 - patch 4 raises NR_STRIPE_HASH_LOCKS (8 -> 32) to spread stripe-hash
   contention on high core-count systems;
 - patches 5 and 8 reduce per-stripe overhead in the resync/recovery
   path and bound the share of the stripe cache a rebuild may hold while
   user I/O is competing;
 - patch 6 allocates each worker group's array on its own NUMA node;
 - patch 7 raises MAX_STRIPE_BATCH (8 -> 32).

Measured effect, treatment vs baseline, % change in mean IOPS (N=3),
swept over group_thread_cnt (RAID6 4+2, 22-core host, ramdisk members):

  workload                       gtc=0   gtc=2   gtc=4   gtc=8
  random 4K write (RMW)          +4.2%   +8.1%  +17.4%   +6.5%
  DB mixed 75/25 8K              +0.4%   +4.2%  +10.3%   +4.7%
  high-concurrency 70/30 4K      +3.9%   +1.2%  +10.0%   +0.2%
  OLTP 70/30 16K                 -0.3%   +4.7%  +10.1%   +9.3%
  partial-stripe write 8K        +1.1%   +4.8%  +11.2%  +14.2%

At the default single handling thread (group_thread_cnt = 0) the series is
neutral (no regression).  As worker threads are added the gain grows,
peaking broadly around group_thread_cnt = 4 at roughly +10-17% across the
whole mix; at gtc = 8 the write-heavy workloads keep gaining while the
read-heavy high-concurrency case has saturated.  (Per-run cv was <1%
except the random-write test, ~5-9%, from a cold first run.)

These numbers are on a ramdisk, which removes device latency and so
overstates the CPU-side contention effect relative to a real device;
they show the direction and the group_thread_cnt dependence, not an
absolute speedup.  The stripe-hash/batch patches (4, 7) and the cache cap
(2) drive this; patch 6 only matters on multi-socket systems (not
exercised above) and patches 5/8 act on the resync/recovery path rather
than this steady-state workload.

Reproduction (stock mdadm + fio):
  mdadm --create /dev/md0 --level=6 --raid-devices=6 --chunk=512 \
        --assume-clean <6 members>
  echo 16384 > /sys/block/md0/md/stripe_cache_size
  echo N     > /sys/block/md0/md/group_thread_cnt      # N = 0,2,4,8
  fio --filename=/dev/md0 --direct=1 --ioengine=libaio --group_reporting \
      --time_based --runtime=15 --name=w <per-workload opts>:
    random write   : --rw=randwrite              --bs=4k  --numjobs=4  --iodepth=32
    DB mixed       : --rw=randrw --rwmixread=75  --bs=8k  --numjobs=8  --iodepth=16
    high-concur.   : --rw=randrw --rwmixread=70  --bs=4k  --numjobs=16 --iodepth=8
    OLTP           : --rw=randrw --rwmixread=70  --bs=16k --numjobs=6  --iodepth=16
    partial-stripe : --rw=randwrite              --bs=8k  --numjobs=4  --iodepth=32

Each patch stands on its own; I am happy to drop or defer any that is not
justified on its own merit.

Functional testing on RAID5 and RAID6: create, fail a member, rebuild
onto a spare / re-add, full data read-back verified, and scrub
("check") reporting mismatch_cnt == 0.  The series was also exercised
with KASAN and lockdep enabled -- including heavy group_thread_cnt
churn on a multi-node setup to stress the per-NUMA-node worker
allocation and the raid5_quiesce hash-lock-all path -- with no reports.

Hiroshi Nishida (8):
  md: change chunk_sectors and stripe cache counts to unsigned int
  md/raid5: raise stripe cache limit from 32768 to 262144
  md: widen badblock sectors param from int to sector_t
  md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32
  md/raid5: submit a window of stripes during resync/recovery
  md/raid5: allocate worker groups per NUMA node
  md/raid5: raise MAX_STRIPE_BATCH from 8 to 32
  md/raid5: reserve stripe cache for user I/O during rebuild

 drivers/md/md.c    |   4 +-
 drivers/md/md.h    |  10 ++--
 drivers/md/raid5.c | 129 ++++++++++++++++++++++++++++++++-------------
 drivers/md/raid5.h |  33 ++++++++----
 4 files changed, 121 insertions(+), 55 deletions(-)

base-commit: 55b77337bdd088c77461588e5ec094421b89911b

-- 
2.43.0


             reply	other threads:[~2026-06-24 15:55 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-24 15:54 Hiroshi Nishida [this message]
2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
2026-06-24 16:16   ` sashiko-bot
2026-06-24 17:25     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 2/8] md/raid5: raise stripe cache limit from 32768 to 262144 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 3/8] md: widen badblock sectors param from int to sector_t Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 4/8] md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery Hiroshi Nishida
2026-06-24 16:12   ` sashiko-bot
2026-06-24 17:13     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 6/8] md/raid5: allocate worker groups per NUMA node Hiroshi Nishida
2026-06-24 16:07   ` sashiko-bot
2026-06-24 16:53     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32 Hiroshi Nishida
2026-06-24 16:09   ` sashiko-bot
2026-06-24 17:01     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild Hiroshi Nishida
2026-06-24 16:12   ` sashiko-bot
2026-06-24 17:25     ` Hiroshi Nishida

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260624155452.211646-1-nishidafmly@gmail.com \
    --to=nishidafmly@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=magiclinan@didiglobal.com \
    --cc=song@kernel.org \
    --cc=xiao@kernel.org \
    --cc=yukuai@fygo.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox