[PATCH 0/8] md/raid5: scalability and rebuild-path improvements

Linux RAID subsystem development
 help / color / mirror / Atom feed

* [PATCH 0/8] md/raid5: scalability and rebuild-path improvements
@ 2026-06-24 15:54 Hiroshi Nishida
  2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 15:54 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, Xiao Ni, linux-raid, linux-kernel, Hiroshi Nishida

This series collects small, individually low-risk md/raid5 changes for
large, many-core, many-disk arrays.  Their common theme is reducing
per-stripe and stripe-cache contention, so the benefit appears mainly
when the raid5 stripe-handling worker threads are in use
(group_thread_cnt > 0); at the default group_thread_cnt = 0 (a single
handling thread) the series is essentially neutral.

 - patches 1-3 remove signed arithmetic from a hot-path divisor, lift an
   arbitrary stripe-cache size cap, and widen a badblock length argument
   that currently truncates large ranges;
 - patch 4 raises NR_STRIPE_HASH_LOCKS (8 -> 32) to spread stripe-hash
   contention on high core-count systems;
 - patches 5 and 8 reduce per-stripe overhead in the resync/recovery
   path and bound the share of the stripe cache a rebuild may hold while
   user I/O is competing;
 - patch 6 allocates each worker group's array on its own NUMA node;
 - patch 7 raises MAX_STRIPE_BATCH (8 -> 32).

Measured effect, treatment vs baseline, % change in mean IOPS (N=3),
swept over group_thread_cnt (RAID6 4+2, 22-core host, ramdisk members):

  workload                       gtc=0   gtc=2   gtc=4   gtc=8
  random 4K write (RMW)          +4.2%   +8.1%  +17.4%   +6.5%
  DB mixed 75/25 8K              +0.4%   +4.2%  +10.3%   +4.7%
  high-concurrency 70/30 4K      +3.9%   +1.2%  +10.0%   +0.2%
  OLTP 70/30 16K                 -0.3%   +4.7%  +10.1%   +9.3%
  partial-stripe write 8K        +1.1%   +4.8%  +11.2%  +14.2%

At the default single handling thread (group_thread_cnt = 0) the series is
neutral (no regression).  As worker threads are added the gain grows,
peaking broadly around group_thread_cnt = 4 at roughly +10-17% across the
whole mix; at gtc = 8 the write-heavy workloads keep gaining while the
read-heavy high-concurrency case has saturated.  (Per-run cv was <1%
except the random-write test, ~5-9%, from a cold first run.)

These numbers are on a ramdisk, which removes device latency and so
overstates the CPU-side contention effect relative to a real device;
they show the direction and the group_thread_cnt dependence, not an
absolute speedup.  The stripe-hash/batch patches (4, 7) and the cache cap
(2) drive this; patch 6 only matters on multi-socket systems (not
exercised above) and patches 5/8 act on the resync/recovery path rather
than this steady-state workload.

Reproduction (stock mdadm + fio):
  mdadm --create /dev/md0 --level=6 --raid-devices=6 --chunk=512 \
        --assume-clean <6 members>
  echo 16384 > /sys/block/md0/md/stripe_cache_size
  echo N     > /sys/block/md0/md/group_thread_cnt      # N = 0,2,4,8
  fio --filename=/dev/md0 --direct=1 --ioengine=libaio --group_reporting \
      --time_based --runtime=15 --name=w <per-workload opts>:
    random write   : --rw=randwrite              --bs=4k  --numjobs=4  --iodepth=32
    DB mixed       : --rw=randrw --rwmixread=75  --bs=8k  --numjobs=8  --iodepth=16
    high-concur.   : --rw=randrw --rwmixread=70  --bs=4k  --numjobs=16 --iodepth=8
    OLTP           : --rw=randrw --rwmixread=70  --bs=16k --numjobs=6  --iodepth=16
    partial-stripe : --rw=randwrite              --bs=8k  --numjobs=4  --iodepth=32

Each patch stands on its own; I am happy to drop or defer any that is not
justified on its own merit.

Functional testing on RAID5 and RAID6: create, fail a member, rebuild
onto a spare / re-add, full data read-back verified, and scrub
("check") reporting mismatch_cnt == 0.  The series was also exercised
with KASAN and lockdep enabled -- including heavy group_thread_cnt
churn on a multi-node setup to stress the per-NUMA-node worker
allocation and the raid5_quiesce hash-lock-all path -- with no reports.

Hiroshi Nishida (8):
  md: change chunk_sectors and stripe cache counts to unsigned int
  md/raid5: raise stripe cache limit from 32768 to 262144
  md: widen badblock sectors param from int to sector_t
  md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32
  md/raid5: submit a window of stripes during resync/recovery
  md/raid5: allocate worker groups per NUMA node
  md/raid5: raise MAX_STRIPE_BATCH from 8 to 32
  md/raid5: reserve stripe cache for user I/O during rebuild

 drivers/md/md.c    |   4 +-
 drivers/md/md.h    |  10 ++--
 drivers/md/raid5.c | 129 ++++++++++++++++++++++++++++++++-------------
 drivers/md/raid5.h |  33 ++++++++----
 4 files changed, 121 insertions(+), 55 deletions(-)

base-commit: 55b77337bdd088c77461588e5ec094421b89911b

-- 
2.43.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int
  2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
@ 2026-06-24 15:54 ` Hiroshi Nishida
  2026-06-24 16:16   ` sashiko-bot
  2026-06-24 15:54 ` [PATCH 2/8] md/raid5: raise stripe cache limit from 32768 to 262144 Hiroshi Nishida
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 15:54 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, Xiao Ni, linux-raid, linux-kernel, Hiroshi Nishida

chunk_sectors, new_chunk_sectors, prev_chunk_sectors, max_nr_stripes,
and min_nr_stripes are never negative. Using signed int is semantically
wrong and prevents the compiler from optimizing division/modulo by
power-of-two chunk sizes to right shifts in the hot I/O path.

Change all struct fields and derived local variables to unsigned int:
  mddev->chunk_sectors
  mddev->new_chunk_sectors
  r5conf->chunk_sectors
  r5conf->prev_chunk_sectors
  r5conf->max_nr_stripes
  r5conf->min_nr_stripes
  Local: sectors_per_chunk, new_chunk, chunk_sectors

The min() in r5c_check_cached_full_stripe() required both operands to
match signedness; this is now satisfied with max_nr_stripes unsigned.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
---
 drivers/md/md.h    |  4 ++--
 drivers/md/raid5.c | 14 +++++++-------
 drivers/md/raid5.h |  8 ++++----
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index d8daf0f75cbb..b9ad26844799 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -437,7 +437,7 @@ struct mddev {
 	int				external;	/* metadata is
 							 * managed externally */
 	char				metadata_type[17]; /* externally set*/
-	int				chunk_sectors;
+	unsigned int			chunk_sectors;
 	time64_t			ctime, utime;
 	int				level, layout;
 	char				clevel[16];
@@ -466,7 +466,7 @@ struct mddev {
 	 */
 	sector_t			reshape_position;
 	int				delta_disks, new_level, new_layout;
-	int				new_chunk_sectors;
+	unsigned int			new_chunk_sectors;
 	int				reshape_backwards;
 
 	struct md_thread __rcu		*thread;	/* management thread */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0c5c9fb0606e..28828e083c2b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2970,7 +2970,7 @@ sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
 	sector_t new_sector;
 	int algorithm = previous ? conf->prev_algo
 				 : conf->algorithm;
-	int sectors_per_chunk = previous ? conf->prev_chunk_sectors
+	unsigned int sectors_per_chunk = previous ? conf->prev_chunk_sectors
 					 : conf->chunk_sectors;
 	int raid_disks = previous ? conf->previous_raid_disks
 				  : conf->raid_disks;
@@ -3166,7 +3166,7 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
 	int raid_disks = sh->disks;
 	int data_disks = raid_disks - conf->max_degraded;
 	sector_t new_sector = sh->sector, check;
-	int sectors_per_chunk = previous ? conf->prev_chunk_sectors
+	unsigned int sectors_per_chunk = previous ? conf->prev_chunk_sectors
 					 : conf->chunk_sectors;
 	int algorithm = previous ? conf->prev_algo
 				 : conf->algorithm;
@@ -3584,7 +3584,7 @@ static void end_reshape(struct r5conf *conf);
 static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
 			    struct stripe_head *sh)
 {
-	int sectors_per_chunk =
+	unsigned int sectors_per_chunk =
 		previous ? conf->prev_chunk_sectors : conf->chunk_sectors;
 	int dd_idx;
 	int chunk_offset = sector_div(stripe, sectors_per_chunk);
@@ -6103,7 +6103,7 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 static sector_t raid5_bio_lowest_chunk_sector(struct r5conf *conf,
 					      struct bio *bi)
 {
-	int sectors_per_chunk = conf->chunk_sectors;
+	unsigned int sectors_per_chunk = conf->chunk_sectors;
 	int raid_disks = conf->raid_disks;
 	int dd_idx;
 	struct stripe_head sh;
@@ -7930,7 +7930,7 @@ static int raid5_run(struct mddev *mddev)
 		sector_t here_new, here_old;
 		int old_disks;
 		int max_degraded = (mddev->level == 6 ? 2 : 1);
-		int chunk_sectors;
+		unsigned int chunk_sectors;
 		int new_data_disks;
 
 		if (journal_dev) {
@@ -8832,7 +8832,7 @@ static int raid5_check_reshape(struct mddev *mddev)
 	 * to be used by a reshape pass.
 	 */
 	struct r5conf *conf = mddev->private;
-	int new_chunk = mddev->new_chunk_sectors;
+	unsigned int new_chunk = mddev->new_chunk_sectors;
 
 	if (mddev->new_layout >= 0 && !algorithm_valid_raid5(mddev->new_layout))
 		return -EINVAL;
@@ -8866,7 +8866,7 @@ static int raid5_check_reshape(struct mddev *mddev)
 
 static int raid6_check_reshape(struct mddev *mddev)
 {
-	int new_chunk = mddev->new_chunk_sectors;
+	unsigned int new_chunk = mddev->new_chunk_sectors;
 
 	if (mddev->new_layout >= 0 && !algorithm_valid_raid6(mddev->new_layout))
 		return -EINVAL;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index cb5feae04db2..5cd9d0f36b6e 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -572,12 +572,12 @@ struct r5conf {
 	/* only protect corresponding hash list and inactive_list */
 	spinlock_t		hash_locks[NR_STRIPE_HASH_LOCKS];
 	struct mddev		*mddev;
-	int			chunk_sectors;
+	unsigned int		chunk_sectors;
 	int			level, algorithm, rmw_level;
 	int			max_degraded;
 	int			raid_disks;
-	int			max_nr_stripes;
-	int			min_nr_stripes;
+	unsigned int		max_nr_stripes;
+	unsigned int		min_nr_stripes;
 #if PAGE_SIZE != DEFAULT_STRIPE_SIZE
 	unsigned long	stripe_size;
 	unsigned int	stripe_shift;
@@ -595,7 +595,7 @@ struct r5conf {
 	 */
 	sector_t		reshape_safe;
 	int			previous_raid_disks;
-	int			prev_chunk_sectors;
+	unsigned int		prev_chunk_sectors;
 	int			prev_algo;
 	short			generation; /* increments with every reshape */
 	seqcount_spinlock_t	gen_lock;	/* lock against generation changes */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int
  2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
@ 2026-06-24 16:16   ` sashiko-bot
  2026-06-24 17:25     ` Hiroshi Nishida
  0 siblings, 1 reply; 19+ messages in thread
From: sashiko-bot @ 2026-06-24 16:16 UTC (permalink / raw)
  To: Hiroshi Nishida; +Cc: linux-raid, yukuai

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

New issues:
- [Medium] The commit's core justification is false. Changing the types to unsigned does not enable compiler right-shift optimizations for variable divisions.
- [Low] Unupdated printf-style format specifiers passing unsigned fields to '%d' cause compiler warnings (-Wformat) and break builds configured with -Werror.

Pre-existing issues:
- [Critical] The macro `round_down` is incorrectly used with a non-power-of-two divisor, causing silent data corruption in bitmap sector calculations.
--

commit 86528a8b22367a5ee16a7cb19a1d56bf58f6aa67
Author: Hiroshi Nishida <nishidafmly@gmail.com>

md: change chunk_sectors and stripe cache counts to unsigned int

This commit changes several structural fields and local variables like
chunk_sectors and max_nr_stripes from signed to unsigned int. The stated
rationale is to prevent negative values and enable right-shift compiler
optimizations for power-of-two division in the hot I/O path.

>     chunk_sectors, new_chunk_sectors, prev_chunk_sectors, max_nr_stripes,
>     and min_nr_stripes are never negative. Using signed int is semantically
>     wrong and prevents the compiler from optimizing division/modulo by
>     power-of-two chunk sizes to right shifts in the hot I/O path.

[Severity: Medium]
Is this optimization rationale accurate? The hot I/O path uses sector_div(),
which expands to do_div(). The do_div() macro explicitly casts the divisor
to a 32-bit unsigned integer, so the division was already operating on an
unsigned value.

Additionally, because sectors_per_chunk is a runtime variable rather than a
compile-time constant, would the compiler be able to optimize it into a
right shift without a conditional branch?

> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index d8daf0f75cbbe..b9ad26844799d 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
[ ... ]
> @@ -437,7 +437,7 @@ struct mddev {
>  	int				external;	/* metadata is
>  							 * managed externally */
>  	char				metadata_type[17]; /* externally set*/
> -	int				chunk_sectors;
> +	unsigned int			chunk_sectors;

[Severity: Low]
Does changing this field to unsigned int cause compiler warnings with
existing format strings?

For example, chunk_size_show() in drivers/md/md.c appears to use %d to
format mddev->chunk_sectors:

chunk_size_show() {
    ...
    return sprintf(page, "%d (%d)\n",
                   mddev->new_chunk_sectors << 9,
                   mddev->chunk_sectors << 9);
}

Will this break builds configured with -Werror?

> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index cb5feae04db27..5cd9d0f36b6e0 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
[ ... ]
> @@ -572,12 +572,12 @@ struct r5conf {
>  	/* only protect corresponding hash list and inactive_list */
>  	spinlock_t		hash_locks[NR_STRIPE_HASH_LOCKS];
>  	struct mddev		*mddev;
> -	int			chunk_sectors;
> +	unsigned int		chunk_sectors;
>  	int			level, algorithm, rmw_level;
>  	int			max_degraded;
>  	int			raid_disks;
> -	int			max_nr_stripes;
> -	int			min_nr_stripes;
> +	unsigned int		max_nr_stripes;
> +	unsigned int		min_nr_stripes;

[Severity: Low]
Will the changes to min_nr_stripes and max_nr_stripes also trigger format
warnings?

Places like raid5_end_read_request() and raid5_set_cache_size() still use
%d to print these fields. For instance, in raid5_set_cache_size():

pr_err("md/raid:%s: Cannot increase cache size, ret=%d, new_size=%d, min_nr_stripes=%d, max_nr_stripes=%d\n",
       ...
       conf->min_nr_stripes,
       conf->max_nr_stripes);

> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 0c5c9fb0606ee..28828e083c2b3 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
[ ... ]
> @@ -2970,7 +2970,7 @@ sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
>  	sector_t new_sector;
>  	int algorithm = previous ? conf->prev_algo
>  				 : conf->algorithm;
> -	int sectors_per_chunk = previous ? conf->prev_chunk_sectors
> +	unsigned int sectors_per_chunk = previous ? conf->prev_chunk_sectors
>  					 : conf->chunk_sectors;

[Severity: Critical]
This is a pre-existing issue, but could the use of sectors_per_chunk
cause problems when passed to rounding macros?

Looking at raid5_bitmap_sector():

raid5_bitmap_sector() {
    ...
    sectors_per_chunk = conf->chunk_sectors *
        (conf->raid_disks - conf->max_degraded);
    start = round_down(start, sectors_per_chunk);
    end = round_up(end, sectors_per_chunk);
    ...
}

The round_down() and round_up() macros are designed specifically for
power-of-two divisors. If the number of data disks is not a power of two,
sectors_per_chunk will not be a power of two, which would yield incorrect
sector boundaries.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624155452.211646-1-nishidafmly@gmail.com?part=1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int
  2026-06-24 16:16   ` sashiko-bot
@ 2026-06-24 17:25     ` Hiroshi Nishida
  0 siblings, 0 replies; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 17:25 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-raid, yukuai

> [Medium] unsigned types don't enable right-shift optimization for variable divisions.

You're right -- the hot-path divisions go through
sector_div()/do_div() (already u32),
and sectors_per_chunk is a runtime value, so signedness changes no codegen.
The rationale is wrong and the change has no perf benefit, so I'll
drop this patch
from v2 rather than reword a claim that doesn't hold.

> [Low] unupdated %d specifiers for the now-unsigned fields.
Another reason to drop it -- avoids the %d/%u churn. (They don't warn
in a default
build, only under -Wformat-signedness, but they are inconsistent.)

 > [Critical, pre-existing] round_down/round_up with a non-power-of-two divisor.
Good catch, and independent of this series.
sectors_per_chunk = chunk_sectors * (raid_disks - max_degraded)
isn't a power of two whenever the data-disk count isn't (raid5 4-disk
-> 3 data),
and round_down/round_up are mask-based.
I'll send a separate fix switching to rounddown()/roundup() with a Fixes: tag.

Thanks,

2026年6月24日(水) 9:16 <sashiko-bot@kernel.org>:
>
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
>
> New issues:
> - [Medium] The commit's core justification is false. Changing the types to unsigned does not enable compiler right-shift optimizations for variable divisions.
> - [Low] Unupdated printf-style format specifiers passing unsigned fields to '%d' cause compiler warnings (-Wformat) and break builds configured with -Werror.
>
> Pre-existing issues:
> - [Critical] The macro `round_down` is incorrectly used with a non-power-of-two divisor, causing silent data corruption in bitmap sector calculations.
> --
>
> commit 86528a8b22367a5ee16a7cb19a1d56bf58f6aa67
> Author: Hiroshi Nishida <nishidafmly@gmail.com>
>
> md: change chunk_sectors and stripe cache counts to unsigned int
>
> This commit changes several structural fields and local variables like
> chunk_sectors and max_nr_stripes from signed to unsigned int. The stated
> rationale is to prevent negative values and enable right-shift compiler
> optimizations for power-of-two division in the hot I/O path.
>
> >     chunk_sectors, new_chunk_sectors, prev_chunk_sectors, max_nr_stripes,
> >     and min_nr_stripes are never negative. Using signed int is semantically
> >     wrong and prevents the compiler from optimizing division/modulo by
> >     power-of-two chunk sizes to right shifts in the hot I/O path.
>
> [Severity: Medium]
> Is this optimization rationale accurate? The hot I/O path uses sector_div(),
> which expands to do_div(). The do_div() macro explicitly casts the divisor
> to a 32-bit unsigned integer, so the division was already operating on an
> unsigned value.
>
> Additionally, because sectors_per_chunk is a runtime variable rather than a
> compile-time constant, would the compiler be able to optimize it into a
> right shift without a conditional branch?
>
> > diff --git a/drivers/md/md.h b/drivers/md/md.h
> > index d8daf0f75cbbe..b9ad26844799d 100644
> > --- a/drivers/md/md.h
> > +++ b/drivers/md/md.h
> [ ... ]
> > @@ -437,7 +437,7 @@ struct mddev {
> >       int                             external;       /* metadata is
> >                                                        * managed externally */
> >       char                            metadata_type[17]; /* externally set*/
> > -     int                             chunk_sectors;
> > +     unsigned int                    chunk_sectors;
>
> [Severity: Low]
> Does changing this field to unsigned int cause compiler warnings with
> existing format strings?
>
> For example, chunk_size_show() in drivers/md/md.c appears to use %d to
> format mddev->chunk_sectors:
>
> chunk_size_show() {
>     ...
>     return sprintf(page, "%d (%d)\n",
>                    mddev->new_chunk_sectors << 9,
>                    mddev->chunk_sectors << 9);
> }
>
> Will this break builds configured with -Werror?
>
> > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> > index cb5feae04db27..5cd9d0f36b6e0 100644
> > --- a/drivers/md/raid5.h
> > +++ b/drivers/md/raid5.h
> [ ... ]
> > @@ -572,12 +572,12 @@ struct r5conf {
> >       /* only protect corresponding hash list and inactive_list */
> >       spinlock_t              hash_locks[NR_STRIPE_HASH_LOCKS];
> >       struct mddev            *mddev;
> > -     int                     chunk_sectors;
> > +     unsigned int            chunk_sectors;
> >       int                     level, algorithm, rmw_level;
> >       int                     max_degraded;
> >       int                     raid_disks;
> > -     int                     max_nr_stripes;
> > -     int                     min_nr_stripes;
> > +     unsigned int            max_nr_stripes;
> > +     unsigned int            min_nr_stripes;
>
> [Severity: Low]
> Will the changes to min_nr_stripes and max_nr_stripes also trigger format
> warnings?
>
> Places like raid5_end_read_request() and raid5_set_cache_size() still use
> %d to print these fields. For instance, in raid5_set_cache_size():
>
> pr_err("md/raid:%s: Cannot increase cache size, ret=%d, new_size=%d, min_nr_stripes=%d, max_nr_stripes=%d\n",
>        ...
>        conf->min_nr_stripes,
>        conf->max_nr_stripes);
>
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 0c5c9fb0606ee..28828e083c2b3 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> [ ... ]
> > @@ -2970,7 +2970,7 @@ sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
> >       sector_t new_sector;
> >       int algorithm = previous ? conf->prev_algo
> >                                : conf->algorithm;
> > -     int sectors_per_chunk = previous ? conf->prev_chunk_sectors
> > +     unsigned int sectors_per_chunk = previous ? conf->prev_chunk_sectors
> >                                        : conf->chunk_sectors;
>
> [Severity: Critical]
> This is a pre-existing issue, but could the use of sectors_per_chunk
> cause problems when passed to rounding macros?
>
> Looking at raid5_bitmap_sector():
>
> raid5_bitmap_sector() {
>     ...
>     sectors_per_chunk = conf->chunk_sectors *
>         (conf->raid_disks - conf->max_degraded);
>     start = round_down(start, sectors_per_chunk);
>     end = round_up(end, sectors_per_chunk);
>     ...
> }
>
> The round_down() and round_up() macros are designed specifically for
> power-of-two divisors. If the number of data disks is not a power of two,
> sectors_per_chunk will not be a power of two, which would yield incorrect
> sector boundaries.
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260624155452.211646-1-nishidafmly@gmail.com?part=1



-- 
Hiroshi Nishida
nishidafmly@gmail.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/8] md/raid5: raise stripe cache limit from 32768 to 262144
  2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
  2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
@ 2026-06-24 15:54 ` Hiroshi Nishida
  2026-06-24 15:54 ` [PATCH 3/8] md: widen badblock sectors param from int to sector_t Hiroshi Nishida
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 15:54 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, Xiao Ni, linux-raid, linux-kernel, Hiroshi Nishida

Stripe cache memory is approximately max_nr_stripes * num_disks *
stripe_size, where stripe_size is 4KB (PAGE_SIZE).  The old hardcoded
ceiling of 32768 stripes therefore limits a 12-disk array to ~1.5GB of
stripe cache.  On servers with 256GB+ RAM backing wide arrays, a larger
cache reduces stalls on stripe-cache misses under heavy write workloads.

Define RAID5_MAX_NR_STRIPES = 262144 (256K) in raid5.h and use it in
raid5_set_cache_size() instead of the magic 32768 literal.  This allows
up to ~12GB of stripe cache on a 12-disk array.  The default stripe count
(NR_STRIPES = 256) is unchanged, so there is no memory impact unless the
administrator raises stripe_cache_size explicitly.

Also fix two local variables in raid5_cache_count() that were declared
as int but read from unsigned int fields; change to unsigned int to
avoid implicit sign-extension in the subtraction.

Tested: echo 262144 > /sys/block/md0/md/stripe_cache_size accepted;
262145 correctly returns EINVAL.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
---
 drivers/md/raid5.c | 6 +++---
 drivers/md/raid5.h | 6 ++++++
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 28828e083c2b..9cb4ed3bd85c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6923,7 +6923,7 @@ raid5_set_cache_size(struct mddev *mddev, int size)
 	int result = 0;
 	struct r5conf *conf = mddev->private;
 
-	if (size <= 16 || size > 32768)
+	if (size <= 16 || size > RAID5_MAX_NR_STRIPES)
 		return -EINVAL;
 
 	WRITE_ONCE(conf->min_nr_stripes, size);
@@ -7505,8 +7505,8 @@ static unsigned long raid5_cache_count(struct shrinker *shrink,
 				       struct shrink_control *sc)
 {
 	struct r5conf *conf = shrink->private_data;
-	int max_stripes = READ_ONCE(conf->max_nr_stripes);
-	int min_stripes = READ_ONCE(conf->min_nr_stripes);
+	unsigned int max_stripes = READ_ONCE(conf->max_nr_stripes);
+	unsigned int min_stripes = READ_ONCE(conf->min_nr_stripes);
 
 	if (max_stripes < min_stripes)
 		/* unlikely, but not impossible */
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 5cd9d0f36b6e..57349737d393 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -479,6 +479,12 @@ struct disk_info {
  */
 
 #define NR_STRIPES		256
+/* Maximum user-settable stripe cache size, in stripes.  Stripe cache memory is
+ * roughly max_nr_stripes * num_disks * stripe_size (stripe_size = 4KB), so the
+ * old cap of 32768 limited a 12-disk array to ~1.5GB.  Raise to 262144 to allow
+ * larger caches on big-RAM systems; the default (NR_STRIPES) is unchanged.
+ */
+#define RAID5_MAX_NR_STRIPES	262144U
 
 #if PAGE_SIZE == DEFAULT_STRIPE_SIZE
 #define STRIPE_SIZE		PAGE_SIZE
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/8] md: widen badblock sectors param from int to sector_t
  2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
  2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
  2026-06-24 15:54 ` [PATCH 2/8] md/raid5: raise stripe cache limit from 32768 to 262144 Hiroshi Nishida
@ 2026-06-24 15:54 ` Hiroshi Nishida
  2026-06-24 15:54 ` [PATCH 4/8] md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32 Hiroshi Nishida
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 15:54 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, Xiao Ni, linux-raid, linux-kernel, Hiroshi Nishida

The badblocks core API -- badblocks_set(), badblocks_clear() and
badblocks_check() -- and the is_badblock() helper all take the range
length as sector_t.  The md wrappers rdev_set_badblocks(),
rdev_clear_badblocks() and rdev_has_badblock(), however, declared the
same length as int, narrowing sector_t to int and back again in the
middle of an otherwise 64-bit clean path.

Change the sectors parameter to sector_t in these three wrappers so it
matches the core API and is_badblock().  No functional change: current
callers pass per-I/O or per-resync-chunk lengths well within int range.
This just removes a gratuitous truncation point and keeps the type
consistent end to end.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
---
 drivers/md/md.c | 4 ++--
 drivers/md/md.h | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index d1465bcd86c8..61f40fa41e78 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -10553,7 +10553,7 @@ EXPORT_SYMBOL(md_finish_reshape);
 /* Bad block management */
 
 /* Returns true on success, false on failure */
-bool rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
+bool rdev_set_badblocks(struct md_rdev *rdev, sector_t s, sector_t sectors,
 			int is_new)
 {
 	struct mddev *mddev = rdev->mddev;
@@ -10593,7 +10593,7 @@ bool rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 }
 EXPORT_SYMBOL_GPL(rdev_set_badblocks);
 
-void rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
+void rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, sector_t sectors,
 			  int is_new)
 {
 	if (is_new)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b9ad26844799..95835a3286aa 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -311,7 +311,7 @@ static inline int is_badblock(struct md_rdev *rdev, sector_t s, sector_t sectors
 }
 
 static inline int rdev_has_badblock(struct md_rdev *rdev, sector_t s,
-				    int sectors)
+				    sector_t sectors)
 {
 	sector_t first_bad;
 	sector_t bad_sectors;
@@ -319,9 +319,9 @@ static inline int rdev_has_badblock(struct md_rdev *rdev, sector_t s,
 	return is_badblock(rdev, s, sectors, &first_bad, &bad_sectors);
 }
 
-extern bool rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
+extern bool rdev_set_badblocks(struct md_rdev *rdev, sector_t s, sector_t sectors,
 			       int is_new);
-extern void rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
+extern void rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, sector_t sectors,
 				 int is_new);
 struct md_cluster_info;
 struct md_cluster_operations;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/8] md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32
  2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
                   ` (2 preceding siblings ...)
  2026-06-24 15:54 ` [PATCH 3/8] md: widen badblock sectors param from int to sector_t Hiroshi Nishida
@ 2026-06-24 15:54 ` Hiroshi Nishida
  2026-06-24 15:54 ` [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery Hiroshi Nishida
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 15:54 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, Xiao Ni, linux-raid, linux-kernel, Hiroshi Nishida

The stripe cache hash is striped across NR_STRIPE_HASH_LOCKS spinlocks
(see stripe_hash_locks_hash()).  The value has been 8 since the per-hash
locking was introduced, which is small for modern many-core servers:
with only 8 buckets, stripe cache lookup and allocation contend on the
same few locks once the CPU count greatly exceeds the bucket count.

Raise it to 32.  Two constraints bound the choice:

  - STRIPE_HASH_LOCKS_MASK is (NR_STRIPE_HASH_LOCKS - 1) and is used as
    a bitmask in stripe_hash_locks_hash(), so the value must be a power
    of two.

  - raid5_quiesce() acquires every hash lock plus device_lock at once
    via lock_all_device_hash_locks_irq().  That holds
    NR_STRIPE_HASH_LOCKS + 1 locks simultaneously, which must stay below
    MAX_LOCK_DEPTH (48) so the held-lock array does not overflow when
    lockdep is enabled.

32 is the largest power of two that satisfies both: 32 + 1 leaves
headroom under 48, whereas 64 would exceed it.  (The pre-existing
"must remain below 64" comment understates this; MAX_LOCK_DEPTH is the
real ceiling.)  STRIPE_HASH_LOCKS_MASK and all NR_STRIPE_HASH_LOCKS-
sized arrays scale automatically.

This is purely a lock-striping change; it does not affect stripe cache
correctness.  The benefit appears on many-core systems, while on small
systems the extra buckets are harmless.

Tested: loop-device RAID-5 create, write/verify, fail a disk, rebuild
onto a spare and scrub all complete cleanly.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
---
 drivers/md/raid5.h | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 57349737d393..3efab71ebef7 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -498,12 +498,14 @@ struct disk_info {
 #define HASH_MASK		(NR_HASH - 1)
 #define MAX_STRIPE_BATCH	8
 
-/* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
- * This is because we sometimes take all the spinlocks
- * and creating that much locking depth can cause
- * problems.
+/* NR_STRIPE_HASH_LOCKS must be a power of two, since
+ * STRIPE_HASH_LOCKS_MASK masks with (NR_STRIPE_HASH_LOCKS - 1).
+ * It must also be small enough that taking all of them at once in
+ * lock_all_device_hash_locks_irq(), plus device_lock, keeps the held
+ * lock count below MAX_LOCK_DEPTH (48) with lockdep enabled.  32 is the
+ * largest power of two that satisfies both constraints.
  */
-#define NR_STRIPE_HASH_LOCKS 8
+#define NR_STRIPE_HASH_LOCKS 32
 #define STRIPE_HASH_LOCKS_MASK (NR_STRIPE_HASH_LOCKS - 1)
 
 struct r5worker {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery
  2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
                   ` (3 preceding siblings ...)
  2026-06-24 15:54 ` [PATCH 4/8] md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32 Hiroshi Nishida
@ 2026-06-24 15:54 ` Hiroshi Nishida
  2026-06-24 16:12   ` sashiko-bot
  2026-06-24 15:54 ` [PATCH 6/8] md/raid5: allocate worker groups per NUMA node Hiroshi Nishida
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 15:54 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, Xiao Ni, linux-raid, linux-kernel, Hiroshi Nishida

raid5_sync_request() dispatches one stripe per call: it fetches a single
stripe head, marks it for sync, and returns one stripe's worth of
sectors.  When the stripe cache is full the NOBLOCK fetch fails and it
re-enters a one-jiffy throttle sleep (schedule_timeout_uninterruptible(1))
before retrying.  Because that sleep is taken per stripe, sustained cache
pressure bounds sync progress to roughly HZ stripes/second regardless of
how fast the member devices are.

Dispatch up to RAID5_SYNC_WINDOW (32) stripes per call instead.  Only the
first stripe of the window keeps the original behaviour (block, then the
one-jiffy throttle if the cache was full); the remaining stripes are
requested with R5_GAS_NOBLOCK and the loop stops as soon as the cache is
full.  So at most one throttle sleep is taken per window rather than per
stripe, and when the cache has free slots a single call can queue a batch
instead of one stripe at a time.

With a warm cache the window stays near full: counting raid5_sync_request()
invocations across a rebuild showed it averaging ~30 of the 32 stripes per
call, i.e. roughly 30x fewer calls into the sync path for the same resync.

The return value reports the number of stripes actually submitted, so
md_do_sync()'s recovery_active accounting stays balanced, and the window
is bounded by both the end of the sync region (max_sector) and
mddev->resync_max, so a user- or cluster-imposed sync ceiling is not
overshot.

This does not change which data is read or written during resync or
recovery.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
---
 drivers/md/raid5.c | 47 +++++++++++++++++++++++++++++++++-------------
 drivers/md/raid5.h |  1 +
 2 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9cb4ed3bd85c..8e9edaaca667 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6563,7 +6563,8 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
 	struct stripe_head *sh;
 	sector_t sync_blocks;
 	bool still_degraded = false;
-	int i;
+	int i, submitted;
+	sector_t win_sector;
 
 	if (sector_nr >= max_sector) {
 		/* just being told to finish up .. nothing much to do */
@@ -6620,16 +6621,7 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
 	if (md_bitmap_enabled(mddev, false))
 		mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, false);
 
-	sh = raid5_get_active_stripe(conf, NULL, sector_nr,
-				     R5_GAS_NOBLOCK);
-	if (sh == NULL) {
-		sh = raid5_get_active_stripe(conf, NULL, sector_nr, 0);
-		/* make sure we don't swamp the stripe cache if someone else
-		 * is trying to get access
-		 */
-		schedule_timeout_uninterruptible(1);
-	}
-	/* Need to check if array will still be degraded after recovery/resync
+	/* Check once whether array will still be degraded after recovery/resync.
 	 * Note in case of > 1 drive failures it's possible we're rebuilding
 	 * one drive while leaving another faulty drive in array.
 	 */
@@ -6640,13 +6632,42 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
 			still_degraded = true;
 	}
 
+	/* First stripe: block if stripe cache is full, then throttle. */
+	sh = raid5_get_active_stripe(conf, NULL, sector_nr, R5_GAS_NOBLOCK);
+	if (sh == NULL) {
+		sh = raid5_get_active_stripe(conf, NULL, sector_nr, 0);
+		/* make sure we don't swamp the stripe cache if someone else
+		 * is trying to get access
+		 */
+		schedule_timeout_uninterruptible(1);
+	}
 	md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, still_degraded);
 	set_bit(STRIPE_SYNC_REQUESTED, &sh->state);
 	set_bit(STRIPE_HANDLE, &sh->state);
-
 	raid5_release_stripe(sh);
 
-	return RAID5_STRIPE_SECTORS(conf);
+	/* Submit remaining stripes in the window non-blocking.  Stop early
+	 * if the stripe cache is full: the disk queue is already saturated.
+	 * Bound by resync_max so a user- or cluster-imposed sync ceiling is
+	 * not overshot.
+	 */
+	win_sector = sector_nr + RAID5_STRIPE_SECTORS(conf);
+	for (submitted = 1;
+	     submitted < RAID5_SYNC_WINDOW && win_sector < max_sector &&
+	     win_sector < mddev->resync_max;
+	     submitted++, win_sector += RAID5_STRIPE_SECTORS(conf)) {
+		sh = raid5_get_active_stripe(conf, NULL, win_sector,
+					     R5_GAS_NOBLOCK);
+		if (!sh)
+			break;
+		md_bitmap_start_sync(mddev, win_sector, &sync_blocks,
+				     still_degraded);
+		set_bit(STRIPE_SYNC_REQUESTED, &sh->state);
+		set_bit(STRIPE_HANDLE, &sh->state);
+		raid5_release_stripe(sh);
+	}
+
+	return submitted * RAID5_STRIPE_SECTORS(conf);
 }
 
 static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 3efab71ebef7..7aeba1fc7f09 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -497,6 +497,7 @@ struct disk_info {
 #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK		(NR_HASH - 1)
 #define MAX_STRIPE_BATCH	8
+#define RAID5_SYNC_WINDOW	32	/* stripes to pre-submit per sync_request call */
 
 /* NR_STRIPE_HASH_LOCKS must be a power of two, since
  * STRIPE_HASH_LOCKS_MASK masks with (NR_STRIPE_HASH_LOCKS - 1).
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery
  2026-06-24 15:54 ` [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery Hiroshi Nishida
@ 2026-06-24 16:12   ` sashiko-bot
  2026-06-24 17:13     ` Hiroshi Nishida
  0 siblings, 1 reply; 19+ messages in thread
From: sashiko-bot @ 2026-06-24 16:12 UTC (permalink / raw)
  To: Hiroshi Nishida; +Cc: linux-raid, yukuai

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] The batch loop ignores the return value of md_bitmap_start_sync(), forcing clean stripes to be unnecessarily synchronized.
- [High] Time-of-check to time-of-use race on the array's degraded state, leading to silent data corruption via incorrect bitmap updates.
--

commit 01d1e80ee6aa253c48c233e7fc218fbdac9522fa
Author: Hiroshi Nishida <nishidafmly@gmail.com>

md/raid5: submit a window of stripes during resync/recovery

This commit alters raid5_sync_request() to dispatch up to 32 stripes per
call instead of one. It attempts to submit the remaining stripes in the
window with a non-blocking flag to avoid the per-stripe throttle sleep.

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9cb4ed3bd85c5..8e9edaaca667b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6620,16 +6621,7 @@ static inline sector_t raid5_sync_request(...)
[ ... ]
> -	sh = raid5_get_active_stripe(conf, NULL, sector_nr,
> -				     R5_GAS_NOBLOCK);
> -	if (sh == NULL) {
> -		sh = raid5_get_active_stripe(conf, NULL, sector_nr, 0);
> -		/* make sure we don't swamp the stripe cache if someone else
> -		 * is trying to get access
> -		 */
> -		schedule_timeout_uninterruptible(1);
> -	}
> -	/* Need to check if array will still be degraded after recovery/resync
> +	/* Check once whether array will still be degraded after recovery/resync.
>  	 * Note in case of > 1 drive failures it's possible we're rebuilding
>  	 * one drive while leaving another faulty drive in array.
>  	 */
> @@ -6640,13 +6632,42 @@ static inline sector_t raid5_sync_request(...)
>  			still_degraded = true;
>  	}
>  
> +	/* First stripe: block if stripe cache is full, then throttle. */
> +	sh = raid5_get_active_stripe(conf, NULL, sector_nr, R5_GAS_NOBLOCK);
> +	if (sh == NULL) {
> +		sh = raid5_get_active_stripe(conf, NULL, sector_nr, 0);
> +		/* make sure we don't swamp the stripe cache if someone else
> +		 * is trying to get access
> +		 */
> +		schedule_timeout_uninterruptible(1);
> +	}

[Severity: High]
Does moving the degraded check before the stripe allocation introduce a
time-of-check to time-of-use race?

If a drive fails during the potential sleep in raid5_get_active_stripe() or
schedule_timeout_uninterruptible(), the stale still_degraded value is used
below to update the bitmap. This could incorrectly clear the NEEDED bit,
potentially leaving parity inconsistent if the system crashes before the sync
completes.

>  	md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, still_degraded);
>  	set_bit(STRIPE_SYNC_REQUESTED, &sh->state);
>  	set_bit(STRIPE_HANDLE, &sh->state);

[ ... ]

> +	win_sector = sector_nr + RAID5_STRIPE_SECTORS(conf);
> +	for (submitted = 1;
> +	     submitted < RAID5_SYNC_WINDOW && win_sector < max_sector &&
> +	     win_sector < mddev->resync_max;
> +	     submitted++, win_sector += RAID5_STRIPE_SECTORS(conf)) {
> +		sh = raid5_get_active_stripe(conf, NULL, win_sector,
> +					     R5_GAS_NOBLOCK);
> +		if (!sh)
> +			break;
> +		md_bitmap_start_sync(mddev, win_sector, &sync_blocks,
> +				     still_degraded);
> +		set_bit(STRIPE_SYNC_REQUESTED, &sh->state);
> +		set_bit(STRIPE_HANDLE, &sh->state);

[Severity: High]
Does this batch loop unnecessarily synchronize clean stripes?

The boolean return value of md_bitmap_start_sync() is ignored here. If the
first stripe needs sync but subsequent stripes in the window are clean, this
loop unconditionally queues every stripe for synchronization regardless of
its clean/dirty state, forcing read-rebuild-write cycles on clean stripes.

> +		raid5_release_stripe(sh);
> +	}
> +
> +	return submitted * RAID5_STRIPE_SECTORS(conf);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624155452.211646-1-nishidafmly@gmail.com?part=5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery
  2026-06-24 16:12   ` sashiko-bot
@ 2026-06-24 17:13     ` Hiroshi Nishida
  0 siblings, 0 replies; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 17:13 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-raid, yukuai

> [High] TOCTOU on still_degraded: the degraded check was moved before the
> blocking stripe get + schedule_timeout, so a drive failing during the sleep
> leaves a stale still_degraded that can wrongly clear the NEEDED bit.

Agreed -- this is a real regression I introduced.
I hoisted the still_degraded computation so the window could reuse it,
but that placed it before the blocking get and the
schedule_timeout_uninterruptible(1),
whereas stock computes it after.
In v2 I'll move the still_degraded computation back below the blocking
get of the first stripe (restoring stock's ordering).
The remaining window stripes are fetched with R5_GAS_NOBLOCK and don't
sleep, so they can reuse that
value without reopening the race across a sleep.

  > [High] The batch loop ignores md_bitmap_start_sync()'s return and
force-syncs
  > clean stripes in the window.

Also correct.
Stock makes the skip decision once at the top
(!md_bitmap_start_sync(..., true) && sync_blocks >= ...),
but the window loop doesn't, so a clean bitmap region the window
crosses gets synced unnecessarily.
In v2 I'll have the window honor the bitmap and stop at the boundary of the
dirty region being resynced, rather than extending into already-in-sync stripes.

Both are correctness details the per-stripe path got right and my
batching dropped;
I'll fold the fixes into v2 and re-test the bitmap resync path.

Thanks,

2026年6月24日(水) 9:12 <sashiko-bot@kernel.org>:
>
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [High] The batch loop ignores the return value of md_bitmap_start_sync(), forcing clean stripes to be unnecessarily synchronized.
> - [High] Time-of-check to time-of-use race on the array's degraded state, leading to silent data corruption via incorrect bitmap updates.
> --
>
> commit 01d1e80ee6aa253c48c233e7fc218fbdac9522fa
> Author: Hiroshi Nishida <nishidafmly@gmail.com>
>
> md/raid5: submit a window of stripes during resync/recovery
>
> This commit alters raid5_sync_request() to dispatch up to 32 stripes per
> call instead of one. It attempts to submit the remaining stripes in the
> window with a non-blocking flag to avoid the per-stripe throttle sleep.
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 9cb4ed3bd85c5..8e9edaaca667b 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6620,16 +6621,7 @@ static inline sector_t raid5_sync_request(...)
> [ ... ]
> > -     sh = raid5_get_active_stripe(conf, NULL, sector_nr,
> > -                                  R5_GAS_NOBLOCK);
> > -     if (sh == NULL) {
> > -             sh = raid5_get_active_stripe(conf, NULL, sector_nr, 0);
> > -             /* make sure we don't swamp the stripe cache if someone else
> > -              * is trying to get access
> > -              */
> > -             schedule_timeout_uninterruptible(1);
> > -     }
> > -     /* Need to check if array will still be degraded after recovery/resync
> > +     /* Check once whether array will still be degraded after recovery/resync.
> >        * Note in case of > 1 drive failures it's possible we're rebuilding
> >        * one drive while leaving another faulty drive in array.
> >        */
> > @@ -6640,13 +6632,42 @@ static inline sector_t raid5_sync_request(...)
> >                       still_degraded = true;
> >       }
> >
> > +     /* First stripe: block if stripe cache is full, then throttle. */
> > +     sh = raid5_get_active_stripe(conf, NULL, sector_nr, R5_GAS_NOBLOCK);
> > +     if (sh == NULL) {
> > +             sh = raid5_get_active_stripe(conf, NULL, sector_nr, 0);
> > +             /* make sure we don't swamp the stripe cache if someone else
> > +              * is trying to get access
> > +              */
> > +             schedule_timeout_uninterruptible(1);
> > +     }
>
> [Severity: High]
> Does moving the degraded check before the stripe allocation introduce a
> time-of-check to time-of-use race?
>
> If a drive fails during the potential sleep in raid5_get_active_stripe() or
> schedule_timeout_uninterruptible(), the stale still_degraded value is used
> below to update the bitmap. This could incorrectly clear the NEEDED bit,
> potentially leaving parity inconsistent if the system crashes before the sync
> completes.
>
> >       md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, still_degraded);
> >       set_bit(STRIPE_SYNC_REQUESTED, &sh->state);
> >       set_bit(STRIPE_HANDLE, &sh->state);
>
> [ ... ]
>
> > +     win_sector = sector_nr + RAID5_STRIPE_SECTORS(conf);
> > +     for (submitted = 1;
> > +          submitted < RAID5_SYNC_WINDOW && win_sector < max_sector &&
> > +          win_sector < mddev->resync_max;
> > +          submitted++, win_sector += RAID5_STRIPE_SECTORS(conf)) {
> > +             sh = raid5_get_active_stripe(conf, NULL, win_sector,
> > +                                          R5_GAS_NOBLOCK);
> > +             if (!sh)
> > +                     break;
> > +             md_bitmap_start_sync(mddev, win_sector, &sync_blocks,
> > +                                  still_degraded);
> > +             set_bit(STRIPE_SYNC_REQUESTED, &sh->state);
> > +             set_bit(STRIPE_HANDLE, &sh->state);
>
> [Severity: High]
> Does this batch loop unnecessarily synchronize clean stripes?
>
> The boolean return value of md_bitmap_start_sync() is ignored here. If the
> first stripe needs sync but subsequent stripes in the window are clean, this
> loop unconditionally queues every stripe for synchronization regardless of
> its clean/dirty state, forcing read-rebuild-write cycles on clean stripes.
>
> > +             raid5_release_stripe(sh);
> > +     }
> > +
> > +     return submitted * RAID5_STRIPE_SECTORS(conf);
> >  }
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260624155452.211646-1-nishidafmly@gmail.com?part=5



-- 
Hiroshi Nishida
nishidafmly@gmail.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 6/8] md/raid5: allocate worker groups per NUMA node
  2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
                   ` (4 preceding siblings ...)
  2026-06-24 15:54 ` [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery Hiroshi Nishida
@ 2026-06-24 15:54 ` Hiroshi Nishida
  2026-06-24 16:07   ` sashiko-bot
  2026-06-24 15:54 ` [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32 Hiroshi Nishida
  2026-06-24 15:54 ` [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild Hiroshi Nishida
  7 siblings, 1 reply; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 15:54 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, Xiao Ni, linux-raid, linux-kernel, Hiroshi Nishida

alloc_thread_groups() previously allocated all r5worker arrays in a
single kcalloc() block, assigning workers for NUMA node N from node 0
memory.  On multi-socket systems this causes remote memory traffic on
every worker->work and worker->temp_inactive_list access.

Replace the single allocation with kzalloc_node(size, GFP_NOIO, i) per
group so each node's workers live in local memory.  Because the workers
are now separate per-node allocations, both free sites --
free_thread_groups() and the reallocation path in
raid5_store_group_thread_cnt() -- are updated to free each group's
allocation individually instead of only group 0's.

Also fix a latent bug: the original kcalloc() had its nmemb and size
arguments swapped (harmless due to commutativity but semantically wrong).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
---
 drivers/md/raid5.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8e9edaaca667..c8787ab7b309 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7297,8 +7297,12 @@ raid5_store_group_thread_cnt(struct mddev *mddev, const char *page, size_t len)
 			conf->worker_groups = new_groups;
 			spin_unlock_irq(&conf->device_lock);
 
-			if (old_groups)
-				kfree(old_groups[0].workers);
+			if (old_groups) {
+				int node;
+
+				for (node = 0; node < num_possible_nodes(); node++)
+					kfree(old_groups[node].workers);
+			}
 			kfree(old_groups);
 		}
 	}
@@ -7336,7 +7340,6 @@ static int alloc_thread_groups(struct r5conf *conf, int cnt, int *group_cnt,
 {
 	int i, j, k;
 	ssize_t size;
-	struct r5worker *workers;
 
 	if (cnt == 0) {
 		*group_cnt = 0;
@@ -7344,24 +7347,24 @@ static int alloc_thread_groups(struct r5conf *conf, int cnt, int *group_cnt,
 		return 0;
 	}
 	*group_cnt = num_possible_nodes();
-	size = sizeof(struct r5worker) * cnt;
-	workers = kcalloc(size, *group_cnt, GFP_NOIO);
 	*worker_groups = kzalloc_objs(struct r5worker_group, *group_cnt,
 				      GFP_NOIO);
-	if (!*worker_groups || !workers) {
-		kfree(workers);
-		kfree(*worker_groups);
+	if (!*worker_groups)
 		return -ENOMEM;
-	}
 
+	size = sizeof(struct r5worker) * cnt;
 	for (i = 0; i < *group_cnt; i++) {
-		struct r5worker_group *group;
+		struct r5worker_group *group = &(*worker_groups)[i];
+		struct r5worker *workers;
+
+		workers = kzalloc_node(size, GFP_NOIO, i);
+		if (!workers)
+			goto out_free;
 
-		group = &(*worker_groups)[i];
 		INIT_LIST_HEAD(&group->handle_list);
 		INIT_LIST_HEAD(&group->loprio_list);
 		group->conf = conf;
-		group->workers = workers + i * cnt;
+		group->workers = workers;
 
 		for (j = 0; j < cnt; j++) {
 			struct r5worker *worker = group->workers + j;
@@ -7374,12 +7377,22 @@ static int alloc_thread_groups(struct r5conf *conf, int cnt, int *group_cnt,
 	}
 
 	return 0;
+
+out_free:
+	while (--i >= 0)
+		kfree((*worker_groups)[i].workers);
+	kfree(*worker_groups);
+	*worker_groups = NULL;
+	return -ENOMEM;
 }
 
 static void free_thread_groups(struct r5conf *conf)
 {
+	int i;
+
 	if (conf->worker_groups)
-		kfree(conf->worker_groups[0].workers);
+		for (i = 0; i < conf->group_cnt; i++)
+			kfree(conf->worker_groups[i].workers);
 	kfree(conf->worker_groups);
 	conf->worker_groups = NULL;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/8] md/raid5: allocate worker groups per NUMA node
  2026-06-24 15:54 ` [PATCH 6/8] md/raid5: allocate worker groups per NUMA node Hiroshi Nishida
@ 2026-06-24 16:07   ` sashiko-bot
  2026-06-24 16:53     ` Hiroshi Nishida
  0 siblings, 1 reply; 19+ messages in thread
From: sashiko-bot @ 2026-06-24 16:07 UTC (permalink / raw)
  To: Hiroshi Nishida; +Cc: linux-raid, yukuai

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [High] The patch incorrectly uses a sequential loop index `i` as the NUMA node ID parameter to `kzalloc_node()`.

Pre-existing issues:
- [High] The `worker_groups` array is sized using `num_possible_nodes()` but is indexed by the raw NUMA node ID, leading to out-of-bounds accesses on systems with sparse NUMA node IDs.
--

commit bce08bf6ce446b00da625ff4028f5782b756e17c
Author: Hiroshi Nishida <nishidafmly@gmail.com>

md/raid5: allocate worker groups per NUMA node

alloc_thread_groups() previously allocated all r5worker arrays in a
single kcalloc() block, assigning workers for NUMA node N from node 0
memory.  On multi-socket systems this causes remote memory traffic on
every worker->work and worker->temp_inactive_list access.

> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8e9edaaca667b..c8787ab7b309c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
[ ... ]
> @@ -7336,7 +7340,6 @@ static int alloc_thread_groups(struct r5conf *conf, int cnt, int *group_cnt,
[ ... ]
>  	*group_cnt = num_possible_nodes();
> -	size = sizeof(struct r5worker) * cnt;
> -	workers = kcalloc(size, *group_cnt, GFP_NOIO);
>  	*worker_groups = kzalloc_objs(struct r5worker_group, *group_cnt,
>  				      GFP_NOIO);

[Severity: High]
This isn't a bug introduced by this patch, but could this array sizing cause
an out-of-bounds access on systems with sparse NUMA nodes?

We size the worker_groups array using num_possible_nodes(), which is a count.
On systems with sparse NUMA node IDs (for example, nodes 0 and 8), the array
would only have 2 elements. Later, in raid5_wakeup_stripe_thread(), we index
into this array using cpu_to_group(cpu) which expands to cpu_to_node(cpu).
If a CPU belongs to node 8, we would access conf->worker_groups[8], exceeding
the allocated size and corrupting memory.

> -	if (!*worker_groups || !workers) {
> -		kfree(workers);
> -		kfree(*worker_groups);
> +	if (!*worker_groups)
>  		return -ENOMEM;
> -	}
>  
> +	size = sizeof(struct r5worker) * cnt;
>  	for (i = 0; i < *group_cnt; i++) {
> -		struct r5worker_group *group;
> +		struct r5worker_group *group = &(*worker_groups)[i];
> +		struct r5worker *workers;
> +
> +		workers = kzalloc_node(size, GFP_NOIO, i);

[Severity: High]
Could passing the sequential loop index i directly to kzalloc_node cause a
NULL pointer dereference on systems with sparse NUMA nodes?

Since num_possible_nodes returns a count rather than an upper bound on IDs,
on a system with nodes 0 and 8, i would be evaluated as 1. Passing an
unpopulated node ID to kzalloc_node without validation can lead to a crash
when accessing NODE_DATA(nid). Should this loop use
for_each_node_state(nid, N_MEMORY) or similar to get the actual node ID
instead of using the loop counter?

> +		if (!workers)
> +			goto out_free;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624155452.211646-1-nishidafmly@gmail.com?part=6

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/8] md/raid5: allocate worker groups per NUMA node
  2026-06-24 16:07   ` sashiko-bot
@ 2026-06-24 16:53     ` Hiroshi Nishida
  0 siblings, 0 replies; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 16:53 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-raid, yukuai

> [High] kzalloc_node(size, GFP_NOIO, i) -- i is a sequential index, not a node ID
> [High, pre-existing] worker_groups sized by num_possible_nodes() but indexed
> by node ID -> OOB on sparse NUMA

Both are valid.
The kzalloc_node(..., i) is wrong on sparse NUMA node IDs -- i runs
0..num_possible_nodes()-1, which only equals the node ID when nodes
are dense -- and it rides on the pre-existing assumption that the
worker_groups array (sized by the node *count*) can be indexed by
cpu_to_node().

The NUMA-locality win here is multi-socket-only and I don't have a
measurement to justify it, so rather than grow this into a sparse-NUMA
correctness fix I'll
drop this patch from v2.
The pre-existing worker_groups sizing/indexing issue is real and
separate; thanks for catching it.

Thanks,


2026年6月24日(水) 9:07 <sashiko-bot@kernel.org>:
>
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
>
> New issues:
> - [High] The patch incorrectly uses a sequential loop index `i` as the NUMA node ID parameter to `kzalloc_node()`.
>
> Pre-existing issues:
> - [High] The `worker_groups` array is sized using `num_possible_nodes()` but is indexed by the raw NUMA node ID, leading to out-of-bounds accesses on systems with sparse NUMA node IDs.
> --
>
> commit bce08bf6ce446b00da625ff4028f5782b756e17c
> Author: Hiroshi Nishida <nishidafmly@gmail.com>
>
> md/raid5: allocate worker groups per NUMA node
>
> alloc_thread_groups() previously allocated all r5worker arrays in a
> single kcalloc() block, assigning workers for NUMA node N from node 0
> memory.  On multi-socket systems this causes remote memory traffic on
> every worker->work and worker->temp_inactive_list access.
>
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 8e9edaaca667b..c8787ab7b309c 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> [ ... ]
> > @@ -7336,7 +7340,6 @@ static int alloc_thread_groups(struct r5conf *conf, int cnt, int *group_cnt,
> [ ... ]
> >       *group_cnt = num_possible_nodes();
> > -     size = sizeof(struct r5worker) * cnt;
> > -     workers = kcalloc(size, *group_cnt, GFP_NOIO);
> >       *worker_groups = kzalloc_objs(struct r5worker_group, *group_cnt,
> >                                     GFP_NOIO);
>
> [Severity: High]
> This isn't a bug introduced by this patch, but could this array sizing cause
> an out-of-bounds access on systems with sparse NUMA nodes?
>
> We size the worker_groups array using num_possible_nodes(), which is a count.
> On systems with sparse NUMA node IDs (for example, nodes 0 and 8), the array
> would only have 2 elements. Later, in raid5_wakeup_stripe_thread(), we index
> into this array using cpu_to_group(cpu) which expands to cpu_to_node(cpu).
> If a CPU belongs to node 8, we would access conf->worker_groups[8], exceeding
> the allocated size and corrupting memory.
>
> > -     if (!*worker_groups || !workers) {
> > -             kfree(workers);
> > -             kfree(*worker_groups);
> > +     if (!*worker_groups)
> >               return -ENOMEM;
> > -     }
> >
> > +     size = sizeof(struct r5worker) * cnt;
> >       for (i = 0; i < *group_cnt; i++) {
> > -             struct r5worker_group *group;
> > +             struct r5worker_group *group = &(*worker_groups)[i];
> > +             struct r5worker *workers;
> > +
> > +             workers = kzalloc_node(size, GFP_NOIO, i);
>
> [Severity: High]
> Could passing the sequential loop index i directly to kzalloc_node cause a
> NULL pointer dereference on systems with sparse NUMA nodes?
>
> Since num_possible_nodes returns a count rather than an upper bound on IDs,
> on a system with nodes 0 and 8, i would be evaluated as 1. Passing an
> unpopulated node ID to kzalloc_node without validation can lead to a crash
> when accessing NODE_DATA(nid). Should this loop use
> for_each_node_state(nid, N_MEMORY) or similar to get the actual node ID
> instead of using the loop counter?
>
> > +             if (!workers)
> > +                     goto out_free;
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260624155452.211646-1-nishidafmly@gmail.com?part=6



-- 
Hiroshi Nishida
nishidafmly@gmail.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32
  2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
                   ` (5 preceding siblings ...)
  2026-06-24 15:54 ` [PATCH 6/8] md/raid5: allocate worker groups per NUMA node Hiroshi Nishida
@ 2026-06-24 15:54 ` Hiroshi Nishida
  2026-06-24 16:09   ` sashiko-bot
  2026-06-24 15:54 ` [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild Hiroshi Nishida
  7 siblings, 1 reply; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 15:54 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, Xiao Ni, linux-raid, linux-kernel, Hiroshi Nishida

handle_active_stripes() dequeues up to MAX_STRIPE_BATCH stripes under
device_lock per pass, then processes them after dropping the lock.
Raising the batch from 8 to 32 amortizes each device_lock acquisition
over more stripes, reducing lock cycles per unit of stripe work.

The worker-spawn heuristic in raid5_wakeup_stripe_thread() also divided
by MAX_STRIPE_BATCH; split it into a separate STRIPE_BATCH_WORKERS=8 so
the spawn sensitivity is preserved unchanged while the batch size grows.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
---
 drivers/md/raid5.c | 2 +-
 drivers/md/raid5.h | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c8787ab7b309..ad6230415af3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -210,7 +210,7 @@ static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
 	/* at least one worker should run to avoid race */
 	queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
 
-	thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
+	thread_cnt = group->stripes_cnt / STRIPE_BATCH_WORKERS - 1;
 	/* wakeup more workers */
 	for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {
 		if (group->workers[i].working == false) {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 7aeba1fc7f09..1f37dabd727b 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -496,7 +496,8 @@ struct disk_info {
 #define BYPASS_THRESHOLD	1
 #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK		(NR_HASH - 1)
-#define MAX_STRIPE_BATCH	8
+#define MAX_STRIPE_BATCH	32	/* stripes per handle_active_stripes pass */
+#define STRIPE_BATCH_WORKERS	8	/* stripes-per-worker threshold for spawning */
 #define RAID5_SYNC_WINDOW	32	/* stripes to pre-submit per sync_request call */
 
 /* NR_STRIPE_HASH_LOCKS must be a power of two, since
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32
  2026-06-24 15:54 ` [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32 Hiroshi Nishida
@ 2026-06-24 16:09   ` sashiko-bot
  2026-06-24 17:01     ` Hiroshi Nishida
  0 siblings, 1 reply; 19+ messages in thread
From: sashiko-bot @ 2026-06-24 16:09 UTC (permalink / raw)
  To: Hiroshi Nishida; +Cc: yukuai, linux-raid

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Increasing MAX_STRIPE_BATCH to 32 without proportionally increasing STRIPE_BATCH_WORKERS serializes stripe processing and causes worker starvation.
--

commit b1a0a169dabd8b46b6b9ca40375034cf879934c6
Author: Hiroshi Nishida <nishidafmly@gmail.com>

md/raid5: raise MAX_STRIPE_BATCH from 8 to 32

This commit increases the maximum stripe batch size from 8 to 32 to
amortize device_lock acquisition over more stripes. It also introduces
STRIPE_BATCH_WORKERS to decouple the worker spawning threshold from the
batch size.

> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c8787ab7b309c..ad6230415af33 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
[ ... ]
> @@ -210,7 +210,7 @@ static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
>  	/* at least one worker should run to avoid race */
>  	queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
>  
> -	thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
> +	thread_cnt = group->stripes_cnt / STRIPE_BATCH_WORKERS - 1;
>  	/* wakeup more workers */
>  	for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {

[Severity: High]
Will increasing the stripe batch size while keeping the worker spawn threshold
small serialize stripe processing and starve the worker threads?

With this change, raid5_wakeup_stripe_thread() still wakes up an additional
worker for every 8 pending stripes (e.g., waking 4 threads for 32 stripes).

However, the first awakened worker executing handle_active_stripes() will
greedily dequeue up to MAX_STRIPE_BATCH (32) stripes under the lock:

drivers/md/raid5.c:handle_active_stripes() {
    ...
    while (batch_size < MAX_STRIPE_BATCH &&
           (sh = __get_priority_stripe(conf, group)) != NULL)
        batch[batch_size++] = sh;
    ...
}

It appears the first thread will steal all 32 pending stripes and process
them sequentially, while the other awakened threads acquire the lock,
find the lists empty, and exit without doing any work.

Could this cause a performance regression by serializing CPU-intensive parity
operations onto a single thread, while also creating unnecessary lock
contention from the starved threads?

> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 7aeba1fc7f09e..1f37dabd727be 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
[ ... ]
> @@ -496,7 +496,8 @@ struct disk_info {
>  #define BYPASS_THRESHOLD	1
>  #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
>  #define HASH_MASK		(NR_HASH - 1)
> -#define MAX_STRIPE_BATCH	8
> +#define MAX_STRIPE_BATCH	32	/* stripes per handle_active_stripes pass */
> +#define STRIPE_BATCH_WORKERS	8	/* stripes-per-worker threshold for spawning */
>  #define RAID5_SYNC_WINDOW	32	/* stripes to pre-submit per sync_request call */

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624155452.211646-1-nishidafmly@gmail.com?part=7

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32
  2026-06-24 16:09   ` sashiko-bot
@ 2026-06-24 17:01     ` Hiroshi Nishida
  0 siblings, 0 replies; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 17:01 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: yukuai, linux-raid

 > [High] Raising MAX_STRIPE_BATCH to 32 while keeping the spawn threshold at 8
 > lets the first worker grab all 32 stripes and starves the others.

You're right.  In stock, MAX_STRIPE_BATCH does double duty -- it caps
the greedy dequeue
in handle_active_stripes() AND is the spawn divisor in
raid5_wakeup_stripe_thread() --
and keeping those equal is what balances the workers woken against the
work each one takes.
My patch grew the dequeue cap to 32 but left the spawn divisor at 8,
so for ~32 pending stripes it wakes 4 workerswhile the first greedily
takes all 32 and
the rest find empty lists -- serializing the batch and adding
pointless lock contention, exactly as you describe.

The lock-amortization benefit was never isolated, and re-coupling the two (spawn
divisor = batch) just trades it for coarser worker granularity, so
I'll drop this
patch from v2 rather than guess at the tuning.

Thanks,

2026年6月24日(水) 9:09 <sashiko-bot@kernel.org>:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] Increasing MAX_STRIPE_BATCH to 32 without proportionally increasing STRIPE_BATCH_WORKERS serializes stripe processing and causes worker starvation.
> --
>
> commit b1a0a169dabd8b46b6b9ca40375034cf879934c6
> Author: Hiroshi Nishida <nishidafmly@gmail.com>
>
> md/raid5: raise MAX_STRIPE_BATCH from 8 to 32
>
> This commit increases the maximum stripe batch size from 8 to 32 to
> amortize device_lock acquisition over more stripes. It also introduces
> STRIPE_BATCH_WORKERS to decouple the worker spawning threshold from the
> batch size.
>
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index c8787ab7b309c..ad6230415af33 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> [ ... ]
> > @@ -210,7 +210,7 @@ static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
> >       /* at least one worker should run to avoid race */
> >       queue_work_on(sh->cpu, raid5_wq, &group->workers[0].work);
> >
> > -     thread_cnt = group->stripes_cnt / MAX_STRIPE_BATCH - 1;
> > +     thread_cnt = group->stripes_cnt / STRIPE_BATCH_WORKERS - 1;
> >       /* wakeup more workers */
> >       for (i = 1; i < conf->worker_cnt_per_group && thread_cnt > 0; i++) {
>
> [Severity: High]
> Will increasing the stripe batch size while keeping the worker spawn threshold
> small serialize stripe processing and starve the worker threads?
>
> With this change, raid5_wakeup_stripe_thread() still wakes up an additional
> worker for every 8 pending stripes (e.g., waking 4 threads for 32 stripes).
>
> However, the first awakened worker executing handle_active_stripes() will
> greedily dequeue up to MAX_STRIPE_BATCH (32) stripes under the lock:
>
> drivers/md/raid5.c:handle_active_stripes() {
>     ...
>     while (batch_size < MAX_STRIPE_BATCH &&
>            (sh = __get_priority_stripe(conf, group)) != NULL)
>         batch[batch_size++] = sh;
>     ...
> }
>
> It appears the first thread will steal all 32 pending stripes and process
> them sequentially, while the other awakened threads acquire the lock,
> find the lists empty, and exit without doing any work.
>
> Could this cause a performance regression by serializing CPU-intensive parity
> operations onto a single thread, while also creating unnecessary lock
> contention from the starved threads?
>
> > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> > index 7aeba1fc7f09e..1f37dabd727be 100644
> > --- a/drivers/md/raid5.h
> > +++ b/drivers/md/raid5.h
> [ ... ]
> > @@ -496,7 +496,8 @@ struct disk_info {
> >  #define BYPASS_THRESHOLD     1
> >  #define NR_HASH                      (PAGE_SIZE / sizeof(struct hlist_head))
> >  #define HASH_MASK            (NR_HASH - 1)
> > -#define MAX_STRIPE_BATCH     8
> > +#define MAX_STRIPE_BATCH     32      /* stripes per handle_active_stripes pass */
> > +#define STRIPE_BATCH_WORKERS 8       /* stripes-per-worker threshold for spawning */
> >  #define RAID5_SYNC_WINDOW    32      /* stripes to pre-submit per sync_request call */
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260624155452.211646-1-nishidafmly@gmail.com?part=7



-- 
Hiroshi Nishida
nishidafmly@gmail.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild
  2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
                   ` (6 preceding siblings ...)
  2026-06-24 15:54 ` [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32 Hiroshi Nishida
@ 2026-06-24 15:54 ` Hiroshi Nishida
  2026-06-24 16:12   ` sashiko-bot
  7 siblings, 1 reply; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 15:54 UTC (permalink / raw)
  To: Song Liu, Yu Kuai
  Cc: Li Nan, Xiao Ni, linux-raid, linux-kernel, Hiroshi Nishida

The resync read-ahead window (RAID5_SYNC_WINDOW) can fill the stripe
cache with rebuild stripes and starve concurrent user I/O, producing a
burst-starvation flip-flop between rebuild and application throughput.

Add two yield points to the window-submission loop:
 - stop the window immediately if any thread is waiting for a stripe
   (waitqueue_active(&conf->wait_for_stripe)); the check is intentionally
   racy -- a waiter appearing just after is serviced by the next
   sync_request call, so no barrier is needed.
 - stop expanding once active_stripes reaches half the cache
   (max_nr_stripes / RAID5_SYNC_HWMARK), but only when
   preread_active_stripes > 0, i.e. user write I/O is actually competing.
   Sync stripes never set STRIPE_PREREAD_ACTIVE, so during a pure rebuild
   the counter stays zero and the window fills freely; rebuild-only
   throughput is unchanged.

This bounds the share of the stripe cache a rebuild may hold while user
I/O is present, so application latency no longer collapses during the
read-ahead bursts, without throttling a rebuild that has the array to
itself.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
---
 drivers/md/raid5.c | 21 +++++++++++++++++++++
 drivers/md/raid5.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ad6230415af3..480f3aa069ef 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6656,6 +6656,27 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
 	     submitted < RAID5_SYNC_WINDOW && win_sector < max_sector &&
 	     win_sector < mddev->resync_max;
 	     submitted++, win_sector += RAID5_STRIPE_SECTORS(conf)) {
+		/*
+		 * Yield to user I/O: stop the read-ahead if anyone is waiting
+		 * for a stripe.  The check is intentionally racy -- a waiter
+		 * appearing just after is serviced by the next sync_request
+		 * call, so no barrier is needed.
+		 */
+		if (waitqueue_active(&conf->wait_for_stripe))
+			break;
+		/*
+		 * Reserve cache for user I/O only when it is actually competing.
+		 * preread_active_stripes counts stripes queued for write I/O
+		 * (including the read phase of RMW); sync stripes never set
+		 * STRIPE_PREREAD_ACTIVE, so during a pure rebuild it stays zero
+		 * and the window fills freely.  Competing user reads do not bump
+		 * the counter but are caught by the waitqueue_active() check
+		 * above.
+		 */
+		if (atomic_read(&conf->preread_active_stripes) > 0 &&
+		    atomic_read(&conf->active_stripes) >=
+		    conf->max_nr_stripes / RAID5_SYNC_HWMARK)
+			break;
 		sh = raid5_get_active_stripe(conf, NULL, win_sector,
 					     R5_GAS_NOBLOCK);
 		if (!sh)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 1f37dabd727b..7833cc07597f 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -499,6 +499,7 @@ struct disk_info {
 #define MAX_STRIPE_BATCH	32	/* stripes per handle_active_stripes pass */
 #define STRIPE_BATCH_WORKERS	8	/* stripes-per-worker threshold for spawning */
 #define RAID5_SYNC_WINDOW	32	/* stripes to pre-submit per sync_request call */
+#define RAID5_SYNC_HWMARK	2	/* rebuild uses at most 1/N of stripe cache */
 
 /* NR_STRIPE_HASH_LOCKS must be a power of two, since
  * STRIPE_HASH_LOCKS_MASK masks with (NR_STRIPE_HASH_LOCKS - 1).
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild
  2026-06-24 15:54 ` [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild Hiroshi Nishida
@ 2026-06-24 16:12   ` sashiko-bot
  2026-06-24 17:25     ` Hiroshi Nishida
  0 siblings, 1 reply; 19+ messages in thread
From: sashiko-bot @ 2026-06-24 16:12 UTC (permalink / raw)
  To: Hiroshi Nishida; +Cc: linux-raid, yukuai

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The patch fails to bound the rebuild's share of the stripe cache to 50%. Breaking the read-ahead batch loop just causes the outer md_do_sync() loop to allocate stripes one-by-one, relentlessly filling the cache to 100% anyway.
--

commit 29ed9bbd7953c5d6ceeb0158cc53c81fd4d3aaf9
Author: Hiroshi Nishida <nishidafmly@gmail.com>

md/raid5: reserve stripe cache for user I/O during rebuild

This commit attempts to prevent rebuild operations from monopolizing the
stripe cache and starving concurrent user I/O. It adds yield points in the
resync read-ahead loop to stop submitting stripes if there are active
waiters or if the cache is more than half full.

> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6656,6 +6656,27 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
>  	     submitted < RAID5_SYNC_WINDOW && win_sector < max_sector &&
>  	     win_sector < mddev->resync_max;
>  	     submitted++, win_sector += RAID5_STRIPE_SECTORS(conf)) {

[ ... ]

> +		/*
> +		 * Reserve cache for user I/O only when it is actually competing.
> +		 * preread_active_stripes counts stripes queued for write I/O
> +		 * (including the read phase of RMW); sync stripes never set
> +		 * STRIPE_PREREAD_ACTIVE, so during a pure rebuild it stays zero
> +		 * and the window fills freely.  Competing user reads do not bump
> +		 * the counter but are caught by the waitqueue_active() check
> +		 * above.
> +		 */
> +		if (atomic_read(&conf->preread_active_stripes) > 0 &&
> +		    atomic_read(&conf->active_stripes) >=
> +		    conf->max_nr_stripes / RAID5_SYNC_HWMARK)
> +			break;

[Severity: High]
Does breaking out of the read-ahead loop here actually limit the cache
usage to 50%?

When this breaks, raid5_sync_request() returns control to md_do_sync(),
which operates in a loop and will immediately call raid5_sync_request() 
again for the next sector.

Because raid5_sync_request() unconditionally allocates its first stripe
earlier in the function before entering this batch loop, won't the outer
loop just continue allocating stripes one-by-one until the cache reaches
100% capacity and triggers the fallback blocking path anyway?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624155452.211646-1-nishidafmly@gmail.com?part=8

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild
  2026-06-24 16:12   ` sashiko-bot
@ 2026-06-24 17:25     ` Hiroshi Nishida
  0 siblings, 0 replies; 19+ messages in thread
From: Hiroshi Nishida @ 2026-06-24 17:25 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-raid, yukuai

> [High] Breaking the read-ahead loop doesn't bound the rebuild to 50% -- the
> outer md_do_sync() loop just calls raid5_sync_request() again and the
> unconditional first-stripe allocation keeps filling the cache to 100%.

You're right, and it defeats the patch's purpose.
The 50% check only gates the window batch; the first stripe of each
raid5_sync_request() call is allocated
unconditionally before the loop, so md_do_sync() keeps driving one
stripe per call past 50%
until the cache is full and the existing blocking get throttles at ~100% anyway.
The "reserve half the cache for user I/O" claim doesn't hold.

Making it actually bound the rebuild would mean gating the
first-stripe allocation and
having raid5_sync_request() yield without making progress --
effectively reworking the sync throttle, which is more than this
change should take on,
and the user-latency benefit was never something I could measure
cleanly.
So I'll drop this patch from v2.  (It's coupled to the read-ahead
window patch, which I'll keep and fix per the other thread.)

Thanks,

2026年6月24日(水) 9:12 <sashiko-bot@kernel.org>:

>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] The patch fails to bound the rebuild's share of the stripe cache to 50%. Breaking the read-ahead batch loop just causes the outer md_do_sync() loop to allocate stripes one-by-one, relentlessly filling the cache to 100% anyway.
> --
>
> commit 29ed9bbd7953c5d6ceeb0158cc53c81fd4d3aaf9
> Author: Hiroshi Nishida <nishidafmly@gmail.com>
>
> md/raid5: reserve stripe cache for user I/O during rebuild
>
> This commit attempts to prevent rebuild operations from monopolizing the
> stripe cache and starving concurrent user I/O. It adds yield points in the
> resync read-ahead loop to stop submitting stripes if there are active
> waiters or if the cache is more than half full.
>
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -6656,6 +6656,27 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
> >            submitted < RAID5_SYNC_WINDOW && win_sector < max_sector &&
> >            win_sector < mddev->resync_max;
> >            submitted++, win_sector += RAID5_STRIPE_SECTORS(conf)) {
>
> [ ... ]
>
> > +             /*
> > +              * Reserve cache for user I/O only when it is actually competing.
> > +              * preread_active_stripes counts stripes queued for write I/O
> > +              * (including the read phase of RMW); sync stripes never set
> > +              * STRIPE_PREREAD_ACTIVE, so during a pure rebuild it stays zero
> > +              * and the window fills freely.  Competing user reads do not bump
> > +              * the counter but are caught by the waitqueue_active() check
> > +              * above.
> > +              */
> > +             if (atomic_read(&conf->preread_active_stripes) > 0 &&
> > +                 atomic_read(&conf->active_stripes) >=
> > +                 conf->max_nr_stripes / RAID5_SYNC_HWMARK)
> > +                     break;
>
> [Severity: High]
> Does breaking out of the read-ahead loop here actually limit the cache
> usage to 50%?
>
> When this breaks, raid5_sync_request() returns control to md_do_sync(),
> which operates in a loop and will immediately call raid5_sync_request()
> again for the next sector.
>
> Because raid5_sync_request() unconditionally allocates its first stripe
> earlier in the function before entering this batch loop, won't the outer
> loop just continue allocating stripes one-by-one until the cache reaches
> 100% capacity and triggers the fallback blocking path anyway?
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260624155452.211646-1-nishidafmly@gmail.com?part=8



--
Hiroshi Nishida
nishidafmly@gmail.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-06-24 17:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 15:54 [PATCH 0/8] md/raid5: scalability and rebuild-path improvements Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 1/8] md: change chunk_sectors and stripe cache counts to unsigned int Hiroshi Nishida
2026-06-24 16:16   ` sashiko-bot
2026-06-24 17:25     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 2/8] md/raid5: raise stripe cache limit from 32768 to 262144 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 3/8] md: widen badblock sectors param from int to sector_t Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 4/8] md/raid5: raise NR_STRIPE_HASH_LOCKS from 8 to 32 Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 5/8] md/raid5: submit a window of stripes during resync/recovery Hiroshi Nishida
2026-06-24 16:12   ` sashiko-bot
2026-06-24 17:13     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 6/8] md/raid5: allocate worker groups per NUMA node Hiroshi Nishida
2026-06-24 16:07   ` sashiko-bot
2026-06-24 16:53     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 7/8] md/raid5: raise MAX_STRIPE_BATCH from 8 to 32 Hiroshi Nishida
2026-06-24 16:09   ` sashiko-bot
2026-06-24 17:01     ` Hiroshi Nishida
2026-06-24 15:54 ` [PATCH 8/8] md/raid5: reserve stripe cache for user I/O during rebuild Hiroshi Nishida
2026-06-24 16:12   ` sashiko-bot
2026-06-24 17:25     ` Hiroshi Nishida

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox