[PATCH v3 00/17] ext4: better scalability for ext4 block allocation

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/17] ext4: better scalability for ext4 block allocation
@ 2025-07-14 13:03 Baokun Li
  2025-07-14 13:03 ` [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
                   ` (18 more replies)
  0 siblings, 19 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

Changes since v2:
 * Collect RVB from Jan Kara. (Thanks for your review!)
 * Add patch 2.
 * Patch 4: Switching to READ_ONCE/WRITE_ONCE (great for single-process)
        over smp_load_acquire/smp_store_release (only slight multi-process
        gain). (Suggested by Jan Kara)
 * Patch 5: The number of global goals is now set to the lesser of the CPU
        count or one-fourth of the group count. This prevents setting too
        many goals for small filesystems, which lead to file dispersion.
        (Suggested by Jan Kara)
 * Patch 5: Directly use kfree() to release s_mb_last_groups instead of
        kvfree(). (Suggested by Julia Lawall)
 * Patch 11: Even without mb_optimize_scan enabled, we now always attempt
        to remove the group from the old order list.(Suggested by Jan Kara)
 * Patch 14-16: Added comments for clarity, refined logic, and removed
        obsolete variables.
 * Update performance test results and indicate raw disk write bandwidth. 

Thanks to Honza for your suggestions!

v2: https://lore.kernel.org/r/20250623073304.3275702-1-libaokun1@huawei.com

Changes since v1:
 * Patch 1: Prioritize checking if a group is busy to avoid unnecessary
       checks and buddy loading. (Thanks to Ojaswin for the suggestion!)
 * Patch 4: Using multiple global goals instead of moving the goal to the
       inode level. (Thanks to Honza for the suggestion!)
 * Collect RVB from Jan Kara and Ojaswin Mujoo.(Thanks for your review!)
 * Add patch 2,3,7-16.
 * Due to the change of test server, the relevant test data was refreshed.

v1: https://lore.kernel.org/r/20250523085821.1329392-1-libaokun@huaweicloud.com

Since servers have more and more CPUs, and we're running more containers
on them, we've been using will-it-scale to test how well ext4 scales. The
fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently
on 64 containers revealed significant contention in block allocation/free,
leading to much lower average fallocate OPS compared to a single
container (see below).

   1   |    2   |    4   |    8   |   16   |   32   |   64
-------|--------|--------|--------|--------|--------|-------
295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588

Under this test scenario, the primary operations are block allocation
(fallocate) and block deallocation (truncate). The main bottlenecks for
these operations are the group lock and s_md_lock. Therefore, this patch
series primarily focuses on optimizing the code related to these two locks.

The following is a brief overview of the patches, see the patches for
more details.

Patch 1: Add ext4_try_lock_group() to skip busy groups to take advantage
of the large number of ext4 groups.

Patch 2: Separates stream goal hits from s_bal_goals in preparation for
cleanup of s_mb_last_start.

Patches 3-5: Split stream allocation's global goal into multiple goals and
remove the unnecessary and expensive s_md_lock.

Patches 6-7: minor cleanups

Patches 8: Converted s_mb_free_pending to atomic_t and used memory barriers
for consistency, instead of relying on the expensive s_md_lock.

Patches 9: When inserting free extents, we now attempt to merge them with
already inserted extents first, to reduce s_md_lock contention.

Patches 10: Updates bb_avg_fragment_size_order to -1 when a group is out of
free blocks, eliminating efficiency-impacting "zombie groups."

Patches 11: Fix potential largest free orders lists corruption when the
mb_optimize_scan mount option is switched on or off.

Patches 12-17: Convert mb_optimize_scan's existing unordered list traversal
to ordered xarrays, thereby reducing contention between block allocation
and freeing, similar to linear traversal.

"kvm-xfstests -c ext4/all -g auto" has been executed with no new failures.

Here are some performance test data for your reference:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 2667  | 20049 (+651%)  | 314065 | 316724 (+0.8%) |
|mb_optimize_scan=1 | 2643  | 19342 (+631%)  | 316344 | 328324 (+3.7%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 3450  | 52125 (+1410%) | 205851 | 215136 (+4.5%) |
|mb_optimize_scan=1 | 3209  | 50331 (+1468%) | 207373 | 209431 (+0.9%) |

Tests also evaluated this patch set's impact on fragmentation: a minor
increase in free space fragmentation for multi-process workloads, but a
significant decrease in file fragmentation:

Test Script：
```shell
#!/bin/bash

dir="/tmp/test"
disk="/dev/sda"

mkdir -p $dir

for scan in 0 1 ; do
    mkfs.ext4 -F -E lazy_itable_init=0,lazy_journal_init=0 \
              -O orphan_file $disk 200G
    mount -o mb_optimize_scan=$scan $disk $dir

    fio -directory=$dir -direct=1 -iodepth 128 -thread -ioengine=falloc \
        -rw=write -bs=4k -fallocate=none -numjobs=64 -file_append=1 \
        -size=1G -group_reporting -name=job1 -cpus_allowed_policy=split

    e2freefrag $disk
    e4defrag -c $dir # Without the patch, this could take 5-6 hours.
    filefrag ${dir}/job* | awk '{print $2}' | \
                           awk '{sum+=$1} END {print sum/NR}'
    umount $dir
done
```

Test results:
-------------------------------------------------------------|
                         |       base      |      patched    |
-------------------------|--------|--------|--------|--------|
mb_optimize_scan         | linear |opt_scan| linear |opt_scan|
-------------------------|--------|--------|--------|--------|
bw(MiB/s)                | 217    | 217    | 5718   | 5626   |
-------------------------|-----------------------------------|
Avg. free extent size(KB)| 1943732| 1943732| 1316212| 1171208|
Num. free extent         | 71     | 71     | 105    | 118    |
-------------------------------------------------------------|
Avg. extents per file    | 261967 | 261973 | 588    | 570    |
Avg. size per extent(KB) | 4      | 4      | 1780   | 1837   |
Fragmentation score      | 100    | 100    | 2      | 2      |
-------------------------------------------------------------|

Comments and questions are, as always, welcome.

Thanks,
Baokun

Baokun Li (17):
  ext4: add ext4_try_lock_group() to skip busy groups
  ext4: separate stream goal hits from s_bal_goals for better tracking
  ext4: remove unnecessary s_mb_last_start
  ext4: remove unnecessary s_md_lock on update s_mb_last_group
  ext4: utilize multiple global goals to reduce contention
  ext4: get rid of some obsolete EXT4_MB_HINT flags
  ext4: fix typo in CR_GOAL_LEN_SLOW comment
  ext4: convert sbi->s_mb_free_pending to atomic_t
  ext4: merge freed extent with existing extents before insertion
  ext4: fix zombie groups in average fragment size lists
  ext4: fix largest free orders lists corruption on mb_optimize_scan
    switch
  ext4: factor out __ext4_mb_scan_group()
  ext4: factor out ext4_mb_might_prefetch()
  ext4: factor out ext4_mb_scan_group()
  ext4: convert free groups order lists to xarrays
  ext4: refactor choose group to scan group
  ext4: implement linear-like traversal across order xarrays

 fs/ext4/balloc.c            |   2 +-
 fs/ext4/ext4.h              |  61 +--
 fs/ext4/mballoc.c           | 895 ++++++++++++++++++++----------------
 fs/ext4/mballoc.h           |   9 +-
 include/trace/events/ext4.h |   3 -
 5 files changed, 534 insertions(+), 436 deletions(-)

-- 
2.46.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-17 10:09   ` Ojaswin Mujoo
  2025-07-17 22:28   ` Andi Kleen
  2025-07-14 13:03 ` [PATCH v3 02/17] ext4: separate stream goal hits from s_bal_goals for better tracking Baokun Li
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

When ext4 allocates blocks, we used to just go through the block groups
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.

But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.

Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.

Also, to make sure we don't skip all the groups that have free space
when allocating blocks, we won't try to skip busy groups anymore when
ac_criteria is CR_ANY_FREE.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80            |
|Memory: 512GB      |-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched      |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 2667  | 4821  (+80.7%)  |
|mb_optimize_scan=1 | 2643  | 4784  (+81.0%)  |

|CPU: AMD 9654 * 2  |          P96            |
|Memory: 1536GB     |-------------------------|
|960GB SSD (1GB/s)  | base  |    patched      |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 3450  | 15371 (+345%)   |
|mb_optimize_scan=1 | 3209  | 6101  (+90.0%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    | 23 ++++++++++++++---------
 fs/ext4/mballoc.c | 19 ++++++++++++++++---
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 18373de980f2..9df74123e7e6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3541,23 +3541,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi)
 	return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD);
 }
 
+static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group)
+{
+	if (!spin_trylock(ext4_group_lock_ptr(sb, group)))
+		return false;
+	/*
+	 * We're able to grab the lock right away, so drop the lock
+	 * contention counter.
+	 */
+	atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
+	return true;
+}
+
 static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group)
 {
-	spinlock_t *lock = ext4_group_lock_ptr(sb, group);
-	if (spin_trylock(lock))
-		/*
-		 * We're able to grab the lock right away, so drop the
-		 * lock contention counter.
-		 */
-		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
-	else {
+	if (!ext4_try_lock_group(sb, group)) {
 		/*
 		 * The lock is busy, so bump the contention counter,
 		 * and then wait on the spin lock.
 		 */
 		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1,
 				  EXT4_MAX_CONTENTION);
-		spin_lock(lock);
+		spin_lock(ext4_group_lock_ptr(sb, group));
 	}
 }
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1e98c5be4e0a..336d65c4f6a2 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
 				    bb_largest_free_order_node) {
 			if (sbi->s_mb_stats)
 				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
-			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
+			if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) &&
+			    likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
 				*group = iter->bb_group;
 				ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
 				read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
@@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o
 	list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
 		if (sbi->s_mb_stats)
 			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
-		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
+		if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) &&
+		    likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
 			grp = iter;
 			break;
 		}
@@ -2899,6 +2901,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 							nr, &prefetch_ios);
 			}
 
+			/* prevent unnecessary buddy loading. */
+			if (cr < CR_ANY_FREE &&
+			    spin_is_locked(ext4_group_lock_ptr(sb, group)))
+				continue;
+
 			/* This now checks without needing the buddy page */
 			ret = ext4_mb_good_group_nolock(ac, group, cr);
 			if (ret <= 0) {
@@ -2911,7 +2918,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			if (err)
 				goto out;
 
-			ext4_lock_group(sb, group);
+			/* skip busy group */
+			if (cr >= CR_ANY_FREE) {
+				ext4_lock_group(sb, group);
+			} else if (!ext4_try_lock_group(sb, group)) {
+				ext4_mb_unload_buddy(&e4b);
+				continue;
+			}
 
 			/*
 			 * We need to check again after locking the
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 02/17] ext4: separate stream goal hits from s_bal_goals for better tracking
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
  2025-07-14 13:03 ` [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-17 10:29   ` Ojaswin Mujoo
  2025-07-14 13:03 ` [PATCH v3 03/17] ext4: remove unnecessary s_mb_last_start Baokun Li
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal()
fails to achieve the inode goal, allocation continues with the stream
allocation global goal. Currently, hits for both are combined in
sbi->s_bal_goals, hindering accurate optimization.

This commit separates global goal hits into sbi->s_bal_stream_goals. Since
stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1.
This prevents stream allocations from being counted in s_bal_goals. Also
clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again.

After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show:

mballoc:
	reqs: 840347
	success: 750992
	groups_scanned: 1230506
	cr_p2_aligned_stats:
		hits: 21531
		groups_considered: 411664
		extents_scanned: 21531
		useless_loops: 0
		bad_suggestions: 6
	cr_goal_fast_stats:
		hits: 111222
		groups_considered: 1806728
		extents_scanned: 467908
		useless_loops: 0
		bad_suggestions: 13
	cr_best_avail_stats:
		hits: 36267
		groups_considered: 1817631
		extents_scanned: 156143
		useless_loops: 0
		bad_suggestions: 204
	cr_goal_slow_stats:
		hits: 106396
		groups_considered: 5671710
		extents_scanned: 22540056
		useless_loops: 123747
	cr_any_free_stats:
		hits: 138071
		groups_considered: 724692
		extents_scanned: 23615593
		useless_loops: 585
	extents_scanned: 46804261
		goal_hits: 1307
		stream_goal_hits: 236317
		len_goal_hits: 155549
		2^n_hits: 21531
		breaks: 225096
		lost: 35062
	buddies_generated: 40/40
	buddies_time_used: 48004
	preallocated: 5962467
	discarded: 4847560

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  1 +
 fs/ext4/mballoc.c | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9df74123e7e6..8750ace12935 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1646,6 +1646,7 @@ struct ext4_sb_info {
 	atomic_t s_bal_cX_ex_scanned[EXT4_MB_NUM_CRS];	/* total extents scanned */
 	atomic_t s_bal_groups_scanned;	/* number of groups scanned */
 	atomic_t s_bal_goals;	/* goal hits */
+	atomic_t s_bal_stream_goals;	/* stream allocation global goal hits */
 	atomic_t s_bal_len_goals;	/* len goal hits */
 	atomic_t s_bal_breaks;	/* too long searches */
 	atomic_t s_bal_2orders;	/* 2^order hits */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 336d65c4f6a2..f56ac477c464 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2849,8 +2849,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		/* TBD: may be hot point */
 		spin_lock(&sbi->s_md_lock);
 		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
-		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
 		spin_unlock(&sbi->s_md_lock);
+		ac->ac_g_ex.fe_start = -1;
+		ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
 	}
 
 	/*
@@ -3000,8 +3001,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		}
 	}
 
-	if (sbi->s_mb_stats && ac->ac_status == AC_STATUS_FOUND)
+	if (sbi->s_mb_stats && ac->ac_status == AC_STATUS_FOUND) {
 		atomic64_inc(&sbi->s_bal_cX_hits[ac->ac_criteria]);
+		if (ac->ac_flags & EXT4_MB_STREAM_ALLOC &&
+		    ac->ac_b_ex.fe_group == ac->ac_g_ex.fe_group)
+			atomic_inc(&sbi->s_bal_stream_goals);
+	}
 out:
 	if (!err && ac->ac_status != AC_STATUS_FOUND && first_err)
 		err = first_err;
@@ -3194,6 +3199,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 	seq_printf(seq, "\textents_scanned: %u\n",
 		   atomic_read(&sbi->s_bal_ex_scanned));
 	seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals));
+	seq_printf(seq, "\t\tstream_goal_hits: %u\n",
+		   atomic_read(&sbi->s_bal_stream_goals));
 	seq_printf(seq, "\t\tlen_goal_hits: %u\n",
 		   atomic_read(&sbi->s_bal_len_goals));
 	seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders));
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 03/17] ext4: remove unnecessary s_mb_last_start
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
  2025-07-14 13:03 ` [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
  2025-07-14 13:03 ` [PATCH v3 02/17] ext4: separate stream goal hits from s_bal_goals for better tracking Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-17 10:31   ` Ojaswin Mujoo
  2025-07-14 13:03 ` [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1
by default, so the no longer needed sbi->s_mb_last_start is removed.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h    | 1 -
 fs/ext4/mballoc.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8750ace12935..b83095541c98 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1631,7 +1631,6 @@ struct ext4_sb_info {
 	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
 	unsigned long s_mb_last_group;
-	unsigned long s_mb_last_start;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f56ac477c464..e3a5103e1620 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2171,7 +2171,6 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
 		spin_lock(&sbi->s_md_lock);
 		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
-		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
 		spin_unlock(&sbi->s_md_lock);
 	}
 	/*
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (2 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 03/17] ext4: remove unnecessary s_mb_last_start Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-17 13:36   ` Ojaswin Mujoo
  2025-07-14 13:03 ` [PATCH v3 05/17] ext4: utilize multiple global goals to reduce contention Baokun Li
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.

Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent. Since s_mb_last_group is merely a hint and doesn't
require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.

Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 4821  | 9636  (+99.8%) | 314065 | 337597 (+7.4%) |
|mb_optimize_scan=1 | 4784  | 4834  (+1.04%) | 316344 | 341440 (+7.9%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
|mb_optimize_scan=1 | 6101  | 9177  (+50.4%) | 207373 | 215732 (+4.0%) |

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  2 +-
 fs/ext4/mballoc.c | 12 +++---------
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b83095541c98..7f5c070de0fb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1630,7 +1630,7 @@ struct ext4_sb_info {
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
-	unsigned long s_mb_last_group;
+	ext4_group_t s_mb_last_group;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index e3a5103e1620..025b759ca643 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2168,11 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	ac->ac_buddy_folio = e4b->bd_buddy_folio;
 	folio_get(ac->ac_buddy_folio);
 	/* store last allocated for subsequent stream allocation */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		spin_lock(&sbi->s_md_lock);
-		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
-		spin_unlock(&sbi->s_md_lock);
-	}
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
+		WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
 	/*
 	 * As we've just preallocated more space than
 	 * user requested originally, we store allocated
@@ -2845,10 +2842,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 	/* if stream allocation is enabled, use global goal */
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		/* TBD: may be hot point */
-		spin_lock(&sbi->s_md_lock);
-		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
-		spin_unlock(&sbi->s_md_lock);
+		ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group);
 		ac->ac_g_ex.fe_start = -1;
 		ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
 	}
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 05/17] ext4: utilize multiple global goals to reduce contention
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (3 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 06/17] ext4: get rid of some obsolete EXT4_MB_HINT flags Baokun Li
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.

However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.

To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the minimum between the number of
possible CPUs and one-quarter of the filesystem's total block group count.

To ensure a consistent goal for each inode, we select the corresponding
goal by taking the inode number modulo the total number of goals.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 9636  | 19628 (+103%)  | 337597 | 320885 (-4.9%) |
|mb_optimize_scan=1 | 4834  | 7129  (+47.4%) | 341440 | 321275 (-5.9%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 22341 | 53760 (+140%)  | 219707 | 213145 (-2.9%) |
|mb_optimize_scan=1 | 9177  | 12716 (+38.5%) | 215732 | 215262 (+0.2%) |

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  6 ++++--
 fs/ext4/mballoc.c | 27 +++++++++++++++++++++++----
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7f5c070de0fb..ad97c693d56a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1629,14 +1629,16 @@ struct ext4_sb_info {
 	unsigned int s_mb_order2_reqs;
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_dir_size_kb;
-	/* where last allocation was done - for stream allocation */
-	ext4_group_t s_mb_last_group;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
 	unsigned int s_sb_update_sec;
 	unsigned int s_sb_update_kb;
 
+	/* where last allocation was done - for stream allocation */
+	ext4_group_t *s_mb_last_groups;
+	unsigned int s_mb_nr_global_goals;
+
 	/* stats for buddy allocator */
 	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
 	atomic_t s_bal_success;	/* we found long enough chunks */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 025b759ca643..b6aa24b48543 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2168,8 +2168,12 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	ac->ac_buddy_folio = e4b->bd_buddy_folio;
 	folio_get(ac->ac_buddy_folio);
 	/* store last allocated for subsequent stream allocation */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
-		WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
+		int hash = ac->ac_inode->i_ino % sbi->s_mb_nr_global_goals;
+
+		WRITE_ONCE(sbi->s_mb_last_groups[hash], ac->ac_f_ex.fe_group);
+	}
+
 	/*
 	 * As we've just preallocated more space than
 	 * user requested originally, we store allocated
@@ -2842,7 +2846,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 	/* if stream allocation is enabled, use global goal */
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group);
+		int hash = ac->ac_inode->i_ino % sbi->s_mb_nr_global_goals;
+
+		ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_groups[hash]);
 		ac->ac_g_ex.fe_start = -1;
 		ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
 	}
@@ -3722,10 +3728,19 @@ int ext4_mb_init(struct super_block *sb)
 			sbi->s_mb_group_prealloc, EXT4_NUM_B2C(sbi, sbi->s_stripe));
 	}
 
+	sbi->s_mb_nr_global_goals = umin(num_possible_cpus(),
+					 DIV_ROUND_UP(sbi->s_groups_count, 4));
+	sbi->s_mb_last_groups = kcalloc(sbi->s_mb_nr_global_goals,
+					sizeof(ext4_group_t), GFP_KERNEL);
+	if (sbi->s_mb_last_groups == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
 	sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
 	if (sbi->s_locality_groups == NULL) {
 		ret = -ENOMEM;
-		goto out;
+		goto out_free_last_groups;
 	}
 	for_each_possible_cpu(i) {
 		struct ext4_locality_group *lg;
@@ -3750,6 +3765,9 @@ int ext4_mb_init(struct super_block *sb)
 out_free_locality_groups:
 	free_percpu(sbi->s_locality_groups);
 	sbi->s_locality_groups = NULL;
+out_free_last_groups:
+	kfree(sbi->s_mb_last_groups);
+	sbi->s_mb_last_groups = NULL;
 out:
 	kfree(sbi->s_mb_avg_fragment_size);
 	kfree(sbi->s_mb_avg_fragment_size_locks);
@@ -3854,6 +3872,7 @@ void ext4_mb_release(struct super_block *sb)
 	}
 
 	free_percpu(sbi->s_locality_groups);
+	kfree(sbi->s_mb_last_groups);
 }
 
 static inline int ext4_issue_discard(struct super_block *sb,
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 06/17] ext4: get rid of some obsolete EXT4_MB_HINT flags
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (4 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 05/17] ext4: utilize multiple global goals to reduce contention Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 07/17] ext4: fix typo in CR_GOAL_LEN_SLOW comment Baokun Li
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

Since nobody has used these EXT4_MB_HINT flags for ages,
let's remove them.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h              | 6 ------
 include/trace/events/ext4.h | 3 ---
 2 files changed, 9 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ad97c693d56a..4ebc665cf871 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -185,14 +185,8 @@ enum criteria {
 
 /* prefer goal again. length */
 #define EXT4_MB_HINT_MERGE		0x0001
-/* blocks already reserved */
-#define EXT4_MB_HINT_RESERVED		0x0002
-/* metadata is being allocated */
-#define EXT4_MB_HINT_METADATA		0x0004
 /* first blocks in the file */
 #define EXT4_MB_HINT_FIRST		0x0008
-/* search for the best chunk */
-#define EXT4_MB_HINT_BEST		0x0010
 /* data is being allocated */
 #define EXT4_MB_HINT_DATA		0x0020
 /* don't preallocate (for tails) */
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 156908641e68..33b204165cc0 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -23,10 +23,7 @@ struct partial_cluster;
 
 #define show_mballoc_flags(flags) __print_flags(flags, "|",	\
 	{ EXT4_MB_HINT_MERGE,		"HINT_MERGE" },		\
-	{ EXT4_MB_HINT_RESERVED,	"HINT_RESV" },		\
-	{ EXT4_MB_HINT_METADATA,	"HINT_MDATA" },		\
 	{ EXT4_MB_HINT_FIRST,		"HINT_FIRST" },		\
-	{ EXT4_MB_HINT_BEST,		"HINT_BEST" },		\
 	{ EXT4_MB_HINT_DATA,		"HINT_DATA" },		\
 	{ EXT4_MB_HINT_NOPREALLOC,	"HINT_NOPREALLOC" },	\
 	{ EXT4_MB_HINT_GROUP_ALLOC,	"HINT_GRP_ALLOC" },	\
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 07/17] ext4: fix typo in CR_GOAL_LEN_SLOW comment
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (5 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 06/17] ext4: get rid of some obsolete EXT4_MB_HINT flags Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 08/17] ext4: convert sbi->s_mb_free_pending to atomic_t Baokun Li
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

Remove the superfluous "find_".

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4ebc665cf871..0379f2974252 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -157,7 +157,7 @@ enum criteria {
 
 	/*
 	 * Reads each block group sequentially, performing disk IO if
-	 * necessary, to find find_suitable block group. Tries to
+	 * necessary, to find suitable block group. Tries to
 	 * allocate goal length but might trim the request if nothing
 	 * is found after enough tries.
 	 */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 08/17] ext4: convert sbi->s_mb_free_pending to atomic_t
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (6 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 07/17] ext4: fix typo in CR_GOAL_LEN_SLOW comment Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 09/17] ext4: merge freed extent with existing extents before insertion Baokun Li
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

Previously, s_md_lock was used to protect s_mb_free_pending during
modifications, while smp_mb() ensured fresh reads, so s_md_lock just
guarantees the atomicity of s_mb_free_pending. Thus we optimized it by
converting s_mb_free_pending into an atomic variable, thereby eliminating
s_md_lock and minimizing lock contention. This also prepares for future
lockless merging of free extents.

Following this modification, s_md_lock is exclusively responsible for
managing insertions and deletions within s_freed_data_list, along with
operations involving list_splice.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19628 | 20043 (+2.1%)  | 320885 | 314331 (-2.0%) |
|mb_optimize_scan=1 | 7129  | 7290  (+2.2%)  | 321275 | 324226 (+0.9%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53760 | 54999 (+2.3%)  | 213145 | 214380 (+0.5%) |
|mb_optimize_scan=1 | 12716 | 13497 (+6.1%)  | 215262 | 216276 (+0.4%) |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/balloc.c  | 2 +-
 fs/ext4/ext4.h    | 2 +-
 fs/ext4/mballoc.c | 9 +++------
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index c48fd36b2d74..c9329ed5c094 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -703,7 +703,7 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 	 * possible we just missed a transaction commit that did so
 	 */
 	smp_mb();
-	if (sbi->s_mb_free_pending == 0) {
+	if (atomic_read(&sbi->s_mb_free_pending) == 0) {
 		if (test_opt(sb, DISCARD)) {
 			atomic_inc(&sbi->s_retry_alloc_pending);
 			flush_work(&sbi->s_discard_work);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0379f2974252..52a72af6ec34 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1602,7 +1602,7 @@ struct ext4_sb_info {
 	unsigned short *s_mb_offsets;
 	unsigned int *s_mb_maxs;
 	unsigned int s_group_info_size;
-	unsigned int s_mb_free_pending;
+	atomic_t s_mb_free_pending;
 	struct list_head s_freed_data_list[2];	/* List of blocks to be freed
 						   after commit completed */
 	struct list_head s_discard_list;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b6aa24b48543..ba3cdacbc9f9 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3687,7 +3687,7 @@ int ext4_mb_init(struct super_block *sb)
 	}
 
 	spin_lock_init(&sbi->s_md_lock);
-	sbi->s_mb_free_pending = 0;
+	atomic_set(&sbi->s_mb_free_pending, 0);
 	INIT_LIST_HEAD(&sbi->s_freed_data_list[0]);
 	INIT_LIST_HEAD(&sbi->s_freed_data_list[1]);
 	INIT_LIST_HEAD(&sbi->s_discard_list);
@@ -3903,10 +3903,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
 	/* we expect to find existing buddy because it's pinned */
 	BUG_ON(err != 0);
 
-	spin_lock(&EXT4_SB(sb)->s_md_lock);
-	EXT4_SB(sb)->s_mb_free_pending -= entry->efd_count;
-	spin_unlock(&EXT4_SB(sb)->s_md_lock);
-
+	atomic_sub(entry->efd_count, &EXT4_SB(sb)->s_mb_free_pending);
 	db = e4b.bd_info;
 	/* there are blocks to put in buddy to make them really free */
 	count += entry->efd_count;
@@ -6401,7 +6398,7 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 
 	spin_lock(&sbi->s_md_lock);
 	list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->efd_tid & 1]);
-	sbi->s_mb_free_pending += clusters;
+	atomic_add(clusters, &sbi->s_mb_free_pending);
 	spin_unlock(&sbi->s_md_lock);
 }
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 09/17] ext4: merge freed extent with existing extents before insertion
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (7 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 08/17] ext4: convert sbi->s_mb_free_pending to atomic_t Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 10/17] ext4: fix zombie groups in average fragment size lists Baokun Li
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

Attempt to merge ext4_free_data with already inserted free extents prior
to adding new ones. This strategy drastically cuts down the number of
times locks are held.

For example, if prev, new, and next extents are all mergeable, the existing
code (before this patch) requires acquiring the s_md_lock three times:

  prev merge into new and free prev // hold lock
  next merge into new and free next // hold lock
  insert new // hold lock

After the patch, it only needs to be acquired once:

  new merge into next and free new // no lock
  next merge into prev and free next // hold lock

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20043 | 20097 (+0.2%)  | 314331 | 316141 (+0.5%) |
|mb_optimize_scan=1 | 7290  | 13318 (+87.4%) | 324226 | 325273 (+0.3%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 54999 | 53603 (-2.5%)  | 214380 | 214243 (-0.06%)|
|mb_optimize_scan=1 | 13497 | 20887 (+54.6%) | 216276 | 213632 (-1.2%) |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 113 +++++++++++++++++++++++++++++++---------------
 1 file changed, 76 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index ba3cdacbc9f9..6d98f2a5afc4 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6307,28 +6307,63 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
  * are contiguous, AND the extents were freed by the same transaction,
  * AND the blocks are associated with the same group.
  */
-static void ext4_try_merge_freed_extent(struct ext4_sb_info *sbi,
-					struct ext4_free_data *entry,
-					struct ext4_free_data *new_entry,
-					struct rb_root *entry_rb_root)
+static inline bool
+ext4_freed_extents_can_be_merged(struct ext4_free_data *entry1,
+				 struct ext4_free_data *entry2)
 {
-	if ((entry->efd_tid != new_entry->efd_tid) ||
-	    (entry->efd_group != new_entry->efd_group))
-		return;
-	if (entry->efd_start_cluster + entry->efd_count ==
-	    new_entry->efd_start_cluster) {
-		new_entry->efd_start_cluster = entry->efd_start_cluster;
-		new_entry->efd_count += entry->efd_count;
-	} else if (new_entry->efd_start_cluster + new_entry->efd_count ==
-		   entry->efd_start_cluster) {
-		new_entry->efd_count += entry->efd_count;
-	} else
-		return;
+	if (entry1->efd_tid != entry2->efd_tid)
+		return false;
+	if (entry1->efd_start_cluster + entry1->efd_count !=
+	    entry2->efd_start_cluster)
+		return false;
+	if (WARN_ON_ONCE(entry1->efd_group != entry2->efd_group))
+		return false;
+	return true;
+}
+
+static inline void
+ext4_merge_freed_extents(struct ext4_sb_info *sbi, struct rb_root *root,
+			 struct ext4_free_data *entry1,
+			 struct ext4_free_data *entry2)
+{
+	entry1->efd_count += entry2->efd_count;
 	spin_lock(&sbi->s_md_lock);
-	list_del(&entry->efd_list);
+	list_del(&entry2->efd_list);
 	spin_unlock(&sbi->s_md_lock);
-	rb_erase(&entry->efd_node, entry_rb_root);
-	kmem_cache_free(ext4_free_data_cachep, entry);
+	rb_erase(&entry2->efd_node, root);
+	kmem_cache_free(ext4_free_data_cachep, entry2);
+}
+
+static inline void
+ext4_try_merge_freed_extent_prev(struct ext4_sb_info *sbi, struct rb_root *root,
+				 struct ext4_free_data *entry)
+{
+	struct ext4_free_data *prev;
+	struct rb_node *node;
+
+	node = rb_prev(&entry->efd_node);
+	if (!node)
+		return;
+
+	prev = rb_entry(node, struct ext4_free_data, efd_node);
+	if (ext4_freed_extents_can_be_merged(prev, entry))
+		ext4_merge_freed_extents(sbi, root, prev, entry);
+}
+
+static inline void
+ext4_try_merge_freed_extent_next(struct ext4_sb_info *sbi, struct rb_root *root,
+				 struct ext4_free_data *entry)
+{
+	struct ext4_free_data *next;
+	struct rb_node *node;
+
+	node = rb_next(&entry->efd_node);
+	if (!node)
+		return;
+
+	next = rb_entry(node, struct ext4_free_data, efd_node);
+	if (ext4_freed_extents_can_be_merged(entry, next))
+		ext4_merge_freed_extents(sbi, root, entry, next);
 }
 
 static noinline_for_stack void
@@ -6338,11 +6373,12 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 	ext4_group_t group = e4b->bd_group;
 	ext4_grpblk_t cluster;
 	ext4_grpblk_t clusters = new_entry->efd_count;
-	struct ext4_free_data *entry;
+	struct ext4_free_data *entry = NULL;
 	struct ext4_group_info *db = e4b->bd_info;
 	struct super_block *sb = e4b->bd_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	struct rb_node **n = &db->bb_free_root.rb_node, *node;
+	struct rb_root *root = &db->bb_free_root;
+	struct rb_node **n = &root->rb_node;
 	struct rb_node *parent = NULL, *new_node;
 
 	BUG_ON(!ext4_handle_valid(handle));
@@ -6378,27 +6414,30 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 		}
 	}
 
-	rb_link_node(new_node, parent, n);
-	rb_insert_color(new_node, &db->bb_free_root);
-
-	/* Now try to see the extent can be merged to left and right */
-	node = rb_prev(new_node);
-	if (node) {
-		entry = rb_entry(node, struct ext4_free_data, efd_node);
-		ext4_try_merge_freed_extent(sbi, entry, new_entry,
-					    &(db->bb_free_root));
+	atomic_add(clusters, &sbi->s_mb_free_pending);
+	if (!entry)
+		goto insert;
+
+	/* Now try to see the extent can be merged to prev and next */
+	if (ext4_freed_extents_can_be_merged(new_entry, entry)) {
+		entry->efd_start_cluster = cluster;
+		entry->efd_count += new_entry->efd_count;
+		kmem_cache_free(ext4_free_data_cachep, new_entry);
+		ext4_try_merge_freed_extent_prev(sbi, root, entry);
+		return;
 	}
-
-	node = rb_next(new_node);
-	if (node) {
-		entry = rb_entry(node, struct ext4_free_data, efd_node);
-		ext4_try_merge_freed_extent(sbi, entry, new_entry,
-					    &(db->bb_free_root));
+	if (ext4_freed_extents_can_be_merged(entry, new_entry)) {
+		entry->efd_count += new_entry->efd_count;
+		kmem_cache_free(ext4_free_data_cachep, new_entry);
+		ext4_try_merge_freed_extent_next(sbi, root, entry);
+		return;
 	}
+insert:
+	rb_link_node(new_node, parent, n);
+	rb_insert_color(new_node, root);
 
 	spin_lock(&sbi->s_md_lock);
 	list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->efd_tid & 1]);
-	atomic_add(clusters, &sbi->s_mb_free_pending);
 	spin_unlock(&sbi->s_md_lock);
 }
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 10/17] ext4: fix zombie groups in average fragment size lists
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (8 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 09/17] ext4: merge freed extent with existing extents before insertion Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 11/17] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Baokun Li
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun, stable

Groups with no free blocks shouldn't be in any average fragment size list.
However, when all blocks in a group are allocated(i.e., bb_fragments or
bb_free is 0), we currently skip updating the average fragment size, which
means the group isn't removed from its previous s_mb_avg_fragment_size[old]
list.

This created "zombie" groups that were always skipped during traversal as
they couldn't satisfy any block allocation requests, negatively impacting
traversal efficiency.

Therefore, when a group becomes completely full, bb_avg_fragment_size_order
is now set to -1. If the old order was not -1, a removal operation is
performed; if the new order is not -1, an insertion is performed.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/mballoc.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 6d98f2a5afc4..72b20fc52bbf 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -841,30 +841,30 @@ static void
 mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	int new_order;
+	int new, old;
 
-	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_fragments == 0)
+	if (!test_opt2(sb, MB_OPTIMIZE_SCAN))
 		return;
 
-	new_order = mb_avg_fragment_size_order(sb,
-					grp->bb_free / grp->bb_fragments);
-	if (new_order == grp->bb_avg_fragment_size_order)
+	old = grp->bb_avg_fragment_size_order;
+	new = grp->bb_fragments == 0 ? -1 :
+	      mb_avg_fragment_size_order(sb, grp->bb_free / grp->bb_fragments);
+	if (new == old)
 		return;
 
-	if (grp->bb_avg_fragment_size_order != -1) {
-		write_lock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
+	if (old >= 0) {
+		write_lock(&sbi->s_mb_avg_fragment_size_locks[old]);
 		list_del(&grp->bb_avg_fragment_size_node);
-		write_unlock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
-	}
-	grp->bb_avg_fragment_size_order = new_order;
-	write_lock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
-	list_add_tail(&grp->bb_avg_fragment_size_node,
-		&sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
-	write_unlock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
+		write_unlock(&sbi->s_mb_avg_fragment_size_locks[old]);
+	}
+
+	grp->bb_avg_fragment_size_order = new;
+	if (new >= 0) {
+		write_lock(&sbi->s_mb_avg_fragment_size_locks[new]);
+		list_add_tail(&grp->bb_avg_fragment_size_node,
+				&sbi->s_mb_avg_fragment_size[new]);
+		write_unlock(&sbi->s_mb_avg_fragment_size_locks[new]);
+	}
 }
 
 /*
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 11/17] ext4: fix largest free orders lists corruption on mb_optimize_scan switch
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (9 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 10/17] ext4: fix zombie groups in average fragment size lists Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 12/17] ext4: factor out __ext4_mb_scan_group() Baokun Li
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun, stable

The grp->bb_largest_free_order is updated regardless of whether
mb_optimize_scan is enabled. This can lead to inconsistencies between
grp->bb_largest_free_order and the actual s_mb_largest_free_orders list
index when mb_optimize_scan is repeatedly enabled and disabled via remount.

For example, if mb_optimize_scan is initially enabled, largest free
order is 3, and the group is in s_mb_largest_free_orders[3]. Then,
mb_optimize_scan is disabled via remount, block allocations occur,
updating largest free order to 2. Finally, mb_optimize_scan is re-enabled
via remount, more block allocations update largest free order to 1.

At this point, the group would be removed from s_mb_largest_free_orders[3]
under the protection of s_mb_largest_free_orders_locks[2]. This lock
mismatch can lead to list corruption.

To fix this, whenever grp->bb_largest_free_order changes, we now always
attempt to remove the group from its old order list. However, we only
insert the group into the new order list if `mb_optimize_scan` is enabled.
This approach helps prevent lock inconsistencies and ensures the data in
the order lists remains reliable.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 72b20fc52bbf..fada0d1b3fdb 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1152,33 +1152,28 @@ static void
 mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	int i;
+	int new, old = grp->bb_largest_free_order;
 
-	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
-		if (grp->bb_counters[i] > 0)
+	for (new = MB_NUM_ORDERS(sb) - 1; new >= 0; new--)
+		if (grp->bb_counters[new] > 0)
 			break;
+
 	/* No need to move between order lists? */
-	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) ||
-	    i == grp->bb_largest_free_order) {
-		grp->bb_largest_free_order = i;
+	if (new == old)
 		return;
-	}
 
-	if (grp->bb_largest_free_order >= 0) {
-		write_lock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+	if (old >= 0 && !list_empty(&grp->bb_largest_free_order_node)) {
+		write_lock(&sbi->s_mb_largest_free_orders_locks[old]);
 		list_del_init(&grp->bb_largest_free_order_node);
-		write_unlock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+		write_unlock(&sbi->s_mb_largest_free_orders_locks[old]);
 	}
-	grp->bb_largest_free_order = i;
-	if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
-		write_lock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+
+	grp->bb_largest_free_order = new;
+	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && new >= 0 && grp->bb_free) {
+		write_lock(&sbi->s_mb_largest_free_orders_locks[new]);
 		list_add_tail(&grp->bb_largest_free_order_node,
-		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
-		write_unlock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+			      &sbi->s_mb_largest_free_orders[new]);
+		write_unlock(&sbi->s_mb_largest_free_orders_locks[new]);
 	}
 }
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 12/17] ext4: factor out __ext4_mb_scan_group()
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (10 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 11/17] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 13/17] ext4: factor out ext4_mb_might_prefetch() Baokun Li
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

Extract __ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 45 +++++++++++++++++++++++++++------------------
 fs/ext4/mballoc.h |  2 ++
 2 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index fada0d1b3fdb..650eb6366eb0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2568,6 +2568,30 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
 	}
 }
 
+static void __ext4_mb_scan_group(struct ext4_allocation_context *ac)
+{
+	bool is_stripe_aligned;
+	struct ext4_sb_info *sbi;
+	enum criteria cr = ac->ac_criteria;
+
+	ac->ac_groups_scanned++;
+	if (cr == CR_POWER2_ALIGNED)
+		return ext4_mb_simple_scan_group(ac, ac->ac_e4b);
+
+	sbi = EXT4_SB(ac->ac_sb);
+	is_stripe_aligned = false;
+	if ((sbi->s_stripe >= sbi->s_cluster_ratio) &&
+	    !(ac->ac_g_ex.fe_len % EXT4_NUM_B2C(sbi, sbi->s_stripe)))
+		is_stripe_aligned = true;
+
+	if ((cr == CR_GOAL_LEN_FAST || cr == CR_BEST_AVAIL_LEN) &&
+	    is_stripe_aligned)
+		ext4_mb_scan_aligned(ac, ac->ac_e4b);
+
+	if (ac->ac_status == AC_STATUS_CONTINUE)
+		ext4_mb_complex_scan_group(ac, ac->ac_e4b);
+}
+
 /*
  * This is also called BEFORE we load the buddy bitmap.
  * Returns either 1 or 0 indicating that the group is either suitable
@@ -2855,6 +2879,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	 */
 	if (ac->ac_2order)
 		cr = CR_POWER2_ALIGNED;
+
+	ac->ac_e4b = &e4b;
 repeat:
 	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;
@@ -2932,24 +2958,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 				continue;
 			}
 
-			ac->ac_groups_scanned++;
-			if (cr == CR_POWER2_ALIGNED)
-				ext4_mb_simple_scan_group(ac, &e4b);
-			else {
-				bool is_stripe_aligned =
-					(sbi->s_stripe >=
-					 sbi->s_cluster_ratio) &&
-					!(ac->ac_g_ex.fe_len %
-					  EXT4_NUM_B2C(sbi, sbi->s_stripe));
-
-				if ((cr == CR_GOAL_LEN_FAST ||
-				     cr == CR_BEST_AVAIL_LEN) &&
-				    is_stripe_aligned)
-					ext4_mb_scan_aligned(ac, &e4b);
-
-				if (ac->ac_status == AC_STATUS_CONTINUE)
-					ext4_mb_complex_scan_group(ac, &e4b);
-			}
+			__ext4_mb_scan_group(ac);
 
 			ext4_unlock_group(sb, group);
 			ext4_mb_unload_buddy(&e4b);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index f8280de3e882..7a60b0103e64 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -204,6 +204,8 @@ struct ext4_allocation_context {
 	__u8 ac_2order;		/* if request is to allocate 2^N blocks and
 				 * N > 0, the field stores N, otherwise 0 */
 	__u8 ac_op;		/* operation, for history only */
+
+	struct ext4_buddy *ac_e4b;
 	struct folio *ac_bitmap_folio;
 	struct folio *ac_buddy_folio;
 	struct ext4_prealloc_space *ac_pa;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 13/17] ext4: factor out ext4_mb_might_prefetch()
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (11 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 12/17] ext4: factor out __ext4_mb_scan_group() Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 14/17] ext4: factor out ext4_mb_scan_group() Baokun Li
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

Extract ext4_mb_might_prefetch() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 62 +++++++++++++++++++++++++++++------------------
 fs/ext4/mballoc.h |  4 +++
 2 files changed, 42 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 650eb6366eb0..52ec59f58c36 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2781,6 +2781,37 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
 	return group;
 }
 
+/*
+ * Batch reads of the block allocation bitmaps to get
+ * multiple READs in flight; limit prefetching at inexpensive
+ * CR, otherwise mballoc can spend a lot of time loading
+ * imperfect groups
+ */
+static void ext4_mb_might_prefetch(struct ext4_allocation_context *ac,
+				   ext4_group_t group)
+{
+	struct ext4_sb_info *sbi;
+
+	if (ac->ac_prefetch_grp != group)
+		return;
+
+	sbi = EXT4_SB(ac->ac_sb);
+	if (ext4_mb_cr_expensive(ac->ac_criteria) ||
+	    ac->ac_prefetch_ios < sbi->s_mb_prefetch_limit) {
+		unsigned int nr = sbi->s_mb_prefetch;
+
+		if (ext4_has_feature_flex_bg(ac->ac_sb)) {
+			nr = 1 << sbi->s_log_groups_per_flex;
+			nr -= group & (nr - 1);
+			nr = umin(nr, sbi->s_mb_prefetch);
+		}
+
+		ac->ac_prefetch_nr = nr;
+		ac->ac_prefetch_grp = ext4_mb_prefetch(ac->ac_sb, group, nr,
+						       &ac->ac_prefetch_ios);
+	}
+}
+
 /*
  * Prefetching reads the block bitmap into the buffer cache; but we
  * need to make sure that the buddy bitmap in the page cache has been
@@ -2817,10 +2848,9 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
 static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
-	ext4_group_t prefetch_grp = 0, ngroups, group, i;
+	ext4_group_t ngroups, group, i;
 	enum criteria new_cr, cr = CR_GOAL_LEN_FAST;
 	int err = 0, first_err = 0;
-	unsigned int nr = 0, prefetch_ios = 0;
 	struct ext4_sb_info *sbi;
 	struct super_block *sb;
 	struct ext4_buddy e4b;
@@ -2881,6 +2911,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		cr = CR_POWER2_ALIGNED;
 
 	ac->ac_e4b = &e4b;
+	ac->ac_prefetch_ios = 0;
 repeat:
 	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;
@@ -2890,8 +2921,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		 */
 		group = ac->ac_g_ex.fe_group;
 		ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups;
-		prefetch_grp = group;
-		nr = 0;
+		ac->ac_prefetch_grp = group;
+		ac->ac_prefetch_nr = 0;
 
 		for (i = 0, new_cr = cr; i < ngroups; i++,
 		     ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) {
@@ -2903,24 +2934,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 				goto repeat;
 			}
 
-			/*
-			 * Batch reads of the block allocation bitmaps
-			 * to get multiple READs in flight; limit
-			 * prefetching at inexpensive CR, otherwise mballoc
-			 * can spend a lot of time loading imperfect groups
-			 */
-			if ((prefetch_grp == group) &&
-			    (ext4_mb_cr_expensive(cr) ||
-			     prefetch_ios < sbi->s_mb_prefetch_limit)) {
-				nr = sbi->s_mb_prefetch;
-				if (ext4_has_feature_flex_bg(sb)) {
-					nr = 1 << sbi->s_log_groups_per_flex;
-					nr -= group & (nr - 1);
-					nr = min(nr, sbi->s_mb_prefetch);
-				}
-				prefetch_grp = ext4_mb_prefetch(sb, group,
-							nr, &prefetch_ios);
-			}
+			ext4_mb_might_prefetch(ac, group);
 
 			/* prevent unnecessary buddy loading. */
 			if (cr < CR_ANY_FREE &&
@@ -3018,8 +3032,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		 ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status,
 		 ac->ac_flags, cr, err);
 
-	if (nr)
-		ext4_mb_prefetch_fini(sb, prefetch_grp, nr);
+	if (ac->ac_prefetch_nr)
+		ext4_mb_prefetch_fini(sb, ac->ac_prefetch_grp, ac->ac_prefetch_nr);
 
 	return err;
 }
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 7a60b0103e64..9f66b1d5db67 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -192,6 +192,10 @@ struct ext4_allocation_context {
 	 */
 	ext4_grpblk_t	ac_orig_goal_len;
 
+	ext4_group_t ac_prefetch_grp;
+	unsigned int ac_prefetch_ios;
+	unsigned int ac_prefetch_nr;
+
 	__u32 ac_flags;		/* allocation hints */
 	__u32 ac_groups_linear_remaining;
 	__u16 ac_groups_scanned;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 14/17] ext4: factor out ext4_mb_scan_group()
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (12 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 13/17] ext4: factor out ext4_mb_might_prefetch() Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 15/17] ext4: convert free groups order lists to xarrays Baokun Li
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

Extract ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 93 +++++++++++++++++++++++++----------------------
 fs/ext4/mballoc.h |  2 +
 2 files changed, 51 insertions(+), 44 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 52ec59f58c36..0c3cbc7e2e85 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2845,12 +2845,56 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
 	}
 }
 
+static int ext4_mb_scan_group(struct ext4_allocation_context *ac,
+			      ext4_group_t group)
+{
+	int ret;
+	struct super_block *sb = ac->ac_sb;
+	enum criteria cr = ac->ac_criteria;
+
+	ext4_mb_might_prefetch(ac, group);
+
+	/* prevent unnecessary buddy loading. */
+	if (cr < CR_ANY_FREE && spin_is_locked(ext4_group_lock_ptr(sb, group)))
+		return 0;
+
+	/* This now checks without needing the buddy page */
+	ret = ext4_mb_good_group_nolock(ac, group, cr);
+	if (ret <= 0) {
+		if (!ac->ac_first_err)
+			ac->ac_first_err = ret;
+		return 0;
+	}
+
+	ret = ext4_mb_load_buddy(sb, group, ac->ac_e4b);
+	if (ret)
+		return ret;
+
+	/* skip busy group */
+	if (cr >= CR_ANY_FREE)
+		ext4_lock_group(sb, group);
+	else if (!ext4_try_lock_group(sb, group))
+		goto out_unload;
+
+	/* We need to check again after locking the block group. */
+	if (unlikely(!ext4_mb_good_group(ac, group, cr)))
+		goto out_unlock;
+
+	__ext4_mb_scan_group(ac);
+
+out_unlock:
+	ext4_unlock_group(sb, group);
+out_unload:
+	ext4_mb_unload_buddy(ac->ac_e4b);
+	return ret;
+}
+
 static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
 	ext4_group_t ngroups, group, i;
 	enum criteria new_cr, cr = CR_GOAL_LEN_FAST;
-	int err = 0, first_err = 0;
+	int err = 0;
 	struct ext4_sb_info *sbi;
 	struct super_block *sb;
 	struct ext4_buddy e4b;
@@ -2912,6 +2956,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 	ac->ac_e4b = &e4b;
 	ac->ac_prefetch_ios = 0;
+	ac->ac_first_err = 0;
 repeat:
 	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;
@@ -2926,7 +2971,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 		for (i = 0, new_cr = cr; i < ngroups; i++,
 		     ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) {
-			int ret = 0;
 
 			cond_resched();
 			if (new_cr != cr) {
@@ -2934,49 +2978,10 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 				goto repeat;
 			}
 
-			ext4_mb_might_prefetch(ac, group);
-
-			/* prevent unnecessary buddy loading. */
-			if (cr < CR_ANY_FREE &&
-			    spin_is_locked(ext4_group_lock_ptr(sb, group)))
-				continue;
-
-			/* This now checks without needing the buddy page */
-			ret = ext4_mb_good_group_nolock(ac, group, cr);
-			if (ret <= 0) {
-				if (!first_err)
-					first_err = ret;
-				continue;
-			}
-
-			err = ext4_mb_load_buddy(sb, group, &e4b);
+			err = ext4_mb_scan_group(ac, group);
 			if (err)
 				goto out;
 
-			/* skip busy group */
-			if (cr >= CR_ANY_FREE) {
-				ext4_lock_group(sb, group);
-			} else if (!ext4_try_lock_group(sb, group)) {
-				ext4_mb_unload_buddy(&e4b);
-				continue;
-			}
-
-			/*
-			 * We need to check again after locking the
-			 * block group
-			 */
-			ret = ext4_mb_good_group(ac, group, cr);
-			if (ret == 0) {
-				ext4_unlock_group(sb, group);
-				ext4_mb_unload_buddy(&e4b);
-				continue;
-			}
-
-			__ext4_mb_scan_group(ac);
-
-			ext4_unlock_group(sb, group);
-			ext4_mb_unload_buddy(&e4b);
-
 			if (ac->ac_status != AC_STATUS_CONTINUE)
 				break;
 		}
@@ -3025,8 +3030,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			atomic_inc(&sbi->s_bal_stream_goals);
 	}
 out:
-	if (!err && ac->ac_status != AC_STATUS_FOUND && first_err)
-		err = first_err;
+	if (!err && ac->ac_status != AC_STATUS_FOUND && ac->ac_first_err)
+		err = ac->ac_first_err;
 
 	mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n",
 		 ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status,
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 9f66b1d5db67..83886fc9521b 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -196,6 +196,8 @@ struct ext4_allocation_context {
 	unsigned int ac_prefetch_ios;
 	unsigned int ac_prefetch_nr;
 
+	int ac_first_err;
+
 	__u32 ac_flags;		/* allocation hints */
 	__u32 ac_groups_linear_remaining;
 	__u16 ac_groups_scanned;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (13 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 14/17] ext4: factor out ext4_mb_scan_group() Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-21 11:07   ` Jan Kara
  2025-07-24  3:55   ` Guenter Roeck
  2025-07-14 13:03 ` [PATCH v3 16/17] ext4: refactor choose group to scan group Baokun Li
                   ` (3 subsequent siblings)
  18 siblings, 2 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

While traversing the list, holding a spin_lock prevents load_buddy, making
direct use of ext4_try_lock_group impossible. This can lead to a bouncing
scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
fails, forcing the list traversal to repeatedly restart from grp_A.

In contrast, linear traversal directly uses ext4_try_lock_group(),
avoiding this bouncing. Therefore, we need a lockless, ordered traversal
to achieve linear-like efficiency.

Therefore, this commit converts both average fragment size lists and
largest free order lists into ordered xarrays.

In an xarray, the index represents the block group number and the value
holds the block group information; a non-empty value indicates the block
group's presence.

While insertion and deletion complexity remain O(1), lookup complexity
changes from O(1) to O(nlogn), which may slightly reduce single-threaded
performance.

Additionally, xarray insertions might fail, potentially due to memory
allocation issues. However, since we have linear traversal as a fallback,
this isn't a major problem. Therefore, we've only added a warning message
for insertion failures here.

A helper function ext4_mb_find_good_group_xarray() is added to find good
groups in the specified xarray starting at the specified position start,
and when it reaches ngroups-1, it wraps around to 0 and then to start-1.
This ensures an ordered traversal within the xarray.

Performance test results are as follows: Single-process operations
on an empty disk show negligible impact, while multi-process workloads
demonstrate a noticeable performance gain.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20097 | 19555 (-2.6%)  | 316141 | 315636 (-0.2%) |
|mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53603 | 53192 (-0.7%)  | 214243 | 212678 (-0.7%) |
|mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |   8 +-
 fs/ext4/mballoc.c | 254 +++++++++++++++++++++++++---------------------
 2 files changed, 140 insertions(+), 122 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 52a72af6ec34..ea412fdb0b76 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1608,10 +1608,8 @@ struct ext4_sb_info {
 	struct list_head s_discard_list;
 	struct work_struct s_discard_work;
 	atomic_t s_retry_alloc_pending;
-	struct list_head *s_mb_avg_fragment_size;
-	rwlock_t *s_mb_avg_fragment_size_locks;
-	struct list_head *s_mb_largest_free_orders;
-	rwlock_t *s_mb_largest_free_orders_locks;
+	struct xarray *s_mb_avg_fragment_size;
+	struct xarray *s_mb_largest_free_orders;
 
 	/* tunables */
 	unsigned long s_stripe;
@@ -3485,8 +3483,6 @@ struct ext4_group_info {
 	void            *bb_bitmap;
 #endif
 	struct rw_semaphore alloc_sem;
-	struct list_head bb_avg_fragment_size_node;
-	struct list_head bb_largest_free_order_node;
 	ext4_grpblk_t	bb_counters[];	/* Nr of free power-of-two-block
 					 * regions, index is order.
 					 * bb_counters[3] = 5 means
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 0c3cbc7e2e85..a9eb997b8c9b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -132,25 +132,30 @@
  * If "mb_optimize_scan" mount option is set, we maintain in memory group info
  * structures in two data structures:
  *
- * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
+ * 1) Array of largest free order xarrays (sbi->s_mb_largest_free_orders)
  *
- *    Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
+ *    Locking: Writers use xa_lock, readers use rcu_read_lock.
  *
- *    This is an array of lists where the index in the array represents the
+ *    This is an array of xarrays where the index in the array represents the
  *    largest free order in the buddy bitmap of the participating group infos of
- *    that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
- *    number of buddy bitmap orders possible) number of lists. Group-infos are
- *    placed in appropriate lists.
+ *    that xarray. So, there are exactly MB_NUM_ORDERS(sb) (which means total
+ *    number of buddy bitmap orders possible) number of xarrays. Group-infos are
+ *    placed in appropriate xarrays.
  *
- * 2) Average fragment size lists (sbi->s_mb_avg_fragment_size)
+ * 2) Average fragment size xarrays (sbi->s_mb_avg_fragment_size)
  *
- *    Locking: sbi->s_mb_avg_fragment_size_locks(array of rw locks)
+ *    Locking: Writers use xa_lock, readers use rcu_read_lock.
  *
- *    This is an array of lists where in the i-th list there are groups with
+ *    This is an array of xarrays where in the i-th xarray there are groups with
  *    average fragment size >= 2^i and < 2^(i+1). The average fragment size
  *    is computed as ext4_group_info->bb_free / ext4_group_info->bb_fragments.
- *    Note that we don't bother with a special list for completely empty groups
- *    so we only have MB_NUM_ORDERS(sb) lists.
+ *    Note that we don't bother with a special xarray for completely empty
+ *    groups so we only have MB_NUM_ORDERS(sb) xarrays. Group-infos are placed
+ *    in appropriate xarrays.
+ *
+ * In xarray, the index is the block group number, the value is the block group
+ * information, and a non-empty value indicates the block group is present in
+ * the current xarray.
  *
  * When "mb_optimize_scan" mount option is set, mballoc consults the above data
  * structures to decide the order in which groups are to be traversed for
@@ -852,21 +857,75 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 	if (new == old)
 		return;
 
-	if (old >= 0) {
-		write_lock(&sbi->s_mb_avg_fragment_size_locks[old]);
-		list_del(&grp->bb_avg_fragment_size_node);
-		write_unlock(&sbi->s_mb_avg_fragment_size_locks[old]);
-	}
+	if (old >= 0)
+		xa_erase(&sbi->s_mb_avg_fragment_size[old], grp->bb_group);
 
 	grp->bb_avg_fragment_size_order = new;
 	if (new >= 0) {
-		write_lock(&sbi->s_mb_avg_fragment_size_locks[new]);
-		list_add_tail(&grp->bb_avg_fragment_size_node,
-				&sbi->s_mb_avg_fragment_size[new]);
-		write_unlock(&sbi->s_mb_avg_fragment_size_locks[new]);
+		/*
+		* Cannot use __GFP_NOFAIL because we hold the group lock.
+		* Although allocation for insertion may fails, it's not fatal
+		* as we have linear traversal to fall back on.
+		*/
+		int err = xa_insert(&sbi->s_mb_avg_fragment_size[new],
+				    grp->bb_group, grp, GFP_ATOMIC);
+		if (err)
+			mb_debug(sb, "insert group: %u to s_mb_avg_fragment_size[%d] failed, err %d",
+				 grp->bb_group, new, err);
 	}
 }
 
+static struct ext4_group_info *
+ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac,
+			       struct xarray *xa, ext4_group_t start)
+{
+	struct super_block *sb = ac->ac_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	enum criteria cr = ac->ac_criteria;
+	ext4_group_t ngroups = ext4_get_groups_count(sb);
+	unsigned long group = start;
+	ext4_group_t end = ngroups;
+	struct ext4_group_info *grp;
+
+	if (WARN_ON_ONCE(start >= end))
+		return NULL;
+
+wrap_around:
+	xa_for_each_range(xa, group, grp, start, end - 1) {
+		if (sbi->s_mb_stats)
+			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
+
+		if (!spin_is_locked(ext4_group_lock_ptr(sb, group)) &&
+		    likely(ext4_mb_good_group(ac, group, cr)))
+			return grp;
+
+		cond_resched();
+	}
+
+	if (start) {
+		end = start;
+		start = 0;
+		goto wrap_around;
+	}
+
+	return NULL;
+}
+
+/*
+ * Find a suitable group of given order from the largest free orders xarray.
+ */
+static struct ext4_group_info *
+ext4_mb_find_good_group_largest_free_order(struct ext4_allocation_context *ac,
+					   int order, ext4_group_t start)
+{
+	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order];
+
+	if (xa_empty(xa))
+		return NULL;
+
+	return ext4_mb_find_good_group_xarray(ac, xa, start);
+}
+
 /*
  * Choose next group by traversing largest_free_order lists. Updates *new_cr if
  * cr level needs an update.
@@ -875,7 +934,7 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
 			enum criteria *new_cr, ext4_group_t *group)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct ext4_group_info *iter;
+	struct ext4_group_info *grp;
 	int i;
 
 	if (ac->ac_status == AC_STATUS_FOUND)
@@ -885,26 +944,12 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
 		atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions);
 
 	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		if (list_empty(&sbi->s_mb_largest_free_orders[i]))
-			continue;
-		read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
-		if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
-			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
-			continue;
-		}
-		list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
-				    bb_largest_free_order_node) {
-			if (sbi->s_mb_stats)
-				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
-			if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) &&
-			    likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
-				*group = iter->bb_group;
-				ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
-				read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
-				return;
-			}
+		grp = ext4_mb_find_good_group_largest_free_order(ac, i, *group);
+		if (grp) {
+			*group = grp->bb_group;
+			ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
+			return;
 		}
-		read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
 	}
 
 	/* Increment cr and search again if no group is found */
@@ -912,35 +957,18 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
 }
 
 /*
- * Find a suitable group of given order from the average fragments list.
+ * Find a suitable group of given order from the average fragments xarray.
  */
 static struct ext4_group_info *
-ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int order)
+ext4_mb_find_good_group_avg_frag_xarray(struct ext4_allocation_context *ac,
+					int order, ext4_group_t start)
 {
-	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct list_head *frag_list = &sbi->s_mb_avg_fragment_size[order];
-	rwlock_t *frag_list_lock = &sbi->s_mb_avg_fragment_size_locks[order];
-	struct ext4_group_info *grp = NULL, *iter;
-	enum criteria cr = ac->ac_criteria;
+	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order];
 
-	if (list_empty(frag_list))
+	if (xa_empty(xa))
 		return NULL;
-	read_lock(frag_list_lock);
-	if (list_empty(frag_list)) {
-		read_unlock(frag_list_lock);
-		return NULL;
-	}
-	list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
-		if (sbi->s_mb_stats)
-			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
-		if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) &&
-		    likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
-			grp = iter;
-			break;
-		}
-	}
-	read_unlock(frag_list_lock);
-	return grp;
+
+	return ext4_mb_find_good_group_xarray(ac, xa, start);
 }
 
 /*
@@ -961,7 +989,7 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
 
 	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
 	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		grp = ext4_mb_find_good_group_avg_frag_lists(ac, i);
+		grp = ext4_mb_find_good_group_avg_frag_xarray(ac, i, *group);
 		if (grp) {
 			*group = grp->bb_group;
 			ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
@@ -1057,7 +1085,8 @@ static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context
 		frag_order = mb_avg_fragment_size_order(ac->ac_sb,
 							ac->ac_g_ex.fe_len);
 
-		grp = ext4_mb_find_good_group_avg_frag_lists(ac, frag_order);
+		grp = ext4_mb_find_good_group_avg_frag_xarray(ac, frag_order,
+							      *group);
 		if (grp) {
 			*group = grp->bb_group;
 			ac->ac_flags |= EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED;
@@ -1162,18 +1191,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 	if (new == old)
 		return;
 
-	if (old >= 0 && !list_empty(&grp->bb_largest_free_order_node)) {
-		write_lock(&sbi->s_mb_largest_free_orders_locks[old]);
-		list_del_init(&grp->bb_largest_free_order_node);
-		write_unlock(&sbi->s_mb_largest_free_orders_locks[old]);
+	if (old >= 0) {
+		struct xarray *xa = &sbi->s_mb_largest_free_orders[old];
+
+		if (!xa_empty(xa) && xa_load(xa, grp->bb_group))
+			xa_erase(xa, grp->bb_group);
 	}
 
 	grp->bb_largest_free_order = new;
 	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && new >= 0 && grp->bb_free) {
-		write_lock(&sbi->s_mb_largest_free_orders_locks[new]);
-		list_add_tail(&grp->bb_largest_free_order_node,
-			      &sbi->s_mb_largest_free_orders[new]);
-		write_unlock(&sbi->s_mb_largest_free_orders_locks[new]);
+		/*
+		* Cannot use __GFP_NOFAIL because we hold the group lock.
+		* Although allocation for insertion may fails, it's not fatal
+		* as we have linear traversal to fall back on.
+		*/
+		int err = xa_insert(&sbi->s_mb_largest_free_orders[new],
+				    grp->bb_group, grp, GFP_ATOMIC);
+		if (err)
+			mb_debug(sb, "insert group: %u to s_mb_largest_free_orders[%d] failed, err %d",
+				 grp->bb_group, new, err);
 	}
 }
 
@@ -3269,6 +3305,7 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 	unsigned long position = ((unsigned long) v);
 	struct ext4_group_info *grp;
 	unsigned int count;
+	unsigned long idx;
 
 	position--;
 	if (position >= MB_NUM_ORDERS(sb)) {
@@ -3277,11 +3314,8 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 			seq_puts(seq, "avg_fragment_size_lists:\n");
 
 		count = 0;
-		read_lock(&sbi->s_mb_avg_fragment_size_locks[position]);
-		list_for_each_entry(grp, &sbi->s_mb_avg_fragment_size[position],
-				    bb_avg_fragment_size_node)
+		xa_for_each(&sbi->s_mb_avg_fragment_size[position], idx, grp)
 			count++;
-		read_unlock(&sbi->s_mb_avg_fragment_size_locks[position]);
 		seq_printf(seq, "\tlist_order_%u_groups: %u\n",
 					(unsigned int)position, count);
 		return 0;
@@ -3293,11 +3327,8 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 		seq_puts(seq, "max_free_order_lists:\n");
 	}
 	count = 0;
-	read_lock(&sbi->s_mb_largest_free_orders_locks[position]);
-	list_for_each_entry(grp, &sbi->s_mb_largest_free_orders[position],
-			    bb_largest_free_order_node)
+	xa_for_each(&sbi->s_mb_largest_free_orders[position], idx, grp)
 		count++;
-	read_unlock(&sbi->s_mb_largest_free_orders_locks[position]);
 	seq_printf(seq, "\tlist_order_%u_groups: %u\n",
 		   (unsigned int)position, count);
 
@@ -3417,8 +3448,6 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
 	INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
 	init_rwsem(&meta_group_info[i]->alloc_sem);
 	meta_group_info[i]->bb_free_root = RB_ROOT;
-	INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
-	INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
 	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
 	meta_group_info[i]->bb_avg_fragment_size_order = -1;  /* uninit */
 	meta_group_info[i]->bb_group = group;
@@ -3628,6 +3657,20 @@ static void ext4_discard_work(struct work_struct *work)
 		ext4_mb_unload_buddy(&e4b);
 }
 
+static inline void ext4_mb_avg_fragment_size_destory(struct ext4_sb_info *sbi)
+{
+	for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++)
+		xa_destroy(&sbi->s_mb_avg_fragment_size[i]);
+	kfree(sbi->s_mb_avg_fragment_size);
+}
+
+static inline void ext4_mb_largest_free_orders_destory(struct ext4_sb_info *sbi)
+{
+	for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++)
+		xa_destroy(&sbi->s_mb_largest_free_orders[i]);
+	kfree(sbi->s_mb_largest_free_orders);
+}
+
 int ext4_mb_init(struct super_block *sb)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -3673,41 +3716,24 @@ int ext4_mb_init(struct super_block *sb)
 	} while (i < MB_NUM_ORDERS(sb));
 
 	sbi->s_mb_avg_fragment_size =
-		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct xarray),
 			GFP_KERNEL);
 	if (!sbi->s_mb_avg_fragment_size) {
 		ret = -ENOMEM;
 		goto out;
 	}
-	sbi->s_mb_avg_fragment_size_locks =
-		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
-			GFP_KERNEL);
-	if (!sbi->s_mb_avg_fragment_size_locks) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
-		INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
-		rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
-	}
+	for (i = 0; i < MB_NUM_ORDERS(sb); i++)
+		xa_init(&sbi->s_mb_avg_fragment_size[i]);
+
 	sbi->s_mb_largest_free_orders =
-		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct xarray),
 			GFP_KERNEL);
 	if (!sbi->s_mb_largest_free_orders) {
 		ret = -ENOMEM;
 		goto out;
 	}
-	sbi->s_mb_largest_free_orders_locks =
-		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
-			GFP_KERNEL);
-	if (!sbi->s_mb_largest_free_orders_locks) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
-		INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
-		rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
-	}
+	for (i = 0; i < MB_NUM_ORDERS(sb); i++)
+		xa_init(&sbi->s_mb_largest_free_orders[i]);
 
 	spin_lock_init(&sbi->s_md_lock);
 	atomic_set(&sbi->s_mb_free_pending, 0);
@@ -3792,10 +3818,8 @@ int ext4_mb_init(struct super_block *sb)
 	kfree(sbi->s_mb_last_groups);
 	sbi->s_mb_last_groups = NULL;
 out:
-	kfree(sbi->s_mb_avg_fragment_size);
-	kfree(sbi->s_mb_avg_fragment_size_locks);
-	kfree(sbi->s_mb_largest_free_orders);
-	kfree(sbi->s_mb_largest_free_orders_locks);
+	ext4_mb_avg_fragment_size_destory(sbi);
+	ext4_mb_largest_free_orders_destory(sbi);
 	kfree(sbi->s_mb_offsets);
 	sbi->s_mb_offsets = NULL;
 	kfree(sbi->s_mb_maxs);
@@ -3862,10 +3886,8 @@ void ext4_mb_release(struct super_block *sb)
 		kvfree(group_info);
 		rcu_read_unlock();
 	}
-	kfree(sbi->s_mb_avg_fragment_size);
-	kfree(sbi->s_mb_avg_fragment_size_locks);
-	kfree(sbi->s_mb_largest_free_orders);
-	kfree(sbi->s_mb_largest_free_orders_locks);
+	ext4_mb_avg_fragment_size_destory(sbi);
+	ext4_mb_largest_free_orders_destory(sbi);
 	kfree(sbi->s_mb_offsets);
 	kfree(sbi->s_mb_maxs);
 	iput(sbi->s_buddy_cache);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 16/17] ext4: refactor choose group to scan group
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (14 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 15/17] ext4: convert free groups order lists to xarrays Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-14 13:03 ` [PATCH v3 17/17] ext4: implement linear-like traversal across order xarrays Baokun Li
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

This commit converts the `choose group` logic to `scan group` using
previously prepared helper functions. This allows us to leverage xarrays
for ordered non-linear traversal, thereby mitigating the "bouncing" issue
inherent in the `choose group` mechanism.

This also decouples linear and non-linear traversals, leading to cleaner
and more readable code.

Key changes:

 * ext4_mb_choose_next_group() is refactored to ext4_mb_scan_groups().

 * Replaced ext4_mb_good_group() with ext4_mb_scan_group() in non-linear
   traversals, and related functions now return error codes instead of
   group info.

 * Added ext4_mb_scan_groups_linear() for performing linear scans starting
   from a specific group for a set number of times.

 * Linear scans now execute up to sbi->s_mb_max_linear_groups times,
   so ac_groups_linear_remaining is removed as it's no longer used.

 * ac->ac_criteria is now used directly instead of passing cr around.
   Also, ac->ac_criteria is incremented directly after groups scan fails
   for the corresponding criteria.

 * Since we're now directly scanning groups instead of finding a good group
   then scanning, the following variables and flags are no longer needed,
   s_bal_cX_groups_considered is sufficient.

    s_bal_p2_aligned_bad_suggestions
    s_bal_goal_fast_bad_suggestions
    s_bal_best_avail_bad_suggestions
    EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED
    EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED
    EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  12 --
 fs/ext4/mballoc.c | 292 +++++++++++++++++++++-------------------------
 fs/ext4/mballoc.h |   1 -
 3 files changed, 131 insertions(+), 174 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ea412fdb0b76..6afd3447bfca 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -207,15 +207,6 @@ enum criteria {
 #define EXT4_MB_USE_RESERVED		0x2000
 /* Do strict check for free blocks while retrying block allocation */
 #define EXT4_MB_STRICT_CHECK		0x4000
-/* Large fragment size list lookup succeeded at least once for
- * CR_POWER2_ALIGNED */
-#define EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED		0x8000
-/* Avg fragment size rb tree lookup succeeded at least once for
- * CR_GOAL_LEN_FAST */
-#define EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED		0x00010000
-/* Avg fragment size rb tree lookup succeeded at least once for
- * CR_BEST_AVAIL_LEN */
-#define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED		0x00020000
 
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
@@ -1643,9 +1634,6 @@ struct ext4_sb_info {
 	atomic_t s_bal_len_goals;	/* len goal hits */
 	atomic_t s_bal_breaks;	/* too long searches */
 	atomic_t s_bal_2orders;	/* 2^order hits */
-	atomic_t s_bal_p2_aligned_bad_suggestions;
-	atomic_t s_bal_goal_fast_bad_suggestions;
-	atomic_t s_bal_best_avail_bad_suggestions;
 	atomic64_t s_bal_cX_groups_considered[EXT4_MB_NUM_CRS];
 	atomic64_t s_bal_cX_hits[EXT4_MB_NUM_CRS];
 	atomic64_t s_bal_cX_failed[EXT4_MB_NUM_CRS];		/* cX loop didn't find blocks */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a9eb997b8c9b..79b2c6b37fbd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -425,8 +425,8 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
 					ext4_group_t group);
 static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
-static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
-			       ext4_group_t group, enum criteria cr);
+static int ext4_mb_scan_group(struct ext4_allocation_context *ac,
+			      ext4_group_t group);
 
 static int ext4_try_to_trim_range(struct super_block *sb,
 		struct ext4_buddy *e4b, ext4_grpblk_t start,
@@ -875,9 +875,8 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 	}
 }
 
-static struct ext4_group_info *
-ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac,
-			       struct xarray *xa, ext4_group_t start)
+static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac,
+				      struct xarray *xa, ext4_group_t start)
 {
 	struct super_block *sb = ac->ac_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -888,16 +887,18 @@ ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac,
 	struct ext4_group_info *grp;
 
 	if (WARN_ON_ONCE(start >= end))
-		return NULL;
+		return 0;
 
 wrap_around:
 	xa_for_each_range(xa, group, grp, start, end - 1) {
+		int err;
+
 		if (sbi->s_mb_stats)
 			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
 
-		if (!spin_is_locked(ext4_group_lock_ptr(sb, group)) &&
-		    likely(ext4_mb_good_group(ac, group, cr)))
-			return grp;
+		err = ext4_mb_scan_group(ac, grp->bb_group);
+		if (err || ac->ac_status != AC_STATUS_CONTINUE)
+			return err;
 
 		cond_resched();
 	}
@@ -908,95 +909,82 @@ ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac,
 		goto wrap_around;
 	}
 
-	return NULL;
+	return 0;
 }
 
 /*
  * Find a suitable group of given order from the largest free orders xarray.
  */
-static struct ext4_group_info *
-ext4_mb_find_good_group_largest_free_order(struct ext4_allocation_context *ac,
-					   int order, ext4_group_t start)
+static int
+ext4_mb_scan_groups_largest_free_order(struct ext4_allocation_context *ac,
+				       int order, ext4_group_t start)
 {
 	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order];
 
 	if (xa_empty(xa))
-		return NULL;
+		return 0;
 
-	return ext4_mb_find_good_group_xarray(ac, xa, start);
+	return ext4_mb_scan_groups_xarray(ac, xa, start);
 }
 
 /*
  * Choose next group by traversing largest_free_order lists. Updates *new_cr if
  * cr level needs an update.
  */
-static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context *ac,
-			enum criteria *new_cr, ext4_group_t *group)
+static int ext4_mb_scan_groups_p2_aligned(struct ext4_allocation_context *ac,
+					  ext4_group_t group)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct ext4_group_info *grp;
 	int i;
-
-	if (ac->ac_status == AC_STATUS_FOUND)
-		return;
-
-	if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED))
-		atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions);
+	int ret = 0;
 
 	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		grp = ext4_mb_find_good_group_largest_free_order(ac, i, *group);
-		if (grp) {
-			*group = grp->bb_group;
-			ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
-			return;
-		}
+		ret = ext4_mb_scan_groups_largest_free_order(ac, i, group);
+		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
+			return ret;
 	}
 
+	if (sbi->s_mb_stats)
+		atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]);
+
 	/* Increment cr and search again if no group is found */
-	*new_cr = CR_GOAL_LEN_FAST;
+	ac->ac_criteria = CR_GOAL_LEN_FAST;
+	return ret;
 }
 
 /*
  * Find a suitable group of given order from the average fragments xarray.
  */
-static struct ext4_group_info *
-ext4_mb_find_good_group_avg_frag_xarray(struct ext4_allocation_context *ac,
-					int order, ext4_group_t start)
+static int ext4_mb_scan_groups_avg_frag_order(struct ext4_allocation_context *ac,
+					      int order, ext4_group_t start)
 {
 	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order];
 
 	if (xa_empty(xa))
-		return NULL;
+		return 0;
 
-	return ext4_mb_find_good_group_xarray(ac, xa, start);
+	return ext4_mb_scan_groups_xarray(ac, xa, start);
 }
 
 /*
  * Choose next group by traversing average fragment size list of suitable
  * order. Updates *new_cr if cr level needs an update.
  */
-static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *ac,
-		enum criteria *new_cr, ext4_group_t *group)
+static int ext4_mb_scan_groups_goal_fast(struct ext4_allocation_context *ac,
+					 ext4_group_t group)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct ext4_group_info *grp = NULL;
-	int i;
+	int i, ret = 0;
 
-	if (unlikely(ac->ac_flags & EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED)) {
-		if (sbi->s_mb_stats)
-			atomic_inc(&sbi->s_bal_goal_fast_bad_suggestions);
-	}
-
-	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
-	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		grp = ext4_mb_find_good_group_avg_frag_xarray(ac, i, *group);
-		if (grp) {
-			*group = grp->bb_group;
-			ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
-			return;
-		}
+	i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
+	for (; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
+		ret = ext4_mb_scan_groups_avg_frag_order(ac, i, group);
+		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
+			return ret;
 	}
 
+	if (sbi->s_mb_stats)
+		atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]);
 	/*
 	 * CR_BEST_AVAIL_LEN works based on the concept that we have
 	 * a larger normalized goal len request which can be trimmed to
@@ -1006,9 +994,11 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
 	 * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA).
 	 */
 	if (ac->ac_flags & EXT4_MB_HINT_DATA)
-		*new_cr = CR_BEST_AVAIL_LEN;
+		ac->ac_criteria = CR_BEST_AVAIL_LEN;
 	else
-		*new_cr = CR_GOAL_LEN_SLOW;
+		ac->ac_criteria = CR_GOAL_LEN_SLOW;
+
+	return ret;
 }
 
 /*
@@ -1020,19 +1010,14 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
  * preallocations. However, we make sure that we don't trim the request too
  * much and fall to CR_GOAL_LEN_SLOW in that case.
  */
-static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context *ac,
-		enum criteria *new_cr, ext4_group_t *group)
+static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac,
+					  ext4_group_t group)
 {
+	int ret = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct ext4_group_info *grp = NULL;
 	int i, order, min_order;
 	unsigned long num_stripe_clusters = 0;
 
-	if (unlikely(ac->ac_flags & EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED)) {
-		if (sbi->s_mb_stats)
-			atomic_inc(&sbi->s_bal_best_avail_bad_suggestions);
-	}
-
 	/*
 	 * mb_avg_fragment_size_order() returns order in a way that makes
 	 * retrieving back the length using (1 << order) inaccurate. Hence, use
@@ -1085,18 +1070,18 @@ static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context
 		frag_order = mb_avg_fragment_size_order(ac->ac_sb,
 							ac->ac_g_ex.fe_len);
 
-		grp = ext4_mb_find_good_group_avg_frag_xarray(ac, frag_order,
-							      *group);
-		if (grp) {
-			*group = grp->bb_group;
-			ac->ac_flags |= EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED;
-			return;
-		}
+		ret = ext4_mb_scan_groups_avg_frag_order(ac, frag_order, group);
+		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
+			return ret;
 	}
 
 	/* Reset goal length to original goal length before falling into CR_GOAL_LEN_SLOW */
 	ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
-	*new_cr = CR_GOAL_LEN_SLOW;
+	if (sbi->s_mb_stats)
+		atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]);
+	ac->ac_criteria = CR_GOAL_LEN_SLOW;
+
+	return ret;
 }
 
 static inline int should_optimize_scan(struct ext4_allocation_context *ac)
@@ -1111,59 +1096,82 @@ static inline int should_optimize_scan(struct ext4_allocation_context *ac)
 }
 
 /*
- * Return next linear group for allocation.
+ * next linear group for allocation.
  */
-static ext4_group_t
-next_linear_group(ext4_group_t group, ext4_group_t ngroups)
+static void next_linear_group(ext4_group_t *group, ext4_group_t ngroups)
 {
 	/*
 	 * Artificially restricted ngroups for non-extent
 	 * files makes group > ngroups possible on first loop.
 	 */
-	return group + 1 >= ngroups ? 0 : group + 1;
+	*group =  *group + 1 >= ngroups ? 0 : *group + 1;
 }
 
-/*
- * ext4_mb_choose_next_group: choose next group for allocation.
- *
- * @ac        Allocation Context
- * @new_cr    This is an output parameter. If the there is no good group
- *            available at current CR level, this field is updated to indicate
- *            the new cr level that should be used.
- * @group     This is an input / output parameter. As an input it indicates the
- *            next group that the allocator intends to use for allocation. As
- *            output, this field indicates the next group that should be used as
- *            determined by the optimization functions.
- * @ngroups   Total number of groups
- */
-static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
-		enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+static int ext4_mb_scan_groups_linear(struct ext4_allocation_context *ac,
+		ext4_group_t ngroups, ext4_group_t *start, ext4_group_t count)
 {
-	*new_cr = ac->ac_criteria;
+	int ret, i;
+	enum criteria cr = ac->ac_criteria;
+	struct super_block *sb = ac->ac_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	ext4_group_t group = *start;
 
-	if (!should_optimize_scan(ac)) {
-		*group = next_linear_group(*group, ngroups);
-		return;
+	for (i = 0; i < count; i++, next_linear_group(&group, ngroups)) {
+		ret = ext4_mb_scan_group(ac, group);
+		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
+			return ret;
+		cond_resched();
 	}
 
+	*start = group;
+	if (count == ngroups)
+		ac->ac_criteria++;
+
+	/* Processed all groups and haven't found blocks */
+	if (sbi->s_mb_stats && i == ngroups)
+		atomic64_inc(&sbi->s_bal_cX_failed[cr]);
+
+	return 0;
+}
+
+static int ext4_mb_scan_groups(struct ext4_allocation_context *ac)
+{
+	int ret = 0;
+	ext4_group_t start;
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+	ext4_group_t ngroups = ext4_get_groups_count(ac->ac_sb);
+
+	/* non-extent files are limited to low blocks/groups */
+	if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)))
+		ngroups = sbi->s_blockfile_groups;
+
+	/* searching for the right group start from the goal value specified */
+	start = ac->ac_g_ex.fe_group;
+	ac->ac_prefetch_grp = start;
+	ac->ac_prefetch_nr = 0;
+
+	if (!should_optimize_scan(ac))
+		return ext4_mb_scan_groups_linear(ac, ngroups, &start, ngroups);
+
 	/*
 	 * Optimized scanning can return non adjacent groups which can cause
 	 * seek overhead for rotational disks. So try few linear groups before
 	 * trying optimized scan.
 	 */
-	if (ac->ac_groups_linear_remaining) {
-		*group = next_linear_group(*group, ngroups);
-		ac->ac_groups_linear_remaining--;
-		return;
-	}
+	if (sbi->s_mb_max_linear_groups)
+		ret = ext4_mb_scan_groups_linear(ac, ngroups, &start,
+						 sbi->s_mb_max_linear_groups);
+	if (ret || ac->ac_status != AC_STATUS_CONTINUE)
+		return ret;
 
-	if (*new_cr == CR_POWER2_ALIGNED) {
-		ext4_mb_choose_next_group_p2_aligned(ac, new_cr, group);
-	} else if (*new_cr == CR_GOAL_LEN_FAST) {
-		ext4_mb_choose_next_group_goal_fast(ac, new_cr, group);
-	} else if (*new_cr == CR_BEST_AVAIL_LEN) {
-		ext4_mb_choose_next_group_best_avail(ac, new_cr, group);
-	} else {
+	switch (ac->ac_criteria) {
+	case CR_POWER2_ALIGNED:
+		return ext4_mb_scan_groups_p2_aligned(ac, start);
+	case CR_GOAL_LEN_FAST:
+		return ext4_mb_scan_groups_goal_fast(ac, start);
+	case CR_BEST_AVAIL_LEN:
+		return ext4_mb_scan_groups_best_avail(ac, start);
+	default:
 		/*
 		 * TODO: For CR_GOAL_LEN_SLOW, we can arrange groups in an
 		 * rb tree sorted by bb_free. But until that happens, we should
@@ -1171,6 +1179,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
 		 */
 		WARN_ON(1);
 	}
+
+	return 0;
 }
 
 /*
@@ -2928,20 +2938,11 @@ static int ext4_mb_scan_group(struct ext4_allocation_context *ac,
 static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
-	ext4_group_t ngroups, group, i;
-	enum criteria new_cr, cr = CR_GOAL_LEN_FAST;
+	ext4_group_t i;
 	int err = 0;
-	struct ext4_sb_info *sbi;
-	struct super_block *sb;
+	struct super_block *sb = ac->ac_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_buddy e4b;
-	int lost;
-
-	sb = ac->ac_sb;
-	sbi = EXT4_SB(sb);
-	ngroups = ext4_get_groups_count(sb);
-	/* non-extent files are limited to low blocks/groups */
-	if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)))
-		ngroups = sbi->s_blockfile_groups;
 
 	BUG_ON(ac->ac_status == AC_STATUS_FOUND);
 
@@ -2987,48 +2988,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	 * start with CR_GOAL_LEN_FAST, unless it is power of 2
 	 * aligned, in which case let's do that faster approach first.
 	 */
+	ac->ac_criteria = CR_GOAL_LEN_FAST;
 	if (ac->ac_2order)
-		cr = CR_POWER2_ALIGNED;
+		ac->ac_criteria = CR_POWER2_ALIGNED;
 
 	ac->ac_e4b = &e4b;
 	ac->ac_prefetch_ios = 0;
 	ac->ac_first_err = 0;
 repeat:
-	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
-		ac->ac_criteria = cr;
-		/*
-		 * searching for the right group start
-		 * from the goal value specified
-		 */
-		group = ac->ac_g_ex.fe_group;
-		ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups;
-		ac->ac_prefetch_grp = group;
-		ac->ac_prefetch_nr = 0;
-
-		for (i = 0, new_cr = cr; i < ngroups; i++,
-		     ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) {
-
-			cond_resched();
-			if (new_cr != cr) {
-				cr = new_cr;
-				goto repeat;
-			}
-
-			err = ext4_mb_scan_group(ac, group);
-			if (err)
-				goto out;
-
-			if (ac->ac_status != AC_STATUS_CONTINUE)
-				break;
-		}
-		/* Processed all groups and haven't found blocks */
-		if (sbi->s_mb_stats && i == ngroups)
-			atomic64_inc(&sbi->s_bal_cX_failed[cr]);
+	while (ac->ac_criteria < EXT4_MB_NUM_CRS) {
+		err = ext4_mb_scan_groups(ac);
+		if (err)
+			goto out;
 
-		if (i == ngroups && ac->ac_criteria == CR_BEST_AVAIL_LEN)
-			/* Reset goal length to original goal length before
-			 * falling into CR_GOAL_LEN_SLOW */
-			ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
+		if (ac->ac_status != AC_STATUS_CONTINUE)
+			break;
 	}
 
 	if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND &&
@@ -3039,6 +3013,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		 */
 		ext4_mb_try_best_found(ac, &e4b);
 		if (ac->ac_status != AC_STATUS_FOUND) {
+			int lost;
+
 			/*
 			 * Someone more lucky has already allocated it.
 			 * The only thing we can do is just take first
@@ -3054,7 +3030,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			ac->ac_b_ex.fe_len = 0;
 			ac->ac_status = AC_STATUS_CONTINUE;
 			ac->ac_flags |= EXT4_MB_HINT_FIRST;
-			cr = CR_ANY_FREE;
+			ac->ac_criteria = CR_ANY_FREE;
 			goto repeat;
 		}
 	}
@@ -3071,7 +3047,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 	mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n",
 		 ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status,
-		 ac->ac_flags, cr, err);
+		 ac->ac_flags, ac->ac_criteria, err);
 
 	if (ac->ac_prefetch_nr)
 		ext4_mb_prefetch_fini(sb, ac->ac_prefetch_grp, ac->ac_prefetch_nr);
@@ -3197,8 +3173,6 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 		   atomic_read(&sbi->s_bal_cX_ex_scanned[CR_POWER2_ALIGNED]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_failed[CR_POWER2_ALIGNED]));
-	seq_printf(seq, "\t\tbad_suggestions: %u\n",
-		   atomic_read(&sbi->s_bal_p2_aligned_bad_suggestions));
 
 	/* CR_GOAL_LEN_FAST stats */
 	seq_puts(seq, "\tcr_goal_fast_stats:\n");
@@ -3211,8 +3185,6 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 		   atomic_read(&sbi->s_bal_cX_ex_scanned[CR_GOAL_LEN_FAST]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_failed[CR_GOAL_LEN_FAST]));
-	seq_printf(seq, "\t\tbad_suggestions: %u\n",
-		   atomic_read(&sbi->s_bal_goal_fast_bad_suggestions));
 
 	/* CR_BEST_AVAIL_LEN stats */
 	seq_puts(seq, "\tcr_best_avail_stats:\n");
@@ -3226,8 +3198,6 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
 		   atomic_read(&sbi->s_bal_cX_ex_scanned[CR_BEST_AVAIL_LEN]));
 	seq_printf(seq, "\t\tuseless_loops: %llu\n",
 		   atomic64_read(&sbi->s_bal_cX_failed[CR_BEST_AVAIL_LEN]));
-	seq_printf(seq, "\t\tbad_suggestions: %u\n",
-		   atomic_read(&sbi->s_bal_best_avail_bad_suggestions));
 
 	/* CR_GOAL_LEN_SLOW stats */
 	seq_puts(seq, "\tcr_goal_slow_stats:\n");
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 83886fc9521b..15a049f05d04 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -199,7 +199,6 @@ struct ext4_allocation_context {
 	int ac_first_err;
 
 	__u32 ac_flags;		/* allocation hints */
-	__u32 ac_groups_linear_remaining;
 	__u16 ac_groups_scanned;
 	__u16 ac_found;
 	__u16 ac_cX_found[EXT4_MB_NUM_CRS];
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 17/17] ext4: implement linear-like traversal across order xarrays
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (15 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 16/17] ext4: refactor choose group to scan group Baokun Li
@ 2025-07-14 13:03 ` Baokun Li
  2025-07-15  1:11 ` [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Zhang Yi
  2025-07-19 21:45 ` Theodore Ts'o
  18 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-14 13:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun1, libaokun

Although we now perform ordered traversal within an xarray, this is
currently limited to a single xarray. However, we have multiple such
xarrays, which prevents us from guaranteeing a linear-like traversal
where all groups on the right are visited before all groups on the left.

For example, suppose we have 128 block groups, with a target group of 64,
a target length corresponding to an order of 1, and available free groups
of 16 (order 1) and group 65 (order 8):

For linear traversal, when no suitable free block is found in group 64, it
will search in the next block group until group 127, then start searching
from 0 up to block group 63. It ensures continuous forward traversal, which
is consistent with the unidirectional rotation behavior of HDD platters.

Additionally, the block group lock contention during freeing block is
unavoidable. The goal increasing from 0 to 64 indicates that previously
scanned groups (which had no suitable free space and are likely to free
blocks later) and skipped groups (which are currently in use) have newly
freed some used blocks. If we allocate blocks in these groups, the
probability of competing with other processes increases.

For non-linear traversal, we first traverse all groups in order_1. If only
group 16 has free space in this list, we first traverse [63, 128), then
traverse [0, 64) to find the available group 16, and then allocate blocks
in group 16. Therefore, it cannot guarantee continuous traversal in one
direction, thus increasing the probability of contention.

So refactor ext4_mb_scan_groups_xarray() to ext4_mb_scan_groups_xa_range()
to only traverse a fixed range of groups, and move the logic for handling
wrap around to the caller. The caller first iterates through all xarrays
in the range [start, ngroups) and then through the range [0, start). This
approach simulates a linear scan, which reduces contention between freeing
blocks and allocating blocks.

Assume we have the following groups, where "|" denotes the xarray traversal
start position:

order_1_groups: AB | CD
order_2_groups: EF | GH

Traversal order:
Before: C > D > A > B > G > H > E > F
After:  C > D > G > H > A > B > E > F

Performance test data follows:

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19555 | 20049 (+2.5%)  | 315636 | 316724 (-0.3%) |
|mb_optimize_scan=1 | 15496 | 19342 (+24.8%) | 323569 | 328324 (+1.4%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53192 | 52125 (-2.0%)  | 212678 | 215136 (+1.1%) |
|mb_optimize_scan=1 | 37636 | 50331 (+33.7%) | 214189 | 209431 (-2.2%) |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 68 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 47 insertions(+), 21 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 79b2c6b37fbd..742124c8213b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -875,21 +875,20 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 	}
 }
 
-static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac,
-				      struct xarray *xa, ext4_group_t start)
+static int ext4_mb_scan_groups_xa_range(struct ext4_allocation_context *ac,
+					struct xarray *xa,
+					ext4_group_t start, ext4_group_t end)
 {
 	struct super_block *sb = ac->ac_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	enum criteria cr = ac->ac_criteria;
 	ext4_group_t ngroups = ext4_get_groups_count(sb);
 	unsigned long group = start;
-	ext4_group_t end = ngroups;
 	struct ext4_group_info *grp;
 
-	if (WARN_ON_ONCE(start >= end))
+	if (WARN_ON_ONCE(end > ngroups || start >= end))
 		return 0;
 
-wrap_around:
 	xa_for_each_range(xa, group, grp, start, end - 1) {
 		int err;
 
@@ -903,28 +902,23 @@ static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac,
 		cond_resched();
 	}
 
-	if (start) {
-		end = start;
-		start = 0;
-		goto wrap_around;
-	}
-
 	return 0;
 }
 
 /*
  * Find a suitable group of given order from the largest free orders xarray.
  */
-static int
-ext4_mb_scan_groups_largest_free_order(struct ext4_allocation_context *ac,
-				       int order, ext4_group_t start)
+static inline int
+ext4_mb_scan_groups_largest_free_order_range(struct ext4_allocation_context *ac,
+					     int order, ext4_group_t start,
+					     ext4_group_t end)
 {
 	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order];
 
 	if (xa_empty(xa))
 		return 0;
 
-	return ext4_mb_scan_groups_xarray(ac, xa, start);
+	return ext4_mb_scan_groups_xa_range(ac, xa, start, end);
 }
 
 /*
@@ -937,12 +931,22 @@ static int ext4_mb_scan_groups_p2_aligned(struct ext4_allocation_context *ac,
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int i;
 	int ret = 0;
+	ext4_group_t start, end;
 
+	start = group;
+	end = ext4_get_groups_count(ac->ac_sb);
+wrap_around:
 	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		ret = ext4_mb_scan_groups_largest_free_order(ac, i, group);
+		ret = ext4_mb_scan_groups_largest_free_order_range(ac, i,
+								   start, end);
 		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
 			return ret;
 	}
+	if (start) {
+		end = start;
+		start = 0;
+		goto wrap_around;
+	}
 
 	if (sbi->s_mb_stats)
 		atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]);
@@ -955,15 +959,17 @@ static int ext4_mb_scan_groups_p2_aligned(struct ext4_allocation_context *ac,
 /*
  * Find a suitable group of given order from the average fragments xarray.
  */
-static int ext4_mb_scan_groups_avg_frag_order(struct ext4_allocation_context *ac,
-					      int order, ext4_group_t start)
+static int
+ext4_mb_scan_groups_avg_frag_order_range(struct ext4_allocation_context *ac,
+					 int order, ext4_group_t start,
+					 ext4_group_t end)
 {
 	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order];
 
 	if (xa_empty(xa))
 		return 0;
 
-	return ext4_mb_scan_groups_xarray(ac, xa, start);
+	return ext4_mb_scan_groups_xa_range(ac, xa, start, end);
 }
 
 /*
@@ -975,13 +981,23 @@ static int ext4_mb_scan_groups_goal_fast(struct ext4_allocation_context *ac,
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int i, ret = 0;
+	ext4_group_t start, end;
 
+	start = group;
+	end = ext4_get_groups_count(ac->ac_sb);
+wrap_around:
 	i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
 	for (; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		ret = ext4_mb_scan_groups_avg_frag_order(ac, i, group);
+		ret = ext4_mb_scan_groups_avg_frag_order_range(ac, i,
+							       start, end);
 		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
 			return ret;
 	}
+	if (start) {
+		end = start;
+		start = 0;
+		goto wrap_around;
+	}
 
 	if (sbi->s_mb_stats)
 		atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]);
@@ -1017,6 +1033,7 @@ static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac,
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int i, order, min_order;
 	unsigned long num_stripe_clusters = 0;
+	ext4_group_t start, end;
 
 	/*
 	 * mb_avg_fragment_size_order() returns order in a way that makes
@@ -1048,6 +1065,9 @@ static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac,
 	if (1 << min_order < ac->ac_o_ex.fe_len)
 		min_order = fls(ac->ac_o_ex.fe_len);
 
+	start = group;
+	end = ext4_get_groups_count(ac->ac_sb);
+wrap_around:
 	for (i = order; i >= min_order; i--) {
 		int frag_order;
 		/*
@@ -1070,10 +1090,16 @@ static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac,
 		frag_order = mb_avg_fragment_size_order(ac->ac_sb,
 							ac->ac_g_ex.fe_len);
 
-		ret = ext4_mb_scan_groups_avg_frag_order(ac, frag_order, group);
+		ret = ext4_mb_scan_groups_avg_frag_order_range(ac, frag_order,
+							       start, end);
 		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
 			return ret;
 	}
+	if (start) {
+		end = start;
+		start = 0;
+		goto wrap_around;
+	}
 
 	/* Reset goal length to original goal length before falling into CR_GOAL_LEN_SLOW */
 	ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 00/17] ext4: better scalability for ext4 block allocation
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (16 preceding siblings ...)
  2025-07-14 13:03 ` [PATCH v3 17/17] ext4: implement linear-like traversal across order xarrays Baokun Li
@ 2025-07-15  1:11 ` Zhang Yi
  2025-07-19 21:45 ` Theodore Ts'o
  18 siblings, 0 replies; 44+ messages in thread
From: Zhang Yi @ 2025-07-15  1:11 UTC (permalink / raw)
  To: Baokun Li, linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, ojaswin, julia.lawall,
	yangerkun, libaokun

On 2025/7/14 21:03, Baokun Li wrote:
> Changes since v2:
>  * Collect RVB from Jan Kara. (Thanks for your review!)
>  * Add patch 2.
>  * Patch 4: Switching to READ_ONCE/WRITE_ONCE (great for single-process)
>         over smp_load_acquire/smp_store_release (only slight multi-process
>         gain). (Suggested by Jan Kara)
>  * Patch 5: The number of global goals is now set to the lesser of the CPU
>         count or one-fourth of the group count. This prevents setting too
>         many goals for small filesystems, which lead to file dispersion.
>         (Suggested by Jan Kara)
>  * Patch 5: Directly use kfree() to release s_mb_last_groups instead of
>         kvfree(). (Suggested by Julia Lawall)
>  * Patch 11: Even without mb_optimize_scan enabled, we now always attempt
>         to remove the group from the old order list.(Suggested by Jan Kara)
>  * Patch 14-16: Added comments for clarity, refined logic, and removed
>         obsolete variables.
>  * Update performance test results and indicate raw disk write bandwidth. 
> 
> Thanks to Honza for your suggestions!

This is a nice improvement! Overall, the series looks good to me!

Reviewed-by: Zhang Yi <yi.zhang@huawei.com>

> 
> v2: https://lore.kernel.org/r/20250623073304.3275702-1-libaokun1@huawei.com
> 
> Changes since v1:
>  * Patch 1: Prioritize checking if a group is busy to avoid unnecessary
>        checks and buddy loading. (Thanks to Ojaswin for the suggestion!)
>  * Patch 4: Using multiple global goals instead of moving the goal to the
>        inode level. (Thanks to Honza for the suggestion!)
>  * Collect RVB from Jan Kara and Ojaswin Mujoo.(Thanks for your review!)
>  * Add patch 2,3,7-16.
>  * Due to the change of test server, the relevant test data was refreshed.
> 
> v1: https://lore.kernel.org/r/20250523085821.1329392-1-libaokun@huaweicloud.com
> 
> Since servers have more and more CPUs, and we're running more containers
> on them, we've been using will-it-scale to test how well ext4 scales. The
> fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently
> on 64 containers revealed significant contention in block allocation/free,
> leading to much lower average fallocate OPS compared to a single
> container (see below).
> 
>    1   |    2   |    4   |    8   |   16   |   32   |   64
> -------|--------|--------|--------|--------|--------|-------
> 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588
> 
> Under this test scenario, the primary operations are block allocation
> (fallocate) and block deallocation (truncate). The main bottlenecks for
> these operations are the group lock and s_md_lock. Therefore, this patch
> series primarily focuses on optimizing the code related to these two locks.
> 
> The following is a brief overview of the patches, see the patches for
> more details.
> 
> Patch 1: Add ext4_try_lock_group() to skip busy groups to take advantage
> of the large number of ext4 groups.
> 
> Patch 2: Separates stream goal hits from s_bal_goals in preparation for
> cleanup of s_mb_last_start.
> 
> Patches 3-5: Split stream allocation's global goal into multiple goals and
> remove the unnecessary and expensive s_md_lock.
> 
> Patches 6-7: minor cleanups
> 
> Patches 8: Converted s_mb_free_pending to atomic_t and used memory barriers
> for consistency, instead of relying on the expensive s_md_lock.
> 
> Patches 9: When inserting free extents, we now attempt to merge them with
> already inserted extents first, to reduce s_md_lock contention.
> 
> Patches 10: Updates bb_avg_fragment_size_order to -1 when a group is out of
> free blocks, eliminating efficiency-impacting "zombie groups."
> 
> Patches 11: Fix potential largest free orders lists corruption when the
> mb_optimize_scan mount option is switched on or off.
> 
> Patches 12-17: Convert mb_optimize_scan's existing unordered list traversal
> to ordered xarrays, thereby reducing contention between block allocation
> and freeing, similar to linear traversal.
> 
> "kvm-xfstests -c ext4/all -g auto" has been executed with no new failures.
> 
> Here are some performance test data for your reference:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
> |CPU: Kunpeng 920   |          P80           |            P1           |
> |Memory: 512GB      |------------------------|-------------------------|
> |960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
> |-------------------|-------|----------------|--------|----------------|
> |mb_optimize_scan=0 | 2667  | 20049 (+651%)  | 314065 | 316724 (+0.8%) |
> |mb_optimize_scan=1 | 2643  | 19342 (+631%)  | 316344 | 328324 (+3.7%) |
> 
> |CPU: AMD 9654 * 2  |          P96           |             P1          |
> |Memory: 1536GB     |------------------------|-------------------------|
> |960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
> |-------------------|-------|----------------|--------|----------------|
> |mb_optimize_scan=0 | 3450  | 52125 (+1410%) | 205851 | 215136 (+4.5%) |
> |mb_optimize_scan=1 | 3209  | 50331 (+1468%) | 207373 | 209431 (+0.9%) |
> 
> Tests also evaluated this patch set's impact on fragmentation: a minor
> increase in free space fragmentation for multi-process workloads, but a
> significant decrease in file fragmentation:
> 
> Test Script：
> ```shell
> #!/bin/bash
> 
> dir="/tmp/test"
> disk="/dev/sda"
> 
> mkdir -p $dir
> 
> for scan in 0 1 ; do
>     mkfs.ext4 -F -E lazy_itable_init=0,lazy_journal_init=0 \
>               -O orphan_file $disk 200G
>     mount -o mb_optimize_scan=$scan $disk $dir
> 
>     fio -directory=$dir -direct=1 -iodepth 128 -thread -ioengine=falloc \
>         -rw=write -bs=4k -fallocate=none -numjobs=64 -file_append=1 \
>         -size=1G -group_reporting -name=job1 -cpus_allowed_policy=split
> 
>     e2freefrag $disk
>     e4defrag -c $dir # Without the patch, this could take 5-6 hours.
>     filefrag ${dir}/job* | awk '{print $2}' | \
>                            awk '{sum+=$1} END {print sum/NR}'
>     umount $dir
> done
> ```
> 
> Test results:
> -------------------------------------------------------------|
>                          |       base      |      patched    |
> -------------------------|--------|--------|--------|--------|
> mb_optimize_scan         | linear |opt_scan| linear |opt_scan|
> -------------------------|--------|--------|--------|--------|
> bw(MiB/s)                | 217    | 217    | 5718   | 5626   |
> -------------------------|-----------------------------------|
> Avg. free extent size(KB)| 1943732| 1943732| 1316212| 1171208|
> Num. free extent         | 71     | 71     | 105    | 118    |
> -------------------------------------------------------------|
> Avg. extents per file    | 261967 | 261973 | 588    | 570    |
> Avg. size per extent(KB) | 4      | 4      | 1780   | 1837   |
> Fragmentation score      | 100    | 100    | 2      | 2      |
> -------------------------------------------------------------|
> 
> Comments and questions are, as always, welcome.
> 
> Thanks,
> Baokun
> 
> Baokun Li (17):
>   ext4: add ext4_try_lock_group() to skip busy groups
>   ext4: separate stream goal hits from s_bal_goals for better tracking
>   ext4: remove unnecessary s_mb_last_start
>   ext4: remove unnecessary s_md_lock on update s_mb_last_group
>   ext4: utilize multiple global goals to reduce contention
>   ext4: get rid of some obsolete EXT4_MB_HINT flags
>   ext4: fix typo in CR_GOAL_LEN_SLOW comment
>   ext4: convert sbi->s_mb_free_pending to atomic_t
>   ext4: merge freed extent with existing extents before insertion
>   ext4: fix zombie groups in average fragment size lists
>   ext4: fix largest free orders lists corruption on mb_optimize_scan
>     switch
>   ext4: factor out __ext4_mb_scan_group()
>   ext4: factor out ext4_mb_might_prefetch()
>   ext4: factor out ext4_mb_scan_group()
>   ext4: convert free groups order lists to xarrays
>   ext4: refactor choose group to scan group
>   ext4: implement linear-like traversal across order xarrays
> 
>  fs/ext4/balloc.c            |   2 +-
>  fs/ext4/ext4.h              |  61 +--
>  fs/ext4/mballoc.c           | 895 ++++++++++++++++++++----------------
>  fs/ext4/mballoc.h           |   9 +-
>  include/trace/events/ext4.h |   3 -
>  5 files changed, 534 insertions(+), 436 deletions(-)
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups
  2025-07-14 13:03 ` [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
@ 2025-07-17 10:09   ` Ojaswin Mujoo
  2025-07-19  0:37     ` Baokun Li
  2025-07-17 22:28   ` Andi Kleen
  1 sibling, 1 reply; 44+ messages in thread
From: Ojaswin Mujoo @ 2025-07-17 10:09 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel,
	julia.lawall, yi.zhang, yangerkun, libaokun

On Mon, Jul 14, 2025 at 09:03:11PM +0800, Baokun Li wrote:
> When ext4 allocates blocks, we used to just go through the block groups
> one by one to find a good one. But when there are tons of block groups
> (like hundreds of thousands or even millions) and not many have free space
> (meaning they're mostly full), it takes a really long time to check them
> all, and performance gets bad. So, we added the "mb_optimize_scan" mount
> option (which is on by default now). It keeps track of some group lists,
> so when we need a free block, we can just grab a likely group from the
> right list. This saves time and makes block allocation much faster.
> 
> But when multiple processes or containers are doing similar things, like
> constantly allocating 8k blocks, they all try to use the same block group
> in the same list. Even just two processes doing this can cut the IOPS in
> half. For example, one container might do 300,000 IOPS, but if you run two
> at the same time, the total is only 150,000.
> 
> Since we can already look at block groups in a non-linear way, the first
> and last groups in the same list are basically the same for finding a block
> right now. Therefore, add an ext4_try_lock_group() helper function to skip
> the current group when it is locked by another process, thereby avoiding
> contention with other processes. This helps ext4 make better use of having
> multiple block groups.
> 
> Also, to make sure we don't skip all the groups that have free space
> when allocating blocks, we won't try to skip busy groups anymore when
> ac_criteria is CR_ANY_FREE.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
> |CPU: Kunpeng 920   |          P80            |
> |Memory: 512GB      |-------------------------|
> |960GB SSD (0.5GB/s)| base  |    patched      |
> |-------------------|-------|-----------------|
> |mb_optimize_scan=0 | 2667  | 4821  (+80.7%)  |
> |mb_optimize_scan=1 | 2643  | 4784  (+81.0%)  |
> 
> |CPU: AMD 9654 * 2  |          P96            |
> |Memory: 1536GB     |-------------------------|
> |960GB SSD (1GB/s)  | base  |    patched      |
> |-------------------|-------|-----------------|
> |mb_optimize_scan=0 | 3450  | 15371 (+345%)   |
> |mb_optimize_scan=1 | 3209  | 6101  (+90.0%)  |
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Hey Baokun, I reviewed some of the patches in v2 but i think that was
very last moment so I'll add the comments in this series, dont mind the
copy paste :)

The patch itself looks good, thanks for the changes.

Feel free to add:

 Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 02/17] ext4: separate stream goal hits from s_bal_goals for better tracking
  2025-07-14 13:03 ` [PATCH v3 02/17] ext4: separate stream goal hits from s_bal_goals for better tracking Baokun Li
@ 2025-07-17 10:29   ` Ojaswin Mujoo
  2025-07-19  1:37     ` Baokun Li
  0 siblings, 1 reply; 44+ messages in thread
From: Ojaswin Mujoo @ 2025-07-17 10:29 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel,
	julia.lawall, yi.zhang, yangerkun, libaokun

On Mon, Jul 14, 2025 at 09:03:12PM +0800, Baokun Li wrote:
> In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal()
> fails to achieve the inode goal, allocation continues with the stream
> allocation global goal. Currently, hits for both are combined in
> sbi->s_bal_goals, hindering accurate optimization.
> 
> This commit separates global goal hits into sbi->s_bal_stream_goals. Since
> stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1.
> This prevents stream allocations from being counted in s_bal_goals. Also
> clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again.
> 
> After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show:
> 
> mballoc:
> 	reqs: 840347
> 	success: 750992
> 	groups_scanned: 1230506
> 	cr_p2_aligned_stats:
> 		hits: 21531
> 		groups_considered: 411664
> 		extents_scanned: 21531
> 		useless_loops: 0
> 		bad_suggestions: 6
> 	cr_goal_fast_stats:
> 		hits: 111222
> 		groups_considered: 1806728
> 		extents_scanned: 467908
> 		useless_loops: 0
> 		bad_suggestions: 13
> 	cr_best_avail_stats:
> 		hits: 36267
> 		groups_considered: 1817631
> 		extents_scanned: 156143
> 		useless_loops: 0
> 		bad_suggestions: 204
> 	cr_goal_slow_stats:
> 		hits: 106396
> 		groups_considered: 5671710
> 		extents_scanned: 22540056
> 		useless_loops: 123747
> 	cr_any_free_stats:
> 		hits: 138071
> 		groups_considered: 724692
> 		extents_scanned: 23615593
> 		useless_loops: 585
> 	extents_scanned: 46804261
> 		goal_hits: 1307
> 		stream_goal_hits: 236317
> 		len_goal_hits: 155549
> 		2^n_hits: 21531
> 		breaks: 225096
> 		lost: 35062
> 	buddies_generated: 40/40
> 	buddies_time_used: 48004
> 	preallocated: 5962467
> 	discarded: 4847560
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/ext4.h    |  1 +
>  fs/ext4/mballoc.c | 11 +++++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 9df74123e7e6..8750ace12935 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1646,6 +1646,7 @@ struct ext4_sb_info {
>  	atomic_t s_bal_cX_ex_scanned[EXT4_MB_NUM_CRS];	/* total extents scanned */
>  	atomic_t s_bal_groups_scanned;	/* number of groups scanned */
>  	atomic_t s_bal_goals;	/* goal hits */
> +	atomic_t s_bal_stream_goals;	/* stream allocation global goal hits */
>  	atomic_t s_bal_len_goals;	/* len goal hits */
>  	atomic_t s_bal_breaks;	/* too long searches */
>  	atomic_t s_bal_2orders;	/* 2^order hits */
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 336d65c4f6a2..f56ac477c464 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2849,8 +2849,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  		/* TBD: may be hot point */
>  		spin_lock(&sbi->s_md_lock);
>  		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> -		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
>  		spin_unlock(&sbi->s_md_lock);
> +		ac->ac_g_ex.fe_start = -1;
> +		ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;

Hey Baokun, I was a bit late to review this in v2 so I'll add the
comment here:

So this is mostly to account for retires right? Maybe rather than
disabling goal allocation a better way to do this is resetting the
original goal group and goal start in the retry logic of
ext4_mb_new_blocks()? Since we drop preallocations before retrying, this
way we might actually find our goal during the retry. Its a slim chance
though but still feels like the right way to do it.

Thoughts?

Regards,
ojaswin

>  	}
>  
>  	/*
> @@ -3000,8 +3001,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  		}
>  	}
>  
> -	if (sbi->s_mb_stats && ac->ac_status == AC_STATUS_FOUND)
> +	if (sbi->s_mb_stats && ac->ac_status == AC_STATUS_FOUND) {
>  		atomic64_inc(&sbi->s_bal_cX_hits[ac->ac_criteria]);
> +		if (ac->ac_flags & EXT4_MB_STREAM_ALLOC &&
> +		    ac->ac_b_ex.fe_group == ac->ac_g_ex.fe_group)
> +			atomic_inc(&sbi->s_bal_stream_goals);
> +	}
>  out:
>  	if (!err && ac->ac_status != AC_STATUS_FOUND && first_err)
>  		err = first_err;
> @@ -3194,6 +3199,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
>  	seq_printf(seq, "\textents_scanned: %u\n",
>  		   atomic_read(&sbi->s_bal_ex_scanned));
>  	seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals));
> +	seq_printf(seq, "\t\tstream_goal_hits: %u\n",
> +		   atomic_read(&sbi->s_bal_stream_goals));
>  	seq_printf(seq, "\t\tlen_goal_hits: %u\n",
>  		   atomic_read(&sbi->s_bal_len_goals));
>  	seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders));
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 03/17] ext4: remove unnecessary s_mb_last_start
  2025-07-14 13:03 ` [PATCH v3 03/17] ext4: remove unnecessary s_mb_last_start Baokun Li
@ 2025-07-17 10:31   ` Ojaswin Mujoo
  0 siblings, 0 replies; 44+ messages in thread
From: Ojaswin Mujoo @ 2025-07-17 10:31 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel,
	julia.lawall, yi.zhang, yangerkun, libaokun

On Mon, Jul 14, 2025 at 09:03:13PM +0800, Baokun Li wrote:
> Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1
> by default, so the no longer needed sbi->s_mb_last_start is removed.
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/ext4.h    | 1 -
>  fs/ext4/mballoc.c | 1 -
>  2 files changed, 2 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8750ace12935..b83095541c98 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1631,7 +1631,6 @@ struct ext4_sb_info {
>  	unsigned int s_max_dir_size_kb;
>  	/* where last allocation was done - for stream allocation */
>  	unsigned long s_mb_last_group;
> -	unsigned long s_mb_last_start;
>  	unsigned int s_mb_prefetch;
>  	unsigned int s_mb_prefetch_limit;
>  	unsigned int s_mb_best_avail_max_trim_order;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index f56ac477c464..e3a5103e1620 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2171,7 +2171,6 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>  	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>  		spin_lock(&sbi->s_md_lock);
>  		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> -		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;

Looks good,

Feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

>  		spin_unlock(&sbi->s_md_lock);
>  	}
>  	/*
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-07-14 13:03 ` [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
@ 2025-07-17 13:36   ` Ojaswin Mujoo
  2025-07-19  1:54     ` Baokun Li
  0 siblings, 1 reply; 44+ messages in thread
From: Ojaswin Mujoo @ 2025-07-17 13:36 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel,
	julia.lawall, yi.zhang, yangerkun, libaokun

On Mon, Jul 14, 2025 at 09:03:14PM +0800, Baokun Li wrote:
> After we optimized the block group lock, we found another lock
> contention issue when running will-it-scale/fallocate2 with multiple
> processes. The fallocate's block allocation and the truncate's block
> release were fighting over the s_md_lock. The problem is, this lock
> protects totally different things in those two processes: the list of
> freed data blocks (s_freed_data_list) when releasing, and where to start
> looking for new blocks (mb_last_group) when allocating.
> 
> Now we only need to track s_mb_last_group and no longer need to track
> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> two are consistent. Since s_mb_last_group is merely a hint and doesn't
> require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.

Hi Baokun,

So i just got curious of the difference between smp_load_acquire vs
READ_ONCE on PowerPC, another weak memory ordering arch.
Interestingly, I didn't see that big of a single threaded drop.

The number are as follows (mb_opt_scan=1):

100 threads 
w/ smp_load_acquire    1668 MB/s 
w/ READ_ONCE           1599 MB/s

1 thread pinned to 1 cpu
w/ smp_load_acquire    292 MB/s
w/ READ_ONCE           296 MB/s

Either ways, this is much better than the base which is around 500MB/s
but just thought I'd share it here

Feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin
> 
> Besides, the s_mb_last_group data type only requires ext4_group_t
> (i.e., unsigned int), rendering unsigned long superfluous.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
> |CPU: Kunpeng 920   |          P80           |            P1           |
> |Memory: 512GB      |------------------------|-------------------------|
> |960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
> |-------------------|-------|----------------|--------|----------------|
> |mb_optimize_scan=0 | 4821  | 9636  (+99.8%) | 314065 | 337597 (+7.4%) |
> |mb_optimize_scan=1 | 4784  | 4834  (+1.04%) | 316344 | 341440 (+7.9%) |
> 
> |CPU: AMD 9654 * 2  |          P96           |             P1          |
> |Memory: 1536GB     |------------------------|-------------------------|
> |960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
> |-------------------|-------|----------------|--------|----------------|
> |mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
> |mb_optimize_scan=1 | 6101  | 9177  (+50.4%) | 207373 | 215732 (+4.0%) |
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/ext4.h    |  2 +-
>  fs/ext4/mballoc.c | 12 +++---------
>  2 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index b83095541c98..7f5c070de0fb 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1630,7 +1630,7 @@ struct ext4_sb_info {
>  	unsigned int s_mb_group_prealloc;
>  	unsigned int s_max_dir_size_kb;
>  	/* where last allocation was done - for stream allocation */
> -	unsigned long s_mb_last_group;
> +	ext4_group_t s_mb_last_group;
>  	unsigned int s_mb_prefetch;
>  	unsigned int s_mb_prefetch_limit;
>  	unsigned int s_mb_best_avail_max_trim_order;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index e3a5103e1620..025b759ca643 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2168,11 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>  	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>  	folio_get(ac->ac_buddy_folio);
>  	/* store last allocated for subsequent stream allocation */
> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		spin_lock(&sbi->s_md_lock);
> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> -		spin_unlock(&sbi->s_md_lock);
> -	}
> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> +		WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>  	/*
>  	 * As we've just preallocated more space than
>  	 * user requested originally, we store allocated
> @@ -2845,10 +2842,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  
>  	/* if stream allocation is enabled, use global goal */
>  	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		/* TBD: may be hot point */
> -		spin_lock(&sbi->s_md_lock);
> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> -		spin_unlock(&sbi->s_md_lock);
> +		ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group);
>  		ac->ac_g_ex.fe_start = -1;
>  		ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
>  	}
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups
  2025-07-14 13:03 ` [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
  2025-07-17 10:09   ` Ojaswin Mujoo
@ 2025-07-17 22:28   ` Andi Kleen
  2025-07-18  3:09     ` Theodore Ts'o
  2025-07-19  0:29     ` Baokun Li
  1 sibling, 2 replies; 44+ messages in thread
From: Andi Kleen @ 2025-07-17 22:28 UTC (permalink / raw)
  To: libaokun1, linux-ext4, linux-kernel

Baokun Li <libaokun1@huawei.com> writes:

> When ext4 allocates blocks, we used to just go through the block groups
> one by one to find a good one. But when there are tons of block groups
> (like hundreds of thousands or even millions) and not many have free space
> (meaning they're mostly full), it takes a really long time to check them
> all, and performance gets bad. So, we added the "mb_optimize_scan" mount
> option (which is on by default now). It keeps track of some group lists,
> so when we need a free block, we can just grab a likely group from the
> right list. This saves time and makes block allocation much faster.
>
> But when multiple processes or containers are doing similar things, like
> constantly allocating 8k blocks, they all try to use the same block group
> in the same list. Even just two processes doing this can cut the IOPS in
> half. For example, one container might do 300,000 IOPS, but if you run two
> at the same time, the total is only 150,000.
>
> Since we can already look at block groups in a non-linear way, the first
> and last groups in the same list are basically the same for finding a block
> right now. Therefore, add an ext4_try_lock_group() helper function to skip
> the current group when it is locked by another process, thereby avoiding
> contention with other processes. This helps ext4 make better use of having
> multiple block groups.

It seems this makes block allocation non deterministic, but depend on
the system load. I can see where this could cause problems when
reproducing bugs at least, but perhaps also in other cases.

Better perhaps just round robin the groups?
Or at least add a way to turn it off.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups
  2025-07-17 22:28   ` Andi Kleen
@ 2025-07-18  3:09     ` Theodore Ts'o
  2025-07-19  0:29     ` Baokun Li
  1 sibling, 0 replies; 44+ messages in thread
From: Theodore Ts'o @ 2025-07-18  3:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: libaokun1, linux-ext4, linux-kernel

On Thu, Jul 17, 2025 at 03:28:27PM -0700, Andi Kleen wrote:
> 
> It seems this makes block allocation non deterministic, but depend on
> the system load. I can see where this could cause problems when
> reproducing bugs at least, but perhaps also in other cases.
> 
> Better perhaps just round robin the groups?
> Or at least add a way to turn it off.

Ext4 has never guareanteed deterministic allocation; in particular,
there are times when we using get_random_u32 whens selecting the block
group used when allocating a new inode, and since the block alocation
is based on block group of the inode, therefore the block allocation
isn't deterministic.

In any case, given there many workloads are doing multi-threaded
allocations, in practice, even without these calls to get_random,
things tend not to be deterministic anyway.

       	    	      		    - Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups
  2025-07-17 22:28   ` Andi Kleen
  2025-07-18  3:09     ` Theodore Ts'o
@ 2025-07-19  0:29     ` Baokun Li
  2025-07-22 20:59       ` Andi Kleen
  1 sibling, 1 reply; 44+ messages in thread
From: Baokun Li @ 2025-07-19  0:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-ext4, linux-kernel

On 2025/7/18 6:28, Andi Kleen wrote:
> Baokun Li <libaokun1@huawei.com> writes:
>
>> When ext4 allocates blocks, we used to just go through the block groups
>> one by one to find a good one. But when there are tons of block groups
>> (like hundreds of thousands or even millions) and not many have free space
>> (meaning they're mostly full), it takes a really long time to check them
>> all, and performance gets bad. So, we added the "mb_optimize_scan" mount
>> option (which is on by default now). It keeps track of some group lists,
>> so when we need a free block, we can just grab a likely group from the
>> right list. This saves time and makes block allocation much faster.
>>
>> But when multiple processes or containers are doing similar things, like
>> constantly allocating 8k blocks, they all try to use the same block group
>> in the same list. Even just two processes doing this can cut the IOPS in
>> half. For example, one container might do 300,000 IOPS, but if you run two
>> at the same time, the total is only 150,000.
>>
>> Since we can already look at block groups in a non-linear way, the first
>> and last groups in the same list are basically the same for finding a block
>> right now. Therefore, add an ext4_try_lock_group() helper function to skip
>> the current group when it is locked by another process, thereby avoiding
>> contention with other processes. This helps ext4 make better use of having
>> multiple block groups.
> It seems this makes block allocation non deterministic, but depend on
> the system load. I can see where this could cause problems when
> reproducing bugs at least, but perhaps also in other cases.
>
> Better perhaps just round robin the groups?
> Or at least add a way to turn it off.
>
> -Andi
>
As Ted mentioned, Ext4 has never guaranteed deterministic allocation. We
do attempt a predetermined goal in ext4_mb_find_by_goal(), and this part
has no trylock logic, meaning we'll always attempt to scan the target
group once—that's deterministic.

However, if the target attempt fails, the primary goal for subsequent
allocation is to find suitable free space as quickly as possible, so
there's no need to contend with other processes for non-target groups.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups
  2025-07-17 10:09   ` Ojaswin Mujoo
@ 2025-07-19  0:37     ` Baokun Li
  0 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-19  0:37 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel,
	julia.lawall, yi.zhang, yangerkun, libaokun

On 2025/7/17 18:09, Ojaswin Mujoo wrote:
> On Mon, Jul 14, 2025 at 09:03:11PM +0800, Baokun Li wrote:
>> When ext4 allocates blocks, we used to just go through the block groups
>> one by one to find a good one. But when there are tons of block groups
>> (like hundreds of thousands or even millions) and not many have free space
>> (meaning they're mostly full), it takes a really long time to check them
>> all, and performance gets bad. So, we added the "mb_optimize_scan" mount
>> option (which is on by default now). It keeps track of some group lists,
>> so when we need a free block, we can just grab a likely group from the
>> right list. This saves time and makes block allocation much faster.
>>
>> But when multiple processes or containers are doing similar things, like
>> constantly allocating 8k blocks, they all try to use the same block group
>> in the same list. Even just two processes doing this can cut the IOPS in
>> half. For example, one container might do 300,000 IOPS, but if you run two
>> at the same time, the total is only 150,000.
>>
>> Since we can already look at block groups in a non-linear way, the first
>> and last groups in the same list are basically the same for finding a block
>> right now. Therefore, add an ext4_try_lock_group() helper function to skip
>> the current group when it is locked by another process, thereby avoiding
>> contention with other processes. This helps ext4 make better use of having
>> multiple block groups.
>>
>> Also, to make sure we don't skip all the groups that have free space
>> when allocating blocks, we won't try to skip busy groups anymore when
>> ac_criteria is CR_ANY_FREE.
>>
>> Performance test data follows:
>>
>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>> Observation: Average fallocate operations per container per second.
>>
>> |CPU: Kunpeng 920   |          P80            |
>> |Memory: 512GB      |-------------------------|
>> |960GB SSD (0.5GB/s)| base  |    patched      |
>> |-------------------|-------|-----------------|
>> |mb_optimize_scan=0 | 2667  | 4821  (+80.7%)  |
>> |mb_optimize_scan=1 | 2643  | 4784  (+81.0%)  |
>>
>> |CPU: AMD 9654 * 2  |          P96            |
>> |Memory: 1536GB     |-------------------------|
>> |960GB SSD (1GB/s)  | base  |    patched      |
>> |-------------------|-------|-----------------|
>> |mb_optimize_scan=0 | 3450  | 15371 (+345%)   |
>> |mb_optimize_scan=1 | 3209  | 6101  (+90.0%)  |
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> Reviewed-by: Jan Kara <jack@suse.cz>
> Hey Baokun, I reviewed some of the patches in v2 but i think that was
> very last moment so I'll add the comments in this series, dont mind the
> copy paste :)
>
> The patch itself looks good, thanks for the changes.
>
> Feel free to add:
>
>   Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Sorry for missing your review, I've snowed under with work lately.

Thanks for the review!


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 02/17] ext4: separate stream goal hits from s_bal_goals for better tracking
  2025-07-17 10:29   ` Ojaswin Mujoo
@ 2025-07-19  1:37     ` Baokun Li
  0 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-19  1:37 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel,
	julia.lawall, yi.zhang, yangerkun, libaokun

On 2025/7/17 18:29, Ojaswin Mujoo wrote:
> On Mon, Jul 14, 2025 at 09:03:12PM +0800, Baokun Li wrote:
>> In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal()
>> fails to achieve the inode goal, allocation continues with the stream
>> allocation global goal. Currently, hits for both are combined in
>> sbi->s_bal_goals, hindering accurate optimization.
>>
>> This commit separates global goal hits into sbi->s_bal_stream_goals. Since
>> stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1.
>> This prevents stream allocations from being counted in s_bal_goals. Also
>> clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again.
>>
>> After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show:
>>
>> mballoc:
>> 	reqs: 840347
>> 	success: 750992
>> 	groups_scanned: 1230506
>> 	cr_p2_aligned_stats:
>> 		hits: 21531
>> 		groups_considered: 411664
>> 		extents_scanned: 21531
>> 		useless_loops: 0
>> 		bad_suggestions: 6
>> 	cr_goal_fast_stats:
>> 		hits: 111222
>> 		groups_considered: 1806728
>> 		extents_scanned: 467908
>> 		useless_loops: 0
>> 		bad_suggestions: 13
>> 	cr_best_avail_stats:
>> 		hits: 36267
>> 		groups_considered: 1817631
>> 		extents_scanned: 156143
>> 		useless_loops: 0
>> 		bad_suggestions: 204
>> 	cr_goal_slow_stats:
>> 		hits: 106396
>> 		groups_considered: 5671710
>> 		extents_scanned: 22540056
>> 		useless_loops: 123747
>> 	cr_any_free_stats:
>> 		hits: 138071
>> 		groups_considered: 724692
>> 		extents_scanned: 23615593
>> 		useless_loops: 585
>> 	extents_scanned: 46804261
>> 		goal_hits: 1307
>> 		stream_goal_hits: 236317
>> 		len_goal_hits: 155549
>> 		2^n_hits: 21531
>> 		breaks: 225096
>> 		lost: 35062
>> 	buddies_generated: 40/40
>> 	buddies_time_used: 48004
>> 	preallocated: 5962467
>> 	discarded: 4847560
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> ---
>>   fs/ext4/ext4.h    |  1 +
>>   fs/ext4/mballoc.c | 11 +++++++++--
>>   2 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 9df74123e7e6..8750ace12935 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1646,6 +1646,7 @@ struct ext4_sb_info {
>>   	atomic_t s_bal_cX_ex_scanned[EXT4_MB_NUM_CRS];	/* total extents scanned */
>>   	atomic_t s_bal_groups_scanned;	/* number of groups scanned */
>>   	atomic_t s_bal_goals;	/* goal hits */
>> +	atomic_t s_bal_stream_goals;	/* stream allocation global goal hits */
>>   	atomic_t s_bal_len_goals;	/* len goal hits */
>>   	atomic_t s_bal_breaks;	/* too long searches */
>>   	atomic_t s_bal_2orders;	/* 2^order hits */
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 336d65c4f6a2..f56ac477c464 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -2849,8 +2849,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>   		/* TBD: may be hot point */
>>   		spin_lock(&sbi->s_md_lock);
>>   		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
>> -		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
>>   		spin_unlock(&sbi->s_md_lock);
>> +		ac->ac_g_ex.fe_start = -1;
>> +		ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
> Hey Baokun, I was a bit late to review this in v2 so I'll add the
> comment here:
>
> So this is mostly to account for retires right? Maybe rather than
> disabling goal allocation a better way to do this is resetting the
> original goal group and goal start in the retry logic of
> ext4_mb_new_blocks()? Since we drop preallocations before retrying, this
> way we might actually find our goal during the retry. Its a slim chance
> though but still feels like the right way to do it.
>
> Thoughts?

It's true that successfully acquiring the goal on retry is possible, but
the probability is too low; I think attempting the inode goal on retry is
not very meaningful. The lack of trylock logic in ext4_mb_find_by_goal()
also introduces some performance overhead.

Additionally, since preallocations might be dropped before retrying, the
inode's preallocation could also be discarded. Therefore, pa overlap needs
to be re-adjusted.

For data block allocation, we should call ext4_mb_normalize_request() to
regenerate a new ac_g_ex instead of directly resetting the original goal.
ext4_mb_normalize_request() will also determine whether to reset
EXT4_MB_HINT_TRY_GOAL.

For metadata block allocation, EXT4_MB_STREAM_ALLOC is not set, so there's
no need to worry about EXT4_MB_HINT_TRY_GOAL being cleared.

Clearing EXT4_MB_HINT_TRY_GOAL here is only to avoid inode goal allocation
with -1. If you insist that we should attempt the inode goal on retry,
I will send a separate patch to address that.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-07-17 13:36   ` Ojaswin Mujoo
@ 2025-07-19  1:54     ` Baokun Li
  0 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-19  1:54 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel,
	julia.lawall, yi.zhang, yangerkun, Baokun Li

On 2025/7/17 21:36, Ojaswin Mujoo wrote:
> On Mon, Jul 14, 2025 at 09:03:14PM +0800, Baokun Li wrote:
>> After we optimized the block group lock, we found another lock
>> contention issue when running will-it-scale/fallocate2 with multiple
>> processes. The fallocate's block allocation and the truncate's block
>> release were fighting over the s_md_lock. The problem is, this lock
>> protects totally different things in those two processes: the list of
>> freed data blocks (s_freed_data_list) when releasing, and where to start
>> looking for new blocks (mb_last_group) when allocating.
>>
>> Now we only need to track s_mb_last_group and no longer need to track
>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>> two are consistent. Since s_mb_last_group is merely a hint and doesn't
>> require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.
> Hi Baokun,
>
> So i just got curious of the difference between smp_load_acquire vs
> READ_ONCE on PowerPC, another weak memory ordering arch.
> Interestingly, I didn't see that big of a single threaded drop.
>
> The number are as follows (mb_opt_scan=1):
>
> 100 threads
> w/ smp_load_acquire    1668 MB/s
> w/ READ_ONCE           1599 MB/s
>
> 1 thread pinned to 1 cpu
> w/ smp_load_acquire    292 MB/s
> w/ READ_ONCE           296 MB/s
>
> Either ways, this is much better than the base which is around 500MB/s
> but just thought I'd share it here

Thank you for providing the test data for PowerPC, it is true that
the results may vary slightly between architectures.

>
> Feel free to add:
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>
Thank you for you review!

Cheers,
Baokun

>> Besides, the s_mb_last_group data type only requires ext4_group_t
>> (i.e., unsigned int), rendering unsigned long superfluous.
>>
>> Performance test data follows:
>>
>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>> Observation: Average fallocate operations per container per second.
>>
>> |CPU: Kunpeng 920   |          P80           |            P1           |
>> |Memory: 512GB      |------------------------|-------------------------|
>> |960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
>> |-------------------|-------|----------------|--------|----------------|
>> |mb_optimize_scan=0 | 4821  | 9636  (+99.8%) | 314065 | 337597 (+7.4%) |
>> |mb_optimize_scan=1 | 4784  | 4834  (+1.04%) | 316344 | 341440 (+7.9%) |
>>
>> |CPU: AMD 9654 * 2  |          P96           |             P1          |
>> |Memory: 1536GB     |------------------------|-------------------------|
>> |960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
>> |-------------------|-------|----------------|--------|----------------|
>> |mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
>> |mb_optimize_scan=1 | 6101  | 9177  (+50.4%) | 207373 | 215732 (+4.0%) |
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> ---
>>   fs/ext4/ext4.h    |  2 +-
>>   fs/ext4/mballoc.c | 12 +++---------
>>   2 files changed, 4 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index b83095541c98..7f5c070de0fb 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1630,7 +1630,7 @@ struct ext4_sb_info {
>>   	unsigned int s_mb_group_prealloc;
>>   	unsigned int s_max_dir_size_kb;
>>   	/* where last allocation was done - for stream allocation */
>> -	unsigned long s_mb_last_group;
>> +	ext4_group_t s_mb_last_group;
>>   	unsigned int s_mb_prefetch;
>>   	unsigned int s_mb_prefetch_limit;
>>   	unsigned int s_mb_best_avail_max_trim_order;
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index e3a5103e1620..025b759ca643 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -2168,11 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>   	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>   	folio_get(ac->ac_buddy_folio);
>>   	/* store last allocated for subsequent stream allocation */
>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>> -		spin_lock(&sbi->s_md_lock);
>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>> -		spin_unlock(&sbi->s_md_lock);
>> -	}
>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>> +		WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>   	/*
>>   	 * As we've just preallocated more space than
>>   	 * user requested originally, we store allocated
>> @@ -2845,10 +2842,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>   
>>   	/* if stream allocation is enabled, use global goal */
>>   	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>> -		/* TBD: may be hot point */
>> -		spin_lock(&sbi->s_md_lock);
>> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
>> -		spin_unlock(&sbi->s_md_lock);
>> +		ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group);
>>   		ac->ac_g_ex.fe_start = -1;
>>   		ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
>>   	}
>> -- 
>> 2.46.1
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 00/17] ext4: better scalability for ext4 block allocation
  2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (17 preceding siblings ...)
  2025-07-15  1:11 ` [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Zhang Yi
@ 2025-07-19 21:45 ` Theodore Ts'o
  18 siblings, 0 replies; 44+ messages in thread
From: Theodore Ts'o @ 2025-07-19 21:45 UTC (permalink / raw)
  To: linux-ext4, Baokun Li
  Cc: Theodore Ts'o, adilger.kernel, jack, linux-kernel, ojaswin,
	julia.lawall, yi.zhang, yangerkun, libaokun


On Mon, 14 Jul 2025 21:03:10 +0800, Baokun Li wrote:
> Changes since v2:
>  * Collect RVB from Jan Kara. (Thanks for your review!)
>  * Add patch 2.
>  * Patch 4: Switching to READ_ONCE/WRITE_ONCE (great for single-process)
>         over smp_load_acquire/smp_store_release (only slight multi-process
>         gain). (Suggested by Jan Kara)
>  * Patch 5: The number of global goals is now set to the lesser of the CPU
>         count or one-fourth of the group count. This prevents setting too
>         many goals for small filesystems, which lead to file dispersion.
>         (Suggested by Jan Kara)
>  * Patch 5: Directly use kfree() to release s_mb_last_groups instead of
>         kvfree(). (Suggested by Julia Lawall)
>  * Patch 11: Even without mb_optimize_scan enabled, we now always attempt
>         to remove the group from the old order list.(Suggested by Jan Kara)
>  * Patch 14-16: Added comments for clarity, refined logic, and removed
>         obsolete variables.
>  * Update performance test results and indicate raw disk write bandwidth.
> 
> [...]

Applied, thanks!

[01/17] ext4: add ext4_try_lock_group() to skip busy groups
        commit: 68f9a4d4f74ac2f6b8a836600caedb17b1f417e0
[02/17] ext4: separate stream goal hits from s_bal_goals for better tracking
        commit: c6a98dbdff75a960a8976294a56b3366305b4fed
[03/17] ext4: remove unnecessary s_mb_last_start
        commit: 8eb252a81b311d6b2a59176c9ef7e17d731e17e6
[04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group
        commit: ea906991a494eeaf8b6a4ac82c568071a6b6b52c
[05/17] ext4: utilize multiple global goals to reduce contention
        commit: 174688d2e06ef9e03d5b93ce2386e2e9a5af6e7b
[06/17] ext4: get rid of some obsolete EXT4_MB_HINT flags
        commit: d82c95e546dc57b3cd2d46e38ac216cd08dfab3c
[07/17] ext4: fix typo in CR_GOAL_LEN_SLOW comment
        commit: 1930d818c5ecfd557eae0f581cc9b6392debf9c6
[08/17] ext4: convert sbi->s_mb_free_pending to atomic_t
        commit: 3772fe7b4225f21a1bfe63e4a338702cc3c153de
[09/17] ext4: merge freed extent with existing extents before insertion
        commit: 92ba7b95ef0743c76688fd3d4c644e8ba4fd4cc4
[10/17] ext4: fix zombie groups in average fragment size lists
        commit: 84521ebf83028c0321050b8665e05d5cdef5d0d8
[11/17] ext4: fix largest free orders lists corruption on mb_optimize_scan switch
        commit: bbe11dd13a3ff78ed256b8c66356624284c66f99
[12/17] ext4: factor out __ext4_mb_scan_group()
        commit: 47fb751bf947da35f6669ddf5ab9869f58f991e2
[13/17] ext4: factor out ext4_mb_might_prefetch()
        commit: 12a5b877c314778ddf9a5c603eeb1803a514ab58
[14/17] ext4: factor out ext4_mb_scan_group()
        commit: 6e0275f6e713f55dd3fc23be317ec11f8db1766d
[15/17] ext4: convert free groups order lists to xarrays
        commit: bffe0d5051626a3e6ce4b03e247814af2d595ee2
[16/17] ext4: refactor choose group to scan group
        commit: 56b493f9ac002ee7963eed22eb4131d120d60fd3
[17/17] ext4: implement linear-like traversal across order xarrays
        commit: feffac547fb53d7a3fedd47a50fa91bd2d804d41

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-14 13:03 ` [PATCH v3 15/17] ext4: convert free groups order lists to xarrays Baokun Li
@ 2025-07-21 11:07   ` Jan Kara
  2025-07-21 12:33     ` Baokun Li
  2025-07-24  3:55   ` Guenter Roeck
  1 sibling, 1 reply; 44+ messages in thread
From: Jan Kara @ 2025-07-21 11:07 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, ojaswin,
	julia.lawall, yi.zhang, yangerkun, libaokun

On Mon 14-07-25 21:03:25, Baokun Li wrote:
> While traversing the list, holding a spin_lock prevents load_buddy, making
> direct use of ext4_try_lock_group impossible. This can lead to a bouncing
> scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
> fails, forcing the list traversal to repeatedly restart from grp_A.
> 
> In contrast, linear traversal directly uses ext4_try_lock_group(),
> avoiding this bouncing. Therefore, we need a lockless, ordered traversal
> to achieve linear-like efficiency.
> 
> Therefore, this commit converts both average fragment size lists and
> largest free order lists into ordered xarrays.
> 
> In an xarray, the index represents the block group number and the value
> holds the block group information; a non-empty value indicates the block
> group's presence.
> 
> While insertion and deletion complexity remain O(1), lookup complexity
> changes from O(1) to O(nlogn), which may slightly reduce single-threaded
> performance.
> 
> Additionally, xarray insertions might fail, potentially due to memory
> allocation issues. However, since we have linear traversal as a fallback,
> this isn't a major problem. Therefore, we've only added a warning message
> for insertion failures here.
> 
> A helper function ext4_mb_find_good_group_xarray() is added to find good
> groups in the specified xarray starting at the specified position start,
> and when it reaches ngroups-1, it wraps around to 0 and then to start-1.
> This ensures an ordered traversal within the xarray.
> 
> Performance test results are as follows: Single-process operations
> on an empty disk show negligible impact, while multi-process workloads
> demonstrate a noticeable performance gain.
> 
> |CPU: Kunpeng 920   |          P80           |            P1           |
> |Memory: 512GB      |------------------------|-------------------------|
> |960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
> |-------------------|-------|----------------|--------|----------------|
> |mb_optimize_scan=0 | 20097 | 19555 (-2.6%)  | 316141 | 315636 (-0.2%) |
> |mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) |
> 
> |CPU: AMD 9654 * 2  |          P96           |             P1          |
> |Memory: 1536GB     |------------------------|-------------------------|
> |960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
> |-------------------|-------|----------------|--------|----------------|
> |mb_optimize_scan=0 | 53603 | 53192 (-0.7%)  | 214243 | 212678 (-0.7%) |
> |mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) |
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

The patch looks good and the results are nice. I've just noticed two typos:

> +static inline void ext4_mb_avg_fragment_size_destory(struct ext4_sb_info *sbi)
						^^^ destroy


> +{
> +	for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++)
> +		xa_destroy(&sbi->s_mb_avg_fragment_size[i]);
> +	kfree(sbi->s_mb_avg_fragment_size);
> +}
> +
> +static inline void ext4_mb_largest_free_orders_destory(struct ext4_sb_info *sbi)
						  ^^^ destroy

> +{
> +	for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++)
> +		xa_destroy(&sbi->s_mb_largest_free_orders[i]);
> +	kfree(sbi->s_mb_largest_free_orders);
> +}

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-21 11:07   ` Jan Kara
@ 2025-07-21 12:33     ` Baokun Li
  2025-07-21 13:45       ` Baokun Li
  0 siblings, 1 reply; 44+ messages in thread
From: Baokun Li @ 2025-07-21 12:33 UTC (permalink / raw)
  To: Jan Kara, tytso
  Cc: linux-ext4, adilger.kernel, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun

On 2025/7/21 19:07, Jan Kara wrote:
> On Mon 14-07-25 21:03:25, Baokun Li wrote:
>> |CPU: Kunpeng 920   |          P80           |            P1           |
>> |Memory: 512GB      |------------------------|-------------------------|
>> |960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
>> |-------------------|-------|----------------|--------|----------------|
>> |mb_optimize_scan=0 | 20097 | 19555 (-2.6%)  | 316141 | 315636 (-0.2%) |
>> |mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) |
>>
>> |CPU: AMD 9654 * 2  |          P96           |             P1          |
>> |Memory: 1536GB     |------------------------|-------------------------|
>> |960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
>> |-------------------|-------|----------------|--------|----------------|
>> |mb_optimize_scan=0 | 53603 | 53192 (-0.7%)  | 214243 | 212678 (-0.7%) |
>> |mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) |
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> The patch looks good and the results are nice. I've just noticed two typos:
>
>> +static inline void ext4_mb_avg_fragment_size_destory(struct ext4_sb_info *sbi)
> 						^^^ destroy
>
>
>> +static inline void ext4_mb_largest_free_orders_destory(struct ext4_sb_info *sbi)
> 						  ^^^ destroy

Hi Jan, thanks for the review! While examining this patch, I also
identified a comment formatting error that I regret overlooking previously.
My apologies for this oversight.

Hey Ted, could you please help apply the following diff to correct the
spelling errors and comment formatting issues? Or would you prefer I send
out a new patch series or a separate cleanup patch?


Thanks,
Baokun

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a9eb997b8c9b..c61955cba370 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -863,10 +863,10 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
  	grp->bb_avg_fragment_size_order = new;
  	if (new >= 0) {
  		/*
-		* Cannot use __GFP_NOFAIL because we hold the group lock.
-		* Although allocation for insertion may fails, it's not fatal
-		* as we have linear traversal to fall back on.
-		*/
+		 * Cannot use __GFP_NOFAIL because we hold the group lock.
+		 * Although allocation for insertion may fails, it's not fatal
+		 * as we have linear traversal to fall back on.
+		 */
  		int err = xa_insert(&sbi->s_mb_avg_fragment_size[new],
  				    grp->bb_group, grp, GFP_ATOMIC);
  		if (err)
@@ -1201,10 +1201,10 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
  	grp->bb_largest_free_order = new;
  	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && new >= 0 && grp->bb_free) {
  		/*
-		* Cannot use __GFP_NOFAIL because we hold the group lock.
-		* Although allocation for insertion may fails, it's not fatal
-		* as we have linear traversal to fall back on.
-		*/
+		 * Cannot use __GFP_NOFAIL because we hold the group lock.
+		 * Although allocation for insertion may fails, it's not fatal
+		 * as we have linear traversal to fall back on.
+		 */
  		int err = xa_insert(&sbi->s_mb_largest_free_orders[new],
  				    grp->bb_group, grp, GFP_ATOMIC);
  		if (err)
@@ -3657,14 +3657,14 @@ static void ext4_discard_work(struct work_struct *work)
  		ext4_mb_unload_buddy(&e4b);
  }
  
-static inline void ext4_mb_avg_fragment_size_destory(struct ext4_sb_info *sbi)
+static inline void ext4_mb_avg_fragment_size_destroy(struct ext4_sb_info *sbi)
  {
  	for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++)
  		xa_destroy(&sbi->s_mb_avg_fragment_size[i]);
  	kfree(sbi->s_mb_avg_fragment_size);
  }
  
-static inline void ext4_mb_largest_free_orders_destory(struct ext4_sb_info *sbi)
+static inline void ext4_mb_largest_free_orders_destroy(struct ext4_sb_info *sbi)
  {
  	for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++)
  		xa_destroy(&sbi->s_mb_largest_free_orders[i]);
@@ -3818,8 +3818,8 @@ int ext4_mb_init(struct super_block *sb)
  	kfree(sbi->s_mb_last_groups);
  	sbi->s_mb_last_groups = NULL;
  out:
-	ext4_mb_avg_fragment_size_destory(sbi);
-	ext4_mb_largest_free_orders_destory(sbi);
+	ext4_mb_avg_fragment_size_destroy(sbi);
+	ext4_mb_largest_free_orders_destroy(sbi);
  	kfree(sbi->s_mb_offsets);
  	sbi->s_mb_offsets = NULL;
  	kfree(sbi->s_mb_maxs);
@@ -3886,8 +3886,8 @@ void ext4_mb_release(struct super_block *sb)
  		kvfree(group_info);
  		rcu_read_unlock();
  	}
-	ext4_mb_avg_fragment_size_destory(sbi);
-	ext4_mb_largest_free_orders_destory(sbi);
+	ext4_mb_avg_fragment_size_destroy(sbi);
+	ext4_mb_largest_free_orders_destroy(sbi);
  	kfree(sbi->s_mb_offsets);
  	kfree(sbi->s_mb_maxs);
  	iput(sbi->s_buddy_cache);



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-21 12:33     ` Baokun Li
@ 2025-07-21 13:45       ` Baokun Li
  2025-07-21 18:01         ` Theodore Ts'o
  0 siblings, 1 reply; 44+ messages in thread
From: Baokun Li @ 2025-07-21 13:45 UTC (permalink / raw)
  To: tytso, Jan Kara
  Cc: linux-ext4, adilger.kernel, linux-kernel, ojaswin, julia.lawall,
	yi.zhang, yangerkun, libaokun

在 2025/7/21 20:33, Baokun Li 写道:
> On 2025/7/21 19:07, Jan Kara wrote:
>> On Mon 14-07-25 21:03:25, Baokun Li wrote:
>>> |CPU: Kunpeng 920   |          P80           |            P1           |
>>> |Memory: 512GB      |------------------------|-------------------------|
>>> |960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
>>> |-------------------|-------|----------------|--------|----------------|
>>> |mb_optimize_scan=0 | 20097 | 19555 (-2.6%)  | 316141 | 315636 (-0.2%) |
>>> |mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) |
>>>
>>> |CPU: AMD 9654 * 2  |          P96           |             P1          |
>>> |Memory: 1536GB     |------------------------|-------------------------|
>>> |960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
>>> |-------------------|-------|----------------|--------|----------------|
>>> |mb_optimize_scan=0 | 53603 | 53192 (-0.7%)  | 214243 | 212678 (-0.7%) |
>>> |mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) |
>>>
>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> The patch looks good and the results are nice. I've just noticed two 
>> typos:
>>
>>> +static inline void ext4_mb_avg_fragment_size_destory(struct 
>>> ext4_sb_info *sbi)
>>                         ^^^ destroy
>>
>>
>>> +static inline void ext4_mb_largest_free_orders_destory(struct 
>>> ext4_sb_info *sbi)
>>                           ^^^ destroy
> 
> Hi Jan, thanks for the review! While examining this patch, I also
> identified a comment formatting error that I regret overlooking previously.
> My apologies for this oversight.
> 
> Hey Ted, could you please help apply the following diff to correct the
> spelling errors and comment formatting issues? Or would you prefer I send
> out a new patch series or a separate cleanup patch?
> 
> 
Sorry, thunderbird is automatically converting tabs to spaces in the
code, try the diff below.


Thanks,
Baokun


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a9eb997b8c9b..c61955cba370 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -863,10 +863,10 @@ mb_update_avg_fragment_size(struct super_block 
*sb, struct ext4_group_info *grp)
  	grp->bb_avg_fragment_size_order = new;
  	if (new >= 0) {
  		/*
-		* Cannot use __GFP_NOFAIL because we hold the group lock.
-		* Although allocation for insertion may fails, it's not fatal
-		* as we have linear traversal to fall back on.
-		*/
+		 * Cannot use __GFP_NOFAIL because we hold the group lock.
+		 * Although allocation for insertion may fails, it's not fatal
+		 * as we have linear traversal to fall back on.
+		 */
  		int err = xa_insert(&sbi->s_mb_avg_fragment_size[new],
  				    grp->bb_group, grp, GFP_ATOMIC);
  		if (err)
@@ -1201,10 +1201,10 @@ mb_set_largest_free_order(struct super_block 
*sb, struct ext4_group_info *grp)
  	grp->bb_largest_free_order = new;
  	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && new >= 0 && grp->bb_free) {
  		/*
-		* Cannot use __GFP_NOFAIL because we hold the group lock.
-		* Although allocation for insertion may fails, it's not fatal
-		* as we have linear traversal to fall back on.
-		*/
+		 * Cannot use __GFP_NOFAIL because we hold the group lock.
+		 * Although allocation for insertion may fails, it's not fatal
+		 * as we have linear traversal to fall back on.
+		 */
  		int err = xa_insert(&sbi->s_mb_largest_free_orders[new],
  				    grp->bb_group, grp, GFP_ATOMIC);
  		if (err)
@@ -3657,14 +3657,14 @@ static void ext4_discard_work(struct work_struct 
*work)
  		ext4_mb_unload_buddy(&e4b);
  }

-static inline void ext4_mb_avg_fragment_size_destory(struct 
ext4_sb_info *sbi)
+static inline void ext4_mb_avg_fragment_size_destroy(struct 
ext4_sb_info *sbi)
  {
  	for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++)
  		xa_destroy(&sbi->s_mb_avg_fragment_size[i]);
  	kfree(sbi->s_mb_avg_fragment_size);
  }

-static inline void ext4_mb_largest_free_orders_destory(struct 
ext4_sb_info *sbi)
+static inline void ext4_mb_largest_free_orders_destroy(struct 
ext4_sb_info *sbi)
  {
  	for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++)
  		xa_destroy(&sbi->s_mb_largest_free_orders[i]);
@@ -3818,8 +3818,8 @@ int ext4_mb_init(struct super_block *sb)
  	kfree(sbi->s_mb_last_groups);
  	sbi->s_mb_last_groups = NULL;
  out:
-	ext4_mb_avg_fragment_size_destory(sbi);
-	ext4_mb_largest_free_orders_destory(sbi);
+	ext4_mb_avg_fragment_size_destroy(sbi);
+	ext4_mb_largest_free_orders_destroy(sbi);
  	kfree(sbi->s_mb_offsets);
  	sbi->s_mb_offsets = NULL;
  	kfree(sbi->s_mb_maxs);
@@ -3886,8 +3886,8 @@ void ext4_mb_release(struct super_block *sb)
  		kvfree(group_info);
  		rcu_read_unlock();
  	}
-	ext4_mb_avg_fragment_size_destory(sbi);
-	ext4_mb_largest_free_orders_destory(sbi);
+	ext4_mb_avg_fragment_size_destroy(sbi);
+	ext4_mb_largest_free_orders_destroy(sbi);
  	kfree(sbi->s_mb_offsets);
  	kfree(sbi->s_mb_maxs);
  	iput(sbi->s_buddy_cache);


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-21 13:45       ` Baokun Li
@ 2025-07-21 18:01         ` Theodore Ts'o
  2025-07-22  5:58           ` Baokun Li
  0 siblings, 1 reply; 44+ messages in thread
From: Theodore Ts'o @ 2025-07-21 18:01 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, adilger.kernel, linux-kernel, ojaswin,
	julia.lawall, yi.zhang, yangerkun, libaokun

Thanks, Baokun!  I've updated the ext4 dev branch with the spelling
fixes integrated into "ext4: convert free groups order lists to
xarrays".

						- Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-21 18:01         ` Theodore Ts'o
@ 2025-07-22  5:58           ` Baokun Li
  0 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-22  5:58 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-ext4, adilger.kernel, linux-kernel, ojaswin,
	julia.lawall, yi.zhang, yangerkun, libaokun

On 7/22/2025 2:01 AM, Theodore Ts'o wrote:
> Thanks, Baokun!  I've updated the ext4 dev branch with the spelling
> fixes integrated into "ext4: convert free groups order lists to
> xarrays".
>
> 						- Ted
>
Thanks for updating the code!


Regards,
Baokun


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups
  2025-07-19  0:29     ` Baokun Li
@ 2025-07-22 20:59       ` Andi Kleen
  0 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2025-07-22 20:59 UTC (permalink / raw)
  To: Baokun Li; +Cc: linux-ext4, linux-kernel

> As Ted mentioned, Ext4 has never guaranteed deterministic allocation. We
> do attempt a predetermined goal in ext4_mb_find_by_goal(), and this part
> has no trylock logic, meaning we'll always attempt to scan the target
> group once—that's deterministic.
> 
> However, if the target attempt fails, the primary goal for subsequent
> allocation is to find suitable free space as quickly as possible, so
> there's no need to contend with other processes for non-target groups.

If you want to do it as quickly as possible then trylock is also not a good
strategy. It requires moving the cache line of the lock from EXCLUSIVE
(on the owning CPU) to SHARED and then later back to unlock, which all require
slow communication. On a large system or with contention you will still
observe considerable latencies.

Better to figure out a scheme that doesn't require touching the lock
at all.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-14 13:03 ` [PATCH v3 15/17] ext4: convert free groups order lists to xarrays Baokun Li
  2025-07-21 11:07   ` Jan Kara
@ 2025-07-24  3:55   ` Guenter Roeck
  2025-07-24  4:54     ` Theodore Ts'o
  1 sibling, 1 reply; 44+ messages in thread
From: Guenter Roeck @ 2025-07-24  3:55 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, ojaswin,
	julia.lawall, yi.zhang, yangerkun, libaokun

Hi,

On Mon, Jul 14, 2025 at 09:03:25PM +0800, Baokun Li wrote:
> While traversing the list, holding a spin_lock prevents load_buddy, making
> direct use of ext4_try_lock_group impossible. This can lead to a bouncing
> scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
> fails, forcing the list traversal to repeatedly restart from grp_A.
> 

This patch causes crashes for pretty much every architecture when
running unit tests as part of booting.

Example (from x8_64) as well as bisect log attached below.

Guenter

---
...
[    9.353832]         # Subtest: test_new_blocks_simple
[    9.366711] BUG: kernel NULL pointer dereference, address: 0000000000000014
[    9.366931] #PF: supervisor read access in kernel mode
[    9.366993] #PF: error_code(0x0000) - not-present page
[    9.367165] PGD 0 P4D 0
[    9.367305] Oops: Oops: 0000 [#1] SMP PTI
[    9.367686] CPU: 0 UID: 0 PID: 217 Comm: kunit_try_catch Tainted: G                 N  6.16.0-rc7-next-20250722 #1 PREEMPT(voluntary)
[    9.367846] Tainted: [N]=TEST
[    9.367891] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    9.368063] RIP: 0010:ext4_mb_release+0x26e/0x510
[    9.368374] Code: 28 4a cb ff e8 03 5a cf ff 31 db 48 8d 3c 9b 48 83 c3 01 48 c1 e7 04 48 03 bd 60 05 00 00 e8 c9 a6 48 01 48 8b 85 68 03 00 00 <0f> b6 40 14 83 c0 02 39 d8 7f d6 48 8b bd 60 05 00 00 31 db e8 d9
[    9.368581] RSP: 0000:ffffb33b8041fe40 EFLAGS: 00010286
[    9.368659] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[    9.368732] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9a319e36
[    9.368802] RBP: ffff8b89c3502400 R08: 0000000000000001 R09: 0000000000000000
[    9.368872] R10: 0000000000000001 R11: 0000000000000120 R12: ffff8b89c2f49160
[    9.368941] R13: ffff8b89c2f49158 R14: ffff8b89c2f24000 R15: ffff8b89c2f24000
[    9.369042] FS:  0000000000000000(0000) GS:ffff8b8a3381a000(0000) knlGS:0000000000000000
[    9.369127] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.369194] CR2: 0000000000000014 CR3: 0000000009a9c000 CR4: 00000000003506f0
[    9.369324] Call Trace:
[    9.369440]  <TASK>
[    9.369637]  mbt_kunit_exit+0x47/0xf0
[    9.369745]  ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[    9.369813]  kunit_try_run_case_cleanup+0x2f/0x40
[    9.369865]  kunit_generic_run_threadfn_adapter+0x1c/0x40
[    9.369922]  kthread+0x10b/0x230
[    9.369965]  ? __pfx_kthread+0x10/0x10
[    9.370013]  ret_from_fork+0x165/0x1b0
[    9.370057]  ? __pfx_kthread+0x10/0x10
[    9.370099]  ret_from_fork_asm+0x1a/0x30
[    9.370188]  </TASK>
[    9.370250] Modules linked in:
[    9.370428] CR2: 0000000000000014
[    9.370657] ---[ end trace 0000000000000000 ]---
[    9.370791] RIP: 0010:ext4_mb_release+0x26e/0x510
[    9.370847] Code: 28 4a cb ff e8 03 5a cf ff 31 db 48 8d 3c 9b 48 83 c3 01 48 c1 e7 04 48 03 bd 60 05 00 00 e8 c9 a6 48 01 48 8b 85 68 03 00 00 <0f> b6 40 14 83 c0 02 39 d8 7f d6 48 8b bd 60 05 00 00 31 db e8 d9
[    9.370996] RSP: 0000:ffffb33b8041fe40 EFLAGS: 00010286
[    9.371050] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[    9.371112] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9a319e36
[    9.371174] RBP: ffff8b89c3502400 R08: 0000000000000001 R09: 0000000000000000
[    9.371235] R10: 0000000000000001 R11: 0000000000000120 R12: ffff8b89c2f49160
[    9.371297] R13: ffff8b89c2f49158 R14: ffff8b89c2f24000 R15: ffff8b89c2f24000
[    9.371358] FS:  0000000000000000(0000) GS:ffff8b8a3381a000(0000) knlGS:0000000000000000
[    9.371428] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.371484] CR2: 0000000000000014 CR3: 0000000009a9c000 CR4: 00000000003506f0
[    9.371598] note: kunit_try_catch[217] exited with irqs disabled
[    9.371861]     # test_new_blocks_simple: try faulted: last line seen fs/ext4/mballoc-test.c:452
[    9.372123]     # test_new_blocks_simple: internal error occurred during test case cleanup: -4
[    9.372440]         not ok 1 block_bits=10 cluster_bits=3 blocks_per_group=8192 group_count=4 desc_size=64
[    9.375702] BUG: kernel NULL pointer dereference, address: 0000000000000014
[    9.375782] #PF: supervisor read access in kernel mode
[    9.375832] #PF: error_code(0x0000) - not-present page
[    9.375881] PGD 0 P4D 0 
[    9.375919] Oops: Oops: 0000 [#2] SMP PTI
[    9.375966] CPU: 0 UID: 0 PID: 219 Comm: kunit_try_catch Tainted: G      D          N  6.16.0-rc7-next-20250722 #1 PREEMPT(voluntary) 
[    9.376085] Tainted: [D]=DIE, [N]=TEST
[    9.376123] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    9.376220] RIP: 0010:ext4_mb_release+0x26e/0x510
[    9.376275] Code: 28 4a cb ff e8 03 5a cf ff 31 db 48 8d 3c 9b 48 83 c3 01 48 c1 e7 04 48 03 bd 60 05 00 00 e8 c9 a6 48 01 48 8b 85 68 03 00 00 <0f> b6 40 14 83 c0 02 39 d8 7f d6 48 8b bd 60 05 00 00 31 db e8 d9
[    9.376425] RSP: 0000:ffffb33b803f7e40 EFLAGS: 00010286
[    9.376482] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[    9.376546] RDX: 0000000002000008 RSI: ffffffff9a319e36 RDI: ffffffff9a319e36
[    9.376608] RBP: ffff8b89c352a400 R08: 0000000000000000 R09: 0000000000000000
[    9.376669] R10: 0000000000000000 R11: 0000000058d996d7 R12: ffff8b89c2f49cc0
[    9.376730] R13: ffff8b89c2f49cb8 R14: ffff8b89c3524000 R15: ffff8b89c3524000
[    9.376792] FS:  0000000000000000(0000) GS:ffff8b8a3381a000(0000) knlGS:0000000000000000
[    9.376861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.376913] CR2: 0000000000000014 CR3: 0000000009a9c000 CR4: 00000000003506f0
[    9.376975] Call Trace:
[    9.377004]  <TASK>
[    9.377040]  mbt_kunit_exit+0x47/0xf0
[    9.377089]  ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[    9.377150]  kunit_try_run_case_cleanup+0x2f/0x40
[    9.377207]  kunit_generic_run_threadfn_adapter+0x1c/0x40
[    9.377266]  kthread+0x10b/0x230
[    9.377308]  ? __pfx_kthread+0x10/0x10
[    9.377353]  ret_from_fork+0x165/0x1b0
[    9.377397]  ? __pfx_kthread+0x10/0x10
[    9.377439]  ret_from_fork_asm+0x1a/0x30
[    9.377505]  </TASK>
[    9.377531] Modules linked in:
[    9.377571] CR2: 0000000000000014
[    9.377609] ---[ end trace 0000000000000000 ]---

---
Bisect log:

# bad: [a933d3dc1968fcfb0ab72879ec304b1971ed1b9a] Add linux-next specific files for 20250723
# good: [89be9a83ccf1f88522317ce02f854f30d6115c41] Linux 6.16-rc7
git bisect start 'HEAD' 'v6.16-rc7'
# bad: [a56f8f8967ad980d45049973561b89dcd9e37e5d] Merge branch 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
git bisect bad a56f8f8967ad980d45049973561b89dcd9e37e5d
# bad: [f6a8dede4030970707e9bae5b3ae76f60df4b75a] Merge branch 'fs-next' of linux-next
git bisect bad f6a8dede4030970707e9bae5b3ae76f60df4b75a
# good: [b863560c5a26fbcf164f5759c98bb5e72e26848d] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git
git bisect good b863560c5a26fbcf164f5759c98bb5e72e26848d
# bad: [690056682cc4de56d8de794bc06a3c04bc7f624b] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs.git
git bisect bad 690056682cc4de56d8de794bc06a3c04bc7f624b
# good: [fea76c3eb7455d1e941fba6fdd89ab41ab7797c8] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
git bisect good fea76c3eb7455d1e941fba6fdd89ab41ab7797c8
# bad: [714a183e8cf1cc1ddddb3318de1694a33f49c694] Merge branch 'dev' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git
git bisect bad 714a183e8cf1cc1ddddb3318de1694a33f49c694
# good: [5fb60c0365c4dad347e4958f78976cb733d903f2] f2fs: Pass a folio to __has_merged_page()
git bisect good 5fb60c0365c4dad347e4958f78976cb733d903f2
# bad: [a8a47fa84cc2168b2b3bd645c2c0918eed994fc0] ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr
git bisect bad a8a47fa84cc2168b2b3bd645c2c0918eed994fc0
# good: [a35454ecf8a320c49954fdcdae0e8d3323067632] ext4: use memcpy() instead of strcpy()
git bisect good a35454ecf8a320c49954fdcdae0e8d3323067632
# good: [3772fe7b4225f21a1bfe63e4a338702cc3c153de] ext4: convert sbi->s_mb_free_pending to atomic_t
git bisect good 3772fe7b4225f21a1bfe63e4a338702cc3c153de
# good: [12a5b877c314778ddf9a5c603eeb1803a514ab58] ext4: factor out ext4_mb_might_prefetch()
git bisect good 12a5b877c314778ddf9a5c603eeb1803a514ab58
# bad: [458bfb991155c2e8ba51861d1ef3c81c5a0846f9] ext4: convert free groups order lists to xarrays
git bisect bad 458bfb991155c2e8ba51861d1ef3c81c5a0846f9
# good: [6e0275f6e713f55dd3fc23be317ec11f8db1766d] ext4: factor out ext4_mb_scan_group()
git bisect good 6e0275f6e713f55dd3fc23be317ec11f8db1766d
# first bad commit: [458bfb991155c2e8ba51861d1ef3c81c5a0846f9] ext4: convert free groups order lists to xarrays

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-24  3:55   ` Guenter Roeck
@ 2025-07-24  4:54     ` Theodore Ts'o
  2025-07-24  5:20       ` Guenter Roeck
  2025-07-24 11:14       ` Zhang Yi
  0 siblings, 2 replies; 44+ messages in thread
From: Theodore Ts'o @ 2025-07-24  4:54 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Baokun Li, linux-ext4, adilger.kernel, jack, linux-kernel,
	ojaswin, julia.lawall, yi.zhang, yangerkun, libaokun

On Wed, Jul 23, 2025 at 08:55:14PM -0700, Guenter Roeck wrote:
> Hi,
> 
> On Mon, Jul 14, 2025 at 09:03:25PM +0800, Baokun Li wrote:
> > While traversing the list, holding a spin_lock prevents load_buddy, making
> > direct use of ext4_try_lock_group impossible. This can lead to a bouncing
> > scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
> > fails, forcing the list traversal to repeatedly restart from grp_A.
> > 
> 
> This patch causes crashes for pretty much every architecture when
> running unit tests as part of booting.

I'm assuming that you're using a randconfig that happened to enable
CONFIG_EXT4_KUNIT_TESTS=y.

A simpler reprducer is to have a .kunitconfig containing:

CONFIG_KUNIT=y
CONFIG_KUNIT_TEST=y
CONFIG_KUNIT_EXAMPLE_TEST=y
CONFIG_EXT4_KUNIT_TESTS=y

... and then run :./tools/testing/kunit/kunit.py run".

The first failure is actually with [11/17] ext4: fix largest free
orders lists corruption on mb_optimize_scan switch, which triggers a
failure of test_mb_mark_used.

Baokun, can you take a look please?   Many thanks!

	    	       	    	      - Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-24  4:54     ` Theodore Ts'o
@ 2025-07-24  5:20       ` Guenter Roeck
  2025-07-24 11:14       ` Zhang Yi
  1 sibling, 0 replies; 44+ messages in thread
From: Guenter Roeck @ 2025-07-24  5:20 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Baokun Li, linux-ext4, adilger.kernel, jack, linux-kernel,
	ojaswin, julia.lawall, yi.zhang, yangerkun, libaokun

On 7/23/25 21:54, Theodore Ts'o wrote:
> On Wed, Jul 23, 2025 at 08:55:14PM -0700, Guenter Roeck wrote:
>> Hi,
>>
>> On Mon, Jul 14, 2025 at 09:03:25PM +0800, Baokun Li wrote:
>>> While traversing the list, holding a spin_lock prevents load_buddy, making
>>> direct use of ext4_try_lock_group impossible. This can lead to a bouncing
>>> scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
>>> fails, forcing the list traversal to repeatedly restart from grp_A.
>>>
>>
>> This patch causes crashes for pretty much every architecture when
>> running unit tests as part of booting.
> 
> I'm assuming that you're using a randconfig that happened to enable
> CONFIG_EXT4_KUNIT_TESTS=y.
> 

I enable as many kunit tests as possible, including CONFIG_EXT4_KUNIT_TESTS=y,
on top of various defconfigs. That results in:
	total: 637 pass: 59 fail: 578
with my qemu boot tests, which in a way is quite impressive ;-).

Guenter


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-24  4:54     ` Theodore Ts'o
  2025-07-24  5:20       ` Guenter Roeck
@ 2025-07-24 11:14       ` Zhang Yi
  2025-07-24 14:30         ` Guenter Roeck
  2025-07-24 14:54         ` Theodore Ts'o
  1 sibling, 2 replies; 44+ messages in thread
From: Zhang Yi @ 2025-07-24 11:14 UTC (permalink / raw)
  To: Theodore Ts'o, Guenter Roeck
  Cc: Baokun Li, linux-ext4, adilger.kernel, jack, linux-kernel,
	ojaswin, julia.lawall, yangerkun, libaokun

On 2025/7/24 12:54, Theodore Ts'o wrote:
> On Wed, Jul 23, 2025 at 08:55:14PM -0700, Guenter Roeck wrote:
>> Hi,
>>
>> On Mon, Jul 14, 2025 at 09:03:25PM +0800, Baokun Li wrote:
>>> While traversing the list, holding a spin_lock prevents load_buddy, making
>>> direct use of ext4_try_lock_group impossible. This can lead to a bouncing
>>> scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
>>> fails, forcing the list traversal to repeatedly restart from grp_A.
>>>
>>
>> This patch causes crashes for pretty much every architecture when
>> running unit tests as part of booting.
> 
> I'm assuming that you're using a randconfig that happened to enable
> CONFIG_EXT4_KUNIT_TESTS=y.
> 
> A simpler reprducer is to have a .kunitconfig containing:
> 
> CONFIG_KUNIT=y
> CONFIG_KUNIT_TEST=y
> CONFIG_KUNIT_EXAMPLE_TEST=y
> CONFIG_EXT4_KUNIT_TESTS=y
> 
> ... and then run :./tools/testing/kunit/kunit.py run".
> 
> The first failure is actually with [11/17] ext4: fix largest free
> orders lists corruption on mb_optimize_scan switch, which triggers a
> failure of test_mb_mark_used.
> 
> Baokun, can you take a look please?   Many thanks!
> 

Hi Ted and Guenter,

I'm sorry for this regression, we didn't run these tests. Baokun is
currently on a business trip, so I help to look into this issue. The
reason for the failure is that the variable initialization in the
mb unit tests are insufficient, but this series relies on them.

Could you please try the following diff? I have tested it on my
machine, and the issue does not recur. If every thing looks fine, I
will send out the official patch.

Thanks,
Yi.


diff --git a/fs/ext4/mballoc-test.c b/fs/ext4/mballoc-test.c
index d634c12f1984..a9416b20ff64 100644
--- a/fs/ext4/mballoc-test.c
+++ b/fs/ext4/mballoc-test.c
@@ -155,6 +155,7 @@ static struct super_block *mbt_ext4_alloc_super_block(void)
 	bgl_lock_init(sbi->s_blockgroup_lock);

 	sbi->s_es = &fsb->es;
+	sbi->s_sb = sb;
 	sb->s_fs_info = sbi;

 	up_write(&sb->s_umount);
@@ -802,6 +803,8 @@ static void test_mb_mark_used(struct kunit *test)
 	KUNIT_ASSERT_EQ(test, ret, 0);

 	grp->bb_free = EXT4_CLUSTERS_PER_GROUP(sb);
+	grp->bb_largest_free_order = -1;
+	grp->bb_avg_fragment_size_order = -1;
 	mbt_generate_test_ranges(sb, ranges, TEST_RANGE_COUNT);
 	for (i = 0; i < TEST_RANGE_COUNT; i++)
 		test_mb_mark_used_range(test, &e4b, ranges[i].start,
@@ -875,6 +878,8 @@ static void test_mb_free_blocks(struct kunit *test)
 	ext4_unlock_group(sb, TEST_GOAL_GROUP);

 	grp->bb_free = 0;
+	grp->bb_largest_free_order = -1;
+	grp->bb_avg_fragment_size_order = -1;
 	memset(bitmap, 0xff, sb->s_blocksize);

 	mbt_generate_test_ranges(sb, ranges, TEST_RANGE_COUNT);




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-24 11:14       ` Zhang Yi
@ 2025-07-24 14:30         ` Guenter Roeck
  2025-07-24 14:54         ` Theodore Ts'o
  1 sibling, 0 replies; 44+ messages in thread
From: Guenter Roeck @ 2025-07-24 14:30 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Theodore Ts'o, Baokun Li, linux-ext4, adilger.kernel, jack,
	linux-kernel, ojaswin, julia.lawall, yangerkun, libaokun

On Thu, Jul 24, 2025 at 07:14:58PM +0800, Zhang Yi wrote:
> On 2025/7/24 12:54, Theodore Ts'o wrote:
> > On Wed, Jul 23, 2025 at 08:55:14PM -0700, Guenter Roeck wrote:
> >> Hi,
> >>
> >> On Mon, Jul 14, 2025 at 09:03:25PM +0800, Baokun Li wrote:
> >>> While traversing the list, holding a spin_lock prevents load_buddy, making
> >>> direct use of ext4_try_lock_group impossible. This can lead to a bouncing
> >>> scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
> >>> fails, forcing the list traversal to repeatedly restart from grp_A.
> >>>
> >>
> >> This patch causes crashes for pretty much every architecture when
> >> running unit tests as part of booting.
> > 
> > I'm assuming that you're using a randconfig that happened to enable
> > CONFIG_EXT4_KUNIT_TESTS=y.
> > 
> > A simpler reprducer is to have a .kunitconfig containing:
> > 
> > CONFIG_KUNIT=y
> > CONFIG_KUNIT_TEST=y
> > CONFIG_KUNIT_EXAMPLE_TEST=y
> > CONFIG_EXT4_KUNIT_TESTS=y
> > 
> > ... and then run :./tools/testing/kunit/kunit.py run".
> > 
> > The first failure is actually with [11/17] ext4: fix largest free
> > orders lists corruption on mb_optimize_scan switch, which triggers a
> > failure of test_mb_mark_used.
> > 
> > Baokun, can you take a look please?   Many thanks!
> > 
> 
> Hi Ted and Guenter,
> 
> I'm sorry for this regression, we didn't run these tests. Baokun is
> currently on a business trip, so I help to look into this issue. The
> reason for the failure is that the variable initialization in the
> mb unit tests are insufficient, but this series relies on them.
> 
> Could you please try the following diff? I have tested it on my
> machine, and the issue does not recur. If every thing looks fine, I
> will send out the official patch.
> 

Confirmed to fix the problem. Please feel free to add

Tested-by: Guenter Roeck <linux@roeck-us.net>

Thanks,
Guenter

> Thanks,
> Yi.
> 
> 
> diff --git a/fs/ext4/mballoc-test.c b/fs/ext4/mballoc-test.c
> index d634c12f1984..a9416b20ff64 100644
> --- a/fs/ext4/mballoc-test.c
> +++ b/fs/ext4/mballoc-test.c
> @@ -155,6 +155,7 @@ static struct super_block *mbt_ext4_alloc_super_block(void)
>  	bgl_lock_init(sbi->s_blockgroup_lock);
> 
>  	sbi->s_es = &fsb->es;
> +	sbi->s_sb = sb;
>  	sb->s_fs_info = sbi;
> 
>  	up_write(&sb->s_umount);
> @@ -802,6 +803,8 @@ static void test_mb_mark_used(struct kunit *test)
>  	KUNIT_ASSERT_EQ(test, ret, 0);
> 
>  	grp->bb_free = EXT4_CLUSTERS_PER_GROUP(sb);
> +	grp->bb_largest_free_order = -1;
> +	grp->bb_avg_fragment_size_order = -1;
>  	mbt_generate_test_ranges(sb, ranges, TEST_RANGE_COUNT);
>  	for (i = 0; i < TEST_RANGE_COUNT; i++)
>  		test_mb_mark_used_range(test, &e4b, ranges[i].start,
> @@ -875,6 +878,8 @@ static void test_mb_free_blocks(struct kunit *test)
>  	ext4_unlock_group(sb, TEST_GOAL_GROUP);
> 
>  	grp->bb_free = 0;
> +	grp->bb_largest_free_order = -1;
> +	grp->bb_avg_fragment_size_order = -1;
>  	memset(bitmap, 0xff, sb->s_blocksize);
> 
>  	mbt_generate_test_ranges(sb, ranges, TEST_RANGE_COUNT);
> 
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-24 11:14       ` Zhang Yi
  2025-07-24 14:30         ` Guenter Roeck
@ 2025-07-24 14:54         ` Theodore Ts'o
  2025-07-25  2:28           ` Zhang Yi
  1 sibling, 1 reply; 44+ messages in thread
From: Theodore Ts'o @ 2025-07-24 14:54 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Guenter Roeck, Baokun Li, linux-ext4, adilger.kernel, jack,
	linux-kernel, ojaswin, julia.lawall, yangerkun, libaokun

On Thu, Jul 24, 2025 at 07:14:58PM +0800, Zhang Yi wrote:
> 
> I'm sorry for this regression, we didn't run these tests.

No worries, I didn't run them either.

> Could you please try the following diff? I have tested it on my
> machine, and the issue does not recur. If every thing looks fine, I
> will send out the official patch.

This patch fixes the test bug which was causing the failure of
test_new_blocks_simple.

However, there is still test failure of test_mb_mark_used in the patch
series starting with bbe11dd13a3f ("ext4: fix largest free orders
lists corruption on mb_optimize_scan switch").  The test failure is
fixed by 458bfb991155 ("ext4: convert free groups order lists to
xarrays").  The reason why this is especialy problematic is that
commit which introduced the problem is marked as "cc: stable", which
means it will get back ported to LTS kernels, thus introducing a
potential bug.

One of the advantages of unit tests is that they are light weight
enough that it is tractable to run them against every commit in the
patch series.  So we should strive to add more unit tests, since it
makes easier to detect regressions.

Anyway, here's the stack trace staring with "ext4: fix largest free
orders lists corruption on mb_optimize_scan switch".  Could you
investigate this failure?  Many thanks!!

						- Ted

[09:35:46] ==================== test_mb_mark_used  ====================
[09:35:46] [ERROR] Test: test_mb_mark_used: missing subtest result line!
[09:35:46] 
[09:35:46] Pid: 35, comm: kunit_try_catch Tainted: G        W        N  6.16.0-rc4-00031-gbbe11dd13a3f-dirty
[09:35:46] RIP: 0033:mb_set_largest_free_order+0x5c/0xc0
[09:35:46] RSP: 00000000a0883d98  EFLAGS: 00010206
[09:35:46] RAX: 0000000060aeaa28 RBX: 0000000060a2d400 RCX: 0000000000000008
[09:35:46] RDX: 0000000060aea9c0 RSI: 0000000000000000 RDI: 0000000060864000
[09:35:46] RBP: 0000000060aea9c0 R08: 0000000000000000 R09: 0000000060a2d400
[09:35:46] R10: 0000000000000400 R11: 0000000060a9cc00 R12: 0000000000000006
[09:35:46] R13: 0000000000000400 R14: 0000000000000305 R15: 0000000000000000
[09:35:46] Kernel panic - not syncing: Segfault with no mm
[09:35:46] CPU: 0 UID: 0 PID: 35 Comm: kunit_try_catch Tainted: G        W        N  6.16.0-rc4-00031-gbbe11dd13a3f-dirty #36 NONE
[09:35:46] Tainted: [W]=WARN, [N]=TEST
[09:35:46] Stack:
[09:35:46]  60210c60 00000200 60a9e400 00000400
[09:35:46]  40060300280 60864000 60a9cc00 60a2d400
[09:35:46]  00000400 60aea9c0 60a9cc00 60aea9c0
[09:35:46] Call Trace:
[09:35:46]  [<60210c60>] ? ext4_mb_generate_buddy+0x1f0/0x230
[09:35:46]  [<60215c3b>] ? test_mb_mark_used+0x28b/0x4e0
[09:35:46]  [<601df5bc>] ? ext4_get_group_desc+0xbc/0x150
[09:35:46]  [<600bf1c0>] ? ktime_get_ts64+0x0/0x190
[09:35:46]  [<60086370>] ? to_kthread+0x0/0x40
[09:35:46]  [<602b559b>] ? kunit_try_run_case+0x7b/0x100
[09:35:46]  [<60086370>] ? to_kthread+0x0/0x40
[09:35:46]  [<602b7850>] ? kunit_generic_run_threadfn_adapter+0x0/0x30
[09:35:46]  [<602b7862>] ? kunit_generic_run_threadfn_adapter+0x12/0x30
[09:35:46]  [<60086a51>] ? kthread+0xf1/0x250
[09:35:46]  [<6004a541>] ? new_thread_handler+0x41/0x60
[09:35:46] [ERROR] Test: test_mb_mark_used: 0 tests run!
[09:35:46] ============= [NO TESTS RUN] test_mb_mark_used =============

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-24 14:54         ` Theodore Ts'o
@ 2025-07-25  2:28           ` Zhang Yi
  2025-07-26  0:50             ` Baokun Li
  0 siblings, 1 reply; 44+ messages in thread
From: Zhang Yi @ 2025-07-25  2:28 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Guenter Roeck, Baokun Li, linux-ext4, adilger.kernel, jack,
	linux-kernel, ojaswin, julia.lawall, yangerkun, libaokun

On 2025/7/24 22:54, Theodore Ts'o wrote:
> On Thu, Jul 24, 2025 at 07:14:58PM +0800, Zhang Yi wrote:
>>
>> I'm sorry for this regression, we didn't run these tests.
> 
> No worries, I didn't run them either.
> 
>> Could you please try the following diff? I have tested it on my
>> machine, and the issue does not recur. If every thing looks fine, I
>> will send out the official patch.
> 
> This patch fixes the test bug which was causing the failure of
> test_new_blocks_simple.
> 

The official patch to fix test_new_blocks_simple for the next
branch:

https://lore.kernel.org/linux-ext4/20250725021550.3177573-1-yi.zhang@huaweicloud.com/

> However, there is still test failure of test_mb_mark_used in the patch
> series starting with bbe11dd13a3f ("ext4: fix largest free orders
> lists corruption on mb_optimize_scan switch").  The test failure is
> fixed by 458bfb991155 ("ext4: convert free groups order lists to
> xarrays").  The reason why this is especialy problematic is that
> commit which introduced the problem is marked as "cc: stable", which
> means it will get back ported to LTS kernels, thus introducing a
> potential bug.
> 

Indeed!

> One of the advantages of unit tests is that they are light weight
> enough that it is tractable to run them against every commit in the
> patch series.  So we should strive to add more unit tests, since it
> makes easier to detect regressions.
> 
> Anyway, here's the stack trace staring with "ext4: fix largest free
> orders lists corruption on mb_optimize_scan switch".  Could you
> investigate this failure?  Many thanks!!
> 

Sure! I've sent out the fix that applies to the kernel that has only
merged bbe11dd13a3f ("ext4: fix largest free orders lists corruption
on mb_optimize_scan switch"), but not merged 458bfb991155 ("ext4:
convert free groups order lists to xarrays"). Please give it a try.

https://lore.kernel.org/linux-ext4/20250725021654.3188798-1-yi.zhang@huaweicloud.com/

Best Regards,
Yi.

> 
> [09:35:46] ==================== test_mb_mark_used  ====================
> [09:35:46] [ERROR] Test: test_mb_mark_used: missing subtest result line!
> [09:35:46] 
> [09:35:46] Pid: 35, comm: kunit_try_catch Tainted: G        W        N  6.16.0-rc4-00031-gbbe11dd13a3f-dirty
> [09:35:46] RIP: 0033:mb_set_largest_free_order+0x5c/0xc0
> [09:35:46] RSP: 00000000a0883d98  EFLAGS: 00010206
> [09:35:46] RAX: 0000000060aeaa28 RBX: 0000000060a2d400 RCX: 0000000000000008
> [09:35:46] RDX: 0000000060aea9c0 RSI: 0000000000000000 RDI: 0000000060864000
> [09:35:46] RBP: 0000000060aea9c0 R08: 0000000000000000 R09: 0000000060a2d400
> [09:35:46] R10: 0000000000000400 R11: 0000000060a9cc00 R12: 0000000000000006
> [09:35:46] R13: 0000000000000400 R14: 0000000000000305 R15: 0000000000000000
> [09:35:46] Kernel panic - not syncing: Segfault with no mm
> [09:35:46] CPU: 0 UID: 0 PID: 35 Comm: kunit_try_catch Tainted: G        W        N  6.16.0-rc4-00031-gbbe11dd13a3f-dirty #36 NONE
> [09:35:46] Tainted: [W]=WARN, [N]=TEST
> [09:35:46] Stack:
> [09:35:46]  60210c60 00000200 60a9e400 00000400
> [09:35:46]  40060300280 60864000 60a9cc00 60a2d400
> [09:35:46]  00000400 60aea9c0 60a9cc00 60aea9c0
> [09:35:46] Call Trace:
> [09:35:46]  [<60210c60>] ? ext4_mb_generate_buddy+0x1f0/0x230
> [09:35:46]  [<60215c3b>] ? test_mb_mark_used+0x28b/0x4e0
> [09:35:46]  [<601df5bc>] ? ext4_get_group_desc+0xbc/0x150
> [09:35:46]  [<600bf1c0>] ? ktime_get_ts64+0x0/0x190
> [09:35:46]  [<60086370>] ? to_kthread+0x0/0x40
> [09:35:46]  [<602b559b>] ? kunit_try_run_case+0x7b/0x100
> [09:35:46]  [<60086370>] ? to_kthread+0x0/0x40
> [09:35:46]  [<602b7850>] ? kunit_generic_run_threadfn_adapter+0x0/0x30
> [09:35:46]  [<602b7862>] ? kunit_generic_run_threadfn_adapter+0x12/0x30
> [09:35:46]  [<60086a51>] ? kthread+0xf1/0x250
> [09:35:46]  [<6004a541>] ? new_thread_handler+0x41/0x60
> [09:35:46] [ERROR] Test: test_mb_mark_used: 0 tests run!
> [09:35:46] ============= [NO TESTS RUN] test_mb_mark_used =============
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 15/17] ext4: convert free groups order lists to xarrays
  2025-07-25  2:28           ` Zhang Yi
@ 2025-07-26  0:50             ` Baokun Li
  0 siblings, 0 replies; 44+ messages in thread
From: Baokun Li @ 2025-07-26  0:50 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o
  Cc: Guenter Roeck, linux-ext4, adilger.kernel, jack, linux-kernel,
	ojaswin, julia.lawall, yangerkun, libaokun

On 7/25/2025 10:28 AM, Zhang Yi wrote:
> On 2025/7/24 22:54, Theodore Ts'o wrote:
>> On Thu, Jul 24, 2025 at 07:14:58PM +0800, Zhang Yi wrote:
>>> I'm sorry for this regression, we didn't run these tests.
>> No worries, I didn't run them either.
>>
>>> Could you please try the following diff? I have tested it on my
>>> machine, and the issue does not recur. If every thing looks fine, I
>>> will send out the official patch.
>> This patch fixes the test bug which was causing the failure of
>> test_new_blocks_simple.
>>
> The official patch to fix test_new_blocks_simple for the next
> branch:
>
> https://lore.kernel.org/linux-ext4/20250725021550.3177573-1-yi.zhang@huaweicloud.com/
>
>> However, there is still test failure of test_mb_mark_used in the patch
>> series starting with bbe11dd13a3f ("ext4: fix largest free orders
>> lists corruption on mb_optimize_scan switch").  The test failure is
>> fixed by 458bfb991155 ("ext4: convert free groups order lists to
>> xarrays").  The reason why this is especialy problematic is that
>> commit which introduced the problem is marked as "cc: stable", which
>> means it will get back ported to LTS kernels, thus introducing a
>> potential bug.
>>
> Indeed!
>
>> One of the advantages of unit tests is that they are light weight
>> enough that it is tractable to run them against every commit in the
>> patch series.  So we should strive to add more unit tests, since it
>> makes easier to detect regressions.
>>
>> Anyway, here's the stack trace staring with "ext4: fix largest free
>> orders lists corruption on mb_optimize_scan switch".  Could you
>> investigate this failure?  Many thanks!!
>>
> Sure! I've sent out the fix that applies to the kernel that has only
> merged bbe11dd13a3f ("ext4: fix largest free orders lists corruption
> on mb_optimize_scan switch"), but not merged 458bfb991155 ("ext4:
> convert free groups order lists to xarrays"). Please give it a try.
>
> https://lore.kernel.org/linux-ext4/20250725021654.3188798-1-yi.zhang@huaweicloud.com/
>
Sorry for the late reply, I haven't had time to look into this this week.
I really appreciate Yi for taking the time to help address these issues.
I'm also very sorry for introducing a regression in the ext4 kunit tests.


Thanks,
Baokun

>
>> [09:35:46] ==================== test_mb_mark_used  ====================
>> [09:35:46] [ERROR] Test: test_mb_mark_used: missing subtest result line!
>> [09:35:46]
>> [09:35:46] Pid: 35, comm: kunit_try_catch Tainted: G        W        N  6.16.0-rc4-00031-gbbe11dd13a3f-dirty
>> [09:35:46] RIP: 0033:mb_set_largest_free_order+0x5c/0xc0
>> [09:35:46] RSP: 00000000a0883d98  EFLAGS: 00010206
>> [09:35:46] RAX: 0000000060aeaa28 RBX: 0000000060a2d400 RCX: 0000000000000008
>> [09:35:46] RDX: 0000000060aea9c0 RSI: 0000000000000000 RDI: 0000000060864000
>> [09:35:46] RBP: 0000000060aea9c0 R08: 0000000000000000 R09: 0000000060a2d400
>> [09:35:46] R10: 0000000000000400 R11: 0000000060a9cc00 R12: 0000000000000006
>> [09:35:46] R13: 0000000000000400 R14: 0000000000000305 R15: 0000000000000000
>> [09:35:46] Kernel panic - not syncing: Segfault with no mm
>> [09:35:46] CPU: 0 UID: 0 PID: 35 Comm: kunit_try_catch Tainted: G        W        N  6.16.0-rc4-00031-gbbe11dd13a3f-dirty #36 NONE
>> [09:35:46] Tainted: [W]=WARN, [N]=TEST
>> [09:35:46] Stack:
>> [09:35:46]  60210c60 00000200 60a9e400 00000400
>> [09:35:46]  40060300280 60864000 60a9cc00 60a2d400
>> [09:35:46]  00000400 60aea9c0 60a9cc00 60aea9c0
>> [09:35:46] Call Trace:
>> [09:35:46]  [<60210c60>] ? ext4_mb_generate_buddy+0x1f0/0x230
>> [09:35:46]  [<60215c3b>] ? test_mb_mark_used+0x28b/0x4e0
>> [09:35:46]  [<601df5bc>] ? ext4_get_group_desc+0xbc/0x150
>> [09:35:46]  [<600bf1c0>] ? ktime_get_ts64+0x0/0x190
>> [09:35:46]  [<60086370>] ? to_kthread+0x0/0x40
>> [09:35:46]  [<602b559b>] ? kunit_try_run_case+0x7b/0x100
>> [09:35:46]  [<60086370>] ? to_kthread+0x0/0x40
>> [09:35:46]  [<602b7850>] ? kunit_generic_run_threadfn_adapter+0x0/0x30
>> [09:35:46]  [<602b7862>] ? kunit_generic_run_threadfn_adapter+0x12/0x30
>> [09:35:46]  [<60086a51>] ? kthread+0xf1/0x250
>> [09:35:46]  [<6004a541>] ? new_thread_handler+0x41/0x60
>> [09:35:46] [ERROR] Test: test_mb_mark_used: 0 tests run!
>> [09:35:46] ============= [NO TESTS RUN] test_mb_mark_used =============
>>


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-07-26  0:50 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
2025-07-14 13:03 ` [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
2025-07-17 10:09   ` Ojaswin Mujoo
2025-07-19  0:37     ` Baokun Li
2025-07-17 22:28   ` Andi Kleen
2025-07-18  3:09     ` Theodore Ts'o
2025-07-19  0:29     ` Baokun Li
2025-07-22 20:59       ` Andi Kleen
2025-07-14 13:03 ` [PATCH v3 02/17] ext4: separate stream goal hits from s_bal_goals for better tracking Baokun Li
2025-07-17 10:29   ` Ojaswin Mujoo
2025-07-19  1:37     ` Baokun Li
2025-07-14 13:03 ` [PATCH v3 03/17] ext4: remove unnecessary s_mb_last_start Baokun Li
2025-07-17 10:31   ` Ojaswin Mujoo
2025-07-14 13:03 ` [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
2025-07-17 13:36   ` Ojaswin Mujoo
2025-07-19  1:54     ` Baokun Li
2025-07-14 13:03 ` [PATCH v3 05/17] ext4: utilize multiple global goals to reduce contention Baokun Li
2025-07-14 13:03 ` [PATCH v3 06/17] ext4: get rid of some obsolete EXT4_MB_HINT flags Baokun Li
2025-07-14 13:03 ` [PATCH v3 07/17] ext4: fix typo in CR_GOAL_LEN_SLOW comment Baokun Li
2025-07-14 13:03 ` [PATCH v3 08/17] ext4: convert sbi->s_mb_free_pending to atomic_t Baokun Li
2025-07-14 13:03 ` [PATCH v3 09/17] ext4: merge freed extent with existing extents before insertion Baokun Li
2025-07-14 13:03 ` [PATCH v3 10/17] ext4: fix zombie groups in average fragment size lists Baokun Li
2025-07-14 13:03 ` [PATCH v3 11/17] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Baokun Li
2025-07-14 13:03 ` [PATCH v3 12/17] ext4: factor out __ext4_mb_scan_group() Baokun Li
2025-07-14 13:03 ` [PATCH v3 13/17] ext4: factor out ext4_mb_might_prefetch() Baokun Li
2025-07-14 13:03 ` [PATCH v3 14/17] ext4: factor out ext4_mb_scan_group() Baokun Li
2025-07-14 13:03 ` [PATCH v3 15/17] ext4: convert free groups order lists to xarrays Baokun Li
2025-07-21 11:07   ` Jan Kara
2025-07-21 12:33     ` Baokun Li
2025-07-21 13:45       ` Baokun Li
2025-07-21 18:01         ` Theodore Ts'o
2025-07-22  5:58           ` Baokun Li
2025-07-24  3:55   ` Guenter Roeck
2025-07-24  4:54     ` Theodore Ts'o
2025-07-24  5:20       ` Guenter Roeck
2025-07-24 11:14       ` Zhang Yi
2025-07-24 14:30         ` Guenter Roeck
2025-07-24 14:54         ` Theodore Ts'o
2025-07-25  2:28           ` Zhang Yi
2025-07-26  0:50             ` Baokun Li
2025-07-14 13:03 ` [PATCH v3 16/17] ext4: refactor choose group to scan group Baokun Li
2025-07-14 13:03 ` [PATCH v3 17/17] ext4: implement linear-like traversal across order xarrays Baokun Li
2025-07-15  1:11 ` [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Zhang Yi
2025-07-19 21:45 ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).