public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/16] ext4: better scalability for ext4 block allocation
@ 2025-06-23  7:32 Baokun Li
  2025-06-23  7:32 ` [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
                   ` (15 more replies)
  0 siblings, 16 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

Changes since v1:
 * Patch 1: Prioritize checking if a group is busy to avoid unnecessary
       checks and buddy loading. (Thanks to Ojaswin for the suggestion!)
 * Patch 4: Using multiple global goals instead of moving the goal to the
       inode level. (Thanks to Honza for the suggestion!)
 * Collect RVB from Jan Kara and Ojaswin Mujoo.(Thanks for your review!)
 * Add patch 2,3,7-16.
 * Due to the change of test server, the relevant test data was refreshed.

v1: https://lore.kernel.org/r/20250523085821.1329392-1-libaokun@huaweicloud.com

Since servers have more and more CPUs, and we're running more containers
on them, we've been using will-it-scale to test how well ext4 scales. The
fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently
on 64 containers revealed significant contention in block allocation/free,
leading to much lower aggregate fallocate OPS compared to a single
container (see below).

   1   |    2   |    4   |    8   |   16   |   32   |   64
-------|--------|--------|--------|--------|--------|-------
295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588

Under this test scenario, the primary operations are block allocation
(fallocate) and block deallocation (truncate). The main bottlenecks for
these operations are the group lock and s_md_lock. Therefore, this patch
series primarily focuses on optimizing the code related to these two locks.

The following is a brief overview of the patches, see the patches for
more details.

Patch 1: Add ext4_try_lock_group() to skip busy groups to take advantage
of the large number of ext4 groups.

Patches 2-4: Split stream allocation's global goal into multiple goals and
protect them with memory barriers instead of the expensive s_md_lock.

Patches 5-6: minor cleanups

Patches 7: Converted s_mb_free_pending to atomic_t and used memory barriers
for consistency, instead of relying on the expensive s_md_lock.

Patches 8: When inserting free extents, we now attempt to merge them with
already inserted extents first, to reduce s_md_lock contention.

Patches 9: Updates bb_avg_fragment_size_order to -1 when a group is out of
free blocks, eliminating efficiency-impacting "zombie groups."

Patches 10: Fix potential largest free orders lists corruption when the
mb_optimize_scan mount option is switched on or off.

Patches 11-16: Convert mb_optimize_scan's existing unordered list traversal
to an ordered xarray, thereby reducing contention between block allocation
and freeing, similar to linear traversal.

"kvm-xfstests -c ext4/all -g auto" has been executed with no new failures.

Here are some performance test data for your reference:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

CPU: Kunpeng 920   |          P80            |            P1           |
Memory: 512GB      |-------------------------|-------------------------|
Disk: 960GB SSD    | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 2667  | 20619  (+673.1%)| 314065| 299238 (-4.7%)  |
mb_optimize_scan=1 | 2643  | 20119  (+661.2%)| 316344| 315268 (-0.3%)  |

CPU: AMD 9654 * 2  |          P96            |            P1           |
Memory: 1536GB     |-------------------------|-------------------------|
Disk: 960GB SSD    | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 3450  | 51983 (+1406.7%)| 205851| 207033 (+0.5%)  |
mb_optimize_scan=1 | 3209  | 48486 (+1410.9%)| 207373| 202415 (-2.3%)  |

Tests also evaluated this patch set's impact on fragmentation: a minor
increase in free space fragmentation for multi-process workloads, but a
significant decrease in file fragmentation:

Test Script:
```shell
#!/bin/bash

dir="/tmp/test"
disk="/dev/sda"

mkdir -p $dir

for scan in 0 1 ; do
    mkfs.ext4 -F -E lazy_itable_init=0,lazy_journal_init=0 \
              -O orphan_file $disk 200G
    mount -o mb_optimize_scan=$scan $disk $dir

    fio -directory=$dir -direct=1 -iodepth 128 -thread -ioengine=falloc \
        -rw=write -bs=4k -fallocate=none -numjobs=64 -file_append=1 \
        -size=1G -group_reporting -name=job1 -cpus_allowed_policy=split

    e2freefrag $disk
    e4defrag -c $dir # Without the patch, this could take 5-6 hours.
    filefrag ${dir}/job* | awk '{print $2}' | \
                           awk '{sum+=$1} END {print sum/NR}'
    umount $dir
done
```

Test results:
-------------------------------------------------------------|
                         |       base      |      patched    |
-------------------------|--------|--------|--------|--------|
mb_optimize_scan         | linear |opt_scan| linear |opt_scan|
-------------------------|--------|--------|--------|--------|
bw(MiB/s)                | 217    | 217    | 5718   | 5626   |
-------------------------|-----------------------------------|
Avg. free extent size(KB)| 1943732| 1943732| 1316212| 1171208|
Num. free extent         | 71     | 71     | 105    | 118    |
-------------------------------------------------------------|
Avg. extents per file    | 261967 | 261973 | 588    | 570    |
Avg. size per extent(KB) | 4      | 4      | 1780   | 1837   |
Fragmentation score      | 100    | 100    | 2      | 2      |
-------------------------------------------------------------| 

Comments and questions are, as always, welcome.

Thanks,
Baokun

Baokun Li (16):
  ext4: add ext4_try_lock_group() to skip busy groups
  ext4: remove unnecessary s_mb_last_start
  ext4: remove unnecessary s_md_lock on update s_mb_last_group
  ext4: utilize multiple global goals to reduce contention
  ext4: get rid of some obsolete EXT4_MB_HINT flags
  ext4: fix typo in CR_GOAL_LEN_SLOW comment
  ext4: convert sbi->s_mb_free_pending to atomic_t
  ext4: merge freed extent with existing extents before insertion
  ext4: fix zombie groups in average fragment size lists
  ext4: fix largest free orders lists corruption on mb_optimize_scan
    switch
  ext4: factor out __ext4_mb_scan_group()
  ext4: factor out ext4_mb_might_prefetch()
  ext4: factor out ext4_mb_scan_group()
  ext4: convert free group lists to ordered xarrays
  ext4: refactor choose group to scan group
  ext4: ensure global ordered traversal across all free groups xarrays

 fs/ext4/balloc.c            |   2 +-
 fs/ext4/ext4.h              |  45 +-
 fs/ext4/mballoc.c           | 898 +++++++++++++++++++++---------------
 fs/ext4/mballoc.h           |  18 +-
 include/trace/events/ext4.h |   3 -
 5 files changed, 553 insertions(+), 413 deletions(-)

-- 
2.46.1


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-27 18:06   ` Jan Kara
  2025-07-14  6:53   ` Ojaswin Mujoo
  2025-06-23  7:32 ` [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start Baokun Li
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

When ext4 allocates blocks, we used to just go through the block groups
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.

But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.

Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.

Also, to make sure we don't skip all the groups that have free space
when allocating blocks, we won't try to skip busy groups anymore when
ac_criteria is CR_ANY_FREE.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

                   | Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96  |
 Disk: 960GB SSD   |-------------------------|-------------------------|
                   | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 2667  | 4821  (+80.7%)  | 3450  | 15371 (+345%)   |
mb_optimize_scan=1 | 2643  | 4784  (+81.0%)  | 3209  | 6101  (+90.0%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    | 23 ++++++++++++++---------
 fs/ext4/mballoc.c | 19 ++++++++++++++++---
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 18373de980f2..9df74123e7e6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3541,23 +3541,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi)
 	return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD);
 }
 
+static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group)
+{
+	if (!spin_trylock(ext4_group_lock_ptr(sb, group)))
+		return false;
+	/*
+	 * We're able to grab the lock right away, so drop the lock
+	 * contention counter.
+	 */
+	atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
+	return true;
+}
+
 static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group)
 {
-	spinlock_t *lock = ext4_group_lock_ptr(sb, group);
-	if (spin_trylock(lock))
-		/*
-		 * We're able to grab the lock right away, so drop the
-		 * lock contention counter.
-		 */
-		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
-	else {
+	if (!ext4_try_lock_group(sb, group)) {
 		/*
 		 * The lock is busy, so bump the contention counter,
 		 * and then wait on the spin lock.
 		 */
 		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1,
 				  EXT4_MAX_CONTENTION);
-		spin_lock(lock);
+		spin_lock(ext4_group_lock_ptr(sb, group));
 	}
 }
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1e98c5be4e0a..336d65c4f6a2 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
 				    bb_largest_free_order_node) {
 			if (sbi->s_mb_stats)
 				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
-			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
+			if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) &&
+			    likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
 				*group = iter->bb_group;
 				ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
 				read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
@@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o
 	list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
 		if (sbi->s_mb_stats)
 			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
-		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
+		if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) &&
+		    likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
 			grp = iter;
 			break;
 		}
@@ -2899,6 +2901,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 							nr, &prefetch_ios);
 			}
 
+			/* prevent unnecessary buddy loading. */
+			if (cr < CR_ANY_FREE &&
+			    spin_is_locked(ext4_group_lock_ptr(sb, group)))
+				continue;
+
 			/* This now checks without needing the buddy page */
 			ret = ext4_mb_good_group_nolock(ac, group, cr);
 			if (ret <= 0) {
@@ -2911,7 +2918,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			if (err)
 				goto out;
 
-			ext4_lock_group(sb, group);
+			/* skip busy group */
+			if (cr >= CR_ANY_FREE) {
+				ext4_lock_group(sb, group);
+			} else if (!ext4_try_lock_group(sb, group)) {
+				ext4_mb_unload_buddy(&e4b);
+				continue;
+			}
 
 			/*
 			 * We need to check again after locking the
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
  2025-06-23  7:32 ` [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-27 18:15   ` Jan Kara
  2025-06-23  7:32 ` [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

ac->ac_g_ex.fe_start is only used in ext4_mb_find_by_goal(), but STREAM
ALLOC is activated after ext4_mb_find_by_goal() fails, so there's no need
to update ac->ac_g_ex.fe_start, remove the unnecessary s_mb_last_start.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    | 1 -
 fs/ext4/mballoc.c | 2 --
 2 files changed, 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9df74123e7e6..cfb60f8fbb63 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1631,7 +1631,6 @@ struct ext4_sb_info {
 	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
 	unsigned long s_mb_last_group;
-	unsigned long s_mb_last_start;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 336d65c4f6a2..5cdae3bda072 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2171,7 +2171,6 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
 		spin_lock(&sbi->s_md_lock);
 		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
-		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
 		spin_unlock(&sbi->s_md_lock);
 	}
 	/*
@@ -2849,7 +2848,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		/* TBD: may be hot point */
 		spin_lock(&sbi->s_md_lock);
 		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
-		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
 		spin_unlock(&sbi->s_md_lock);
 	}
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
  2025-06-23  7:32 ` [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
  2025-06-23  7:32 ` [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-27 18:19   ` Jan Kara
  2025-07-01  2:57   ` kernel test robot
  2025-06-23  7:32 ` [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention Baokun Li
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.

Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent, and we can ensure that the s_mb_last_group read is up
to date by using smp_store_release/smp_load_acquire.

Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

                   | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
 Disk: 960GB SSD   |-------------------------|-------------------------|
                   | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  2 +-
 fs/ext4/mballoc.c | 17 ++++++-----------
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index cfb60f8fbb63..93f03d8c3dca 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1630,7 +1630,7 @@ struct ext4_sb_info {
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
-	unsigned long s_mb_last_group;
+	ext4_group_t s_mb_last_group;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5cdae3bda072..3f103919868b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	ac->ac_buddy_folio = e4b->bd_buddy_folio;
 	folio_get(ac->ac_buddy_folio);
 	/* store last allocated for subsequent stream allocation */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		spin_lock(&sbi->s_md_lock);
-		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
-		spin_unlock(&sbi->s_md_lock);
-	}
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
+		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
+		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
 	/*
 	 * As we've just preallocated more space than
 	 * user requested originally, we store allocated
@@ -2844,12 +2842,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	}
 
 	/* if stream allocation is enabled, use global goal */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		/* TBD: may be hot point */
-		spin_lock(&sbi->s_md_lock);
-		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
-		spin_unlock(&sbi->s_md_lock);
-	}
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
+		/* pairs with smp_store_release in ext4_mb_use_best_found() */
+		ac->ac_g_ex.fe_group = smp_load_acquire(&sbi->s_mb_last_group);
 
 	/*
 	 * Let's just scan groups to find more-less suitable blocks We
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (2 preceding siblings ...)
  2025-06-23  7:32 ` [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-27 18:31   ` Jan Kara
  2025-06-23  7:32 ` [PATCH v2 05/16] ext4: get rid of some obsolete EXT4_MB_HINT flags Baokun Li
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.

However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.

To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the CPU count rounded up to the
nearest power of 2. To ensure a consistent goal for each inode, we select
the corresponding goal by taking the inode number modulo the total number
of goals.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

                   | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
 Disk: 960GB SSD   |-------------------------|-------------------------|
                   | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  2 +-
 fs/ext4/mballoc.c | 31 ++++++++++++++++++++++++-------
 fs/ext4/mballoc.h |  9 +++++++++
 3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 93f03d8c3dca..c3f16aba7b79 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1630,7 +1630,7 @@ struct ext4_sb_info {
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
-	ext4_group_t s_mb_last_group;
+	ext4_group_t *s_mb_last_groups;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 3f103919868b..216b332a5054 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2168,9 +2168,12 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	ac->ac_buddy_folio = e4b->bd_buddy_folio;
 	folio_get(ac->ac_buddy_folio);
 	/* store last allocated for subsequent stream allocation */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
-		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
-		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
+		int hash = ac->ac_inode->i_ino % MB_LAST_GROUPS;
+		/* Pairs with smp_load_acquire in ext4_mb_regular_allocator() */
+		smp_store_release(&sbi->s_mb_last_groups[hash],
+				  ac->ac_f_ex.fe_group);
+	}
 	/*
 	 * As we've just preallocated more space than
 	 * user requested originally, we store allocated
@@ -2842,9 +2845,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	}
 
 	/* if stream allocation is enabled, use global goal */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
-		/* pairs with smp_store_release in ext4_mb_use_best_found() */
-		ac->ac_g_ex.fe_group = smp_load_acquire(&sbi->s_mb_last_group);
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
+		int hash = ac->ac_inode->i_ino % MB_LAST_GROUPS;
+		/* Pairs with smp_store_release in ext4_mb_use_best_found() */
+		ac->ac_g_ex.fe_group = smp_load_acquire(
+						&sbi->s_mb_last_groups[hash]);
+	}
 
 	/*
 	 * Let's just scan groups to find more-less suitable blocks We
@@ -3715,10 +3721,17 @@ int ext4_mb_init(struct super_block *sb)
 			sbi->s_mb_group_prealloc, EXT4_NUM_B2C(sbi, sbi->s_stripe));
 	}
 
+	sbi->s_mb_last_groups = kcalloc(MB_LAST_GROUPS, sizeof(ext4_group_t),
+					GFP_KERNEL);
+	if (sbi->s_mb_last_groups == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
 	sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
 	if (sbi->s_locality_groups == NULL) {
 		ret = -ENOMEM;
-		goto out;
+		goto out_free_last_groups;
 	}
 	for_each_possible_cpu(i) {
 		struct ext4_locality_group *lg;
@@ -3743,6 +3756,9 @@ int ext4_mb_init(struct super_block *sb)
 out_free_locality_groups:
 	free_percpu(sbi->s_locality_groups);
 	sbi->s_locality_groups = NULL;
+out_free_last_groups:
+	kvfree(sbi->s_mb_last_groups);
+	sbi->s_mb_last_groups = NULL;
 out:
 	kfree(sbi->s_mb_avg_fragment_size);
 	kfree(sbi->s_mb_avg_fragment_size_locks);
@@ -3847,6 +3863,7 @@ void ext4_mb_release(struct super_block *sb)
 	}
 
 	free_percpu(sbi->s_locality_groups);
+	kvfree(sbi->s_mb_last_groups);
 }
 
 static inline int ext4_issue_discard(struct super_block *sb,
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index f8280de3e882..38c37901728d 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -97,6 +97,15 @@
  */
 #define MB_NUM_ORDERS(sb)		((sb)->s_blocksize_bits + 2)
 
+/*
+ * Number of mb last groups
+ */
+#ifdef CONFIG_SMP
+#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
+#else
+#define MB_LAST_GROUPS 1
+#endif
+
 struct ext4_free_data {
 	/* this links the free block information from sb_info */
 	struct list_head		efd_list;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 05/16] ext4: get rid of some obsolete EXT4_MB_HINT flags
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (3 preceding siblings ...)
  2025-06-23  7:32 ` [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-23  7:32 ` [PATCH v2 06/16] ext4: fix typo in CR_GOAL_LEN_SLOW comment Baokun Li
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

Since nobody has used these EXT4_MB_HINT flags for ages,
let's remove them.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h              | 6 ------
 include/trace/events/ext4.h | 3 ---
 2 files changed, 9 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c3f16aba7b79..29b3817f41a5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -185,14 +185,8 @@ enum criteria {
 
 /* prefer goal again. length */
 #define EXT4_MB_HINT_MERGE		0x0001
-/* blocks already reserved */
-#define EXT4_MB_HINT_RESERVED		0x0002
-/* metadata is being allocated */
-#define EXT4_MB_HINT_METADATA		0x0004
 /* first blocks in the file */
 #define EXT4_MB_HINT_FIRST		0x0008
-/* search for the best chunk */
-#define EXT4_MB_HINT_BEST		0x0010
 /* data is being allocated */
 #define EXT4_MB_HINT_DATA		0x0020
 /* don't preallocate (for tails) */
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 156908641e68..33b204165cc0 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -23,10 +23,7 @@ struct partial_cluster;
 
 #define show_mballoc_flags(flags) __print_flags(flags, "|",	\
 	{ EXT4_MB_HINT_MERGE,		"HINT_MERGE" },		\
-	{ EXT4_MB_HINT_RESERVED,	"HINT_RESV" },		\
-	{ EXT4_MB_HINT_METADATA,	"HINT_MDATA" },		\
 	{ EXT4_MB_HINT_FIRST,		"HINT_FIRST" },		\
-	{ EXT4_MB_HINT_BEST,		"HINT_BEST" },		\
 	{ EXT4_MB_HINT_DATA,		"HINT_DATA" },		\
 	{ EXT4_MB_HINT_NOPREALLOC,	"HINT_NOPREALLOC" },	\
 	{ EXT4_MB_HINT_GROUP_ALLOC,	"HINT_GRP_ALLOC" },	\
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 06/16] ext4: fix typo in CR_GOAL_LEN_SLOW comment
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (4 preceding siblings ...)
  2025-06-23  7:32 ` [PATCH v2 05/16] ext4: get rid of some obsolete EXT4_MB_HINT flags Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-23  7:32 ` [PATCH v2 07/16] ext4: convert sbi->s_mb_free_pending to atomic_t Baokun Li
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

Remove the superfluous "find_".

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 29b3817f41a5..294198c05cdd 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -157,7 +157,7 @@ enum criteria {
 
 	/*
 	 * Reads each block group sequentially, performing disk IO if
-	 * necessary, to find find_suitable block group. Tries to
+	 * necessary, to find suitable block group. Tries to
 	 * allocate goal length but might trim the request if nothing
 	 * is found after enough tries.
 	 */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 07/16] ext4: convert sbi->s_mb_free_pending to atomic_t
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (5 preceding siblings ...)
  2025-06-23  7:32 ` [PATCH v2 06/16] ext4: fix typo in CR_GOAL_LEN_SLOW comment Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-27 18:33   ` Jan Kara
  2025-06-23  7:32 ` [PATCH v2 08/16] ext4: merge freed extent with existing extents before insertion Baokun Li
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

Previously, s_md_lock was used to protect s_mb_free_pending during
modifications, while smp_mb() ensured fresh reads, so s_md_lock just
guarantees the atomicity of s_mb_free_pending. Thus we optimized it by
converting s_mb_free_pending into an atomic variable, thereby eliminating
s_md_lock and minimizing lock contention. This also prepares for future
lockless merging of free extents.

Following this modification, s_md_lock is exclusively responsible for
managing insertions and deletions within s_freed_data_list, along with
operations involving list_splice.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

                   | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
 Disk: 960GB SSD   |-------------------------|-------------------------|
                   | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 19699 | 20982 (+6.5%)   | 53093 | 50629 (-4.6%)   |
mb_optimize_scan=1 | 9862  | 10703 (+8.5%)   | 14401 | 14856 (+3.1%)   |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/balloc.c  | 2 +-
 fs/ext4/ext4.h    | 2 +-
 fs/ext4/mballoc.c | 9 +++------
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index c48fd36b2d74..c9329ed5c094 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -703,7 +703,7 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 	 * possible we just missed a transaction commit that did so
 	 */
 	smp_mb();
-	if (sbi->s_mb_free_pending == 0) {
+	if (atomic_read(&sbi->s_mb_free_pending) == 0) {
 		if (test_opt(sb, DISCARD)) {
 			atomic_inc(&sbi->s_retry_alloc_pending);
 			flush_work(&sbi->s_discard_work);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 294198c05cdd..003b8d3726e8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1602,7 +1602,7 @@ struct ext4_sb_info {
 	unsigned short *s_mb_offsets;
 	unsigned int *s_mb_maxs;
 	unsigned int s_group_info_size;
-	unsigned int s_mb_free_pending;
+	atomic_t s_mb_free_pending;
 	struct list_head s_freed_data_list[2];	/* List of blocks to be freed
 						   after commit completed */
 	struct list_head s_discard_list;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 216b332a5054..5410fb3688ee 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3680,7 +3680,7 @@ int ext4_mb_init(struct super_block *sb)
 	}
 
 	spin_lock_init(&sbi->s_md_lock);
-	sbi->s_mb_free_pending = 0;
+	atomic_set(&sbi->s_mb_free_pending, 0);
 	INIT_LIST_HEAD(&sbi->s_freed_data_list[0]);
 	INIT_LIST_HEAD(&sbi->s_freed_data_list[1]);
 	INIT_LIST_HEAD(&sbi->s_discard_list);
@@ -3894,10 +3894,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
 	/* we expect to find existing buddy because it's pinned */
 	BUG_ON(err != 0);
 
-	spin_lock(&EXT4_SB(sb)->s_md_lock);
-	EXT4_SB(sb)->s_mb_free_pending -= entry->efd_count;
-	spin_unlock(&EXT4_SB(sb)->s_md_lock);
-
+	atomic_sub(entry->efd_count, &EXT4_SB(sb)->s_mb_free_pending);
 	db = e4b.bd_info;
 	/* there are blocks to put in buddy to make them really free */
 	count += entry->efd_count;
@@ -6392,7 +6389,7 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 
 	spin_lock(&sbi->s_md_lock);
 	list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->efd_tid & 1]);
-	sbi->s_mb_free_pending += clusters;
+	atomic_add(clusters, &sbi->s_mb_free_pending);
 	spin_unlock(&sbi->s_md_lock);
 }
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 08/16] ext4: merge freed extent with existing extents before insertion
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (6 preceding siblings ...)
  2025-06-23  7:32 ` [PATCH v2 07/16] ext4: convert sbi->s_mb_free_pending to atomic_t Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-27 19:11   ` Jan Kara
  2025-06-23  7:32 ` [PATCH v2 09/16] ext4: fix zombie groups in average fragment size lists Baokun Li
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

Attempt to merge ext4_free_data with already inserted free extents prior
to adding new ones. This strategy drastically cuts down the number of
times locks are held.

For example, if prev, new, and next extents are all mergeable, the existing
code (before this patch) requires acquiring the s_md_lock three times:

  prev merge into new and free prev // hold lock
  next merge into new and free next // hold lock
  insert new // hold lock

After the patch, it only needs to be acquired once:

  new merge next and free new // no lock
  next merge into prev and free prev // hold lock

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

                   | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
 Disk: 960GB SSD   |-------------------------|-------------------------|
                   | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 20982 | 21157 (+0.8%)   | 50629 | 50420 (-0.4%)   |
mb_optimize_scan=1 | 10703 | 12896 (+20.4%)  | 14856 | 17273 (+16.2%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 113 +++++++++++++++++++++++++++++++---------------
 1 file changed, 76 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5410fb3688ee..94950b07a577 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6298,28 +6298,63 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
  * are contiguous, AND the extents were freed by the same transaction,
  * AND the blocks are associated with the same group.
  */
-static void ext4_try_merge_freed_extent(struct ext4_sb_info *sbi,
-					struct ext4_free_data *entry,
-					struct ext4_free_data *new_entry,
-					struct rb_root *entry_rb_root)
+static inline bool
+ext4_freed_extents_can_be_merged(struct ext4_free_data *entry1,
+				 struct ext4_free_data *entry2)
 {
-	if ((entry->efd_tid != new_entry->efd_tid) ||
-	    (entry->efd_group != new_entry->efd_group))
-		return;
-	if (entry->efd_start_cluster + entry->efd_count ==
-	    new_entry->efd_start_cluster) {
-		new_entry->efd_start_cluster = entry->efd_start_cluster;
-		new_entry->efd_count += entry->efd_count;
-	} else if (new_entry->efd_start_cluster + new_entry->efd_count ==
-		   entry->efd_start_cluster) {
-		new_entry->efd_count += entry->efd_count;
-	} else
-		return;
+	if (entry1->efd_tid != entry2->efd_tid)
+		return false;
+	if (entry1->efd_start_cluster + entry1->efd_count !=
+	    entry2->efd_start_cluster)
+		return false;
+	if (WARN_ON_ONCE(entry1->efd_group != entry2->efd_group))
+		return false;
+	return true;
+}
+
+static inline void
+ext4_merge_freed_extents(struct ext4_sb_info *sbi, struct rb_root *root,
+			 struct ext4_free_data *entry1,
+			 struct ext4_free_data *entry2)
+{
+	entry1->efd_count += entry2->efd_count;
 	spin_lock(&sbi->s_md_lock);
-	list_del(&entry->efd_list);
+	list_del(&entry2->efd_list);
 	spin_unlock(&sbi->s_md_lock);
-	rb_erase(&entry->efd_node, entry_rb_root);
-	kmem_cache_free(ext4_free_data_cachep, entry);
+	rb_erase(&entry2->efd_node, root);
+	kmem_cache_free(ext4_free_data_cachep, entry2);
+}
+
+static inline void
+ext4_try_merge_freed_extent_prev(struct ext4_sb_info *sbi, struct rb_root *root,
+				 struct ext4_free_data *entry)
+{
+	struct ext4_free_data *prev;
+	struct rb_node *node;
+
+	node = rb_prev(&entry->efd_node);
+	if (!node)
+		return;
+
+	prev = rb_entry(node, struct ext4_free_data, efd_node);
+	if (ext4_freed_extents_can_be_merged(prev, entry))
+		ext4_merge_freed_extents(sbi, root, prev, entry);
+}
+
+static inline void
+ext4_try_merge_freed_extent_next(struct ext4_sb_info *sbi, struct rb_root *root,
+				 struct ext4_free_data *entry)
+{
+	struct ext4_free_data *next;
+	struct rb_node *node;
+
+	node = rb_next(&entry->efd_node);
+	if (!node)
+		return;
+
+	next = rb_entry(node, struct ext4_free_data, efd_node);
+	if (ext4_freed_extents_can_be_merged(entry, next))
+		ext4_merge_freed_extents(sbi, root, entry, next);
 }
 
 static noinline_for_stack void
@@ -6329,11 +6364,12 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 	ext4_group_t group = e4b->bd_group;
 	ext4_grpblk_t cluster;
 	ext4_grpblk_t clusters = new_entry->efd_count;
-	struct ext4_free_data *entry;
+	struct ext4_free_data *entry = NULL;
 	struct ext4_group_info *db = e4b->bd_info;
 	struct super_block *sb = e4b->bd_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	struct rb_node **n = &db->bb_free_root.rb_node, *node;
+	struct rb_root *root = &db->bb_free_root;
+	struct rb_node **n = &root->rb_node;
 	struct rb_node *parent = NULL, *new_node;
 
 	BUG_ON(!ext4_handle_valid(handle));
@@ -6369,27 +6405,30 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 		}
 	}
 
-	rb_link_node(new_node, parent, n);
-	rb_insert_color(new_node, &db->bb_free_root);
-
-	/* Now try to see the extent can be merged to left and right */
-	node = rb_prev(new_node);
-	if (node) {
-		entry = rb_entry(node, struct ext4_free_data, efd_node);
-		ext4_try_merge_freed_extent(sbi, entry, new_entry,
-					    &(db->bb_free_root));
+	atomic_add(clusters, &sbi->s_mb_free_pending);
+	if (!entry)
+		goto insert;
+
+	/* Now try to see the extent can be merged to prev and next */
+	if (ext4_freed_extents_can_be_merged(new_entry, entry)) {
+		entry->efd_start_cluster = cluster;
+		entry->efd_count += new_entry->efd_count;
+		kmem_cache_free(ext4_free_data_cachep, new_entry);
+		ext4_try_merge_freed_extent_prev(sbi, root, entry);
+		return;
 	}
-
-	node = rb_next(new_node);
-	if (node) {
-		entry = rb_entry(node, struct ext4_free_data, efd_node);
-		ext4_try_merge_freed_extent(sbi, entry, new_entry,
-					    &(db->bb_free_root));
+	if (ext4_freed_extents_can_be_merged(entry, new_entry)) {
+		entry->efd_count += new_entry->efd_count;
+		kmem_cache_free(ext4_free_data_cachep, new_entry);
+		ext4_try_merge_freed_extent_next(sbi, root, entry);
+		return;
 	}
+insert:
+	rb_link_node(new_node, parent, n);
+	rb_insert_color(new_node, root);
 
 	spin_lock(&sbi->s_md_lock);
 	list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->efd_tid & 1]);
-	atomic_add(clusters, &sbi->s_mb_free_pending);
 	spin_unlock(&sbi->s_md_lock);
 }
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 09/16] ext4: fix zombie groups in average fragment size lists
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (7 preceding siblings ...)
  2025-06-23  7:32 ` [PATCH v2 08/16] ext4: merge freed extent with existing extents before insertion Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-27 19:14   ` Jan Kara
  2025-06-23  7:32 ` [PATCH v2 10/16] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Baokun Li
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1, stable

Groups with no free blocks shouldn't be in any average fragment size list.
However, when all blocks in a group are allocated(i.e., bb_fragments or
bb_free is 0), we currently skip updating the average fragment size, which
means the group isn't removed from its previous s_mb_avg_fragment_size[old]
list.

This created "zombie" groups that were always skipped during traversal as
they couldn't satisfy any block allocation requests, negatively impacting
traversal efficiency.

Therefore, when a group becomes completely free, bb_avg_fragment_size_order
is now set to -1. If the old order was not -1, a removal operation is
performed; if the new order is not -1, an insertion is performed.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 94950b07a577..e6d6c2da3c6e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -841,30 +841,30 @@ static void
 mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	int new_order;
+	int new, old;
 
-	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_fragments == 0)
+	if (!test_opt2(sb, MB_OPTIMIZE_SCAN))
 		return;
 
-	new_order = mb_avg_fragment_size_order(sb,
-					grp->bb_free / grp->bb_fragments);
-	if (new_order == grp->bb_avg_fragment_size_order)
+	old = grp->bb_avg_fragment_size_order;
+	new = grp->bb_fragments == 0 ? -1 :
+	      mb_avg_fragment_size_order(sb, grp->bb_free / grp->bb_fragments);
+	if (new == old)
 		return;
 
-	if (grp->bb_avg_fragment_size_order != -1) {
-		write_lock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
+	if (old >= 0) {
+		write_lock(&sbi->s_mb_avg_fragment_size_locks[old]);
 		list_del(&grp->bb_avg_fragment_size_node);
-		write_unlock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
-	}
-	grp->bb_avg_fragment_size_order = new_order;
-	write_lock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
-	list_add_tail(&grp->bb_avg_fragment_size_node,
-		&sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
-	write_unlock(&sbi->s_mb_avg_fragment_size_locks[
-					grp->bb_avg_fragment_size_order]);
+		write_unlock(&sbi->s_mb_avg_fragment_size_locks[old]);
+	}
+
+	grp->bb_avg_fragment_size_order = new;
+	if (new >= 0) {
+		write_lock(&sbi->s_mb_avg_fragment_size_locks[new]);
+		list_add_tail(&grp->bb_avg_fragment_size_node,
+				&sbi->s_mb_avg_fragment_size[new]);
+		write_unlock(&sbi->s_mb_avg_fragment_size_locks[new]);
+	}
 }
 
 /*
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 10/16] ext4: fix largest free orders lists corruption on mb_optimize_scan switch
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (8 preceding siblings ...)
  2025-06-23  7:32 ` [PATCH v2 09/16] ext4: fix zombie groups in average fragment size lists Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-27 19:34   ` Jan Kara
  2025-06-23  7:32 ` [PATCH v2 11/16] ext4: factor out __ext4_mb_scan_group() Baokun Li
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1, stable

The grp->bb_largest_free_order is updated regardless of whether
mb_optimize_scan is enabled. This can lead to inconsistencies between
grp->bb_largest_free_order and the actual s_mb_largest_free_orders list
index when mb_optimize_scan is repeatedly enabled and disabled via remount.

For example, if mb_optimize_scan is initially enabled, largest free
order is 3, and the group is in s_mb_largest_free_orders[3]. Then,
mb_optimize_scan is disabled via remount, block allocations occur,
updating largest free order to 2. Finally, mb_optimize_scan is re-enabled
via remount, more block allocations update largest free order to 1.

At this point, the group would be removed from s_mb_largest_free_orders[3]
under the protection of s_mb_largest_free_orders_locks[2]. This lock
mismatch can lead to list corruption.

To fix this, a new field bb_largest_free_order_idx is added to struct
ext4_group_info to explicitly track the list index. Then still update
bb_largest_free_order unconditionally, but only update
bb_largest_free_order_idx when mb_optimize_scan is enabled. so that there
is no inconsistency between the lock and the data to be protected.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  1 +
 fs/ext4/mballoc.c | 35 ++++++++++++++++-------------------
 2 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 003b8d3726e8..0e574378c6a3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3476,6 +3476,7 @@ struct ext4_group_info {
 	int		bb_avg_fragment_size_order;	/* order of average
 							   fragment in BG */
 	ext4_grpblk_t	bb_largest_free_order;/* order of largest frag in BG */
+	ext4_grpblk_t	bb_largest_free_order_idx; /* index of largest frag */
 	ext4_group_t	bb_group;	/* Group number */
 	struct          list_head bb_prealloc_list;
 #ifdef DOUBLE_CHECK
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index e6d6c2da3c6e..dc82124f0905 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1152,33 +1152,29 @@ static void
 mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	int i;
+	int new, old = grp->bb_largest_free_order_idx;
 
-	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
-		if (grp->bb_counters[i] > 0)
+	for (new = MB_NUM_ORDERS(sb) - 1; new >= 0; new--)
+		if (grp->bb_counters[new] > 0)
 			break;
+
+	grp->bb_largest_free_order = new;
 	/* No need to move between order lists? */
-	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) ||
-	    i == grp->bb_largest_free_order) {
-		grp->bb_largest_free_order = i;
+	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || new == old)
 		return;
-	}
 
-	if (grp->bb_largest_free_order >= 0) {
-		write_lock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+	if (old >= 0) {
+		write_lock(&sbi->s_mb_largest_free_orders_locks[old]);
 		list_del_init(&grp->bb_largest_free_order_node);
-		write_unlock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+		write_unlock(&sbi->s_mb_largest_free_orders_locks[old]);
 	}
-	grp->bb_largest_free_order = i;
-	if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
-		write_lock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+
+	grp->bb_largest_free_order_idx = new;
+	if (new >= 0 && grp->bb_free) {
+		write_lock(&sbi->s_mb_largest_free_orders_locks[new]);
 		list_add_tail(&grp->bb_largest_free_order_node,
-		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
-		write_unlock(&sbi->s_mb_largest_free_orders_locks[
-					      grp->bb_largest_free_order]);
+			      &sbi->s_mb_largest_free_orders[new]);
+		write_unlock(&sbi->s_mb_largest_free_orders_locks[new]);
 	}
 }
 
@@ -3391,6 +3387,7 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
 	INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
 	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
 	meta_group_info[i]->bb_avg_fragment_size_order = -1;  /* uninit */
+	meta_group_info[i]->bb_largest_free_order_idx = -1;  /* uninit */
 	meta_group_info[i]->bb_group = group;
 
 	mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 11/16] ext4: factor out __ext4_mb_scan_group()
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (9 preceding siblings ...)
  2025-06-23  7:32 ` [PATCH v2 10/16] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Baokun Li
@ 2025-06-23  7:32 ` Baokun Li
  2025-06-23  7:33 ` [PATCH v2 12/16] ext4: factor out ext4_mb_might_prefetch() Baokun Li
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:32 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

Extract __ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 45 +++++++++++++++++++++++++++------------------
 fs/ext4/mballoc.h |  2 ++
 2 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index dc82124f0905..db5d8b1e5cce 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2569,6 +2569,30 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
 	}
 }
 
+static void __ext4_mb_scan_group(struct ext4_allocation_context *ac)
+{
+	bool is_stripe_aligned;
+	struct ext4_sb_info *sbi;
+	enum criteria cr = ac->ac_criteria;
+
+	ac->ac_groups_scanned++;
+	if (cr == CR_POWER2_ALIGNED)
+		return ext4_mb_simple_scan_group(ac, ac->ac_e4b);
+
+	sbi = EXT4_SB(ac->ac_sb);
+	is_stripe_aligned = false;
+	if ((sbi->s_stripe >= sbi->s_cluster_ratio) &&
+	    !(ac->ac_g_ex.fe_len % EXT4_NUM_B2C(sbi, sbi->s_stripe)))
+		is_stripe_aligned = true;
+
+	if ((cr == CR_GOAL_LEN_FAST || cr == CR_BEST_AVAIL_LEN) &&
+	    is_stripe_aligned)
+		ext4_mb_scan_aligned(ac, ac->ac_e4b);
+
+	if (ac->ac_status == AC_STATUS_CONTINUE)
+		ext4_mb_complex_scan_group(ac, ac->ac_e4b);
+}
+
 /*
  * This is also called BEFORE we load the buddy bitmap.
  * Returns either 1 or 0 indicating that the group is either suitable
@@ -2855,6 +2879,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	 */
 	if (ac->ac_2order)
 		cr = CR_POWER2_ALIGNED;
+
+	ac->ac_e4b = &e4b;
 repeat:
 	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;
@@ -2932,24 +2958,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 				continue;
 			}
 
-			ac->ac_groups_scanned++;
-			if (cr == CR_POWER2_ALIGNED)
-				ext4_mb_simple_scan_group(ac, &e4b);
-			else {
-				bool is_stripe_aligned =
-					(sbi->s_stripe >=
-					 sbi->s_cluster_ratio) &&
-					!(ac->ac_g_ex.fe_len %
-					  EXT4_NUM_B2C(sbi, sbi->s_stripe));
-
-				if ((cr == CR_GOAL_LEN_FAST ||
-				     cr == CR_BEST_AVAIL_LEN) &&
-				    is_stripe_aligned)
-					ext4_mb_scan_aligned(ac, &e4b);
-
-				if (ac->ac_status == AC_STATUS_CONTINUE)
-					ext4_mb_complex_scan_group(ac, &e4b);
-			}
+			__ext4_mb_scan_group(ac);
 
 			ext4_unlock_group(sb, group);
 			ext4_mb_unload_buddy(&e4b);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 38c37901728d..d61d690d237c 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -213,6 +213,8 @@ struct ext4_allocation_context {
 	__u8 ac_2order;		/* if request is to allocate 2^N blocks and
 				 * N > 0, the field stores N, otherwise 0 */
 	__u8 ac_op;		/* operation, for history only */
+
+	struct ext4_buddy *ac_e4b;
 	struct folio *ac_bitmap_folio;
 	struct folio *ac_buddy_folio;
 	struct ext4_prealloc_space *ac_pa;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 12/16] ext4: factor out ext4_mb_might_prefetch()
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (10 preceding siblings ...)
  2025-06-23  7:32 ` [PATCH v2 11/16] ext4: factor out __ext4_mb_scan_group() Baokun Li
@ 2025-06-23  7:33 ` Baokun Li
  2025-06-23  7:33 ` [PATCH v2 13/16] ext4: factor out ext4_mb_scan_group() Baokun Li
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

Extract ext4_mb_might_prefetch() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 62 +++++++++++++++++++++++++++++------------------
 fs/ext4/mballoc.h |  4 +++
 2 files changed, 42 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index db5d8b1e5cce..683e7f8faab6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2782,6 +2782,37 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group,
 	return group;
 }
 
+/*
+ * Batch reads of the block allocation bitmaps to get
+ * multiple READs in flight; limit prefetching at inexpensive
+ * CR, otherwise mballoc can spend a lot of time loading
+ * imperfect groups
+ */
+static void ext4_mb_might_prefetch(struct ext4_allocation_context *ac,
+				   ext4_group_t group)
+{
+	struct ext4_sb_info *sbi;
+
+	if (ac->ac_prefetch_grp != group)
+		return;
+
+	sbi = EXT4_SB(ac->ac_sb);
+	if (ext4_mb_cr_expensive(ac->ac_criteria) ||
+	    ac->ac_prefetch_ios < sbi->s_mb_prefetch_limit) {
+		unsigned int nr = sbi->s_mb_prefetch;
+
+		if (ext4_has_feature_flex_bg(ac->ac_sb)) {
+			nr = 1 << sbi->s_log_groups_per_flex;
+			nr -= group & (nr - 1);
+			nr = min(nr, sbi->s_mb_prefetch);
+		}
+
+		ac->ac_prefetch_nr = nr;
+		ac->ac_prefetch_grp = ext4_mb_prefetch(ac->ac_sb, group, nr,
+						       &ac->ac_prefetch_ios);
+	}
+}
+
 /*
  * Prefetching reads the block bitmap into the buffer cache; but we
  * need to make sure that the buddy bitmap in the page cache has been
@@ -2818,10 +2849,9 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
 static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
-	ext4_group_t prefetch_grp = 0, ngroups, group, i;
+	ext4_group_t ngroups, group, i;
 	enum criteria new_cr, cr = CR_GOAL_LEN_FAST;
 	int err = 0, first_err = 0;
-	unsigned int nr = 0, prefetch_ios = 0;
 	struct ext4_sb_info *sbi;
 	struct super_block *sb;
 	struct ext4_buddy e4b;
@@ -2881,6 +2911,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		cr = CR_POWER2_ALIGNED;
 
 	ac->ac_e4b = &e4b;
+	ac->ac_prefetch_ios = 0;
 repeat:
 	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;
@@ -2890,8 +2921,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		 */
 		group = ac->ac_g_ex.fe_group;
 		ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups;
-		prefetch_grp = group;
-		nr = 0;
+		ac->ac_prefetch_grp = group;
+		ac->ac_prefetch_nr = 0;
 
 		for (i = 0, new_cr = cr; i < ngroups; i++,
 		     ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) {
@@ -2903,24 +2934,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 				goto repeat;
 			}
 
-			/*
-			 * Batch reads of the block allocation bitmaps
-			 * to get multiple READs in flight; limit
-			 * prefetching at inexpensive CR, otherwise mballoc
-			 * can spend a lot of time loading imperfect groups
-			 */
-			if ((prefetch_grp == group) &&
-			    (ext4_mb_cr_expensive(cr) ||
-			     prefetch_ios < sbi->s_mb_prefetch_limit)) {
-				nr = sbi->s_mb_prefetch;
-				if (ext4_has_feature_flex_bg(sb)) {
-					nr = 1 << sbi->s_log_groups_per_flex;
-					nr -= group & (nr - 1);
-					nr = min(nr, sbi->s_mb_prefetch);
-				}
-				prefetch_grp = ext4_mb_prefetch(sb, group,
-							nr, &prefetch_ios);
-			}
+			ext4_mb_might_prefetch(ac, group);
 
 			/* prevent unnecessary buddy loading. */
 			if (cr < CR_ANY_FREE &&
@@ -3014,8 +3028,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		 ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status,
 		 ac->ac_flags, cr, err);
 
-	if (nr)
-		ext4_mb_prefetch_fini(sb, prefetch_grp, nr);
+	if (ac->ac_prefetch_nr)
+		ext4_mb_prefetch_fini(sb, ac->ac_prefetch_grp, ac->ac_prefetch_nr);
 
 	return err;
 }
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index d61d690d237c..772ee0264d33 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -201,6 +201,10 @@ struct ext4_allocation_context {
 	 */
 	ext4_grpblk_t	ac_orig_goal_len;
 
+	ext4_group_t ac_prefetch_grp;
+	unsigned int ac_prefetch_ios;
+	unsigned int ac_prefetch_nr;
+
 	__u32 ac_flags;		/* allocation hints */
 	__u32 ac_groups_linear_remaining;
 	__u16 ac_groups_scanned;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 13/16] ext4: factor out ext4_mb_scan_group()
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (11 preceding siblings ...)
  2025-06-23  7:33 ` [PATCH v2 12/16] ext4: factor out ext4_mb_might_prefetch() Baokun Li
@ 2025-06-23  7:33 ` Baokun Li
  2025-06-23  7:33 ` [PATCH v2 14/16] ext4: convert free group lists to ordered xarrays Baokun Li
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

Extract ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 93 +++++++++++++++++++++++++----------------------
 fs/ext4/mballoc.h |  2 +
 2 files changed, 51 insertions(+), 44 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 683e7f8faab6..2c4c2cf3e180 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2846,12 +2846,56 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
 	}
 }
 
+static int ext4_mb_scan_group(struct ext4_allocation_context *ac,
+			      ext4_group_t group)
+{
+	int ret;
+	struct super_block *sb = ac->ac_sb;
+	enum criteria cr = ac->ac_criteria;
+
+	ext4_mb_might_prefetch(ac, group);
+
+	/* prevent unnecessary buddy loading. */
+	if (cr < CR_ANY_FREE && spin_is_locked(ext4_group_lock_ptr(sb, group)))
+		return 0;
+
+	/* This now checks without needing the buddy page */
+	ret = ext4_mb_good_group_nolock(ac, group, cr);
+	if (ret <= 0) {
+		if (!ac->ac_first_err)
+			ac->ac_first_err = ret;
+		return 0;
+	}
+
+	ret = ext4_mb_load_buddy(sb, group, ac->ac_e4b);
+	if (ret)
+		return ret;
+
+	/* skip busy group */
+	if (cr >= CR_ANY_FREE)
+		ext4_lock_group(sb, group);
+	else if (!ext4_try_lock_group(sb, group))
+		goto out_unload;
+
+	/* We need to check again after locking the block group. */
+	if (unlikely(!ext4_mb_good_group(ac, group, cr)))
+		goto out_unlock;
+
+	__ext4_mb_scan_group(ac);
+
+out_unlock:
+	ext4_unlock_group(sb, group);
+out_unload:
+	ext4_mb_unload_buddy(ac->ac_e4b);
+	return ret;
+}
+
 static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
 	ext4_group_t ngroups, group, i;
 	enum criteria new_cr, cr = CR_GOAL_LEN_FAST;
-	int err = 0, first_err = 0;
+	int err = 0;
 	struct ext4_sb_info *sbi;
 	struct super_block *sb;
 	struct ext4_buddy e4b;
@@ -2912,6 +2956,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 	ac->ac_e4b = &e4b;
 	ac->ac_prefetch_ios = 0;
+	ac->ac_first_err = 0;
 repeat:
 	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;
@@ -2926,7 +2971,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 		for (i = 0, new_cr = cr; i < ngroups; i++,
 		     ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) {
-			int ret = 0;
 
 			cond_resched();
 			if (new_cr != cr) {
@@ -2934,49 +2978,10 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 				goto repeat;
 			}
 
-			ext4_mb_might_prefetch(ac, group);
-
-			/* prevent unnecessary buddy loading. */
-			if (cr < CR_ANY_FREE &&
-			    spin_is_locked(ext4_group_lock_ptr(sb, group)))
-				continue;
-
-			/* This now checks without needing the buddy page */
-			ret = ext4_mb_good_group_nolock(ac, group, cr);
-			if (ret <= 0) {
-				if (!first_err)
-					first_err = ret;
-				continue;
-			}
-
-			err = ext4_mb_load_buddy(sb, group, &e4b);
+			err = ext4_mb_scan_group(ac, group);
 			if (err)
 				goto out;
 
-			/* skip busy group */
-			if (cr >= CR_ANY_FREE) {
-				ext4_lock_group(sb, group);
-			} else if (!ext4_try_lock_group(sb, group)) {
-				ext4_mb_unload_buddy(&e4b);
-				continue;
-			}
-
-			/*
-			 * We need to check again after locking the
-			 * block group
-			 */
-			ret = ext4_mb_good_group(ac, group, cr);
-			if (ret == 0) {
-				ext4_unlock_group(sb, group);
-				ext4_mb_unload_buddy(&e4b);
-				continue;
-			}
-
-			__ext4_mb_scan_group(ac);
-
-			ext4_unlock_group(sb, group);
-			ext4_mb_unload_buddy(&e4b);
-
 			if (ac->ac_status != AC_STATUS_CONTINUE)
 				break;
 		}
@@ -3021,8 +3026,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	if (sbi->s_mb_stats && ac->ac_status == AC_STATUS_FOUND)
 		atomic64_inc(&sbi->s_bal_cX_hits[ac->ac_criteria]);
 out:
-	if (!err && ac->ac_status != AC_STATUS_FOUND && first_err)
-		err = first_err;
+	if (!err && ac->ac_status != AC_STATUS_FOUND && ac->ac_first_err)
+		err = ac->ac_first_err;
 
 	mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n",
 		 ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status,
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 772ee0264d33..721aaea1f83e 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -205,6 +205,8 @@ struct ext4_allocation_context {
 	unsigned int ac_prefetch_ios;
 	unsigned int ac_prefetch_nr;
 
+	int ac_first_err;
+
 	__u32 ac_flags;		/* allocation hints */
 	__u32 ac_groups_linear_remaining;
 	__u16 ac_groups_scanned;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 14/16] ext4: convert free group lists to ordered xarrays
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (12 preceding siblings ...)
  2025-06-23  7:33 ` [PATCH v2 13/16] ext4: factor out ext4_mb_scan_group() Baokun Li
@ 2025-06-23  7:33 ` Baokun Li
  2025-06-23  7:33 ` [PATCH v2 15/16] ext4: refactor choose group to scan group Baokun Li
  2025-06-23  7:33 ` [PATCH v2 16/16] ext4: ensure global ordered traversal across all free groups xarrays Baokun Li
  15 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

While traversing the list, holding a spin_lock prevents load_buddy, making
direct use of ext4_try_lock_group impossible. This can lead to a bouncing
scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
fails, forcing the list traversal to repeatedly restart from grp_A.

In contrast, linear traversal directly uses ext4_try_lock_group(),
avoiding this bouncing. Therefore, we need a lockless, ordered traversal
to achieve linear-like efficiency.

Therefore, this commit converts both average fragment size lists and
largest free order lists into ordered xarrays.

In an xarray, the index represents the block group number and the value
holds the block group information; a non-empty value indicates the block
group's presence.

While insertion and deletion complexity remain O(1), lookup complexity
changes from O(1) to O(nlogn), which may slightly reduce single-threaded
performance.

After this, we can convert choose group to scan group, and then we can
implement ordered optimize scan.

Performance test results are as follows: Single-process operations
on an empty disk show negligible impact, while multi-process workloads
demonstrate a noticeable performance gain.

CPU: Kunpeng 920   |          P80            |            P1           |
Memory: 512GB      |-------------------------|-------------------------|
Disk: 960GB SSD    | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 21157 | 20976  (-0.8%)  | 320645| 319396 (-0.4%)  |
mb_optimize_scan=1 | 12896 | 14580  (+13.0%) | 321233| 319237 (-0.6%)  |

CPU: AMD 9654 * 2  |          P96            |            P1           |
Memory: 1536GB     |-------------------------|-------------------------|
Disk: 960GB SSD    | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 50420 | 51713 (+2.5%)   | 206570| 206655 (0.04%)  |
mb_optimize_scan=1 | 17273 | 35527 (+105%)   | 208362| 212574 (+2.0%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |   8 +-
 fs/ext4/mballoc.c | 255 ++++++++++++++++++++++++----------------------
 2 files changed, 137 insertions(+), 126 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0e574378c6a3..64e1c978a89d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1608,10 +1608,8 @@ struct ext4_sb_info {
 	struct list_head s_discard_list;
 	struct work_struct s_discard_work;
 	atomic_t s_retry_alloc_pending;
-	struct list_head *s_mb_avg_fragment_size;
-	rwlock_t *s_mb_avg_fragment_size_locks;
-	struct list_head *s_mb_largest_free_orders;
-	rwlock_t *s_mb_largest_free_orders_locks;
+	struct xarray *s_mb_avg_fragment_size;
+	struct xarray *s_mb_largest_free_orders;
 
 	/* tunables */
 	unsigned long s_stripe;
@@ -3483,8 +3481,6 @@ struct ext4_group_info {
 	void            *bb_bitmap;
 #endif
 	struct rw_semaphore alloc_sem;
-	struct list_head bb_avg_fragment_size_node;
-	struct list_head bb_largest_free_order_node;
 	ext4_grpblk_t	bb_counters[];	/* Nr of free power-of-two-block
 					 * regions, index is order.
 					 * bb_counters[3] = 5 means
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 2c4c2cf3e180..45c7717fcbbd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -132,25 +132,30 @@
  * If "mb_optimize_scan" mount option is set, we maintain in memory group info
  * structures in two data structures:
  *
- * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
+ * 1) Array of largest free order xarrays (sbi->s_mb_largest_free_orders)
  *
- *    Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
+ *    Locking: Writers use xa_lock, readers use rcu_read_lock.
  *
- *    This is an array of lists where the index in the array represents the
+ *    This is an array of xarrays where the index in the array represents the
  *    largest free order in the buddy bitmap of the participating group infos of
- *    that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
- *    number of buddy bitmap orders possible) number of lists. Group-infos are
- *    placed in appropriate lists.
+ *    that xarray. So, there are exactly MB_NUM_ORDERS(sb) (which means total
+ *    number of buddy bitmap orders possible) number of xarrays. Group-infos are
+ *    placed in appropriate xarrays.
  *
- * 2) Average fragment size lists (sbi->s_mb_avg_fragment_size)
+ * 2) Average fragment size xarrays (sbi->s_mb_avg_fragment_size)
  *
- *    Locking: sbi->s_mb_avg_fragment_size_locks(array of rw locks)
+ *    Locking: Writers use xa_lock, readers use rcu_read_lock.
  *
- *    This is an array of lists where in the i-th list there are groups with
+ *    This is an array of xarrays where in the i-th xarray there are groups with
  *    average fragment size >= 2^i and < 2^(i+1). The average fragment size
  *    is computed as ext4_group_info->bb_free / ext4_group_info->bb_fragments.
- *    Note that we don't bother with a special list for completely empty groups
- *    so we only have MB_NUM_ORDERS(sb) lists.
+ *    Note that we don't bother with a special xarray for completely empty
+ *    groups so we only have MB_NUM_ORDERS(sb) xarrays. Group-infos are placed
+ *    in appropriate xarrays.
+ *
+ * In xarray, the index is the block group number, the value is the block group
+ * information, and a non-empty value indicates the block group is present in
+ * the current xarray.
  *
  * When "mb_optimize_scan" mount option is set, mballoc consults the above data
  * structures to decide the order in which groups are to be traversed for
@@ -842,6 +847,7 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	int new, old;
+	int ret;
 
 	if (!test_opt2(sb, MB_OPTIMIZE_SCAN))
 		return;
@@ -852,19 +858,71 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 	if (new == old)
 		return;
 
-	if (old >= 0) {
-		write_lock(&sbi->s_mb_avg_fragment_size_locks[old]);
-		list_del(&grp->bb_avg_fragment_size_node);
-		write_unlock(&sbi->s_mb_avg_fragment_size_locks[old]);
-	}
+	if (old >= 0)
+		xa_erase(&sbi->s_mb_avg_fragment_size[old], grp->bb_group);
 
 	grp->bb_avg_fragment_size_order = new;
-	if (new >= 0) {
-		write_lock(&sbi->s_mb_avg_fragment_size_locks[new]);
-		list_add_tail(&grp->bb_avg_fragment_size_node,
-				&sbi->s_mb_avg_fragment_size[new]);
-		write_unlock(&sbi->s_mb_avg_fragment_size_locks[new]);
+	if (new < 0)
+		return;
+
+	ret = xa_insert(&sbi->s_mb_avg_fragment_size[new],
+			grp->bb_group, grp, GFP_ATOMIC);
+	if (!ret)
+		return;
+	ext4_warning(sb, "insert group: %u to s_mb_avg_fragment_size[%d] failed, err %d",
+		     grp->bb_group, new, ret);
+}
+
+static struct ext4_group_info *
+ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac,
+			       struct xarray *xa, ext4_group_t start)
+{
+	struct super_block *sb = ac->ac_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	enum criteria cr = ac->ac_criteria;
+	ext4_group_t ngroups = ext4_get_groups_count(sb);
+	unsigned long group = start;
+	ext4_group_t end;
+	struct ext4_group_info *grp;
+
+	if (WARN_ON_ONCE(start >= ngroups))
+		return NULL;
+	end = ngroups - 1;
+
+wrap_around:
+	xa_for_each_range(xa, group, grp, start, end) {
+		if (sbi->s_mb_stats)
+			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
+
+		if (!spin_is_locked(ext4_group_lock_ptr(sb, group)) &&
+		    likely(ext4_mb_good_group(ac, group, cr)))
+			return grp;
+
+		cond_resched();
+	}
+
+	if (start) {
+		end = start - 1;
+		start = 0;
+		goto wrap_around;
 	}
+
+	return NULL;
+}
+
+/*
+ * Find a suitable group of given order from the largest free orders xarray.
+ */
+static struct ext4_group_info *
+ext4_mb_find_good_group_largest_free_order(struct ext4_allocation_context *ac,
+					   int order, ext4_group_t start)
+{
+	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order];
+
+	if (xa_empty(xa))
+		return NULL;
+
+	return ext4_mb_find_good_group_xarray(ac, xa, start);
 }
 
 /*
@@ -875,7 +933,7 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
 			enum criteria *new_cr, ext4_group_t *group)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct ext4_group_info *iter;
+	struct ext4_group_info *grp;
 	int i;
 
 	if (ac->ac_status == AC_STATUS_FOUND)
@@ -885,26 +943,12 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
 		atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions);
 
 	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		if (list_empty(&sbi->s_mb_largest_free_orders[i]))
-			continue;
-		read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
-		if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
-			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
-			continue;
-		}
-		list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
-				    bb_largest_free_order_node) {
-			if (sbi->s_mb_stats)
-				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
-			if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) &&
-			    likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
-				*group = iter->bb_group;
-				ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
-				read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
-				return;
-			}
+		grp = ext4_mb_find_good_group_largest_free_order(ac, i, *group);
+		if (grp) {
+			*group = grp->bb_group;
+			ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
+			return;
 		}
-		read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
 	}
 
 	/* Increment cr and search again if no group is found */
@@ -912,35 +956,18 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
 }
 
 /*
- * Find a suitable group of given order from the average fragments list.
+ * Find a suitable group of given order from the average fragments xarray.
  */
 static struct ext4_group_info *
-ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int order)
+ext4_mb_find_good_group_avg_frag_xarray(struct ext4_allocation_context *ac,
+					int order, ext4_group_t start)
 {
-	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct list_head *frag_list = &sbi->s_mb_avg_fragment_size[order];
-	rwlock_t *frag_list_lock = &sbi->s_mb_avg_fragment_size_locks[order];
-	struct ext4_group_info *grp = NULL, *iter;
-	enum criteria cr = ac->ac_criteria;
+	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order];
 
-	if (list_empty(frag_list))
-		return NULL;
-	read_lock(frag_list_lock);
-	if (list_empty(frag_list)) {
-		read_unlock(frag_list_lock);
+	if (xa_empty(xa))
 		return NULL;
-	}
-	list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
-		if (sbi->s_mb_stats)
-			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
-		if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) &&
-		    likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
-			grp = iter;
-			break;
-		}
-	}
-	read_unlock(frag_list_lock);
-	return grp;
+
+	return ext4_mb_find_good_group_xarray(ac, xa, start);
 }
 
 /*
@@ -961,7 +988,7 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
 
 	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
 	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		grp = ext4_mb_find_good_group_avg_frag_lists(ac, i);
+		grp = ext4_mb_find_good_group_avg_frag_xarray(ac, i, *group);
 		if (grp) {
 			*group = grp->bb_group;
 			ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
@@ -1057,7 +1084,8 @@ static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context
 		frag_order = mb_avg_fragment_size_order(ac->ac_sb,
 							ac->ac_g_ex.fe_len);
 
-		grp = ext4_mb_find_good_group_avg_frag_lists(ac, frag_order);
+		grp = ext4_mb_find_good_group_avg_frag_xarray(ac, frag_order,
+							      *group);
 		if (grp) {
 			*group = grp->bb_group;
 			ac->ac_flags |= EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED;
@@ -1153,6 +1181,7 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	int new, old = grp->bb_largest_free_order_idx;
+	int ret;
 
 	for (new = MB_NUM_ORDERS(sb) - 1; new >= 0; new--)
 		if (grp->bb_counters[new] > 0)
@@ -1163,19 +1192,19 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || new == old)
 		return;
 
-	if (old >= 0) {
-		write_lock(&sbi->s_mb_largest_free_orders_locks[old]);
-		list_del_init(&grp->bb_largest_free_order_node);
-		write_unlock(&sbi->s_mb_largest_free_orders_locks[old]);
-	}
+	if (old >= 0)
+		xa_erase(&sbi->s_mb_largest_free_orders[old], grp->bb_group);
 
 	grp->bb_largest_free_order_idx = new;
-	if (new >= 0 && grp->bb_free) {
-		write_lock(&sbi->s_mb_largest_free_orders_locks[new]);
-		list_add_tail(&grp->bb_largest_free_order_node,
-			      &sbi->s_mb_largest_free_orders[new]);
-		write_unlock(&sbi->s_mb_largest_free_orders_locks[new]);
-	}
+	if (new < 0 || !grp->bb_free)
+		return;
+
+	ret = xa_insert(&sbi->s_mb_largest_free_orders[new],
+			grp->bb_group, grp, GFP_ATOMIC);
+	if (!ret)
+		return;
+	ext4_warning(sb, "insert group: %u to s_mb_largest_free_orders[%d] failed, err %d",
+		     grp->bb_group, new, ret);
 }
 
 static noinline_for_stack
@@ -3263,6 +3292,7 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 	unsigned long position = ((unsigned long) v);
 	struct ext4_group_info *grp;
 	unsigned int count;
+	unsigned long idx;
 
 	position--;
 	if (position >= MB_NUM_ORDERS(sb)) {
@@ -3271,11 +3301,8 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 			seq_puts(seq, "avg_fragment_size_lists:\n");
 
 		count = 0;
-		read_lock(&sbi->s_mb_avg_fragment_size_locks[position]);
-		list_for_each_entry(grp, &sbi->s_mb_avg_fragment_size[position],
-				    bb_avg_fragment_size_node)
+		xa_for_each(&sbi->s_mb_avg_fragment_size[position], idx, grp)
 			count++;
-		read_unlock(&sbi->s_mb_avg_fragment_size_locks[position]);
 		seq_printf(seq, "\tlist_order_%u_groups: %u\n",
 					(unsigned int)position, count);
 		return 0;
@@ -3287,11 +3314,8 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
 		seq_puts(seq, "max_free_order_lists:\n");
 	}
 	count = 0;
-	read_lock(&sbi->s_mb_largest_free_orders_locks[position]);
-	list_for_each_entry(grp, &sbi->s_mb_largest_free_orders[position],
-			    bb_largest_free_order_node)
+	xa_for_each(&sbi->s_mb_largest_free_orders[position], idx, grp)
 		count++;
-	read_unlock(&sbi->s_mb_largest_free_orders_locks[position]);
 	seq_printf(seq, "\tlist_order_%u_groups: %u\n",
 		   (unsigned int)position, count);
 
@@ -3411,8 +3435,6 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
 	INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
 	init_rwsem(&meta_group_info[i]->alloc_sem);
 	meta_group_info[i]->bb_free_root = RB_ROOT;
-	INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
-	INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
 	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
 	meta_group_info[i]->bb_avg_fragment_size_order = -1;  /* uninit */
 	meta_group_info[i]->bb_largest_free_order_idx = -1;  /* uninit */
@@ -3623,6 +3645,20 @@ static void ext4_discard_work(struct work_struct *work)
 		ext4_mb_unload_buddy(&e4b);
 }
 
+static inline void ext4_mb_avg_fragment_size_destory(struct ext4_sb_info *sbi)
+{
+	for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++)
+		xa_destroy(&sbi->s_mb_avg_fragment_size[i]);
+	kfree(sbi->s_mb_avg_fragment_size);
+}
+
+static inline void ext4_mb_largest_free_orders_destory(struct ext4_sb_info *sbi)
+{
+	for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++)
+		xa_destroy(&sbi->s_mb_largest_free_orders[i]);
+	kfree(sbi->s_mb_largest_free_orders);
+}
+
 int ext4_mb_init(struct super_block *sb)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -3668,41 +3704,24 @@ int ext4_mb_init(struct super_block *sb)
 	} while (i < MB_NUM_ORDERS(sb));
 
 	sbi->s_mb_avg_fragment_size =
-		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct xarray),
 			GFP_KERNEL);
 	if (!sbi->s_mb_avg_fragment_size) {
 		ret = -ENOMEM;
 		goto out;
 	}
-	sbi->s_mb_avg_fragment_size_locks =
-		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
-			GFP_KERNEL);
-	if (!sbi->s_mb_avg_fragment_size_locks) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
-		INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]);
-		rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]);
-	}
+	for (i = 0; i < MB_NUM_ORDERS(sb); i++)
+		xa_init(&sbi->s_mb_avg_fragment_size[i]);
+
 	sbi->s_mb_largest_free_orders =
-		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct xarray),
 			GFP_KERNEL);
 	if (!sbi->s_mb_largest_free_orders) {
 		ret = -ENOMEM;
 		goto out;
 	}
-	sbi->s_mb_largest_free_orders_locks =
-		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
-			GFP_KERNEL);
-	if (!sbi->s_mb_largest_free_orders_locks) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
-		INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
-		rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
-	}
+	for (i = 0; i < MB_NUM_ORDERS(sb); i++)
+		xa_init(&sbi->s_mb_largest_free_orders[i]);
 
 	spin_lock_init(&sbi->s_md_lock);
 	atomic_set(&sbi->s_mb_free_pending, 0);
@@ -3785,10 +3804,8 @@ int ext4_mb_init(struct super_block *sb)
 	kvfree(sbi->s_mb_last_groups);
 	sbi->s_mb_last_groups = NULL;
 out:
-	kfree(sbi->s_mb_avg_fragment_size);
-	kfree(sbi->s_mb_avg_fragment_size_locks);
-	kfree(sbi->s_mb_largest_free_orders);
-	kfree(sbi->s_mb_largest_free_orders_locks);
+	ext4_mb_avg_fragment_size_destory(sbi);
+	ext4_mb_largest_free_orders_destory(sbi);
 	kfree(sbi->s_mb_offsets);
 	sbi->s_mb_offsets = NULL;
 	kfree(sbi->s_mb_maxs);
@@ -3855,10 +3872,8 @@ void ext4_mb_release(struct super_block *sb)
 		kvfree(group_info);
 		rcu_read_unlock();
 	}
-	kfree(sbi->s_mb_avg_fragment_size);
-	kfree(sbi->s_mb_avg_fragment_size_locks);
-	kfree(sbi->s_mb_largest_free_orders);
-	kfree(sbi->s_mb_largest_free_orders_locks);
+	ext4_mb_avg_fragment_size_destory(sbi);
+	ext4_mb_largest_free_orders_destory(sbi);
 	kfree(sbi->s_mb_offsets);
 	kfree(sbi->s_mb_maxs);
 	iput(sbi->s_buddy_cache);
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 15/16] ext4: refactor choose group to scan group
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (13 preceding siblings ...)
  2025-06-23  7:33 ` [PATCH v2 14/16] ext4: convert free group lists to ordered xarrays Baokun Li
@ 2025-06-23  7:33 ` Baokun Li
  2025-06-23  7:33 ` [PATCH v2 16/16] ext4: ensure global ordered traversal across all free groups xarrays Baokun Li
  15 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

This commit converts the `choose group` logic to `scan group` using
previously prepared helper functions. This allows us to leverage xarrays
for ordered non-linear traversal, thereby mitigating the "bouncing" issue
inherent in the `choose group` mechanism.

This also decouples linear and non-linear traversals, leading to cleaner
and more readable code.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 310 ++++++++++++++++++++++++----------------------
 fs/ext4/mballoc.h |   1 -
 2 files changed, 159 insertions(+), 152 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 45c7717fcbbd..d8372a649a0c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -432,6 +432,10 @@ static int ext4_try_to_trim_range(struct super_block *sb,
 		struct ext4_buddy *e4b, ext4_grpblk_t start,
 		ext4_grpblk_t max, ext4_grpblk_t minblocks);
 
+static int ext4_mb_scan_group(struct ext4_allocation_context *ac,
+			      ext4_group_t group);
+static void ext4_mb_might_prefetch(struct ext4_allocation_context *ac,
+				   ext4_group_t group);
 /*
  * The algorithm using this percpu seq counter goes below:
  * 1. We sample the percpu discard_pa_seq counter before trying for block
@@ -873,9 +877,8 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 		     grp->bb_group, new, ret);
 }
 
-static struct ext4_group_info *
-ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac,
-			       struct xarray *xa, ext4_group_t start)
+static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac,
+				      struct xarray *xa, ext4_group_t start)
 {
 	struct super_block *sb = ac->ac_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -886,17 +889,19 @@ ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac,
 	struct ext4_group_info *grp;
 
 	if (WARN_ON_ONCE(start >= ngroups))
-		return NULL;
+		return 0;
 	end = ngroups - 1;
 
 wrap_around:
 	xa_for_each_range(xa, group, grp, start, end) {
+		int err;
+
 		if (sbi->s_mb_stats)
 			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
 
-		if (!spin_is_locked(ext4_group_lock_ptr(sb, group)) &&
-		    likely(ext4_mb_good_group(ac, group, cr)))
-			return grp;
+		err = ext4_mb_scan_group(ac, grp->bb_group);
+		if (err || ac->ac_status != AC_STATUS_CONTINUE)
+			return err;
 
 		cond_resched();
 	}
@@ -907,95 +912,86 @@ ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac,
 		goto wrap_around;
 	}
 
-	return NULL;
+	return 0;
 }
 
 /*
  * Find a suitable group of given order from the largest free orders xarray.
  */
-static struct ext4_group_info *
-ext4_mb_find_good_group_largest_free_order(struct ext4_allocation_context *ac,
-					   int order, ext4_group_t start)
+static int
+ext4_mb_scan_groups_largest_free_order(struct ext4_allocation_context *ac,
+				       int order, ext4_group_t start)
 {
 	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order];
 
 	if (xa_empty(xa))
-		return NULL;
+		return 0;
 
-	return ext4_mb_find_good_group_xarray(ac, xa, start);
+	return ext4_mb_scan_groups_xarray(ac, xa, start);
 }
 
 /*
  * Choose next group by traversing largest_free_order lists. Updates *new_cr if
  * cr level needs an update.
  */
-static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context *ac,
-			enum criteria *new_cr, ext4_group_t *group)
+static int ext4_mb_scan_groups_p2_aligned(struct ext4_allocation_context *ac,
+					  ext4_group_t group)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct ext4_group_info *grp;
 	int i;
+	int ret = 0;
 
-	if (ac->ac_status == AC_STATUS_FOUND)
-		return;
-
-	if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED))
-		atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions);
-
+	ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
 	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		grp = ext4_mb_find_good_group_largest_free_order(ac, i, *group);
-		if (grp) {
-			*group = grp->bb_group;
-			ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
-			return;
-		}
+		ret = ext4_mb_scan_groups_largest_free_order(ac, i, group);
+		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
+			goto out;
 	}
 
+	if (sbi->s_mb_stats)
+		atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]);
+
 	/* Increment cr and search again if no group is found */
-	*new_cr = CR_GOAL_LEN_FAST;
+	ac->ac_criteria = CR_GOAL_LEN_FAST;
+out:
+	ac->ac_flags &= ~EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
+	return ret;
 }
 
 /*
  * Find a suitable group of given order from the average fragments xarray.
  */
-static struct ext4_group_info *
-ext4_mb_find_good_group_avg_frag_xarray(struct ext4_allocation_context *ac,
-					int order, ext4_group_t start)
+static int ext4_mb_scan_groups_avg_frag_order(struct ext4_allocation_context *ac,
+					      int order, ext4_group_t start)
 {
 	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order];
 
 	if (xa_empty(xa))
-		return NULL;
+		return 0;
 
-	return ext4_mb_find_good_group_xarray(ac, xa, start);
+	return ext4_mb_scan_groups_xarray(ac, xa, start);
 }
 
 /*
  * Choose next group by traversing average fragment size list of suitable
  * order. Updates *new_cr if cr level needs an update.
  */
-static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *ac,
-		enum criteria *new_cr, ext4_group_t *group)
+static int ext4_mb_scan_groups_goal_fast(struct ext4_allocation_context *ac,
+					 ext4_group_t group)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct ext4_group_info *grp = NULL;
-	int i;
-
-	if (unlikely(ac->ac_flags & EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED)) {
-		if (sbi->s_mb_stats)
-			atomic_inc(&sbi->s_bal_goal_fast_bad_suggestions);
-	}
+	int i, ret = 0;
 
-	for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
-	     i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		grp = ext4_mb_find_good_group_avg_frag_xarray(ac, i, *group);
-		if (grp) {
-			*group = grp->bb_group;
-			ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
-			return;
-		}
+	ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
+	i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
+	for (; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
+		ret = ext4_mb_scan_groups_avg_frag_order(ac, i, group);
+		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
+			goto out;
 	}
 
+	if (sbi->s_mb_stats)
+		atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]);
 	/*
 	 * CR_BEST_AVAIL_LEN works based on the concept that we have
 	 * a larger normalized goal len request which can be trimmed to
@@ -1005,9 +1001,12 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
 	 * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA).
 	 */
 	if (ac->ac_flags & EXT4_MB_HINT_DATA)
-		*new_cr = CR_BEST_AVAIL_LEN;
+		ac->ac_criteria = CR_BEST_AVAIL_LEN;
 	else
-		*new_cr = CR_GOAL_LEN_SLOW;
+		ac->ac_criteria = CR_GOAL_LEN_SLOW;
+out:
+	ac->ac_flags &= ~EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
+	return ret;
 }
 
 /*
@@ -1019,19 +1018,14 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
  * preallocations. However, we make sure that we don't trim the request too
  * much and fall to CR_GOAL_LEN_SLOW in that case.
  */
-static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context *ac,
-		enum criteria *new_cr, ext4_group_t *group)
+static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac,
+					  ext4_group_t group)
 {
+	int ret = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	struct ext4_group_info *grp = NULL;
 	int i, order, min_order;
 	unsigned long num_stripe_clusters = 0;
 
-	if (unlikely(ac->ac_flags & EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED)) {
-		if (sbi->s_mb_stats)
-			atomic_inc(&sbi->s_bal_best_avail_bad_suggestions);
-	}
-
 	/*
 	 * mb_avg_fragment_size_order() returns order in a way that makes
 	 * retrieving back the length using (1 << order) inaccurate. Hence, use
@@ -1062,6 +1056,7 @@ static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context
 	if (1 << min_order < ac->ac_o_ex.fe_len)
 		min_order = fls(ac->ac_o_ex.fe_len);
 
+	ac->ac_flags |= EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED;
 	for (i = order; i >= min_order; i--) {
 		int frag_order;
 		/*
@@ -1084,18 +1079,19 @@ static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context
 		frag_order = mb_avg_fragment_size_order(ac->ac_sb,
 							ac->ac_g_ex.fe_len);
 
-		grp = ext4_mb_find_good_group_avg_frag_xarray(ac, frag_order,
-							      *group);
-		if (grp) {
-			*group = grp->bb_group;
-			ac->ac_flags |= EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED;
-			return;
-		}
+		ret = ext4_mb_scan_groups_avg_frag_order(ac, frag_order, group);
+		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
+			goto out;
 	}
 
 	/* Reset goal length to original goal length before falling into CR_GOAL_LEN_SLOW */
 	ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
-	*new_cr = CR_GOAL_LEN_SLOW;
+	if (sbi->s_mb_stats)
+		atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]);
+	ac->ac_criteria = CR_GOAL_LEN_SLOW;
+out:
+	ac->ac_flags &= ~EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED;
+	return ret;
 }
 
 static inline int should_optimize_scan(struct ext4_allocation_context *ac)
@@ -1110,59 +1106,87 @@ static inline int should_optimize_scan(struct ext4_allocation_context *ac)
 }
 
 /*
- * Return next linear group for allocation.
+ * next linear group for allocation.
  */
-static ext4_group_t
-next_linear_group(ext4_group_t group, ext4_group_t ngroups)
+static void next_linear_group(ext4_group_t *group, ext4_group_t ngroups)
 {
 	/*
 	 * Artificially restricted ngroups for non-extent
 	 * files makes group > ngroups possible on first loop.
 	 */
-	return group + 1 >= ngroups ? 0 : group + 1;
+	*group =  *group + 1 >= ngroups ? 0 : *group + 1;
+}
+
+static int ext4_mb_scan_groups_linear(struct ext4_allocation_context *ac,
+		ext4_group_t ngroups, ext4_group_t *start, ext4_group_t count)
+{
+	int ret, i;
+	enum criteria cr = ac->ac_criteria;
+	struct super_block *sb = ac->ac_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	ext4_group_t group = *start;
+
+	for (i = 0; i < count; i++, next_linear_group(&group, ngroups)) {
+		ret = ext4_mb_scan_group(ac, group);
+		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
+			return ret;
+		cond_resched();
+	}
+
+	*start = group;
+	if (count == ngroups)
+		ac->ac_criteria++;
+
+	/* Processed all groups and haven't found blocks */
+	if (sbi->s_mb_stats && i == ngroups)
+		atomic64_inc(&sbi->s_bal_cX_failed[cr]);
+
+	return 0;
 }
 
 /*
- * ext4_mb_choose_next_group: choose next group for allocation.
+ * ext4_mb_scan_groups: .
  *
  * @ac        Allocation Context
- * @new_cr    This is an output parameter. If the there is no good group
- *            available at current CR level, this field is updated to indicate
- *            the new cr level that should be used.
- * @group     This is an input / output parameter. As an input it indicates the
- *            next group that the allocator intends to use for allocation. As
- *            output, this field indicates the next group that should be used as
- *            determined by the optimization functions.
- * @ngroups   Total number of groups
  */
-static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
-		enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+static int ext4_mb_scan_groups(struct ext4_allocation_context *ac)
 {
-	*new_cr = ac->ac_criteria;
+	int ret = 0;
+	ext4_group_t start;
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+	ext4_group_t ngroups = ext4_get_groups_count(ac->ac_sb);
 
-	if (!should_optimize_scan(ac)) {
-		*group = next_linear_group(*group, ngroups);
-		return;
-	}
+	/* non-extent files are limited to low blocks/groups */
+	if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)))
+		ngroups = sbi->s_blockfile_groups;
+
+	/* searching for the right group start from the goal value specified */
+	start = ac->ac_g_ex.fe_group;
+	ac->ac_prefetch_grp = start;
+	ac->ac_prefetch_nr = 0;
+
+	if (!should_optimize_scan(ac))
+		return ext4_mb_scan_groups_linear(ac, ngroups, &start, ngroups);
 
 	/*
 	 * Optimized scanning can return non adjacent groups which can cause
 	 * seek overhead for rotational disks. So try few linear groups before
 	 * trying optimized scan.
 	 */
-	if (ac->ac_groups_linear_remaining) {
-		*group = next_linear_group(*group, ngroups);
-		ac->ac_groups_linear_remaining--;
-		return;
-	}
+	if (sbi->s_mb_max_linear_groups)
+		ret = ext4_mb_scan_groups_linear(ac, ngroups, &start,
+						 sbi->s_mb_max_linear_groups);
+	if (ret || ac->ac_status != AC_STATUS_CONTINUE)
+		return ret;
 
-	if (*new_cr == CR_POWER2_ALIGNED) {
-		ext4_mb_choose_next_group_p2_aligned(ac, new_cr, group);
-	} else if (*new_cr == CR_GOAL_LEN_FAST) {
-		ext4_mb_choose_next_group_goal_fast(ac, new_cr, group);
-	} else if (*new_cr == CR_BEST_AVAIL_LEN) {
-		ext4_mb_choose_next_group_best_avail(ac, new_cr, group);
-	} else {
+	switch (ac->ac_criteria) {
+	case CR_POWER2_ALIGNED:
+		return ext4_mb_scan_groups_p2_aligned(ac, start);
+	case CR_GOAL_LEN_FAST:
+		return ext4_mb_scan_groups_goal_fast(ac, start);
+	case CR_BEST_AVAIL_LEN:
+		return ext4_mb_scan_groups_best_avail(ac, start);
+	default:
 		/*
 		 * TODO: For CR_GOAL_LEN_SLOW, we can arrange groups in an
 		 * rb tree sorted by bb_free. But until that happens, we should
@@ -1170,6 +1194,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
 		 */
 		WARN_ON(1);
 	}
+
+	return 0;
 }
 
 /*
@@ -2875,6 +2901,18 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group,
 	}
 }
 
+static inline void ac_inc_bad_suggestions(struct ext4_allocation_context *ac)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+
+	if (ac->ac_flags & EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED)
+		atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions);
+	else if (ac->ac_flags & EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED)
+		atomic_inc(&sbi->s_bal_goal_fast_bad_suggestions);
+	else if (ac->ac_flags & EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED)
+		atomic_inc(&sbi->s_bal_best_avail_bad_suggestions);
+}
+
 static int ext4_mb_scan_group(struct ext4_allocation_context *ac,
 			      ext4_group_t group)
 {
@@ -2893,7 +2931,8 @@ static int ext4_mb_scan_group(struct ext4_allocation_context *ac,
 	if (ret <= 0) {
 		if (!ac->ac_first_err)
 			ac->ac_first_err = ret;
-		return 0;
+		ret = 0;
+		goto out;
 	}
 
 	ret = ext4_mb_load_buddy(sb, group, ac->ac_e4b);
@@ -2916,26 +2955,20 @@ static int ext4_mb_scan_group(struct ext4_allocation_context *ac,
 	ext4_unlock_group(sb, group);
 out_unload:
 	ext4_mb_unload_buddy(ac->ac_e4b);
+out:
+	if (EXT4_SB(sb)->s_mb_stats && ac->ac_status == AC_STATUS_CONTINUE)
+		ac_inc_bad_suggestions(ac);
 	return ret;
 }
 
 static noinline_for_stack int
 ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 {
-	ext4_group_t ngroups, group, i;
-	enum criteria new_cr, cr = CR_GOAL_LEN_FAST;
+	ext4_group_t i;
 	int err = 0;
-	struct ext4_sb_info *sbi;
-	struct super_block *sb;
+	struct super_block *sb = ac->ac_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_buddy e4b;
-	int lost;
-
-	sb = ac->ac_sb;
-	sbi = EXT4_SB(sb);
-	ngroups = ext4_get_groups_count(sb);
-	/* non-extent files are limited to low blocks/groups */
-	if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)))
-		ngroups = sbi->s_blockfile_groups;
 
 	BUG_ON(ac->ac_status == AC_STATUS_FOUND);
 
@@ -2980,48 +3013,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	 * start with CR_GOAL_LEN_FAST, unless it is power of 2
 	 * aligned, in which case let's do that faster approach first.
 	 */
+	ac->ac_criteria = CR_GOAL_LEN_FAST;
 	if (ac->ac_2order)
-		cr = CR_POWER2_ALIGNED;
+		ac->ac_criteria = CR_POWER2_ALIGNED;
 
 	ac->ac_e4b = &e4b;
 	ac->ac_prefetch_ios = 0;
 	ac->ac_first_err = 0;
 repeat:
-	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
-		ac->ac_criteria = cr;
-		/*
-		 * searching for the right group start
-		 * from the goal value specified
-		 */
-		group = ac->ac_g_ex.fe_group;
-		ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups;
-		ac->ac_prefetch_grp = group;
-		ac->ac_prefetch_nr = 0;
-
-		for (i = 0, new_cr = cr; i < ngroups; i++,
-		     ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) {
-
-			cond_resched();
-			if (new_cr != cr) {
-				cr = new_cr;
-				goto repeat;
-			}
-
-			err = ext4_mb_scan_group(ac, group);
-			if (err)
-				goto out;
-
-			if (ac->ac_status != AC_STATUS_CONTINUE)
-				break;
-		}
-		/* Processed all groups and haven't found blocks */
-		if (sbi->s_mb_stats && i == ngroups)
-			atomic64_inc(&sbi->s_bal_cX_failed[cr]);
+	while (ac->ac_criteria < EXT4_MB_NUM_CRS) {
+		err = ext4_mb_scan_groups(ac);
+		if (err)
+			goto out;
 
-		if (i == ngroups && ac->ac_criteria == CR_BEST_AVAIL_LEN)
-			/* Reset goal length to original goal length before
-			 * falling into CR_GOAL_LEN_SLOW */
-			ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
+		if (ac->ac_status != AC_STATUS_CONTINUE)
+			break;
 	}
 
 	if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND &&
@@ -3032,6 +3038,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 		 */
 		ext4_mb_try_best_found(ac, &e4b);
 		if (ac->ac_status != AC_STATUS_FOUND) {
+			int lost;
+
 			/*
 			 * Someone more lucky has already allocated it.
 			 * The only thing we can do is just take first
@@ -3047,7 +3055,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			ac->ac_b_ex.fe_len = 0;
 			ac->ac_status = AC_STATUS_CONTINUE;
 			ac->ac_flags |= EXT4_MB_HINT_FIRST;
-			cr = CR_ANY_FREE;
+			ac->ac_criteria = CR_ANY_FREE;
 			goto repeat;
 		}
 	}
@@ -3060,7 +3068,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 	mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n",
 		 ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status,
-		 ac->ac_flags, cr, err);
+		 ac->ac_flags, ac->ac_criteria, err);
 
 	if (ac->ac_prefetch_nr)
 		ext4_mb_prefetch_fini(sb, ac->ac_prefetch_grp, ac->ac_prefetch_nr);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 721aaea1f83e..65713b847385 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -208,7 +208,6 @@ struct ext4_allocation_context {
 	int ac_first_err;
 
 	__u32 ac_flags;		/* allocation hints */
-	__u32 ac_groups_linear_remaining;
 	__u16 ac_groups_scanned;
 	__u16 ac_found;
 	__u16 ac_cX_found[EXT4_MB_NUM_CRS];
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 16/16] ext4: ensure global ordered traversal across all free groups xarrays
  2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
                   ` (14 preceding siblings ...)
  2025-06-23  7:33 ` [PATCH v2 15/16] ext4: refactor choose group to scan group Baokun Li
@ 2025-06-23  7:33 ` Baokun Li
  15 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-23  7:33 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jack, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun, libaokun1

Although we now perform ordered traversal within an xarray, this is
currently limited to a single xarray, traversing right then left. However,
we have multiple such xarrays, which prevents us from guaranteeing a
linear-like traversal where all groups on the right are visited before all
groups on the left.

Therefore, this change modifies the traversal to first iterate through
all right groups across all xarrays, and then all left groups across all
xarrays. This achieves a linear-like effect, mitigating contention
between block allocation and block freeing paths.

Performance test data follows:

CPU: Kunpeng 920   |          P80            |            P1           |
Memory: 512GB      |-------------------------|-------------------------|
Disk: 960GB SSD    | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 20976 | 20619  (-1.7%)  | 319396| 299238 (-6.3%)  |
mb_optimize_scan=1 | 14580 | 20119  (+37.9%) | 319237| 315268 (-1.2%)  |

CPU: AMD 9654 * 2  |          P96            |            P1           |
Memory: 1536GB     |-------------------------|-------------------------|
Disk: 960GB SSD    | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 51713 | 51983 (+0.5%)   | 206655| 207033 (0.18%)  |
mb_optimize_scan=1 | 35527 | 48486 (+36.4%)  | 212574| 202415 (+4.7%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/mballoc.c | 69 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 47 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d8372a649a0c..d26a0e8e3f7e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -877,22 +877,20 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
 		     grp->bb_group, new, ret);
 }
 
-static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac,
-				      struct xarray *xa, ext4_group_t start)
+static int ext4_mb_scan_groups_xa_range(struct ext4_allocation_context *ac,
+					struct xarray *xa,
+					ext4_group_t start, ext4_group_t end)
 {
 	struct super_block *sb = ac->ac_sb;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	enum criteria cr = ac->ac_criteria;
 	ext4_group_t ngroups = ext4_get_groups_count(sb);
 	unsigned long group = start;
-	ext4_group_t end;
 	struct ext4_group_info *grp;
 
-	if (WARN_ON_ONCE(start >= ngroups))
+	if (WARN_ON_ONCE(end >= ngroups || start > end))
 		return 0;
-	end = ngroups - 1;
 
-wrap_around:
 	xa_for_each_range(xa, group, grp, start, end) {
 		int err;
 
@@ -906,28 +904,23 @@ static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac,
 		cond_resched();
 	}
 
-	if (start) {
-		end = start - 1;
-		start = 0;
-		goto wrap_around;
-	}
-
 	return 0;
 }
 
 /*
  * Find a suitable group of given order from the largest free orders xarray.
  */
-static int
-ext4_mb_scan_groups_largest_free_order(struct ext4_allocation_context *ac,
-				       int order, ext4_group_t start)
+static inline int
+ext4_mb_scan_groups_largest_free_order_range(struct ext4_allocation_context *ac,
+					     int order, ext4_group_t start,
+					     ext4_group_t end)
 {
 	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order];
 
 	if (xa_empty(xa))
 		return 0;
 
-	return ext4_mb_scan_groups_xarray(ac, xa, start);
+	return ext4_mb_scan_groups_xa_range(ac, xa, start, end - 1);
 }
 
 /*
@@ -940,13 +933,23 @@ static int ext4_mb_scan_groups_p2_aligned(struct ext4_allocation_context *ac,
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int i;
 	int ret = 0;
+	ext4_group_t start, end;
 
 	ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
+	start = group;
+	end = ext4_get_groups_count(ac->ac_sb);
+wrap_around:
 	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		ret = ext4_mb_scan_groups_largest_free_order(ac, i, group);
+		ret = ext4_mb_scan_groups_largest_free_order_range(ac, i,
+								   start, end);
 		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
 			goto out;
 	}
+	if (start) {
+		end = start;
+		start = 0;
+		goto wrap_around;
+	}
 
 	if (sbi->s_mb_stats)
 		atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]);
@@ -961,15 +964,17 @@ static int ext4_mb_scan_groups_p2_aligned(struct ext4_allocation_context *ac,
 /*
  * Find a suitable group of given order from the average fragments xarray.
  */
-static int ext4_mb_scan_groups_avg_frag_order(struct ext4_allocation_context *ac,
-					      int order, ext4_group_t start)
+static int
+ext4_mb_scan_groups_avg_frag_order_range(struct ext4_allocation_context *ac,
+					 int order, ext4_group_t start,
+					 ext4_group_t end)
 {
 	struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order];
 
 	if (xa_empty(xa))
 		return 0;
 
-	return ext4_mb_scan_groups_xarray(ac, xa, start);
+	return ext4_mb_scan_groups_xa_range(ac, xa, start, end - 1);
 }
 
 /*
@@ -981,14 +986,24 @@ static int ext4_mb_scan_groups_goal_fast(struct ext4_allocation_context *ac,
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int i, ret = 0;
+	ext4_group_t start, end;
 
 	ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
+	start = group;
+	end = ext4_get_groups_count(ac->ac_sb);
+wrap_around:
 	i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len);
 	for (; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
-		ret = ext4_mb_scan_groups_avg_frag_order(ac, i, group);
+		ret = ext4_mb_scan_groups_avg_frag_order_range(ac, i,
+							       start, end);
 		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
 			goto out;
 	}
+	if (start) {
+		end = start;
+		start = 0;
+		goto wrap_around;
+	}
 
 	if (sbi->s_mb_stats)
 		atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]);
@@ -1025,6 +1040,7 @@ static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac,
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int i, order, min_order;
 	unsigned long num_stripe_clusters = 0;
+	ext4_group_t start, end;
 
 	/*
 	 * mb_avg_fragment_size_order() returns order in a way that makes
@@ -1057,6 +1073,9 @@ static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac,
 		min_order = fls(ac->ac_o_ex.fe_len);
 
 	ac->ac_flags |= EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED;
+	start = group;
+	end = ext4_get_groups_count(ac->ac_sb);
+wrap_around:
 	for (i = order; i >= min_order; i--) {
 		int frag_order;
 		/*
@@ -1079,10 +1098,16 @@ static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac,
 		frag_order = mb_avg_fragment_size_order(ac->ac_sb,
 							ac->ac_g_ex.fe_len);
 
-		ret = ext4_mb_scan_groups_avg_frag_order(ac, frag_order, group);
+		ret = ext4_mb_scan_groups_avg_frag_order_range(ac, frag_order,
+							       start, end);
 		if (ret || ac->ac_status != AC_STATUS_CONTINUE)
 			goto out;
 	}
+	if (start) {
+		end = start;
+		start = 0;
+		goto wrap_around;
+	}
 
 	/* Reset goal length to original goal length before falling into CR_GOAL_LEN_SLOW */
 	ac->ac_g_ex.fe_len = ac->ac_orig_goal_len;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups
  2025-06-23  7:32 ` [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
@ 2025-06-27 18:06   ` Jan Kara
  2025-07-14  6:53   ` Ojaswin Mujoo
  1 sibling, 0 replies; 51+ messages in thread
From: Jan Kara @ 2025-06-27 18:06 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun

On Mon 23-06-25 15:32:49, Baokun Li wrote:
> When ext4 allocates blocks, we used to just go through the block groups
> one by one to find a good one. But when there are tons of block groups
> (like hundreds of thousands or even millions) and not many have free space
> (meaning they're mostly full), it takes a really long time to check them
> all, and performance gets bad. So, we added the "mb_optimize_scan" mount
> option (which is on by default now). It keeps track of some group lists,
> so when we need a free block, we can just grab a likely group from the
> right list. This saves time and makes block allocation much faster.
> 
> But when multiple processes or containers are doing similar things, like
> constantly allocating 8k blocks, they all try to use the same block group
> in the same list. Even just two processes doing this can cut the IOPS in
> half. For example, one container might do 300,000 IOPS, but if you run two
> at the same time, the total is only 150,000.
> 
> Since we can already look at block groups in a non-linear way, the first
> and last groups in the same list are basically the same for finding a block
> right now. Therefore, add an ext4_try_lock_group() helper function to skip
> the current group when it is locked by another process, thereby avoiding
> contention with other processes. This helps ext4 make better use of having
> multiple block groups.
> 
> Also, to make sure we don't skip all the groups that have free space
> when allocating blocks, we won't try to skip busy groups anymore when
> ac_criteria is CR_ANY_FREE.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                    | Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96  |
>  Disk: 960GB SSD   |-------------------------|-------------------------|
>                    | base  |    patched      | base  |    patched      |
> -------------------|-------|-----------------|-------|-----------------|
> mb_optimize_scan=0 | 2667  | 4821  (+80.7%)  | 3450  | 15371 (+345%)   |
> mb_optimize_scan=1 | 2643  | 4784  (+81.0%)  | 3209  | 6101  (+90.0%)  |
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start
  2025-06-23  7:32 ` [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start Baokun Li
@ 2025-06-27 18:15   ` Jan Kara
  2025-06-30  3:32     ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-06-27 18:15 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun

On Mon 23-06-25 15:32:50, Baokun Li wrote:
> ac->ac_g_ex.fe_start is only used in ext4_mb_find_by_goal(), but STREAM
> ALLOC is activated after ext4_mb_find_by_goal() fails, so there's no need
> to update ac->ac_g_ex.fe_start, remove the unnecessary s_mb_last_start.
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

I'd just note that ac->ac_g_ex.fe_start is also used in
ext4_mb_collect_stats() so this change may impact the statistics gathered
there. OTOH it is questionable whether we even want to account streaming
allocation as a goal hit... Anyway, I'm fine with this, I'd just mention it
in the changelog.

Also one nit below but feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> @@ -2849,7 +2848,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  		/* TBD: may be hot point */
>  		spin_lock(&sbi->s_md_lock);
>  		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> -		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;

Maybe reset ac->ac_g_ex.fe_start to 0 instead of leaving it at some random
value? Just for the sake of defensive programming...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-06-23  7:32 ` [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
@ 2025-06-27 18:19   ` Jan Kara
  2025-06-30  3:48     ` Baokun Li
  2025-07-01  2:57   ` kernel test robot
  1 sibling, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-06-27 18:19 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun

On Mon 23-06-25 15:32:51, Baokun Li wrote:
> After we optimized the block group lock, we found another lock
> contention issue when running will-it-scale/fallocate2 with multiple
> processes. The fallocate's block allocation and the truncate's block
> release were fighting over the s_md_lock. The problem is, this lock
> protects totally different things in those two processes: the list of
> freed data blocks (s_freed_data_list) when releasing, and where to start
> looking for new blocks (mb_last_group) when allocating.
> 
> Now we only need to track s_mb_last_group and no longer need to track
> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> two are consistent, and we can ensure that the s_mb_last_group read is up
> to date by using smp_store_release/smp_load_acquire.
> 
> Besides, the s_mb_last_group data type only requires ext4_group_t
> (i.e., unsigned int), rendering unsigned long superfluous.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                    | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>  Disk: 960GB SSD   |-------------------------|-------------------------|
>                    | base  |    patched      | base  |    patched      |
> -------------------|-------|-----------------|-------|-----------------|
> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

...

> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 5cdae3bda072..3f103919868b 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>  	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>  	folio_get(ac->ac_buddy_folio);
>  	/* store last allocated for subsequent stream allocation */
> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		spin_lock(&sbi->s_md_lock);
> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> -		spin_unlock(&sbi->s_md_lock);
> -	}
> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);

Do you really need any kind of barrier (implied by smp_store_release())
here? I mean the store to s_mb_last_group is perfectly fine to be reordered
with other accesses from the thread, isn't it? As such it should be enough
to have WRITE_ONCE() here...

>  	/*
>  	 * As we've just preallocated more space than
>  	 * user requested originally, we store allocated
> @@ -2844,12 +2842,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  	}
>  
>  	/* if stream allocation is enabled, use global goal */
> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		/* TBD: may be hot point */
> -		spin_lock(&sbi->s_md_lock);
> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> -		spin_unlock(&sbi->s_md_lock);
> -	}
> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> +		/* pairs with smp_store_release in ext4_mb_use_best_found() */
> +		ac->ac_g_ex.fe_group = smp_load_acquire(&sbi->s_mb_last_group);

... and READ_ONCE() here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention
  2025-06-23  7:32 ` [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention Baokun Li
@ 2025-06-27 18:31   ` Jan Kara
  2025-06-30  6:50     ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-06-27 18:31 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun

On Mon 23-06-25 15:32:52, Baokun Li wrote:
> When allocating data blocks, if the first try (goal allocation) fails and
> stream allocation is on, it tries a global goal starting from the last
> group we used (s_mb_last_group). This helps cluster large files together
> to reduce free space fragmentation, and the data block contiguity also
> accelerates write-back to disk.
> 
> However, when multiple processes allocate blocks, having just one global
> goal means they all fight over the same group. This drastically lowers
> the chances of extents merging and leads to much worse file fragmentation.
> 
> To mitigate this multi-process contention, we now employ multiple global
> goals, with the number of goals being the CPU count rounded up to the
> nearest power of 2. To ensure a consistent goal for each inode, we select
> the corresponding goal by taking the inode number modulo the total number
> of goals.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                    | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>  Disk: 960GB SSD   |-------------------------|-------------------------|
>                    | base  |    patched      | base  |    patched      |
> -------------------|-------|-----------------|-------|-----------------|
> mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
> mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

...

> +/*
> + * Number of mb last groups
> + */
> +#ifdef CONFIG_SMP
> +#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
> +#else
> +#define MB_LAST_GROUPS 1
> +#endif
> +

I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
distribution kernels (it is just a theoretical maximum for the number of
CPUs the kernel can support) which seems like far too much for small
filesystems with say 100 block groups. I'd rather pick the array size like:

min(num_possible_cpus(), sbi->s_groups_count/4)

to

a) don't have too many slots so we still concentrate big allocations in
somewhat limited area of the filesystem (a quarter of block groups here).

b) have at most one slot per CPU the machine hardware can in principle
support.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 07/16] ext4: convert sbi->s_mb_free_pending to atomic_t
  2025-06-23  7:32 ` [PATCH v2 07/16] ext4: convert sbi->s_mb_free_pending to atomic_t Baokun Li
@ 2025-06-27 18:33   ` Jan Kara
  0 siblings, 0 replies; 51+ messages in thread
From: Jan Kara @ 2025-06-27 18:33 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun

On Mon 23-06-25 15:32:55, Baokun Li wrote:
> Previously, s_md_lock was used to protect s_mb_free_pending during
> modifications, while smp_mb() ensured fresh reads, so s_md_lock just
> guarantees the atomicity of s_mb_free_pending. Thus we optimized it by
> converting s_mb_free_pending into an atomic variable, thereby eliminating
> s_md_lock and minimizing lock contention. This also prepares for future
> lockless merging of free extents.
> 
> Following this modification, s_md_lock is exclusively responsible for
> managing insertions and deletions within s_freed_data_list, along with
> operations involving list_splice.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                    | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>  Disk: 960GB SSD   |-------------------------|-------------------------|
>                    | base  |    patched      | base  |    patched      |
> -------------------|-------|-----------------|-------|-----------------|
> mb_optimize_scan=0 | 19699 | 20982 (+6.5%)   | 53093 | 50629 (-4.6%)   |
> mb_optimize_scan=1 | 9862  | 10703 (+8.5%)   | 14401 | 14856 (+3.1%)   |
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/balloc.c  | 2 +-
>  fs/ext4/ext4.h    | 2 +-
>  fs/ext4/mballoc.c | 9 +++------
>  3 files changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index c48fd36b2d74..c9329ed5c094 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -703,7 +703,7 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
>  	 * possible we just missed a transaction commit that did so
>  	 */
>  	smp_mb();
> -	if (sbi->s_mb_free_pending == 0) {
> +	if (atomic_read(&sbi->s_mb_free_pending) == 0) {
>  		if (test_opt(sb, DISCARD)) {
>  			atomic_inc(&sbi->s_retry_alloc_pending);
>  			flush_work(&sbi->s_discard_work);
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 294198c05cdd..003b8d3726e8 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1602,7 +1602,7 @@ struct ext4_sb_info {
>  	unsigned short *s_mb_offsets;
>  	unsigned int *s_mb_maxs;
>  	unsigned int s_group_info_size;
> -	unsigned int s_mb_free_pending;
> +	atomic_t s_mb_free_pending;
>  	struct list_head s_freed_data_list[2];	/* List of blocks to be freed
>  						   after commit completed */
>  	struct list_head s_discard_list;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 216b332a5054..5410fb3688ee 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3680,7 +3680,7 @@ int ext4_mb_init(struct super_block *sb)
>  	}
>  
>  	spin_lock_init(&sbi->s_md_lock);
> -	sbi->s_mb_free_pending = 0;
> +	atomic_set(&sbi->s_mb_free_pending, 0);
>  	INIT_LIST_HEAD(&sbi->s_freed_data_list[0]);
>  	INIT_LIST_HEAD(&sbi->s_freed_data_list[1]);
>  	INIT_LIST_HEAD(&sbi->s_discard_list);
> @@ -3894,10 +3894,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb,
>  	/* we expect to find existing buddy because it's pinned */
>  	BUG_ON(err != 0);
>  
> -	spin_lock(&EXT4_SB(sb)->s_md_lock);
> -	EXT4_SB(sb)->s_mb_free_pending -= entry->efd_count;
> -	spin_unlock(&EXT4_SB(sb)->s_md_lock);
> -
> +	atomic_sub(entry->efd_count, &EXT4_SB(sb)->s_mb_free_pending);
>  	db = e4b.bd_info;
>  	/* there are blocks to put in buddy to make them really free */
>  	count += entry->efd_count;
> @@ -6392,7 +6389,7 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
>  
>  	spin_lock(&sbi->s_md_lock);
>  	list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->efd_tid & 1]);
> -	sbi->s_mb_free_pending += clusters;
> +	atomic_add(clusters, &sbi->s_mb_free_pending);
>  	spin_unlock(&sbi->s_md_lock);
>  }
>  
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 08/16] ext4: merge freed extent with existing extents before insertion
  2025-06-23  7:32 ` [PATCH v2 08/16] ext4: merge freed extent with existing extents before insertion Baokun Li
@ 2025-06-27 19:11   ` Jan Kara
  0 siblings, 0 replies; 51+ messages in thread
From: Jan Kara @ 2025-06-27 19:11 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun

On Mon 23-06-25 15:32:56, Baokun Li wrote:
> Attempt to merge ext4_free_data with already inserted free extents prior
> to adding new ones. This strategy drastically cuts down the number of
> times locks are held.
> 
> For example, if prev, new, and next extents are all mergeable, the existing
> code (before this patch) requires acquiring the s_md_lock three times:
> 
>   prev merge into new and free prev // hold lock
>   next merge into new and free next // hold lock
>   insert new // hold lock
> 
> After the patch, it only needs to be acquired once:
> 
>   new merge next and free new // no lock
>   next merge into prev and free prev // hold lock
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                    | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>  Disk: 960GB SSD   |-------------------------|-------------------------|
>                    | base  |    patched      | base  |    patched      |
> -------------------|-------|-----------------|-------|-----------------|
> mb_optimize_scan=0 | 20982 | 21157 (+0.8%)   | 50629 | 50420 (-0.4%)   |
> mb_optimize_scan=1 | 10703 | 12896 (+20.4%)  | 14856 | 17273 (+16.2%)  |
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/mballoc.c | 113 +++++++++++++++++++++++++++++++---------------
>  1 file changed, 76 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 5410fb3688ee..94950b07a577 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -6298,28 +6298,63 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>   * are contiguous, AND the extents were freed by the same transaction,
>   * AND the blocks are associated with the same group.
>   */
> -static void ext4_try_merge_freed_extent(struct ext4_sb_info *sbi,
> -					struct ext4_free_data *entry,
> -					struct ext4_free_data *new_entry,
> -					struct rb_root *entry_rb_root)
> +static inline bool
> +ext4_freed_extents_can_be_merged(struct ext4_free_data *entry1,
> +				 struct ext4_free_data *entry2)
>  {
> -	if ((entry->efd_tid != new_entry->efd_tid) ||
> -	    (entry->efd_group != new_entry->efd_group))
> -		return;
> -	if (entry->efd_start_cluster + entry->efd_count ==
> -	    new_entry->efd_start_cluster) {
> -		new_entry->efd_start_cluster = entry->efd_start_cluster;
> -		new_entry->efd_count += entry->efd_count;
> -	} else if (new_entry->efd_start_cluster + new_entry->efd_count ==
> -		   entry->efd_start_cluster) {
> -		new_entry->efd_count += entry->efd_count;
> -	} else
> -		return;
> +	if (entry1->efd_tid != entry2->efd_tid)
> +		return false;
> +	if (entry1->efd_start_cluster + entry1->efd_count !=
> +	    entry2->efd_start_cluster)
> +		return false;
> +	if (WARN_ON_ONCE(entry1->efd_group != entry2->efd_group))
> +		return false;
> +	return true;
> +}
> +
> +static inline void
> +ext4_merge_freed_extents(struct ext4_sb_info *sbi, struct rb_root *root,
> +			 struct ext4_free_data *entry1,
> +			 struct ext4_free_data *entry2)
> +{
> +	entry1->efd_count += entry2->efd_count;
>  	spin_lock(&sbi->s_md_lock);
> -	list_del(&entry->efd_list);
> +	list_del(&entry2->efd_list);
>  	spin_unlock(&sbi->s_md_lock);
> -	rb_erase(&entry->efd_node, entry_rb_root);
> -	kmem_cache_free(ext4_free_data_cachep, entry);
> +	rb_erase(&entry2->efd_node, root);
> +	kmem_cache_free(ext4_free_data_cachep, entry2);
> +}
> +
> +static inline void
> +ext4_try_merge_freed_extent_prev(struct ext4_sb_info *sbi, struct rb_root *root,
> +				 struct ext4_free_data *entry)
> +{
> +	struct ext4_free_data *prev;
> +	struct rb_node *node;
> +
> +	node = rb_prev(&entry->efd_node);
> +	if (!node)
> +		return;
> +
> +	prev = rb_entry(node, struct ext4_free_data, efd_node);
> +	if (ext4_freed_extents_can_be_merged(prev, entry))
> +		ext4_merge_freed_extents(sbi, root, prev, entry);
> +}
> +
> +static inline void
> +ext4_try_merge_freed_extent_next(struct ext4_sb_info *sbi, struct rb_root *root,
> +				 struct ext4_free_data *entry)
> +{
> +	struct ext4_free_data *next;
> +	struct rb_node *node;
> +
> +	node = rb_next(&entry->efd_node);
> +	if (!node)
> +		return;
> +
> +	next = rb_entry(node, struct ext4_free_data, efd_node);
> +	if (ext4_freed_extents_can_be_merged(entry, next))
> +		ext4_merge_freed_extents(sbi, root, entry, next);
>  }
>  
>  static noinline_for_stack void
> @@ -6329,11 +6364,12 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
>  	ext4_group_t group = e4b->bd_group;
>  	ext4_grpblk_t cluster;
>  	ext4_grpblk_t clusters = new_entry->efd_count;
> -	struct ext4_free_data *entry;
> +	struct ext4_free_data *entry = NULL;
>  	struct ext4_group_info *db = e4b->bd_info;
>  	struct super_block *sb = e4b->bd_sb;
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> -	struct rb_node **n = &db->bb_free_root.rb_node, *node;
> +	struct rb_root *root = &db->bb_free_root;
> +	struct rb_node **n = &root->rb_node;
>  	struct rb_node *parent = NULL, *new_node;
>  
>  	BUG_ON(!ext4_handle_valid(handle));
> @@ -6369,27 +6405,30 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
>  		}
>  	}
>  
> -	rb_link_node(new_node, parent, n);
> -	rb_insert_color(new_node, &db->bb_free_root);
> -
> -	/* Now try to see the extent can be merged to left and right */
> -	node = rb_prev(new_node);
> -	if (node) {
> -		entry = rb_entry(node, struct ext4_free_data, efd_node);
> -		ext4_try_merge_freed_extent(sbi, entry, new_entry,
> -					    &(db->bb_free_root));
> +	atomic_add(clusters, &sbi->s_mb_free_pending);
> +	if (!entry)
> +		goto insert;
> +
> +	/* Now try to see the extent can be merged to prev and next */
> +	if (ext4_freed_extents_can_be_merged(new_entry, entry)) {
> +		entry->efd_start_cluster = cluster;
> +		entry->efd_count += new_entry->efd_count;
> +		kmem_cache_free(ext4_free_data_cachep, new_entry);
> +		ext4_try_merge_freed_extent_prev(sbi, root, entry);
> +		return;
>  	}
> -
> -	node = rb_next(new_node);
> -	if (node) {
> -		entry = rb_entry(node, struct ext4_free_data, efd_node);
> -		ext4_try_merge_freed_extent(sbi, entry, new_entry,
> -					    &(db->bb_free_root));
> +	if (ext4_freed_extents_can_be_merged(entry, new_entry)) {
> +		entry->efd_count += new_entry->efd_count;
> +		kmem_cache_free(ext4_free_data_cachep, new_entry);
> +		ext4_try_merge_freed_extent_next(sbi, root, entry);
> +		return;
>  	}
> +insert:
> +	rb_link_node(new_node, parent, n);
> +	rb_insert_color(new_node, root);
>  
>  	spin_lock(&sbi->s_md_lock);
>  	list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->efd_tid & 1]);
> -	atomic_add(clusters, &sbi->s_mb_free_pending);
>  	spin_unlock(&sbi->s_md_lock);
>  }
>  
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 09/16] ext4: fix zombie groups in average fragment size lists
  2025-06-23  7:32 ` [PATCH v2 09/16] ext4: fix zombie groups in average fragment size lists Baokun Li
@ 2025-06-27 19:14   ` Jan Kara
  2025-06-30  6:53     ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-06-27 19:14 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, stable

On Mon 23-06-25 15:32:57, Baokun Li wrote:
> Groups with no free blocks shouldn't be in any average fragment size list.
> However, when all blocks in a group are allocated(i.e., bb_fragments or
> bb_free is 0), we currently skip updating the average fragment size, which
> means the group isn't removed from its previous s_mb_avg_fragment_size[old]
> list.
> 
> This created "zombie" groups that were always skipped during traversal as
> they couldn't satisfy any block allocation requests, negatively impacting
> traversal efficiency.
> 
> Therefore, when a group becomes completely free, bb_avg_fragment_size_order
					     ^^^ full

> is now set to -1. If the old order was not -1, a removal operation is
> performed; if the new order is not -1, an insertion is performed.
> 
> Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
> CC: stable@vger.kernel.org
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Good catch! The patch looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/mballoc.c | 36 ++++++++++++++++++------------------
>  1 file changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 94950b07a577..e6d6c2da3c6e 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -841,30 +841,30 @@ static void
>  mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> -	int new_order;
> +	int new, old;
>  
> -	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_fragments == 0)
> +	if (!test_opt2(sb, MB_OPTIMIZE_SCAN))
>  		return;
>  
> -	new_order = mb_avg_fragment_size_order(sb,
> -					grp->bb_free / grp->bb_fragments);
> -	if (new_order == grp->bb_avg_fragment_size_order)
> +	old = grp->bb_avg_fragment_size_order;
> +	new = grp->bb_fragments == 0 ? -1 :
> +	      mb_avg_fragment_size_order(sb, grp->bb_free / grp->bb_fragments);
> +	if (new == old)
>  		return;
>  
> -	if (grp->bb_avg_fragment_size_order != -1) {
> -		write_lock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> +	if (old >= 0) {
> +		write_lock(&sbi->s_mb_avg_fragment_size_locks[old]);
>  		list_del(&grp->bb_avg_fragment_size_node);
> -		write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> -	}
> -	grp->bb_avg_fragment_size_order = new_order;
> -	write_lock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> -	list_add_tail(&grp->bb_avg_fragment_size_node,
> -		&sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
> -	write_unlock(&sbi->s_mb_avg_fragment_size_locks[
> -					grp->bb_avg_fragment_size_order]);
> +		write_unlock(&sbi->s_mb_avg_fragment_size_locks[old]);
> +	}
> +
> +	grp->bb_avg_fragment_size_order = new;
> +	if (new >= 0) {
> +		write_lock(&sbi->s_mb_avg_fragment_size_locks[new]);
> +		list_add_tail(&grp->bb_avg_fragment_size_node,
> +				&sbi->s_mb_avg_fragment_size[new]);
> +		write_unlock(&sbi->s_mb_avg_fragment_size_locks[new]);
> +	}
>  }
>  
>  /*
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 10/16] ext4: fix largest free orders lists corruption on mb_optimize_scan switch
  2025-06-23  7:32 ` [PATCH v2 10/16] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Baokun Li
@ 2025-06-27 19:34   ` Jan Kara
  2025-06-30  7:34     ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-06-27 19:34 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, stable

On Mon 23-06-25 15:32:58, Baokun Li wrote:
> The grp->bb_largest_free_order is updated regardless of whether
> mb_optimize_scan is enabled. This can lead to inconsistencies between
> grp->bb_largest_free_order and the actual s_mb_largest_free_orders list
> index when mb_optimize_scan is repeatedly enabled and disabled via remount.
> 
> For example, if mb_optimize_scan is initially enabled, largest free
> order is 3, and the group is in s_mb_largest_free_orders[3]. Then,
> mb_optimize_scan is disabled via remount, block allocations occur,
> updating largest free order to 2. Finally, mb_optimize_scan is re-enabled
> via remount, more block allocations update largest free order to 1.
> 
> At this point, the group would be removed from s_mb_largest_free_orders[3]
> under the protection of s_mb_largest_free_orders_locks[2]. This lock
> mismatch can lead to list corruption.
> 
> To fix this, a new field bb_largest_free_order_idx is added to struct
> ext4_group_info to explicitly track the list index. Then still update
> bb_largest_free_order unconditionally, but only update
> bb_largest_free_order_idx when mb_optimize_scan is enabled. so that there
> is no inconsistency between the lock and the data to be protected.
> 
> Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
> CC: stable@vger.kernel.org
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Hum, rather than duplicating index like this, couldn't we add to
mb_set_largest_free_order():

	/* Did mb_optimize_scan setting change? */
	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) &&
	    !list_empty(&grp->bb_largest_free_order_node)) {
		write_lock(&sbi->s_mb_largest_free_orders_locks[old]);
		list_del_init(&grp->bb_largest_free_order_node);
		write_unlock(&sbi->s_mb_largest_free_orders_locks[old]);
	}

Also arguably we should reinit bb lists when mb_optimize_scan gets
reenabled because otherwise inconsistent lists could lead to suboptimal
results... But that's less important to fix I guess.

								Honza

> ---
>  fs/ext4/ext4.h    |  1 +
>  fs/ext4/mballoc.c | 35 ++++++++++++++++-------------------
>  2 files changed, 17 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 003b8d3726e8..0e574378c6a3 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3476,6 +3476,7 @@ struct ext4_group_info {
>  	int		bb_avg_fragment_size_order;	/* order of average
>  							   fragment in BG */
>  	ext4_grpblk_t	bb_largest_free_order;/* order of largest frag in BG */
> +	ext4_grpblk_t	bb_largest_free_order_idx; /* index of largest frag */
>  	ext4_group_t	bb_group;	/* Group number */
>  	struct          list_head bb_prealloc_list;
>  #ifdef DOUBLE_CHECK
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index e6d6c2da3c6e..dc82124f0905 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1152,33 +1152,29 @@ static void
>  mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> -	int i;
> +	int new, old = grp->bb_largest_free_order_idx;
>  
> -	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
> -		if (grp->bb_counters[i] > 0)
> +	for (new = MB_NUM_ORDERS(sb) - 1; new >= 0; new--)
> +		if (grp->bb_counters[new] > 0)
>  			break;
> +
> +	grp->bb_largest_free_order = new;
>  	/* No need to move between order lists? */
> -	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) ||
> -	    i == grp->bb_largest_free_order) {
> -		grp->bb_largest_free_order = i;
> +	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || new == old)
>  		return;
> -	}
>  
> -	if (grp->bb_largest_free_order >= 0) {
> -		write_lock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> +	if (old >= 0) {
> +		write_lock(&sbi->s_mb_largest_free_orders_locks[old]);
>  		list_del_init(&grp->bb_largest_free_order_node);
> -		write_unlock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> +		write_unlock(&sbi->s_mb_largest_free_orders_locks[old]);
>  	}
> -	grp->bb_largest_free_order = i;
> -	if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
> -		write_lock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> +
> +	grp->bb_largest_free_order_idx = new;
> +	if (new >= 0 && grp->bb_free) {
> +		write_lock(&sbi->s_mb_largest_free_orders_locks[new]);
>  		list_add_tail(&grp->bb_largest_free_order_node,
> -		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> -		write_unlock(&sbi->s_mb_largest_free_orders_locks[
> -					      grp->bb_largest_free_order]);
> +			      &sbi->s_mb_largest_free_orders[new]);
> +		write_unlock(&sbi->s_mb_largest_free_orders_locks[new]);
>  	}
>  }
>  
> @@ -3391,6 +3387,7 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
>  	INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
>  	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
>  	meta_group_info[i]->bb_avg_fragment_size_order = -1;  /* uninit */
> +	meta_group_info[i]->bb_largest_free_order_idx = -1;  /* uninit */
>  	meta_group_info[i]->bb_group = group;
>  
>  	mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start
  2025-06-27 18:15   ` Jan Kara
@ 2025-06-30  3:32     ` Baokun Li
  2025-06-30  7:31       ` Jan Kara
  0 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-30  3:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/6/28 2:15, Jan Kara wrote:
> On Mon 23-06-25 15:32:50, Baokun Li wrote:
>> ac->ac_g_ex.fe_start is only used in ext4_mb_find_by_goal(), but STREAM
>> ALLOC is activated after ext4_mb_find_by_goal() fails, so there's no need
>> to update ac->ac_g_ex.fe_start, remove the unnecessary s_mb_last_start.
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> I'd just note that ac->ac_g_ex.fe_start is also used in
> ext4_mb_collect_stats() so this change may impact the statistics gathered
> there. OTOH it is questionable whether we even want to account streaming
> allocation as a goal hit... Anyway, I'm fine with this, I'd just mention it
> in the changelog.
Yes, I missed ext4_mb_collect_stats(). However, instead of explaining
it in the changelog, I think it would be better to move the current
s_bal_goals update to inside or after ext4_mb_find_by_goal().

Then, we could add another variable, such as s_bal_stream_goals, to
represent the hit count for global goals. This kind of statistic would
help us fine-tune the logic for optimizing inode goals and global goals.

What are your thoughts on this?
> Also one nit below but feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
Thanks for your review!
>
>> @@ -2849,7 +2848,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>   		/* TBD: may be hot point */
>>   		spin_lock(&sbi->s_md_lock);
>>   		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
>> -		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
> Maybe reset ac->ac_g_ex.fe_start to 0 instead of leaving it at some random
> value? Just for the sake of defensive programming...
>
> 								Honza

ac->ac_g_ex.fe_start holds the inode goal's start position, not a random
value. It's unused after ext4_mb_find_by_goal() (if s_bal_stream_goals is
added). Thus, I see no need for further modification. We can always re-add
it if future requirements change.


Thanks,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-06-27 18:19   ` Jan Kara
@ 2025-06-30  3:48     ` Baokun Li
  2025-06-30  7:47       ` Jan Kara
  0 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-30  3:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/6/28 2:19, Jan Kara wrote:
> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>> After we optimized the block group lock, we found another lock
>> contention issue when running will-it-scale/fallocate2 with multiple
>> processes. The fallocate's block allocation and the truncate's block
>> release were fighting over the s_md_lock. The problem is, this lock
>> protects totally different things in those two processes: the list of
>> freed data blocks (s_freed_data_list) when releasing, and where to start
>> looking for new blocks (mb_last_group) when allocating.
>>
>> Now we only need to track s_mb_last_group and no longer need to track
>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>> two are consistent, and we can ensure that the s_mb_last_group read is up
>> to date by using smp_store_release/smp_load_acquire.
>>
>> Besides, the s_mb_last_group data type only requires ext4_group_t
>> (i.e., unsigned int), rendering unsigned long superfluous.
>>
>> Performance test data follows:
>>
>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>> Observation: Average fallocate operations per container per second.
>>
>>                     | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>   Disk: 960GB SSD   |-------------------------|-------------------------|
>>                     | base  |    patched      | base  |    patched      |
>> -------------------|-------|-----------------|-------|-----------------|
>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ...
>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 5cdae3bda072..3f103919868b 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>   	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>   	folio_get(ac->ac_buddy_folio);
>>   	/* store last allocated for subsequent stream allocation */
>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>> -		spin_lock(&sbi->s_md_lock);
>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>> -		spin_unlock(&sbi->s_md_lock);
>> -	}
>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
> Do you really need any kind of barrier (implied by smp_store_release())
> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
> with other accesses from the thread, isn't it? As such it should be enough
> to have WRITE_ONCE() here...

WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
that variable reads/writes access values directly from L1/L2 cache rather
than registers.

They do not guarantee that other CPUs see the latest values. Reading stale
values could lead to more useless traversals, which might incur higher
overhead than memory barriers. This is why we use memory barriers to ensure
the latest values are read.

If we could guarantee that each goal is used on only one CPU, we could
switch to the cheaper WRITE_ONCE()/READ_ONCE().


Regards,
Baokun

>>   	/*
>>   	 * As we've just preallocated more space than
>>   	 * user requested originally, we store allocated
>> @@ -2844,12 +2842,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>   	}
>>   
>>   	/* if stream allocation is enabled, use global goal */
>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>> -		/* TBD: may be hot point */
>> -		spin_lock(&sbi->s_md_lock);
>> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
>> -		spin_unlock(&sbi->s_md_lock);
>> -	}
>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>> +		/* pairs with smp_store_release in ext4_mb_use_best_found() */
>> +		ac->ac_g_ex.fe_group = smp_load_acquire(&sbi->s_mb_last_group);
> ... and READ_ONCE() here.
>
> 								Honza



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention
  2025-06-27 18:31   ` Jan Kara
@ 2025-06-30  6:50     ` Baokun Li
  2025-06-30  8:38       ` Jan Kara
  0 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-30  6:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/6/28 2:31, Jan Kara wrote:
> On Mon 23-06-25 15:32:52, Baokun Li wrote:
>> When allocating data blocks, if the first try (goal allocation) fails and
>> stream allocation is on, it tries a global goal starting from the last
>> group we used (s_mb_last_group). This helps cluster large files together
>> to reduce free space fragmentation, and the data block contiguity also
>> accelerates write-back to disk.
>>
>> However, when multiple processes allocate blocks, having just one global
>> goal means they all fight over the same group. This drastically lowers
>> the chances of extents merging and leads to much worse file fragmentation.
>>
>> To mitigate this multi-process contention, we now employ multiple global
>> goals, with the number of goals being the CPU count rounded up to the
>> nearest power of 2. To ensure a consistent goal for each inode, we select
>> the corresponding goal by taking the inode number modulo the total number
>> of goals.
>>
>> Performance test data follows:
>>
>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>> Observation: Average fallocate operations per container per second.
>>
>>                     | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>   Disk: 960GB SSD   |-------------------------|-------------------------|
>>                     | base  |    patched      | base  |    patched      |
>> -------------------|-------|-----------------|-------|-----------------|
>> mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
>> mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ...
>
>> +/*
>> + * Number of mb last groups
>> + */
>> +#ifdef CONFIG_SMP
>> +#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
>> +#else
>> +#define MB_LAST_GROUPS 1
>> +#endif
>> +
> I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
> distribution kernels (it is just a theoretical maximum for the number of
> CPUs the kernel can support)

nr_cpu_ids is generally equal to num_possible_cpus(). Only when
CONFIG_FORCE_NR_CPUS is enabled will nr_cpu_ids be set to NR_CPUS,
which represents the maximum number of supported CPUs.

> which seems like far too much for small
> filesystems with say 100 block groups.

It does make sense.

> I'd rather pick the array size like:
>
> min(num_possible_cpus(), sbi->s_groups_count/4)
>
> to
>
> a) don't have too many slots so we still concentrate big allocations in
> somewhat limited area of the filesystem (a quarter of block groups here).
>
> b) have at most one slot per CPU the machine hardware can in principle
> support.
>
> 								Honza

You're right, we should consider the number of block groups when setting
the number of global goals.

However, a server's rootfs can often be quite small, perhaps only tens of
GBs, while having many CPUs. In such cases, sbi->s_groups_count / 4 might
still limit the filesystem's scalability. Furthermore, after supporting
LBS, the number of block groups will sharply decrease.

How about we directly use sbi->s_groups_count (which would effectively be
min(num_possible_cpus(), sbi->s_groups_count)) instead? This would also
avoid zero values.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 09/16] ext4: fix zombie groups in average fragment size lists
  2025-06-27 19:14   ` Jan Kara
@ 2025-06-30  6:53     ` Baokun Li
  0 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-30  6:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, stable, Baokun Li

On 2025/6/28 3:14, Jan Kara wrote:
> On Mon 23-06-25 15:32:57, Baokun Li wrote:
>> Groups with no free blocks shouldn't be in any average fragment size list.
>> However, when all blocks in a group are allocated(i.e., bb_fragments or
>> bb_free is 0), we currently skip updating the average fragment size, which
>> means the group isn't removed from its previous s_mb_avg_fragment_size[old]
>> list.
>>
>> This created "zombie" groups that were always skipped during traversal as
>> they couldn't satisfy any block allocation requests, negatively impacting
>> traversal efficiency.
>>
>> Therefore, when a group becomes completely free, bb_avg_fragment_size_order
> 					     ^^^ full

Oh, thank you for pointing out that typo!
I'll correct it in the next version.


Thanks,
Baokun

>> is now set to -1. If the old order was not -1, a removal operation is
>> performed; if the new order is not -1, an insertion is performed.
>>
>> Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
>> CC: stable@vger.kernel.org
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Good catch! The patch looks good. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> 								Honza
>
>> ---
>>   fs/ext4/mballoc.c | 36 ++++++++++++++++++------------------
>>   1 file changed, 18 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 94950b07a577..e6d6c2da3c6e 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -841,30 +841,30 @@ static void
>>   mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
>>   {
>>   	struct ext4_sb_info *sbi = EXT4_SB(sb);
>> -	int new_order;
>> +	int new, old;
>>   
>> -	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_fragments == 0)
>> +	if (!test_opt2(sb, MB_OPTIMIZE_SCAN))
>>   		return;
>>   
>> -	new_order = mb_avg_fragment_size_order(sb,
>> -					grp->bb_free / grp->bb_fragments);
>> -	if (new_order == grp->bb_avg_fragment_size_order)
>> +	old = grp->bb_avg_fragment_size_order;
>> +	new = grp->bb_fragments == 0 ? -1 :
>> +	      mb_avg_fragment_size_order(sb, grp->bb_free / grp->bb_fragments);
>> +	if (new == old)
>>   		return;
>>   
>> -	if (grp->bb_avg_fragment_size_order != -1) {
>> -		write_lock(&sbi->s_mb_avg_fragment_size_locks[
>> -					grp->bb_avg_fragment_size_order]);
>> +	if (old >= 0) {
>> +		write_lock(&sbi->s_mb_avg_fragment_size_locks[old]);
>>   		list_del(&grp->bb_avg_fragment_size_node);
>> -		write_unlock(&sbi->s_mb_avg_fragment_size_locks[
>> -					grp->bb_avg_fragment_size_order]);
>> -	}
>> -	grp->bb_avg_fragment_size_order = new_order;
>> -	write_lock(&sbi->s_mb_avg_fragment_size_locks[
>> -					grp->bb_avg_fragment_size_order]);
>> -	list_add_tail(&grp->bb_avg_fragment_size_node,
>> -		&sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]);
>> -	write_unlock(&sbi->s_mb_avg_fragment_size_locks[
>> -					grp->bb_avg_fragment_size_order]);
>> +		write_unlock(&sbi->s_mb_avg_fragment_size_locks[old]);
>> +	}
>> +
>> +	grp->bb_avg_fragment_size_order = new;
>> +	if (new >= 0) {
>> +		write_lock(&sbi->s_mb_avg_fragment_size_locks[new]);
>> +		list_add_tail(&grp->bb_avg_fragment_size_node,
>> +				&sbi->s_mb_avg_fragment_size[new]);
>> +		write_unlock(&sbi->s_mb_avg_fragment_size_locks[new]);
>> +	}
>>   }
>>   
>>   /*
>> -- 
>> 2.46.1
>>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start
  2025-06-30  3:32     ` Baokun Li
@ 2025-06-30  7:31       ` Jan Kara
  2025-06-30  7:52         ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-06-30  7:31 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, ojaswin,
	linux-kernel, yi.zhang, yangerkun

On Mon 30-06-25 11:32:16, Baokun Li wrote:
> On 2025/6/28 2:15, Jan Kara wrote:
> > On Mon 23-06-25 15:32:50, Baokun Li wrote:
> > > ac->ac_g_ex.fe_start is only used in ext4_mb_find_by_goal(), but STREAM
> > > ALLOC is activated after ext4_mb_find_by_goal() fails, so there's no need
> > > to update ac->ac_g_ex.fe_start, remove the unnecessary s_mb_last_start.
> > > 
> > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > I'd just note that ac->ac_g_ex.fe_start is also used in
> > ext4_mb_collect_stats() so this change may impact the statistics gathered
> > there. OTOH it is questionable whether we even want to account streaming
> > allocation as a goal hit... Anyway, I'm fine with this, I'd just mention it
> > in the changelog.
> Yes, I missed ext4_mb_collect_stats(). However, instead of explaining
> it in the changelog, I think it would be better to move the current
> s_bal_goals update to inside or after ext4_mb_find_by_goal().
> 
> Then, we could add another variable, such as s_bal_stream_goals, to
> represent the hit count for global goals. This kind of statistic would
> help us fine-tune the logic for optimizing inode goals and global goals.
> 
> What are your thoughts on this?

Sure that sounds good to me.

> > > @@ -2849,7 +2848,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> > >   		/* TBD: may be hot point */
> > >   		spin_lock(&sbi->s_md_lock);
> > >   		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> > > -		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
> > Maybe reset ac->ac_g_ex.fe_start to 0 instead of leaving it at some random
> > value? Just for the sake of defensive programming...
> > 
> ac->ac_g_ex.fe_start holds the inode goal's start position, not a random
> value. It's unused after ext4_mb_find_by_goal() (if s_bal_stream_goals is
> added). Thus, I see no need for further modification. We can always re-add
> it if future requirements change.

Yeah, I was imprecise. It is not a random value. But it is not an offset in
the group we are now setting. Therefore I'd still prefer to reset fe_start
to 0 (or some invalid value like -1 to catch unexpected use).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 10/16] ext4: fix largest free orders lists corruption on mb_optimize_scan switch
  2025-06-27 19:34   ` Jan Kara
@ 2025-06-30  7:34     ` Baokun Li
  0 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-06-30  7:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, stable, Baokun Li

On 2025/6/28 3:34, Jan Kara wrote:
> On Mon 23-06-25 15:32:58, Baokun Li wrote:
>> The grp->bb_largest_free_order is updated regardless of whether
>> mb_optimize_scan is enabled. This can lead to inconsistencies between
>> grp->bb_largest_free_order and the actual s_mb_largest_free_orders list
>> index when mb_optimize_scan is repeatedly enabled and disabled via remount.
>>
>> For example, if mb_optimize_scan is initially enabled, largest free
>> order is 3, and the group is in s_mb_largest_free_orders[3]. Then,
>> mb_optimize_scan is disabled via remount, block allocations occur,
>> updating largest free order to 2. Finally, mb_optimize_scan is re-enabled
>> via remount, more block allocations update largest free order to 1.
>>
>> At this point, the group would be removed from s_mb_largest_free_orders[3]
>> under the protection of s_mb_largest_free_orders_locks[2]. This lock
>> mismatch can lead to list corruption.
>>
>> To fix this, a new field bb_largest_free_order_idx is added to struct
>> ext4_group_info to explicitly track the list index. Then still update
>> bb_largest_free_order unconditionally, but only update
>> bb_largest_free_order_idx when mb_optimize_scan is enabled. so that there
>> is no inconsistency between the lock and the data to be protected.
>>
>> Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
>> CC: stable@vger.kernel.org
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Hum, rather than duplicating index like this, couldn't we add to
> mb_set_largest_free_order():
>
> 	/* Did mb_optimize_scan setting change? */
> 	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) &&
> 	    !list_empty(&grp->bb_largest_free_order_node)) {
> 		write_lock(&sbi->s_mb_largest_free_orders_locks[old]);
> 		list_del_init(&grp->bb_largest_free_order_node);
> 		write_unlock(&sbi->s_mb_largest_free_orders_locks[old]);
> 	}
>
> Also arguably we should reinit bb lists when mb_optimize_scan gets
> reenabled because otherwise inconsistent lists could lead to suboptimal
> results... But that's less important to fix I guess.
>
> 								Honza

Yeah, this looks good. We just need to remove groups modified when
mb_optimize_scan=0 from the list. Groups that remain in the list after
mb_optimize_scan is re-enabled can be used normally.

As for the groups that were removed, they will be re-added to their
corresponding lists during block freeing or block allocation when
cr >= CR_GOAL_LEN_SLOW. So, I agree that we don't need to explicitly
reinit them.



Cheers,
Baokun

>> ---
>>   fs/ext4/ext4.h    |  1 +
>>   fs/ext4/mballoc.c | 35 ++++++++++++++++-------------------
>>   2 files changed, 17 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 003b8d3726e8..0e574378c6a3 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -3476,6 +3476,7 @@ struct ext4_group_info {
>>   	int		bb_avg_fragment_size_order;	/* order of average
>>   							   fragment in BG */
>>   	ext4_grpblk_t	bb_largest_free_order;/* order of largest frag in BG */
>> +	ext4_grpblk_t	bb_largest_free_order_idx; /* index of largest frag */
>>   	ext4_group_t	bb_group;	/* Group number */
>>   	struct          list_head bb_prealloc_list;
>>   #ifdef DOUBLE_CHECK
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index e6d6c2da3c6e..dc82124f0905 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -1152,33 +1152,29 @@ static void
>>   mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>>   {
>>   	struct ext4_sb_info *sbi = EXT4_SB(sb);
>> -	int i;
>> +	int new, old = grp->bb_largest_free_order_idx;
>>   
>> -	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--)
>> -		if (grp->bb_counters[i] > 0)
>> +	for (new = MB_NUM_ORDERS(sb) - 1; new >= 0; new--)
>> +		if (grp->bb_counters[new] > 0)
>>   			break;
>> +
>> +	grp->bb_largest_free_order = new;
>>   	/* No need to move between order lists? */
>> -	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) ||
>> -	    i == grp->bb_largest_free_order) {
>> -		grp->bb_largest_free_order = i;
>> +	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || new == old)
>>   		return;
>> -	}
>>   
>> -	if (grp->bb_largest_free_order >= 0) {
>> -		write_lock(&sbi->s_mb_largest_free_orders_locks[
>> -					      grp->bb_largest_free_order]);
>> +	if (old >= 0) {
>> +		write_lock(&sbi->s_mb_largest_free_orders_locks[old]);
>>   		list_del_init(&grp->bb_largest_free_order_node);
>> -		write_unlock(&sbi->s_mb_largest_free_orders_locks[
>> -					      grp->bb_largest_free_order]);
>> +		write_unlock(&sbi->s_mb_largest_free_orders_locks[old]);
>>   	}
>> -	grp->bb_largest_free_order = i;
>> -	if (grp->bb_largest_free_order >= 0 && grp->bb_free) {
>> -		write_lock(&sbi->s_mb_largest_free_orders_locks[
>> -					      grp->bb_largest_free_order]);
>> +
>> +	grp->bb_largest_free_order_idx = new;
>> +	if (new >= 0 && grp->bb_free) {
>> +		write_lock(&sbi->s_mb_largest_free_orders_locks[new]);
>>   		list_add_tail(&grp->bb_largest_free_order_node,
>> -		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
>> -		write_unlock(&sbi->s_mb_largest_free_orders_locks[
>> -					      grp->bb_largest_free_order]);
>> +			      &sbi->s_mb_largest_free_orders[new]);
>> +		write_unlock(&sbi->s_mb_largest_free_orders_locks[new]);
>>   	}
>>   }
>>   
>> @@ -3391,6 +3387,7 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
>>   	INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node);
>>   	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
>>   	meta_group_info[i]->bb_avg_fragment_size_order = -1;  /* uninit */
>> +	meta_group_info[i]->bb_largest_free_order_idx = -1;  /* uninit */
>>   	meta_group_info[i]->bb_group = group;
>>   
>>   	mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
>> -- 
>> 2.46.1
>>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-06-30  3:48     ` Baokun Li
@ 2025-06-30  7:47       ` Jan Kara
  2025-06-30  9:21         ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-06-30  7:47 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, ojaswin,
	linux-kernel, yi.zhang, yangerkun

On Mon 30-06-25 11:48:20, Baokun Li wrote:
> On 2025/6/28 2:19, Jan Kara wrote:
> > On Mon 23-06-25 15:32:51, Baokun Li wrote:
> > > After we optimized the block group lock, we found another lock
> > > contention issue when running will-it-scale/fallocate2 with multiple
> > > processes. The fallocate's block allocation and the truncate's block
> > > release were fighting over the s_md_lock. The problem is, this lock
> > > protects totally different things in those two processes: the list of
> > > freed data blocks (s_freed_data_list) when releasing, and where to start
> > > looking for new blocks (mb_last_group) when allocating.
> > > 
> > > Now we only need to track s_mb_last_group and no longer need to track
> > > s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> > > two are consistent, and we can ensure that the s_mb_last_group read is up
> > > to date by using smp_store_release/smp_load_acquire.
> > > 
> > > Besides, the s_mb_last_group data type only requires ext4_group_t
> > > (i.e., unsigned int), rendering unsigned long superfluous.
> > > 
> > > Performance test data follows:
> > > 
> > > Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> > > Observation: Average fallocate operations per container per second.
> > > 
> > >                     | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
> > >   Disk: 960GB SSD   |-------------------------|-------------------------|
> > >                     | base  |    patched      | base  |    patched      |
> > > -------------------|-------|-----------------|-------|-----------------|
> > > mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
> > > mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
> > > 
> > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > ...
> > 
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index 5cdae3bda072..3f103919868b 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> > >   	ac->ac_buddy_folio = e4b->bd_buddy_folio;
> > >   	folio_get(ac->ac_buddy_folio);
> > >   	/* store last allocated for subsequent stream allocation */
> > > -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> > > -		spin_lock(&sbi->s_md_lock);
> > > -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> > > -		spin_unlock(&sbi->s_md_lock);
> > > -	}
> > > +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> > > +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
> > > +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
> > Do you really need any kind of barrier (implied by smp_store_release())
> > here? I mean the store to s_mb_last_group is perfectly fine to be reordered
> > with other accesses from the thread, isn't it? As such it should be enough
> > to have WRITE_ONCE() here...
> 
> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
> that variable reads/writes access values directly from L1/L2 cache rather
> than registers.

I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
particular they force the compiler to read / write the memory location
exactly once instead of reading it potentially multiple times in different
parts of expression and getting inconsistent values, or possibly writing
the value say byte by byte (yes, that would be insane but not contrary to
the C standard).

> They do not guarantee that other CPUs see the latest values. Reading stale
> values could lead to more useless traversals, which might incur higher
> overhead than memory barriers. This is why we use memory barriers to ensure
> the latest values are read.

But smp_load_acquire() / smp_store_release() have no guarantee about CPU
seeing latest values either. They are just speculation barriers meaning
they prevent the CPU from reordering accesses in the code after
smp_load_acquire() to be performed before the smp_load_acquire() is
executed and similarly with smp_store_release(). So I dare to say that
these barries have no (positive) impact on the allocation performance and
just complicate the code - but if you have some data that show otherwise,
I'd be happy to be proven wrong.

> If we could guarantee that each goal is used on only one CPU, we could
> switch to the cheaper WRITE_ONCE()/READ_ONCE().

Well, neither READ_ONCE() / WRITE_ONCE() nor smp_load_acquire() /
smp_store_release() can guarantee that.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start
  2025-06-30  7:31       ` Jan Kara
@ 2025-06-30  7:52         ` Baokun Li
  2025-07-14  7:00           ` Ojaswin Mujoo
  0 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-30  7:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/6/30 15:31, Jan Kara wrote:
> On Mon 30-06-25 11:32:16, Baokun Li wrote:
>> On 2025/6/28 2:15, Jan Kara wrote:
>>> On Mon 23-06-25 15:32:50, Baokun Li wrote:
>>>> ac->ac_g_ex.fe_start is only used in ext4_mb_find_by_goal(), but STREAM
>>>> ALLOC is activated after ext4_mb_find_by_goal() fails, so there's no need
>>>> to update ac->ac_g_ex.fe_start, remove the unnecessary s_mb_last_start.
>>>>
>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>> I'd just note that ac->ac_g_ex.fe_start is also used in
>>> ext4_mb_collect_stats() so this change may impact the statistics gathered
>>> there. OTOH it is questionable whether we even want to account streaming
>>> allocation as a goal hit... Anyway, I'm fine with this, I'd just mention it
>>> in the changelog.
>> Yes, I missed ext4_mb_collect_stats(). However, instead of explaining
>> it in the changelog, I think it would be better to move the current
>> s_bal_goals update to inside or after ext4_mb_find_by_goal().
>>
>> Then, we could add another variable, such as s_bal_stream_goals, to
>> represent the hit count for global goals. This kind of statistic would
>> help us fine-tune the logic for optimizing inode goals and global goals.
>>
>> What are your thoughts on this?
> Sure that sounds good to me.

Ok, I will add a patch to implement that logic in the next version.

>
>>>> @@ -2849,7 +2848,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>>>    		/* TBD: may be hot point */
>>>>    		spin_lock(&sbi->s_md_lock);
>>>>    		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
>>>> -		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
>>> Maybe reset ac->ac_g_ex.fe_start to 0 instead of leaving it at some random
>>> value? Just for the sake of defensive programming...
>>>
>> ac->ac_g_ex.fe_start holds the inode goal's start position, not a random
>> value. It's unused after ext4_mb_find_by_goal() (if s_bal_stream_goals is
>> added). Thus, I see no need for further modification. We can always re-add
>> it if future requirements change.
> Yeah, I was imprecise. It is not a random value. But it is not an offset in
> the group we are now setting. Therefore I'd still prefer to reset fe_start
> to 0 (or some invalid value like -1 to catch unexpected use).
>
> 								Honza

When ext4_mb_regular_allocator() fails, it might retry and get called
again. In this scenario, we can't reliably determine if ac_g_ex has
already been modified. Therefore, it might be more appropriate to set
ac_g_ex.fe_start to -1 after ext4_mb_find_by_goal() fails. We can then
skip ext4_mb_find_by_goal() when ac_g_ex.fe_start < 0.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention
  2025-06-30  6:50     ` Baokun Li
@ 2025-06-30  8:38       ` Jan Kara
  2025-06-30 10:02         ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-06-30  8:38 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, ojaswin,
	linux-kernel, yi.zhang, yangerkun

On Mon 30-06-25 14:50:30, Baokun Li wrote:
> On 2025/6/28 2:31, Jan Kara wrote:
> > On Mon 23-06-25 15:32:52, Baokun Li wrote:
> > > When allocating data blocks, if the first try (goal allocation) fails and
> > > stream allocation is on, it tries a global goal starting from the last
> > > group we used (s_mb_last_group). This helps cluster large files together
> > > to reduce free space fragmentation, and the data block contiguity also
> > > accelerates write-back to disk.
> > > 
> > > However, when multiple processes allocate blocks, having just one global
> > > goal means they all fight over the same group. This drastically lowers
> > > the chances of extents merging and leads to much worse file fragmentation.
> > > 
> > > To mitigate this multi-process contention, we now employ multiple global
> > > goals, with the number of goals being the CPU count rounded up to the
> > > nearest power of 2. To ensure a consistent goal for each inode, we select
> > > the corresponding goal by taking the inode number modulo the total number
> > > of goals.
> > > 
> > > Performance test data follows:
> > > 
> > > Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> > > Observation: Average fallocate operations per container per second.
> > > 
> > >                     | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
> > >   Disk: 960GB SSD   |-------------------------|-------------------------|
> > >                     | base  |    patched      | base  |    patched      |
> > > -------------------|-------|-----------------|-------|-----------------|
> > > mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
> > > mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |
> > > 
> > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > ...
> > 
> > > +/*
> > > + * Number of mb last groups
> > > + */
> > > +#ifdef CONFIG_SMP
> > > +#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
> > > +#else
> > > +#define MB_LAST_GROUPS 1
> > > +#endif
> > > +
> > I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
> > distribution kernels (it is just a theoretical maximum for the number of
> > CPUs the kernel can support)
> 
> nr_cpu_ids is generally equal to num_possible_cpus(). Only when
> CONFIG_FORCE_NR_CPUS is enabled will nr_cpu_ids be set to NR_CPUS,
> which represents the maximum number of supported CPUs.

Indeed, CONFIG_FORCE_NR_CPUS confused me.

> > which seems like far too much for small
> > filesystems with say 100 block groups.
> 
> It does make sense.
> 
> > I'd rather pick the array size like:
> > 
> > min(num_possible_cpus(), sbi->s_groups_count/4)
> > 
> > to
> > 
> > a) don't have too many slots so we still concentrate big allocations in
> > somewhat limited area of the filesystem (a quarter of block groups here).
> > 
> > b) have at most one slot per CPU the machine hardware can in principle
> > support.
> > 
> > 								Honza
> 
> You're right, we should consider the number of block groups when setting
> the number of global goals.
> 
> However, a server's rootfs can often be quite small, perhaps only tens of
> GBs, while having many CPUs. In such cases, sbi->s_groups_count / 4 might
> still limit the filesystem's scalability.

I would not expect such root filesystem to be loaded by many big
allocations in parallel :). And with 4k blocksize 32GB filesystem would
have already 64 goals which doesn't seem *that* limiting?

Also note that as the filesystem is filling up and the free space is getting
fragmented, the number of groups where large allocation can succeed will
reduce. Thus regardless of how many slots for streaming goal you have, they
will all end up pointing only to those several groups where large
still allocation succeeds. So although large number of slots looks good for
an empty filesystem, the benefit for aged filesystem is diminishing and
larger number of slots will make the fs fragment faster.

> Furthermore, after supporting LBS, the number of block groups will
> sharply decrease.

Right. This is going to reduce scalability of block allocation in general.
Also as the groups grow larger with larger blocksize the benefit of
streaming allocation which just gives a hint about block group to use is
going to diminish when the free block search will be always starting from
0. We will maybe need to store ext4_fsblk_t (effectively combining
group+offset in a single atomic unit) as a streaming goal to mitigate this.

> How about we directly use sbi->s_groups_count (which would effectively be
> min(num_possible_cpus(), sbi->s_groups_count)) instead? This would also
> avoid zero values.

Avoiding zero values is definitely a good point. My concern is that if we
have sb->s_groups_count streaming goals, then practically each group will
become a streaming goal group and thus we can just remove the streaming
allocation altogether, there's no benefit.

We could make streaming goal to be ext4_fsblk_t so that also offset of the
last big allocation in the group is recorded as I wrote above. That would
tend to pack big allocations in each group together which is benefitial to
combat fragmentation even with higher proportion of groups that are streaming
goals (and likely becomes more important as the blocksize and thus group
size grow). We can discuss proper number of slots for streaming allocation
(I'm not hung up on it being quarter of the group count) but I'm convinced
sb->s_groups_count is too much :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-06-30  7:47       ` Jan Kara
@ 2025-06-30  9:21         ` Baokun Li
  2025-06-30 16:32           ` Jan Kara
  0 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-30  9:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/6/30 15:47, Jan Kara wrote:
> On Mon 30-06-25 11:48:20, Baokun Li wrote:
>> On 2025/6/28 2:19, Jan Kara wrote:
>>> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>>>> After we optimized the block group lock, we found another lock
>>>> contention issue when running will-it-scale/fallocate2 with multiple
>>>> processes. The fallocate's block allocation and the truncate's block
>>>> release were fighting over the s_md_lock. The problem is, this lock
>>>> protects totally different things in those two processes: the list of
>>>> freed data blocks (s_freed_data_list) when releasing, and where to start
>>>> looking for new blocks (mb_last_group) when allocating.
>>>>
>>>> Now we only need to track s_mb_last_group and no longer need to track
>>>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>>>> two are consistent, and we can ensure that the s_mb_last_group read is up
>>>> to date by using smp_store_release/smp_load_acquire.
>>>>
>>>> Besides, the s_mb_last_group data type only requires ext4_group_t
>>>> (i.e., unsigned int), rendering unsigned long superfluous.
>>>>
>>>> Performance test data follows:
>>>>
>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>> Observation: Average fallocate operations per container per second.
>>>>
>>>>                      | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>    Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>                      | base  |    patched      | base  |    patched      |
>>>> -------------------|-------|-----------------|-------|-----------------|
>>>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>>>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>>>
>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>> ...
>>>
>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>> index 5cdae3bda072..3f103919868b 100644
>>>> --- a/fs/ext4/mballoc.c
>>>> +++ b/fs/ext4/mballoc.c
>>>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>>>    	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>>>    	folio_get(ac->ac_buddy_folio);
>>>>    	/* store last allocated for subsequent stream allocation */
>>>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>> -		spin_lock(&sbi->s_md_lock);
>>>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>>>> -		spin_unlock(&sbi->s_md_lock);
>>>> -	}
>>>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>>>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>>>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>> Do you really need any kind of barrier (implied by smp_store_release())
>>> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
>>> with other accesses from the thread, isn't it? As such it should be enough
>>> to have WRITE_ONCE() here...
>> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
>> that variable reads/writes access values directly from L1/L2 cache rather
>> than registers.
> I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
> particular they force the compiler to read / write the memory location
> exactly once instead of reading it potentially multiple times in different
> parts of expression and getting inconsistent values, or possibly writing
> the value say byte by byte (yes, that would be insane but not contrary to
> the C standard).
READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
two main purposes:

1. It tells the compiler that the variable's value can change unexpectedly,
    preventing the compiler from making incorrect optimizations based on
    assumptions about its stability.

2. It ensures the CPU directly reads from or writes to the variable's
    memory address. This means the value will be fetched from cache (L1/L2)
    if available, or from main memory otherwise, rather than using a stale
    value from a CPU register.
>> They do not guarantee that other CPUs see the latest values. Reading stale
>> values could lead to more useless traversals, which might incur higher
>> overhead than memory barriers. This is why we use memory barriers to ensure
>> the latest values are read.
> But smp_load_acquire() / smp_store_release() have no guarantee about CPU
> seeing latest values either. They are just speculation barriers meaning
> they prevent the CPU from reordering accesses in the code after
> smp_load_acquire() to be performed before the smp_load_acquire() is
> executed and similarly with smp_store_release(). So I dare to say that
> these barries have no (positive) impact on the allocation performance and
> just complicate the code - but if you have some data that show otherwise,
> I'd be happy to be proven wrong.
smp_load_acquire() / smp_store_release() guarantee that CPUs read the
latest data.

For example, imagine a variable a = 0, with both CPU0 and CPU1 having
a=0 in their caches.

Without a memory barrier:
When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
a=1 is written to main memory and becomes visible to other CPUs.
Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
its invalidation queue. However, it might not process it immediately;
instead, it could perform the read first, potentially still reading a=0
from its cache.

With a memory barrier:
When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
the store buffer, but data in the store buffer is also written to main
memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.

When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
to its invalidation queue. Here, the invalidation queue is flushed, which
invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
performs the read, its cache is invalid, so it reads the latest a=1 from
main memory.

This is a general overview. Please let me know if I've missed anything.


Thanks,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention
  2025-06-30  8:38       ` Jan Kara
@ 2025-06-30 10:02         ` Baokun Li
  2025-06-30 17:41           ` Jan Kara
  0 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-06-30 10:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/6/30 16:38, Jan Kara wrote:
> On Mon 30-06-25 14:50:30, Baokun Li wrote:
>> On 2025/6/28 2:31, Jan Kara wrote:
>>> On Mon 23-06-25 15:32:52, Baokun Li wrote:
>>>> When allocating data blocks, if the first try (goal allocation) fails and
>>>> stream allocation is on, it tries a global goal starting from the last
>>>> group we used (s_mb_last_group). This helps cluster large files together
>>>> to reduce free space fragmentation, and the data block contiguity also
>>>> accelerates write-back to disk.
>>>>
>>>> However, when multiple processes allocate blocks, having just one global
>>>> goal means they all fight over the same group. This drastically lowers
>>>> the chances of extents merging and leads to much worse file fragmentation.
>>>>
>>>> To mitigate this multi-process contention, we now employ multiple global
>>>> goals, with the number of goals being the CPU count rounded up to the
>>>> nearest power of 2. To ensure a consistent goal for each inode, we select
>>>> the corresponding goal by taking the inode number modulo the total number
>>>> of goals.
>>>>
>>>> Performance test data follows:
>>>>
>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>> Observation: Average fallocate operations per container per second.
>>>>
>>>>                      | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>    Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>                      | base  |    patched      | base  |    patched      |
>>>> -------------------|-------|-----------------|-------|-----------------|
>>>> mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
>>>> mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |
>>>>
>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>> ...
>>>
>>>> +/*
>>>> + * Number of mb last groups
>>>> + */
>>>> +#ifdef CONFIG_SMP
>>>> +#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
>>>> +#else
>>>> +#define MB_LAST_GROUPS 1
>>>> +#endif
>>>> +
>>> I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
>>> distribution kernels (it is just a theoretical maximum for the number of
>>> CPUs the kernel can support)
>> nr_cpu_ids is generally equal to num_possible_cpus(). Only when
>> CONFIG_FORCE_NR_CPUS is enabled will nr_cpu_ids be set to NR_CPUS,
>> which represents the maximum number of supported CPUs.
> Indeed, CONFIG_FORCE_NR_CPUS confused me.
>
>>> which seems like far too much for small
>>> filesystems with say 100 block groups.
>> It does make sense.
>>
>>> I'd rather pick the array size like:
>>>
>>> min(num_possible_cpus(), sbi->s_groups_count/4)
>>>
>>> to
>>>
>>> a) don't have too many slots so we still concentrate big allocations in
>>> somewhat limited area of the filesystem (a quarter of block groups here).
>>>
>>> b) have at most one slot per CPU the machine hardware can in principle
>>> support.
>>>
>>> 								Honza
>> You're right, we should consider the number of block groups when setting
>> the number of global goals.
>>
>> However, a server's rootfs can often be quite small, perhaps only tens of
>> GBs, while having many CPUs. In such cases, sbi->s_groups_count / 4 might
>> still limit the filesystem's scalability.
> I would not expect such root filesystem to be loaded by many big
> allocations in parallel :). And with 4k blocksize 32GB filesystem would
> have already 64 goals which doesn't seem *that* limiting?

Docker's default path is on the rootfs. Our rootfs size is typically 70GB,
but we might have 300+ or even 500+ CPUs. This could lead to scalability
issues in certain specific scenarios. However, in general,
sbi->s_groups_count / 4 does appear to be sufficient.

> Also note that as the filesystem is filling up and the free space is getting
> fragmented, the number of groups where large allocation can succeed will
> reduce. Thus regardless of how many slots for streaming goal you have, they
> will all end up pointing only to those several groups where large
> still allocation succeeds. So although large number of slots looks good for
> an empty filesystem, the benefit for aged filesystem is diminishing and
> larger number of slots will make the fs fragment faster.
I don't think so. Although we're now splitting into multiple goals, these
goals all start from zero. This means 'n' goals will cause us to scan all
groups 'n' times. We'll repeatedly search for free space on disk rather
than creating more fragmentation.

This approach can actually solve the issue where a single goal, despite
having 4K free space available, causes an 8K allocation request to skip it,
forcing subsequent 4K allocation requests to split larger free spaces.
>
>> Furthermore, after supporting LBS, the number of block groups will
>> sharply decrease.
> Right. This is going to reduce scalability of block allocation in general.
> Also as the groups grow larger with larger blocksize the benefit of
> streaming allocation which just gives a hint about block group to use is
> going to diminish when the free block search will be always starting from
> 0. We will maybe need to store ext4_fsblk_t (effectively combining
> group+offset in a single atomic unit) as a streaming goal to mitigate this.
I don't think that's necessary. We still need to consider block group lock
contention, so the smallest unit should always be the group.
>
>> How about we directly use sbi->s_groups_count (which would effectively be
>> min(num_possible_cpus(), sbi->s_groups_count)) instead? This would also
>> avoid zero values.
> Avoiding zero values is definitely a good point. My concern is that if we
> have sb->s_groups_count streaming goals, then practically each group will
> become a streaming goal group and thus we can just remove the streaming
> allocation altogether, there's no benefit.
Having 'n' goals simply means we scan the groups 'n' times; it's not
related to the number of groups. However, when there are too many goals,
the probability of contention due to identical goals increases.
Nevertheless, this is always better than having a single goal, where they
would always contend for the same one.

Now that we're hashing based on an inode's ino, we can later specify the
corresponding inode ino based on the CPU ID during inode allocation.
>
> We could make streaming goal to be ext4_fsblk_t so that also offset of the
> last big allocation in the group is recorded as I wrote above. That would
> tend to pack big allocations in each group together which is benefitial to
> combat fragmentation even with higher proportion of groups that are streaming
> goals (and likely becomes more important as the blocksize and thus group
> size grow). We can discuss proper number of slots for streaming allocation
> (I'm not hung up on it being quarter of the group count) but I'm convinced
> sb->s_groups_count is too much :)
>
> 								Honza

I think sbi->s_groups_count / 4 is indeed acceptable. However, I don't
believe recording offsets is necessary. As groups become larger,
contention for groups will intensify, and adding offsets would only
make this contention worse.


Regards,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-06-30  9:21         ` Baokun Li
@ 2025-06-30 16:32           ` Jan Kara
  2025-07-01  2:39             ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-06-30 16:32 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, ojaswin,
	linux-kernel, yi.zhang, yangerkun

On Mon 30-06-25 17:21:48, Baokun Li wrote:
> On 2025/6/30 15:47, Jan Kara wrote:
> > On Mon 30-06-25 11:48:20, Baokun Li wrote:
> > > On 2025/6/28 2:19, Jan Kara wrote:
> > > > On Mon 23-06-25 15:32:51, Baokun Li wrote:
> > > > > After we optimized the block group lock, we found another lock
> > > > > contention issue when running will-it-scale/fallocate2 with multiple
> > > > > processes. The fallocate's block allocation and the truncate's block
> > > > > release were fighting over the s_md_lock. The problem is, this lock
> > > > > protects totally different things in those two processes: the list of
> > > > > freed data blocks (s_freed_data_list) when releasing, and where to start
> > > > > looking for new blocks (mb_last_group) when allocating.
> > > > > 
> > > > > Now we only need to track s_mb_last_group and no longer need to track
> > > > > s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> > > > > two are consistent, and we can ensure that the s_mb_last_group read is up
> > > > > to date by using smp_store_release/smp_load_acquire.
> > > > > 
> > > > > Besides, the s_mb_last_group data type only requires ext4_group_t
> > > > > (i.e., unsigned int), rendering unsigned long superfluous.
> > > > > 
> > > > > Performance test data follows:
> > > > > 
> > > > > Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> > > > > Observation: Average fallocate operations per container per second.
> > > > > 
> > > > >                      | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
> > > > >    Disk: 960GB SSD   |-------------------------|-------------------------|
> > > > >                      | base  |    patched      | base  |    patched      |
> > > > > -------------------|-------|-----------------|-------|-----------------|
> > > > > mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
> > > > > mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
> > > > > 
> > > > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > > > ...
> > > > 
> > > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > > > index 5cdae3bda072..3f103919868b 100644
> > > > > --- a/fs/ext4/mballoc.c
> > > > > +++ b/fs/ext4/mballoc.c
> > > > > @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> > > > >    	ac->ac_buddy_folio = e4b->bd_buddy_folio;
> > > > >    	folio_get(ac->ac_buddy_folio);
> > > > >    	/* store last allocated for subsequent stream allocation */
> > > > > -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> > > > > -		spin_lock(&sbi->s_md_lock);
> > > > > -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> > > > > -		spin_unlock(&sbi->s_md_lock);
> > > > > -	}
> > > > > +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> > > > > +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
> > > > > +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
> > > > Do you really need any kind of barrier (implied by smp_store_release())
> > > > here? I mean the store to s_mb_last_group is perfectly fine to be reordered
> > > > with other accesses from the thread, isn't it? As such it should be enough
> > > > to have WRITE_ONCE() here...
> > > WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
> > > that variable reads/writes access values directly from L1/L2 cache rather
> > > than registers.
> > I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
> > particular they force the compiler to read / write the memory location
> > exactly once instead of reading it potentially multiple times in different
> > parts of expression and getting inconsistent values, or possibly writing
> > the value say byte by byte (yes, that would be insane but not contrary to
> > the C standard).
> READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
> two main purposes:
> 
> 1. It tells the compiler that the variable's value can change unexpectedly,
>    preventing the compiler from making incorrect optimizations based on
>    assumptions about its stability.
> 
> 2. It ensures the CPU directly reads from or writes to the variable's
>    memory address. This means the value will be fetched from cache (L1/L2)
>    if available, or from main memory otherwise, rather than using a stale
>    value from a CPU register.

Yes, we agree on this.

> > > They do not guarantee that other CPUs see the latest values. Reading stale
> > > values could lead to more useless traversals, which might incur higher
> > > overhead than memory barriers. This is why we use memory barriers to ensure
> > > the latest values are read.
> > But smp_load_acquire() / smp_store_release() have no guarantee about CPU
> > seeing latest values either. They are just speculation barriers meaning
> > they prevent the CPU from reordering accesses in the code after
> > smp_load_acquire() to be performed before the smp_load_acquire() is
> > executed and similarly with smp_store_release(). So I dare to say that
> > these barries have no (positive) impact on the allocation performance and
> > just complicate the code - but if you have some data that show otherwise,
> > I'd be happy to be proven wrong.
> smp_load_acquire() / smp_store_release() guarantee that CPUs read the
> latest data.
> 
> For example, imagine a variable a = 0, with both CPU0 and CPU1 having
> a=0 in their caches.
> 
> Without a memory barrier:
> When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
> an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
> a=1 is written to main memory and becomes visible to other CPUs.
> Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
> its invalidation queue. However, it might not process it immediately;
> instead, it could perform the read first, potentially still reading a=0
> from its cache.
> 
> With a memory barrier:
> When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
> the store buffer, but data in the store buffer is also written to main
> memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
> 
> When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
> to its invalidation queue. Here, the invalidation queue is flushed, which
> invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
> performs the read, its cache is invalid, so it reads the latest a=1 from
> main memory.

Well, here I think you assume way more about the CPU architecture than is
generally true (and I didn't find what you write above guaranteed neither
by x86 nor by arm64 CPU documentation). Generally I'm following the
guarantees as defined by Documentation/memory-barriers.txt and there you
can argue only about order of effects as observed by different CPUs but not
really about when content is fetched to / from CPU caches.

BTW on x86 in particular smp_load_acquire() and smp_store_release() aren't
very different from pure READ_ONCE() / WRITE_ONCE:

arch/x86/include/asm/barrier.h:

#define __smp_store_release(p, v)                                       \
do {                                                                    \
        compiletime_assert_atomic_type(*p);                             \
        barrier();                                                      \
        WRITE_ONCE(*p, v);                                              \
} while (0)

#define __smp_load_acquire(p)                                           \
({                                                                      \
        typeof(*p) ___p1 = READ_ONCE(*p);                               \
        compiletime_assert_atomic_type(*p);                             \
        barrier();                                                      \
        ___p1;                                                          \
})

where barrier() is just a compiler barrier - i.e., preventing the compiler
from reordering accesses around this point. This is because x86 is strongly
ordered and the CPU can only reorder loads earlier than previous stores.
TL;DR; on x86 there's no practical difference between using READ_ONCE() /
WRITE_ONCE() and smp_load_acquire() and smp_store_release() in your code.
So I still think using those will be clearer and I'd be curious if you can
see any performance impacts from using READ_ONCE / WRITE_ONCE instead of
smp_load_acquire() / smp_store_release().

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention
  2025-06-30 10:02         ` Baokun Li
@ 2025-06-30 17:41           ` Jan Kara
  2025-07-01  3:32             ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-06-30 17:41 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, ojaswin,
	linux-kernel, yi.zhang, yangerkun

On Mon 30-06-25 18:02:49, Baokun Li wrote:
> On 2025/6/30 16:38, Jan Kara wrote:
> > We could make streaming goal to be ext4_fsblk_t so that also offset of the
> > last big allocation in the group is recorded as I wrote above. That would
> > tend to pack big allocations in each group together which is benefitial to
> > combat fragmentation even with higher proportion of groups that are streaming
> > goals (and likely becomes more important as the blocksize and thus group
> > size grow). We can discuss proper number of slots for streaming allocation
> > (I'm not hung up on it being quarter of the group count) but I'm convinced
> > sb->s_groups_count is too much :)
> > 
> > 								Honza
> 
> I think sbi->s_groups_count / 4 is indeed acceptable. However, I don't
> believe recording offsets is necessary. As groups become larger,
> contention for groups will intensify, and adding offsets would only
> make this contention worse.

I agree the contention for groups will increase when the group count goes
down. I just thought offsets may help to find free space faster in large
groups (and thus reduce contention) and also reduce free space
fragmentation within a group (by having higher chances of placing large
allocations close together within a group) but maybe that's not the case.
Offsets are definitely not requirement at this point.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-06-30 16:32           ` Jan Kara
@ 2025-07-01  2:39             ` Baokun Li
  2025-07-01 12:21               ` Jan Kara
  0 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-07-01  2:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/7/1 0:32, Jan Kara wrote:
> On Mon 30-06-25 17:21:48, Baokun Li wrote:
>> On 2025/6/30 15:47, Jan Kara wrote:
>>> On Mon 30-06-25 11:48:20, Baokun Li wrote:
>>>> On 2025/6/28 2:19, Jan Kara wrote:
>>>>> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>>>>>> After we optimized the block group lock, we found another lock
>>>>>> contention issue when running will-it-scale/fallocate2 with multiple
>>>>>> processes. The fallocate's block allocation and the truncate's block
>>>>>> release were fighting over the s_md_lock. The problem is, this lock
>>>>>> protects totally different things in those two processes: the list of
>>>>>> freed data blocks (s_freed_data_list) when releasing, and where to start
>>>>>> looking for new blocks (mb_last_group) when allocating.
>>>>>>
>>>>>> Now we only need to track s_mb_last_group and no longer need to track
>>>>>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>>>>>> two are consistent, and we can ensure that the s_mb_last_group read is up
>>>>>> to date by using smp_store_release/smp_load_acquire.
>>>>>>
>>>>>> Besides, the s_mb_last_group data type only requires ext4_group_t
>>>>>> (i.e., unsigned int), rendering unsigned long superfluous.
>>>>>>
>>>>>> Performance test data follows:
>>>>>>
>>>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>>>> Observation: Average fallocate operations per container per second.
>>>>>>
>>>>>>                       | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>>>     Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>>>                       | base  |    patched      | base  |    patched      |
>>>>>> -------------------|-------|-----------------|-------|-----------------|
>>>>>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>>>>>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>>>>>
>>>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>>>> ...
>>>>>
>>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>>>> index 5cdae3bda072..3f103919868b 100644
>>>>>> --- a/fs/ext4/mballoc.c
>>>>>> +++ b/fs/ext4/mballoc.c
>>>>>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>>>>>     	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>>>>>     	folio_get(ac->ac_buddy_folio);
>>>>>>     	/* store last allocated for subsequent stream allocation */
>>>>>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>>>> -		spin_lock(&sbi->s_md_lock);
>>>>>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>>>>>> -		spin_unlock(&sbi->s_md_lock);
>>>>>> -	}
>>>>>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>>>>>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>>>>>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>>>> Do you really need any kind of barrier (implied by smp_store_release())
>>>>> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
>>>>> with other accesses from the thread, isn't it? As such it should be enough
>>>>> to have WRITE_ONCE() here...
>>>> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
>>>> that variable reads/writes access values directly from L1/L2 cache rather
>>>> than registers.
>>> I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
>>> particular they force the compiler to read / write the memory location
>>> exactly once instead of reading it potentially multiple times in different
>>> parts of expression and getting inconsistent values, or possibly writing
>>> the value say byte by byte (yes, that would be insane but not contrary to
>>> the C standard).
>> READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
>> two main purposes:
>>
>> 1. It tells the compiler that the variable's value can change unexpectedly,
>>     preventing the compiler from making incorrect optimizations based on
>>     assumptions about its stability.
>>
>> 2. It ensures the CPU directly reads from or writes to the variable's
>>     memory address. This means the value will be fetched from cache (L1/L2)
>>     if available, or from main memory otherwise, rather than using a stale
>>     value from a CPU register.
> Yes, we agree on this.
>
>>>> They do not guarantee that other CPUs see the latest values. Reading stale
>>>> values could lead to more useless traversals, which might incur higher
>>>> overhead than memory barriers. This is why we use memory barriers to ensure
>>>> the latest values are read.
>>> But smp_load_acquire() / smp_store_release() have no guarantee about CPU
>>> seeing latest values either. They are just speculation barriers meaning
>>> they prevent the CPU from reordering accesses in the code after
>>> smp_load_acquire() to be performed before the smp_load_acquire() is
>>> executed and similarly with smp_store_release(). So I dare to say that
>>> these barries have no (positive) impact on the allocation performance and
>>> just complicate the code - but if you have some data that show otherwise,
>>> I'd be happy to be proven wrong.
>> smp_load_acquire() / smp_store_release() guarantee that CPUs read the
>> latest data.
>>
>> For example, imagine a variable a = 0, with both CPU0 and CPU1 having
>> a=0 in their caches.
>>
>> Without a memory barrier:
>> When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
>> an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
>> a=1 is written to main memory and becomes visible to other CPUs.
>> Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
>> its invalidation queue. However, it might not process it immediately;
>> instead, it could perform the read first, potentially still reading a=0
>> from its cache.
>>
>> With a memory barrier:
>> When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
>> the store buffer, but data in the store buffer is also written to main
>> memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
>>
>> When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
>> to its invalidation queue. Here, the invalidation queue is flushed, which
>> invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
>> performs the read, its cache is invalid, so it reads the latest a=1 from
>> main memory.
> Well, here I think you assume way more about the CPU architecture than is
> generally true (and I didn't find what you write above guaranteed neither
> by x86 nor by arm64 CPU documentation). Generally I'm following the
> guarantees as defined by Documentation/memory-barriers.txt and there you
> can argue only about order of effects as observed by different CPUs but not
> really about when content is fetched to / from CPU caches.

Explaining why smp_load_acquire() and smp_store_release() guarantee the
latest data is read truly requires delving into their underlying
implementation details.

I suggest you Google "why memory barriers are needed." You might find
introductions to concepts like 'Total Store Order', 'Weak Memory Ordering',
MESI, store buffers, and invalidate queue, along with the stories behind
them.

The Documentation/memory-barriers.txt file does a good job of introducing
memory barrier concepts and guiding their usage (for instance, the
'MULTICOPY ATOMICITY' section covers CPU cache coherence in detail).
However, it skips many of the specific implementation details that are
quite often necessary for a deeper understanding.

>
> BTW on x86 in particular smp_load_acquire() and smp_store_release() aren't
> very different from pure READ_ONCE() / WRITE_ONCE:
>
> arch/x86/include/asm/barrier.h:
>
> #define __smp_store_release(p, v)                                       \
> do {                                                                    \
>          compiletime_assert_atomic_type(*p);                             \
>          barrier();                                                      \
>          WRITE_ONCE(*p, v);                                              \
> } while (0)
>
> #define __smp_load_acquire(p)                                           \
> ({                                                                      \
>          typeof(*p) ___p1 = READ_ONCE(*p);                               \
>          compiletime_assert_atomic_type(*p);                             \
>          barrier();                                                      \
>          ___p1;                                                          \
> })
>
> where barrier() is just a compiler barrier - i.e., preventing the compiler
> from reordering accesses around this point. This is because x86 is strongly
> ordered and the CPU can only reorder loads earlier than previous stores.
> TL;DR; on x86 there's no practical difference between using READ_ONCE() /
> WRITE_ONCE() and smp_load_acquire() and smp_store_release() in your code.
> So I still think using those will be clearer and I'd be curious if you can
> see any performance impacts from using READ_ONCE / WRITE_ONCE instead of
> smp_load_acquire() / smp_store_release().
>
> 								Honza

Yes, x86 is a strongly ordered memory architecture. For x86, we only need
to use READ_ONCE()/WRITE_ONCE() to ensure access to data in the CPU cache,
as x86 guarantees the cache is up-to-date.

However, the Linux kernel doesn't exclusively run on x86 architectures;
we have a large number of arm64 servers. Disregarding performance, it's
inherently unreasonable that x86 consistently sees the latest global goals
during block allocation while arm64 does not.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-06-23  7:32 ` [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
  2025-06-27 18:19   ` Jan Kara
@ 2025-07-01  2:57   ` kernel test robot
  1 sibling, 0 replies; 51+ messages in thread
From: kernel test robot @ 2025-07-01  2:57 UTC (permalink / raw)
  To: Baokun Li
  Cc: oe-lkp, lkp, linux-ext4, tytso, jack, adilger.kernel, ojaswin,
	linux-kernel, yi.zhang, yangerkun, libaokun1, oliver.sang



Hello,

kernel test robot noticed a 31.1% improvement of stress-ng.fsize.ops_per_sec on:


commit: ad0d50f30d3fe376a99fd0e392867c7ca9b619e3 ("[PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group")
url: https://github.com/intel-lab-lkp/linux/commits/Baokun-Li/ext4-add-ext4_try_lock_group-to-skip-busy-groups/20250623-155451
base: https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git dev
patch link: https://lore.kernel.org/all/20250623073304.3275702-4-libaokun1@huawei.com/
patch subject: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group

testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
parameters:

	nr_threads: 100%
	disk: 1HDD
	testtime: 60s
	fs: ext4
	test: fsize
	cpufreq_governor: performance



Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250701/202507010457.3b3d3c33-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/1HDD/ext4/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp4/fsize/stress-ng/60s

commit: 
  86f92bf2c0 ("ext4: remove unnecessary s_mb_last_start")
  ad0d50f30d ("ext4: remove unnecessary s_md_lock on update s_mb_last_group")

86f92bf2c059852a ad0d50f30d3fe376a99fd0e3928 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      5042 ±  4%     -10.1%       4532 ±  2%  meminfo.Dirty
    100194 ± 63%     +92.5%     192828 ± 32%  numa-meminfo.node0.Shmem
      5082 ±  3%     +28.1%       6510 ±  5%  vmstat.system.cs
     71089           -17.1%      58900 ±  2%  perf-c2c.DRAM.remote
     44206           -13.4%      38284 ±  2%  perf-c2c.HITM.remote
    131696            -4.1%     126359 ±  2%  perf-c2c.HITM.total
      0.15 ± 18%      +0.2        0.35 ± 14%  mpstat.cpu.all.iowait%
      0.32 ±  7%      -0.0        0.28 ±  4%  mpstat.cpu.all.irq%
      0.05 ±  4%      +0.0        0.07 ±  3%  mpstat.cpu.all.soft%
      0.50 ± 13%      +0.2        0.69 ± 16%  mpstat.cpu.all.usr%
  14478005 ±  2%     +32.7%   19217687 ±  4%  numa-numastat.node0.local_node
  14540770 ±  2%     +32.6%   19285137 ±  4%  numa-numastat.node0.numa_hit
  14722680           +28.8%   18967713        numa-numastat.node1.local_node
  14793059           +28.7%   19032805        numa-numastat.node1.numa_hit
    918392           -38.4%     565297 ± 18%  sched_debug.cpu.avg_idle.avg
    356474 ±  5%     -92.0%      28413 ± 90%  sched_debug.cpu.avg_idle.min
      2362 ±  2%     +18.8%       2806 ±  4%  sched_debug.cpu.nr_switches.avg
      1027           +35.5%       1391 ±  6%  sched_debug.cpu.nr_switches.min
     25263 ± 63%     +91.0%      48258 ± 31%  numa-vmstat.node0.nr_shmem
  14540796 ±  2%     +32.5%   19271949 ±  4%  numa-vmstat.node0.numa_hit
  14478031 ±  2%     +32.6%   19204499 ±  4%  numa-vmstat.node0.numa_local
  14792432           +28.6%   19020203        numa-vmstat.node1.numa_hit
  14722053           +28.8%   18955111        numa-vmstat.node1.numa_local
      3780           +30.9%       4950 ±  2%  stress-ng.fsize.SIGXFSZ_signals_per_sec
    643887           +31.0%     843807 ±  2%  stress-ng.fsize.ops
     10726           +31.1%      14059 ±  2%  stress-ng.fsize.ops_per_sec
    126167 ±  2%      +8.7%     137085 ±  2%  stress-ng.time.involuntary_context_switches
     21.82 ±  2%     +45.1%      31.66 ±  4%  stress-ng.time.user_time
      5144 ± 15%    +704.0%      41366 ± 20%  stress-ng.time.voluntary_context_switches
      1272 ±  4%     -10.8%       1135 ±  2%  proc-vmstat.nr_dirty
     59459            +8.1%      64288        proc-vmstat.nr_slab_reclaimable
      1272 ±  4%     -10.8%       1134 ±  2%  proc-vmstat.nr_zone_write_pending
  29335922           +30.6%   38319823        proc-vmstat.numa_hit
  29202778           +30.8%   38187281        proc-vmstat.numa_local
  35012787           +31.9%   46166245 ±  2%  proc-vmstat.pgalloc_normal
  34753289           +31.9%   45830460 ±  2%  proc-vmstat.pgfree
    120464            +2.3%     123212        proc-vmstat.pgpgout
      0.35 ±  3%      +0.1        0.41 ±  3%  perf-stat.i.branch-miss-rate%
  48059547           +21.7%   58484853        perf-stat.i.branch-misses
     33.69            -1.8       31.91        perf-stat.i.cache-miss-rate%
 1.227e+08           +13.5%  1.392e+08 ±  7%  perf-stat.i.cache-misses
 3.623e+08           +19.9%  4.342e+08 ±  7%  perf-stat.i.cache-references
      4958 ±  3%     +30.4%       6467 ±  4%  perf-stat.i.context-switches
      6.10            -5.2%       5.79 ±  4%  perf-stat.i.cpi
    208.43           +22.0%     254.30 ±  5%  perf-stat.i.cpu-migrations
      3333           -11.4%       2954 ±  7%  perf-stat.i.cycles-between-cache-misses
      0.33            +0.1        0.39 ±  2%  perf-stat.overall.branch-miss-rate%
     33.87            -1.8       32.04        perf-stat.overall.cache-miss-rate%
      6.16            -5.3%       5.83 ±  4%  perf-stat.overall.cpi
      3360           -11.5%       2973 ±  7%  perf-stat.overall.cycles-between-cache-misses
      0.16            +5.8%       0.17 ±  4%  perf-stat.overall.ipc
  47200442           +21.7%   57451126        perf-stat.ps.branch-misses
 1.206e+08           +13.5%  1.369e+08 ±  7%  perf-stat.ps.cache-misses
 3.563e+08           +19.9%  4.271e+08 ±  7%  perf-stat.ps.cache-references
      4873 ±  3%     +30.3%       6351 ±  4%  perf-stat.ps.context-switches
    204.75           +22.0%     249.75 ±  5%  perf-stat.ps.cpu-migrations
 6.583e+10            +5.7%  6.955e+10 ±  4%  perf-stat.ps.instructions
 4.046e+12            +5.5%  4.267e+12 ±  4%  perf-stat.total.instructions
      0.15 ± 24%     +97.6%       0.31 ± 21%  perf-sched.sch_delay.avg.ms.__cond_resched.ext4_free_blocks.ext4_remove_blocks.ext4_ext_rm_leaf.ext4_ext_remove_space
      0.69 ± 34%     -45.3%       0.38 ± 24%  perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_poll
      0.04 ±  2%     -11.0%       0.03 ±  7%  perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.09 ± 18%    +104.1%       0.19 ± 38%  perf-sched.sch_delay.avg.ms.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
      0.32 ± 59%    +284.8%       1.24 ± 71%  perf-sched.sch_delay.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_noprof.__filemap_get_folio
     16.34 ± 81%     -81.7%       2.99 ± 34%  perf-sched.sch_delay.max.ms.__cond_resched.__ext4_handle_dirty_metadata.ext4_mb_mark_context.ext4_mb_mark_diskspace_used.ext4_mb_new_blocks
      3.51 ± 11%     +56.2%       5.48 ± 38%  perf-sched.sch_delay.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.ext4_setattr
      0.06 ±223%   +1443.8%       0.86 ± 97%  perf-sched.sch_delay.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.generic_update_time
      0.47 ± 33%    +337.5%       2.05 ± 67%  perf-sched.sch_delay.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_ext_insert_extent.ext4_ext_map_blocks.ext4_map_create_blocks
      0.47 ± 64%    +417.9%       2.43 ± 53%  perf-sched.sch_delay.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_truncate.ext4_setattr.notify_change
      7.30 ± 60%     -53.7%       3.38 ± 22%  perf-sched.sch_delay.max.ms.__cond_resched.__find_get_block_slow.find_get_block_common.bdev_getblk.ext4_read_block_bitmap_nowait
      2.72 ± 34%     +59.5%       4.33 ± 20%  perf-sched.sch_delay.max.ms.__cond_resched.down_read.ext4_map_blocks.ext4_alloc_file_blocks.isra
      0.08 ±138%    +382.6%       0.37 ± 24%  perf-sched.sch_delay.max.ms.__cond_resched.down_write.do_truncate.do_ftruncate.do_sys_ftruncate
      1.33 ± 90%    +122.5%       2.96 ± 34%  perf-sched.sch_delay.max.ms.__cond_resched.down_write.ext4_alloc_file_blocks.isra.0
      3.04           +93.7%       5.89 ± 82%  perf-sched.sch_delay.max.ms.__cond_resched.down_write.ext4_setattr.notify_change.do_truncate
      3.66 ± 19%     +52.6%       5.59 ± 31%  perf-sched.sch_delay.max.ms.__cond_resched.down_write.ext4_truncate.ext4_setattr.notify_change
      0.41 ± 26%    +169.4%       1.11 ± 78%  perf-sched.sch_delay.max.ms.__cond_resched.ext4_free_blocks.ext4_remove_blocks.ext4_ext_rm_leaf.ext4_ext_remove_space
      6.93 ± 82%     -65.5%       2.39 ± 49%  perf-sched.sch_delay.max.ms.__cond_resched.ext4_mb_regular_allocator.ext4_mb_new_blocks.ext4_ext_map_blocks.ext4_map_create_blocks
      0.23 ± 68%    +357.9%       1.04 ± 82%  perf-sched.sch_delay.max.ms.__cond_resched.kmem_cache_alloc_noprof.ext4_mb_clear_bb.ext4_remove_blocks.ext4_ext_rm_leaf
      0.26 ± 39%    +205.8%       0.78 ± 73%  perf-sched.sch_delay.max.ms.__cond_resched.mutex_lock.ext4_mb_initialize_context.ext4_mb_new_blocks.ext4_ext_map_blocks
      0.11 ± 93%   +1390.4%       1.60 ± 62%  perf-sched.sch_delay.max.ms.io_schedule.bit_wait_io.__wait_on_bit_lock.out_of_line_wait_on_bit_lock
      0.30 ± 74%   +2467.2%       7.58 ± 60%  perf-sched.sch_delay.max.ms.io_schedule.folio_wait_bit_common.__find_get_block_slow.find_get_block_common
      2.66 ± 18%     +29.4%       3.44 ±  7%  perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      2.64 ± 21%    +197.3%       7.84 ± 53%  perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     87.11 ±  2%     -15.3%      73.79 ±  4%  perf-sched.total_wait_and_delay.average.ms
     21561 ±  2%     +18.5%      25553 ±  4%  perf-sched.total_wait_and_delay.count.ms
     86.95 ±  2%     -15.4%      73.60 ±  4%  perf-sched.total_wait_time.average.ms
      0.76 ± 54%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.bdev_getblk.ext4_read_block_bitmap_nowait.ext4_read_block_bitmap.ext4_mb_mark_context
      0.61 ± 47%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.ext4_mb_regular_allocator.ext4_mb_new_blocks.ext4_ext_map_blocks.ext4_map_create_blocks
    168.47 ±  2%     -10.4%     150.98 ±  4%  perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    125.33 ± 10%     +72.2%     215.83 ±  8%  perf-sched.wait_and_delay.count.__cond_resched.__ext4_handle_dirty_metadata.ext4_do_update_inode.isra.0
    781.33 ±  3%     -74.6%     198.83 ± 15%  perf-sched.wait_and_delay.count.__cond_resched.__ext4_handle_dirty_metadata.ext4_mb_mark_context.ext4_mb_mark_diskspace_used.ext4_mb_new_blocks
    278.67 ± 13%    +310.9%       1145 ± 20%  perf-sched.wait_and_delay.count.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.ext4_setattr
      1116 ±  3%     -81.5%     206.33 ± 13%  perf-sched.wait_and_delay.count.__cond_resched.__find_get_block_slow.find_get_block_common.bdev_getblk.ext4_read_block_bitmap_nowait
    166.33 ±  8%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.bdev_getblk.ext4_read_block_bitmap_nowait.ext4_read_block_bitmap.ext4_mb_mark_context
    115.50 ± 46%    +298.7%     460.50 ± 16%  perf-sched.wait_and_delay.count.__cond_resched.down_read.ext4_map_blocks.ext4_alloc_file_blocks.isra
    138.33 ± 16%    +290.7%     540.50 ± 18%  perf-sched.wait_and_delay.count.__cond_resched.down_write.ext4_setattr.notify_change.do_truncate
    310.17 ± 14%    +263.9%       1128 ± 21%  perf-sched.wait_and_delay.count.__cond_resched.down_write.ext4_truncate.ext4_setattr.notify_change
      1274 ±  2%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.ext4_mb_regular_allocator.ext4_mb_new_blocks.ext4_ext_map_blocks.ext4_map_create_blocks
      7148 ±  2%     +11.9%       7998 ±  4%  perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     32.82 ± 80%     -81.8%       5.99 ± 34%  perf-sched.wait_and_delay.max.ms.__cond_resched.__ext4_handle_dirty_metadata.ext4_mb_mark_context.ext4_mb_mark_diskspace_used.ext4_mb_new_blocks
     12.06 ± 22%    +168.4%      32.36 ± 47%  perf-sched.wait_and_delay.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.ext4_setattr
     20.55 ± 82%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.bdev_getblk.ext4_read_block_bitmap_nowait.ext4_read_block_bitmap.ext4_mb_mark_context
     27.66 ± 20%     +78.9%      49.49 ± 60%  perf-sched.wait_and_delay.max.ms.__cond_resched.ext4_journal_check_start.__ext4_journal_start_sb.ext4_dirty_inode.__mark_inode_dirty
     16.75 ± 64%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.ext4_mb_regular_allocator.ext4_mb_new_blocks.ext4_ext_map_blocks.ext4_map_create_blocks
      0.19 ± 29%    +191.5%       0.55 ± 29%  perf-sched.wait_time.avg.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_truncate.ext4_setattr.notify_change
      0.15 ± 24%     +98.1%       0.31 ± 21%  perf-sched.wait_time.avg.ms.__cond_resched.ext4_free_blocks.ext4_remove_blocks.ext4_ext_rm_leaf.ext4_ext_remove_space
    168.44 ±  2%     -10.4%     150.94 ±  4%  perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.36 ± 40%    +392.9%       1.78 ± 71%  perf-sched.wait_time.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_noprof.__filemap_get_folio
     17.42 ± 70%     -82.4%       3.07 ± 34%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_handle_dirty_metadata.ext4_mb_mark_context.ext4_mb_mark_diskspace_used.ext4_mb_new_blocks
     11.49 ± 26%    +180.6%      32.23 ± 48%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.ext4_setattr
      0.06 ±223%   +1443.8%       0.86 ± 97%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.generic_update_time
      0.47 ± 33%    +411.8%       2.40 ± 56%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_ext_insert_extent.ext4_ext_map_blocks.ext4_map_create_blocks
      0.64 ±161%    +244.6%       2.20 ± 61%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_setattr.notify_change.do_truncate
      0.47 ± 64%    +968.9%       5.01 ± 83%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_truncate.ext4_setattr.notify_change
      0.08 ±138%    +382.6%       0.37 ± 24%  perf-sched.wait_time.max.ms.__cond_resched.down_write.do_truncate.do_ftruncate.do_sys_ftruncate
      0.41 ± 26%    +169.4%       1.11 ± 78%  perf-sched.wait_time.max.ms.__cond_resched.ext4_free_blocks.ext4_remove_blocks.ext4_ext_rm_leaf.ext4_ext_remove_space
     17.67 ± 25%    +110.8%      37.26 ± 35%  perf-sched.wait_time.max.ms.__cond_resched.ext4_journal_check_start.__ext4_journal_start_sb.ext4_dirty_inode.__mark_inode_dirty
      2.23 ± 51%    +360.3%      10.28 ± 71%  perf-sched.wait_time.max.ms.__cond_resched.ext4_journal_check_start.__ext4_journal_start_sb.ext4_ext_remove_space.ext4_ext_truncate
     84.33 ± 14%     -46.9%      44.77 ± 72%  perf-sched.wait_time.max.ms.__cond_resched.ext4_mb_load_buddy_gfp.ext4_process_freed_data.ext4_journal_commit_callback.jbd2_journal_commit_transaction
      0.23 ± 68%    +357.9%       1.04 ± 82%  perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_noprof.ext4_mb_clear_bb.ext4_remove_blocks.ext4_ext_rm_leaf
      0.26 ± 39%    +205.8%       0.78 ± 73%  perf-sched.wait_time.max.ms.__cond_resched.mutex_lock.ext4_mb_initialize_context.ext4_mb_new_blocks.ext4_ext_map_blocks
    276.82 ± 13%     -22.2%     215.50 ± 13%  perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.30 ± 74%   +9637.4%      28.76 ± 48%  perf-sched.wait_time.max.ms.io_schedule.folio_wait_bit_common.__find_get_block_slow.find_get_block_common
      1.44 ± 79%  +11858.3%     172.80 ±219%  perf-sched.wait_time.max.ms.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown]




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention
  2025-06-30 17:41           ` Jan Kara
@ 2025-07-01  3:32             ` Baokun Li
  2025-07-01 11:53               ` Jan Kara
  0 siblings, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-07-01  3:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/7/1 1:41, Jan Kara wrote:
> On Mon 30-06-25 18:02:49, Baokun Li wrote:
>> On 2025/6/30 16:38, Jan Kara wrote:
>>> We could make streaming goal to be ext4_fsblk_t so that also offset of the
>>> last big allocation in the group is recorded as I wrote above. That would
>>> tend to pack big allocations in each group together which is benefitial to
>>> combat fragmentation even with higher proportion of groups that are streaming
>>> goals (and likely becomes more important as the blocksize and thus group
>>> size grow). We can discuss proper number of slots for streaming allocation
>>> (I'm not hung up on it being quarter of the group count) but I'm convinced
>>> sb->s_groups_count is too much :)
>>>
>>> 								Honza
>> I think sbi->s_groups_count / 4 is indeed acceptable. However, I don't
>> believe recording offsets is necessary. As groups become larger,
>> contention for groups will intensify, and adding offsets would only
>> make this contention worse.
> I agree the contention for groups will increase when the group count goes
> down. I just thought offsets may help to find free space faster in large
> groups (and thus reduce contention) and also reduce free space
> fragmentation within a group (by having higher chances of placing large
> allocations close together within a group) but maybe that's not the case.
> Offsets are definitely not requirement at this point.
>
> 								Honza
>
Thinking this over, with LBS support coming, if our block size jumps from
4KB to 64KB, the maximum group size will dramatically increase from 128MB
to 32GB (even with the current 4GB group limit). If free space within a
group gets heavily fragmented, iterating through that single group could
become quite time-consuming.

Your idea of recording offsets to prevent redundant scanning of
already-checked extents within a group definitely makes sense. But with
reference to the idea of optimizing linear traversal of groups, I think it
might be better to record the offset of the first occurrence of each order
in the same way that bb_counters records the number of each order.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention
  2025-07-01  3:32             ` Baokun Li
@ 2025-07-01 11:53               ` Jan Kara
  2025-07-01 12:12                 ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-07-01 11:53 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, ojaswin,
	linux-kernel, yi.zhang, yangerkun

On Tue 01-07-25 11:32:23, Baokun Li wrote:
> On 2025/7/1 1:41, Jan Kara wrote:
> > On Mon 30-06-25 18:02:49, Baokun Li wrote:
> > > On 2025/6/30 16:38, Jan Kara wrote:
> > > > We could make streaming goal to be ext4_fsblk_t so that also offset of the
> > > > last big allocation in the group is recorded as I wrote above. That would
> > > > tend to pack big allocations in each group together which is benefitial to
> > > > combat fragmentation even with higher proportion of groups that are streaming
> > > > goals (and likely becomes more important as the blocksize and thus group
> > > > size grow). We can discuss proper number of slots for streaming allocation
> > > > (I'm not hung up on it being quarter of the group count) but I'm convinced
> > > > sb->s_groups_count is too much :)
> > > > 
> > > > 								Honza
> > > I think sbi->s_groups_count / 4 is indeed acceptable. However, I don't
> > > believe recording offsets is necessary. As groups become larger,
> > > contention for groups will intensify, and adding offsets would only
> > > make this contention worse.
> > I agree the contention for groups will increase when the group count goes
> > down. I just thought offsets may help to find free space faster in large
> > groups (and thus reduce contention) and also reduce free space
> > fragmentation within a group (by having higher chances of placing large
> > allocations close together within a group) but maybe that's not the case.
> > Offsets are definitely not requirement at this point.
> > 
> > 								Honza
> > 
> Thinking this over, with LBS support coming, if our block size jumps from
> 4KB to 64KB, the maximum group size will dramatically increase from 128MB
> to 32GB (even with the current 4GB group limit). If free space within a
> group gets heavily fragmented, iterating through that single group could
> become quite time-consuming.
> 
> Your idea of recording offsets to prevent redundant scanning of
> already-checked extents within a group definitely makes sense. But with
> reference to the idea of optimizing linear traversal of groups, I think it
> might be better to record the offset of the first occurrence of each order
> in the same way that bb_counters records the number of each order.

Yes, something like that makes sense. But I guess that's a material for the
next patch set :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention
  2025-07-01 11:53               ` Jan Kara
@ 2025-07-01 12:12                 ` Baokun Li
  0 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-07-01 12:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/7/1 19:53, Jan Kara wrote:
> On Tue 01-07-25 11:32:23, Baokun Li wrote:
>> On 2025/7/1 1:41, Jan Kara wrote:
>>> On Mon 30-06-25 18:02:49, Baokun Li wrote:
>>>> On 2025/6/30 16:38, Jan Kara wrote:
>>>>> We could make streaming goal to be ext4_fsblk_t so that also offset of the
>>>>> last big allocation in the group is recorded as I wrote above. That would
>>>>> tend to pack big allocations in each group together which is benefitial to
>>>>> combat fragmentation even with higher proportion of groups that are streaming
>>>>> goals (and likely becomes more important as the blocksize and thus group
>>>>> size grow). We can discuss proper number of slots for streaming allocation
>>>>> (I'm not hung up on it being quarter of the group count) but I'm convinced
>>>>> sb->s_groups_count is too much :)
>>>>>
>>>>> 								Honza
>>>> I think sbi->s_groups_count / 4 is indeed acceptable. However, I don't
>>>> believe recording offsets is necessary. As groups become larger,
>>>> contention for groups will intensify, and adding offsets would only
>>>> make this contention worse.
>>> I agree the contention for groups will increase when the group count goes
>>> down. I just thought offsets may help to find free space faster in large
>>> groups (and thus reduce contention) and also reduce free space
>>> fragmentation within a group (by having higher chances of placing large
>>> allocations close together within a group) but maybe that's not the case.
>>> Offsets are definitely not requirement at this point.
>>>
>>> 								Honza
>>>
>> Thinking this over, with LBS support coming, if our block size jumps from
>> 4KB to 64KB, the maximum group size will dramatically increase from 128MB
>> to 32GB (even with the current 4GB group limit). If free space within a
>> group gets heavily fragmented, iterating through that single group could
>> become quite time-consuming.
>>
>> Your idea of recording offsets to prevent redundant scanning of
>> already-checked extents within a group definitely makes sense. But with
>> reference to the idea of optimizing linear traversal of groups, I think it
>> might be better to record the offset of the first occurrence of each order
>> in the same way that bb_counters records the number of each order.
> Yes, something like that makes sense. But I guess that's a material for the
> next patch set :)
>
> 								Honza

Yes, this isn't urgent right now. I plan to implement this idea after
the LBS patch set is complete.

Thank you very much for your review and patient explanations! 😀


Regards,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-07-01  2:39             ` Baokun Li
@ 2025-07-01 12:21               ` Jan Kara
  2025-07-01 13:17                 ` Baokun Li
  2025-07-08 13:08                 ` Baokun Li
  0 siblings, 2 replies; 51+ messages in thread
From: Jan Kara @ 2025-07-01 12:21 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, ojaswin,
	linux-kernel, yi.zhang, yangerkun

On Tue 01-07-25 10:39:53, Baokun Li wrote:
> On 2025/7/1 0:32, Jan Kara wrote:
> > On Mon 30-06-25 17:21:48, Baokun Li wrote:
> > > On 2025/6/30 15:47, Jan Kara wrote:
> > > > On Mon 30-06-25 11:48:20, Baokun Li wrote:
> > > > > On 2025/6/28 2:19, Jan Kara wrote:
> > > > > > On Mon 23-06-25 15:32:51, Baokun Li wrote:
> > > > > > > After we optimized the block group lock, we found another lock
> > > > > > > contention issue when running will-it-scale/fallocate2 with multiple
> > > > > > > processes. The fallocate's block allocation and the truncate's block
> > > > > > > release were fighting over the s_md_lock. The problem is, this lock
> > > > > > > protects totally different things in those two processes: the list of
> > > > > > > freed data blocks (s_freed_data_list) when releasing, and where to start
> > > > > > > looking for new blocks (mb_last_group) when allocating.
> > > > > > > 
> > > > > > > Now we only need to track s_mb_last_group and no longer need to track
> > > > > > > s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> > > > > > > two are consistent, and we can ensure that the s_mb_last_group read is up
> > > > > > > to date by using smp_store_release/smp_load_acquire.
> > > > > > > 
> > > > > > > Besides, the s_mb_last_group data type only requires ext4_group_t
> > > > > > > (i.e., unsigned int), rendering unsigned long superfluous.
> > > > > > > 
> > > > > > > Performance test data follows:
> > > > > > > 
> > > > > > > Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> > > > > > > Observation: Average fallocate operations per container per second.
> > > > > > > 
> > > > > > >                       | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
> > > > > > >     Disk: 960GB SSD   |-------------------------|-------------------------|
> > > > > > >                       | base  |    patched      | base  |    patched      |
> > > > > > > -------------------|-------|-----------------|-------|-----------------|
> > > > > > > mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
> > > > > > > mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
> > > > > > > 
> > > > > > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > > > > > ...
> > > > > > 
> > > > > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > > > > > index 5cdae3bda072..3f103919868b 100644
> > > > > > > --- a/fs/ext4/mballoc.c
> > > > > > > +++ b/fs/ext4/mballoc.c
> > > > > > > @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> > > > > > >     	ac->ac_buddy_folio = e4b->bd_buddy_folio;
> > > > > > >     	folio_get(ac->ac_buddy_folio);
> > > > > > >     	/* store last allocated for subsequent stream allocation */
> > > > > > > -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> > > > > > > -		spin_lock(&sbi->s_md_lock);
> > > > > > > -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> > > > > > > -		spin_unlock(&sbi->s_md_lock);
> > > > > > > -	}
> > > > > > > +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> > > > > > > +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
> > > > > > > +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
> > > > > > Do you really need any kind of barrier (implied by smp_store_release())
> > > > > > here? I mean the store to s_mb_last_group is perfectly fine to be reordered
> > > > > > with other accesses from the thread, isn't it? As such it should be enough
> > > > > > to have WRITE_ONCE() here...
> > > > > WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
> > > > > that variable reads/writes access values directly from L1/L2 cache rather
> > > > > than registers.
> > > > I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
> > > > particular they force the compiler to read / write the memory location
> > > > exactly once instead of reading it potentially multiple times in different
> > > > parts of expression and getting inconsistent values, or possibly writing
> > > > the value say byte by byte (yes, that would be insane but not contrary to
> > > > the C standard).
> > > READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
> > > two main purposes:
> > > 
> > > 1. It tells the compiler that the variable's value can change unexpectedly,
> > >     preventing the compiler from making incorrect optimizations based on
> > >     assumptions about its stability.
> > > 
> > > 2. It ensures the CPU directly reads from or writes to the variable's
> > >     memory address. This means the value will be fetched from cache (L1/L2)
> > >     if available, or from main memory otherwise, rather than using a stale
> > >     value from a CPU register.
> > Yes, we agree on this.
> > 
> > > > > They do not guarantee that other CPUs see the latest values. Reading stale
> > > > > values could lead to more useless traversals, which might incur higher
> > > > > overhead than memory barriers. This is why we use memory barriers to ensure
> > > > > the latest values are read.
> > > > But smp_load_acquire() / smp_store_release() have no guarantee about CPU
> > > > seeing latest values either. They are just speculation barriers meaning
> > > > they prevent the CPU from reordering accesses in the code after
> > > > smp_load_acquire() to be performed before the smp_load_acquire() is
> > > > executed and similarly with smp_store_release(). So I dare to say that
> > > > these barries have no (positive) impact on the allocation performance and
> > > > just complicate the code - but if you have some data that show otherwise,
> > > > I'd be happy to be proven wrong.
> > > smp_load_acquire() / smp_store_release() guarantee that CPUs read the
> > > latest data.
> > > 
> > > For example, imagine a variable a = 0, with both CPU0 and CPU1 having
> > > a=0 in their caches.
> > > 
> > > Without a memory barrier:
> > > When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
> > > an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
> > > a=1 is written to main memory and becomes visible to other CPUs.
> > > Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
> > > its invalidation queue. However, it might not process it immediately;
> > > instead, it could perform the read first, potentially still reading a=0
> > > from its cache.
> > > 
> > > With a memory barrier:
> > > When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
> > > the store buffer, but data in the store buffer is also written to main
> > > memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
> > > 
> > > When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
> > > to its invalidation queue. Here, the invalidation queue is flushed, which
> > > invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
> > > performs the read, its cache is invalid, so it reads the latest a=1 from
> > > main memory.
> > Well, here I think you assume way more about the CPU architecture than is
> > generally true (and I didn't find what you write above guaranteed neither
> > by x86 nor by arm64 CPU documentation). Generally I'm following the
> > guarantees as defined by Documentation/memory-barriers.txt and there you
> > can argue only about order of effects as observed by different CPUs but not
> > really about when content is fetched to / from CPU caches.
> 
> Explaining why smp_load_acquire() and smp_store_release() guarantee the
> latest data is read truly requires delving into their underlying
> implementation details.
> 
> I suggest you Google "why memory barriers are needed." You might find
> introductions to concepts like 'Total Store Order', 'Weak Memory Ordering',
> MESI, store buffers, and invalidate queue, along with the stories behind
> them.

Yes, I know these things. Not that I'd be really an expert in them but I'd
call myself familiar enough :). But that is kind of besides the point here.
What I want to point out it that if you have code like:

  some access A
  grp = smp_load_acquire(&sbi->s_mb_last_group)
  some more accesses

then the CPU is fully within it's right to execute them as:

  grp = smp_load_acquire(&sbi->s_mb_last_group)
  some access A
  some more accesses

Now your *particular implementation* of the ARM64 CPU model may never do
that similarly as no x86 CPU currently does it but some other CPU
implementation may (e.g. Alpha CPU probably would, as much as that's
irrevelent these days :). So using smp_load_acquire() is at best a
heuristics that may happen to help using more fresh value for some CPU
models but it isn't guaranteed to help for all architectures and all CPU
models Linux supports.

So can you do me a favor please and do a performance comparison of using
READ_ONCE / WRITE_ONCE vs using smp_load_acquire / smp_store_release on
your Arm64 server for streaming goal management? If smp_load_acquire /
smp_store_release indeed bring any performance benefit for your servers, we
can just stick a comment there explaining why they are used. If they bring
no measurable benefit I'd put READ_ONCE / WRITE_ONCE there for code
simplicity. Do you agree?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-07-01 12:21               ` Jan Kara
@ 2025-07-01 13:17                 ` Baokun Li
  2025-07-08 13:08                 ` Baokun Li
  1 sibling, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-07-01 13:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/7/1 20:21, Jan Kara wrote:
> On Tue 01-07-25 10:39:53, Baokun Li wrote:
>> On 2025/7/1 0:32, Jan Kara wrote:
>>> On Mon 30-06-25 17:21:48, Baokun Li wrote:
>>>> On 2025/6/30 15:47, Jan Kara wrote:
>>>>> On Mon 30-06-25 11:48:20, Baokun Li wrote:
>>>>>> On 2025/6/28 2:19, Jan Kara wrote:
>>>>>>> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>>>>>>>> After we optimized the block group lock, we found another lock
>>>>>>>> contention issue when running will-it-scale/fallocate2 with multiple
>>>>>>>> processes. The fallocate's block allocation and the truncate's block
>>>>>>>> release were fighting over the s_md_lock. The problem is, this lock
>>>>>>>> protects totally different things in those two processes: the list of
>>>>>>>> freed data blocks (s_freed_data_list) when releasing, and where to start
>>>>>>>> looking for new blocks (mb_last_group) when allocating.
>>>>>>>>
>>>>>>>> Now we only need to track s_mb_last_group and no longer need to track
>>>>>>>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>>>>>>>> two are consistent, and we can ensure that the s_mb_last_group read is up
>>>>>>>> to date by using smp_store_release/smp_load_acquire.
>>>>>>>>
>>>>>>>> Besides, the s_mb_last_group data type only requires ext4_group_t
>>>>>>>> (i.e., unsigned int), rendering unsigned long superfluous.
>>>>>>>>
>>>>>>>> Performance test data follows:
>>>>>>>>
>>>>>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>>>>>> Observation: Average fallocate operations per container per second.
>>>>>>>>
>>>>>>>>                        | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>>>>>      Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>>>>>                        | base  |    patched      | base  |    patched      |
>>>>>>>> -------------------|-------|-----------------|-------|-----------------|
>>>>>>>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>>>>>>>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>>>>>>>
>>>>>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>>>>>> ...
>>>>>>>
>>>>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>>>>>> index 5cdae3bda072..3f103919868b 100644
>>>>>>>> --- a/fs/ext4/mballoc.c
>>>>>>>> +++ b/fs/ext4/mballoc.c
>>>>>>>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>>>>>>>      	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>>>>>>>      	folio_get(ac->ac_buddy_folio);
>>>>>>>>      	/* store last allocated for subsequent stream allocation */
>>>>>>>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>>>>>> -		spin_lock(&sbi->s_md_lock);
>>>>>>>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>>>>>>>> -		spin_unlock(&sbi->s_md_lock);
>>>>>>>> -	}
>>>>>>>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>>>>>>>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>>>>>>>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>>>>>> Do you really need any kind of barrier (implied by smp_store_release())
>>>>>>> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
>>>>>>> with other accesses from the thread, isn't it? As such it should be enough
>>>>>>> to have WRITE_ONCE() here...
>>>>>> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
>>>>>> that variable reads/writes access values directly from L1/L2 cache rather
>>>>>> than registers.
>>>>> I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
>>>>> particular they force the compiler to read / write the memory location
>>>>> exactly once instead of reading it potentially multiple times in different
>>>>> parts of expression and getting inconsistent values, or possibly writing
>>>>> the value say byte by byte (yes, that would be insane but not contrary to
>>>>> the C standard).
>>>> READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
>>>> two main purposes:
>>>>
>>>> 1. It tells the compiler that the variable's value can change unexpectedly,
>>>>      preventing the compiler from making incorrect optimizations based on
>>>>      assumptions about its stability.
>>>>
>>>> 2. It ensures the CPU directly reads from or writes to the variable's
>>>>      memory address. This means the value will be fetched from cache (L1/L2)
>>>>      if available, or from main memory otherwise, rather than using a stale
>>>>      value from a CPU register.
>>> Yes, we agree on this.
>>>
>>>>>> They do not guarantee that other CPUs see the latest values. Reading stale
>>>>>> values could lead to more useless traversals, which might incur higher
>>>>>> overhead than memory barriers. This is why we use memory barriers to ensure
>>>>>> the latest values are read.
>>>>> But smp_load_acquire() / smp_store_release() have no guarantee about CPU
>>>>> seeing latest values either. They are just speculation barriers meaning
>>>>> they prevent the CPU from reordering accesses in the code after
>>>>> smp_load_acquire() to be performed before the smp_load_acquire() is
>>>>> executed and similarly with smp_store_release(). So I dare to say that
>>>>> these barries have no (positive) impact on the allocation performance and
>>>>> just complicate the code - but if you have some data that show otherwise,
>>>>> I'd be happy to be proven wrong.
>>>> smp_load_acquire() / smp_store_release() guarantee that CPUs read the
>>>> latest data.
>>>>
>>>> For example, imagine a variable a = 0, with both CPU0 and CPU1 having
>>>> a=0 in their caches.
>>>>
>>>> Without a memory barrier:
>>>> When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
>>>> an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
>>>> a=1 is written to main memory and becomes visible to other CPUs.
>>>> Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
>>>> its invalidation queue. However, it might not process it immediately;
>>>> instead, it could perform the read first, potentially still reading a=0
>>>> from its cache.
>>>>
>>>> With a memory barrier:
>>>> When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
>>>> the store buffer, but data in the store buffer is also written to main
>>>> memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
>>>>
>>>> When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
>>>> to its invalidation queue. Here, the invalidation queue is flushed, which
>>>> invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
>>>> performs the read, its cache is invalid, so it reads the latest a=1 from
>>>> main memory.
>>> Well, here I think you assume way more about the CPU architecture than is
>>> generally true (and I didn't find what you write above guaranteed neither
>>> by x86 nor by arm64 CPU documentation). Generally I'm following the
>>> guarantees as defined by Documentation/memory-barriers.txt and there you
>>> can argue only about order of effects as observed by different CPUs but not
>>> really about when content is fetched to / from CPU caches.
>> Explaining why smp_load_acquire() and smp_store_release() guarantee the
>> latest data is read truly requires delving into their underlying
>> implementation details.
>>
>> I suggest you Google "why memory barriers are needed." You might find
>> introductions to concepts like 'Total Store Order', 'Weak Memory Ordering',
>> MESI, store buffers, and invalidate queue, along with the stories behind
>> them.
> Yes, I know these things. Not that I'd be really an expert in them but I'd
> call myself familiar enough :). But that is kind of besides the point here.
> What I want to point out it that if you have code like:
>
>    some access A
>    grp = smp_load_acquire(&sbi->s_mb_last_group)
>    some more accesses
>
> then the CPU is fully within it's right to execute them as:
>
>    grp = smp_load_acquire(&sbi->s_mb_last_group)
>    some access A
>    some more accesses
>
> Now your *particular implementation* of the ARM64 CPU model may never do
> that similarly as no x86 CPU currently does it but some other CPU
> implementation may (e.g. Alpha CPU probably would, as much as that's
> irrevelent these days :). So using smp_load_acquire() is at best a
> heuristics that may happen to help using more fresh value for some CPU
> models but it isn't guaranteed to help for all architectures and all CPU
> models Linux supports.
Yes, it's true that the underlying implementation of
smp_load_acquire() can differ somewhat across various
processor architectures.
>
> So can you do me a favor please and do a performance comparison of using
> READ_ONCE / WRITE_ONCE vs using smp_load_acquire / smp_store_release on
> your Arm64 server for streaming goal management? If smp_load_acquire /
> smp_store_release indeed bring any performance benefit for your servers, we
> can just stick a comment there explaining why they are used. If they bring
> no measurable benefit I'd put READ_ONCE / WRITE_ONCE there for code
> simplicity. Do you agree?
>
> 								Honza

Okay, no problem. I'll get an ARM server from the resource pool to test
the difference between the two. If there's no difference, replacing them
with READ_ONCE/WRITE_ONCE would be acceptable.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-07-01 12:21               ` Jan Kara
  2025-07-01 13:17                 ` Baokun Li
@ 2025-07-08 13:08                 ` Baokun Li
  2025-07-10 14:38                   ` Jan Kara
  1 sibling, 1 reply; 51+ messages in thread
From: Baokun Li @ 2025-07-08 13:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun, Baokun Li

On 2025/7/1 20:21, Jan Kara wrote:
> On Tue 01-07-25 10:39:53, Baokun Li wrote:
>> On 2025/7/1 0:32, Jan Kara wrote:
>>> On Mon 30-06-25 17:21:48, Baokun Li wrote:
>>>> On 2025/6/30 15:47, Jan Kara wrote:
>>>>> On Mon 30-06-25 11:48:20, Baokun Li wrote:
>>>>>> On 2025/6/28 2:19, Jan Kara wrote:
>>>>>>> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>>>>>>>> After we optimized the block group lock, we found another lock
>>>>>>>> contention issue when running will-it-scale/fallocate2 with multiple
>>>>>>>> processes. The fallocate's block allocation and the truncate's block
>>>>>>>> release were fighting over the s_md_lock. The problem is, this lock
>>>>>>>> protects totally different things in those two processes: the list of
>>>>>>>> freed data blocks (s_freed_data_list) when releasing, and where to start
>>>>>>>> looking for new blocks (mb_last_group) when allocating.
>>>>>>>>
>>>>>>>> Now we only need to track s_mb_last_group and no longer need to track
>>>>>>>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>>>>>>>> two are consistent, and we can ensure that the s_mb_last_group read is up
>>>>>>>> to date by using smp_store_release/smp_load_acquire.
>>>>>>>>
>>>>>>>> Besides, the s_mb_last_group data type only requires ext4_group_t
>>>>>>>> (i.e., unsigned int), rendering unsigned long superfluous.
>>>>>>>>
>>>>>>>> Performance test data follows:
>>>>>>>>
>>>>>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>>>>>> Observation: Average fallocate operations per container per second.
>>>>>>>>
>>>>>>>>                        | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>>>>>      Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>>>>>                        | base  |    patched      | base  |    patched      |
>>>>>>>> -------------------|-------|-----------------|-------|-----------------|
>>>>>>>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>>>>>>>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>>>>>>>
>>>>>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>>>>>> ...
>>>>>>>
>>>>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>>>>>> index 5cdae3bda072..3f103919868b 100644
>>>>>>>> --- a/fs/ext4/mballoc.c
>>>>>>>> +++ b/fs/ext4/mballoc.c
>>>>>>>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>>>>>>>      	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>>>>>>>      	folio_get(ac->ac_buddy_folio);
>>>>>>>>      	/* store last allocated for subsequent stream allocation */
>>>>>>>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>>>>>> -		spin_lock(&sbi->s_md_lock);
>>>>>>>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>>>>>>>> -		spin_unlock(&sbi->s_md_lock);
>>>>>>>> -	}
>>>>>>>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>>>>>>>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>>>>>>>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>>>>>> Do you really need any kind of barrier (implied by smp_store_release())
>>>>>>> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
>>>>>>> with other accesses from the thread, isn't it? As such it should be enough
>>>>>>> to have WRITE_ONCE() here...
>>>>>> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
>>>>>> that variable reads/writes access values directly from L1/L2 cache rather
>>>>>> than registers.
>>>>> I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
>>>>> particular they force the compiler to read / write the memory location
>>>>> exactly once instead of reading it potentially multiple times in different
>>>>> parts of expression and getting inconsistent values, or possibly writing
>>>>> the value say byte by byte (yes, that would be insane but not contrary to
>>>>> the C standard).
>>>> READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
>>>> two main purposes:
>>>>
>>>> 1. It tells the compiler that the variable's value can change unexpectedly,
>>>>      preventing the compiler from making incorrect optimizations based on
>>>>      assumptions about its stability.
>>>>
>>>> 2. It ensures the CPU directly reads from or writes to the variable's
>>>>      memory address. This means the value will be fetched from cache (L1/L2)
>>>>      if available, or from main memory otherwise, rather than using a stale
>>>>      value from a CPU register.
>>> Yes, we agree on this.
>>>
>>>>>> They do not guarantee that other CPUs see the latest values. Reading stale
>>>>>> values could lead to more useless traversals, which might incur higher
>>>>>> overhead than memory barriers. This is why we use memory barriers to ensure
>>>>>> the latest values are read.
>>>>> But smp_load_acquire() / smp_store_release() have no guarantee about CPU
>>>>> seeing latest values either. They are just speculation barriers meaning
>>>>> they prevent the CPU from reordering accesses in the code after
>>>>> smp_load_acquire() to be performed before the smp_load_acquire() is
>>>>> executed and similarly with smp_store_release(). So I dare to say that
>>>>> these barries have no (positive) impact on the allocation performance and
>>>>> just complicate the code - but if you have some data that show otherwise,
>>>>> I'd be happy to be proven wrong.
>>>> smp_load_acquire() / smp_store_release() guarantee that CPUs read the
>>>> latest data.
>>>>
>>>> For example, imagine a variable a = 0, with both CPU0 and CPU1 having
>>>> a=0 in their caches.
>>>>
>>>> Without a memory barrier:
>>>> When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
>>>> an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
>>>> a=1 is written to main memory and becomes visible to other CPUs.
>>>> Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
>>>> its invalidation queue. However, it might not process it immediately;
>>>> instead, it could perform the read first, potentially still reading a=0
>>>> from its cache.
>>>>
>>>> With a memory barrier:
>>>> When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
>>>> the store buffer, but data in the store buffer is also written to main
>>>> memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
>>>>
>>>> When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
>>>> to its invalidation queue. Here, the invalidation queue is flushed, which
>>>> invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
>>>> performs the read, its cache is invalid, so it reads the latest a=1 from
>>>> main memory.
>>> Well, here I think you assume way more about the CPU architecture than is
>>> generally true (and I didn't find what you write above guaranteed neither
>>> by x86 nor by arm64 CPU documentation). Generally I'm following the
>>> guarantees as defined by Documentation/memory-barriers.txt and there you
>>> can argue only about order of effects as observed by different CPUs but not
>>> really about when content is fetched to / from CPU caches.
>> Explaining why smp_load_acquire() and smp_store_release() guarantee the
>> latest data is read truly requires delving into their underlying
>> implementation details.
>>
>> I suggest you Google "why memory barriers are needed." You might find
>> introductions to concepts like 'Total Store Order', 'Weak Memory Ordering',
>> MESI, store buffers, and invalidate queue, along with the stories behind
>> them.
> Yes, I know these things. Not that I'd be really an expert in them but I'd
> call myself familiar enough :). But that is kind of besides the point here.
> What I want to point out it that if you have code like:
>
>    some access A
>    grp = smp_load_acquire(&sbi->s_mb_last_group)
>    some more accesses
>
> then the CPU is fully within it's right to execute them as:
>
>    grp = smp_load_acquire(&sbi->s_mb_last_group)
>    some access A
>    some more accesses
>
> Now your *particular implementation* of the ARM64 CPU model may never do
> that similarly as no x86 CPU currently does it but some other CPU
> implementation may (e.g. Alpha CPU probably would, as much as that's
> irrevelent these days :). So using smp_load_acquire() is at best a
> heuristics that may happen to help using more fresh value for some CPU
> models but it isn't guaranteed to help for all architectures and all CPU
> models Linux supports.
>
> So can you do me a favor please and do a performance comparison of using
> READ_ONCE / WRITE_ONCE vs using smp_load_acquire / smp_store_release on
> your Arm64 server for streaming goal management? If smp_load_acquire /
> smp_store_release indeed bring any performance benefit for your servers, we
> can just stick a comment there explaining why they are used. If they bring
> no measurable benefit I'd put READ_ONCE / WRITE_ONCE there for code
> simplicity. Do you agree?
>
> 								Honza

Sorry for getting to this so late – I've been totally overloaded
with stuff recently.

Anyway, back to what we were discussing. I managed to test
the performance difference between READ_ONCE / WRITE_ONCE and
smp_load_acquire / smp_store_release on an ARM64 server.
Here's the results:

CPU: Kunpeng 920
Memory: 512GB
Disk: 960GB SSD (~500M/s)

         | mb_optimize_scan  |       0        |       1        |
         |-------------------|----------------|----------------|
         | Num. containers   |  P80  |   P1   |  P80  |   P1   |
--------|-------------------|-------|--------|-------|--------|
         | acquire/release   | 9899  | 290260 | 5005  | 307361 |
  single | [READ|WRITE]_ONCE | 9636  | 337597 | 4834  | 341440 |
  goal   |-------------------|-------|--------|-------|--------|
         |                   | -2.6% | +16.3% | -3.4% | +11.0% |
--------|-------------------|-------|--------|-------|--------|
         | acquire/release   | 19931 | 290348 | 7365  | 311717 |
  muti   | [READ|WRITE]_ONCE | 19628 | 320885 | 7129  | 321275 |
  goal   |-------------------|-------|--------|-------|--------|
         |                   | -1.5% | +10.5% | -3.2% | +3.0%  |

So, my tests show that READ_ONCE / WRITE_ONCE gives us better
single-threaded performance. That's because it skips the mandatory
CPU-to-CPU syncing. This also helps explain why x86 has double the
disk bandwidth (~1000MB/s) of Arm64, but surprisingly, single
containers run much worse on x86.

However, in multi-threaded scenarios, not consistently reading
the latest goal has these implications:

  * ext4_get_group_info() calls increase, as ext4_mb_good_group_nolock()
    is invoked more often on incorrect groups.

  * ext4_mb_load_buddy() calls increase due to repeated group accesses
    leading to more folio_mark_accessed calls.

  * ext4_mb_prefetch() calls increase with more frequent prefetch_grp
    access. (I suspect the current mb_prefetch mechanism has some inherent
    issues we could optimize later.)

At this point, I believe either approach is acceptable.

What are your thoughts?


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-07-08 13:08                 ` Baokun Li
@ 2025-07-10 14:38                   ` Jan Kara
  2025-07-14  3:01                     ` Theodore Ts'o
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Kara @ 2025-07-10 14:38 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, ojaswin,
	linux-kernel, yi.zhang, yangerkun

On Tue 08-07-25 21:08:00, Baokun Li wrote:
> Sorry for getting to this so late – I've been totally overloaded
> with stuff recently.
> 
> Anyway, back to what we were discussing. I managed to test
> the performance difference between READ_ONCE / WRITE_ONCE and
> smp_load_acquire / smp_store_release on an ARM64 server.
> Here's the results:
> 
> CPU: Kunpeng 920
> Memory: 512GB
> Disk: 960GB SSD (~500M/s)
> 
>         | mb_optimize_scan  |       0        |       1        |
>         |-------------------|----------------|----------------|
>         | Num. containers   |  P80  |   P1   |  P80  |   P1   |
> --------|-------------------|-------|--------|-------|--------|
>         | acquire/release   | 9899  | 290260 | 5005  | 307361 |
>  single | [READ|WRITE]_ONCE | 9636  | 337597 | 4834  | 341440 |
>  goal   |-------------------|-------|--------|-------|--------|
>         |                   | -2.6% | +16.3% | -3.4% | +11.0% |
> --------|-------------------|-------|--------|-------|--------|
>         | acquire/release   | 19931 | 290348 | 7365  | 311717 |
>  muti   | [READ|WRITE]_ONCE | 19628 | 320885 | 7129  | 321275 |
>  goal   |-------------------|-------|--------|-------|--------|
>         |                   | -1.5% | +10.5% | -3.2% | +3.0%  |
> 
> So, my tests show that READ_ONCE / WRITE_ONCE gives us better
> single-threaded performance. That's because it skips the mandatory
> CPU-to-CPU syncing. This also helps explain why x86 has double the
> disk bandwidth (~1000MB/s) of Arm64, but surprisingly, single
> containers run much worse on x86.

Interesting! Thanks for measuring the data!

> However, in multi-threaded scenarios, not consistently reading
> the latest goal has these implications:
> 
>  * ext4_get_group_info() calls increase, as ext4_mb_good_group_nolock()
>    is invoked more often on incorrect groups.
> 
>  * ext4_mb_load_buddy() calls increase due to repeated group accesses
>    leading to more folio_mark_accessed calls.
> 
>  * ext4_mb_prefetch() calls increase with more frequent prefetch_grp
>    access. (I suspect the current mb_prefetch mechanism has some inherent
>    issues we could optimize later.)
> 
> At this point, I believe either approach is acceptable.
> 
> What are your thoughts?

Yes, apparently both approaches have their pros and cons. I'm actually
surprised the impact of additional barriers on ARM is so big for the
single container case. 10% gain for single container cases look nice OTOH
realistical workloads will have more container so maybe that's not worth
optimizing for. Ted, do you have any opinion?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-07-10 14:38                   ` Jan Kara
@ 2025-07-14  3:01                     ` Theodore Ts'o
  2025-07-14  7:00                       ` Baokun Li
  0 siblings, 1 reply; 51+ messages in thread
From: Theodore Ts'o @ 2025-07-14  3:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Baokun Li, linux-ext4, adilger.kernel, ojaswin, linux-kernel,
	yi.zhang, yangerkun

On Thu, Jul 10, 2025 at 04:38:33PM +0200, Jan Kara wrote:
> 
> Yes, apparently both approaches have their pros and cons. I'm actually
> surprised the impact of additional barriers on ARM is so big for the
> single container case. 10% gain for single container cases look nice OTOH
> realistical workloads will have more container so maybe that's not worth
> optimizing for. Ted, do you have any opinion?

Let me try to summarize; regardless of whether we use
{READ,WRITE})_ONCE or smp_load_acquire / smp_store_restore, both are
signiicantly better than using a the spinlock.  The other thing about
the "single-threaded perforance" is that there is the aditional cost
of the CPU-to-CPU syncing is not free.  But CPU synchronization cost
applies when that the single thread is bouncing between CPU's --- if
we hada single threaded application which is pinned on a single CPU
cost of smp_load_acquire would't be there since the cache line
wouldn't be bouncing back and forth.  Is that correct, or am I missing
something?

In any case, so long as the single-threaded performance doesn't
regress relative to the current spin_lock implementation, I'm inclined
to prefer the use smp_load_acquire approach if it improves
multi-threaded allocation performance on ARM64.

Cheers,

							- Ted

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups
  2025-06-23  7:32 ` [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
  2025-06-27 18:06   ` Jan Kara
@ 2025-07-14  6:53   ` Ojaswin Mujoo
  1 sibling, 0 replies; 51+ messages in thread
From: Ojaswin Mujoo @ 2025-07-14  6:53 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, jack, adilger.kernel, linux-kernel, yi.zhang,
	yangerkun

On Mon, Jun 23, 2025 at 03:32:49PM +0800, Baokun Li wrote:
> When ext4 allocates blocks, we used to just go through the block groups
> one by one to find a good one. But when there are tons of block groups
> (like hundreds of thousands or even millions) and not many have free space
> (meaning they're mostly full), it takes a really long time to check them
> all, and performance gets bad. So, we added the "mb_optimize_scan" mount
> option (which is on by default now). It keeps track of some group lists,
> so when we need a free block, we can just grab a likely group from the
> right list. This saves time and makes block allocation much faster.
> 
> But when multiple processes or containers are doing similar things, like
> constantly allocating 8k blocks, they all try to use the same block group
> in the same list. Even just two processes doing this can cut the IOPS in
> half. For example, one container might do 300,000 IOPS, but if you run two
> at the same time, the total is only 150,000.
> 
> Since we can already look at block groups in a non-linear way, the first
> and last groups in the same list are basically the same for finding a block
> right now. Therefore, add an ext4_try_lock_group() helper function to skip
> the current group when it is locked by another process, thereby avoiding
> contention with other processes. This helps ext4 make better use of having
> multiple block groups.
> 
> Also, to make sure we don't skip all the groups that have free space
> when allocating blocks, we won't try to skip busy groups anymore when
> ac_criteria is CR_ANY_FREE.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                    | Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96  |
>  Disk: 960GB SSD   |-------------------------|-------------------------|
>                    | base  |    patched      | base  |    patched      |
> -------------------|-------|-----------------|-------|-----------------|
> mb_optimize_scan=0 | 2667  | 4821  (+80.7%)  | 3450  | 15371 (+345%)   |
> mb_optimize_scan=1 | 2643  | 4784  (+81.0%)  | 3209  | 6101  (+90.0%)  |
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Hey Baokun, sorry I'm a bit late to the review, been caught up with a
few things last couple weeks.

The patch itself looks good, thanks for the changes.

Feel free to add:

 Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>


Regards,
ojaswin


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
  2025-07-14  3:01                     ` Theodore Ts'o
@ 2025-07-14  7:00                       ` Baokun Li
  0 siblings, 0 replies; 51+ messages in thread
From: Baokun Li @ 2025-07-14  7:00 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara
  Cc: linux-ext4, adilger.kernel, ojaswin, linux-kernel, yi.zhang,
	yangerkun

Hello!

On 2025/7/14 11:01, Theodore Ts'o wrote:
> On Thu, Jul 10, 2025 at 04:38:33PM +0200, Jan Kara wrote:
>> Yes, apparently both approaches have their pros and cons. I'm actually
>> surprised the impact of additional barriers on ARM is so big for the
>> single container case. 10% gain for single container cases look nice OTOH
>> realistical workloads will have more container so maybe that's not worth
>> optimizing for. Ted, do you have any opinion?
> Let me try to summarize; regardless of whether we use
> {READ,WRITE})_ONCE or smp_load_acquire / smp_store_restore, both are
> signiicantly better than using a the spinlock.  The other thing about
> the "single-threaded perforance" is that there is the aditional cost
> of the CPU-to-CPU syncing is not free.  But CPU synchronization cost
> applies when that the single thread is bouncing between CPU's --- if
> we hada single threaded application which is pinned on a single CPU
> cost of smp_load_acquire would't be there since the cache line
> wouldn't be bouncing back and forth.  Is that correct, or am I missing
> something?
>
> In any case, so long as the single-threaded performance doesn't
> regress relative to the current spin_lock implementation, I'm inclined
> to prefer the use smp_load_acquire approach if it improves
> multi-threaded allocation performance on ARM64.
>
> Cheers,
>
> 							- Ted
>
Using {READ,WRITE}_ONCE yielded a very significant improvement in single
container scenarios (10%-16%). Although there was a slight decrease in
multi-container scenarios (-1% to -3%), subsequent optimizations
compensated for this.

To prevent regressions in single-container performance, we ultimately chose
{READ,WRITE}_ONCE for the v3 release last week.

Thank you for your suggestion!


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start
  2025-06-30  7:52         ` Baokun Li
@ 2025-07-14  7:00           ` Ojaswin Mujoo
  0 siblings, 0 replies; 51+ messages in thread
From: Ojaswin Mujoo @ 2025-07-14  7:00 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, linux-kernel,
	yi.zhang, yangerkun

On Mon, Jun 30, 2025 at 03:52:58PM +0800, Baokun Li wrote:
> On 2025/6/30 15:31, Jan Kara wrote:
> > On Mon 30-06-25 11:32:16, Baokun Li wrote:
> > > On 2025/6/28 2:15, Jan Kara wrote:
> > > > On Mon 23-06-25 15:32:50, Baokun Li wrote:
> > > > > ac->ac_g_ex.fe_start is only used in ext4_mb_find_by_goal(), but STREAM
> > > > > ALLOC is activated after ext4_mb_find_by_goal() fails, so there's no need
> > > > > to update ac->ac_g_ex.fe_start, remove the unnecessary s_mb_last_start.
> > > > > 
> > > > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > > > I'd just note that ac->ac_g_ex.fe_start is also used in
> > > > ext4_mb_collect_stats() so this change may impact the statistics gathered
> > > > there. OTOH it is questionable whether we even want to account streaming
> > > > allocation as a goal hit... Anyway, I'm fine with this, I'd just mention it
> > > > in the changelog.
> > > Yes, I missed ext4_mb_collect_stats(). However, instead of explaining
> > > it in the changelog, I think it would be better to move the current
> > > s_bal_goals update to inside or after ext4_mb_find_by_goal().
> > > 
> > > Then, we could add another variable, such as s_bal_stream_goals, to
> > > represent the hit count for global goals. This kind of statistic would
> > > help us fine-tune the logic for optimizing inode goals and global goals.
> > > 
> > > What are your thoughts on this?
> > Sure that sounds good to me.
> 
> Ok, I will add a patch to implement that logic in the next version.
> 
> > 
> > > > > @@ -2849,7 +2848,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> > > > >    		/* TBD: may be hot point */
> > > > >    		spin_lock(&sbi->s_md_lock);
> > > > >    		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> > > > > -		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
> > > > Maybe reset ac->ac_g_ex.fe_start to 0 instead of leaving it at some random
> > > > value? Just for the sake of defensive programming...
> > > > 
> > > ac->ac_g_ex.fe_start holds the inode goal's start position, not a random
> > > value. It's unused after ext4_mb_find_by_goal() (if s_bal_stream_goals is
> > > added). Thus, I see no need for further modification. We can always re-add
> > > it if future requirements change.
> > Yeah, I was imprecise. It is not a random value. But it is not an offset in
> > the group we are now setting. Therefore I'd still prefer to reset fe_start
> > to 0 (or some invalid value like -1 to catch unexpected use).
> > 
> > 								Honza
> 
> When ext4_mb_regular_allocator() fails, it might retry and get called
> again. In this scenario, we can't reliably determine if ac_g_ex has
> already been modified. Therefore, it might be more appropriate to set
> ac_g_ex.fe_start to -1 after ext4_mb_find_by_goal() fails. We can then
> skip ext4_mb_find_by_goal() when ac_g_ex.fe_start < 0.

Hmm idk if giving a sort of one-off special meaning to -1 would be right. 

How about resetting the original goal group and goal start in the retry
logic of ext4_mb_new_blocks()? Since we drop preallocations before
retrying, this way we might actually find our goal during the retry
(slim chance though but still).

Regards,
ojaswin
> 
> 
> Cheers,
> Baokun
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2025-07-14  7:02 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
2025-06-23  7:32 ` [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
2025-06-27 18:06   ` Jan Kara
2025-07-14  6:53   ` Ojaswin Mujoo
2025-06-23  7:32 ` [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start Baokun Li
2025-06-27 18:15   ` Jan Kara
2025-06-30  3:32     ` Baokun Li
2025-06-30  7:31       ` Jan Kara
2025-06-30  7:52         ` Baokun Li
2025-07-14  7:00           ` Ojaswin Mujoo
2025-06-23  7:32 ` [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
2025-06-27 18:19   ` Jan Kara
2025-06-30  3:48     ` Baokun Li
2025-06-30  7:47       ` Jan Kara
2025-06-30  9:21         ` Baokun Li
2025-06-30 16:32           ` Jan Kara
2025-07-01  2:39             ` Baokun Li
2025-07-01 12:21               ` Jan Kara
2025-07-01 13:17                 ` Baokun Li
2025-07-08 13:08                 ` Baokun Li
2025-07-10 14:38                   ` Jan Kara
2025-07-14  3:01                     ` Theodore Ts'o
2025-07-14  7:00                       ` Baokun Li
2025-07-01  2:57   ` kernel test robot
2025-06-23  7:32 ` [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention Baokun Li
2025-06-27 18:31   ` Jan Kara
2025-06-30  6:50     ` Baokun Li
2025-06-30  8:38       ` Jan Kara
2025-06-30 10:02         ` Baokun Li
2025-06-30 17:41           ` Jan Kara
2025-07-01  3:32             ` Baokun Li
2025-07-01 11:53               ` Jan Kara
2025-07-01 12:12                 ` Baokun Li
2025-06-23  7:32 ` [PATCH v2 05/16] ext4: get rid of some obsolete EXT4_MB_HINT flags Baokun Li
2025-06-23  7:32 ` [PATCH v2 06/16] ext4: fix typo in CR_GOAL_LEN_SLOW comment Baokun Li
2025-06-23  7:32 ` [PATCH v2 07/16] ext4: convert sbi->s_mb_free_pending to atomic_t Baokun Li
2025-06-27 18:33   ` Jan Kara
2025-06-23  7:32 ` [PATCH v2 08/16] ext4: merge freed extent with existing extents before insertion Baokun Li
2025-06-27 19:11   ` Jan Kara
2025-06-23  7:32 ` [PATCH v2 09/16] ext4: fix zombie groups in average fragment size lists Baokun Li
2025-06-27 19:14   ` Jan Kara
2025-06-30  6:53     ` Baokun Li
2025-06-23  7:32 ` [PATCH v2 10/16] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Baokun Li
2025-06-27 19:34   ` Jan Kara
2025-06-30  7:34     ` Baokun Li
2025-06-23  7:32 ` [PATCH v2 11/16] ext4: factor out __ext4_mb_scan_group() Baokun Li
2025-06-23  7:33 ` [PATCH v2 12/16] ext4: factor out ext4_mb_might_prefetch() Baokun Li
2025-06-23  7:33 ` [PATCH v2 13/16] ext4: factor out ext4_mb_scan_group() Baokun Li
2025-06-23  7:33 ` [PATCH v2 14/16] ext4: convert free group lists to ordered xarrays Baokun Li
2025-06-23  7:33 ` [PATCH v2 15/16] ext4: refactor choose group to scan group Baokun Li
2025-06-23  7:33 ` [PATCH v2 16/16] ext4: ensure global ordered traversal across all free groups xarrays Baokun Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox