public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] ext4: better scalability for ext4 block allocation
@ 2025-05-23  8:58 libaokun
  2025-05-23  8:58 ` [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups libaokun
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: libaokun @ 2025-05-23  8:58 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, yi.zhang, yangerkun,
	libaokun1, libaokun

From: Baokun Li <libaokun1@huawei.com>

Since servers have more and more CPUs, and we're running more containers
on them, we've been using will-it-scale to test how well ext4 scales. The
fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently
on 64 containers revealed significant contention in block allocation/free,
leading to much lower aggregate fallocate OPS compared to a single
container (see below).

   1   |    2   |    4   |    8   |   16   |   32   |   64
-------|--------|--------|--------|--------|--------|-------
295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588

The main bottleneck was the ext4_lock_group(), which both block allocation
and free fought over. While the block group for block free is fixed and
unoptimizable, the block group for allocation is selectable. Consequently,
the ext4_try_lock_group() helper function was added to avoid contention on
busy groups, and you can see more in Patch 1.

After we fixed the ext4_lock_group bottleneck, another one showed up:
s_md_lock. This lock protects different data when allocating and freeing
blocks. We got rid of the s_md_lock call in block allocation by making
stream allocation work per inode instead of globally. You can find more
details in Patch 2.

Patches 3 and 4 are just some minor cleanups.

Performance test data follows:

CPU: HUAWEI Kunpeng 920
Memory: 480GB
Disk: 480GB SSD SATA 3.2
Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
Observation: Average fallocate operations per container per second.

|--------|--------|--------|--------|--------|--------|--------|--------|
|    -   |    1   |    2   |    4   |    8   |   16   |   32   |   64   |
|--------|--------|--------|--------|--------|--------|--------|--------|
|  base  | 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588  |
|--------|--------|--------|--------|--------|--------|--------|--------|
| linear | 286328 | 123102 | 119542 | 90653  | 60344  | 35302  | 23280  |
|        | -3.0%  | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% |
|--------|--------|--------|--------|--------|--------|--------|--------|
|mb_optim| 292498 | 133305 | 103069 | 61727  | 29702  | 16845  | 10430  |
|ize_scan| -0.9%  | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% |
|--------|--------|--------|--------|--------|--------|--------|--------|

Running "kvm-xfstests -c ext4/all -g auto" showed that 1k/generic/347 often
fails. The test seems to think that a dm-thin device with a virtual size of
5000M but a real size of 500M, after being formatted as ext4, would have
500M free.

But it doesn't – we run out of space after making about 430 1M
files. Since the block size is 1k, making so many files turns on dir_index,
and dm-thin waits a minute, sees no free space, and then throws IO error.
This can cause a directory index block to fail to write and abort journal.

What's worse is that _dmthin_check_fs doesn't replay the journal, so fsck
finds inconsistencies and the test failed. I think the problem is with the
test itself, and I'll send a patch to fix it soon.

Comments and questions are, as always, welcome.

Thanks,
Baokun


Baokun Li (4):
  ext4: add ext4_try_lock_group() to skip busy groups
  ext4: move mb_last_[group|start] to ext4_inode_info
  ext4: get rid of some obsolete EXT4_MB_HINT flags
  ext4: fix typo in CR_GOAL_LEN_SLOW comment

 fs/ext4/ext4.h              | 38 ++++++++++++++++++-------------------
 fs/ext4/mballoc.c           | 34 +++++++++++++++++++--------------
 fs/ext4/super.c             |  2 ++
 include/trace/events/ext4.h |  3 ---
 4 files changed, 41 insertions(+), 36 deletions(-)

-- 
2.46.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups
  2025-05-23  8:58 [PATCH 0/4] ext4: better scalability for ext4 block allocation libaokun
@ 2025-05-23  8:58 ` libaokun
  2025-05-28 15:05   ` Ojaswin Mujoo
  2025-05-23  8:58 ` [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info libaokun
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: libaokun @ 2025-05-23  8:58 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, yi.zhang, yangerkun,
	libaokun1, libaokun

From: Baokun Li <libaokun1@huawei.com>

When ext4 allocates blocks, we used to just go through the block groups
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.

But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.

Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.

Also, to make sure we don't skip all the groups that have free space
when allocating blocks, we won't try to skip busy groups anymore when
ac_criteria is CR_ANY_FREE.

Performance test data follows:

CPU: HUAWEI Kunpeng 920
Memory: 480GB
Disk: 480GB SSD SATA 3.2
Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
Observation: Average fallocate operations per container per second.

                      base    patched
mb_optimize_scan=0    3588    6755 (+88.2%)
mb_optimize_scan=1    3588    4302 (+19.8%)

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    | 23 ++++++++++++++---------
 fs/ext4/mballoc.c | 14 +++++++++++---
 2 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5a20e9cd7184..9c665a620a46 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3494,23 +3494,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi)
 	return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD);
 }
 
+static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group)
+{
+	if (!spin_trylock(ext4_group_lock_ptr(sb, group)))
+		return false;
+	/*
+	 * We're able to grab the lock right away, so drop the lock
+	 * contention counter.
+	 */
+	atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
+	return true;
+}
+
 static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group)
 {
-	spinlock_t *lock = ext4_group_lock_ptr(sb, group);
-	if (spin_trylock(lock))
-		/*
-		 * We're able to grab the lock right away, so drop the
-		 * lock contention counter.
-		 */
-		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
-	else {
+	if (!ext4_try_lock_group(sb, group)) {
 		/*
 		 * The lock is busy, so bump the contention counter,
 		 * and then wait on the spin lock.
 		 */
 		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1,
 				  EXT4_MAX_CONTENTION);
-		spin_lock(lock);
+		spin_lock(ext4_group_lock_ptr(sb, group));
 	}
 }
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1e98c5be4e0a..5c13d9f8a1cc 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
 				    bb_largest_free_order_node) {
 			if (sbi->s_mb_stats)
 				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
-			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
+			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED)) &&
+			    !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) {
 				*group = iter->bb_group;
 				ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
 				read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
@@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o
 	list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
 		if (sbi->s_mb_stats)
 			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
-		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
+		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr)) &&
+		    !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) {
 			grp = iter;
 			break;
 		}
@@ -2911,7 +2913,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 			if (err)
 				goto out;
 
-			ext4_lock_group(sb, group);
+			/* skip busy group */
+			if (cr >= CR_ANY_FREE) {
+				ext4_lock_group(sb, group);
+			} else if (!ext4_try_lock_group(sb, group)) {
+				ext4_mb_unload_buddy(&e4b);
+				continue;
+			}
 
 			/*
 			 * We need to check again after locking the
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info
  2025-05-23  8:58 [PATCH 0/4] ext4: better scalability for ext4 block allocation libaokun
  2025-05-23  8:58 ` [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups libaokun
@ 2025-05-23  8:58 ` libaokun
  2025-05-29 12:56   ` Jan Kara
  2025-05-23  8:58 ` [PATCH 3/4] ext4: get rid of some obsolete EXT4_MB_HINT flags libaokun
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: libaokun @ 2025-05-23  8:58 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, yi.zhang, yangerkun,
	libaokun1, libaokun

From: Baokun Li <libaokun1@huawei.com>

After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_[group|start]) when allocating.

Moreover, when allocating data blocks, if the first try (goal allocation)
fails and stream allocation is on, it tries a global goal starting from
the last group we used (s_mb_last_group). This can make things faster by
writing blocks close together on the disk. But when many processes are
allocating, they all fight over s_md_lock and might even try to use the
same group. This makes it harder to merge extents and can make files more
fragmented. If different processes allocate chunks of very different sizes,
the free space on the disk can also get fragmented. A small allocation
might fit in a partially full group, but a big allocation might have
skipped it, leading to the small IO ending up in a more empty group.

So, we're changing stream allocation to work per inode. First, it tries
the goal, then the last group where that inode successfully allocated a
block. This keeps an inode's data closer together. Plus, after moving
mb_last_[group|start] to ext4_inode_info, we don't need s_md_lock during
block allocation anymore because we already have the write lock on
i_data_sem. This gets rid of the contention between allocating and
releasing blocks, which gives a huge performance boost to fallocate2.

Performance test data follows:

CPU: HUAWEI Kunpeng 920
Memory: 480GB
Disk: 480GB SSD SATA 3.2
Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
Observation: Average fallocate operations per container per second.

                      base     patched
mb_optimize_scan=0    6755     23280 (+244.6%)
mb_optimize_scan=1    4302     10430 (+142.4%)

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  7 ++++---
 fs/ext4/mballoc.c | 20 +++++++++-----------
 fs/ext4/super.c   |  2 ++
 3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9c665a620a46..16c14dd09df6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1171,6 +1171,10 @@ struct ext4_inode_info {
 	__u32 i_csum_seed;
 
 	kprojid_t i_projid;
+
+	/* where last allocation was done - for stream allocation */
+	ext4_group_t i_mb_last_group;
+	ext4_grpblk_t i_mb_last_start;
 };
 
 /*
@@ -1603,9 +1607,6 @@ struct ext4_sb_info {
 	unsigned int s_mb_order2_reqs;
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_dir_size_kb;
-	/* where last allocation was done - for stream allocation */
-	unsigned long s_mb_last_group;
-	unsigned long s_mb_last_start;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5c13d9f8a1cc..ee9696f9bac8 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2138,7 +2138,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 					struct ext4_buddy *e4b)
 {
-	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int ret;
 
 	BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group);
@@ -2169,10 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	folio_get(ac->ac_buddy_folio);
 	/* store last allocated for subsequent stream allocation */
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		spin_lock(&sbi->s_md_lock);
-		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
-		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
-		spin_unlock(&sbi->s_md_lock);
+		EXT4_I(ac->ac_inode)->i_mb_last_group = ac->ac_f_ex.fe_group;
+		EXT4_I(ac->ac_inode)->i_mb_last_start = ac->ac_f_ex.fe_start;
 	}
 	/*
 	 * As we've just preallocated more space than
@@ -2844,13 +2841,14 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 							   MB_NUM_ORDERS(sb));
 	}
 
-	/* if stream allocation is enabled, use global goal */
+	/* if stream allocation is enabled, use last goal */
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		/* TBD: may be hot point */
-		spin_lock(&sbi->s_md_lock);
-		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
-		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
-		spin_unlock(&sbi->s_md_lock);
+		struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
+
+		if (ei->i_mb_last_group || ei->i_mb_last_start) {
+			ac->ac_g_ex.fe_group = ei->i_mb_last_group;
+			ac->ac_g_ex.fe_start = ei->i_mb_last_start;
+		}
 	}
 
 	/*
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 181934499624..6c49c43bb2cb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1416,6 +1416,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
 	ext4_fc_init_inode(&ei->vfs_inode);
 	mutex_init(&ei->i_fc_lock);
+	ei->i_mb_last_group = 0;
+	ei->i_mb_last_start = 0;
 	return &ei->vfs_inode;
 }
 
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] ext4: get rid of some obsolete EXT4_MB_HINT flags
  2025-05-23  8:58 [PATCH 0/4] ext4: better scalability for ext4 block allocation libaokun
  2025-05-23  8:58 ` [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups libaokun
  2025-05-23  8:58 ` [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info libaokun
@ 2025-05-23  8:58 ` libaokun
  2025-05-28 15:10   ` Ojaswin Mujoo
  2025-05-29 12:57   ` Jan Kara
  2025-05-23  8:58 ` [PATCH 4/4] ext4: fix typo in CR_GOAL_LEN_SLOW comment libaokun
  2025-05-28 14:53 ` [PATCH 0/4] ext4: better scalability for ext4 block allocation Ojaswin Mujoo
  4 siblings, 2 replies; 22+ messages in thread
From: libaokun @ 2025-05-23  8:58 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, yi.zhang, yangerkun,
	libaokun1, libaokun

From: Baokun Li <libaokun1@huawei.com>

Since nobody has used these EXT4_MB_HINT flags for ages,
let's remove them.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h              | 6 ------
 include/trace/events/ext4.h | 3 ---
 2 files changed, 9 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 16c14dd09df6..f6d01702423d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -185,14 +185,8 @@ enum criteria {
 
 /* prefer goal again. length */
 #define EXT4_MB_HINT_MERGE		0x0001
-/* blocks already reserved */
-#define EXT4_MB_HINT_RESERVED		0x0002
-/* metadata is being allocated */
-#define EXT4_MB_HINT_METADATA		0x0004
 /* first blocks in the file */
 #define EXT4_MB_HINT_FIRST		0x0008
-/* search for the best chunk */
-#define EXT4_MB_HINT_BEST		0x0010
 /* data is being allocated */
 #define EXT4_MB_HINT_DATA		0x0020
 /* don't preallocate (for tails) */
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 156908641e68..33b204165cc0 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -23,10 +23,7 @@ struct partial_cluster;
 
 #define show_mballoc_flags(flags) __print_flags(flags, "|",	\
 	{ EXT4_MB_HINT_MERGE,		"HINT_MERGE" },		\
-	{ EXT4_MB_HINT_RESERVED,	"HINT_RESV" },		\
-	{ EXT4_MB_HINT_METADATA,	"HINT_MDATA" },		\
 	{ EXT4_MB_HINT_FIRST,		"HINT_FIRST" },		\
-	{ EXT4_MB_HINT_BEST,		"HINT_BEST" },		\
 	{ EXT4_MB_HINT_DATA,		"HINT_DATA" },		\
 	{ EXT4_MB_HINT_NOPREALLOC,	"HINT_NOPREALLOC" },	\
 	{ EXT4_MB_HINT_GROUP_ALLOC,	"HINT_GRP_ALLOC" },	\
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] ext4: fix typo in CR_GOAL_LEN_SLOW comment
  2025-05-23  8:58 [PATCH 0/4] ext4: better scalability for ext4 block allocation libaokun
                   ` (2 preceding siblings ...)
  2025-05-23  8:58 ` [PATCH 3/4] ext4: get rid of some obsolete EXT4_MB_HINT flags libaokun
@ 2025-05-23  8:58 ` libaokun
  2025-05-28 15:11   ` Ojaswin Mujoo
  2025-05-29 12:57   ` Jan Kara
  2025-05-28 14:53 ` [PATCH 0/4] ext4: better scalability for ext4 block allocation Ojaswin Mujoo
  4 siblings, 2 replies; 22+ messages in thread
From: libaokun @ 2025-05-23  8:58 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, yi.zhang, yangerkun,
	libaokun1, libaokun

From: Baokun Li <libaokun1@huawei.com>

Remove the superfluous "find_".

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f6d01702423d..1182700901cc 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -157,7 +157,7 @@ enum criteria {
 
 	/*
 	 * Reads each block group sequentially, performing disk IO if
-	 * necessary, to find find_suitable block group. Tries to
+	 * necessary, to find suitable block group. Tries to
 	 * allocate goal length but might trim the request if nothing
 	 * is found after enough tries.
 	 */
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation
  2025-05-23  8:58 [PATCH 0/4] ext4: better scalability for ext4 block allocation libaokun
                   ` (3 preceding siblings ...)
  2025-05-23  8:58 ` [PATCH 4/4] ext4: fix typo in CR_GOAL_LEN_SLOW comment libaokun
@ 2025-05-28 14:53 ` Ojaswin Mujoo
  2025-05-29 12:24   ` Baokun Li
  4 siblings, 1 reply; 22+ messages in thread
From: Ojaswin Mujoo @ 2025-05-28 14:53 UTC (permalink / raw)
  To: libaokun
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun1

On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
> 
> Since servers have more and more CPUs, and we're running more containers
> on them, we've been using will-it-scale to test how well ext4 scales. The
> fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently
> on 64 containers revealed significant contention in block allocation/free,
> leading to much lower aggregate fallocate OPS compared to a single
> container (see below).
> 
>    1   |    2   |    4   |    8   |   16   |   32   |   64
> -------|--------|--------|--------|--------|--------|-------
> 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588
> 
> The main bottleneck was the ext4_lock_group(), which both block allocation
> and free fought over. While the block group for block free is fixed and
> unoptimizable, the block group for allocation is selectable. Consequently,
> the ext4_try_lock_group() helper function was added to avoid contention on
> busy groups, and you can see more in Patch 1.
> 
> After we fixed the ext4_lock_group bottleneck, another one showed up:
> s_md_lock. This lock protects different data when allocating and freeing
> blocks. We got rid of the s_md_lock call in block allocation by making
> stream allocation work per inode instead of globally. You can find more
> details in Patch 2.
> 
> Patches 3 and 4 are just some minor cleanups.
> 
> Performance test data follows:
> 
> CPU: HUAWEI Kunpeng 920
> Memory: 480GB
> Disk: 480GB SSD SATA 3.2
> Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
>  Observation: Average fallocate operations per container per second.

> 
> |--------|--------|--------|--------|--------|--------|--------|--------|
> |    -   |    1   |    2   |    4   |    8   |   16   |   32   |   64   |
> |--------|--------|--------|--------|--------|--------|--------|--------|
> |  base  | 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588  |
> |--------|--------|--------|--------|--------|--------|--------|--------|
> | linear | 286328 | 123102 | 119542 | 90653  | 60344  | 35302  | 23280  |
> |        | -3.0%  | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% |
> |--------|--------|--------|--------|--------|--------|--------|--------|
> |mb_optim| 292498 | 133305 | 103069 | 61727  | 29702  | 16845  | 10430  |
> |ize_scan| -0.9%  | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% |
> |--------|--------|--------|--------|--------|--------|--------|--------|

Hey Baokun, nice improvements! The proposed changes make sense to me,
however I suspect the performance improvements may come at a cost of
slight increase in fragmentation, which might affect rotational disks
especially. Maybe comparing e2freefrag numbers with and without the
patches might give a better insight into this.

Regardless the performance benefits are significant and I feel it is
good to have these patches.

I'll give my reviews individually as I'm still going through patch 2
However, I wanted to check on a couple things:

1. I believe you ran these in docker. Would you have any script etc open
   sourced that I can use to run some benchmarks on my end (and also
	 understand your test setup).

2. I notice we are getting way less throughput in mb_optimize_scan? I
   wonder why that is the case. Do you have some data on that? Are your
   tests starting on an empty FS, maybe in that case linear scan works a 
   bit better since almost all groups are empty. If so, what are the
   numbers like when we start with a fragmented FS?

   - Or maybe it is that the lazyinit thread has not yet initialized all
   the buddies yet which means we have lesser BGs in the freefrag list
   or the order list used by faster CRs. Hence, if they are locked we
   are falling more to CR_GOAL_LEN_SLOW. To check if this is the case,
   one hack is to cat /proc/fs/ext4/<disk>/mb_groups (or something along
   the lines) before the benchmark, which forces init of all the group
   buddies thus populating all the lists used by mb_opt_scan. Maybe we
   can check if this gives better results.

3. Also, how much IO are we doing here, are we filling the whole FS?

Regards,
ojaswin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups
  2025-05-23  8:58 ` [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups libaokun
@ 2025-05-28 15:05   ` Ojaswin Mujoo
  2025-05-30  8:20     ` Baokun Li
  0 siblings, 1 reply; 22+ messages in thread
From: Ojaswin Mujoo @ 2025-05-28 15:05 UTC (permalink / raw)
  To: libaokun
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun1

On Fri, May 23, 2025 at 04:58:18PM +0800, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
> 
> When ext4 allocates blocks, we used to just go through the block groups
> one by one to find a good one. But when there are tons of block groups
> (like hundreds of thousands or even millions) and not many have free space
> (meaning they're mostly full), it takes a really long time to check them
> all, and performance gets bad. So, we added the "mb_optimize_scan" mount
> option (which is on by default now). It keeps track of some group lists,
> so when we need a free block, we can just grab a likely group from the
> right list. This saves time and makes block allocation much faster.
> 
> But when multiple processes or containers are doing similar things, like
> constantly allocating 8k blocks, they all try to use the same block group
> in the same list. Even just two processes doing this can cut the IOPS in
> half. For example, one container might do 300,000 IOPS, but if you run two
> at the same time, the total is only 150,000.
> 
> Since we can already look at block groups in a non-linear way, the first
> and last groups in the same list are basically the same for finding a block
> right now. Therefore, add an ext4_try_lock_group() helper function to skip
> the current group when it is locked by another process, thereby avoiding
> contention with other processes. This helps ext4 make better use of having
> multiple block groups.
> 
> Also, to make sure we don't skip all the groups that have free space
> when allocating blocks, we won't try to skip busy groups anymore when
> ac_criteria is CR_ANY_FREE.
> 
> Performance test data follows:
> 
> CPU: HUAWEI Kunpeng 920
> Memory: 480GB
> Disk: 480GB SSD SATA 3.2
> Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                       base    patched
> mb_optimize_scan=0    3588    6755 (+88.2%)
> mb_optimize_scan=1    3588    4302 (+19.8%)

The patch looks mostly good. Same observations about mb_optimize_scan=1
improving less. We can continue this discussion in my reply to the cover
letter. That being said, I have some minor suggestions:

> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/ext4.h    | 23 ++++++++++++++---------
>  fs/ext4/mballoc.c | 14 +++++++++++---
>  2 files changed, 25 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 5a20e9cd7184..9c665a620a46 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3494,23 +3494,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi)
>  	return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD);
>  }
>  
> +static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group)
> +{
> +	if (!spin_trylock(ext4_group_lock_ptr(sb, group)))
> +		return false;
> +	/*
> +	 * We're able to grab the lock right away, so drop the lock
> +	 * contention counter.
> +	 */
> +	atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
> +	return true;
> +}
> +
>  static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group)
>  {
> -	spinlock_t *lock = ext4_group_lock_ptr(sb, group);
> -	if (spin_trylock(lock))
> -		/*
> -		 * We're able to grab the lock right away, so drop the
> -		 * lock contention counter.
> -		 */
> -		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
> -	else {
> +	if (!ext4_try_lock_group(sb, group)) {
>  		/*
>  		 * The lock is busy, so bump the contention counter,
>  		 * and then wait on the spin lock.
>  		 */
>  		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1,
>  				  EXT4_MAX_CONTENTION);
> -		spin_lock(lock);
> +		spin_lock(ext4_group_lock_ptr(sb, group));
>  	}
>  }
>  
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 1e98c5be4e0a..5c13d9f8a1cc 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
>  				    bb_largest_free_order_node) {
>  			if (sbi->s_mb_stats)
>  				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
> -			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
> +			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED)) &&
> +			    !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) {

Maybe reversing the && order to be (!spin_is_locked() && ext4_mb_good_group()) would be better?

>  				*group = iter->bb_group;
>  				ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
>  				read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> @@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o
>  	list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
>  		if (sbi->s_mb_stats)
>  			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
> -		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
> +		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr)) &&
> +		    !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) {

same as above
 
>  			grp = iter;
>  			break;
>  		}
> @@ -2911,7 +2913,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  			if (err)
>  				goto out;
>  
> -			ext4_lock_group(sb, group);
> +			/* skip busy group */
> +			if (cr >= CR_ANY_FREE) {
> +				ext4_lock_group(sb, group);
> +			} else if (!ext4_try_lock_group(sb, group)) {
> +				ext4_mb_unload_buddy(&e4b);
> +				continue;
> +			}

This in itself looks good. I am just thinking that now that we are
deciding to skip locked groups, in the code above this one, shall we do
something like:

      
      if (spin_is_locked(group_lock))
        continue;
      
      err = ext4_mb_load_buddy(sb, group, &e4b);
      if (err)
        goto out;

      /* skip busy group */
      if (cr >= CR_ANY_FREE) {
        ext4_lock_group(sb, group);
      } else if (!ext4_try_lock_group(sb, group)) {
        ext4_mb_unload_buddy(&e4b);
        continue;
      }

With this we can even avoid loading the folio as well.

Regards,
ojaswin
>  
>  			/*
>  			 * We need to check again after locking the
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] ext4: get rid of some obsolete EXT4_MB_HINT flags
  2025-05-23  8:58 ` [PATCH 3/4] ext4: get rid of some obsolete EXT4_MB_HINT flags libaokun
@ 2025-05-28 15:10   ` Ojaswin Mujoo
  2025-05-29 12:57   ` Jan Kara
  1 sibling, 0 replies; 22+ messages in thread
From: Ojaswin Mujoo @ 2025-05-28 15:10 UTC (permalink / raw)
  To: libaokun
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun1

On Fri, May 23, 2025 at 04:58:20PM +0800, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
> 
> Since nobody has used these EXT4_MB_HINT flags for ages,
> let's remove them.
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Looks good:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] ext4: fix typo in CR_GOAL_LEN_SLOW comment
  2025-05-23  8:58 ` [PATCH 4/4] ext4: fix typo in CR_GOAL_LEN_SLOW comment libaokun
@ 2025-05-28 15:11   ` Ojaswin Mujoo
  2025-05-29 12:57   ` Jan Kara
  1 sibling, 0 replies; 22+ messages in thread
From: Ojaswin Mujoo @ 2025-05-28 15:11 UTC (permalink / raw)
  To: libaokun
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun1

On Fri, May 23, 2025 at 04:58:21PM +0800, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
> 
> Remove the superfluous "find_".
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Looks good:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

> ---
>  fs/ext4/ext4.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index f6d01702423d..1182700901cc 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -157,7 +157,7 @@ enum criteria {
>  
>  	/*
>  	 * Reads each block group sequentially, performing disk IO if
> -	 * necessary, to find find_suitable block group. Tries to
> +	 * necessary, to find suitable block group. Tries to
>  	 * allocate goal length but might trim the request if nothing
>  	 * is found after enough tries.
>  	 */
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation
  2025-05-28 14:53 ` [PATCH 0/4] ext4: better scalability for ext4 block allocation Ojaswin Mujoo
@ 2025-05-29 12:24   ` Baokun Li
  2025-06-10 12:06     ` Ojaswin Mujoo
  0 siblings, 1 reply; 22+ messages in thread
From: Baokun Li @ 2025-05-29 12:24 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun, Baokun Li

On 2025/5/28 22:53, Ojaswin Mujoo wrote:
> On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@huaweicloud.com wrote:
>> From: Baokun Li <libaokun1@huawei.com>
>>
>> Since servers have more and more CPUs, and we're running more containers
>> on them, we've been using will-it-scale to test how well ext4 scales. The
>> fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently
>> on 64 containers revealed significant contention in block allocation/free,
>> leading to much lower aggregate fallocate OPS compared to a single
>> container (see below).
>>
>>     1   |    2   |    4   |    8   |   16   |   32   |   64
>> -------|--------|--------|--------|--------|--------|-------
>> 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588
>>
>> The main bottleneck was the ext4_lock_group(), which both block allocation
>> and free fought over. While the block group for block free is fixed and
>> unoptimizable, the block group for allocation is selectable. Consequently,
>> the ext4_try_lock_group() helper function was added to avoid contention on
>> busy groups, and you can see more in Patch 1.
>>
>> After we fixed the ext4_lock_group bottleneck, another one showed up:
>> s_md_lock. This lock protects different data when allocating and freeing
>> blocks. We got rid of the s_md_lock call in block allocation by making
>> stream allocation work per inode instead of globally. You can find more
>> details in Patch 2.
>>
>> Patches 3 and 4 are just some minor cleanups.
>>
>> Performance test data follows:
>>
>> CPU: HUAWEI Kunpeng 920
>> Memory: 480GB
>> Disk: 480GB SSD SATA 3.2
>> Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
>>   Observation: Average fallocate operations per container per second.
>> |--------|--------|--------|--------|--------|--------|--------|--------|
>> |    -   |    1   |    2   |    4   |    8   |   16   |   32   |   64   |
>> |--------|--------|--------|--------|--------|--------|--------|--------|
>> |  base  | 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588  |
>> |--------|--------|--------|--------|--------|--------|--------|--------|
>> | linear | 286328 | 123102 | 119542 | 90653  | 60344  | 35302  | 23280  |
>> |        | -3.0%  | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% |
>> |--------|--------|--------|--------|--------|--------|--------|--------|
>> |mb_optim| 292498 | 133305 | 103069 | 61727  | 29702  | 16845  | 10430  |
>> |ize_scan| -0.9%  | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% |
>> |--------|--------|--------|--------|--------|--------|--------|--------|
> Hey Baokun, nice improvements! The proposed changes make sense to me,
> however I suspect the performance improvements may come at a cost of
> slight increase in fragmentation, which might affect rotational disks
> especially. Maybe comparing e2freefrag numbers with and without the
> patches might give a better insight into this.
While this approach might slightly increase free space fragmentation on
the disk, it significantly reduces file fragmentation, leading to faster
read speeds on rotational disks.

When multiple processes contend for free blocks within the same block
group, the probability of blocks allocated by the same process being
merged on consecutive allocations is low. This is because other processes
may preempt the free blocks immediately following the current process's
last allocated region.

Normally, we rely on preallocation to avoid files becoming overly
fragmented (even though preallocation itself can cause fragmentation in
free disk space). But since fallocate doesn't support preallocation, the
fragmentation issue is more pronounced. Counterintuitively, skipping busy
groups actually boosts opportunities for file extent merging, which in turn
reduces overall file fragmentation.

Referencing will-it-scale/fallocate2, I tested 64 processes each appending
4KB via fallocate to 64 separate files until they reached 1GB. This test
specifically examines contention in block allocation, unlike fallocate2,
it omits the contention between fallocate and truncate. Preliminary results
are provided below; detailed scripts and full test outcomes are attached in
the email footer.

----------------------------------------------------------
                      |       base      |      patched    |
---------------------|--------|--------|--------|--------|
mb_optimize_scan     | linear |opt_scan| linear |opt_scan|
---------------------|--------|--------|--------|--------|
bw(MiB/s)            | 217    | 219    | 5685   | 5670   |
Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
Avg. extents per file| 261879 | 262039 | 744    | 2084   |
Avg. size per extent | 4      | 4      | 1408   | 503    |
Fragmentation score  | 100    | 100    | 2      | 6      |
----------------------------------------------------------
> Regardless the performance benefits are significant and I feel it is
> good to have these patches.
>
> I'll give my reviews individually as I'm still going through patch 2
> However, I wanted to check on a couple things:
Okay, thank you for your feedback.
>
> 1. I believe you ran these in docker. Would you have any script etc open
>     sourced that I can use to run some benchmarks on my end (and also
> 	 understand your test setup).
Yes, these two patches primarily mitigate contention between block
allocations and between block allocation and release. The testing script
can be referenced from the fio script mentioned earlier in the email
footer. You can also add more truncate calls based on it.
> 2. I notice we are getting way less throughput in mb_optimize_scan? I
>     wonder why that is the case. Do you have some data on that? Are your
>     tests starting on an empty FS, maybe in that case linear scan works a
>     bit better since almost all groups are empty. If so, what are the
>     numbers like when we start with a fragmented FS?
The throughput of mb_optimize_scan is indeed much lower, and we continue
to optimize it, as mb_optimize_scan is the default mount option and
performs better in scenarios with large volume disks and high space usage.

Disk space used is about 7%; mb_optimize_scan should perform better with
less free space. However, this isn't the critical factor. The poor
throughput here is due to the following reasons。

One reason is that mb_optimize_scan's list traversal is unordered and
always selects the first group.

While traversing the list, holding a spin_lock prevents load_buddy, making
direct use of ext4_lock_group impossible. This can lead to a "bouncing"
scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
fails, forcing the list traversal to repeatedly restart from grp_A.

In contrast, linear traversal directly uses ext4_try_lock_group(),
avoiding this bouncing. Therefore, we need a lockless, ordered traversal
to achieve linear-like efficiency.

Another reason is that opt_scan tends to allocate from groups that have
just received freed blocks, causing it to constantly "jump around"
between certain groups.

This leads to intense contention between allocation and release, and even
between release events. In contrast, linear traversal always moves forward
without revisiting groups, resulting in less contention between allocation
and release.

However, because linear involves more groups in allocation, journal
becomes a bottleneck. If opt_scan first attempts to traverse block groups
to the right from the target group in all lists, and then from index 0 to
the left in all lists, competition between block groups would be
significantly reduced.

To enable ordered traversal, we attempted to convert list_head to an
ordered xarray. This ordering prevents "bouncing" during retries.
Additionally, traversing all right-side groups before left-side groups
significantly reduced contention. Performance improved from 10430 to 17730.

However, xarray traversal introduces overhead; list_head group selection
was O(1), while xarray becomes O(n log n). This results in a ~10%
performance drop in single-process scenarios, and I'm not entirely sure if
this trade-off is worthwhile. 🤔

Additionally, by attempting to merge before inserting in
ext4_mb_free_metadata(), we can eliminate contention on sbi->s_md_lock
during merges, resulting in roughly a 5% performance gain.
>
>     - Or maybe it is that the lazyinit thread has not yet initialized all
>     the buddies yet which means we have lesser BGs in the freefrag list
>     or the order list used by faster CRs. Hence, if they are locked we
>     are falling more to CR_GOAL_LEN_SLOW. To check if this is the case,
>     one hack is to cat /proc/fs/ext4/<disk>/mb_groups (or something along
>     the lines) before the benchmark, which forces init of all the group
>     buddies thus populating all the lists used by mb_opt_scan. Maybe we
>     can check if this gives better results.
All groups are already initialized at the time of testing, and that's not
where the problem lies.
>
> 3. Also, how much IO are we doing here, are we filling the whole FS?
>
In a single container, create a file, then repeatedly append 8KB using
fallocate until the file reaches 1MB. After that, truncate the file to
0 and continue appending 8KB with fallocate. The 64 containers will
occupy a maximum of 64MB of disk space in total, so they won't fill the
entire file system.


Cheers,
Baokun


======================== test script ========================

#!/bin/bash

dir="/tmp/test"
disk="/dev/sda"
numjobs=64
iodepth=128

mkdir -p $dir

for scan in 0 1 ; do
     mkfs.ext4 -F -E lazy_itable_init=0,lazy_journal_init=0 -O 
orphan_file $disk
     mount -o mb_optimize_scan=$scan $disk $dir

     fio -directory=$dir -direct=1 -iodepth ${iodepth} -thread -rw=write 
-ioengine=falloc -bs=4k -fallocate=none \
         -size=1G -numjobs=${numjobs} -group_reporting -name=job1 
-cpus_allowed_policy=split -file_append=1

     e2freefrag $disk
     e4defrag -c $dir # ** NOTE ** Without the patch, this could take 
5-6 hours.
     filefrag ${dir}/job* | awk '{print $2}' | awk '{sum+=$1} END {print 
sum/NR}'
     umount $dir
done

======================== test results ========================

---------------------------------------------------------- ## base

------------------------### linear bw=217MiB/s (228MB/s)
------------ e2freefrag /dev/sda
Device: /dev/sda
Blocksize: 4096 bytes
Total blocks: 52428800
Free blocks: 34501259 (65.8%)

Min. free extent: 98172 KB
Max. free extent: 2064256 KB
Avg. free extent: 1943732 KB
Num. free extent: 71

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
    64M...  128M-  :             2         49087    0.14%
   512M... 1024M-  :             3        646918    1.88%
     1G...    2G-  :            66      33805254   97.98%

------------ e4defrag -c /tmp/test
e4defrag 1.47.2 (1-Jan-2025)
<Fragmented files> now/best       size/ext
1. /tmp/test/job1.4.0                       262035/1 4 KB
2. /tmp/test/job1.2.0                       262034/1 4 KB
3. /tmp/test/job1.44.0                      262026/1 4 KB
4. /tmp/test/job1.15.0                      262025/1 4 KB
5. /tmp/test/job1.12.0                      262025/1 4 KB

  Total/best extents                             16760234/64
  Average size per extent                        4 KB
  Fragmentation score                            100
  [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
  This directory (/tmp/test) needs defragmentation.
  Done.

------------ filefrag /tmp/test/job* | awk '{print $2}' | awk '{sum+=$1} 
END {print sum/NR}'
261879

------------------------### opt_scan  bw=219MiB/s (230MB/s)
------------ e2freefrag /dev/sda
Device: /dev/sda
Blocksize: 4096 bytes
Total blocks: 52428800
Free blocks: 34501238 (65.8%)

Min. free extent: 98172 KB
Max. free extent: 2064256 KB
Avg. free extent: 1943728 KB
Num. free extent: 71

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
    64M...  128M-  :             2         49087    0.14%
   512M... 1024M-  :             3        646897    1.87%
     1G...    2G-  :            66      33805254   97.98%

------------ e4defrag -c /tmp/test
e4defrag 1.47.2 (1-Jan-2025)
<Fragmented files> now/best       size/ext
1. /tmp/test/job1.57.0                      262084/1 4 KB
2. /tmp/test/job1.35.0                      262081/1 4 KB
3. /tmp/test/job1.45.0                      262080/1 4 KB
4. /tmp/test/job1.25.0                      262078/1 4 KB
5. /tmp/test/job1.11.0                      262077/1 4 KB

  Total/best extents                             16770469/64
  Average size per extent                        4 KB
  Fragmentation score                            100
  [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
  This directory (/tmp/test) needs defragmentation.
  Done.

------------ filefrag /tmp/test/job* | awk '{print $2}' | awk '{sum+=$1} 
END {print sum/NR}'
262039

==========================================================================================

---------------------------------------------------------- ## patched
------------------------ linear bw=5685MiB/s (5962MB/s)
------------ e2freefrag /dev/sda
Device: /dev/sda
Blocksize: 4096 bytes
Total blocks: 52428800
Free blocks: 34550601 (65.9%)

Min. free extent: 8832 KB
Max. free extent: 2064256 KB
Avg. free extent: 1439608 KB
Num. free extent: 96

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
     8M...   16M-  :             2          5267    0.02%
    32M...   64M-  :             9        129695    0.38%
    64M...  128M-  :            17        409917    1.19%
   512M... 1024M-  :             3        716532    2.07%
     1G...    2G-  :            65      33289190   96.35%

------------ e4defrag -c /tmp/test
e4defrag 1.47.2 (1-Jan-2025)
<Fragmented files> now/best       size/ext
1. /tmp/test/job1.18.0                         984/1 1065 KB
2. /tmp/test/job1.37.0                         981/1 1068 KB
3. /tmp/test/job1.36.0                         980/1 1069 KB
4. /tmp/test/job1.27.0                         954/1 1099 KB
5. /tmp/test/job1.30.0                         954/1 1099 KB

  Total/best extents                47629/64
  Average size per extent            1408 KB
  Fragmentation score                2
  [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
  This directory (/tmp/test) does not need defragmentation.
  Done.

------------ filefrag /tmp/test/job* | awk '{print $2}' | awk '{sum+=$1} 
END {print sum/NR}'
744.203

------------------------ opt_scan  bw=5670MiB/s (5946MB/s)
------------ e2freefrag /dev/sda
Device: /dev/sda
Blocksize: 4096 bytes
Total blocks: 52428800
Free blocks: 34550296 (65.9%)

Min. free extent: 5452 KB
Max. free extent: 2064256 KB
Avg. free extent: 1368328 KB
Num. free extent: 101

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
     4M...    8M-  :             4          5935    0.02%
     8M...   16M-  :             3          9929    0.03%
    16M...   32M-  :             4         21775    0.06%
    32M...   64M-  :            13        164831    0.48%
    64M...  128M-  :             9        189227    0.55%
   512M... 1024M-  :             2        457702    1.32%
     1G...    2G-  :            66      33700897   97.54%

------------ e4defrag -c /tmp/test
e4defrag 1.47.2 (1-Jan-2025)
<Fragmented files> now/best       size/ext
1. /tmp/test/job1.43.0                        4539/1 231 KB
2. /tmp/test/job1.5.0                         4446/1 235 KB
3. /tmp/test/job1.14.0                        3851/1 272 KB
4. /tmp/test/job1.3.0                         3682/1 284 KB
5. /tmp/test/job1.50.0                        3597/1 291 KB

  Total/best extents                133415/64
  Average size per extent            503 KB
  Fragmentation score                6
  [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
  This directory (/tmp/test) does not need defragmentation.
  Done.

------------ filefrag /tmp/test/job* | awk '{print $2}' | awk '{sum+=$1} 
END {print sum/NR}'
2084.61




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info
  2025-05-23  8:58 ` [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info libaokun
@ 2025-05-29 12:56   ` Jan Kara
  2025-05-30  9:31     ` Baokun Li
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2025-05-29 12:56 UTC (permalink / raw)
  To: libaokun
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun1

On Fri 23-05-25 16:58:19, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
> 
> After we optimized the block group lock, we found another lock
> contention issue when running will-it-scale/fallocate2 with multiple
> processes. The fallocate's block allocation and the truncate's block
> release were fighting over the s_md_lock. The problem is, this lock
> protects totally different things in those two processes: the list of
> freed data blocks (s_freed_data_list) when releasing, and where to start
> looking for new blocks (mb_last_[group|start]) when allocating.
> 
> Moreover, when allocating data blocks, if the first try (goal allocation)
> fails and stream allocation is on, it tries a global goal starting from
> the last group we used (s_mb_last_group). This can make things faster by
> writing blocks close together on the disk. But when many processes are
> allocating, they all fight over s_md_lock and might even try to use the
> same group. This makes it harder to merge extents and can make files more
> fragmented. If different processes allocate chunks of very different sizes,
> the free space on the disk can also get fragmented. A small allocation
> might fit in a partially full group, but a big allocation might have
> skipped it, leading to the small IO ending up in a more empty group.
> 
> So, we're changing stream allocation to work per inode. First, it tries
> the goal, then the last group where that inode successfully allocated a
> block. This keeps an inode's data closer together. Plus, after moving
> mb_last_[group|start] to ext4_inode_info, we don't need s_md_lock during
> block allocation anymore because we already have the write lock on
> i_data_sem. This gets rid of the contention between allocating and
> releasing blocks, which gives a huge performance boost to fallocate2.
> 
> Performance test data follows:
> 
> CPU: HUAWEI Kunpeng 920
> Memory: 480GB
> Disk: 480GB SSD SATA 3.2
> Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                       base     patched
> mb_optimize_scan=0    6755     23280 (+244.6%)
> mb_optimize_scan=1    4302     10430 (+142.4%)
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Good spotting with the s_md_lock contention here. But your changes don't
quite make sense to me. The idea of streaming allocation in mballoc is to
have an area of filesystem for large files to reduce fragmentation.  When
you switch to per-inode, this effect of packing large files together goes
away. Futhermore for each inode either all allocations will be very likely
streaming or not streaming (the logic uses file size) so either your
per-inode target will be unused or just another constantly used copy of
goal value.

So I can see two sensible solutions here:
a) Drop streaming allocations support altogether.

b) Enhance streaming allocation support to avoid contention between
processes allocating in parallel and freeing. Frankly, there's no strong
reason why reads & writes of streaming allocation goal need to use a
spinlock AFAICS. We could just store a physical block number and use
atomic64 accessors for it? Also having single goal value is just causing
more contention on group locks for parallel writers that end up using it
(that's the problem I suspect you were hitting the most). So perhaps we
can keep multiple streaming goal slots in the superblock (scale the count
based on CPU count & filesystem group count) and just pick the slot based
on inode number hash to reduce contention?

								Honza

> ---
>  fs/ext4/ext4.h    |  7 ++++---
>  fs/ext4/mballoc.c | 20 +++++++++-----------
>  fs/ext4/super.c   |  2 ++
>  3 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 9c665a620a46..16c14dd09df6 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1171,6 +1171,10 @@ struct ext4_inode_info {
>  	__u32 i_csum_seed;
>  
>  	kprojid_t i_projid;
> +
> +	/* where last allocation was done - for stream allocation */
> +	ext4_group_t i_mb_last_group;
> +	ext4_grpblk_t i_mb_last_start;
>  };
>  
>  /*
> @@ -1603,9 +1607,6 @@ struct ext4_sb_info {
>  	unsigned int s_mb_order2_reqs;
>  	unsigned int s_mb_group_prealloc;
>  	unsigned int s_max_dir_size_kb;
> -	/* where last allocation was done - for stream allocation */
> -	unsigned long s_mb_last_group;
> -	unsigned long s_mb_last_start;
>  	unsigned int s_mb_prefetch;
>  	unsigned int s_mb_prefetch_limit;
>  	unsigned int s_mb_best_avail_max_trim_order;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 5c13d9f8a1cc..ee9696f9bac8 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2138,7 +2138,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
>  static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>  					struct ext4_buddy *e4b)
>  {
> -	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>  	int ret;
>  
>  	BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group);
> @@ -2169,10 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>  	folio_get(ac->ac_buddy_folio);
>  	/* store last allocated for subsequent stream allocation */
>  	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		spin_lock(&sbi->s_md_lock);
> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> -		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
> -		spin_unlock(&sbi->s_md_lock);
> +		EXT4_I(ac->ac_inode)->i_mb_last_group = ac->ac_f_ex.fe_group;
> +		EXT4_I(ac->ac_inode)->i_mb_last_start = ac->ac_f_ex.fe_start;
>  	}
>  	/*
>  	 * As we've just preallocated more space than
> @@ -2844,13 +2841,14 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  							   MB_NUM_ORDERS(sb));
>  	}
>  
> -	/* if stream allocation is enabled, use global goal */
> +	/* if stream allocation is enabled, use last goal */
>  	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		/* TBD: may be hot point */
> -		spin_lock(&sbi->s_md_lock);
> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> -		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
> -		spin_unlock(&sbi->s_md_lock);
> +		struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
> +
> +		if (ei->i_mb_last_group || ei->i_mb_last_start) {
> +			ac->ac_g_ex.fe_group = ei->i_mb_last_group;
> +			ac->ac_g_ex.fe_start = ei->i_mb_last_start;
> +		}
>  	}
>  
>  	/*
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 181934499624..6c49c43bb2cb 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1416,6 +1416,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>  	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
>  	ext4_fc_init_inode(&ei->vfs_inode);
>  	mutex_init(&ei->i_fc_lock);
> +	ei->i_mb_last_group = 0;
> +	ei->i_mb_last_start = 0;
>  	return &ei->vfs_inode;
>  }
>  
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] ext4: get rid of some obsolete EXT4_MB_HINT flags
  2025-05-23  8:58 ` [PATCH 3/4] ext4: get rid of some obsolete EXT4_MB_HINT flags libaokun
  2025-05-28 15:10   ` Ojaswin Mujoo
@ 2025-05-29 12:57   ` Jan Kara
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Kara @ 2025-05-29 12:57 UTC (permalink / raw)
  To: libaokun
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun1

On Fri 23-05-25 16:58:20, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
> 
> Since nobody has used these EXT4_MB_HINT flags for ages,
> let's remove them.
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ext4.h              | 6 ------
>  include/trace/events/ext4.h | 3 ---
>  2 files changed, 9 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 16c14dd09df6..f6d01702423d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -185,14 +185,8 @@ enum criteria {
>  
>  /* prefer goal again. length */
>  #define EXT4_MB_HINT_MERGE		0x0001
> -/* blocks already reserved */
> -#define EXT4_MB_HINT_RESERVED		0x0002
> -/* metadata is being allocated */
> -#define EXT4_MB_HINT_METADATA		0x0004
>  /* first blocks in the file */
>  #define EXT4_MB_HINT_FIRST		0x0008
> -/* search for the best chunk */
> -#define EXT4_MB_HINT_BEST		0x0010
>  /* data is being allocated */
>  #define EXT4_MB_HINT_DATA		0x0020
>  /* don't preallocate (for tails) */
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 156908641e68..33b204165cc0 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -23,10 +23,7 @@ struct partial_cluster;
>  
>  #define show_mballoc_flags(flags) __print_flags(flags, "|",	\
>  	{ EXT4_MB_HINT_MERGE,		"HINT_MERGE" },		\
> -	{ EXT4_MB_HINT_RESERVED,	"HINT_RESV" },		\
> -	{ EXT4_MB_HINT_METADATA,	"HINT_MDATA" },		\
>  	{ EXT4_MB_HINT_FIRST,		"HINT_FIRST" },		\
> -	{ EXT4_MB_HINT_BEST,		"HINT_BEST" },		\
>  	{ EXT4_MB_HINT_DATA,		"HINT_DATA" },		\
>  	{ EXT4_MB_HINT_NOPREALLOC,	"HINT_NOPREALLOC" },	\
>  	{ EXT4_MB_HINT_GROUP_ALLOC,	"HINT_GRP_ALLOC" },	\
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] ext4: fix typo in CR_GOAL_LEN_SLOW comment
  2025-05-23  8:58 ` [PATCH 4/4] ext4: fix typo in CR_GOAL_LEN_SLOW comment libaokun
  2025-05-28 15:11   ` Ojaswin Mujoo
@ 2025-05-29 12:57   ` Jan Kara
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Kara @ 2025-05-29 12:57 UTC (permalink / raw)
  To: libaokun
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun1

On Fri 23-05-25 16:58:21, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
> 
> Remove the superfluous "find_".
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ext4.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index f6d01702423d..1182700901cc 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -157,7 +157,7 @@ enum criteria {
>  
>  	/*
>  	 * Reads each block group sequentially, performing disk IO if
> -	 * necessary, to find find_suitable block group. Tries to
> +	 * necessary, to find suitable block group. Tries to
>  	 * allocate goal length but might trim the request if nothing
>  	 * is found after enough tries.
>  	 */
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups
  2025-05-28 15:05   ` Ojaswin Mujoo
@ 2025-05-30  8:20     ` Baokun Li
  2025-06-10 12:07       ` Ojaswin Mujoo
  0 siblings, 1 reply; 22+ messages in thread
From: Baokun Li @ 2025-05-30  8:20 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun, Baokun Li

On 2025/5/28 23:05, Ojaswin Mujoo wrote:
> On Fri, May 23, 2025 at 04:58:18PM +0800, libaokun@huaweicloud.com wrote:
>> From: Baokun Li <libaokun1@huawei.com>
>>
>> When ext4 allocates blocks, we used to just go through the block groups
>> one by one to find a good one. But when there are tons of block groups
>> (like hundreds of thousands or even millions) and not many have free space
>> (meaning they're mostly full), it takes a really long time to check them
>> all, and performance gets bad. So, we added the "mb_optimize_scan" mount
>> option (which is on by default now). It keeps track of some group lists,
>> so when we need a free block, we can just grab a likely group from the
>> right list. This saves time and makes block allocation much faster.
>>
>> But when multiple processes or containers are doing similar things, like
>> constantly allocating 8k blocks, they all try to use the same block group
>> in the same list. Even just two processes doing this can cut the IOPS in
>> half. For example, one container might do 300,000 IOPS, but if you run two
>> at the same time, the total is only 150,000.
>>
>> Since we can already look at block groups in a non-linear way, the first
>> and last groups in the same list are basically the same for finding a block
>> right now. Therefore, add an ext4_try_lock_group() helper function to skip
>> the current group when it is locked by another process, thereby avoiding
>> contention with other processes. This helps ext4 make better use of having
>> multiple block groups.
>>
>> Also, to make sure we don't skip all the groups that have free space
>> when allocating blocks, we won't try to skip busy groups anymore when
>> ac_criteria is CR_ANY_FREE.
>>
>> Performance test data follows:
>>
>> CPU: HUAWEI Kunpeng 920
>> Memory: 480GB
>> Disk: 480GB SSD SATA 3.2
>> Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
>> Observation: Average fallocate operations per container per second.
>>
>>                        base    patched
>> mb_optimize_scan=0    3588    6755 (+88.2%)
>> mb_optimize_scan=1    3588    4302 (+19.8%)
> The patch looks mostly good. Same observations about mb_optimize_scan=1
> improving less. We can continue this discussion in my reply to the cover
> letter. That being said, I have some minor suggestions:
Thanks for the review!
>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> ---
>>   fs/ext4/ext4.h    | 23 ++++++++++++++---------
>>   fs/ext4/mballoc.c | 14 +++++++++++---
>>   2 files changed, 25 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 5a20e9cd7184..9c665a620a46 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -3494,23 +3494,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi)
>>   	return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD);
>>   }
>>   
>> +static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group)
>> +{
>> +	if (!spin_trylock(ext4_group_lock_ptr(sb, group)))
>> +		return false;
>> +	/*
>> +	 * We're able to grab the lock right away, so drop the lock
>> +	 * contention counter.
>> +	 */
>> +	atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
>> +	return true;
>> +}
>> +
>>   static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group)
>>   {
>> -	spinlock_t *lock = ext4_group_lock_ptr(sb, group);
>> -	if (spin_trylock(lock))
>> -		/*
>> -		 * We're able to grab the lock right away, so drop the
>> -		 * lock contention counter.
>> -		 */
>> -		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
>> -	else {
>> +	if (!ext4_try_lock_group(sb, group)) {
>>   		/*
>>   		 * The lock is busy, so bump the contention counter,
>>   		 * and then wait on the spin lock.
>>   		 */
>>   		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1,
>>   				  EXT4_MAX_CONTENTION);
>> -		spin_lock(lock);
>> +		spin_lock(ext4_group_lock_ptr(sb, group));
>>   	}
>>   }
>>   
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 1e98c5be4e0a..5c13d9f8a1cc 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
>>   				    bb_largest_free_order_node) {
>>   			if (sbi->s_mb_stats)
>>   				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
>> -			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
>> +			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED)) &&
>> +			    !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) {
> Maybe reversing the && order to be (!spin_is_locked() && ext4_mb_good_group()) would be better?
Yeah.
>>   				*group = iter->bb_group;
>>   				ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
>>   				read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
>> @@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o
>>   	list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
>>   		if (sbi->s_mb_stats)
>>   			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
>> -		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
>> +		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr)) &&
>> +		    !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) {
> same as above
Okay.
>   
>>   			grp = iter;
>>   			break;
>>   		}
>> @@ -2911,7 +2913,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>   			if (err)
>>   				goto out;
>>   
>> -			ext4_lock_group(sb, group);
>> +			/* skip busy group */
>> +			if (cr >= CR_ANY_FREE) {
>> +				ext4_lock_group(sb, group);
>> +			} else if (!ext4_try_lock_group(sb, group)) {
>> +				ext4_mb_unload_buddy(&e4b);
>> +				continue;
>> +			}
> This in itself looks good. I am just thinking that now that we are
> deciding to skip locked groups, in the code above this one, shall we do
> something like:
>
>        
>        if (spin_is_locked(group_lock))
>          continue;
>        
>        err = ext4_mb_load_buddy(sb, group, &e4b);
>        if (err)
>          goto out;
>
>        /* skip busy group */
>        if (cr >= CR_ANY_FREE) {
>          ext4_lock_group(sb, group);
>        } else if (!ext4_try_lock_group(sb, group)) {
>          ext4_mb_unload_buddy(&e4b);
>          continue;
>        }
>
> With this we can even avoid loading the folio as well.
I previously assumed that for busy groups, the buddy was already loaded,
so reloading it would incur minimal overhead. However, I was mistaken.

After implementing a change, the proportion of time spent in
ext4_mb_load_buddy() decreased from 3.6% to 1.7%, resulting in
approximately a 2% performance improvement.

Thank you for your suggestion!
I will prevent unnecessary buddy loading in the next version.

Cheers,
Baokun


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info
  2025-05-29 12:56   ` Jan Kara
@ 2025-05-30  9:31     ` Baokun Li
  2025-06-02 15:44       ` Jan Kara
  0 siblings, 1 reply; 22+ messages in thread
From: Baokun Li @ 2025-05-30  9:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, linux-kernel, yi.zhang,
	yangerkun, libaokun, Baokun Li

Hello Honza,

Thank you for you review!

On 2025/5/29 20:56, Jan Kara wrote:
> On Fri 23-05-25 16:58:19, libaokun@huaweicloud.com wrote:
>> From: Baokun Li <libaokun1@huawei.com>
>>
>> After we optimized the block group lock, we found another lock
>> contention issue when running will-it-scale/fallocate2 with multiple
>> processes. The fallocate's block allocation and the truncate's block
>> release were fighting over the s_md_lock. The problem is, this lock
>> protects totally different things in those two processes: the list of
>> freed data blocks (s_freed_data_list) when releasing, and where to start
>> looking for new blocks (mb_last_[group|start]) when allocating.
>>
>> Moreover, when allocating data blocks, if the first try (goal allocation)
>> fails and stream allocation is on, it tries a global goal starting from
>> the last group we used (s_mb_last_group). This can make things faster by
>> writing blocks close together on the disk. But when many processes are
>> allocating, they all fight over s_md_lock and might even try to use the
>> same group. This makes it harder to merge extents and can make files more
>> fragmented. If different processes allocate chunks of very different sizes,
>> the free space on the disk can also get fragmented. A small allocation
>> might fit in a partially full group, but a big allocation might have
>> skipped it, leading to the small IO ending up in a more empty group.
>>
>> So, we're changing stream allocation to work per inode. First, it tries
>> the goal, then the last group where that inode successfully allocated a
>> block. This keeps an inode's data closer together. Plus, after moving
>> mb_last_[group|start] to ext4_inode_info, we don't need s_md_lock during
>> block allocation anymore because we already have the write lock on
>> i_data_sem. This gets rid of the contention between allocating and
>> releasing blocks, which gives a huge performance boost to fallocate2.
>>
>> Performance test data follows:
>>
>> CPU: HUAWEI Kunpeng 920
>> Memory: 480GB
>> Disk: 480GB SSD SATA 3.2
>> Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
>> Observation: Average fallocate operations per container per second.
>>
>>                        base     patched
>> mb_optimize_scan=0    6755     23280 (+244.6%)
>> mb_optimize_scan=1    4302     10430 (+142.4%)
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> Good spotting with the s_md_lock contention here. But your changes don't
> quite make sense to me. The idea of streaming allocation in mballoc is to
> have an area of filesystem for large files to reduce fragmentation.  When
> you switch to per-inode, this effect of packing large files together goes
> away. Futhermore for each inode either all allocations will be very likely
> streaming or not streaming (the logic uses file size) so either your
> per-inode target will be unused or just another constantly used copy of
> goal value.
Sorry, I didn't intend to  break streaming allocation's semantics.
A precise definition of streaming allocation is not found in the
existing code, documentation, or historical records. Consequently,
my previous understanding of it was somewhat inaccurate.

I previously thought it was used to optimize the efficiency of linear
traversal. For instance, if 500 inodes are created in group 0 and each
file is sequentially filled to 1GB, each file's goal, being empty, would
be group 1 (the second group in the inode's flex_bg).

Without a global goal and in the absence of non-linear traversal,
after the first inode is filled, the second inode would need to traverse
groups 1 through 8 to find its first free block.

This inefficiency escalates, eventually requiring the 500th inode to
potentially traverse almost 4000 block groups to find its first available
block.

I initially believed it could be split to the inode level to reduce
traversal time and file fragmentation. However, as you pointed out,
its purpose is to cluster large files, not data blocks within a file.
Given this, splitting it to the inode level no longer makes sense.
> So I can see two sensible solutions here:
> a) Drop streaming allocations support altogether.
As mentioned above, it can also greatly improve the efficiency of linear
traversal, so we can't simply remove it.
>
> b) Enhance streaming allocation support to avoid contention between
> processes allocating in parallel and freeing. Frankly, there's no strong
> reason why reads & writes of streaming allocation goal need to use a
> spinlock AFAICS.
Yes, since it's just a hint, we don't need a lock at all, not even
fe_start, we just need the last fe_group.
> We could just store a physical block number and use
> atomic64 accessors for it? Also having single goal value is just causing
> more contention on group locks for parallel writers that end up using it
> (that's the problem I suspect you were hitting the most).
Spot on! We did try a single, lockless atomic64 variable, and just as
you pointed out, all processes started traversing from the very same
group, which just cranked up the contention, dropping OPS to just 6745.
>   So perhaps we
> can keep multiple streaming goal slots in the superblock (scale the count
> based on CPU count & filesystem group count) and just pick the slot based
> on inode number hash to reduce contention?
>
> 								Honza
That's a brilliant idea, actually!

Since most containers are CPU-pinned, this would naturally cluster a single
container's data blocks together. I believe we can also apply this to inode
allocation, so a container's inodes and data are all in a single region,
significantly reducing interference between containers.

My gratitude for your valuable suggestion!
I'm going to try out the CPU bucketing approach.

Regards,
Baokun


>> ---
>>   fs/ext4/ext4.h    |  7 ++++---
>>   fs/ext4/mballoc.c | 20 +++++++++-----------
>>   fs/ext4/super.c   |  2 ++
>>   3 files changed, 15 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 9c665a620a46..16c14dd09df6 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1171,6 +1171,10 @@ struct ext4_inode_info {
>>   	__u32 i_csum_seed;
>>   
>>   	kprojid_t i_projid;
>> +
>> +	/* where last allocation was done - for stream allocation */
>> +	ext4_group_t i_mb_last_group;
>> +	ext4_grpblk_t i_mb_last_start;
>>   };
>>   
>>   /*
>> @@ -1603,9 +1607,6 @@ struct ext4_sb_info {
>>   	unsigned int s_mb_order2_reqs;
>>   	unsigned int s_mb_group_prealloc;
>>   	unsigned int s_max_dir_size_kb;
>> -	/* where last allocation was done - for stream allocation */
>> -	unsigned long s_mb_last_group;
>> -	unsigned long s_mb_last_start;
>>   	unsigned int s_mb_prefetch;
>>   	unsigned int s_mb_prefetch_limit;
>>   	unsigned int s_mb_best_avail_max_trim_order;
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 5c13d9f8a1cc..ee9696f9bac8 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -2138,7 +2138,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
>>   static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>   					struct ext4_buddy *e4b)
>>   {
>> -	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
>>   	int ret;
>>   
>>   	BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group);
>> @@ -2169,10 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>   	folio_get(ac->ac_buddy_folio);
>>   	/* store last allocated for subsequent stream allocation */
>>   	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>> -		spin_lock(&sbi->s_md_lock);
>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>> -		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
>> -		spin_unlock(&sbi->s_md_lock);
>> +		EXT4_I(ac->ac_inode)->i_mb_last_group = ac->ac_f_ex.fe_group;
>> +		EXT4_I(ac->ac_inode)->i_mb_last_start = ac->ac_f_ex.fe_start;
>>   	}
>>   	/*
>>   	 * As we've just preallocated more space than
>> @@ -2844,13 +2841,14 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>   							   MB_NUM_ORDERS(sb));
>>   	}
>>   
>> -	/* if stream allocation is enabled, use global goal */
>> +	/* if stream allocation is enabled, use last goal */
>>   	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>> -		/* TBD: may be hot point */
>> -		spin_lock(&sbi->s_md_lock);
>> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
>> -		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
>> -		spin_unlock(&sbi->s_md_lock);
>> +		struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
>> +
>> +		if (ei->i_mb_last_group || ei->i_mb_last_start) {
>> +			ac->ac_g_ex.fe_group = ei->i_mb_last_group;
>> +			ac->ac_g_ex.fe_start = ei->i_mb_last_start;
>> +		}
>>   	}
>>   
>>   	/*
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 181934499624..6c49c43bb2cb 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -1416,6 +1416,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>>   	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
>>   	ext4_fc_init_inode(&ei->vfs_inode);
>>   	mutex_init(&ei->i_fc_lock);
>> +	ei->i_mb_last_group = 0;
>> +	ei->i_mb_last_start = 0;
>>   	return &ei->vfs_inode;
>>   }
>>   
>> -- 
>> 2.46.1
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info
  2025-05-30  9:31     ` Baokun Li
@ 2025-06-02 15:44       ` Jan Kara
  2025-06-04  8:13         ` Baokun Li
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2025-06-02 15:44 UTC (permalink / raw)
  To: Baokun Li
  Cc: Jan Kara, linux-ext4, tytso, adilger.kernel, linux-kernel,
	yi.zhang, yangerkun, libaokun

Hello!

On Fri 30-05-25 17:31:48, Baokun Li wrote:
> On 2025/5/29 20:56, Jan Kara wrote:
> > On Fri 23-05-25 16:58:19, libaokun@huaweicloud.com wrote:
> > > From: Baokun Li <libaokun1@huawei.com>
> > > 
> > > After we optimized the block group lock, we found another lock
> > > contention issue when running will-it-scale/fallocate2 with multiple
> > > processes. The fallocate's block allocation and the truncate's block
> > > release were fighting over the s_md_lock. The problem is, this lock
> > > protects totally different things in those two processes: the list of
> > > freed data blocks (s_freed_data_list) when releasing, and where to start
> > > looking for new blocks (mb_last_[group|start]) when allocating.
> > > 
> > > Moreover, when allocating data blocks, if the first try (goal allocation)
> > > fails and stream allocation is on, it tries a global goal starting from
> > > the last group we used (s_mb_last_group). This can make things faster by
> > > writing blocks close together on the disk. But when many processes are
> > > allocating, they all fight over s_md_lock and might even try to use the
> > > same group. This makes it harder to merge extents and can make files more
> > > fragmented. If different processes allocate chunks of very different sizes,
> > > the free space on the disk can also get fragmented. A small allocation
> > > might fit in a partially full group, but a big allocation might have
> > > skipped it, leading to the small IO ending up in a more empty group.
> > > 
> > > So, we're changing stream allocation to work per inode. First, it tries
> > > the goal, then the last group where that inode successfully allocated a
> > > block. This keeps an inode's data closer together. Plus, after moving
> > > mb_last_[group|start] to ext4_inode_info, we don't need s_md_lock during
> > > block allocation anymore because we already have the write lock on
> > > i_data_sem. This gets rid of the contention between allocating and
> > > releasing blocks, which gives a huge performance boost to fallocate2.
> > > 
> > > Performance test data follows:
> > > 
> > > CPU: HUAWEI Kunpeng 920
> > > Memory: 480GB
> > > Disk: 480GB SSD SATA 3.2
> > > Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
> > > Observation: Average fallocate operations per container per second.
> > > 
> > >                        base     patched
> > > mb_optimize_scan=0    6755     23280 (+244.6%)
> > > mb_optimize_scan=1    4302     10430 (+142.4%)
> > > 
> > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > Good spotting with the s_md_lock contention here. But your changes don't
> > quite make sense to me. The idea of streaming allocation in mballoc is to
> > have an area of filesystem for large files to reduce fragmentation.  When
> > you switch to per-inode, this effect of packing large files together goes
> > away. Futhermore for each inode either all allocations will be very likely
> > streaming or not streaming (the logic uses file size) so either your
> > per-inode target will be unused or just another constantly used copy of
> > goal value.
> Sorry, I didn't intend to  break streaming allocation's semantics.
> A precise definition of streaming allocation is not found in the
> existing code, documentation, or historical records. Consequently,
> my previous understanding of it was somewhat inaccurate.
> 
> I previously thought it was used to optimize the efficiency of linear
> traversal. For instance, if 500 inodes are created in group 0 and each
> file is sequentially filled to 1GB, each file's goal, being empty, would
> be group 1 (the second group in the inode's flex_bg).
> 
> Without a global goal and in the absence of non-linear traversal,
> after the first inode is filled, the second inode would need to traverse
> groups 1 through 8 to find its first free block.
> 
> This inefficiency escalates, eventually requiring the 500th inode to
> potentially traverse almost 4000 block groups to find its first available
> block.

I see. But doesn't ext4_mb_choose_next_group() usually select group from
which allocation can succeed instead of linearly scanning through all the
groups? The linear scan is just a last resort as far as I remember. Anyway
I'm not 100% sure what was the original motivation for the streaming goal.
Maybe Andreas would remember since he was involved in the design.  What I
wrote is mostly derived from the general understanding of mballoc operating
principles but I could be wrong.

> I initially believed it could be split to the inode level to reduce
> traversal time and file fragmentation. However, as you pointed out,
> its purpose is to cluster large files, not data blocks within a file.
> Given this, splitting it to the inode level no longer makes sense.
> > So I can see two sensible solutions here:
> > a) Drop streaming allocations support altogether.
> As mentioned above, it can also greatly improve the efficiency of linear
> traversal, so we can't simply remove it.
> > 
> > b) Enhance streaming allocation support to avoid contention between
> > processes allocating in parallel and freeing. Frankly, there's no strong
> > reason why reads & writes of streaming allocation goal need to use a
> > spinlock AFAICS.
> Yes, since it's just a hint, we don't need a lock at all, not even
> fe_start, we just need the last fe_group.
> > We could just store a physical block number and use
> > atomic64 accessors for it? Also having single goal value is just causing
> > more contention on group locks for parallel writers that end up using it
> > (that's the problem I suspect you were hitting the most).
> Spot on! We did try a single, lockless atomic64 variable, and just as
> you pointed out, all processes started traversing from the very same
> group, which just cranked up the contention, dropping OPS to just 6745.
> >   So perhaps we
> > can keep multiple streaming goal slots in the superblock (scale the count
> > based on CPU count & filesystem group count) and just pick the slot based
> > on inode number hash to reduce contention?
> > 
> > 								Honza
> That's a brilliant idea, actually!
> 
> Since most containers are CPU-pinned, this would naturally cluster a single
> container's data blocks together. I believe we can also apply this to inode
> allocation, so a container's inodes and data are all in a single region,
> significantly reducing interference between containers.
> 
> My gratitude for your valuable suggestion!
> I'm going to try out the CPU bucketing approach.

Cool, let's see how it works out :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info
  2025-06-02 15:44       ` Jan Kara
@ 2025-06-04  8:13         ` Baokun Li
  0 siblings, 0 replies; 22+ messages in thread
From: Baokun Li @ 2025-06-04  8:13 UTC (permalink / raw)
  To: Jan Kara, adilger.kernel
  Cc: linux-ext4, tytso, linux-kernel, yi.zhang, yangerkun, libaokun,
	Baokun Li

On 2025/6/2 23:44, Jan Kara wrote:
> Hello!
>
> On Fri 30-05-25 17:31:48, Baokun Li wrote:
>> On 2025/5/29 20:56, Jan Kara wrote:
>>> On Fri 23-05-25 16:58:19, libaokun@huaweicloud.com wrote:
>>>> From: Baokun Li <libaokun1@huawei.com>
>>>>
>>>> After we optimized the block group lock, we found another lock
>>>> contention issue when running will-it-scale/fallocate2 with multiple
>>>> processes. The fallocate's block allocation and the truncate's block
>>>> release were fighting over the s_md_lock. The problem is, this lock
>>>> protects totally different things in those two processes: the list of
>>>> freed data blocks (s_freed_data_list) when releasing, and where to start
>>>> looking for new blocks (mb_last_[group|start]) when allocating.
>>>>
>>>> Moreover, when allocating data blocks, if the first try (goal allocation)
>>>> fails and stream allocation is on, it tries a global goal starting from
>>>> the last group we used (s_mb_last_group). This can make things faster by
>>>> writing blocks close together on the disk. But when many processes are
>>>> allocating, they all fight over s_md_lock and might even try to use the
>>>> same group. This makes it harder to merge extents and can make files more
>>>> fragmented. If different processes allocate chunks of very different sizes,
>>>> the free space on the disk can also get fragmented. A small allocation
>>>> might fit in a partially full group, but a big allocation might have
>>>> skipped it, leading to the small IO ending up in a more empty group.
>>>>
>>>> So, we're changing stream allocation to work per inode. First, it tries
>>>> the goal, then the last group where that inode successfully allocated a
>>>> block. This keeps an inode's data closer together. Plus, after moving
>>>> mb_last_[group|start] to ext4_inode_info, we don't need s_md_lock during
>>>> block allocation anymore because we already have the write lock on
>>>> i_data_sem. This gets rid of the contention between allocating and
>>>> releasing blocks, which gives a huge performance boost to fallocate2.
>>>>
>>>> Performance test data follows:
>>>>
>>>> CPU: HUAWEI Kunpeng 920
>>>> Memory: 480GB
>>>> Disk: 480GB SSD SATA 3.2
>>>> Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
>>>> Observation: Average fallocate operations per container per second.
>>>>
>>>>                         base     patched
>>>> mb_optimize_scan=0    6755     23280 (+244.6%)
>>>> mb_optimize_scan=1    4302     10430 (+142.4%)
>>>>
>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>> Good spotting with the s_md_lock contention here. But your changes don't
>>> quite make sense to me. The idea of streaming allocation in mballoc is to
>>> have an area of filesystem for large files to reduce fragmentation.  When
>>> you switch to per-inode, this effect of packing large files together goes
>>> away. Futhermore for each inode either all allocations will be very likely
>>> streaming or not streaming (the logic uses file size) so either your
>>> per-inode target will be unused or just another constantly used copy of
>>> goal value.
>> Sorry, I didn't intend to  break streaming allocation's semantics.
>> A precise definition of streaming allocation is not found in the
>> existing code, documentation, or historical records. Consequently,
>> my previous understanding of it was somewhat inaccurate.
>>
>> I previously thought it was used to optimize the efficiency of linear
>> traversal. For instance, if 500 inodes are created in group 0 and each
>> file is sequentially filled to 1GB, each file's goal, being empty, would
>> be group 1 (the second group in the inode's flex_bg).
>>
>> Without a global goal and in the absence of non-linear traversal,
>> after the first inode is filled, the second inode would need to traverse
>> groups 1 through 8 to find its first free block.
>>
>> This inefficiency escalates, eventually requiring the 500th inode to
>> potentially traverse almost 4000 block groups to find its first available
>> block.
> I see. But doesn't ext4_mb_choose_next_group() usually select group from
> which allocation can succeed instead of linearly scanning through all the
> groups? The linear scan is just a last resort as far as I remember.
Yes, that's right. I was referring to how streaming allocation worked
before mb_optimize_scan was introduced, when we only had linear traversal.

Because mb_optimize_scan's traversal is unordered, it doesn't encounter
this problem. But linear traversal is still needed when criteria are
CR_GOAL_LEN_SLOW or CR_ANY_FREE, or when s_mb_max_linear_groups is
specified.
> Anyway
> I'm not 100% sure what was the original motivation for the streaming goal.
> Maybe Andreas would remember since he was involved in the design.  What I
> wrote is mostly derived from the general understanding of mballoc operating
> principles but I could be wrong.
Hey Andreas, do you happen to remember what the initial thinking was
behind bringing in the streaming goal (EXT4_MB_STREAM_ALLOC)?
>> I initially believed it could be split to the inode level to reduce
>> traversal time and file fragmentation. However, as you pointed out,
>> its purpose is to cluster large files, not data blocks within a file.
>> Given this, splitting it to the inode level no longer makes sense.
>>> So I can see two sensible solutions here:
>>> a) Drop streaming allocations support altogether.
>> As mentioned above, it can also greatly improve the efficiency of linear
>> traversal, so we can't simply remove it.
>>> b) Enhance streaming allocation support to avoid contention between
>>> processes allocating in parallel and freeing. Frankly, there's no strong
>>> reason why reads & writes of streaming allocation goal need to use a
>>> spinlock AFAICS.
>> Yes, since it's just a hint, we don't need a lock at all, not even
>> fe_start, we just need the last fe_group.
>>> We could just store a physical block number and use
>>> atomic64 accessors for it? Also having single goal value is just causing
>>> more contention on group locks for parallel writers that end up using it
>>> (that's the problem I suspect you were hitting the most).
>> Spot on! We did try a single, lockless atomic64 variable, and just as
>> you pointed out, all processes started traversing from the very same
>> group, which just cranked up the contention, dropping OPS to just 6745.
>>>    So perhaps we
>>> can keep multiple streaming goal slots in the superblock (scale the count
>>> based on CPU count & filesystem group count) and just pick the slot based
>>> on inode number hash to reduce contention?
>>>
>>> 								Honza
>> That's a brilliant idea, actually!
>>
>> Since most containers are CPU-pinned, this would naturally cluster a single
>> container's data blocks together. I believe we can also apply this to inode
>> allocation, so a container's inodes and data are all in a single region,
>> significantly reducing interference between containers.
>>
>> My gratitude for your valuable suggestion!
>> I'm going to try out the CPU bucketing approach.
> Cool, let's see how it works out :).

Initially, I tried defining a per-CPU variable where each process would
use the goal corresponding to its current CPU. In CPU-pinned scenarios,
this resulted in almost no interference between CPUs. However, when
writing to the same file sequentially in a non-CPU-pinned environment,
the data blocks could become highly fragmented.

Therefore, I switched to defining a regular array whose length is the
number of CPUs rounded up to the nearest power of 2. By taking the inode
number modulo the array length, we can get the corresponding goal. This
guarantees that the same file consistently uses the same goal. Its
performance is largely similar to inode i_mb_last_group, but it's more
memory-efficient. I'll be switching to this method in the next version.

Thanks again for your suggestion!


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation
  2025-05-29 12:24   ` Baokun Li
@ 2025-06-10 12:06     ` Ojaswin Mujoo
  2025-06-10 13:48       ` Baokun Li
  0 siblings, 1 reply; 22+ messages in thread
From: Ojaswin Mujoo @ 2025-06-10 12:06 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun

On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote:
> On 2025/5/28 22:53, Ojaswin Mujoo wrote:
> > On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@huaweicloud.com wrote:
> > > From: Baokun Li <libaokun1@huawei.com>

<...>

> > > |--------|--------|--------|--------|--------|--------|--------|--------|
> > > |    -   |    1   |    2   |    4   |    8   |   16   |   32   |   64   |
> > > |--------|--------|--------|--------|--------|--------|--------|--------|
> > > |  base  | 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588  |
> > > |--------|--------|--------|--------|--------|--------|--------|--------|
> > > | linear | 286328 | 123102 | 119542 | 90653  | 60344  | 35302  | 23280  |
> > > |        | -3.0%  | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% |
> > > |--------|--------|--------|--------|--------|--------|--------|--------|
> > > |mb_optim| 292498 | 133305 | 103069 | 61727  | 29702  | 16845  | 10430  |
> > > |ize_scan| -0.9%  | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% |
> > > |--------|--------|--------|--------|--------|--------|--------|--------|
> > Hey Baokun, nice improvements! The proposed changes make sense to me,
> > however I suspect the performance improvements may come at a cost of
> > slight increase in fragmentation, which might affect rotational disks
> > especially. Maybe comparing e2freefrag numbers with and without the
> > patches might give a better insight into this.
> While this approach might slightly increase free space fragmentation on
> the disk, it significantly reduces file fragmentation, leading to faster
> read speeds on rotational disks.
> 
> When multiple processes contend for free blocks within the same block
> group, the probability of blocks allocated by the same process being
> merged on consecutive allocations is low. This is because other processes
> may preempt the free blocks immediately following the current process's
> last allocated region.
> 
> Normally, we rely on preallocation to avoid files becoming overly
> fragmented (even though preallocation itself can cause fragmentation in
> free disk space). But since fallocate doesn't support preallocation, the
> fragmentation issue is more pronounced. Counterintuitively, skipping busy
> groups actually boosts opportunities for file extent merging, which in turn
> reduces overall file fragmentation.
> 
> Referencing will-it-scale/fallocate2, I tested 64 processes each appending
> 4KB via fallocate to 64 separate files until they reached 1GB. This test
> specifically examines contention in block allocation, unlike fallocate2,
> it omits the contention between fallocate and truncate. Preliminary results
> are provided below; detailed scripts and full test outcomes are attached in
> the email footer.
> 
> ----------------------------------------------------------
>                      |       base      |      patched    |
> ---------------------|--------|--------|--------|--------|
> mb_optimize_scan     | linear |opt_scan| linear |opt_scan|
> ---------------------|--------|--------|--------|--------|
> bw(MiB/s)            | 217    | 219    | 5685   | 5670   |
> Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
> Avg. extents per file| 261879 | 262039 | 744    | 2084   |
> Avg. size per extent | 4      | 4      | 1408   | 503    |
> Fragmentation score  | 100    | 100    | 2      | 6      |
> ----------------------------------------------------------

Hi Baokun,

Thanks for the info and data and apologies for being late, I caught a
viral last week :/ 

These numbers look pretty interesting and your explanation of why the
fragmentation is better makes sense. It is definitely a win-win then for 
performance and fragmentation!

> > Regardless the performance benefits are significant and I feel it is
> > good to have these patches.
> > 
> > I'll give my reviews individually as I'm still going through patch 2
> > However, I wanted to check on a couple things:
> Okay, thank you for your feedback.
> > 
> > 1. I believe you ran these in docker. Would you have any script etc open
> >     sourced that I can use to run some benchmarks on my end (and also
> > 	 understand your test setup).
> Yes, these two patches primarily mitigate contention between block
> allocations and between block allocation and release. The testing script
> can be referenced from the fio script mentioned earlier in the email
> footer. You can also add more truncate calls based on it.

Thanks for the scripts.

> > 2. I notice we are getting way less throughput in mb_optimize_scan? I
> >     wonder why that is the case. Do you have some data on that? Are your
> >     tests starting on an empty FS, maybe in that case linear scan works a
> >     bit better since almost all groups are empty. If so, what are the
> >     numbers like when we start with a fragmented FS?
> The throughput of mb_optimize_scan is indeed much lower, and we continue
> to optimize it, as mb_optimize_scan is the default mount option and
> performs better in scenarios with large volume disks and high space usage.
> 
> Disk space used is about 7%; mb_optimize_scan should perform better with
> less free space. However, this isn't the critical factor. The poor
> throughput here is due to the following reasons。
> 
> One reason is that mb_optimize_scan's list traversal is unordered and
> always selects the first group.
> 
> While traversing the list, holding a spin_lock prevents load_buddy, making
> direct use of ext4_lock_group impossible. This can lead to a "bouncing"
> scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
> fails, forcing the list traversal to repeatedly restart from grp_A.
> 
> In contrast, linear traversal directly uses ext4_try_lock_group(),
> avoiding this bouncing. Therefore, we need a lockless, ordered traversal
> to achieve linear-like efficiency.

Hmm, right the non ordered traversal has led to other issues as well in
the past.

> 
> Another reason is that opt_scan tends to allocate from groups that have
> just received freed blocks, causing it to constantly "jump around"
> between certain groups.
> 
> This leads to intense contention between allocation and release, and even
> between release events. In contrast, linear traversal always moves forward
> without revisiting groups, resulting in less contention between allocation
> and release.

By just received free blocks, you mean the blocks got freed in the group
right? I was under the impression that when we free blocks and the group's
largest order/ avg fragment changes, the group is added to the end of
the free fragment list/order list so it should be the last to be picked.
Is that not the case?

> 
> However, because linear involves more groups in allocation, journal
> becomes a bottleneck. If opt_scan first attempts to traverse block groups
> to the right from the target group in all lists, and then from index 0 to
> the left in all lists, competition between block groups would be
> significantly reduced.
> 
> To enable ordered traversal, we attempted to convert list_head to an
> ordered xarray. This ordering prevents "bouncing" during retries.
> Additionally, traversing all right-side groups before left-side groups
> significantly reduced contention. Performance improved from 10430 to 17730.

Do you maybe have some code you can share of this? 

> 
> However, xarray traversal introduces overhead; list_head group selection
> was O(1), while xarray becomes O(n log n). This results in a ~10%
> performance drop in single-process scenarios, and I'm not entirely sure if
> this trade-off is worthwhile. 🤔
> 
> Additionally, by attempting to merge before inserting in
> ext4_mb_free_metadata(), we can eliminate contention on sbi->s_md_lock
> during merges, resulting in roughly a 5% performance gain.
> > 
> >     - Or maybe it is that the lazyinit thread has not yet initialized all
> >     the buddies yet which means we have lesser BGs in the freefrag list
> >     or the order list used by faster CRs. Hence, if they are locked we
> >     are falling more to CR_GOAL_LEN_SLOW. To check if this is the case,
> >     one hack is to cat /proc/fs/ext4/<disk>/mb_groups (or something along
> >     the lines) before the benchmark, which forces init of all the group
> >     buddies thus populating all the lists used by mb_opt_scan. Maybe we
> >     can check if this gives better results.
> All groups are already initialized at the time of testing, and that's not
> where the problem lies.
> > 
> > 3. Also, how much IO are we doing here, are we filling the whole FS?
> > 
> In a single container, create a file, then repeatedly append 8KB using
> fallocate until the file reaches 1MB. After that, truncate the file to
> 0 and continue appending 8KB with fallocate. The 64 containers will
> occupy a maximum of 64MB of disk space in total, so they won't fill the
> entire file system.

Also, as per your theory, if we do similar experiments in a fragmented
FS we should see opt_scan perform better right? I'd like to test this as
well. I'll try to play with the scripts you have shared.

Thanks again for the detailed response and scripts!

Regards,
ojaswin

> 
> Cheers,
> Baokun
> 
> 
> ======================== test script ========================
> 
> #!/bin/bash
> 
> dir="/tmp/test"
> disk="/dev/sda"
> numjobs=64
> iodepth=128
> 
> mkdir -p $dir
> 
> for scan in 0 1 ; do
>     mkfs.ext4 -F -E lazy_itable_init=0,lazy_journal_init=0 -O orphan_file
> $disk
>     mount -o mb_optimize_scan=$scan $disk $dir
> 
>     fio -directory=$dir -direct=1 -iodepth ${iodepth} -thread -rw=write
> -ioengine=falloc -bs=4k -fallocate=none \
>         -size=1G -numjobs=${numjobs} -group_reporting -name=job1
> -cpus_allowed_policy=split -file_append=1
> 
>     e2freefrag $disk
>     e4defrag -c $dir # ** NOTE ** Without the patch, this could take 5-6
> hours.
>     filefrag ${dir}/job* | awk '{print $2}' | awk '{sum+=$1} END {print
> sum/NR}'
>     umount $dir
> done
> 

<snip>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups
  2025-05-30  8:20     ` Baokun Li
@ 2025-06-10 12:07       ` Ojaswin Mujoo
  0 siblings, 0 replies; 22+ messages in thread
From: Ojaswin Mujoo @ 2025-06-10 12:07 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, libaokun

On Fri, May 30, 2025 at 04:20:44PM +0800, Baokun Li wrote:
> On 2025/5/28 23:05, Ojaswin Mujoo wrote:
> > On Fri, May 23, 2025 at 04:58:18PM +0800, libaokun@huaweicloud.com wrote:
> > > From: Baokun Li <libaokun1@huawei.com>
> > > 
> > > When ext4 allocates blocks, we used to just go through the block groups
> > > one by one to find a good one. But when there are tons of block groups
> > > (like hundreds of thousands or even millions) and not many have free space
> > > (meaning they're mostly full), it takes a really long time to check them
> > > all, and performance gets bad. So, we added the "mb_optimize_scan" mount
> > > option (which is on by default now). It keeps track of some group lists,
> > > so when we need a free block, we can just grab a likely group from the
> > > right list. This saves time and makes block allocation much faster.
> > > 
> > > But when multiple processes or containers are doing similar things, like
> > > constantly allocating 8k blocks, they all try to use the same block group
> > > in the same list. Even just two processes doing this can cut the IOPS in
> > > half. For example, one container might do 300,000 IOPS, but if you run two
> > > at the same time, the total is only 150,000.
> > > 
> > > Since we can already look at block groups in a non-linear way, the first
> > > and last groups in the same list are basically the same for finding a block
> > > right now. Therefore, add an ext4_try_lock_group() helper function to skip
> > > the current group when it is locked by another process, thereby avoiding
> > > contention with other processes. This helps ext4 make better use of having
> > > multiple block groups.
> > > 
> > > Also, to make sure we don't skip all the groups that have free space
> > > when allocating blocks, we won't try to skip busy groups anymore when
> > > ac_criteria is CR_ANY_FREE.
> > > 
> > > Performance test data follows:
> > > 
> > > CPU: HUAWEI Kunpeng 920
> > > Memory: 480GB
> > > Disk: 480GB SSD SATA 3.2
> > > Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
> > > Observation: Average fallocate operations per container per second.
> > > 
> > >                        base    patched
> > > mb_optimize_scan=0    3588    6755 (+88.2%)
> > > mb_optimize_scan=1    3588    4302 (+19.8%)
> > The patch looks mostly good. Same observations about mb_optimize_scan=1
> > improving less. We can continue this discussion in my reply to the cover
> > letter. That being said, I have some minor suggestions:
> Thanks for the review!
> > 
> > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > > ---
> > >   fs/ext4/ext4.h    | 23 ++++++++++++++---------
> > >   fs/ext4/mballoc.c | 14 +++++++++++---
> > >   2 files changed, 25 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 5a20e9cd7184..9c665a620a46 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -3494,23 +3494,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi)
> > >   	return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD);
> > >   }
> > > +static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group)
> > > +{
> > > +	if (!spin_trylock(ext4_group_lock_ptr(sb, group)))
> > > +		return false;
> > > +	/*
> > > +	 * We're able to grab the lock right away, so drop the lock
> > > +	 * contention counter.
> > > +	 */
> > > +	atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
> > > +	return true;
> > > +}
> > > +
> > >   static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group)
> > >   {
> > > -	spinlock_t *lock = ext4_group_lock_ptr(sb, group);
> > > -	if (spin_trylock(lock))
> > > -		/*
> > > -		 * We're able to grab the lock right away, so drop the
> > > -		 * lock contention counter.
> > > -		 */
> > > -		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
> > > -	else {
> > > +	if (!ext4_try_lock_group(sb, group)) {
> > >   		/*
> > >   		 * The lock is busy, so bump the contention counter,
> > >   		 * and then wait on the spin lock.
> > >   		 */
> > >   		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1,
> > >   				  EXT4_MAX_CONTENTION);
> > > -		spin_lock(lock);
> > > +		spin_lock(ext4_group_lock_ptr(sb, group));
> > >   	}
> > >   }
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index 1e98c5be4e0a..5c13d9f8a1cc 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
> > >   				    bb_largest_free_order_node) {
> > >   			if (sbi->s_mb_stats)
> > >   				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
> > > -			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
> > > +			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED)) &&
> > > +			    !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) {
> > Maybe reversing the && order to be (!spin_is_locked() && ext4_mb_good_group()) would be better?
> Yeah.
> > >   				*group = iter->bb_group;
> > >   				ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
> > >   				read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> > > @@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o
> > >   	list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
> > >   		if (sbi->s_mb_stats)
> > >   			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
> > > -		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
> > > +		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr)) &&
> > > +		    !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) {
> > same as above
> Okay.
> > >   			grp = iter;
> > >   			break;
> > >   		}
> > > @@ -2911,7 +2913,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> > >   			if (err)
> > >   				goto out;
> > > -			ext4_lock_group(sb, group);
> > > +			/* skip busy group */
> > > +			if (cr >= CR_ANY_FREE) {
> > > +				ext4_lock_group(sb, group);
> > > +			} else if (!ext4_try_lock_group(sb, group)) {
> > > +				ext4_mb_unload_buddy(&e4b);
> > > +				continue;
> > > +			}
> > This in itself looks good. I am just thinking that now that we are
> > deciding to skip locked groups, in the code above this one, shall we do
> > something like:
> > 
> >        if (spin_is_locked(group_lock))
> >          continue;
> >        err = ext4_mb_load_buddy(sb, group, &e4b);
> >        if (err)
> >          goto out;
> > 
> >        /* skip busy group */
> >        if (cr >= CR_ANY_FREE) {
> >          ext4_lock_group(sb, group);
> >        } else if (!ext4_try_lock_group(sb, group)) {
> >          ext4_mb_unload_buddy(&e4b);
> >          continue;
> >        }
> > 
> > With this we can even avoid loading the folio as well.
> I previously assumed that for busy groups, the buddy was already loaded,
> so reloading it would incur minimal overhead. However, I was mistaken.
> 
> After implementing a change, the proportion of time spent in
> ext4_mb_load_buddy() decreased from 3.6% to 1.7%, resulting in
> approximately a 2% performance improvement.

Nice :) 

> 
> Thank you for your suggestion!
> I will prevent unnecessary buddy loading in the next version.
> 
> Cheers,
> Baokun
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation
  2025-06-10 12:06     ` Ojaswin Mujoo
@ 2025-06-10 13:48       ` Baokun Li
  2025-06-11  8:22         ` Ojaswin Mujoo
  0 siblings, 1 reply; 22+ messages in thread
From: Baokun Li @ 2025-06-10 13:48 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, Baokun Li

On 2025/6/10 20:06, Ojaswin Mujoo wrote:
> On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote:
>> On 2025/5/28 22:53, Ojaswin Mujoo wrote:
>>> On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@huaweicloud.com wrote:
>>>> From: Baokun Li <libaokun1@huawei.com>
> <...>
>
>>>> |--------|--------|--------|--------|--------|--------|--------|--------|
>>>> |    -   |    1   |    2   |    4   |    8   |   16   |   32   |   64   |
>>>> |--------|--------|--------|--------|--------|--------|--------|--------|
>>>> |  base  | 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588  |
>>>> |--------|--------|--------|--------|--------|--------|--------|--------|
>>>> | linear | 286328 | 123102 | 119542 | 90653  | 60344  | 35302  | 23280  |
>>>> |        | -3.0%  | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% |
>>>> |--------|--------|--------|--------|--------|--------|--------|--------|
>>>> |mb_optim| 292498 | 133305 | 103069 | 61727  | 29702  | 16845  | 10430  |
>>>> |ize_scan| -0.9%  | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% |
>>>> |--------|--------|--------|--------|--------|--------|--------|--------|
>>> Hey Baokun, nice improvements! The proposed changes make sense to me,
>>> however I suspect the performance improvements may come at a cost of
>>> slight increase in fragmentation, which might affect rotational disks
>>> especially. Maybe comparing e2freefrag numbers with and without the
>>> patches might give a better insight into this.
>> While this approach might slightly increase free space fragmentation on
>> the disk, it significantly reduces file fragmentation, leading to faster
>> read speeds on rotational disks.
>>
>> When multiple processes contend for free blocks within the same block
>> group, the probability of blocks allocated by the same process being
>> merged on consecutive allocations is low. This is because other processes
>> may preempt the free blocks immediately following the current process's
>> last allocated region.
>>
>> Normally, we rely on preallocation to avoid files becoming overly
>> fragmented (even though preallocation itself can cause fragmentation in
>> free disk space). But since fallocate doesn't support preallocation, the
>> fragmentation issue is more pronounced. Counterintuitively, skipping busy
>> groups actually boosts opportunities for file extent merging, which in turn
>> reduces overall file fragmentation.
>>
>> Referencing will-it-scale/fallocate2, I tested 64 processes each appending
>> 4KB via fallocate to 64 separate files until they reached 1GB. This test
>> specifically examines contention in block allocation, unlike fallocate2,
>> it omits the contention between fallocate and truncate. Preliminary results
>> are provided below; detailed scripts and full test outcomes are attached in
>> the email footer.
>>
>> ----------------------------------------------------------
>>                       |       base      |      patched    |
>> ---------------------|--------|--------|--------|--------|
>> mb_optimize_scan     | linear |opt_scan| linear |opt_scan|
>> ---------------------|--------|--------|--------|--------|
>> bw(MiB/s)            | 217    | 219    | 5685   | 5670   |
>> Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
>> Avg. extents per file| 261879 | 262039 | 744    | 2084   |
>> Avg. size per extent | 4      | 4      | 1408   | 503    |
>> Fragmentation score  | 100    | 100    | 2      | 6      |
>> ----------------------------------------------------------
> Hi Baokun,
>
> Thanks for the info and data and apologies for being late, I caught a
> viral last week :/
Hi Ojaswin,

Being sick really takes a toll.
Please get some good rest and take care of yourself.
>
> These numbers look pretty interesting and your explanation of why the
> fragmentation is better makes sense. It is definitely a win-win then for
> performance and fragmentation!
>
>>> Regardless the performance benefits are significant and I feel it is
>>> good to have these patches.
>>>
>>> I'll give my reviews individually as I'm still going through patch 2
>>> However, I wanted to check on a couple things:
>> Okay, thank you for your feedback.
>>> 1. I believe you ran these in docker. Would you have any script etc open
>>>      sourced that I can use to run some benchmarks on my end (and also
>>> 	 understand your test setup).
>> Yes, these two patches primarily mitigate contention between block
>> allocations and between block allocation and release. The testing script
>> can be referenced from the fio script mentioned earlier in the email
>> footer. You can also add more truncate calls based on it.
> Thanks for the scripts.
>
>>> 2. I notice we are getting way less throughput in mb_optimize_scan? I
>>>      wonder why that is the case. Do you have some data on that? Are your
>>>      tests starting on an empty FS, maybe in that case linear scan works a
>>>      bit better since almost all groups are empty. If so, what are the
>>>      numbers like when we start with a fragmented FS?
>> The throughput of mb_optimize_scan is indeed much lower, and we continue
>> to optimize it, as mb_optimize_scan is the default mount option and
>> performs better in scenarios with large volume disks and high space usage.
>>
>> Disk space used is about 7%; mb_optimize_scan should perform better with
>> less free space. However, this isn't the critical factor. The poor
>> throughput here is due to the following reasons。
>>
>> One reason is that mb_optimize_scan's list traversal is unordered and
>> always selects the first group.
>>
>> While traversing the list, holding a spin_lock prevents load_buddy, making
>> direct use of ext4_lock_group impossible. This can lead to a "bouncing"
>> scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
>> fails, forcing the list traversal to repeatedly restart from grp_A.
>>
>> In contrast, linear traversal directly uses ext4_try_lock_group(),
>> avoiding this bouncing. Therefore, we need a lockless, ordered traversal
>> to achieve linear-like efficiency.
> Hmm, right the non ordered traversal has led to other issues as well in
> the past.
>
>> Another reason is that opt_scan tends to allocate from groups that have
>> just received freed blocks, causing it to constantly "jump around"
>> between certain groups.
>>
>> This leads to intense contention between allocation and release, and even
>> between release events. In contrast, linear traversal always moves forward
>> without revisiting groups, resulting in less contention between allocation
>> and release.
> By just received free blocks, you mean the blocks got freed in the group
> right?
Yes.
> I was under the impression that when we free blocks and the group's
> largest order/ avg fragment changes, the group is added to the end of
> the free fragment list/order list so it should be the last to be picked.
> Is that not the case?
Yes, we are indeed adding the group to the list tail. However, because
we traverse all ordered lists from low to high, a group might end up
"bouncing" repeatedly between order_0 and order_1 here.

For instance, if order_1 only contains group 1, linear traversal would
rarely revisit it after the initial pass. However, after a non-linear
allocation, this group is moved from the order_1 list to the order_0 list.
When blocks are freed, it's moved back to the order_1 list, and then
non-linear traversal prioritizes allocation in this same group again...
>
>> However, because linear involves more groups in allocation, journal
>> becomes a bottleneck. If opt_scan first attempts to traverse block groups
>> to the right from the target group in all lists, and then from index 0 to
>> the left in all lists, competition between block groups would be
>> significantly reduced.
>>
>> To enable ordered traversal, we attempted to convert list_head to an
>> ordered xarray. This ordering prevents "bouncing" during retries.
>> Additionally, traversing all right-side groups before left-side groups
>> significantly reduced contention. Performance improved from 10430 to 17730.
> Do you maybe have some code you can share of this?
Yes, V2 will include those changes.
>
>> However, xarray traversal introduces overhead; list_head group selection
>> was O(1), while xarray becomes O(n log n). This results in a ~10%
>> performance drop in single-process scenarios, and I'm not entirely sure if
>> this trade-off is worthwhile. 🤔
>>
>> Additionally, by attempting to merge before inserting in
>> ext4_mb_free_metadata(), we can eliminate contention on sbi->s_md_lock
>> during merges, resulting in roughly a 5% performance gain.
>>>      - Or maybe it is that the lazyinit thread has not yet initialized all
>>>      the buddies yet which means we have lesser BGs in the freefrag list
>>>      or the order list used by faster CRs. Hence, if they are locked we
>>>      are falling more to CR_GOAL_LEN_SLOW. To check if this is the case,
>>>      one hack is to cat /proc/fs/ext4/<disk>/mb_groups (or something along
>>>      the lines) before the benchmark, which forces init of all the group
>>>      buddies thus populating all the lists used by mb_opt_scan. Maybe we
>>>      can check if this gives better results.
>> All groups are already initialized at the time of testing, and that's not
>> where the problem lies.
>>> 3. Also, how much IO are we doing here, are we filling the whole FS?
>>>
>> In a single container, create a file, then repeatedly append 8KB using
>> fallocate until the file reaches 1MB. After that, truncate the file to
>> 0 and continue appending 8KB with fallocate. The 64 containers will
>> occupy a maximum of 64MB of disk space in total, so they won't fill the
>> entire file system.
> Also, as per your theory, if we do similar experiments in a fragmented
> FS we should see opt_scan perform better right? I'd like to test this as
> well. I'll try to play with the scripts you have shared.
>
Yes, mb_optimize_scan performs well when free space fragmentation is
severe. We have a 600TB disk array where the write bandwidth is
approximately 5 GB/s when empty. When space utilization reaches 95%,
linear traversal drops bandwidth to 1 GB/s, whereas enabling
mb_optimize_scan restores it to 4.2 GB/s.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation
  2025-06-10 13:48       ` Baokun Li
@ 2025-06-11  8:22         ` Ojaswin Mujoo
  2025-06-12 11:30           ` Baokun Li
  0 siblings, 1 reply; 22+ messages in thread
From: Ojaswin Mujoo @ 2025-06-11  8:22 UTC (permalink / raw)
  To: Baokun Li
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun

On Tue, Jun 10, 2025 at 09:48:30PM +0800, Baokun Li wrote:
> On 2025/6/10 20:06, Ojaswin Mujoo wrote:
> > On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote:
> > > On 2025/5/28 22:53, Ojaswin Mujoo wrote:
> > > > On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@huaweicloud.com wrote:
> > > 
> > > ----------------------------------------------------------
> > >                       |       base      |      patched    |
> > > ---------------------|--------|--------|--------|--------|
> > > mb_optimize_scan     | linear |opt_scan| linear |opt_scan|
> > > ---------------------|--------|--------|--------|--------|
> > > bw(MiB/s)            | 217    | 219    | 5685   | 5670   |
> > > Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
> > > Avg. extents per file| 261879 | 262039 | 744    | 2084   |
> > > Avg. size per extent | 4      | 4      | 1408   | 503    |
> > > Fragmentation score  | 100    | 100    | 2      | 6      |
> > > ----------------------------------------------------------
> > Hi Baokun,
> > 
> > Thanks for the info and data and apologies for being late, I caught a
> > viral last week :/
> Hi Ojaswin,
> 
> Being sick really takes a toll.
> Please get some good rest and take care of yourself.

Thanks Baokun!

<snip>

> > 
> > > Another reason is that opt_scan tends to allocate from groups that have
> > > just received freed blocks, causing it to constantly "jump around"
> > > between certain groups.
> > > 
> > > This leads to intense contention between allocation and release, and even
> > > between release events. In contrast, linear traversal always moves forward
> > > without revisiting groups, resulting in less contention between allocation
> > > and release.
> > By just received free blocks, you mean the blocks got freed in the group
> > right?
> Yes.
> > I was under the impression that when we free blocks and the group's
> > largest order/ avg fragment changes, the group is added to the end of
> > the free fragment list/order list so it should be the last to be picked.
> > Is that not the case?
> Yes, we are indeed adding the group to the list tail. However, because
> we traverse all ordered lists from low to high, a group might end up
> "bouncing" repeatedly between order_0 and order_1 here.
> 
> For instance, if order_1 only contains group 1, linear traversal would
> rarely revisit it after the initial pass. However, after a non-linear
> allocation, this group is moved from the order_1 list to the order_0 list.
> When blocks are freed, it's moved back to the order_1 list, and then
> non-linear traversal prioritizes allocation in this same group again...

Right, I get what you mean now. Thanks.

> > 
> > > However, because linear involves more groups in allocation, journal
> > > becomes a bottleneck. If opt_scan first attempts to traverse block groups
> > > to the right from the target group in all lists, and then from index 0 to
> > > the left in all lists, competition between block groups would be
> > > significantly reduced.
> > > 
> > > To enable ordered traversal, we attempted to convert list_head to an
> > > ordered xarray. This ordering prevents "bouncing" during retries.
> > > Additionally, traversing all right-side groups before left-side groups
> > > significantly reduced contention. Performance improved from 10430 to 17730.
> > Do you maybe have some code you can share of this?
> Yes, V2 will include those changes.

Okay sure

<snip

> > > In a single container, create a file, then repeatedly append 8KB using
> > > fallocate until the file reaches 1MB. After that, truncate the file to
> > > 0 and continue appending 8KB with fallocate. The 64 containers will
> > > occupy a maximum of 64MB of disk space in total, so they won't fill the
> > > entire file system.
> > Also, as per your theory, if we do similar experiments in a fragmented
> > FS we should see opt_scan perform better right? I'd like to test this as
> > well. I'll try to play with the scripts you have shared.
> > 
> Yes, mb_optimize_scan performs well when free space fragmentation is
> severe. We have a 600TB disk array where the write bandwidth is
> approximately 5 GB/s when empty. When space utilization reaches 95%,
> linear traversal drops bandwidth to 1 GB/s, whereas enabling
> mb_optimize_scan restores it to 4.2 GB/s.

Got it, thanks for confirming. Seems like in mostly empty FS linear
traversal seems to do better. Makes me wonder if we should use some
heurestic to switch between linear and opt_scan based allocation, for
example opt_scan can be used if FS is 60% full or has a fragmentation 
score of X. (or something like that..)

But Im also curious about your improved optimize scan implementation
with ordered traversal, so lets see how that goes first.

> 
> 
> Cheers,
> Baokun
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation
  2025-06-11  8:22         ` Ojaswin Mujoo
@ 2025-06-12 11:30           ` Baokun Li
  0 siblings, 0 replies; 22+ messages in thread
From: Baokun Li @ 2025-06-12 11:30 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, tytso, adilger.kernel, jack, linux-kernel, yi.zhang,
	yangerkun, Baokun Li

On 2025/6/11 16:22, Ojaswin Mujoo wrote:
> On Tue, Jun 10, 2025 at 09:48:30PM +0800, Baokun Li wrote:
>> On 2025/6/10 20:06, Ojaswin Mujoo wrote:
>>> On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote:
>>>> On 2025/5/28 22:53, Ojaswin Mujoo wrote:
>>>>> On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@huaweicloud.com wrote:
>>>> ----------------------------------------------------------
>>>>                        |       base      |      patched    |
>>>> ---------------------|--------|--------|--------|--------|
>>>> mb_optimize_scan     | linear |opt_scan| linear |opt_scan|
>>>> ---------------------|--------|--------|--------|--------|
>>>> bw(MiB/s)            | 217    | 219    | 5685   | 5670   |
>>>> Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
>>>> Avg. extents per file| 261879 | 262039 | 744    | 2084   |
>>>> Avg. size per extent | 4      | 4      | 1408   | 503    |
>>>> Fragmentation score  | 100    | 100    | 2      | 6      |
>>>> ----------------------------------------------------------
>>> Hi Baokun,
>>>
>>> Thanks for the info and data and apologies for being late, I caught a
>>> viral last week :/
>> Hi Ojaswin,
>>
>> Being sick really takes a toll.
>> Please get some good rest and take care of yourself.
> Thanks Baokun!
>
> <snip>
>
>>>> Another reason is that opt_scan tends to allocate from groups that have
>>>> just received freed blocks, causing it to constantly "jump around"
>>>> between certain groups.
>>>>
>>>> This leads to intense contention between allocation and release, and even
>>>> between release events. In contrast, linear traversal always moves forward
>>>> without revisiting groups, resulting in less contention between allocation
>>>> and release.
>>> By just received free blocks, you mean the blocks got freed in the group
>>> right?
>> Yes.
>>> I was under the impression that when we free blocks and the group's
>>> largest order/ avg fragment changes, the group is added to the end of
>>> the free fragment list/order list so it should be the last to be picked.
>>> Is that not the case?
>> Yes, we are indeed adding the group to the list tail. However, because
>> we traverse all ordered lists from low to high, a group might end up
>> "bouncing" repeatedly between order_0 and order_1 here.
>>
>> For instance, if order_1 only contains group 1, linear traversal would
>> rarely revisit it after the initial pass. However, after a non-linear
>> allocation, this group is moved from the order_1 list to the order_0 list.
>> When blocks are freed, it's moved back to the order_1 list, and then
>> non-linear traversal prioritizes allocation in this same group again...
> Right, I get what you mean now. Thanks.
>
>>>> However, because linear involves more groups in allocation, journal
>>>> becomes a bottleneck. If opt_scan first attempts to traverse block groups
>>>> to the right from the target group in all lists, and then from index 0 to
>>>> the left in all lists, competition between block groups would be
>>>> significantly reduced.
>>>>
>>>> To enable ordered traversal, we attempted to convert list_head to an
>>>> ordered xarray. This ordering prevents "bouncing" during retries.
>>>> Additionally, traversing all right-side groups before left-side groups
>>>> significantly reduced contention. Performance improved from 10430 to 17730.
>>> Do you maybe have some code you can share of this?
>> Yes, V2 will include those changes.
> Okay sure
>
> <snip
>
>>>> In a single container, create a file, then repeatedly append 8KB using
>>>> fallocate until the file reaches 1MB. After that, truncate the file to
>>>> 0 and continue appending 8KB with fallocate. The 64 containers will
>>>> occupy a maximum of 64MB of disk space in total, so they won't fill the
>>>> entire file system.
>>> Also, as per your theory, if we do similar experiments in a fragmented
>>> FS we should see opt_scan perform better right? I'd like to test this as
>>> well. I'll try to play with the scripts you have shared.
>>>
>> Yes, mb_optimize_scan performs well when free space fragmentation is
>> severe. We have a 600TB disk array where the write bandwidth is
>> approximately 5 GB/s when empty. When space utilization reaches 95%,
>> linear traversal drops bandwidth to 1 GB/s, whereas enabling
>> mb_optimize_scan restores it to 4.2 GB/s.
> Got it, thanks for confirming. Seems like in mostly empty FS linear
> traversal seems to do better. Makes me wonder if we should use some
> heurestic to switch between linear and opt_scan based allocation, for
> example opt_scan can be used if FS is 60% full or has a fragmentation
> score of X. (or something like that..)
Yes, it would be nice if we could switch at the right point in time so we
could have the best of both allocations. But we need some data to support
us in choosing the right strategy, but that can come later.
> But Im also curious about your improved optimize scan implementation
> with ordered traversal, so lets see how that goes first.
>
The v2 version of the patch is currently being tested and if it goes
well, it could be sent out next week.


Cheers,
Baokun


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-06-12 11:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-23  8:58 [PATCH 0/4] ext4: better scalability for ext4 block allocation libaokun
2025-05-23  8:58 ` [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups libaokun
2025-05-28 15:05   ` Ojaswin Mujoo
2025-05-30  8:20     ` Baokun Li
2025-06-10 12:07       ` Ojaswin Mujoo
2025-05-23  8:58 ` [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info libaokun
2025-05-29 12:56   ` Jan Kara
2025-05-30  9:31     ` Baokun Li
2025-06-02 15:44       ` Jan Kara
2025-06-04  8:13         ` Baokun Li
2025-05-23  8:58 ` [PATCH 3/4] ext4: get rid of some obsolete EXT4_MB_HINT flags libaokun
2025-05-28 15:10   ` Ojaswin Mujoo
2025-05-29 12:57   ` Jan Kara
2025-05-23  8:58 ` [PATCH 4/4] ext4: fix typo in CR_GOAL_LEN_SLOW comment libaokun
2025-05-28 15:11   ` Ojaswin Mujoo
2025-05-29 12:57   ` Jan Kara
2025-05-28 14:53 ` [PATCH 0/4] ext4: better scalability for ext4 block allocation Ojaswin Mujoo
2025-05-29 12:24   ` Baokun Li
2025-06-10 12:06     ` Ojaswin Mujoo
2025-06-10 13:48       ` Baokun Li
2025-06-11  8:22         ` Ojaswin Mujoo
2025-06-12 11:30           ` Baokun Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox