From: Baokun Li <libaokun1@huawei.com>
To: <linux-ext4@vger.kernel.org>
Cc: <tytso@mit.edu>, <jack@suse.cz>, <adilger.kernel@dilger.ca>,
<ojaswin@linux.ibm.com>, <linux-kernel@vger.kernel.org>,
<yi.zhang@huawei.com>, <yangerkun@huawei.com>,
<libaokun1@huawei.com>
Subject: [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups
Date: Mon, 23 Jun 2025 15:32:49 +0800 [thread overview]
Message-ID: <20250623073304.3275702-2-libaokun1@huawei.com> (raw)
In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com>
When ext4 allocates blocks, we used to just go through the block groups
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.
But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.
Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.
Also, to make sure we don't skip all the groups that have free space
when allocating blocks, we won't try to skip busy groups anymore when
ac_criteria is CR_ANY_FREE.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
| Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96 |
Disk: 960GB SSD |-------------------------|-------------------------|
| base | patched | base | patched |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 2667 | 4821 (+80.7%) | 3450 | 15371 (+345%) |
mb_optimize_scan=1 | 2643 | 4784 (+81.0%) | 3209 | 6101 (+90.0%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
fs/ext4/ext4.h | 23 ++++++++++++++---------
fs/ext4/mballoc.c | 19 ++++++++++++++++---
2 files changed, 30 insertions(+), 12 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 18373de980f2..9df74123e7e6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3541,23 +3541,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi)
return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD);
}
+static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group)
+{
+ if (!spin_trylock(ext4_group_lock_ptr(sb, group)))
+ return false;
+ /*
+ * We're able to grab the lock right away, so drop the lock
+ * contention counter.
+ */
+ atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
+ return true;
+}
+
static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group)
{
- spinlock_t *lock = ext4_group_lock_ptr(sb, group);
- if (spin_trylock(lock))
- /*
- * We're able to grab the lock right away, so drop the
- * lock contention counter.
- */
- atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
- else {
+ if (!ext4_try_lock_group(sb, group)) {
/*
* The lock is busy, so bump the contention counter,
* and then wait on the spin lock.
*/
atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1,
EXT4_MAX_CONTENTION);
- spin_lock(lock);
+ spin_lock(ext4_group_lock_ptr(sb, group));
}
}
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1e98c5be4e0a..336d65c4f6a2 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
bb_largest_free_order_node) {
if (sbi->s_mb_stats)
atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
- if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
+ if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) &&
+ likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
*group = iter->bb_group;
ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
@@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o
list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
if (sbi->s_mb_stats)
atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
- if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
+ if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) &&
+ likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
grp = iter;
break;
}
@@ -2899,6 +2901,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
nr, &prefetch_ios);
}
+ /* prevent unnecessary buddy loading. */
+ if (cr < CR_ANY_FREE &&
+ spin_is_locked(ext4_group_lock_ptr(sb, group)))
+ continue;
+
/* This now checks without needing the buddy page */
ret = ext4_mb_good_group_nolock(ac, group, cr);
if (ret <= 0) {
@@ -2911,7 +2918,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
if (err)
goto out;
- ext4_lock_group(sb, group);
+ /* skip busy group */
+ if (cr >= CR_ANY_FREE) {
+ ext4_lock_group(sb, group);
+ } else if (!ext4_try_lock_group(sb, group)) {
+ ext4_mb_unload_buddy(&e4b);
+ continue;
+ }
/*
* We need to check again after locking the
--
2.46.1
next prev parent reply other threads:[~2025-06-23 7:46 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-23 7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
2025-06-23 7:32 ` Baokun Li [this message]
2025-06-27 18:06 ` [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups Jan Kara
2025-07-14 6:53 ` Ojaswin Mujoo
2025-06-23 7:32 ` [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start Baokun Li
2025-06-27 18:15 ` Jan Kara
2025-06-30 3:32 ` Baokun Li
2025-06-30 7:31 ` Jan Kara
2025-06-30 7:52 ` Baokun Li
2025-07-14 7:00 ` Ojaswin Mujoo
2025-06-23 7:32 ` [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
2025-06-27 18:19 ` Jan Kara
2025-06-30 3:48 ` Baokun Li
2025-06-30 7:47 ` Jan Kara
2025-06-30 9:21 ` Baokun Li
2025-06-30 16:32 ` Jan Kara
2025-07-01 2:39 ` Baokun Li
2025-07-01 12:21 ` Jan Kara
2025-07-01 13:17 ` Baokun Li
2025-07-08 13:08 ` Baokun Li
2025-07-10 14:38 ` Jan Kara
2025-07-14 3:01 ` Theodore Ts'o
2025-07-14 7:00 ` Baokun Li
2025-07-01 2:57 ` kernel test robot
2025-06-23 7:32 ` [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention Baokun Li
2025-06-27 18:31 ` Jan Kara
2025-06-30 6:50 ` Baokun Li
2025-06-30 8:38 ` Jan Kara
2025-06-30 10:02 ` Baokun Li
2025-06-30 17:41 ` Jan Kara
2025-07-01 3:32 ` Baokun Li
2025-07-01 11:53 ` Jan Kara
2025-07-01 12:12 ` Baokun Li
2025-06-23 7:32 ` [PATCH v2 05/16] ext4: get rid of some obsolete EXT4_MB_HINT flags Baokun Li
2025-06-23 7:32 ` [PATCH v2 06/16] ext4: fix typo in CR_GOAL_LEN_SLOW comment Baokun Li
2025-06-23 7:32 ` [PATCH v2 07/16] ext4: convert sbi->s_mb_free_pending to atomic_t Baokun Li
2025-06-27 18:33 ` Jan Kara
2025-06-23 7:32 ` [PATCH v2 08/16] ext4: merge freed extent with existing extents before insertion Baokun Li
2025-06-27 19:11 ` Jan Kara
2025-06-23 7:32 ` [PATCH v2 09/16] ext4: fix zombie groups in average fragment size lists Baokun Li
2025-06-27 19:14 ` Jan Kara
2025-06-30 6:53 ` Baokun Li
2025-06-23 7:32 ` [PATCH v2 10/16] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Baokun Li
2025-06-27 19:34 ` Jan Kara
2025-06-30 7:34 ` Baokun Li
2025-06-23 7:32 ` [PATCH v2 11/16] ext4: factor out __ext4_mb_scan_group() Baokun Li
2025-06-23 7:33 ` [PATCH v2 12/16] ext4: factor out ext4_mb_might_prefetch() Baokun Li
2025-06-23 7:33 ` [PATCH v2 13/16] ext4: factor out ext4_mb_scan_group() Baokun Li
2025-06-23 7:33 ` [PATCH v2 14/16] ext4: convert free group lists to ordered xarrays Baokun Li
2025-06-23 7:33 ` [PATCH v2 15/16] ext4: refactor choose group to scan group Baokun Li
2025-06-23 7:33 ` [PATCH v2 16/16] ext4: ensure global ordered traversal across all free groups xarrays Baokun Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250623073304.3275702-2-libaokun1@huawei.com \
--to=libaokun1@huawei.com \
--cc=adilger.kernel@dilger.ca \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=tytso@mit.edu \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox