From: Baokun Li <libaokun1@huawei.com>
To: <linux-ext4@vger.kernel.org>
Cc: <tytso@mit.edu>, <adilger.kernel@dilger.ca>, <jack@suse.cz>,
<linux-kernel@vger.kernel.org>, <ojaswin@linux.ibm.com>,
<julia.lawall@inria.fr>, <yi.zhang@huawei.com>,
<yangerkun@huawei.com>, <libaokun1@huawei.com>,
<libaokun@huaweicloud.com>
Subject: [PATCH v3 05/17] ext4: utilize multiple global goals to reduce contention
Date: Mon, 14 Jul 2025 21:03:15 +0800 [thread overview]
Message-ID: <20250714130327.1830534-6-libaokun1@huawei.com> (raw)
In-Reply-To: <20250714130327.1830534-1-libaokun1@huawei.com>
When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.
However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.
To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the minimum between the number of
possible CPUs and one-quarter of the filesystem's total block group count.
To ensure a consistent goal for each inode, we select the corresponding
goal by taking the inode number modulo the total number of goals.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 9636 | 19628 (+103%) | 337597 | 320885 (-4.9%) |
|mb_optimize_scan=1 | 4834 | 7129 (+47.4%) | 341440 | 321275 (-5.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 22341 | 53760 (+140%) | 219707 | 213145 (-2.9%) |
|mb_optimize_scan=1 | 9177 | 12716 (+38.5%) | 215732 | 215262 (+0.2%) |
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
fs/ext4/ext4.h | 6 ++++--
fs/ext4/mballoc.c | 27 +++++++++++++++++++++++----
2 files changed, 27 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7f5c070de0fb..ad97c693d56a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1629,14 +1629,16 @@ struct ext4_sb_info {
unsigned int s_mb_order2_reqs;
unsigned int s_mb_group_prealloc;
unsigned int s_max_dir_size_kb;
- /* where last allocation was done - for stream allocation */
- ext4_group_t s_mb_last_group;
unsigned int s_mb_prefetch;
unsigned int s_mb_prefetch_limit;
unsigned int s_mb_best_avail_max_trim_order;
unsigned int s_sb_update_sec;
unsigned int s_sb_update_kb;
+ /* where last allocation was done - for stream allocation */
+ ext4_group_t *s_mb_last_groups;
+ unsigned int s_mb_nr_global_goals;
+
/* stats for buddy allocator */
atomic_t s_bal_reqs; /* number of reqs with len > 1 */
atomic_t s_bal_success; /* we found long enough chunks */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 025b759ca643..b6aa24b48543 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2168,8 +2168,12 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
ac->ac_buddy_folio = e4b->bd_buddy_folio;
folio_get(ac->ac_buddy_folio);
/* store last allocated for subsequent stream allocation */
- if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
- WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
+ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
+ int hash = ac->ac_inode->i_ino % sbi->s_mb_nr_global_goals;
+
+ WRITE_ONCE(sbi->s_mb_last_groups[hash], ac->ac_f_ex.fe_group);
+ }
+
/*
* As we've just preallocated more space than
* user requested originally, we store allocated
@@ -2842,7 +2846,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
/* if stream allocation is enabled, use global goal */
if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
- ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group);
+ int hash = ac->ac_inode->i_ino % sbi->s_mb_nr_global_goals;
+
+ ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_groups[hash]);
ac->ac_g_ex.fe_start = -1;
ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
}
@@ -3722,10 +3728,19 @@ int ext4_mb_init(struct super_block *sb)
sbi->s_mb_group_prealloc, EXT4_NUM_B2C(sbi, sbi->s_stripe));
}
+ sbi->s_mb_nr_global_goals = umin(num_possible_cpus(),
+ DIV_ROUND_UP(sbi->s_groups_count, 4));
+ sbi->s_mb_last_groups = kcalloc(sbi->s_mb_nr_global_goals,
+ sizeof(ext4_group_t), GFP_KERNEL);
+ if (sbi->s_mb_last_groups == NULL) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
if (sbi->s_locality_groups == NULL) {
ret = -ENOMEM;
- goto out;
+ goto out_free_last_groups;
}
for_each_possible_cpu(i) {
struct ext4_locality_group *lg;
@@ -3750,6 +3765,9 @@ int ext4_mb_init(struct super_block *sb)
out_free_locality_groups:
free_percpu(sbi->s_locality_groups);
sbi->s_locality_groups = NULL;
+out_free_last_groups:
+ kfree(sbi->s_mb_last_groups);
+ sbi->s_mb_last_groups = NULL;
out:
kfree(sbi->s_mb_avg_fragment_size);
kfree(sbi->s_mb_avg_fragment_size_locks);
@@ -3854,6 +3872,7 @@ void ext4_mb_release(struct super_block *sb)
}
free_percpu(sbi->s_locality_groups);
+ kfree(sbi->s_mb_last_groups);
}
static inline int ext4_issue_discard(struct super_block *sb,
--
2.46.1
next prev parent reply other threads:[~2025-07-14 13:18 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-14 13:03 [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Baokun Li
2025-07-14 13:03 ` [PATCH v3 01/17] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
2025-07-17 10:09 ` Ojaswin Mujoo
2025-07-19 0:37 ` Baokun Li
2025-07-17 22:28 ` Andi Kleen
2025-07-18 3:09 ` Theodore Ts'o
2025-07-19 0:29 ` Baokun Li
2025-07-22 20:59 ` Andi Kleen
2025-07-14 13:03 ` [PATCH v3 02/17] ext4: separate stream goal hits from s_bal_goals for better tracking Baokun Li
2025-07-17 10:29 ` Ojaswin Mujoo
2025-07-19 1:37 ` Baokun Li
2025-07-14 13:03 ` [PATCH v3 03/17] ext4: remove unnecessary s_mb_last_start Baokun Li
2025-07-17 10:31 ` Ojaswin Mujoo
2025-07-14 13:03 ` [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
2025-07-17 13:36 ` Ojaswin Mujoo
2025-07-19 1:54 ` Baokun Li
2025-07-14 13:03 ` Baokun Li [this message]
2025-07-14 13:03 ` [PATCH v3 06/17] ext4: get rid of some obsolete EXT4_MB_HINT flags Baokun Li
2025-07-14 13:03 ` [PATCH v3 07/17] ext4: fix typo in CR_GOAL_LEN_SLOW comment Baokun Li
2025-07-14 13:03 ` [PATCH v3 08/17] ext4: convert sbi->s_mb_free_pending to atomic_t Baokun Li
2025-07-14 13:03 ` [PATCH v3 09/17] ext4: merge freed extent with existing extents before insertion Baokun Li
2025-07-14 13:03 ` [PATCH v3 10/17] ext4: fix zombie groups in average fragment size lists Baokun Li
2025-07-14 13:03 ` [PATCH v3 11/17] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Baokun Li
2025-07-14 13:03 ` [PATCH v3 12/17] ext4: factor out __ext4_mb_scan_group() Baokun Li
2025-07-14 13:03 ` [PATCH v3 13/17] ext4: factor out ext4_mb_might_prefetch() Baokun Li
2025-07-14 13:03 ` [PATCH v3 14/17] ext4: factor out ext4_mb_scan_group() Baokun Li
2025-07-14 13:03 ` [PATCH v3 15/17] ext4: convert free groups order lists to xarrays Baokun Li
2025-07-21 11:07 ` Jan Kara
2025-07-21 12:33 ` Baokun Li
2025-07-21 13:45 ` Baokun Li
2025-07-21 18:01 ` Theodore Ts'o
2025-07-22 5:58 ` Baokun Li
2025-07-24 3:55 ` Guenter Roeck
2025-07-24 4:54 ` Theodore Ts'o
2025-07-24 5:20 ` Guenter Roeck
2025-07-24 11:14 ` Zhang Yi
2025-07-24 14:30 ` Guenter Roeck
2025-07-24 14:54 ` Theodore Ts'o
2025-07-25 2:28 ` Zhang Yi
2025-07-26 0:50 ` Baokun Li
2025-07-14 13:03 ` [PATCH v3 16/17] ext4: refactor choose group to scan group Baokun Li
2025-07-14 13:03 ` [PATCH v3 17/17] ext4: implement linear-like traversal across order xarrays Baokun Li
2025-07-15 1:11 ` [PATCH v3 00/17] ext4: better scalability for ext4 block allocation Zhang Yi
2025-07-19 21:45 ` Theodore Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250714130327.1830534-6-libaokun1@huawei.com \
--to=libaokun1@huawei.com \
--cc=adilger.kernel@dilger.ca \
--cc=jack@suse.cz \
--cc=julia.lawall@inria.fr \
--cc=libaokun@huaweicloud.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=tytso@mit.edu \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).