Linux EXT4 FS development
 help / color / mirror / Atom feed
From: libaokun@huaweicloud.com
To: linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz,
	linux-kernel@vger.kernel.org, yi.zhang@huawei.com,
	yangerkun@huawei.com, libaokun1@huawei.com,
	libaokun@huaweicloud.com
Subject: [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info
Date: Fri, 23 May 2025 16:58:19 +0800	[thread overview]
Message-ID: <20250523085821.1329392-3-libaokun@huaweicloud.com> (raw)
In-Reply-To: <20250523085821.1329392-1-libaokun@huaweicloud.com>

From: Baokun Li <libaokun1@huawei.com>

After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_[group|start]) when allocating.

Moreover, when allocating data blocks, if the first try (goal allocation)
fails and stream allocation is on, it tries a global goal starting from
the last group we used (s_mb_last_group). This can make things faster by
writing blocks close together on the disk. But when many processes are
allocating, they all fight over s_md_lock and might even try to use the
same group. This makes it harder to merge extents and can make files more
fragmented. If different processes allocate chunks of very different sizes,
the free space on the disk can also get fragmented. A small allocation
might fit in a partially full group, but a big allocation might have
skipped it, leading to the small IO ending up in a more empty group.

So, we're changing stream allocation to work per inode. First, it tries
the goal, then the last group where that inode successfully allocated a
block. This keeps an inode's data closer together. Plus, after moving
mb_last_[group|start] to ext4_inode_info, we don't need s_md_lock during
block allocation anymore because we already have the write lock on
i_data_sem. This gets rid of the contention between allocating and
releasing blocks, which gives a huge performance boost to fallocate2.

Performance test data follows:

CPU: HUAWEI Kunpeng 920
Memory: 480GB
Disk: 480GB SSD SATA 3.2
Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
Observation: Average fallocate operations per container per second.

                      base     patched
mb_optimize_scan=0    6755     23280 (+244.6%)
mb_optimize_scan=1    4302     10430 (+142.4%)

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  7 ++++---
 fs/ext4/mballoc.c | 20 +++++++++-----------
 fs/ext4/super.c   |  2 ++
 3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9c665a620a46..16c14dd09df6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1171,6 +1171,10 @@ struct ext4_inode_info {
 	__u32 i_csum_seed;
 
 	kprojid_t i_projid;
+
+	/* where last allocation was done - for stream allocation */
+	ext4_group_t i_mb_last_group;
+	ext4_grpblk_t i_mb_last_start;
 };
 
 /*
@@ -1603,9 +1607,6 @@ struct ext4_sb_info {
 	unsigned int s_mb_order2_reqs;
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_dir_size_kb;
-	/* where last allocation was done - for stream allocation */
-	unsigned long s_mb_last_group;
-	unsigned long s_mb_last_start;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5c13d9f8a1cc..ee9696f9bac8 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2138,7 +2138,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
 static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 					struct ext4_buddy *e4b)
 {
-	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int ret;
 
 	BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group);
@@ -2169,10 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	folio_get(ac->ac_buddy_folio);
 	/* store last allocated for subsequent stream allocation */
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		spin_lock(&sbi->s_md_lock);
-		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
-		sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
-		spin_unlock(&sbi->s_md_lock);
+		EXT4_I(ac->ac_inode)->i_mb_last_group = ac->ac_f_ex.fe_group;
+		EXT4_I(ac->ac_inode)->i_mb_last_start = ac->ac_f_ex.fe_start;
 	}
 	/*
 	 * As we've just preallocated more space than
@@ -2844,13 +2841,14 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 							   MB_NUM_ORDERS(sb));
 	}
 
-	/* if stream allocation is enabled, use global goal */
+	/* if stream allocation is enabled, use last goal */
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		/* TBD: may be hot point */
-		spin_lock(&sbi->s_md_lock);
-		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
-		ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
-		spin_unlock(&sbi->s_md_lock);
+		struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
+
+		if (ei->i_mb_last_group || ei->i_mb_last_start) {
+			ac->ac_g_ex.fe_group = ei->i_mb_last_group;
+			ac->ac_g_ex.fe_start = ei->i_mb_last_start;
+		}
 	}
 
 	/*
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 181934499624..6c49c43bb2cb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1416,6 +1416,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
 	ext4_fc_init_inode(&ei->vfs_inode);
 	mutex_init(&ei->i_fc_lock);
+	ei->i_mb_last_group = 0;
+	ei->i_mb_last_start = 0;
 	return &ei->vfs_inode;
 }
 
-- 
2.46.1


  parent reply	other threads:[~2025-05-23  9:03 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-23  8:58 [PATCH 0/4] ext4: better scalability for ext4 block allocation libaokun
2025-05-23  8:58 ` [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups libaokun
2025-05-28 15:05   ` Ojaswin Mujoo
2025-05-30  8:20     ` Baokun Li
2025-06-10 12:07       ` Ojaswin Mujoo
2025-05-23  8:58 ` libaokun [this message]
2025-05-29 12:56   ` [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info Jan Kara
2025-05-30  9:31     ` Baokun Li
2025-06-02 15:44       ` Jan Kara
2025-06-04  8:13         ` Baokun Li
2025-05-23  8:58 ` [PATCH 3/4] ext4: get rid of some obsolete EXT4_MB_HINT flags libaokun
2025-05-28 15:10   ` Ojaswin Mujoo
2025-05-29 12:57   ` Jan Kara
2025-05-23  8:58 ` [PATCH 4/4] ext4: fix typo in CR_GOAL_LEN_SLOW comment libaokun
2025-05-28 15:11   ` Ojaswin Mujoo
2025-05-29 12:57   ` Jan Kara
2025-05-28 14:53 ` [PATCH 0/4] ext4: better scalability for ext4 block allocation Ojaswin Mujoo
2025-05-29 12:24   ` Baokun Li
2025-06-10 12:06     ` Ojaswin Mujoo
2025-06-10 13:48       ` Baokun Li
2025-06-11  8:22         ` Ojaswin Mujoo
2025-06-12 11:30           ` Baokun Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250523085821.1329392-3-libaokun@huaweicloud.com \
    --to=libaokun@huaweicloud.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=jack@suse.cz \
    --cc=libaokun1@huawei.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox