From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 336042116F5; Fri, 30 May 2025 08:20:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.191 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748593251; cv=none; b=BL4YFlNLjpdz2+c5YDGAv1ciTU4l/5yUJUsU+rt/6JaEQgoh6MjjaOwxVz8nQ2Gqx1U77+FOwqA+AIneGgqOn9AJkOXZ0qhhmskY+9YZ2IHdXgElXxqgUCNsz//6U1jkNmlGJD0Ph61iPWr4VwDgkz2SchLojAR9InWfBmgf0D0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748593251; c=relaxed/simple; bh=Ucz1ygW9HG1Og2peBspr33vdjTpvWXFaRtJTMZFth+M=; h=Message-ID:Date:MIME-Version:Subject:To:CC:References:From: In-Reply-To:Content-Type; b=ROCqcKzM3jRKvLEmBmrDZS+h5g5YFuYsbaqOEkH6LrotzMsCT5d+4KdqxjzRrOlIY3bHCoEuFmM5X0W6hMFoeu3qYIyVmrT1ZqkWiZgP2kzpjV3laimwEKuRcFZ2QodAhMjNFiegVBaOYIlt2CkQ5OkRBlNZmjpy0XB3wCOYDR0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.191 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.214]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4b7x3y0mjpz23jTN; Fri, 30 May 2025 16:19:42 +0800 (CST) Received: from kwepemg500008.china.huawei.com (unknown [7.202.181.45]) by mail.maildlp.com (Postfix) with ESMTPS id 77A591A016C; Fri, 30 May 2025 16:20:45 +0800 (CST) Received: from [127.0.0.1] (10.174.177.71) by kwepemg500008.china.huawei.com (7.202.181.45) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 30 May 2025 16:20:44 +0800 Message-ID: Date: Fri, 30 May 2025 16:20:44 +0800 Precedence: bulk X-Mailing-List: linux-ext4@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups To: Ojaswin Mujoo CC: , , , , , , , , Baokun Li References: <20250523085821.1329392-1-libaokun@huaweicloud.com> <20250523085821.1329392-2-libaokun@huaweicloud.com> Content-Language: en-US From: Baokun Li In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-ClientProxiedBy: kwepems500001.china.huawei.com (7.221.188.70) To kwepemg500008.china.huawei.com (7.202.181.45) On 2025/5/28 23:05, Ojaswin Mujoo wrote: > On Fri, May 23, 2025 at 04:58:18PM +0800, libaokun@huaweicloud.com wrote: >> From: Baokun Li >> >> When ext4 allocates blocks, we used to just go through the block groups >> one by one to find a good one. But when there are tons of block groups >> (like hundreds of thousands or even millions) and not many have free space >> (meaning they're mostly full), it takes a really long time to check them >> all, and performance gets bad. So, we added the "mb_optimize_scan" mount >> option (which is on by default now). It keeps track of some group lists, >> so when we need a free block, we can just grab a likely group from the >> right list. This saves time and makes block allocation much faster. >> >> But when multiple processes or containers are doing similar things, like >> constantly allocating 8k blocks, they all try to use the same block group >> in the same list. Even just two processes doing this can cut the IOPS in >> half. For example, one container might do 300,000 IOPS, but if you run two >> at the same time, the total is only 150,000. >> >> Since we can already look at block groups in a non-linear way, the first >> and last groups in the same list are basically the same for finding a block >> right now. Therefore, add an ext4_try_lock_group() helper function to skip >> the current group when it is locked by another process, thereby avoiding >> contention with other processes. This helps ext4 make better use of having >> multiple block groups. >> >> Also, to make sure we don't skip all the groups that have free space >> when allocating blocks, we won't try to skip busy groups anymore when >> ac_criteria is CR_ANY_FREE. >> >> Performance test data follows: >> >> CPU: HUAWEI Kunpeng 920 >> Memory: 480GB >> Disk: 480GB SSD SATA 3.2 >> Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers. >> Observation: Average fallocate operations per container per second. >> >> base patched >> mb_optimize_scan=0 3588 6755 (+88.2%) >> mb_optimize_scan=1 3588 4302 (+19.8%) > The patch looks mostly good. Same observations about mb_optimize_scan=1 > improving less. We can continue this discussion in my reply to the cover > letter. That being said, I have some minor suggestions: Thanks for the review! > >> Signed-off-by: Baokun Li >> --- >> fs/ext4/ext4.h | 23 ++++++++++++++--------- >> fs/ext4/mballoc.c | 14 +++++++++++--- >> 2 files changed, 25 insertions(+), 12 deletions(-) >> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >> index 5a20e9cd7184..9c665a620a46 100644 >> --- a/fs/ext4/ext4.h >> +++ b/fs/ext4/ext4.h >> @@ -3494,23 +3494,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi) >> return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD); >> } >> >> +static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group) >> +{ >> + if (!spin_trylock(ext4_group_lock_ptr(sb, group))) >> + return false; >> + /* >> + * We're able to grab the lock right away, so drop the lock >> + * contention counter. >> + */ >> + atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0); >> + return true; >> +} >> + >> static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group) >> { >> - spinlock_t *lock = ext4_group_lock_ptr(sb, group); >> - if (spin_trylock(lock)) >> - /* >> - * We're able to grab the lock right away, so drop the >> - * lock contention counter. >> - */ >> - atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0); >> - else { >> + if (!ext4_try_lock_group(sb, group)) { >> /* >> * The lock is busy, so bump the contention counter, >> * and then wait on the spin lock. >> */ >> atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1, >> EXT4_MAX_CONTENTION); >> - spin_lock(lock); >> + spin_lock(ext4_group_lock_ptr(sb, group)); >> } >> } >> >> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >> index 1e98c5be4e0a..5c13d9f8a1cc 100644 >> --- a/fs/ext4/mballoc.c >> +++ b/fs/ext4/mballoc.c >> @@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context >> bb_largest_free_order_node) { >> if (sbi->s_mb_stats) >> atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]); >> - if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) { >> + if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED)) && >> + !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) { > Maybe reversing the && order to be (!spin_is_locked() && ext4_mb_good_group()) would be better? Yeah. >> *group = iter->bb_group; >> ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; >> read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); >> @@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o >> list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) { >> if (sbi->s_mb_stats) >> atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); >> - if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) { >> + if (likely(ext4_mb_good_group(ac, iter->bb_group, cr)) && >> + !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) { > same as above Okay. > >> grp = iter; >> break; >> } >> @@ -2911,7 +2913,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) >> if (err) >> goto out; >> >> - ext4_lock_group(sb, group); >> + /* skip busy group */ >> + if (cr >= CR_ANY_FREE) { >> + ext4_lock_group(sb, group); >> + } else if (!ext4_try_lock_group(sb, group)) { >> + ext4_mb_unload_buddy(&e4b); >> + continue; >> + } > This in itself looks good. I am just thinking that now that we are > deciding to skip locked groups, in the code above this one, shall we do > something like: > > > if (spin_is_locked(group_lock)) > continue; > > err = ext4_mb_load_buddy(sb, group, &e4b); > if (err) > goto out; > > /* skip busy group */ > if (cr >= CR_ANY_FREE) { > ext4_lock_group(sb, group); > } else if (!ext4_try_lock_group(sb, group)) { > ext4_mb_unload_buddy(&e4b); > continue; > } > > With this we can even avoid loading the folio as well. I previously assumed that for busy groups, the buddy was already loaded, so reloading it would incur minimal overhead. However, I was mistaken. After implementing a change, the proportion of time spent in ext4_mb_load_buddy() decreased from 3.6% to 1.7%, resulting in approximately a 2% performance improvement. Thank you for your suggestion! I will prevent unnecessary buddy loading in the next version. Cheers, Baokun