Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

From: Baokun Li <libaokun1@huawei.com>
To: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>, <tytso@mit.edu>,
	<adilger.kernel@dilger.ca>, <ojaswin@linux.ibm.com>,
	<linux-kernel@vger.kernel.org>, <yi.zhang@huawei.com>,
	<yangerkun@huawei.com>, Baokun Li <libaokun1@huawei.com>
Subject: Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Date: Mon, 30 Jun 2025 17:21:48 +0800	[thread overview]
Message-ID: <db4b9d71-c34d-4315-a87d-2edf3bbaff2d@huawei.com> (raw)
In-Reply-To: <mfybwoygcycblgaln2j4et4zmyzli2zibcgvixysanugjjhhh5@xyzoc4juy4wv>

On 2025/6/30 15:47, Jan Kara wrote:
> On Mon 30-06-25 11:48:20, Baokun Li wrote:
>> On 2025/6/28 2:19, Jan Kara wrote:
>>> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>>>> After we optimized the block group lock, we found another lock
>>>> contention issue when running will-it-scale/fallocate2 with multiple
>>>> processes. The fallocate's block allocation and the truncate's block
>>>> release were fighting over the s_md_lock. The problem is, this lock
>>>> protects totally different things in those two processes: the list of
>>>> freed data blocks (s_freed_data_list) when releasing, and where to start
>>>> looking for new blocks (mb_last_group) when allocating.
>>>>
>>>> Now we only need to track s_mb_last_group and no longer need to track
>>>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>>>> two are consistent, and we can ensure that the s_mb_last_group read is up
>>>> to date by using smp_store_release/smp_load_acquire.
>>>>
>>>> Besides, the s_mb_last_group data type only requires ext4_group_t
>>>> (i.e., unsigned int), rendering unsigned long superfluous.
>>>>
>>>> Performance test data follows:
>>>>
>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>> Observation: Average fallocate operations per container per second.
>>>>
>>>>                      | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>    Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>                      | base  |    patched      | base  |    patched      |
>>>> -------------------|-------|-----------------|-------|-----------------|
>>>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>>>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>>>
>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>> ...
>>>
>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>> index 5cdae3bda072..3f103919868b 100644
>>>> --- a/fs/ext4/mballoc.c
>>>> +++ b/fs/ext4/mballoc.c
>>>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>>>    	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>>>    	folio_get(ac->ac_buddy_folio);
>>>>    	/* store last allocated for subsequent stream allocation */
>>>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>> -		spin_lock(&sbi->s_md_lock);
>>>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>>>> -		spin_unlock(&sbi->s_md_lock);
>>>> -	}
>>>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>>>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>>>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>> Do you really need any kind of barrier (implied by smp_store_release())
>>> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
>>> with other accesses from the thread, isn't it? As such it should be enough
>>> to have WRITE_ONCE() here...
>> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
>> that variable reads/writes access values directly from L1/L2 cache rather
>> than registers.
> I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
> particular they force the compiler to read / write the memory location
> exactly once instead of reading it potentially multiple times in different
> parts of expression and getting inconsistent values, or possibly writing
> the value say byte by byte (yes, that would be insane but not contrary to
> the C standard).
READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
two main purposes:

1. It tells the compiler that the variable's value can change unexpectedly,
    preventing the compiler from making incorrect optimizations based on
    assumptions about its stability.

2. It ensures the CPU directly reads from or writes to the variable's
    memory address. This means the value will be fetched from cache (L1/L2)
    if available, or from main memory otherwise, rather than using a stale
    value from a CPU register.
>> They do not guarantee that other CPUs see the latest values. Reading stale
>> values could lead to more useless traversals, which might incur higher
>> overhead than memory barriers. This is why we use memory barriers to ensure
>> the latest values are read.
> But smp_load_acquire() / smp_store_release() have no guarantee about CPU
> seeing latest values either. They are just speculation barriers meaning
> they prevent the CPU from reordering accesses in the code after
> smp_load_acquire() to be performed before the smp_load_acquire() is
> executed and similarly with smp_store_release(). So I dare to say that
> these barries have no (positive) impact on the allocation performance and
> just complicate the code - but if you have some data that show otherwise,
> I'd be happy to be proven wrong.
smp_load_acquire() / smp_store_release() guarantee that CPUs read the
latest data.

For example, imagine a variable a = 0, with both CPU0 and CPU1 having
a=0 in their caches.

Without a memory barrier:
When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
a=1 is written to main memory and becomes visible to other CPUs.
Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
its invalidation queue. However, it might not process it immediately;
instead, it could perform the read first, potentially still reading a=0
from its cache.

With a memory barrier:
When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
the store buffer, but data in the store buffer is also written to main
memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.

When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
to its invalidation queue. Here, the invalidation queue is flushed, which
invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
performs the read, its cache is invalid, so it reads the latest a=1 from
main memory.

This is a general overview. Please let me know if I've missed anything.


Thanks,
Baokun

next prev parent reply	other threads:[~2025-06-30  9:21 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-23  7:32 [PATCH v2 00/16] ext4: better scalability for ext4 block allocation Baokun Li
2025-06-23  7:32 ` [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups Baokun Li
2025-06-27 18:06   ` Jan Kara
2025-07-14  6:53   ` Ojaswin Mujoo
2025-06-23  7:32 ` [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start Baokun Li
2025-06-27 18:15   ` Jan Kara
2025-06-30  3:32     ` Baokun Li
2025-06-30  7:31       ` Jan Kara
2025-06-30  7:52         ` Baokun Li
2025-07-14  7:00           ` Ojaswin Mujoo
2025-06-23  7:32 ` [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group Baokun Li
2025-06-27 18:19   ` Jan Kara
2025-06-30  3:48     ` Baokun Li
2025-06-30  7:47       ` Jan Kara
2025-06-30  9:21         ` Baokun Li [this message]
2025-06-30 16:32           ` Jan Kara
2025-07-01  2:39             ` Baokun Li
2025-07-01 12:21               ` Jan Kara
2025-07-01 13:17                 ` Baokun Li
2025-07-08 13:08                 ` Baokun Li
2025-07-10 14:38                   ` Jan Kara
2025-07-14  3:01                     ` Theodore Ts'o
2025-07-14  7:00                       ` Baokun Li
2025-07-01  2:57   ` kernel test robot
2025-06-23  7:32 ` [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention Baokun Li
2025-06-27 18:31   ` Jan Kara
2025-06-30  6:50     ` Baokun Li
2025-06-30  8:38       ` Jan Kara
2025-06-30 10:02         ` Baokun Li
2025-06-30 17:41           ` Jan Kara
2025-07-01  3:32             ` Baokun Li
2025-07-01 11:53               ` Jan Kara
2025-07-01 12:12                 ` Baokun Li
2025-06-23  7:32 ` [PATCH v2 05/16] ext4: get rid of some obsolete EXT4_MB_HINT flags Baokun Li
2025-06-23  7:32 ` [PATCH v2 06/16] ext4: fix typo in CR_GOAL_LEN_SLOW comment Baokun Li
2025-06-23  7:32 ` [PATCH v2 07/16] ext4: convert sbi->s_mb_free_pending to atomic_t Baokun Li
2025-06-27 18:33   ` Jan Kara
2025-06-23  7:32 ` [PATCH v2 08/16] ext4: merge freed extent with existing extents before insertion Baokun Li
2025-06-27 19:11   ` Jan Kara
2025-06-23  7:32 ` [PATCH v2 09/16] ext4: fix zombie groups in average fragment size lists Baokun Li
2025-06-27 19:14   ` Jan Kara
2025-06-30  6:53     ` Baokun Li
2025-06-23  7:32 ` [PATCH v2 10/16] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Baokun Li
2025-06-27 19:34   ` Jan Kara
2025-06-30  7:34     ` Baokun Li
2025-06-23  7:32 ` [PATCH v2 11/16] ext4: factor out __ext4_mb_scan_group() Baokun Li
2025-06-23  7:33 ` [PATCH v2 12/16] ext4: factor out ext4_mb_might_prefetch() Baokun Li
2025-06-23  7:33 ` [PATCH v2 13/16] ext4: factor out ext4_mb_scan_group() Baokun Li
2025-06-23  7:33 ` [PATCH v2 14/16] ext4: convert free group lists to ordered xarrays Baokun Li
2025-06-23  7:33 ` [PATCH v2 15/16] ext4: refactor choose group to scan group Baokun Li
2025-06-23  7:33 ` [PATCH v2 16/16] ext4: ensure global ordered traversal across all free groups xarrays Baokun Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=db4b9d71-c34d-4315-a87d-2edf3bbaff2d@huawei.com \
    --to=libaokun1@huawei.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=tytso@mit.edu \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox