public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] ext4: better scalability for ext4 block allocation
@ 2025-05-23  8:58 libaokun
  2025-05-23  8:58 ` [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups libaokun
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: libaokun @ 2025-05-23  8:58 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, linux-kernel, yi.zhang, yangerkun,
	libaokun1, libaokun

From: Baokun Li <libaokun1@huawei.com>

Since servers have more and more CPUs, and we're running more containers
on them, we've been using will-it-scale to test how well ext4 scales. The
fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently
on 64 containers revealed significant contention in block allocation/free,
leading to much lower aggregate fallocate OPS compared to a single
container (see below).

   1   |    2   |    4   |    8   |   16   |   32   |   64
-------|--------|--------|--------|--------|--------|-------
295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588

The main bottleneck was the ext4_lock_group(), which both block allocation
and free fought over. While the block group for block free is fixed and
unoptimizable, the block group for allocation is selectable. Consequently,
the ext4_try_lock_group() helper function was added to avoid contention on
busy groups, and you can see more in Patch 1.

After we fixed the ext4_lock_group bottleneck, another one showed up:
s_md_lock. This lock protects different data when allocating and freeing
blocks. We got rid of the s_md_lock call in block allocation by making
stream allocation work per inode instead of globally. You can find more
details in Patch 2.

Patches 3 and 4 are just some minor cleanups.

Performance test data follows:

CPU: HUAWEI Kunpeng 920
Memory: 480GB
Disk: 480GB SSD SATA 3.2
Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
Observation: Average fallocate operations per container per second.

|--------|--------|--------|--------|--------|--------|--------|--------|
|    -   |    1   |    2   |    4   |    8   |   16   |   32   |   64   |
|--------|--------|--------|--------|--------|--------|--------|--------|
|  base  | 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588  |
|--------|--------|--------|--------|--------|--------|--------|--------|
| linear | 286328 | 123102 | 119542 | 90653  | 60344  | 35302  | 23280  |
|        | -3.0%  | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% |
|--------|--------|--------|--------|--------|--------|--------|--------|
|mb_optim| 292498 | 133305 | 103069 | 61727  | 29702  | 16845  | 10430  |
|ize_scan| -0.9%  | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% |
|--------|--------|--------|--------|--------|--------|--------|--------|

Running "kvm-xfstests -c ext4/all -g auto" showed that 1k/generic/347 often
fails. The test seems to think that a dm-thin device with a virtual size of
5000M but a real size of 500M, after being formatted as ext4, would have
500M free.

But it doesn't – we run out of space after making about 430 1M
files. Since the block size is 1k, making so many files turns on dir_index,
and dm-thin waits a minute, sees no free space, and then throws IO error.
This can cause a directory index block to fail to write and abort journal.

What's worse is that _dmthin_check_fs doesn't replay the journal, so fsck
finds inconsistencies and the test failed. I think the problem is with the
test itself, and I'll send a patch to fix it soon.

Comments and questions are, as always, welcome.

Thanks,
Baokun


Baokun Li (4):
  ext4: add ext4_try_lock_group() to skip busy groups
  ext4: move mb_last_[group|start] to ext4_inode_info
  ext4: get rid of some obsolete EXT4_MB_HINT flags
  ext4: fix typo in CR_GOAL_LEN_SLOW comment

 fs/ext4/ext4.h              | 38 ++++++++++++++++++-------------------
 fs/ext4/mballoc.c           | 34 +++++++++++++++++++--------------
 fs/ext4/super.c             |  2 ++
 include/trace/events/ext4.h |  3 ---
 4 files changed, 41 insertions(+), 36 deletions(-)

-- 
2.46.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-06-12 11:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-23  8:58 [PATCH 0/4] ext4: better scalability for ext4 block allocation libaokun
2025-05-23  8:58 ` [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups libaokun
2025-05-28 15:05   ` Ojaswin Mujoo
2025-05-30  8:20     ` Baokun Li
2025-06-10 12:07       ` Ojaswin Mujoo
2025-05-23  8:58 ` [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info libaokun
2025-05-29 12:56   ` Jan Kara
2025-05-30  9:31     ` Baokun Li
2025-06-02 15:44       ` Jan Kara
2025-06-04  8:13         ` Baokun Li
2025-05-23  8:58 ` [PATCH 3/4] ext4: get rid of some obsolete EXT4_MB_HINT flags libaokun
2025-05-28 15:10   ` Ojaswin Mujoo
2025-05-29 12:57   ` Jan Kara
2025-05-23  8:58 ` [PATCH 4/4] ext4: fix typo in CR_GOAL_LEN_SLOW comment libaokun
2025-05-28 15:11   ` Ojaswin Mujoo
2025-05-29 12:57   ` Jan Kara
2025-05-28 14:53 ` [PATCH 0/4] ext4: better scalability for ext4 block allocation Ojaswin Mujoo
2025-05-29 12:24   ` Baokun Li
2025-06-10 12:06     ` Ojaswin Mujoo
2025-06-10 13:48       ` Baokun Li
2025-06-11  8:22         ` Ojaswin Mujoo
2025-06-12 11:30           ` Baokun Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox