From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2C445223DE9; Thu, 29 May 2025 12:24:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.190 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748521467; cv=none; b=EjGn+s/GT3AkpDU6I+P33k6t1Yl3N5MhxsrHb0OK2wp7TINog/O32EAzpLYL9LVZ38bbjbV3b4nuIMUUUMxXuSSySMSUj7tZOm5iX5hBMbGkvX4QkP0KyX++EJapOSIEJTyfoeAeuTG717rYeLWklXfjabHPGTQPbMfaDbju6Ak= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748521467; c=relaxed/simple; bh=HV/NYi6+0fgZ/GhUDaDJrWy55jTF5YKF1SfTHDfhzv4=; h=Message-ID:Date:MIME-Version:Subject:To:CC:References:From: In-Reply-To:Content-Type; b=V0X6rPfvYwFJLPZJpRrxWjMlcHk1Gfrw7rtT+BAM5AwiIJTYIcd/vQ/JQEW7eDde92aKTYBE8vuQlSQa9iYsq+l/yap5BvdUiRJMWeRMjzOAu4DDjKbnKSOm/Ahta+xbPkBPtEVzhEcZHvnz0bDfC4zpvHRpKuQx8SgsE51arW4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.190 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.234]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4b7QWP01rYz2RVqs; Thu, 29 May 2025 20:23:12 +0800 (CST) Received: from kwepemg500008.china.huawei.com (unknown [7.202.181.45]) by mail.maildlp.com (Postfix) with ESMTPS id B99BE1401F0; Thu, 29 May 2025 20:24:15 +0800 (CST) Received: from [127.0.0.1] (10.174.177.71) by kwepemg500008.china.huawei.com (7.202.181.45) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 29 May 2025 20:24:14 +0800 Message-ID: Date: Thu, 29 May 2025 20:24:14 +0800 Precedence: bulk X-Mailing-List: linux-ext4@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation To: Ojaswin Mujoo CC: , , , , , , , , Baokun Li References: <20250523085821.1329392-1-libaokun@huaweicloud.com> Content-Language: en-US From: Baokun Li In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-ClientProxiedBy: kwepems200002.china.huawei.com (7.221.188.68) To kwepemg500008.china.huawei.com (7.202.181.45) On 2025/5/28 22:53, Ojaswin Mujoo wrote: > On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@huaweicloud.com wrote: >> From: Baokun Li >> >> Since servers have more and more CPUs, and we're running more containers >> on them, we've been using will-it-scale to test how well ext4 scales. The >> fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently >> on 64 containers revealed significant contention in block allocation/free, >> leading to much lower aggregate fallocate OPS compared to a single >> container (see below). >> >> 1 | 2 | 4 | 8 | 16 | 32 | 64 >> -------|--------|--------|--------|--------|--------|------- >> 295287 | 70665 | 33865 | 19387 | 10104 | 5588 | 3588 >> >> The main bottleneck was the ext4_lock_group(), which both block allocation >> and free fought over. While the block group for block free is fixed and >> unoptimizable, the block group for allocation is selectable. Consequently, >> the ext4_try_lock_group() helper function was added to avoid contention on >> busy groups, and you can see more in Patch 1. >> >> After we fixed the ext4_lock_group bottleneck, another one showed up: >> s_md_lock. This lock protects different data when allocating and freeing >> blocks. We got rid of the s_md_lock call in block allocation by making >> stream allocation work per inode instead of globally. You can find more >> details in Patch 2. >> >> Patches 3 and 4 are just some minor cleanups. >> >> Performance test data follows: >> >> CPU: HUAWEI Kunpeng 920 >> Memory: 480GB >> Disk: 480GB SSD SATA 3.2 >> Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers. >> Observation: Average fallocate operations per container per second. >> |--------|--------|--------|--------|--------|--------|--------|--------| >> | - | 1 | 2 | 4 | 8 | 16 | 32 | 64 | >> |--------|--------|--------|--------|--------|--------|--------|--------| >> | base | 295287 | 70665 | 33865 | 19387 | 10104 | 5588 | 3588 | >> |--------|--------|--------|--------|--------|--------|--------|--------| >> | linear | 286328 | 123102 | 119542 | 90653 | 60344 | 35302 | 23280 | >> | | -3.0% | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% | >> |--------|--------|--------|--------|--------|--------|--------|--------| >> |mb_optim| 292498 | 133305 | 103069 | 61727 | 29702 | 16845 | 10430 | >> |ize_scan| -0.9% | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% | >> |--------|--------|--------|--------|--------|--------|--------|--------| > Hey Baokun, nice improvements! The proposed changes make sense to me, > however I suspect the performance improvements may come at a cost of > slight increase in fragmentation, which might affect rotational disks > especially. Maybe comparing e2freefrag numbers with and without the > patches might give a better insight into this. While this approach might slightly increase free space fragmentation on the disk, it significantly reduces file fragmentation, leading to faster read speeds on rotational disks. When multiple processes contend for free blocks within the same block group, the probability of blocks allocated by the same process being merged on consecutive allocations is low. This is because other processes may preempt the free blocks immediately following the current process's last allocated region. Normally, we rely on preallocation to avoid files becoming overly fragmented (even though preallocation itself can cause fragmentation in free disk space). But since fallocate doesn't support preallocation, the fragmentation issue is more pronounced. Counterintuitively, skipping busy groups actually boosts opportunities for file extent merging, which in turn reduces overall file fragmentation. Referencing will-it-scale/fallocate2, I tested 64 processes each appending 4KB via fallocate to 64 separate files until they reached 1GB. This test specifically examines contention in block allocation, unlike fallocate2, it omits the contention between fallocate and truncate. Preliminary results are provided below; detailed scripts and full test outcomes are attached in the email footer. ----------------------------------------------------------                      |       base      |      patched    | ---------------------|--------|--------|--------|--------| mb_optimize_scan     | linear |opt_scan| linear |opt_scan| ---------------------|--------|--------|--------|--------| bw(MiB/s)            | 217    | 219    | 5685   | 5670   | Avg. free extent size| 1943732| 1943728| 1439608| 1368328| Avg. extents per file| 261879 | 262039 | 744    | 2084   | Avg. size per extent | 4      | 4      | 1408   | 503    | Fragmentation score  | 100    | 100    | 2      | 6      | ---------------------------------------------------------- > Regardless the performance benefits are significant and I feel it is > good to have these patches. > > I'll give my reviews individually as I'm still going through patch 2 > However, I wanted to check on a couple things: Okay, thank you for your feedback. > > 1. I believe you ran these in docker. Would you have any script etc open > sourced that I can use to run some benchmarks on my end (and also > understand your test setup). Yes, these two patches primarily mitigate contention between block allocations and between block allocation and release. The testing script can be referenced from the fio script mentioned earlier in the email footer. You can also add more truncate calls based on it. > 2. I notice we are getting way less throughput in mb_optimize_scan? I > wonder why that is the case. Do you have some data on that? Are your > tests starting on an empty FS, maybe in that case linear scan works a > bit better since almost all groups are empty. If so, what are the > numbers like when we start with a fragmented FS? The throughput of mb_optimize_scan is indeed much lower, and we continue to optimize it, as mb_optimize_scan is the default mount option and performs better in scenarios with large volume disks and high space usage. Disk space used is about 7%; mb_optimize_scan should perform better with less free space. However, this isn't the critical factor. The poor throughput here is due to the following reasons。 One reason is that mb_optimize_scan's list traversal is unordered and always selects the first group. While traversing the list, holding a spin_lock prevents load_buddy, making direct use of ext4_lock_group impossible. This can lead to a "bouncing" scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group() fails, forcing the list traversal to repeatedly restart from grp_A. In contrast, linear traversal directly uses ext4_try_lock_group(), avoiding this bouncing. Therefore, we need a lockless, ordered traversal to achieve linear-like efficiency. Another reason is that opt_scan tends to allocate from groups that have just received freed blocks, causing it to constantly "jump around" between certain groups. This leads to intense contention between allocation and release, and even between release events. In contrast, linear traversal always moves forward without revisiting groups, resulting in less contention between allocation and release. However, because linear involves more groups in allocation, journal becomes a bottleneck. If opt_scan first attempts to traverse block groups to the right from the target group in all lists, and then from index 0 to the left in all lists, competition between block groups would be significantly reduced. To enable ordered traversal, we attempted to convert list_head to an ordered xarray. This ordering prevents "bouncing" during retries. Additionally, traversing all right-side groups before left-side groups significantly reduced contention. Performance improved from 10430 to 17730. However, xarray traversal introduces overhead; list_head group selection was O(1), while xarray becomes O(n log n). This results in a ~10% performance drop in single-process scenarios, and I'm not entirely sure if this trade-off is worthwhile. 🤔 Additionally, by attempting to merge before inserting in ext4_mb_free_metadata(), we can eliminate contention on sbi->s_md_lock during merges, resulting in roughly a 5% performance gain. > > - Or maybe it is that the lazyinit thread has not yet initialized all > the buddies yet which means we have lesser BGs in the freefrag list > or the order list used by faster CRs. Hence, if they are locked we > are falling more to CR_GOAL_LEN_SLOW. To check if this is the case, > one hack is to cat /proc/fs/ext4//mb_groups (or something along > the lines) before the benchmark, which forces init of all the group > buddies thus populating all the lists used by mb_opt_scan. Maybe we > can check if this gives better results. All groups are already initialized at the time of testing, and that's not where the problem lies. > > 3. Also, how much IO are we doing here, are we filling the whole FS? > In a single container, create a file, then repeatedly append 8KB using fallocate until the file reaches 1MB. After that, truncate the file to 0 and continue appending 8KB with fallocate. The 64 containers will occupy a maximum of 64MB of disk space in total, so they won't fill the entire file system. Cheers, Baokun ======================== test script ======================== #!/bin/bash dir="/tmp/test" disk="/dev/sda" numjobs=64 iodepth=128 mkdir -p $dir for scan in 0 1 ; do     mkfs.ext4 -F -E lazy_itable_init=0,lazy_journal_init=0 -O orphan_file $disk     mount -o mb_optimize_scan=$scan $disk $dir     fio -directory=$dir -direct=1 -iodepth ${iodepth} -thread -rw=write -ioengine=falloc -bs=4k -fallocate=none \         -size=1G -numjobs=${numjobs} -group_reporting -name=job1 -cpus_allowed_policy=split -file_append=1     e2freefrag $disk     e4defrag -c $dir # ** NOTE ** Without the patch, this could take 5-6 hours.     filefrag ${dir}/job* | awk '{print $2}' | awk '{sum+=$1} END {print sum/NR}'     umount $dir done ======================== test results ======================== ---------------------------------------------------------- ## base ------------------------### linear bw=217MiB/s (228MB/s) ------------ e2freefrag /dev/sda Device: /dev/sda Blocksize: 4096 bytes Total blocks: 52428800 Free blocks: 34501259 (65.8%) Min. free extent: 98172 KB Max. free extent: 2064256 KB Avg. free extent: 1943732 KB Num. free extent: 71 HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range :  Free extents   Free Blocks  Percent    64M...  128M-  :             2         49087    0.14%   512M... 1024M-  :             3        646918    1.88%     1G...    2G-  :            66      33805254   97.98% ------------ e4defrag -c /tmp/test e4defrag 1.47.2 (1-Jan-2025) now/best       size/ext 1. /tmp/test/job1.4.0                       262035/1 4 KB 2. /tmp/test/job1.2.0                       262034/1 4 KB 3. /tmp/test/job1.44.0                      262026/1 4 KB 4. /tmp/test/job1.15.0                      262025/1 4 KB 5. /tmp/test/job1.12.0                      262025/1 4 KB  Total/best extents                             16760234/64  Average size per extent                        4 KB  Fragmentation score                            100  [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]  This directory (/tmp/test) needs defragmentation.  Done. ------------ filefrag /tmp/test/job* | awk '{print $2}' | awk '{sum+=$1} END {print sum/NR}' 261879 ------------------------### opt_scan  bw=219MiB/s (230MB/s) ------------ e2freefrag /dev/sda Device: /dev/sda Blocksize: 4096 bytes Total blocks: 52428800 Free blocks: 34501238 (65.8%) Min. free extent: 98172 KB Max. free extent: 2064256 KB Avg. free extent: 1943728 KB Num. free extent: 71 HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range :  Free extents   Free Blocks  Percent    64M...  128M-  :             2         49087    0.14%   512M... 1024M-  :             3        646897    1.87%     1G...    2G-  :            66      33805254   97.98% ------------ e4defrag -c /tmp/test e4defrag 1.47.2 (1-Jan-2025) now/best       size/ext 1. /tmp/test/job1.57.0                      262084/1 4 KB 2. /tmp/test/job1.35.0                      262081/1 4 KB 3. /tmp/test/job1.45.0                      262080/1 4 KB 4. /tmp/test/job1.25.0                      262078/1 4 KB 5. /tmp/test/job1.11.0                      262077/1 4 KB  Total/best extents                             16770469/64  Average size per extent                        4 KB  Fragmentation score                            100  [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]  This directory (/tmp/test) needs defragmentation.  Done. ------------ filefrag /tmp/test/job* | awk '{print $2}' | awk '{sum+=$1} END {print sum/NR}' 262039 ========================================================================================== ---------------------------------------------------------- ## patched ------------------------ linear bw=5685MiB/s (5962MB/s) ------------ e2freefrag /dev/sda Device: /dev/sda Blocksize: 4096 bytes Total blocks: 52428800 Free blocks: 34550601 (65.9%) Min. free extent: 8832 KB Max. free extent: 2064256 KB Avg. free extent: 1439608 KB Num. free extent: 96 HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range :  Free extents   Free Blocks  Percent     8M...   16M-  :             2          5267    0.02%    32M...   64M-  :             9        129695    0.38%    64M...  128M-  :            17        409917    1.19%   512M... 1024M-  :             3        716532    2.07%     1G...    2G-  :            65      33289190   96.35% ------------ e4defrag -c /tmp/test e4defrag 1.47.2 (1-Jan-2025) now/best       size/ext 1. /tmp/test/job1.18.0                         984/1 1065 KB 2. /tmp/test/job1.37.0                         981/1 1068 KB 3. /tmp/test/job1.36.0                         980/1 1069 KB 4. /tmp/test/job1.27.0                         954/1 1099 KB 5. /tmp/test/job1.30.0                         954/1 1099 KB  Total/best extents                47629/64  Average size per extent            1408 KB  Fragmentation score                2  [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]  This directory (/tmp/test) does not need defragmentation.  Done. ------------ filefrag /tmp/test/job* | awk '{print $2}' | awk '{sum+=$1} END {print sum/NR}' 744.203 ------------------------ opt_scan  bw=5670MiB/s (5946MB/s) ------------ e2freefrag /dev/sda Device: /dev/sda Blocksize: 4096 bytes Total blocks: 52428800 Free blocks: 34550296 (65.9%) Min. free extent: 5452 KB Max. free extent: 2064256 KB Avg. free extent: 1368328 KB Num. free extent: 101 HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range :  Free extents   Free Blocks  Percent     4M...    8M-  :             4          5935    0.02%     8M...   16M-  :             3          9929    0.03%    16M...   32M-  :             4         21775    0.06%    32M...   64M-  :            13        164831    0.48%    64M...  128M-  :             9        189227    0.55%   512M... 1024M-  :             2        457702    1.32%     1G...    2G-  :            66      33700897   97.54% ------------ e4defrag -c /tmp/test e4defrag 1.47.2 (1-Jan-2025) now/best       size/ext 1. /tmp/test/job1.43.0                        4539/1 231 KB 2. /tmp/test/job1.5.0                         4446/1 235 KB 3. /tmp/test/job1.14.0                        3851/1 272 KB 4. /tmp/test/job1.3.0                         3682/1 284 KB 5. /tmp/test/job1.50.0                        3597/1 291 KB  Total/best extents                133415/64  Average size per extent            503 KB  Fragmentation score                6  [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]  This directory (/tmp/test) does not need defragmentation.  Done. ------------ filefrag /tmp/test/job* | awk '{print $2}' | awk '{sum+=$1} END {print sum/NR}' 2084.61