From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ECFEF27B358 for ; Mon, 18 Aug 2025 13:48:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755524885; cv=none; b=mCLyLvDWUYVPgcUlzBc/nCEl10BT8X5UE9lCs0wkoB3rqiA9rGg7HdtxwI+mPjGzVx1AyQzM/7Fv8VbxKm2+BEvAnjk5KU1DzNTgFvS8l9nVXVcRDzEZnkPdLKL0TxvUZ/KOlsCPA2bYnXlKfUhrBHN/HsKeEItdLtIjzERRrDQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755524885; c=relaxed/simple; bh=5Q8hBiHxgF7w8YFhHj/0g57/RM4x53jGhlaPv3Afe+I=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: Content-Type:MIME-Version; b=nL/vaDWCFSjCmp/djsAzLYYD+t5d8g9Y/lVT9fo+lyaDzsEeW8MR5TqpJ1Ef94ljsbrhYf+UWkNdXYiywXoC49ZNZ670PxVp/lXcpEY4i9B4j0Ja0e2PeLetj3IWbGIfNZ8XspZHWzu0W1rnH9MArpCc4PDHMVNv60REiy5ZlPw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=I6LIEYri; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="I6LIEYri" Received: by smtp.kernel.org (Postfix) with ESMTPS id B51C1C19422 for ; Mon, 18 Aug 2025 13:48:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1755524884; bh=5Q8hBiHxgF7w8YFhHj/0g57/RM4x53jGhlaPv3Afe+I=; h=From:To:Subject:Date:In-Reply-To:References:From; b=I6LIEYrium6TPzFXxK8BFsOAjP6cy83GUEavXmNXfJ/nPNaWHv1838xuXmWjPDdmF TCXxXvxUpGU129TmCunD6RWJZ8dQsghBcsRnkN2d3tNRv6sxYyllRGSx/aGS8b6XdD Ac+1KOHAc6pLnn2nkJx43B6dg6EUDKUhIv4mtb8QAsIWgl8dJ/bF/v4pelPgDzXaom RKeKzhh2B6ABOHWDURt1379dfsX0LVKGO7kPyaFti102i0EhfVG4Fg2K+BhSAw7A2i D5k+U3riTq7BaJBiu3UVIrxIq1SKaxzm4Ov+gzLEk1H2tFbzkLfFBVDmumr1N2KGP4 jJPMjHxMItjbQ== Received: by aws-us-west-2-korg-bugzilla-1.web.codeaurora.org (Postfix, from userid 48) id ACDD4C53BBF; Mon, 18 Aug 2025 13:48:04 +0000 (UTC) From: bugzilla-daemon@kernel.org To: linux-ext4@vger.kernel.org Subject: [Bug 217965] ext4(?) regression since 6.5.0 on sata hdd Date: Mon, 18 Aug 2025 13:48:04 +0000 X-Bugzilla-Reason: None X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: AssignedTo fs_ext4@kernel-bugs.osdl.org X-Bugzilla-Product: File System X-Bugzilla-Component: ext4 X-Bugzilla-Version: 2.5 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: mingyu.he@shopee.com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: fs_ext4@kernel-bugs.osdl.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugzilla.kernel.org/ Auto-Submitted: auto-generated Precedence: bulk X-Mailing-List: linux-ext4@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 https://bugzilla.kernel.org/show_bug.cgi?id=3D217965 mingyu.he (mingyu.he@shopee.com) changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mingyu.he@shopee.com --- Comment #72 from mingyu.he (mingyu.he@shopee.com) --- I find out the root cause of this problem. In previous discussion, we didn't find out an exact root cause of this prob= lem. But the Ojaswin Mujoo made an effective fallback solution. Although this problem is some what an "old problem", I think it still meaningful to reply to you for root cause as it distrub= ed lots of users back then. Here I will write the root cause first and then put an exact reproduce prog= ram for testing. ** Root Cause ** For some Newbies like me, perhaps we should know this first: Raid will cut all the logic blocks into many intervals. The length of an interval is stripe. We can use `tune2fs -l /dev/sdX` to find or `mount` to see it. The layout may looks like this [stripe][stripe][stripe][stripe][stripe][stripe] [ group size ][ group size ][ group size ] In function 'ext4_mb_scan_aligned', it try to find a start point of an raid interval in a group. And then calculate the remaining block length of this group. If the remaining length(which should be free and continuous) is not enough = for a stripe, it will return, and then find a next group. [ stripe ] [ stripe ] | here [ group size ] The core problem is the action of finding next group. In function 'ext4_mb_choose_next_group', code changed since this patch '[PA= TCH v2 00/12] multiblock allocator improvements' https://lore.kernel.org/all/cover.1685449706.git.ojaswin@linux.ibm.com/ The author changed the fragment order RB tree into list for better performa= nce. However, the function 'ext4_mb_find_good_group_avg_frag_lists' will always returns the same group every time, which makes upper layer pass the same group to 'ext4_mb_scan_aligned'. A= nd thus aligned scan always fails. So it will end until the loop var 'i' (in ext4_mb_regular_allocator) runs i= nto n_groups. Here is the stats I collected in 'ext4_mb_scan_aligned' -> 'mb_find_extent' I set stripe to 30000 for easilier reproduce the problem. ``` Attaching 5 probes... Tracing ext4_mb_regular_allocator... Hit Ctrl-C to stop. find_ex, tid=3D22280, group_id=3D175542, block(i)=3D9744, needed(stripe)=3D= 30000, ret(max)=3D23024 find_ex, tid=3D22280, group_id=3D175543, block(i)=3D6976, needed(stripe)=3D= 30000, ret(max)=3D25792 find_ex, tid=3D22280, group_id=3D175544, block(i)=3D4208, needed(stripe)=3D= 30000, ret(max)=3D28560 find_ex, tid=3D22280, group_id=3D127, block(i)=3D8464, needed(stripe)=3D300= 00, ret(max)=3D24304 find_ex, tid=3D22280, group_id=3D127, block(i)=3D8464, needed(stripe)=3D300= 00, ret(max)=3D24304 find_ex, tid=3D22280, group_id=3D127, block(i)=3D8464, needed(stripe)=3D300= 00, ret(max)=3D24304 find_ex, tid=3D22280, group_id=3D127, block(i)=3D8464, needed(stripe)=3D300= 00, ret(max)=3D24304 find_ex, tid=3D22280, group_id=3D127, block(i)=3D8464, needed(stripe)=3D300= 00, ret(max)=3D24304 ... countless same lines ``` The first few lines is the linear choose section. And later is avg_frag_lis= t's selections. Also, it you are in multi-thread running env, the chooser will probably find the same group. But the mb_regular_allocator need to get the spin lock of this group, which inflames high cpu usage. (Thanks to Baokun Li, he made an optimization for spin lock in recent patch= es 'ext4: better scalability for ext4 block allocation') And in 5.15 version, the RB tree works fine. ``` find_ex, tid=3D17292, group_id=3D603069, block(i)=3D25008, needed(strip)=3D= 30000, ret(max)=3D7760 ...linear search find_ex, tid=3D17292, group_id=3D603072, block(i)=3D16704, needed(strip)=3D= 30000, ret(max)=3D16064 find_ex, tid=3D17292, group_id=3D278911, block(i)=3D24352, needed(strip)=3D= 30000, ret(max)=3D8416 find_ex, tid=3D17292, group_id=3D279167, block(i)=3D5744, needed(strip)=3D3= 0000, ret(max)=3D27024 ...tree search find_ex, tid=3D17292, group_id=3D280447, block(i)=3D2704, needed(strip)=3D3= 0000, ret(max)=3D30064 ``` And in latest version (6.17.0-rc2) still works fine (Baokun Li made some optimization for allocator and frag order list) Note that for observe the work of new data structure, I comment out the fallback logic. ``` find_ex, tid=3D31612, group_id=3D1423, block(i)=3D9984, needed(stripe)=3D32= 752, ret(max)=3D22784 find_ex, tid=3D31612, group_id=3D1425, block(i)=3D9952, needed(stripe)=3D32= 752, ret(max)=3D22816 find_ex, tid=3D31612, group_id=3D1426, block(i)=3D9936, needed(stripe)=3D32= 752, ret(max)=3D22832 find_ex, tid=3D31612, group_id=3D1427, block(i)=3D9920, needed(stripe)=3D32= 752, ret(max)=3D22848 find_ex, tid=3D31612, group_id=3D1428, block(i)=3D9904, needed(stripe)=3D32= 752, ret(max)=3D22864 find_ex, tid=3D31612, group_id=3D1429, block(i)=3D9888, needed(stripe)=3D32= 752, ret(max)=3D22880 find_ex, tid=3D31612, group_id=3D1430, block(i)=3D9872, needed(stripe)=3D32= 752, ret(max)=3D22896 find_ex, tid=3D31612, group_id=3D1431, block(i)=3D9856, needed(stripe)=3D32= 752, ret(max)=3D22912 find_ex, tid=3D31612, group_id=3D1432, block(i)=3D9840, needed(stripe)=3D32= 752, ret(max)=3D22928 find_ex, tid=3D31612, group_id=3D1433, block(i)=3D9824, needed(stripe)=3D32= 752, ret(max)=3D22944 find_ex, tid=3D31612, group_id=3D1434, block(i)=3D9808, needed(stripe)=3D32= 752, ret(max)=3D22960 find_ex, tid=3D31612, group_id=3D1435, block(i)=3D9792, needed(stripe)=3D32= 752, ret(max)=3D22976 ``` ** Reproduce Method ** Before 6.8 (or you delete the fallback logic) with ext4 and raid. # higher stripe, higher probability. But lesser than 32768(the default of blocks in a group as it will cut request length) mount -o remount,stripe=3D35000 /dev/sdX # request allocation for 35000 blocks. Need aligned to stripe. ./test_C_program 35000 the C program (I delete error checking for keeping simple): ``` #define _GNU_SOURCE #include #include #include #include #include #include #include #include #define BLOCK_SIZE 4096 int main(int argc, char **argv) { if (argc < 2) { fprintf(stderr, "Usage: %s \n", argv[0]); return 1; } int blocks =3D atoi(argv[1]); off_t FILE_SIZE =3D (off_t)blocks * BLOCK_SIZE; int fd =3D open("source_file.txt", O_CREAT | O_WRONLY | O_TRUNC, 0644); fallocate(fd, 0, 0, FILE_SIZE); char *data =3D malloc(FILE_SIZE); memset(data, 'A', FILE_SIZE); write(fd, data, FILE_SIZE); rename("source_file.txt", "target_file.txt"); printf("rename succeeded\n"); free(data); close(fd); return 0; } ``` If your stripe is relatively small, here is the method from carlos@fisica.ufpr.br. https://marc.info/?l=3Dlinux-raid&m=3D170327844709957&w=3D2 This method may not be exact. But very simple to reproduce for small stripe. However, I need to run the parallel version to reproduce in my machine to reproduce. Here are they: mkdir 1 2 3 4 5 xzcat linux-6.16.tar.xz | tar x -C ./1 -f - & xzcat linux-6.16.tar.xz | tar x -C ./2 -f - & xzcat linux-6.16.tar.xz | tar x -C ./3 -f - & xzcat linux-6.16.tar.xz | tar x -C ./4 -f - & xzcat linux-6.16.tar.xz | tar x -C ./5 -f - & Wait for kworker to flush dirty pages. You can use `top` to see kworker use 100% cpu for a long time. Best Regards, Mingyu He --=20 You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.=