* [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation @ 2023-12-15 11:19 Ojaswin Mujoo 2023-12-15 11:19 ` [PATCH 1/1] ext4: fallback to complex scan if aligned scan doesn't work Ojaswin Mujoo 2024-01-09 2:53 ` [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation Theodore Ts'o 0 siblings, 2 replies; 8+ messages in thread From: Ojaswin Mujoo @ 2023-12-15 11:19 UTC (permalink / raw) To: linux-ext4, Theodore Ts'o Cc: Ritesh Harjani, linux-kernel, Jan Kara, glandvador, bugzilla This patch intends to fix the recent bugzilla [1] report where the kworker flush thread seemed to be taking 100% CPU utilizationa and was slowing down the whole system. The backtrace indicated that we were stuck in mballoc allocation path. The issue was only seen kernel 6.5+ and when ext4 was mounted with -o stripe (or stripe option was implicitly added due us mkfs flags used). Although I was not able to fully replicate this issue, from the perf probe logs collected I have a possible root cause which I have explained in the patch commit message. Now, the one thing I'm still skeptical about is why this was only seen in kernel 6.5+. We added a new mballoc criteria in kernel 6.5 but I was not able to find a satisfactory explanation as to why that would have any effect here. Furter, the issue still persisted when I asked one of the reporters to disable the it using sysfs file and rerun the test. Maybe there are some more factors at play? Anyways, I would appreciate if the people experiencing this issue can help test this patch and see if it fixes the regression. [1] https://bugzilla.kernel.org/show_bug.cgi?id=217965 Regards, ojaswin Ojaswin Mujoo (1): ext4: fallback to complex scan if aligned scan doesn't work fs/ext4/mballoc.c | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) -- 2.39.3 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/1] ext4: fallback to complex scan if aligned scan doesn't work 2023-12-15 11:19 [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation Ojaswin Mujoo @ 2023-12-15 11:19 ` Ojaswin Mujoo 2024-01-04 15:27 ` Jan Kara 2024-01-09 2:53 ` [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation Theodore Ts'o 1 sibling, 1 reply; 8+ messages in thread From: Ojaswin Mujoo @ 2023-12-15 11:19 UTC (permalink / raw) To: linux-ext4, Theodore Ts'o Cc: Ritesh Harjani, linux-kernel, Jan Kara, glandvador, bugzilla Currently in case the goal length is a multiple of stripe size we use ext4_mb_scan_aligned() to find the stripe size aligned physical blocks. In case we are not able to find any, we again go back to calling ext4_mb_choose_next_group() to search for a different suitable block group. However, since the linear search always begins from the start, most of the times we end up with the same BG and the cycle continues. With large fliesystems, the CPU can be stuck in this loop for hours which can slow down the whole system. Hence, until we figure out a better way to continue the search (rather than starting from beginning) in ext4_mb_choose_next_group(), lets just fallback to ext4_mb_complex_scan_group() in case aligned scan fails, as it is much more likely to find the needed blocks. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> --- fs/ext4/mballoc.c | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index d72b5e3c92ec..63f12ec02485 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2895,14 +2895,19 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_groups_scanned++; if (cr == CR_POWER2_ALIGNED) ext4_mb_simple_scan_group(ac, &e4b); - else if ((cr == CR_GOAL_LEN_FAST || - cr == CR_BEST_AVAIL_LEN) && - sbi->s_stripe && - !(ac->ac_g_ex.fe_len % - EXT4_B2C(sbi, sbi->s_stripe))) - ext4_mb_scan_aligned(ac, &e4b); - else - ext4_mb_complex_scan_group(ac, &e4b); + else { + bool is_stripe_aligned = sbi->s_stripe && + !(ac->ac_g_ex.fe_len % + EXT4_B2C(sbi, sbi->s_stripe)); + + if ((cr == CR_GOAL_LEN_FAST || + cr == CR_BEST_AVAIL_LEN) && + is_stripe_aligned) + ext4_mb_scan_aligned(ac, &e4b); + + if (ac->ac_status == AC_STATUS_CONTINUE) + ext4_mb_complex_scan_group(ac, &e4b); + } ext4_unlock_group(sb, group); ext4_mb_unload_buddy(&e4b); -- 2.39.3 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] ext4: fallback to complex scan if aligned scan doesn't work 2023-12-15 11:19 ` [PATCH 1/1] ext4: fallback to complex scan if aligned scan doesn't work Ojaswin Mujoo @ 2024-01-04 15:27 ` Jan Kara 2024-01-09 9:40 ` Ojaswin Mujoo 0 siblings, 1 reply; 8+ messages in thread From: Jan Kara @ 2024-01-04 15:27 UTC (permalink / raw) To: Ojaswin Mujoo Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel, Jan Kara, glandvador, bugzilla On Fri 15-12-23 16:49:50, Ojaswin Mujoo wrote: > Currently in case the goal length is a multiple of stripe size we use > ext4_mb_scan_aligned() to find the stripe size aligned physical blocks. > In case we are not able to find any, we again go back to calling > ext4_mb_choose_next_group() to search for a different suitable block > group. However, since the linear search always begins from the start, > most of the times we end up with the same BG and the cycle continues. > > With large fliesystems, the CPU can be stuck in this loop for hours > which can slow down the whole system. Hence, until we figure out a > better way to continue the search (rather than starting from beginning) > in ext4_mb_choose_next_group(), lets just fallback to > ext4_mb_complex_scan_group() in case aligned scan fails, as it is much > more likely to find the needed blocks. > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> If I understand the difference right, the problem is that while ext4_mb_choose_next_group() guarantees large enough free space extent for the CR_GOAL_LEN_FAST or CR_BEST_AVAIL_LEN passes, it does not guaranteed large enough *aligned* free space extent. Thus for non-aligned allocations we can fail only due to a race with another allocating process but with aligned allocations we can consistently fail in ext4_mb_scan_aligned() and thus livelock in the allocation loop. If my understanding is correct, feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > fs/ext4/mballoc.c | 21 +++++++++++++-------- > 1 file changed, 13 insertions(+), 8 deletions(-) > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index d72b5e3c92ec..63f12ec02485 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -2895,14 +2895,19 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > ac->ac_groups_scanned++; > if (cr == CR_POWER2_ALIGNED) > ext4_mb_simple_scan_group(ac, &e4b); > - else if ((cr == CR_GOAL_LEN_FAST || > - cr == CR_BEST_AVAIL_LEN) && > - sbi->s_stripe && > - !(ac->ac_g_ex.fe_len % > - EXT4_B2C(sbi, sbi->s_stripe))) > - ext4_mb_scan_aligned(ac, &e4b); > - else > - ext4_mb_complex_scan_group(ac, &e4b); > + else { > + bool is_stripe_aligned = sbi->s_stripe && > + !(ac->ac_g_ex.fe_len % > + EXT4_B2C(sbi, sbi->s_stripe)); > + > + if ((cr == CR_GOAL_LEN_FAST || > + cr == CR_BEST_AVAIL_LEN) && > + is_stripe_aligned) > + ext4_mb_scan_aligned(ac, &e4b); > + > + if (ac->ac_status == AC_STATUS_CONTINUE) > + ext4_mb_complex_scan_group(ac, &e4b); > + } > > ext4_unlock_group(sb, group); > ext4_mb_unload_buddy(&e4b); > -- > 2.39.3 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/1] ext4: fallback to complex scan if aligned scan doesn't work 2024-01-04 15:27 ` Jan Kara @ 2024-01-09 9:40 ` Ojaswin Mujoo 0 siblings, 0 replies; 8+ messages in thread From: Ojaswin Mujoo @ 2024-01-09 9:40 UTC (permalink / raw) To: Jan Kara Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel, glandvador, bugzilla On Thu, Jan 04, 2024 at 04:27:17PM +0100, Jan Kara wrote: > On Fri 15-12-23 16:49:50, Ojaswin Mujoo wrote: > > Currently in case the goal length is a multiple of stripe size we use > > ext4_mb_scan_aligned() to find the stripe size aligned physical blocks. > > In case we are not able to find any, we again go back to calling > > ext4_mb_choose_next_group() to search for a different suitable block > > group. However, since the linear search always begins from the start, > > most of the times we end up with the same BG and the cycle continues. > > > > With large fliesystems, the CPU can be stuck in this loop for hours > > which can slow down the whole system. Hence, until we figure out a > > better way to continue the search (rather than starting from beginning) > > in ext4_mb_choose_next_group(), lets just fallback to > > ext4_mb_complex_scan_group() in case aligned scan fails, as it is much > > more likely to find the needed blocks. > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> > > If I understand the difference right, the problem is that while > ext4_mb_choose_next_group() guarantees large enough free space extent for > the CR_GOAL_LEN_FAST or CR_BEST_AVAIL_LEN passes, it does not guaranteed > large enough *aligned* free space extent. Thus for non-aligned allocations > we can fail only due to a race with another allocating process but with > aligned allocations we can consistently fail in ext4_mb_scan_aligned() and > thus livelock in the allocation loop. > > If my understanding is correct, feel free to add: > > Reviewed-by: Jan Kara <jack@suse.cz> > > Honza Hey Jan, Yes you are correct, thanks for the review. As you said, it's theoretically possible to livelock during non stripe scenarios as well, but the probability of getting stuck for any significant amount of time is really really less. I'm not sure if that is enough to justify adding some logic to optimize the search for such scenarios as that might need more involved code changes. Regards, ojaswin > > > > > --- > > fs/ext4/mballoc.c | 21 +++++++++++++-------- > > 1 file changed, 13 insertions(+), 8 deletions(-) > > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > > index d72b5e3c92ec..63f12ec02485 100644 > > --- a/fs/ext4/mballoc.c > > +++ b/fs/ext4/mballoc.c > > @@ -2895,14 +2895,19 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > > ac->ac_groups_scanned++; > > if (cr == CR_POWER2_ALIGNED) > > ext4_mb_simple_scan_group(ac, &e4b); > > - else if ((cr == CR_GOAL_LEN_FAST || > > - cr == CR_BEST_AVAIL_LEN) && > > - sbi->s_stripe && > > - !(ac->ac_g_ex.fe_len % > > - EXT4_B2C(sbi, sbi->s_stripe))) > > - ext4_mb_scan_aligned(ac, &e4b); > > - else > > - ext4_mb_complex_scan_group(ac, &e4b); > > + else { > > + bool is_stripe_aligned = sbi->s_stripe && > > + !(ac->ac_g_ex.fe_len % > > + EXT4_B2C(sbi, sbi->s_stripe)); > > + > > + if ((cr == CR_GOAL_LEN_FAST || > > + cr == CR_BEST_AVAIL_LEN) && > > + is_stripe_aligned) > > + ext4_mb_scan_aligned(ac, &e4b); > > + > > + if (ac->ac_status == AC_STATUS_CONTINUE) > > + ext4_mb_complex_scan_group(ac, &e4b); > > + } > > > > ext4_unlock_group(sb, group); > > ext4_mb_unload_buddy(&e4b); > > -- > > 2.39.3 > > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation 2023-12-15 11:19 [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation Ojaswin Mujoo 2023-12-15 11:19 ` [PATCH 1/1] ext4: fallback to complex scan if aligned scan doesn't work Ojaswin Mujoo @ 2024-01-09 2:53 ` Theodore Ts'o 2024-03-20 16:52 ` Frederick Lawler 1 sibling, 1 reply; 8+ messages in thread From: Theodore Ts'o @ 2024-01-09 2:53 UTC (permalink / raw) To: linux-ext4, Ojaswin Mujoo Cc: Theodore Ts'o, Ritesh Harjani, linux-kernel, Jan Kara, glandvador, bugzilla On Fri, 15 Dec 2023 16:49:49 +0530, Ojaswin Mujoo wrote: > This patch intends to fix the recent bugzilla [1] report where the > kworker flush thread seemed to be taking 100% CPU utilizationa and was > slowing down the whole system. The backtrace indicated that we were > stuck in mballoc allocation path. The issue was only seen kernel 6.5+ > and when ext4 was mounted with -o stripe (or stripe option was > implicitly added due us mkfs flags used). > > [...] Applied, thanks! [1/1] ext4: fallback to complex scan if aligned scan doesn't work commit: a26b6faf7f1c9c1ba6edb3fea9d1390201f2ed50 Best regards, -- Theodore Ts'o <tytso@mit.edu> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation 2024-01-09 2:53 ` [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation Theodore Ts'o @ 2024-03-20 16:52 ` Frederick Lawler 2024-03-22 8:31 ` Ojaswin Mujoo 0 siblings, 1 reply; 8+ messages in thread From: Frederick Lawler @ 2024-03-20 16:52 UTC (permalink / raw) To: Theodore Ts'o Cc: linux-ext4, Ojaswin Mujoo, Ritesh Harjani, linux-kernel, Jan Kara, glandvador, bugzilla, kernel-team Hi Theodore and Ojaswin, On Mon, Jan 08, 2024 at 09:53:18PM -0500, Theodore Ts'o wrote: > > On Fri, 15 Dec 2023 16:49:49 +0530, Ojaswin Mujoo wrote: > > This patch intends to fix the recent bugzilla [1] report where the > > kworker flush thread seemed to be taking 100% CPU utilizationa and was > > slowing down the whole system. The backtrace indicated that we were > > stuck in mballoc allocation path. The issue was only seen kernel 6.5+ > > and when ext4 was mounted with -o stripe (or stripe option was > > implicitly added due us mkfs flags used). > > > > [...] > > Applied, thanks! I backported this patch to at least 6.6 and tested on our fleet of software RAID 0 NVME SSD nodes. This change worked very nicely for us. We're interested in backporting this to at least 6.6. I tried looking at xfstests, and didn't really see a good match (user error?) to validate the fix via that. So I'm a little unclear what the path forward here is. Although we experienced this issue in 6.1, I didn't backport to 6.1 and test to verify this also works there, however, setting stripe to 0 did in the 6.1 case. Best, Fred > > [1/1] ext4: fallback to complex scan if aligned scan doesn't work > commit: a26b6faf7f1c9c1ba6edb3fea9d1390201f2ed50 > > Best regards, > -- > Theodore Ts'o <tytso@mit.edu> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation 2024-03-20 16:52 ` Frederick Lawler @ 2024-03-22 8:31 ` Ojaswin Mujoo 2024-03-25 18:12 ` Frederick Lawler 0 siblings, 1 reply; 8+ messages in thread From: Ojaswin Mujoo @ 2024-03-22 8:31 UTC (permalink / raw) To: Frederick Lawler Cc: Theodore Ts'o, linux-ext4, Ritesh Harjani, linux-kernel, Jan Kara, glandvador, bugzilla, kernel-team On Wed, Mar 20, 2024 at 11:52:58AM -0500, Frederick Lawler wrote: > Hi Theodore and Ojaswin, > > On Mon, Jan 08, 2024 at 09:53:18PM -0500, Theodore Ts'o wrote: > > > > On Fri, 15 Dec 2023 16:49:49 +0530, Ojaswin Mujoo wrote: > > > This patch intends to fix the recent bugzilla [1] report where the > > > kworker flush thread seemed to be taking 100% CPU utilizationa and was > > > slowing down the whole system. The backtrace indicated that we were > > > stuck in mballoc allocation path. The issue was only seen kernel 6.5+ > > > and when ext4 was mounted with -o stripe (or stripe option was > > > implicitly added due us mkfs flags used). > > > > > > [...] > > > > Applied, thanks! > > I backported this patch to at least 6.6 and tested on our fleet of > software RAID 0 NVME SSD nodes. This change worked very nicely > for us. We're interested in backporting this to at least 6.6. > > I tried looking at xfstests, and didn't really see a good match > (user error?) to validate the fix via that. So I'm a little unclear what > the path forward here is. > > Although we experienced this issue in 6.1, I didn't backport to 6.1 and > test to verify this also works there, however, setting stripe to 0 did in > the 6.1 case. > > Best, > Fred Hi Fred, If I understand correctly, you are looking for a test case which you could use to confirm if the issue exists and if the backport is solving it, right? Actually, I was never able to replicate this at my end so I had to rely on people hitting the bug to confirm if it works. I did set out to write a testcase that could help us reliably replicate this issue but it needs a very specially crafted FS that is a bit difficult to achieve from user space. I was using debugfs to create an FS that could hit it but I kept running into issues where it won't mount etc. Maybe there's a better way to craft such an FS that I'm not aware of. One more option is that maybe we can have KUnit test for this in the mballoc code but I'd need to read some more about the kunit infrastructure to see if it's possible/feasible. Regards, ojaswin > > > > > [1/1] ext4: fallback to complex scan if aligned scan doesn't work > > commit: a26b6faf7f1c9c1ba6edb3fea9d1390201f2ed50 > > > > Best regards, > > -- > > Theodore Ts'o <tytso@mit.edu> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation 2024-03-22 8:31 ` Ojaswin Mujoo @ 2024-03-25 18:12 ` Frederick Lawler 0 siblings, 0 replies; 8+ messages in thread From: Frederick Lawler @ 2024-03-25 18:12 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Theodore Ts'o, linux-ext4, Ritesh Harjani, linux-kernel, Jan Kara, glandvador, bugzilla, kernel-team On Fri, Mar 22, 2024 at 02:01:17PM +0530, Ojaswin Mujoo wrote: > On Wed, Mar 20, 2024 at 11:52:58AM -0500, Frederick Lawler wrote: > > Hi Theodore and Ojaswin, > > > > On Mon, Jan 08, 2024 at 09:53:18PM -0500, Theodore Ts'o wrote: > > > > > > On Fri, 15 Dec 2023 16:49:49 +0530, Ojaswin Mujoo wrote: > > > > This patch intends to fix the recent bugzilla [1] report where the > > > > kworker flush thread seemed to be taking 100% CPU utilizationa and was > > > > slowing down the whole system. The backtrace indicated that we were > > > > stuck in mballoc allocation path. The issue was only seen kernel 6.5+ > > > > and when ext4 was mounted with -o stripe (or stripe option was > > > > implicitly added due us mkfs flags used). > > > > > > > > [...] > > > > > > Applied, thanks! > > > > I backported this patch to at least 6.6 and tested on our fleet of > > software RAID 0 NVME SSD nodes. This change worked very nicely > > for us. We're interested in backporting this to at least 6.6. > > > > I tried looking at xfstests, and didn't really see a good match > > (user error?) to validate the fix via that. So I'm a little unclear what > > the path forward here is. > > > > Although we experienced this issue in 6.1, I didn't backport to 6.1 and > > test to verify this also works there, however, setting stripe to 0 did in > > the 6.1 case. > > > > Best, > > Fred > > Hi Fred, > > If I understand correctly, you are looking for a test case which you > could use to confirm if the issue exists and if the backport is solving > it, right? Not quite. I made an assumption that having a test was a requirement for backporting the patch. I know some other file systems prefer a few loops of kdevops to backport patches, and was curious if that's a similar flow for ext4. I only backported the patch to 6.6 and ensured that our affected nodes perform as expected with it. > > Actually, I was never able to replicate this at my end so I had to rely > on people hitting the bug to confirm if it works. I did set out to write > a testcase that could help us reliably replicate this issue but it needs > a very specially crafted FS that is a bit difficult to achieve from user > space. I was using debugfs to create an FS that could hit it but I kept > running into issues where it won't mount etc. Maybe there's a better > way to craft such an FS that I'm not aware of. > > One more option is that maybe we can have KUnit test for this in the > mballoc code but I'd need to read some more about the kunit > infrastructure to see if it's possible/feasible. > I think kunit is an interesting idea. One thing to keep in mind is that mocking is going to be the real problem with that approach. And with more mocking may mean more brittle tests. > Regards, > ojaswin > > > > > > > > [1/1] ext4: fallback to complex scan if aligned scan doesn't work > > > commit: a26b6faf7f1c9c1ba6edb3fea9d1390201f2ed50 > > > > > > Best regards, > > > -- > > > Theodore Ts'o <tytso@mit.edu> ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-03-25 18:13 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-12-15 11:19 [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation Ojaswin Mujoo 2023-12-15 11:19 ` [PATCH 1/1] ext4: fallback to complex scan if aligned scan doesn't work Ojaswin Mujoo 2024-01-04 15:27 ` Jan Kara 2024-01-09 9:40 ` Ojaswin Mujoo 2024-01-09 2:53 ` [PATCH 0/1] Fix for recent bugzilla reports related to long halts during block allocation Theodore Ts'o 2024-03-20 16:52 ` Frederick Lawler 2024-03-22 8:31 ` Ojaswin Mujoo 2024-03-25 18:12 ` Frederick Lawler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox