* [PATCH 0/2] ext4: Fix performance regression with mballoc
@ 2022-08-23 20:15 Jan Kara
2022-08-23 20:15 ` [PATCH 1/2] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara
` (2 more replies)
0 siblings, 3 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-23 20:15 UTC (permalink / raw)
To: Ted Tso
Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren,
Harshad Shirwadkar, Jan Kara
Hello,
So I have implemented mballoc improvements to avoid spreading allocations
even with mb_optimize_scan=1. It fixes the performance regression I was able
to reproduce with reaim on my test machine:
mb_optimize_scan=0 mb_optimize_scan=1 patched
Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%)
Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%)
Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%)
Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%)
Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%*
Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%)
Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%*
Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%)
Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%)
Stefan, can you please test whether these patches fix the problem for you as
well? Comments & review welcome.
Honza
^ permalink raw reply [flat|nested] 20+ messages in thread* [PATCH 1/2] ext4: Make mballoc try target group first even with mb_optimize_scan 2022-08-23 20:15 [PATCH 0/2] ext4: Fix performance regression with mballoc Jan Kara @ 2022-08-23 20:15 ` Jan Kara 2022-08-23 20:15 ` [PATCH 2/2] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara 2022-08-24 10:17 ` [PATCH 0/2] ext4: Fix performance regression with mballoc Stefan Wahren 2 siblings, 0 replies; 20+ messages in thread From: Jan Kara @ 2022-08-23 20:15 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Harshad Shirwadkar, Jan Kara, stable One of the side-effects of mb_optimize_scan was that the optimized functions to select next group to try were called even before we tried the goal group. As a result we no longer allocate files close to corresponding inodes as well as we don't try to expand currently allocated extent in the same group. This results in reaim regression with workfile.disk workload of upto 8% with many clients on my test machine: baseline mb_optimize_scan Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%) Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%* Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%* Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%* Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%* Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%) Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%) Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%* Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%* Also this is causing huge regression (time increased by a factor of 5 or so) when untarring archive with lots of small files on some eMMC storage cards. Fix the problem by making sure we try goal group first. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Reported-by: Stefan Wahren <stefan.wahren@i2se.com> Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/ Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/mballoc.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index bd8f8b5c3d30..41e1cfecac3b 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1049,8 +1049,10 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, { *new_cr = ac->ac_criteria; - if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) + if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) { + *group = next_linear_group(ac, *group, ngroups); return; + } if (*new_cr == 0) { ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups); @@ -2636,7 +2638,7 @@ static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { ext4_group_t prefetch_grp = 0, ngroups, group, i; - int cr = -1; + int cr = -1, new_cr; int err = 0, first_err = 0; unsigned int nr = 0, prefetch_ios = 0; struct ext4_sb_info *sbi; @@ -2711,13 +2713,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups; prefetch_grp = group; - for (i = 0; i < ngroups; group = next_linear_group(ac, group, ngroups), - i++) { - int ret = 0, new_cr; + for (i = 0, new_cr = cr; i < ngroups; i++, + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { + int ret = 0; cond_resched(); - - ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups); if (new_cr != cr) { cr = new_cr; goto repeat; -- 2.35.3 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 2/2] ext4: Avoid unnecessary spreading of allocations among groups 2022-08-23 20:15 [PATCH 0/2] ext4: Fix performance regression with mballoc Jan Kara 2022-08-23 20:15 ` [PATCH 1/2] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara @ 2022-08-23 20:15 ` Jan Kara 2022-08-24 10:17 ` [PATCH 0/2] ext4: Fix performance regression with mballoc Stefan Wahren 2 siblings, 0 replies; 20+ messages in thread From: Jan Kara @ 2022-08-23 20:15 UTC (permalink / raw) To: Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Stefan Wahren, Harshad Shirwadkar, Jan Kara mb_set_largest_free_order() updates lists containing groups with largest chunk of free space of given order. The way it updates it leads to always moving the group to the tail of the list. Thus allocations looking for free space of given order effectively end up cycling through all groups (and due to initialization in last to first order). This spreads allocations among block groups which reduces performance for rotating disks or low-end flash media. Change mb_set_largest_free_order() to only update lists if the order of the largest free chunk in the group changed. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/mballoc.c | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 41e1cfecac3b..6251b4a6cc63 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1077,23 +1077,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) struct ext4_sb_info *sbi = EXT4_SB(sb); int i; - if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) + if (grp->bb_counters[i] > 0) + break; + /* No need to move between order lists? */ + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || + i == grp->bb_largest_free_order) { + grp->bb_largest_free_order = i; + return; + } + + if (grp->bb_largest_free_order >= 0) { write_lock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); list_del_init(&grp->bb_largest_free_order_node); write_unlock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); } - grp->bb_largest_free_order = -1; /* uninit */ - - for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) { - if (grp->bb_counters[i] > 0) { - grp->bb_largest_free_order = i; - break; - } - } - if (test_opt2(sb, MB_OPTIMIZE_SCAN) && - grp->bb_largest_free_order >= 0 && grp->bb_free) { + grp->bb_largest_free_order = i; + if (grp->bb_largest_free_order >= 0 && grp->bb_free) { write_lock(&sbi->s_mb_largest_free_orders_locks[ grp->bb_largest_free_order]); list_add_tail(&grp->bb_largest_free_order_node, -- 2.35.3 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-23 20:15 [PATCH 0/2] ext4: Fix performance regression with mballoc Jan Kara 2022-08-23 20:15 ` [PATCH 1/2] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara 2022-08-23 20:15 ` [PATCH 2/2] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara @ 2022-08-24 10:17 ` Stefan Wahren 2022-08-24 10:40 ` Jan Kara 2 siblings, 1 reply; 20+ messages in thread From: Stefan Wahren @ 2022-08-24 10:17 UTC (permalink / raw) To: Jan Kara, Ted Tso Cc: linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar Hi Jan, Am 23.08.22 um 22:15 schrieb Jan Kara: > Hello, > > So I have implemented mballoc improvements to avoid spreading allocations > even with mb_optimize_scan=1. It fixes the performance regression I was able > to reproduce with reaim on my test machine: > > mb_optimize_scan=0 mb_optimize_scan=1 patched > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) > > Stefan, can you please test whether these patches fix the problem for you as > well? Comments & review welcome. i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the update process succeed which is a improvement, but the download + unpack duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 minute ). Unfortuntately i don't have much time this week and next week i'm in holidays. Just a question, my tests always had MBCACHE=y . Is it possible that the mb_optimize_scan is counterproductive for MBCACHE in this case? I'm asking because before the download the update script removes the files from the previous update process which already cause a high load. Best regards > > Honza ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-24 10:17 ` [PATCH 0/2] ext4: Fix performance regression with mballoc Stefan Wahren @ 2022-08-24 10:40 ` Jan Kara 2022-08-24 14:13 ` Jan Kara 2022-08-24 21:24 ` Stefan Wahren 0 siblings, 2 replies; 20+ messages in thread From: Jan Kara @ 2022-08-24 10:40 UTC (permalink / raw) To: Stefan Wahren Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar Hi Stefan! On Wed 24-08-22 12:17:14, Stefan Wahren wrote: > Am 23.08.22 um 22:15 schrieb Jan Kara: > > Hello, > > > > So I have implemented mballoc improvements to avoid spreading allocations > > even with mb_optimize_scan=1. It fixes the performance regression I was able > > to reproduce with reaim on my test machine: > > > > mb_optimize_scan=0 mb_optimize_scan=1 patched > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) > > > > Stefan, can you please test whether these patches fix the problem for you as > > well? Comments & review welcome. > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the > update process succeed which is a improvement, but the download + unpack > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 > minute ). OK, thanks for testing! I'll try to check specifically untar whether I can still see some differences in the IO pattern on my test machine. > Unfortuntately i don't have much time this week and next week i'm in > holidays. No problem. > Just a question, my tests always had MBCACHE=y . Is it possible that the > mb_optimize_scan is counterproductive for MBCACHE in this case? MBCACHE (despite similar name) is actually related to extended attributes so it likely has no impact on your workload. > I'm asking because before the download the update script removes the files > from the previous update process which already cause a high load. Do you mean already the removal step is noticeably slower with mb_optimize_scan=1? The removal will be modifying directory blocks, inode table blocks, block & inode bitmaps, and group descriptors. So if block allocations are more spread (due to mb_optimize_scan=1 used during the untar), the removal may also take somewhat longer. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-24 10:40 ` Jan Kara @ 2022-08-24 14:13 ` Jan Kara 2022-08-25 17:49 ` Ojaswin Mujoo 2022-08-24 21:24 ` Stefan Wahren 1 sibling, 1 reply; 20+ messages in thread From: Jan Kara @ 2022-08-24 14:13 UTC (permalink / raw) To: Stefan Wahren Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar On Wed 24-08-22 12:40:10, Jan Kara wrote: > Hi Stefan! > > On Wed 24-08-22 12:17:14, Stefan Wahren wrote: > > Am 23.08.22 um 22:15 schrieb Jan Kara: > > > Hello, > > > > > > So I have implemented mballoc improvements to avoid spreading allocations > > > even with mb_optimize_scan=1. It fixes the performance regression I was able > > > to reproduce with reaim on my test machine: > > > > > > mb_optimize_scan=0 mb_optimize_scan=1 patched > > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) > > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) > > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) > > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) > > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* > > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) > > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* > > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) > > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) > > > > > > Stefan, can you please test whether these patches fix the problem for you as > > > well? Comments & review welcome. > > > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the > > update process succeed which is a improvement, but the download + unpack > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 > > minute ). > > OK, thanks for testing! I'll try to check specifically untar whether I can > still see some differences in the IO pattern on my test machine. I have created the same tar archive as you've referenced (files with same number of blocks) and looked at where blocks get allocated with mb_optimize_scan=0 and with mb_optimize_scan=1 + my patches. And the resulting IO pattern looks practically the same on my test machine. In particular in both cases files get allocated only in 6 groups, if I look at the number of erase blocks that are expected to be touched by file data (for various erase block sizes from 512k to 4MB) I get practically same numbers for both cases. Ojaswin, I think you've also mentioned you were able to reproduce the issue in your setup? Are you still able to reproduce it with the patched kernel? Can you help debugging while Stefan is away? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-24 14:13 ` Jan Kara @ 2022-08-25 17:49 ` Ojaswin Mujoo 2022-08-26 9:07 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Ojaswin Mujoo @ 2022-08-25 17:49 UTC (permalink / raw) To: Jan Kara Cc: Stefan Wahren, Ted Tso, linux-ext4, Thorsten Leemhuis, Harshad Shirwadkar On Wed, Aug 24, 2022 at 04:13:38PM +0200, Jan Kara wrote: > On Wed 24-08-22 12:40:10, Jan Kara wrote: > > Hi Stefan! > > > > On Wed 24-08-22 12:17:14, Stefan Wahren wrote: > > > Am 23.08.22 um 22:15 schrieb Jan Kara: > > > > Hello, > > > > > > > > So I have implemented mballoc improvements to avoid spreading allocations > > > > even with mb_optimize_scan=1. It fixes the performance regression I was able > > > > to reproduce with reaim on my test machine: > > > > > > > > mb_optimize_scan=0 mb_optimize_scan=1 patched > > > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) > > > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) > > > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) > > > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) > > > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* > > > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) > > > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* > > > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) > > > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) > > > > > > > > Stefan, can you please test whether these patches fix the problem for you as > > > > well? Comments & review welcome. > > > > > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the > > > update process succeed which is a improvement, but the download + unpack > > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 > > > minute ). > > > > OK, thanks for testing! I'll try to check specifically untar whether I can > > still see some differences in the IO pattern on my test machine. > > I have created the same tar archive as you've referenced (files with same > number of blocks) and looked at where blocks get allocated with > mb_optimize_scan=0 and with mb_optimize_scan=1 + my patches. And the > resulting IO pattern looks practically the same on my test machine. In > particular in both cases files get allocated only in 6 groups, if I look > at the number of erase blocks that are expected to be touched by file data > (for various erase block sizes from 512k to 4MB) I get practically same > numbers for both cases. > > Ojaswin, I think you've also mentioned you were able to reproduce the issue > in your setup? Are you still able to reproduce it with the patched kernel? > Can you help debugging while Stefan is away? > > Honza Hi Jan, So I ran some more tests on v6.0-rc2 kernel with and without your patches and here are the details: Workload:- time tar -xf rpi-firmware.tar -C ./test time sync System details: - Rpi 3b+ w/ 8G memory card (~4G free) - tar is ~120MB compressed And here is the output of time command for various tests. Since some of them take some time to complete, I ran them only 2 3 times each so the numbers might vary but they are indicative of the issue. v6.0-rc2 (Without patches) mb_optimize_scan = 0 **tar** real 1m39.574s user 0m10.311s sys 0m2.761s **sync** real 0m22.269s user 0m0.001s sys 0m0.005s mb_optimize_scan = 1 **tar** real 21m25.288s user 0m9.607 sys 0m3.026 **sync** real 1m23.402s user 0m0.005s sys 0m0.000s v6.0-rc2 (With patches) mb_optimize_scan = 0 * similar to unpatched (~1 to 2mins) * mb_optimize_scan = 1 **tar** real 5m7.858s user 0m11.008s sys 0m2.739s **sync** real 6m7.308s user 0m0.003s sys 0m0.001s At this point, I'm pretty confident that it is the untar operation that is having most of the regression and no other download/delete operations in rpi-update are behind the delay. Further, it does seem like your patches improve the performance but, from my tests, we are still not close to the mb_optimize_scan=0 numbers. I'm going to spend some more time trying to collect the perfs and which block group the allocations are happening during the untar to see if we can get a better idea from that data. Let me know if you'd want me to collect anything else. PS: One question, to find the blocks groups being used I'm planning to take the dumpe2fs output before and after untar and then see the groups where free blocks changed (since there is nothing much running on Pi i assume this should give us a rough idea of allocation pattern of untar), just wanted to check if there's a better way to get this data. Regards, Ojaswin > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-25 17:49 ` Ojaswin Mujoo @ 2022-08-26 9:07 ` Jan Kara 0 siblings, 0 replies; 20+ messages in thread From: Jan Kara @ 2022-08-26 9:07 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Jan Kara, Stefan Wahren, Ted Tso, linux-ext4, Thorsten Leemhuis, Harshad Shirwadkar On Thu 25-08-22 23:19:48, Ojaswin Mujoo wrote: > On Wed, Aug 24, 2022 at 04:13:38PM +0200, Jan Kara wrote: > > On Wed 24-08-22 12:40:10, Jan Kara wrote: > > > Hi Stefan! > > > > > > On Wed 24-08-22 12:17:14, Stefan Wahren wrote: > > > > Am 23.08.22 um 22:15 schrieb Jan Kara: > > > > > Hello, > > > > > > > > > > So I have implemented mballoc improvements to avoid spreading allocations > > > > > even with mb_optimize_scan=1. It fixes the performance regression I was able > > > > > to reproduce with reaim on my test machine: > > > > > > > > > > mb_optimize_scan=0 mb_optimize_scan=1 patched > > > > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) > > > > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) > > > > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) > > > > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) > > > > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* > > > > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) > > > > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* > > > > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) > > > > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) > > > > > > > > > > Stefan, can you please test whether these patches fix the problem for you as > > > > > well? Comments & review welcome. > > > > > > > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the > > > > update process succeed which is a improvement, but the download + unpack > > > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 > > > > minute ). > > > > > > OK, thanks for testing! I'll try to check specifically untar whether I can > > > still see some differences in the IO pattern on my test machine. > > > > I have created the same tar archive as you've referenced (files with same > > number of blocks) and looked at where blocks get allocated with > > mb_optimize_scan=0 and with mb_optimize_scan=1 + my patches. And the > > resulting IO pattern looks practically the same on my test machine. In > > particular in both cases files get allocated only in 6 groups, if I look > > at the number of erase blocks that are expected to be touched by file data > > (for various erase block sizes from 512k to 4MB) I get practically same > > numbers for both cases. > > > > Ojaswin, I think you've also mentioned you were able to reproduce the issue > > in your setup? Are you still able to reproduce it with the patched kernel? > > Can you help debugging while Stefan is away? > > > > Honza > Hi Jan, > > So I ran some more tests on v6.0-rc2 kernel with and without your patches and > here are the details: > > Workload:- > time tar -xf rpi-firmware.tar -C ./test > time sync > > System details: > - Rpi 3b+ w/ 8G memory card (~4G free) > - tar is ~120MB compressed Hum, maybe the difference is that I've tried with somewhat larger (20G) and otherwise empty filesystem... > And here is the output of time command for various tests. Since some of them > take some time to complete, I ran them only 2 3 times each so the numbers might > vary but they are indicative of the issue. > > v6.0-rc2 (Without patches) > > mb_optimize_scan = 0 > > **tar** > real 1m39.574s > user 0m10.311s > sys 0m2.761s > > **sync** > real 0m22.269s > user 0m0.001s > sys 0m0.005s > > mb_optimize_scan = 1 > > **tar** > real 21m25.288s > user 0m9.607 > sys 0m3.026 > > **sync** > real 1m23.402s > user 0m0.005s > sys 0m0.000s > > v6.0-rc2 (With patches) > > mb_optimize_scan = 0 > > * similar to unpatched (~1 to 2mins) * > > mb_optimize_scan = 1 > > **tar** > real 5m7.858s > user 0m11.008s > sys 0m2.739s > > **sync** > real 6m7.308s > user 0m0.003s > sys 0m0.001s > > At this point, I'm pretty confident that it is the untar operation that is > having most of the regression and no other download/delete operations in > rpi-update are behind the delay. Further, it does seem like your patches > improve the performance but, from my tests, we are still not close to the > mb_optimize_scan=0 numbers. Yes, thanks for the tests! > I'm going to spend some more time trying to collect the perfs and which block > group the allocations are happening during the untar to see if we can get a better > idea from that data. Let me know if you'd want me to collect anything else. > > PS: One question, to find the blocks groups being used I'm planning to take > the dumpe2fs output before and after untar and then see the groups where free blocks > changed (since there is nothing much running on Pi i assume this should give us > a rough idea of allocation pattern of untar), just wanted to check if there's a > better way to get this data. I have used 'find <target-dir> -exec filefrag -v {} \;' to get block numbers of files. That gets you better insight than plain dumpe2fs numbers... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-24 10:40 ` Jan Kara 2022-08-24 14:13 ` Jan Kara @ 2022-08-24 21:24 ` Stefan Wahren 2022-08-25 9:18 ` Jan Kara 1 sibling, 1 reply; 20+ messages in thread From: Stefan Wahren @ 2022-08-24 21:24 UTC (permalink / raw) To: Jan Kara Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar Hi Jan, Am 24.08.22 um 12:40 schrieb Jan Kara: > Hi Stefan! > > On Wed 24-08-22 12:17:14, Stefan Wahren wrote: >> Am 23.08.22 um 22:15 schrieb Jan Kara: >>> Hello, >>> >>> So I have implemented mballoc improvements to avoid spreading allocations >>> even with mb_optimize_scan=1. It fixes the performance regression I was able >>> to reproduce with reaim on my test machine: >>> >>> mb_optimize_scan=0 mb_optimize_scan=1 patched >>> Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) >>> Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) >>> Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) >>> Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) >>> Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* >>> Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) >>> Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* >>> Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) >>> Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) >>> >>> Stefan, can you please test whether these patches fix the problem for you as >>> well? Comments & review welcome. >> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the >> update process succeed which is a improvement, but the download + unpack >> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 >> minute ). > OK, thanks for testing! I'll try to check specifically untar whether I can > still see some differences in the IO pattern on my test machine. i made two iostat output logs during the complete download phase with 5.19 and your series applied. iostat was running via ssh connection and rpi-update via serial console. First with mb_optimize_scan=0 https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log Second with mb_optimize_scan=1 https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log Maybe this helps > >> Unfortuntately i don't have much time this week and next week i'm in >> holidays. > No problem. > >> Just a question, my tests always had MBCACHE=y . Is it possible that the >> mb_optimize_scan is counterproductive for MBCACHE in this case? > MBCACHE (despite similar name) is actually related to extended attributes > so it likely has no impact on your workload. > >> I'm asking because before the download the update script removes the files >> from the previous update process which already cause a high load. > Do you mean already the removal step is noticeably slower with > mb_optimize_scan=1? The removal will be modifying directory blocks, inode > table blocks, block & inode bitmaps, and group descriptors. So if block > allocations are more spread (due to mb_optimize_scan=1 used during the > untar), the removal may also take somewhat longer. Not sure about this, maybe we should concentrate on download / untar phase. > Honza ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-24 21:24 ` Stefan Wahren @ 2022-08-25 9:18 ` Jan Kara 2022-08-25 15:48 ` Stefan Wahren 2022-08-25 16:57 ` Stefan Wahren 0 siblings, 2 replies; 20+ messages in thread From: Jan Kara @ 2022-08-25 9:18 UTC (permalink / raw) To: Stefan Wahren Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar Hi Stefan! On Wed 24-08-22 23:24:43, Stefan Wahren wrote: > Am 24.08.22 um 12:40 schrieb Jan Kara: > > Hi Stefan! > > > > On Wed 24-08-22 12:17:14, Stefan Wahren wrote: > > > Am 23.08.22 um 22:15 schrieb Jan Kara: > > > > Hello, > > > > > > > > So I have implemented mballoc improvements to avoid spreading allocations > > > > even with mb_optimize_scan=1. It fixes the performance regression I was able > > > > to reproduce with reaim on my test machine: > > > > > > > > mb_optimize_scan=0 mb_optimize_scan=1 patched > > > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) > > > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) > > > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) > > > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) > > > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* > > > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) > > > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* > > > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) > > > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) > > > > > > > > Stefan, can you please test whether these patches fix the problem for you as > > > > well? Comments & review welcome. > > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the > > > update process succeed which is a improvement, but the download + unpack > > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 > > > minute ). > > OK, thanks for testing! I'll try to check specifically untar whether I can > > still see some differences in the IO pattern on my test machine. > > i made two iostat output logs during the complete download phase with 5.19 > and your series applied. iostat was running via ssh connection and > rpi-update via serial console. > > First with mb_optimize_scan=0 > > https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log > > Second with mb_optimize_scan=1 > > https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log > > Maybe this helps Thanks for the data! So this is interesting. In both iostat logs, there is initial phase where no IO happens. I guess that's expected. It is significantly longer in the mb_optimize_scan=0 but I suppose that is just caused by a difference in when iostat was actually started. Then in mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100% utilized and then iostat ends. During this time ~63MB is written altogether. Request sizes vary a lot, average is 60KB. In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is 100% utilized. During this time ~133MB is written, average request size is 40KB. If I look just at first 155 seconds of the trace (assuming iostat was in both cases terminated before writing was fully done), we have written ~53MB and average request size is 56KB. So with mb_optimize_scan=1 we are indeed still somewhat slower but based on the trace it is not clear why the download+unpack should take 7 minutes instead of 1 minute. There must be some other effect we are missing. Perhaps if you just download the archive manually, call sync(1), and measure how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we can see whether plain untar is indeed making the difference or there's something else influencing the result as well (I have checked and rpi-update does a lot of other deleting & copying as the part of the update)? Thanks. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-25 9:18 ` Jan Kara @ 2022-08-25 15:48 ` Stefan Wahren 2022-08-26 9:52 ` Jan Kara 2022-08-25 16:57 ` Stefan Wahren 1 sibling, 1 reply; 20+ messages in thread From: Stefan Wahren @ 2022-08-25 15:48 UTC (permalink / raw) To: Jan Kara Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar Hi Jan, Am 25.08.22 um 11:18 schrieb Jan Kara: > Hi Stefan! > > On Wed 24-08-22 23:24:43, Stefan Wahren wrote: >> Am 24.08.22 um 12:40 schrieb Jan Kara: >>> Hi Stefan! >>> >>> On Wed 24-08-22 12:17:14, Stefan Wahren wrote: >>>> Am 23.08.22 um 22:15 schrieb Jan Kara: >>>>> Hello, >>>>> >>>>> So I have implemented mballoc improvements to avoid spreading allocations >>>>> even with mb_optimize_scan=1. It fixes the performance regression I was able >>>>> to reproduce with reaim on my test machine: >>>>> >>>>> mb_optimize_scan=0 mb_optimize_scan=1 patched >>>>> Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) >>>>> Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) >>>>> Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) >>>>> Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) >>>>> Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* >>>>> Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) >>>>> Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* >>>>> Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) >>>>> Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) >>>>> >>>>> Stefan, can you please test whether these patches fix the problem for you as >>>>> well? Comments & review welcome. >>>> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the >>>> update process succeed which is a improvement, but the download + unpack >>>> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 >>>> minute ). >>> OK, thanks for testing! I'll try to check specifically untar whether I can >>> still see some differences in the IO pattern on my test machine. >> i made two iostat output logs during the complete download phase with 5.19 >> and your series applied. iostat was running via ssh connection and >> rpi-update via serial console. >> >> First with mb_optimize_scan=0 >> >> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log >> >> Second with mb_optimize_scan=1 >> >> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log >> >> Maybe this helps > Thanks for the data! So this is interesting. In both iostat logs, there is > initial phase where no IO happens. I guess that's expected. It is > significantly longer in the mb_optimize_scan=0 but I suppose that is just > caused by a difference in when iostat was actually started. Then in > mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100% > utilized and then iostat ends. During this time ~63MB is written > altogether. Request sizes vary a lot, average is 60KB. > > In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is > 100% utilized. During this time ~133MB is written, average request size is > 40KB. If I look just at first 155 seconds of the trace (assuming iostat was > in both cases terminated before writing was fully done), we have written > ~53MB and average request size is 56KB. > > So with mb_optimize_scan=1 we are indeed still somewhat slower but based on > the trace it is not clear why the download+unpack should take 7 minutes > instead of 1 minute. There must be some other effect we are missing. > > Perhaps if you just download the archive manually, call sync(1), and measure > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we > can see whether plain untar is indeed making the difference or there's > something else influencing the result as well (I have checked and > rpi-update does a lot of other deleting & copying as the part of the > update)? Thanks. I will provide those iostats. Btw i untar the firmware archive (mb_optimized_scan=1 and your patch) and got following: cat /proc/fs/ext4/mmcblk1p2/mb_structs_summary optimize_scan: 1 max_free_order_lists: list_order_0_groups: 5 list_order_1_groups: 0 list_order_2_groups: 0 list_order_3_groups: 0 list_order_4_groups: 1 list_order_5_groups: 0 list_order_6_groups: 1 list_order_7_groups: 1 list_order_8_groups: 10 list_order_9_groups: 1 list_order_10_groups: 2 list_order_11_groups: 0 list_order_12_groups: 2 list_order_13_groups: 55 fragment_size_tree: tree_min: 1 tree_max: 31249 tree_nodes: 79 Is this expected? > > Honza ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-25 15:48 ` Stefan Wahren @ 2022-08-26 9:52 ` Jan Kara 0 siblings, 0 replies; 20+ messages in thread From: Jan Kara @ 2022-08-26 9:52 UTC (permalink / raw) To: Stefan Wahren Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar Hi Stefan! On Thu 25-08-22 17:48:32, Stefan Wahren wrote: > Am 25.08.22 um 11:18 schrieb Jan Kara: > > On Wed 24-08-22 23:24:43, Stefan Wahren wrote: > > > Am 24.08.22 um 12:40 schrieb Jan Kara: > > > > On Wed 24-08-22 12:17:14, Stefan Wahren wrote: > > > > > Am 23.08.22 um 22:15 schrieb Jan Kara: > > > > > > So I have implemented mballoc improvements to avoid spreading allocations > > > > > > even with mb_optimize_scan=1. It fixes the performance regression I was able > > > > > > to reproduce with reaim on my test machine: > > > > > > > > > > > > mb_optimize_scan=0 mb_optimize_scan=1 patched > > > > > > Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) > > > > > > Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) > > > > > > Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) > > > > > > Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) > > > > > > Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* > > > > > > Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) > > > > > > Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* > > > > > > Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) > > > > > > Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) > > > > > > > > > > > > Stefan, can you please test whether these patches fix the problem for you as > > > > > > well? Comments & review welcome. > > > > > i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the > > > > > update process succeed which is a improvement, but the download + unpack > > > > > duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 > > > > > minute ). > > > > OK, thanks for testing! I'll try to check specifically untar whether I can > > > > still see some differences in the IO pattern on my test machine. > > > i made two iostat output logs during the complete download phase with 5.19 > > > and your series applied. iostat was running via ssh connection and > > > rpi-update via serial console. > > > > > > First with mb_optimize_scan=0 > > > > > > https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log > > > > > > Second with mb_optimize_scan=1 > > > > > > https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log > > > > > > Maybe this helps > > Thanks for the data! So this is interesting. In both iostat logs, there is > > initial phase where no IO happens. I guess that's expected. It is > > significantly longer in the mb_optimize_scan=0 but I suppose that is just > > caused by a difference in when iostat was actually started. Then in > > mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100% > > utilized and then iostat ends. During this time ~63MB is written > > altogether. Request sizes vary a lot, average is 60KB. > > > > In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is > > 100% utilized. During this time ~133MB is written, average request size is > > 40KB. If I look just at first 155 seconds of the trace (assuming iostat was > > in both cases terminated before writing was fully done), we have written > > ~53MB and average request size is 56KB. > > > > So with mb_optimize_scan=1 we are indeed still somewhat slower but based on > > the trace it is not clear why the download+unpack should take 7 minutes > > instead of 1 minute. There must be some other effect we are missing. > > > > Perhaps if you just download the archive manually, call sync(1), and measure > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we > > can see whether plain untar is indeed making the difference or there's > > something else influencing the result as well (I have checked and > > rpi-update does a lot of other deleting & copying as the part of the > > update)? Thanks. > > I will provide those iostats. > > Btw i untar the firmware archive (mb_optimized_scan=1 and your patch) and > got following: > > cat /proc/fs/ext4/mmcblk1p2/mb_structs_summary > > > optimize_scan: 1 > max_free_order_lists: > list_order_0_groups: 5 > list_order_1_groups: 0 > list_order_2_groups: 0 > list_order_3_groups: 0 > list_order_4_groups: 1 > list_order_5_groups: 0 > list_order_6_groups: 1 > list_order_7_groups: 1 > list_order_8_groups: 10 > list_order_9_groups: 1 > list_order_10_groups: 2 > list_order_11_groups: 0 > list_order_12_groups: 2 > list_order_13_groups: 55 > fragment_size_tree: > tree_min: 1 > tree_max: 31249 > > tree_nodes: 79 > > Is this expected? Yes, I don't see anything out of ordinary in this for a used filesystem. It tells us there are 55 groups with big chunks of free space, there are some groups which have only small chunks of free space but that's expected when the filesystem is reasonably filled... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-25 9:18 ` Jan Kara 2022-08-25 15:48 ` Stefan Wahren @ 2022-08-25 16:57 ` Stefan Wahren 2022-08-26 10:15 ` Jan Kara 1 sibling, 1 reply; 20+ messages in thread From: Stefan Wahren @ 2022-08-25 16:57 UTC (permalink / raw) To: Jan Kara Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar Hi Jan, Am 25.08.22 um 11:18 schrieb Jan Kara: > Hi Stefan! > > On Wed 24-08-22 23:24:43, Stefan Wahren wrote: >> Am 24.08.22 um 12:40 schrieb Jan Kara: >>> Hi Stefan! >>> >>> On Wed 24-08-22 12:17:14, Stefan Wahren wrote: >>>> Am 23.08.22 um 22:15 schrieb Jan Kara: >>>>> Hello, >>>>> >>>>> So I have implemented mballoc improvements to avoid spreading allocations >>>>> even with mb_optimize_scan=1. It fixes the performance regression I was able >>>>> to reproduce with reaim on my test machine: >>>>> >>>>> mb_optimize_scan=0 mb_optimize_scan=1 patched >>>>> Hmean disk-1 2076.12 ( 0.00%) 2099.37 ( 1.12%) 2032.52 ( -2.10%) >>>>> Hmean disk-41 92481.20 ( 0.00%) 83787.47 * -9.40%* 90308.37 ( -2.35%) >>>>> Hmean disk-81 155073.39 ( 0.00%) 135527.05 * -12.60%* 154285.71 ( -0.51%) >>>>> Hmean disk-121 185109.64 ( 0.00%) 166284.93 * -10.17%* 185298.62 ( 0.10%) >>>>> Hmean disk-161 229890.53 ( 0.00%) 207563.39 * -9.71%* 232883.32 * 1.30%* >>>>> Hmean disk-201 223333.33 ( 0.00%) 203235.59 * -9.00%* 221446.93 ( -0.84%) >>>>> Hmean disk-241 235735.25 ( 0.00%) 217705.51 * -7.65%* 239483.27 * 1.59%* >>>>> Hmean disk-281 266772.15 ( 0.00%) 241132.72 * -9.61%* 263108.62 ( -1.37%) >>>>> Hmean disk-321 265435.50 ( 0.00%) 245412.84 * -7.54%* 267277.27 ( 0.69%) >>>>> >>>>> Stefan, can you please test whether these patches fix the problem for you as >>>>> well? Comments & review welcome. >>>> i tested the whole series against 5.19 and 6.0.0-rc2. In both cases the >>>> update process succeed which is a improvement, but the download + unpack >>>> duration ( ~ 7 minutes ) is not as good as with mb_optimize_scan=0 ( ~ 1 >>>> minute ). >>> OK, thanks for testing! I'll try to check specifically untar whether I can >>> still see some differences in the IO pattern on my test machine. >> i made two iostat output logs during the complete download phase with 5.19 >> and your series applied. iostat was running via ssh connection and >> rpi-update via serial console. >> >> First with mb_optimize_scan=0 >> >> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_nooptimize_download_success.iostat.log >> >> Second with mb_optimize_scan=1 >> >> https://github.com/lategoodbye/mb_optimize_scan_regress/blob/main/5.19_SDCIT_patch_optimize_download_success.iostat.log >> >> Maybe this helps > Thanks for the data! So this is interesting. In both iostat logs, there is > initial phase where no IO happens. I guess that's expected. It is > significantly longer in the mb_optimize_scan=0 but I suppose that is just > caused by a difference in when iostat was actually started. Then in > mb_optimize_scan=0 there is 155 seconds where the eMMC card is 100% > utilized and then iostat ends. During this time ~63MB is written > altogether. Request sizes vary a lot, average is 60KB. > > In mb_optimize_scan=1 case there is 715 seconds recorded where eMMC card is > 100% utilized. During this time ~133MB is written, average request size is > 40KB. If I look just at first 155 seconds of the trace (assuming iostat was > in both cases terminated before writing was fully done), we have written > ~53MB and average request size is 56KB. > > So with mb_optimize_scan=1 we are indeed still somewhat slower but based on > the trace it is not clear why the download+unpack should take 7 minutes > instead of 1 minute. There must be some other effect we are missing. > > Perhaps if you just download the archive manually, call sync(1), and measure > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we > can see whether plain untar is indeed making the difference or there's > something else influencing the result as well (I have checked and > rpi-update does a lot of other deleting & copying as the part of the > update)? Thanks. mb_optimize_scan=0 -> almost 5 minutes mb_optimize_scan=1 -> almost 18 minutes https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec > > Honza ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-25 16:57 ` Stefan Wahren @ 2022-08-26 10:15 ` Jan Kara 2022-08-27 14:36 ` Ojaswin Mujoo 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2022-08-26 10:15 UTC (permalink / raw) To: Stefan Wahren Cc: Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Ojaswin Mujoo, Harshad Shirwadkar Hi Stefan, On Thu 25-08-22 18:57:08, Stefan Wahren wrote: > > Perhaps if you just download the archive manually, call sync(1), and measure > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we > > can see whether plain untar is indeed making the difference or there's > > something else influencing the result as well (I have checked and > > rpi-update does a lot of other deleting & copying as the part of the > > update)? Thanks. > > mb_optimize_scan=0 -> almost 5 minutes > > mb_optimize_scan=1 -> almost 18 minutes > > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec Thanks! So now the iostat data indeed looks substantially different. nooptimize optimize Total written 183.6 MB 190.5 MB Time (recorded) 283 s 1040 s Avg write request size 79 KB 41 KB So indeed with mb_optimize_scan=1 we do submit substantially smaller requests on average. So far I'm not sure why that is. Since Ojaswin can reproduce as well, let's see what he can see from block location info. Thanks again for help with debugging this and enjoy your vacation! Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-26 10:15 ` Jan Kara @ 2022-08-27 14:36 ` Ojaswin Mujoo 2022-08-29 9:04 ` Jan Kara 2022-09-04 10:01 ` Stefan Wahren 0 siblings, 2 replies; 20+ messages in thread From: Ojaswin Mujoo @ 2022-08-27 14:36 UTC (permalink / raw) To: Jan Kara Cc: Stefan Wahren, Ted Tso, linux-ext4, Thorsten Leemhuis, Harshad Shirwadkar On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote: > Hi Stefan, > > On Thu 25-08-22 18:57:08, Stefan Wahren wrote: > > > Perhaps if you just download the archive manually, call sync(1), and measure > > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we > > > can see whether plain untar is indeed making the difference or there's > > > something else influencing the result as well (I have checked and > > > rpi-update does a lot of other deleting & copying as the part of the > > > update)? Thanks. > > > > mb_optimize_scan=0 -> almost 5 minutes > > > > mb_optimize_scan=1 -> almost 18 minutes > > > > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec > > Thanks! So now the iostat data indeed looks substantially different. > > nooptimize optimize > Total written 183.6 MB 190.5 MB > Time (recorded) 283 s 1040 s > Avg write request size 79 KB 41 KB > > So indeed with mb_optimize_scan=1 we do submit substantially smaller > requests on average. So far I'm not sure why that is. Since Ojaswin can > reproduce as well, let's see what he can see from block location info. > Thanks again for help with debugging this and enjoy your vacation! > Hi Jan and Stefan, Apologies for the delay, I was on leave yesterday and couldn't find time to get to this. So I was able to collect the block numbers using the method you suggested. I converted the blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can find them here: mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png Observations: * Before the patched mb_optimize_scan=1 allocations were way more spread out in 40 different BGs. * With the patch, we still allocate in 36 different BGs but majority happen in just 1 or 2 BGs. * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could explain why this is faster. Also, one strange thing I'm seeing is that the perfs don't really show any particular function causing the regression, which is surprising considering mb_optimize_scan=1 almost takes 10 times more time. All the perfs can be found here (raw files and perf report/diff --stdio ): https://github.com/OjaswinM/mbopt-bug/tree/master/perfs Lastly, FWIW I'm not able to replicate the regression when using loop devices and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches as well). Not sure if this is related to the issue or just some side effect of using loop devices. Will post here if I have any updates on this. Regards, Ojaswin ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-27 14:36 ` Ojaswin Mujoo @ 2022-08-29 9:04 ` Jan Kara 2022-08-31 7:36 ` Ojaswin Mujoo 2022-09-04 10:01 ` Stefan Wahren 1 sibling, 1 reply; 20+ messages in thread From: Jan Kara @ 2022-08-29 9:04 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Jan Kara, Stefan Wahren, Ted Tso, linux-ext4, Thorsten Leemhuis, Harshad Shirwadkar On Sat 27-08-22 20:06:00, Ojaswin Mujoo wrote: > On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote: > > Hi Stefan, > > > > On Thu 25-08-22 18:57:08, Stefan Wahren wrote: > > > > Perhaps if you just download the archive manually, call sync(1), and measure > > > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we > > > > can see whether plain untar is indeed making the difference or there's > > > > something else influencing the result as well (I have checked and > > > > rpi-update does a lot of other deleting & copying as the part of the > > > > update)? Thanks. > > > > > > mb_optimize_scan=0 -> almost 5 minutes > > > > > > mb_optimize_scan=1 -> almost 18 minutes > > > > > > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec > > > > Thanks! So now the iostat data indeed looks substantially different. > > > > nooptimize optimize > > Total written 183.6 MB 190.5 MB > > Time (recorded) 283 s 1040 s > > Avg write request size 79 KB 41 KB > > > > So indeed with mb_optimize_scan=1 we do submit substantially smaller > > requests on average. So far I'm not sure why that is. Since Ojaswin can > > reproduce as well, let's see what he can see from block location info. > > Thanks again for help with debugging this and enjoy your vacation! > > > > Hi Jan and Stefan, > > Apologies for the delay, I was on leave yesterday and couldn't find time to get to this. > > So I was able to collect the block numbers using the method you suggested. I converted the > blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can > find them here: > > mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png > mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png > mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png > > Observations: > * Before the patched mb_optimize_scan=1 allocations were way more spread out in > 40 different BGs. > * With the patch, we still allocate in 36 different BGs but majority happen in > just 1 or 2 BGs. > * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could > explain why this is faster. Thanks for testing Ojaswin! Based on iostats from Stefan, I'm relatively confident the spread between block groups is responsible for the performance regression. Iostats show pretty clearly that the write throughput is determined by the average write request size which is directly related to the number of block groups we allocate from. Your stats for patched kernel show that there are two block groups which get big part of allocations (these are likely the target block groups) but then remaining ~1/3 is spread a lot. I'm not yet sure why that is... I guess I will fiddle some more with my test VM and try to reproduce these allocation differences (on my test server the allocation pattern on patched kernel is very similar with mb_optimize_scan=0/1). > Also, one strange thing I'm seeing is that the perfs don't really show any > particular function causing the regression, which is surprising considering > mb_optimize_scan=1 almost takes 10 times more time. Well, the time is not spent by CPU. We spend more time waiting for IO which is not visible in perf profiles. You could plot something like offcpu flame graphs, there the difference would be visible but I don't expect you'd see anything more than just we spend more time in functions waiting for writeback to complete. > Lastly, FWIW I'm not able to replicate the regression when using loop devices > and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches > as well). Not sure if this is related to the issue or just some side effect of > using loop devices. This is because eMMC devices seem to be very sensitive to IO pattern (and write request size). For loop devices, we don't care about request size much so that's why mb_optimize_scan makes no big difference. But can you still see the difference in the allocation pattern with the loop device? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-29 9:04 ` Jan Kara @ 2022-08-31 7:36 ` Ojaswin Mujoo 0 siblings, 0 replies; 20+ messages in thread From: Ojaswin Mujoo @ 2022-08-31 7:36 UTC (permalink / raw) To: Jan Kara Cc: Stefan Wahren, Ted Tso, linux-ext4, Thorsten Leemhuis, Harshad Shirwadkar On Mon, Aug 29, 2022 at 11:04:34AM +0200, Jan Kara wrote: > On Sat 27-08-22 20:06:00, Ojaswin Mujoo wrote: > > On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote: > > > Hi Stefan, > > > > > > On Thu 25-08-22 18:57:08, Stefan Wahren wrote: > > > > > Perhaps if you just download the archive manually, call sync(1), and measure > > > > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we > > > > > can see whether plain untar is indeed making the difference or there's > > > > > something else influencing the result as well (I have checked and > > > > > rpi-update does a lot of other deleting & copying as the part of the > > > > > update)? Thanks. > > > > > > > > mb_optimize_scan=0 -> almost 5 minutes > > > > > > > > mb_optimize_scan=1 -> almost 18 minutes > > > > > > > > https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec > > > > > > Thanks! So now the iostat data indeed looks substantially different. > > > > > > nooptimize optimize > > > Total written 183.6 MB 190.5 MB > > > Time (recorded) 283 s 1040 s > > > Avg write request size 79 KB 41 KB > > > > > > So indeed with mb_optimize_scan=1 we do submit substantially smaller > > > requests on average. So far I'm not sure why that is. Since Ojaswin can > > > reproduce as well, let's see what he can see from block location info. > > > Thanks again for help with debugging this and enjoy your vacation! > > > > > > > Hi Jan and Stefan, > > > > Apologies for the delay, I was on leave yesterday and couldn't find time to get to this. > > > > So I was able to collect the block numbers using the method you suggested. I converted the > > blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can > > find them here: > > > > mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png > > mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png > > mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png > > > > Observations: > > * Before the patched mb_optimize_scan=1 allocations were way more spread out in > > 40 different BGs. > > * With the patch, we still allocate in 36 different BGs but majority happen in > > just 1 or 2 BGs. > > * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could > > explain why this is faster. > > Thanks for testing Ojaswin! Based on iostats from Stefan, I'm relatively > confident the spread between block groups is responsible for the > performance regression. Iostats show pretty clearly that the write > throughput is determined by the average write request size which is > directly related to the number of block groups we allocate from. > > Your stats for patched kernel show that there are two block groups which > get big part of allocations (these are likely the target block groups) but > then remaining ~1/3 is spread a lot. I'm not yet sure why that is... I > guess I will fiddle some more with my test VM and try to reproduce these > allocation differences (on my test server the allocation pattern on patched > kernel is very similar with mb_optimize_scan=0/1). > > > Also, one strange thing I'm seeing is that the perfs don't really show any > > particular function causing the regression, which is surprising considering > > mb_optimize_scan=1 almost takes 10 times more time. > > Well, the time is not spent by CPU. We spend more time waiting for IO which > is not visible in perf profiles. You could plot something like offcpu flame > graphs, there the difference would be visible but I don't expect you'd see > anything more than just we spend more time in functions waiting for > writeback to complete. Ahh I see, that makes sense. > > > Lastly, FWIW I'm not able to replicate the regression when using loop devices > > and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches > > as well). Not sure if this is related to the issue or just some side effect of > > using loop devices. > > This is because eMMC devices seem to be very sensitive to IO pattern (and > write request size). For loop devices, we don't care about request size > much so that's why mb_optimize_scan makes no big difference. But can you > still see the difference in the allocation pattern with the loop device? So i tested with loop devices and yes I do see a similar allocation pattern althout the regression is not there. Thanks, Ojaswin ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-08-27 14:36 ` Ojaswin Mujoo 2022-08-29 9:04 ` Jan Kara @ 2022-09-04 10:01 ` Stefan Wahren 2022-09-04 22:32 ` Andreas Dilger 1 sibling, 1 reply; 20+ messages in thread From: Stefan Wahren @ 2022-09-04 10:01 UTC (permalink / raw) To: Ojaswin Mujoo, Jan Kara Cc: Ted Tso, linux-ext4, Thorsten Leemhuis, Harshad Shirwadkar Hi, Am 27.08.22 um 16:36 schrieb Ojaswin Mujoo: > On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote: >> Hi Stefan, >> >> On Thu 25-08-22 18:57:08, Stefan Wahren wrote: >>>> Perhaps if you just download the archive manually, call sync(1), and measure >>>> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we >>>> can see whether plain untar is indeed making the difference or there's >>>> something else influencing the result as well (I have checked and >>>> rpi-update does a lot of other deleting & copying as the part of the >>>> update)? Thanks. >>> mb_optimize_scan=0 -> almost 5 minutes >>> >>> mb_optimize_scan=1 -> almost 18 minutes >>> >>> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec >> Thanks! So now the iostat data indeed looks substantially different. >> >> nooptimize optimize >> Total written 183.6 MB 190.5 MB >> Time (recorded) 283 s 1040 s >> Avg write request size 79 KB 41 KB >> >> So indeed with mb_optimize_scan=1 we do submit substantially smaller >> requests on average. So far I'm not sure why that is. Since Ojaswin can >> reproduce as well, let's see what he can see from block location info. >> Thanks again for help with debugging this and enjoy your vacation! >> > Hi Jan and Stefan, > > Apologies for the delay, I was on leave yesterday and couldn't find time to get to this. > > So I was able to collect the block numbers using the method you suggested. I converted the > blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can > find them here: > > mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png > mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png > mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png > > Observations: > * Before the patched mb_optimize_scan=1 allocations were way more spread out in > 40 different BGs. > * With the patch, we still allocate in 36 different BGs but majority happen in > just 1 or 2 BGs. > * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could > explain why this is faster. thanks this is very helpful for me to understand. So it seems to me that with disabled mb_optimized_scan we have a more sequential write behavior and with enabled mb_optimized_scan a more random write behavior. From my understanding writing small blocks at random addresses of the sd card flash causes a lot of overhead, because the sd card controller need to erase huge blocks (up to 1 MB) before it's able to program the flash pages. This would explain why this series doesn't fix the performance issue, the total amount of BGs is still much higher. Is this new block allocation pattern a side effect of the optimization or desired? > Also, one strange thing I'm seeing is that the perfs don't really show any > particular function causing the regression, which is surprising considering > mb_optimize_scan=1 almost takes 10 times more time. > > All the perfs can be found here (raw files and perf report/diff --stdio ): > https://github.com/OjaswinM/mbopt-bug/tree/master/perfs > > Lastly, FWIW I'm not able to replicate the regression when using loop devices > and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches > as well). Not sure if this is related to the issue or just some side effect of > using loop devices. The regression actually happen in the sd card, but it's triggered external by the different IO pattern. For a loop device there is no difference between sequential and random write. > > Will post here if I have any updates on this. > > Regards, > Ojaswin ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-09-04 10:01 ` Stefan Wahren @ 2022-09-04 22:32 ` Andreas Dilger 2022-09-05 10:15 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Andreas Dilger @ 2022-09-04 22:32 UTC (permalink / raw) To: Stefan Wahren Cc: Ojaswin Mujoo, Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Harshad Shirwadkar On Sep 4, 2022, at 00:01, Stefan Wahren <stefan.wahren@i2se.com> wrote: > >> Am 27.08.22 um 16:36 schrieb Ojaswin Mujoo: >>> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote: >>> Hi Stefan, >>> >>> On Thu 25-08-22 18:57:08, Stefan Wahren wrote: >>>>> Perhaps if you just download the archive manually, call sync(1), and measure >>>>> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we >>>>> can see whether plain untar is indeed making the difference or there's >>>>> something else influencing the result as well (I have checked and >>>>> rpi-update does a lot of other deleting & copying as the part of the >>>>> update)? Thanks. >>>> mb_optimize_scan=0 -> almost 5 minutes >>>> >>>> mb_optimize_scan=1 -> almost 18 minutes >>>> >>>> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec >>> Thanks! So now the iostat data indeed looks substantially different. >>> >>> nooptimize optimize >>> Total written 183.6 MB 190.5 MB >>> Time (recorded) 283 s 1040 s >>> Avg write request size 79 KB 41 KB >>> >>> So indeed with mb_optimize_scan=1 we do submit substantially smaller >>> requests on average. So far I'm not sure why that is. Since Ojaswin can >>> reproduce as well, let's see what he can see from block location info. >>> Thanks again for help with debugging this and enjoy your vacation! >>> >> Hi Jan and Stefan, >> >> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this. >> >> So I was able to collect the block numbers using the method you suggested. I converted the >> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can >> find them here: >> >> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png >> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png >> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png >> >> Observations: >> * Before the patched mb_optimize_scan=1 allocations were way more spread out in >> 40 different BGs. >> * With the patch, we still allocate in 36 different BGs but majority happen in >> just 1 or 2 BGs. >> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could >> explain why this is faster. > > thanks this is very helpful for me to understand. So it seems to me that with disabled mb_optimized_scan we have a more sequential write behavior and with enabled mb_optimized_scan a more random write behavior. > > From my understanding writing small blocks at random addresses of the sd card flash causes a lot of overhead, because the sd card controller need to erase huge blocks (up to 1 MB) before it's able to program the flash pages. This would explain why this series doesn't fix the performance issue, the total amount of BGs is still much higher. > > Is this new block allocation pattern a side effect of the optimization or desired? The goal of the mb_optimized_scan is to avoid a large amount of linear scanning for very large filesystems when there are many block groups that are full or fragmented. It seems for empty filesystems the new list management is not very ideal. It probably makes sense to have a hybrid, with some small amount of linear scanning (eg. a meta block group worth), and then use the new list to find a new group if that doesn't find any group with free space. Cheers, Andreas ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] ext4: Fix performance regression with mballoc 2022-09-04 22:32 ` Andreas Dilger @ 2022-09-05 10:15 ` Jan Kara 0 siblings, 0 replies; 20+ messages in thread From: Jan Kara @ 2022-09-05 10:15 UTC (permalink / raw) To: Andreas Dilger Cc: Stefan Wahren, Ojaswin Mujoo, Jan Kara, Ted Tso, linux-ext4, Thorsten Leemhuis, Harshad Shirwadkar On Sun 04-09-22 12:32:59, Andreas Dilger wrote: > On Sep 4, 2022, at 00:01, Stefan Wahren <stefan.wahren@i2se.com> wrote: > > > >> Am 27.08.22 um 16:36 schrieb Ojaswin Mujoo: > >>> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote: > >>> Hi Stefan, > >>> > >>> On Thu 25-08-22 18:57:08, Stefan Wahren wrote: > >>>>> Perhaps if you just download the archive manually, call sync(1), and measure > >>>>> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we > >>>>> can see whether plain untar is indeed making the difference or there's > >>>>> something else influencing the result as well (I have checked and > >>>>> rpi-update does a lot of other deleting & copying as the part of the > >>>>> update)? Thanks. > >>>> mb_optimize_scan=0 -> almost 5 minutes > >>>> > >>>> mb_optimize_scan=1 -> almost 18 minutes > >>>> > >>>> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec > >>> Thanks! So now the iostat data indeed looks substantially different. > >>> > >>> nooptimize optimize > >>> Total written 183.6 MB 190.5 MB > >>> Time (recorded) 283 s 1040 s > >>> Avg write request size 79 KB 41 KB > >>> > >>> So indeed with mb_optimize_scan=1 we do submit substantially smaller > >>> requests on average. So far I'm not sure why that is. Since Ojaswin can > >>> reproduce as well, let's see what he can see from block location info. > >>> Thanks again for help with debugging this and enjoy your vacation! > >>> > >> Hi Jan and Stefan, > >> > >> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this. > >> > >> So I was able to collect the block numbers using the method you suggested. I converted the > >> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can > >> find them here: > >> > >> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png > >> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png > >> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png > >> > >> Observations: > >> * Before the patched mb_optimize_scan=1 allocations were way more spread out in > >> 40 different BGs. > >> * With the patch, we still allocate in 36 different BGs but majority happen in > >> just 1 or 2 BGs. > >> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could > >> explain why this is faster. > > > > thanks this is very helpful for me to understand. So it seems to me that with disabled mb_optimized_scan we have a more sequential write behavior and with enabled mb_optimized_scan a more random write behavior. > > > > From my understanding writing small blocks at random addresses of the sd card flash causes a lot of overhead, because the sd card controller need to erase huge blocks (up to 1 MB) before it's able to program the flash pages. This would explain why this series doesn't fix the performance issue, the total amount of BGs is still much higher. > > > > Is this new block allocation pattern a side effect of the optimization or desired? > > The goal of the mb_optimized_scan is to avoid a large amount of linear > scanning for very large filesystems when there are many block groups that > are full or fragmented. > > It seems for empty filesystems the new list management is not very ideal. The filesystems here are actually about half full and not too fragmented. > It probably makes sense to have a hybrid, with some small amount of > linear scanning (eg. a meta block group worth), and then use the new list > to find a new group if that doesn't find any group with free space. There is a heuristic to scan a few block groups linearly before using the data structures to decide about the next block group in current mballoc code but it gets used only for rotational devices. I don't know about some easy way how to detect other types of storage like eMMC cards that also benefit from better data allocation locality. I have come up with two more patches on top of my current attempt which improve allocation locality and at least for the untar case causing issues on eMMC they do get close to the mb_optimize_scan=0 locality. I want to check whether the higher locality does not hurt performance for highly parallel workloads though. Then I'll post them for review and discussion. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2022-09-05 10:17 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-08-23 20:15 [PATCH 0/2] ext4: Fix performance regression with mballoc Jan Kara 2022-08-23 20:15 ` [PATCH 1/2] ext4: Make mballoc try target group first even with mb_optimize_scan Jan Kara 2022-08-23 20:15 ` [PATCH 2/2] ext4: Avoid unnecessary spreading of allocations among groups Jan Kara 2022-08-24 10:17 ` [PATCH 0/2] ext4: Fix performance regression with mballoc Stefan Wahren 2022-08-24 10:40 ` Jan Kara 2022-08-24 14:13 ` Jan Kara 2022-08-25 17:49 ` Ojaswin Mujoo 2022-08-26 9:07 ` Jan Kara 2022-08-24 21:24 ` Stefan Wahren 2022-08-25 9:18 ` Jan Kara 2022-08-25 15:48 ` Stefan Wahren 2022-08-26 9:52 ` Jan Kara 2022-08-25 16:57 ` Stefan Wahren 2022-08-26 10:15 ` Jan Kara 2022-08-27 14:36 ` Ojaswin Mujoo 2022-08-29 9:04 ` Jan Kara 2022-08-31 7:36 ` Ojaswin Mujoo 2022-09-04 10:01 ` Stefan Wahren 2022-09-04 22:32 ` Andreas Dilger 2022-09-05 10:15 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox