xfsprogs: repair: Higher memory consumption when disable prefetch

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* xfsprogs: repair: Higher memory consumption when disable prefetch
@ 2023-11-08 15:56 Per Förlin
  2023-11-08 22:05 ` Dave Chinner
  0 siblings, 1 reply; 3+ messages in thread
From: Per Förlin @ 2023-11-08 15:56 UTC (permalink / raw)
  To: linux-xfs@vger.kernel.org

Hi Linux XFS community,

Please bare with me I'm new to XFS :)

I'm comparing how EXT4 and XFS behaves on systems with a relative
small RAM vs storage ratio. The current focus is on FS repair memory consumption.

I have been running some tests using the max_mem_specified option.
The "-m" (max_mem_specified) parameter does not guarantee success but it surely helps
to reduce the memory load, in comparison to EXT4 this is an improvement.

My question concerns the relation between "-P" (disable prefetch) and "-m" (max_mem_specified).

There is a difference in xfs_repair memory consumption between the following commands
1. xfs_repair -P -m 500
2. xfs_repair -m 500

1) Exceeds the max_mem_specified limit
2) Stays below the max_mem_specified limit

I expected disabled prefetch to reduce the memory load but instead the result is the opposite.
The two commands 1) and 2) are being executed in the same system.

My speculation:
Does the prefetch facilitate and improve the calculation of the memory
consumption and make it more accurate?

Here follows output with -P and without -P from the same system.
I have extracted the part the actually differs.
The full logs are available the bottom of this email.

# -P -m 500 #
Phase 3 - for each AG...
...
Active entries = 12336
Hash table size = 1549
Hits = 1
Misses = 224301
Hit ratio =  0.00
MRU 0 entries =  12335 ( 99%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =      0 (  0%)
MRU 3 entries =      0 (  0%)
MRU 4 entries =      0 (  0%)
MRU 5 entries =      0 (  0%)
MRU 6 entries =      0 (  0%)
MRU 7 entries =      0 (  0%)

# -m 500 #
Phase 3 - for each AG...
...
Active entries = 12388
Hash table size = 1549
Hits = 220459
Misses = 235388
Hit ratio = 48.36
MRU 0 entries =      2 (  0%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =   1362 ( 10%)
MRU 3 entries =     68 (  0%)
MRU 4 entries =     10 (  0%)
MRU 5 entries =   6097 ( 49%)
MRU 6 entries =   4752 ( 38%)
MRU 7 entries =     96 (  0%)


Tested on version 6.1.1 and 6.5.0, same result.

BR
Per Forlin


-------------------------------------------------------------------------
Here follows full logs for both memory consumption and xfs_repair output.


## Full log of xfs_repair with "-P" that exceeds max limit and crash the system

# xfs_repair -vvv -P -n -m 500 /dev/sda1 

Phase 1 - find and verify superblock...
bhash_option_used=0
max_mem_specified=500
verbose=3
[main:1166] perfn
        - max_mem = 512000, icount = 6445568, imem = 25178, dblock = 488378385, dmem = 238466
        - block cache size set to 12392 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1068680 tail block 1068680
        - scan filesystem freespace and inode maps...
        - found root inode chunk
libxfs_bcache: 0x5597cf9220
Max supported entries = 12392
Max utilized entries = 582
Active entries = 582
Hash table size = 1549
Hits = 1
Misses = 582
Hit ratio =  0.17
MRU 0 entries =    581 ( 99%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =      0 (  0%)
MRU 3 entries =      0 (  0%)
MRU 4 entries =      0 (  0%)
MRU 5 entries =      0 (  0%)
MRU 6 entries =      0 (  0%)
MRU 7 entries =      0 (  0%)
MRU 8 entries =      0 (  0%)
MRU 9 entries =      0 (  0%)
MRU 10 entries =      0 (  0%)
MRU 11 entries =      0 (  0%)
MRU 12 entries =      0 (  0%)
MRU 13 entries =      0 (  0%)
MRU 14 entries =      0 (  0%)
MRU 15 entries =      0 (  0%)
Dirty MRU 16 entries =      0 (  0%)
Hash buckets with   0 entries   1078 (  0%)
Hash buckets with   1 entries    375 ( 64%)
Hash buckets with   2 entries     84 ( 28%)
Hash buckets with   3 entries      9 (  4%)
Hash buckets with   4 entries      3 (  2%)
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
libxfs_bcache: 0x5597cf9220
Max supported entries = 12392
Max utilized entries = 12392
Active entries = 12336
Hash table size = 1549
Hits = 1
Misses = 224301
Hit ratio =  0.00
MRU 0 entries =  12335 ( 99%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =      0 (  0%)
MRU 3 entries =      0 (  0%)
MRU 4 entries =      0 (  0%)
MRU 5 entries =      0 (  0%)
MRU 6 entries =      0 (  0%)
MRU 7 entries =      0 (  0%)
MRU 8 entries =      0 (  0%)
MRU 9 entries =      0 (  0%)
MRU 10 entries =      0 (  0%)
MRU 11 entries =      0 (  0%)
MRU 12 entries =      0 (  0%)
MRU 13 entries =      0 (  0%)
MRU 14 entries =      0 (  0%)
MRU 15 entries =      0 (  0%)
Dirty MRU 16 entries =      0 (  0%)
Hash buckets with   1 entries      2 (  0%)
Hash buckets with   2 entries     17 (  0%)
Hash buckets with   3 entries     42 (  1%)
Hash buckets with   4 entries     87 (  2%)
Hash buckets with   5 entries    153 (  6%)
Hash buckets with   6 entries    198 (  9%)
Hash buckets with   7 entries    225 ( 12%)
Hash buckets with   8 entries    208 ( 13%)
Hash buckets with   9 entries    184 ( 13%)
Hash buckets with  10 entries    162 ( 13%)
Hash buckets with  11 entries     99 (  8%)
Hash buckets with  12 entries     79 (  7%)
Hash buckets with  13 entries     38 (  4%)
Hash buckets with  14 entries     25 (  2%)
Hash buckets with  15 entries     15 (  1%)
Hash buckets with  16 entries      9 (  1%)
Hash buckets with  17 entries      2 (  0%)
Hash buckets with  18 entries      2 (  0%)
Hash buckets with  19 entries      2 (  0%)
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
!! OOM system crash !!

# Memory log:
# while  :; do grep Private_D /proc/$(pidof xfs_repair)/smaps_rollup ; grep MemAvail /proc/meminfo ; sleep 10; done
Private_Dirty:     11772 kB
MemAvailable:     625436 kB
Private_Dirty:    135020 kB
MemAvailable:     501736 kB
Private_Dirty:    239860 kB
MemAvailable:     396432 kB
Private_Dirty:    269312 kB
MemAvailable:     366948 kB
Private_Dirty:    290976 kB
MemAvailable:     344756 kB
Private_Dirty:    304520 kB
MemAvailable:     330392 kB
Private_Dirty:    331152 kB
MemAvailable:     304184 kB
Private_Dirty:    361924 kB
MemAvailable:     272400 kB
Private_Dirty:    382204 kB
MemAvailable:     252476 kB
Private_Dirty:    407184 kB
MemAvailable:     227008 kB
Private_Dirty:    422432 kB
MemAvailable:     211160 kB
Private_Dirty:    437428 kB
MemAvailable:     197144 kB
Private_Dirty:    460960 kB
MemAvailable:     175692 kB
Private_Dirty:    467128 kB
MemAvailable:     168156 kB
Private_Dirty:    483184 kB
MemAvailable:     153280 kB
Private_Dirty:    507128 kB
MemAvailable:     131140 kB
Private_Dirty:    540896 kB
MemAvailable:      97488 kB
Private_Dirty:    575480 kB
MemAvailable:      67268 kB
Private_Dirty:    604580 kB
MemAvailable:      36484 kB
Private_Dirty:    614316 kB
MemAvailable:      31668 kB
Private_Dirty:    645888 kB
MemAvailable:      24232 kB
Private_Dirty:    659140 kB
MemAvailable:      21444 kB
!! Runs out of memory at this point !!


## Full log of xfs_repair with "-P" run that stays within max limit and finish successfully

root@ax-b8a44f27a3b4:~# xfs_repair -vvv -n -m 500 /dev/sda1 
Phase 1 - find and verify superblock...
bhash_option_used=0
max_mem_specified=500
verbose=3
[main:1166] perfn
        - max_mem = 512000, icount = 6445568, imem = 25178, dblock = 488378385, dmem = 238466
        - block cache size set to 12392 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1068680 tail block 1068680
        - scan filesystem freespace and inode maps...
        - found root inode chunk
libxfs_bcache: 0x55aa821220
Max supported entries = 12392
Max utilized entries = 582
Active entries = 582
Hash table size = 1549
Hits = 1
Misses = 582
Hit ratio =  0.17
MRU 0 entries =    581 ( 99%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =      0 (  0%)
MRU 3 entries =      0 (  0%)
MRU 4 entries =      0 (  0%)
MRU 5 entries =      0 (  0%)
MRU 6 entries =      0 (  0%)
MRU 7 entries =      0 (  0%)
MRU 8 entries =      0 (  0%)
MRU 9 entries =      0 (  0%)
MRU 10 entries =      0 (  0%)
MRU 11 entries =      0 (  0%)
MRU 12 entries =      0 (  0%)
MRU 13 entries =      0 (  0%)
MRU 14 entries =      0 (  0%)
MRU 15 entries =      0 (  0%)
Dirty MRU 16 entries =      0 (  0%)
Hash buckets with   0 entries   1078 (  0%)
Hash buckets with   1 entries    375 ( 64%)
Hash buckets with   2 entries     84 ( 28%)
Hash buckets with   3 entries      9 (  4%)
Hash buckets with   4 entries      3 (  2%)
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
libxfs_bcache: 0x55aa821220
Max supported entries = 12392
Max utilized entries = 12392
Active entries = 12388
Hash table size = 1549
Hits = 220459
Misses = 235388
Hit ratio = 48.36
MRU 0 entries =      2 (  0%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =   1362 ( 10%)
MRU 3 entries =     68 (  0%)
MRU 4 entries =     10 (  0%)
MRU 5 entries =   6097 ( 49%)
MRU 6 entries =   4752 ( 38%)
MRU 7 entries =     96 (  0%)
MRU 8 entries =      0 (  0%)
MRU 9 entries =      0 (  0%)
MRU 10 entries =      0 (  0%)
MRU 11 entries =      0 (  0%)
MRU 12 entries =      0 (  0%)
MRU 13 entries =      0 (  0%)
MRU 14 entries =      0 (  0%)
MRU 15 entries =      0 (  0%)
Dirty MRU 16 entries =      0 (  0%)
Hash buckets with   1 entries      2 (  0%)
Hash buckets with   2 entries      8 (  0%)
Hash buckets with   3 entries     35 (  0%)
Hash buckets with   4 entries     88 (  2%)
Hash buckets with   5 entries    123 (  4%)
Hash buckets with   6 entries    180 (  8%)
Hash buckets with   7 entries    243 ( 13%)
Hash buckets with   8 entries    249 ( 16%)
Hash buckets with   9 entries    224 ( 16%)
Hash buckets with  10 entries    151 ( 12%)
Hash buckets with  11 entries    109 (  9%)
Hash buckets with  12 entries     50 (  4%)
Hash buckets with  13 entries     51 (  5%)
Hash buckets with  14 entries     17 (  1%)
Hash buckets with  15 entries      8 (  0%)
Hash buckets with  16 entries      9 (  1%)
Hash buckets with  17 entries      1 (  0%)
Hash buckets with  18 entries      1 (  0%)
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
libxfs_bcache: 0x55aa821220
Max supported entries = 12392
Max utilized entries = 12392
Active entries = 12369
Hash table size = 1549
Hits = 445862
Misses = 484224
Hit ratio = 47.94
MRU 0 entries =      5 (  0%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =   1498 ( 12%)
MRU 3 entries =     73 (  0%)
MRU 4 entries =     17 (  0%)
MRU 5 entries =   6401 ( 51%)
MRU 6 entries =   4374 ( 35%)
MRU 7 entries =      0 (  0%)
MRU 8 entries =      0 (  0%)
MRU 9 entries =      0 (  0%)
MRU 10 entries =      0 (  0%)
MRU 11 entries =      0 (  0%)
MRU 12 entries =      0 (  0%)
MRU 13 entries =      0 (  0%)
MRU 14 entries =      0 (  0%)
MRU 15 entries =      0 (  0%)
Dirty MRU 16 entries =      0 (  0%)
Hash buckets with   1 entries      1 (  0%)
Hash buckets with   2 entries     13 (  0%)
Hash buckets with   3 entries     35 (  0%)
Hash buckets with   4 entries     93 (  3%)
Hash buckets with   5 entries    126 (  5%)
Hash buckets with   6 entries    184 (  8%)
Hash buckets with   7 entries    235 ( 13%)
Hash buckets with   8 entries    239 ( 15%)
Hash buckets with   9 entries    214 ( 15%)
Hash buckets with  10 entries    155 ( 12%)
Hash buckets with  11 entries    115 ( 10%)
Hash buckets with  12 entries     63 (  6%)
Hash buckets with  13 entries     30 (  3%)
Hash buckets with  14 entries     22 (  2%)
Hash buckets with  15 entries     12 (  1%)
Hash buckets with  16 entries      7 (  0%)
Hash buckets with  17 entries      3 (  0%)
Hash buckets with  18 entries      2 (  0%)
No modify flag set, skipping phase 5
libxfs_bcache: 0x55aa821220
Max supported entries = 12392
Max utilized entries = 12392
Active entries = 12369
Hash table size = 1549
Hits = 445862
Misses = 484224
Hit ratio = 47.94
MRU 0 entries =      5 (  0%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =   1498 ( 12%)
MRU 3 entries =     73 (  0%)
MRU 4 entries =     17 (  0%)
MRU 5 entries =   6401 ( 51%)
MRU 6 entries =   4374 ( 35%)
MRU 7 entries =      0 (  0%)
MRU 8 entries =      0 (  0%)
MRU 9 entries =      0 (  0%)
MRU 10 entries =      0 (  0%)
MRU 11 entries =      0 (  0%)
MRU 12 entries =      0 (  0%)
MRU 13 entries =      0 (  0%)
MRU 14 entries =      0 (  0%)
MRU 15 entries =      0 (  0%)
Dirty MRU 16 entries =      0 (  0%)
Hash buckets with   1 entries      1 (  0%)
Hash buckets with   2 entries     13 (  0%)
Hash buckets with   3 entries     35 (  0%)
Hash buckets with   4 entries     93 (  3%)
Hash buckets with   5 entries    126 (  5%)
Hash buckets with   6 entries    184 (  8%)
Hash buckets with   7 entries    235 ( 13%)
Hash buckets with   8 entries    239 ( 15%)
Hash buckets with   9 entries    214 ( 15%)
Hash buckets with  10 entries    155 ( 12%)
Hash buckets with  11 entries    115 ( 10%)
Hash buckets with  12 entries     63 (  6%)
Hash buckets with  13 entries     30 (  3%)
Hash buckets with  14 entries     22 (  2%)
Hash buckets with  15 entries     12 (  1%)
Hash buckets with  16 entries      7 (  0%)
Hash buckets with  17 entries      3 (  0%)
Hash buckets with  18 entries      2 (  0%)
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
libxfs_bcache: 0x55aa821220
Max supported entries = 12392
Max utilized entries = 12392
Active entries = 12357
Hash table size = 1549
Hits = 3043575
Misses = 717152
Hit ratio = 80.93
MRU 0 entries =   1505 ( 12%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =      3 (  0%)
MRU 3 entries =     58 (  0%)
MRU 4 entries =      9 (  0%)
MRU 5 entries =   5981 ( 48%)
MRU 6 entries =   4696 ( 38%)
MRU 7 entries =     96 (  0%)
MRU 8 entries =      0 (  0%)
MRU 9 entries =      0 (  0%)
MRU 10 entries =      8 (  0%)
MRU 11 entries =      0 (  0%)
MRU 12 entries =      0 (  0%)
MRU 13 entries =      0 (  0%)
MRU 14 entries =      0 (  0%)
MRU 15 entries =      0 (  0%)
Dirty MRU 16 entries =      0 (  0%)
Hash buckets with   1 entries      3 (  0%)
Hash buckets with   2 entries     10 (  0%)
Hash buckets with   3 entries     39 (  0%)
Hash buckets with   4 entries     79 (  2%)
Hash buckets with   5 entries    131 (  5%)
Hash buckets with   6 entries    182 (  8%)
Hash buckets with   7 entries    240 ( 13%)
Hash buckets with   8 entries    256 ( 16%)
Hash buckets with   9 entries    209 ( 15%)
Hash buckets with  10 entries    151 ( 12%)
Hash buckets with  11 entries    115 ( 10%)
Hash buckets with  12 entries     54 (  5%)
Hash buckets with  13 entries     38 (  3%)
Hash buckets with  14 entries     21 (  2%)
Hash buckets with  15 entries      8 (  0%)
Hash buckets with  16 entries      7 (  0%)
Hash buckets with  17 entries      6 (  0%)
Phase 7 - verify link counts...
libxfs_bcache: 0x55aa821220
Max supported entries = 12392
Max utilized entries = 12392
Active entries = 12357
Hash table size = 1549
Hits = 3043575
Misses = 717152
Hit ratio = 80.93
MRU 0 entries =   1505 ( 12%)
MRU 1 entries =      0 (  0%)
MRU 2 entries =      3 (  0%)
MRU 3 entries =     58 (  0%)
MRU 4 entries =      9 (  0%)
MRU 5 entries =   5981 ( 48%)
MRU 6 entries =   4696 ( 38%)
MRU 7 entries =     96 (  0%)
MRU 8 entries =      0 (  0%)
MRU 9 entries =      0 (  0%)
MRU 10 entries =      8 (  0%)
MRU 11 entries =      0 (  0%)
MRU 12 entries =      0 (  0%)
MRU 13 entries =      0 (  0%)
MRU 14 entries =      0 (  0%)
MRU 15 entries =      0 (  0%)
Dirty MRU 16 entries =      0 (  0%)
Hash buckets with   1 entries      3 (  0%)
Hash buckets with   2 entries     10 (  0%)
Hash buckets with   3 entries     39 (  0%)
Hash buckets with   4 entries     79 (  2%)
Hash buckets with   5 entries    131 (  5%)
Hash buckets with   6 entries    182 (  8%)
Hash buckets with   7 entries    240 ( 13%)
Hash buckets with   8 entries    256 ( 16%)
Hash buckets with   9 entries    209 ( 15%)
Hash buckets with  10 entries    151 ( 12%)
Hash buckets with  11 entries    115 ( 10%)
Hash buckets with  12 entries     54 (  5%)
Hash buckets with  13 entries     38 (  3%)
Hash buckets with  14 entries     21 (  2%)
Hash buckets with  15 entries      8 (  0%)
Hash buckets with  16 entries      7 (  0%)
Hash buckets with  17 entries      6 (  0%)
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Wed Nov  8 13:05:25 2023

Phase		Start		End		Duration
Phase 1:	11/08 12:59:25	11/08 12:59:25	
Phase 2:	11/08 12:59:25	11/08 12:59:33	8 seconds
Phase 3:	11/08 12:59:33	11/08 13:01:29	1 minute, 56 seconds
Phase 4:	11/08 13:01:29	11/08 13:03:22	1 minute, 53 seconds
Phase 5:	Skipped
Phase 6:	11/08 13:03:22	11/08 13:05:25	2 minutes, 3 seconds
Phase 7:	11/08 13:05:25	11/08 13:05:25	

Total run time: 6 minutes


# Memory log:
# while  :; do grep Private_D /proc/$(pidof xfs_repair)/smaps_rollup ; grep MemAvail /proc/meminfo ; sleep 10; done
Private_Dirty:     26172 kB
MemAvailable:     613712 kB
Private_Dirty:    235704 kB
MemAvailable:     403512 kB
Private_Dirty:    247580 kB
MemAvailable:     393164 kB
Private_Dirty:    258268 kB
MemAvailable:     381100 kB
Private_Dirty:    265832 kB
MemAvailable:     374548 kB
Private_Dirty:    272652 kB
MemAvailable:     366484 kB
Private_Dirty:    282496 kB
MemAvailable:     356484 kB
Private_Dirty:    286664 kB
MemAvailable:     354624 kB
Private_Dirty:    291684 kB
MemAvailable:     349820 kB
Private_Dirty:    308204 kB
MemAvailable:     332716 kB
Private_Dirty:    310520 kB
MemAvailable:     330180 kB
Private_Dirty:    312348 kB
MemAvailable:     327424 kB
Private_Dirty:    315280 kB
MemAvailable:     324828 kB
Private_Dirty:    332864 kB
MemAvailable:     307064 kB
Private_Dirty:    348504 kB
MemAvailable:     292304 kB
Private_Dirty:    362752 kB
MemAvailable:     276240 kB
Private_Dirty:    380060 kB
MemAvailable:     260568 kB
Private_Dirty:    396068 kB
MemAvailable:     244392 kB
Private_Dirty:    406540 kB
MemAvailable:     232548 kB
Private_Dirty:    417648 kB
MemAvailable:     221476 kB
Private_Dirty:    434708 kB
MemAvailable:     205192 kB
Private_Dirty:    443988 kB
MemAvailable:     194844 kB
Private_Dirty:    452880 kB
MemAvailable:     185504 kB
Private_Dirty:    462060 kB
MemAvailable:     178244 kB
Private_Dirty:    414464 kB
MemAvailable:     228676 kB
Private_Dirty:    420344 kB
MemAvailable:     223304 kB
Private_Dirty:    422584 kB
MemAvailable:     220700 kB
Private_Dirty:    423904 kB
MemAvailable:     218104 kB
Private_Dirty:    424172 kB
MemAvailable:     218120 kB
Private_Dirty:    424548 kB
MemAvailable:     216860 kB
Private_Dirty:    425080 kB
MemAvailable:     217236 kB
Private_Dirty:    425184 kB
MemAvailable:     217032 kB
Private_Dirty:    425464 kB
MemAvailable:     216512 kB
Private_Dirty:    427392 kB
MemAvailable:     214480 kB
Private_Dirty:    427768 kB
MemAvailable:     214732 kB
Private_Dirty:    428592 kB
MemAvailable:     215032 kB
Private_Dirty:    428580 kB
MemAvailable:     214204 kB
Finished successfully!
grep: /proc//smaps_rollup: No such file or directory
MemAvailable:     643000 kB
Type / to insert files and more

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: xfsprogs: repair: Higher memory consumption when disable prefetch
  2023-11-08 15:56 xfsprogs: repair: Higher memory consumption when disable prefetch Per Förlin
@ 2023-11-08 22:05 ` Dave Chinner
  2023-11-08 22:54   ` Darrick J. Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2023-11-08 22:05 UTC (permalink / raw)
  To: Per Förlin; +Cc: linux-xfs@vger.kernel.org

On Wed, Nov 08, 2023 at 03:56:00PM +0000, Per Förlin wrote:
> Hi Linux XFS community,
> 
> Please bare with me I'm new to XFS :)
> 
> I'm comparing how EXT4 and XFS behaves on systems with a relative
> small RAM vs storage ratio. The current focus is on FS repair memory consumption.
> 
> I have been running some tests using the max_mem_specified option.
> The "-m" (max_mem_specified) parameter does not guarantee success but it surely helps
> to reduce the memory load, in comparison to EXT4 this is an improvement.
> 
> My question concerns the relation between "-P" (disable prefetch) and "-m" (max_mem_specified).
> 
> There is a difference in xfs_repair memory consumption between the following commands
> 1. xfs_repair -P -m 500
> 2. xfs_repair -m 500
>
> 1) Exceeds the max_mem_specified limit
> 2) Stays below the max_mem_specified limit

Purely co-incidental, IMO.

As the man page says:

	-m maxmem

	      Specifies the approximate maximum amount of memory, in
	      megabytes, to use for xfs_repair.  xfs_repair has its
	      own  internal  block cache  which  will  scale  out
	      up to the lesser of the process’s virtual address
	      limit or about 75% of the system’s physical RAM.  This
	      option overrides these limits.

	      NOTE: These memory limits are only approximate and may
	      use more than the specified limit.

IOWs, behaviour is expected - the max_mem figure is just a starting
point guideline, and it only affects the size of the IO cache that
repair holds.  We still need lots of memory to index free space,
used space, inodes, hold directory information, etc, so memory usage
on any filesystem with enough metadata in it to fill the internal
buffer cache will always go over this number....

> I expected disabled prefetch to reduce the memory load but instead the result is the opposite.
> The two commands 1) and 2) are being executed in the same system.

> My speculation:
> Does the prefetch facilitate and improve the calculation of the memory
> consumption and make it more accurate?

No, prefetching changes the way processing of the metadata occurs.
It also vastly changes the way IO is done and the buffer cache is
populated.

e.g. prefetching looks at metadata density and issues
large IOs if the density is high enough and then chops them up into
individual metadata buffers in memory at prefetch IO completion.
This avoids doing lots of small IOs, greatly improving IO throughput
and keeping the processing pipeline busy.

This comes at the cost of increased CPU overhead and non-buffer
cache memory footprint, but for slow IO devices this can improve IO
throughput (and hence repair times) by a factor of up to 100x. Have
a look at the difference in IO patterns when you enable/disable
prefetching...

When prefetching is turned off, the processing issues individual IO
itself and doesn't do density-based scan optimisation. In some cases
this is faster (e.g. high speed SSDs) because it is more CPU
efficient, but it results in different IO patterns and buffer access
patterns.

The end result is that buffers have a very different life time when
prefetching is turned on compared to when it is off, and so there's
a very different buffer cache memory footprint between the two
options.

> Here follows output with -P and without -P from the same system.
> I have extracted the part the actually differs.
> The full logs are available the bottom of this email.
> 
> # -P -m 500 #
> Phase 3 - for each AG...
> ...
> Active entries = 12336
> Hash table size = 1549
> Hits = 1
> Misses = 224301
> Hit ratio =  0.00
> MRU 0 entries =  12335 ( 99%)
> MRU 1 entries =      0 (  0%)
> MRU 2 entries =      0 (  0%)
> MRU 3 entries =      0 (  0%)
> MRU 4 entries =      0 (  0%)
> MRU 5 entries =      0 (  0%)
> MRU 6 entries =      0 (  0%)
> MRU 7 entries =      0 (  0%)

Without prefetching, we have a single use for all buffers and the
metadata accessed is 20x larger than the size of the buffer cache
(220k vs 12k for the cache size).  This is just showing how the
non-prefetch case is just streaming buffers through the cache in
processing access order.

i.e. The MRU list indicates that nothing is being kept for long
periods or being accessed out of order as all buffers are on list 0
(most recently used). i.e. nothing is aging out and which means
buffers are being used and reclaimed in the same order they are
being instantiated.  If anything was being accessed out of order, we
would see buffers move down the aging lists....

> # -m 500 #
> Phase 3 - for each AG...
> ...
> Active entries = 12388
> Hash table size = 1549
> Hits = 220459
> Misses = 235388
> Hit ratio = 48.36

And there's the difference - two accesses per buffer for the
prefetch case. One for the IO dispatch to bring it into memory (the
miss) and one for processing (the hit).

> MRU 0 entries =      2 (  0%)
> MRU 1 entries =      0 (  0%)
> MRU 2 entries =   1362 ( 10%)
> MRU 3 entries =     68 (  0%)
> MRU 4 entries =     10 (  0%)
> MRU 5 entries =   6097 ( 49%)
> MRU 6 entries =   4752 ( 38%)
> MRU 7 entries =     96 (  0%)

And the MRU list shows how the buffer access are not uniform - we
are seeing buffers of all different ages in the cache. This shows
that buffers are being aged 5-6 times before they are getting used,
which means the cache size is almost too small for prefetch to work
effectively....

Actually, the cache is too small - cache misses are significantly
larger than cache hits, meaning some buffers are being fetched from
disk twice because the prefetched buffers are aging out before the
processing thread gets to them. Give xfs_repair ~5GB of RAM, and it
should only need to do a single IO pass in phase 3 and then phase 4
and 6 will hit the buffers in the cache and hence not need to do any
IO at all...

So to me, this is prefetch working as it should - it's bringing
buffers into cache in the optimal IO pattern rather than the
application level access pattern. The difference in memory footprint
compared to no prefetching is largely co-incidental and really not
something we are concerned about in any way...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: xfsprogs: repair: Higher memory consumption when disable prefetch
  2023-11-08 22:05 ` Dave Chinner
@ 2023-11-08 22:54   ` Darrick J. Wong
  0 siblings, 0 replies; 3+ messages in thread
From: Darrick J. Wong @ 2023-11-08 22:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Per Förlin, linux-xfs@vger.kernel.org

On Thu, Nov 09, 2023 at 09:05:52AM +1100, Dave Chinner wrote:
> On Wed, Nov 08, 2023 at 03:56:00PM +0000, Per Förlin wrote:
> > Hi Linux XFS community,
> > 
> > Please bare with me I'm new to XFS :)
> > 
> > I'm comparing how EXT4 and XFS behaves on systems with a relative
> > small RAM vs storage ratio. The current focus is on FS repair memory consumption.
> > 
> > I have been running some tests using the max_mem_specified option.
> > The "-m" (max_mem_specified) parameter does not guarantee success but it surely helps
> > to reduce the memory load, in comparison to EXT4 this is an improvement.
> > 
> > My question concerns the relation between "-P" (disable prefetch) and "-m" (max_mem_specified).
> > 
> > There is a difference in xfs_repair memory consumption between the following commands
> > 1. xfs_repair -P -m 500
> > 2. xfs_repair -m 500
> >
> > 1) Exceeds the max_mem_specified limit
> > 2) Stays below the max_mem_specified limit
> 
> Purely co-incidental, IMO.
> 
> As the man page says:
> 
> 	-m maxmem
> 
> 	      Specifies the approximate maximum amount of memory, in
> 	      megabytes, to use for xfs_repair.  xfs_repair has its
> 	      own  internal  block cache  which  will  scale  out
> 	      up to the lesser of the process’s virtual address
> 	      limit or about 75% of the system’s physical RAM.  This
> 	      option overrides these limits.
> 
> 	      NOTE: These memory limits are only approximate and may
> 	      use more than the specified limit.
> 
> IOWs, behaviour is expected - the max_mem figure is just a starting
> point guideline, and it only affects the size of the IO cache that
> repair holds.  We still need lots of memory to index free space,
> used space, inodes, hold directory information, etc, so memory usage
> on any filesystem with enough metadata in it to fill the internal
> buffer cache will always go over this number....
> 
> > I expected disabled prefetch to reduce the memory load but instead the result is the opposite.
> > The two commands 1) and 2) are being executed in the same system.
> 
> > My speculation:
> > Does the prefetch facilitate and improve the calculation of the memory
> > consumption and make it more accurate?
> 
> No, prefetching changes the way processing of the metadata occurs.
> It also vastly changes the way IO is done and the buffer cache is
> populated.
> 
> e.g. prefetching looks at metadata density and issues
> large IOs if the density is high enough and then chops them up into
> individual metadata buffers in memory at prefetch IO completion.
> This avoids doing lots of small IOs, greatly improving IO throughput
> and keeping the processing pipeline busy.
> 
> This comes at the cost of increased CPU overhead and non-buffer
> cache memory footprint, but for slow IO devices this can improve IO
> throughput (and hence repair times) by a factor of up to 100x. Have
> a look at the difference in IO patterns when you enable/disable
> prefetching...
> 
> When prefetching is turned off, the processing issues individual IO
> itself and doesn't do density-based scan optimisation. In some cases
> this is faster (e.g. high speed SSDs) because it is more CPU
> efficient, but it results in different IO patterns and buffer access
> patterns.
> 
> The end result is that buffers have a very different life time when
> prefetching is turned on compared to when it is off, and so there's
> a very different buffer cache memory footprint between the two
> options.
> 
> > Here follows output with -P and without -P from the same system.
> > I have extracted the part the actually differs.
> > The full logs are available the bottom of this email.
> > 
> > # -P -m 500 #
> > Phase 3 - for each AG...
> > ...
> > Active entries = 12336
> > Hash table size = 1549
> > Hits = 1
> > Misses = 224301
> > Hit ratio =  0.00
> > MRU 0 entries =  12335 ( 99%)
> > MRU 1 entries =      0 (  0%)
> > MRU 2 entries =      0 (  0%)
> > MRU 3 entries =      0 (  0%)
> > MRU 4 entries =      0 (  0%)
> > MRU 5 entries =      0 (  0%)
> > MRU 6 entries =      0 (  0%)
> > MRU 7 entries =      0 (  0%)
> 
> Without prefetching, we have a single use for all buffers and the
> metadata accessed is 20x larger than the size of the buffer cache
> (220k vs 12k for the cache size).  This is just showing how the
> non-prefetch case is just streaming buffers through the cache in
> processing access order.
> 
> i.e. The MRU list indicates that nothing is being kept for long
> periods or being accessed out of order as all buffers are on list 0
> (most recently used). i.e. nothing is aging out and which means
> buffers are being used and reclaimed in the same order they are
> being instantiated.  If anything was being accessed out of order, we
> would see buffers move down the aging lists....
> 
> > # -m 500 #
> > Phase 3 - for each AG...
> > ...
> > Active entries = 12388
> > Hash table size = 1549
> > Hits = 220459
> > Misses = 235388
> > Hit ratio = 48.36
> 
> And there's the difference - two accesses per buffer for the
> prefetch case. One for the IO dispatch to bring it into memory (the
> miss) and one for processing (the hit).
> 
> > MRU 0 entries =      2 (  0%)
> > MRU 1 entries =      0 (  0%)
> > MRU 2 entries =   1362 ( 10%)
> > MRU 3 entries =     68 (  0%)
> > MRU 4 entries =     10 (  0%)
> > MRU 5 entries =   6097 ( 49%)
> > MRU 6 entries =   4752 ( 38%)
> > MRU 7 entries =     96 (  0%)
> 
> And the MRU list shows how the buffer access are not uniform - we
> are seeing buffers of all different ages in the cache. This shows
> that buffers are being aged 5-6 times before they are getting used,
> which means the cache size is almost too small for prefetch to work
> effectively....
> 
> Actually, the cache is too small - cache misses are significantly
> larger than cache hits, meaning some buffers are being fetched from
> disk twice because the prefetched buffers are aging out before the
> processing thread gets to them. Give xfs_repair ~5GB of RAM, and it
> should only need to do a single IO pass in phase 3 and then phase 4
> and 6 will hit the buffers in the cache and hence not need to do any
> IO at all...
> 
> So to me, this is prefetch working as it should - it's bringing
> buffers into cache in the optimal IO pattern rather than the
> application level access pattern. The difference in memory footprint
> compared to no prefetching is largely co-incidental and really not
> something we are concerned about in any way...

/me notes that if you turn on the fancy new features (rmap, reflink, or
parent pointers) then repair will consume even more memory.  None of
that can be precomputed before scanning the fs, so the -m "limits" are
even less precise.

(Also, large metadata-heavy filesystems aren't well supported on systems
with limited DRAM.)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-11-08 22:54 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-08 15:56 xfsprogs: repair: Higher memory consumption when disable prefetch Per Förlin
2023-11-08 22:05 ` Dave Chinner
2023-11-08 22:54   ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox