Re: [PATCH 1/2] mkfs: enable new features by default

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: aalbersh@kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/2] mkfs: enable new features by default
Date: Wed, 10 Dec 2025 15:49:28 -0800	[thread overview]
Message-ID: <20251210234928.GE7725@frogsfrogsfrogs> (raw)
In-Reply-To: <aTih1FDXt8fMrIb4@dread.disaster.area>

On Wed, Dec 10, 2025 at 09:25:24AM +1100, Dave Chinner wrote:
> On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Since the LTS is coming up, enable parent pointers and exchange-range by
> > default for all users.  Also fix up an out of date comment.
> > 
> > I created a really stupid benchmarking script that does:
> > 
> > #!/bin/bash
> > 
> > # pptr overhead benchmark
> > 
> > umount /opt /mnt
> > rmmod xfs
> > for i in 1 0; do
> > 	umount /opt
> > 	mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> > 	mount /dev/sdb /opt
> > 	mkdir -p /opt/foo
> > 	for ((i=0;i<5;i++)); do
> > 		time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> > 	done
> > done
> 
> Hmmm. fsstress is an interesting choice here...

<flush all the old benchmarks and conclusions>

I have an old 40-core Xeon E5-2660V3 with a pair of 1.5T Intel nvme ssds
and 128G of RAM running 6.18.0.  For this sample, I tried to keep the
memory usage well below the amount of DRAM so that I could measure the
pure overhead of writing parent pointers out to disk and not anything
else.  I also omit ls'ing and chmod'ing the directory tree because
neither of those operations touch parent pointers.  I also left the
logbsize at the defaults (32k) because that's what most users get.

Here I'm using the following benchmark program, compiled from various
suggestions from dchinner over the years:

#!/bin/bash -x

iter=8
feature="-n parent"
filesz=0
subdirs=10000
files_per_iter=100000
writesize=16384

mkdirme() {
        set +x
        local i

        for ((i=0;i<agcount;i++)); do
                mkdir -p /nvme/$i
                dirs+=(-d /nvme/$i)
        done
        set -x
}

bulkme() {
        set +x
        local i

        for ((i=0;i<agcount;i++)); do
                xfs_io -c "bulkstat -a $i -q" /nvme &
        done
        wait
        set -x
}

rmdirme() {
        set +x
        local i
        for dir in "${dirs[@]}"; do
                rm -r -f "${dir}" &
        done
        wait
        set -x
}

benchme() {
        agcount="$(xfs_info /nvme/ | grep agcount= | sed -e 's/^.*agcount=//g' -e 's/,.*$//g')"
        dirs=()
        mkdirme

        #time ~djwong/cdev/work/fstests/build-x86_64/ltp/fsstress -n 400000 -p 40 -z -f creat=1,mkdir=1,rmdir=1,unlink=1 -d /nvme/ -s 1
        time fs_mark -w "${writesz}" -D "${subdirs}" -S 0 -n "${files_per_iter}" -s "${filesz}" -L "${iter}" "${dirs[@]}"

        time bulkme
        time rmdirme
}

for p in 0 1; do
        umount /dev/nvme1n1 /nvme /mnt
        #mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 -n parent=$p || break
        mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 $feature=$p || break
        mount /dev/nvme1n1 /nvme/ -o logdev=/dev/nvme0n1 || break
        benchme
        umount /dev/nvme1n1 /nvme /mnt
done

I get this mkfs output:
# mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1
meta-data=/dev/nvme1n1           isize=512    agcount=40, agsize=9767586 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=0   metadir=0
data     =                       bsize=4096   blocks=390703440, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =/dev/nvme0n1           bsize=4096   blocks=262144, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
         =                       zoned=0      start=0 reserved=0
# grep nvme1n1 /proc/mounts
/dev/nvme1n1 /nvme xfs rw,relatime,inode64,logbufs=8,logbsize=32k,logdev=/dev/nvme0n1,noquota 0 0

and this output from fsmark with parent=0:

#  fs_mark  -D  10000  -S  0  -n  100000  -s  0  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 14:22:07 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2      4000000            0     566680.9         31398816
     2      8000000            0     665535.6         30037368
     2     12000000            0     537227.6         31726557
     2     16000000            0     538133.9         32411165
     2     20000000            0     619369.6         30790676
     2     24000000            0     600018.2         31583349
     2     28000000            0     607209.8         31193980
     3     32000000            0     533240.7         32277102

real    0m57.573s
user    3m53.578s
sys     19m44.440s
+ bulkme
+ set +x

real    0m1.122s
user    0m0.955s
sys     0m39.306s
+ rmdirme
+ set +x

real    0m59.649s
user    0m41.196s
sys     13m9.566s

I limited this to 8 iterations so I could post some preliminary results
after a few minutes.  Now let's try again with parent=1:

+ fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39

#  fs_mark  -D  10000  -S  0  -n  100000  -s  0  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 14:24:44 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2      4000000            0     543929.1         31344175
     2      8000000            0     523736.2         31180565
     2     12000000            0     522184.1         31700380
     2     16000000            0     513468.0         32112498
     2     20000000            0     543993.1         31910496
     2     24000000            0     562760.1         32061910
     2     28000000            0     524039.8         31825520
     3     32000000            0     526028.8         31889193

real    1m2.934s
user    3m53.508s
sys     25m14.810s
+ bulkme
+ set +x

real    0m1.158s
user    0m0.882s
sys     0m39.847s
+ rmdirme
+ set +x

real    1m12.505s
user    0m47.489s
sys     20m33.844s


fs_mark itself shows a decrease in file creation/sec of about 9%, an
increase in wall clock time of about 9%, and an increase in kernel time
of about 28%.  That's to be expected, since parent pointer updates cause
directory entry creation and deletion to hold more ILOCKs and for
longer.

Parallel bulkstat (aka bulkme) shows an increase in wall time of 3% and
system time of 1%, which is not surprising since that's just walking the
inode btree and cores, no parent pointers involved.

Similarly, deleting all the files created by fs_mark shows an increase
in wall time of about ~21% and an increase in system time of about 56%.
I concede that parent pointers has a fair amount of overhead for the
worst case of creating a large directory tree or deleting it.

I reran this with logbsize=256k and while I saw a slight increase in
performance across the board, the overhead of pptrs is about the same
percentagewise.

If I then re-run the benchmark with a file size of 1M and tell it to
create fewer files, then I get the following for parent=0:

#  fs_mark  -D  1000  -S  0  -n  200  -s  1048576  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:03:11 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 1048576 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2         8000      1048576       1493.4           198379
     2        16000      1048576       1327.0           255655
     3        24000      1048576       1355.8           255105
     4        32000      1048576       1352.3           253094
     4        40000      1048576       1836.9           262258
     5        48000      1048576       1337.6           246991
     5        56000      1048576       1328.4           240303
     6        64000      1048576       1165.9           237211

real    0m50.384s
user    0m7.640s
sys     1m43.187s
+ bulkme
+ set +x

real    0m0.023s
user    0m0.061s
sys     0m0.167s
+ rmdirme
+ set +x

real    0m0.675s
user    0m0.107s
sys     0m15.644s

and for parent=1:

#  fs_mark  -D  1000  -S  0  -n  200  -s  1048576  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:04:41 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 1048576 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2         8000      1048576       1963.9           254007
     2        16000      1048576       1716.4           227074
     3        24000      1048576       1052.5           264987
     4        32000      1048576       1793.6           242288
     4        40000      1048576       1364.2           249738
     5        48000      1048576       1081.2           250394
     5        56000      1048576       1342.0           260667
     6        64000      1048576       1356.9           242324

real    0m49.256s
user    0m7.621s
sys     1m44.847s
+ bulkme
+ set +x

real    0m0.021s
user    0m0.060s
sys     0m0.176s
+ rmdirme
+ set +x

real    0m0.537s
user    0m0.108s
sys     0m15.453s

Here we see that the fs_mark creates/sec goes up by 4%, wall time
decreases by 3%, and the kernel time increases by 2% or so.  The rmdir
wall time decreases by 2% and the kernel time by ~1%, which is quite
small.  So for a more common case of populating a directory tree full of
big files with data in them, the overhead isn't all that noticeable.

I then decided to simulate my maildir spool, which has 670,000 files
consuming 12GB for an average file size of 17936 bytes.  I reduced the
file size to 16K, increase the number of files per iteration, and set
the write buffer size to something not aligned to a block, and got this
for parent=0:

#  fs_mark  -w  778  -D  1000  -S  0  -n  6000  -s  16384  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:21:38 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 16384 bytes, written with an IO size of 778 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2       240000        16384      40085.3          2492281
     2       480000        16384      37026.7          2780077
     2       720000        16384      28445.5          2591461
     3       960000        16384      28888.6          2595817
     3      1200000        16384      25160.8          2903882
     3      1440000        16384      29372.1          2600018
     3      1680000        16384      26443.9          2732790
     4      1920000        16384      26307.1          2758750

real    1m11.633s
user    0m46.156s
sys     3m24.543s
+ bulkme
+ set +x

real    0m0.091s
user    0m0.111s
sys     0m2.461s
+ rmdirme
+ set +x

real    0m9.364s
user    0m2.245s
sys     0m47.221s

and this for parent=1

#  fs_mark  -w  778  -D  1000  -S  0  -n  6000  -s  16384  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:23:38 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 16384 bytes, written with an IO size of 778 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2       240000        16384      39340.1          2627066
     2       480000        16384      27727.2          2925494
     2       720000        16384      28305.4          2597191
     2       960000        16384      24891.6          2834421
     3      1200000        16384      27964.8          2810556
     3      1440000        16384      27204.6          2776783
     3      1680000        16384      25745.2          2779197
     3      1920000        16384      24674.9          2752721

real    1m14.422s
user    0m46.607s
sys     3m38.777s
+ bulkme
+ set +x

real    0m0.081s
user    0m0.123s
sys     0m2.408s
+ rmdirme
+ set +x

real    0m9.306s
user    0m2.570s
sys     1m10.598s

fs_mark shows a 7% decrease in creates/sec, a 4% increase in wall time,
a 7% increase in kernel time.  bulkstat is as usual not that different,
and deletion shows an increase in kernel time of 50%.

Conclusion: There are noticeable overheads to enabling parent pointers,
but counterbalancing that, we can now repair an entire filesystem,
directory tree and all.

--D

next prev parent reply	other threads:[~2025-12-10 23:49 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-09 16:16 [PATCHSET V2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-09 16:16 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-09 16:22   ` Christoph Hellwig
2025-12-09 22:25   ` Dave Chinner
2025-12-10 23:49     ` Darrick J. Wong [this message]
2025-12-15 23:59       ` Dave Chinner
2025-12-16 23:07         ` Darrick J. Wong
2025-12-09 16:16 ` [PATCH 2/2] mkfs: add 2025 LTS config file Darrick J. Wong
2025-12-09 16:23   ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2025-12-02  1:27 [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-02  1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-02  7:38   ` Christoph Hellwig
2025-12-03  0:53     ` Darrick J. Wong
2025-12-03  6:31       ` Christoph Hellwig
2025-12-04 18:48         ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251210234928.GE7725@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=aalbersh@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.