public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: aalbersh@kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/2] mkfs: enable new features by default
Date: Wed, 10 Dec 2025 15:49:28 -0800	[thread overview]
Message-ID: <20251210234928.GE7725@frogsfrogsfrogs> (raw)
In-Reply-To: <aTih1FDXt8fMrIb4@dread.disaster.area>

On Wed, Dec 10, 2025 at 09:25:24AM +1100, Dave Chinner wrote:
> On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Since the LTS is coming up, enable parent pointers and exchange-range by
> > default for all users.  Also fix up an out of date comment.
> > 
> > I created a really stupid benchmarking script that does:
> > 
> > #!/bin/bash
> > 
> > # pptr overhead benchmark
> > 
> > umount /opt /mnt
> > rmmod xfs
> > for i in 1 0; do
> > 	umount /opt
> > 	mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> > 	mount /dev/sdb /opt
> > 	mkdir -p /opt/foo
> > 	for ((i=0;i<5;i++)); do
> > 		time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> > 	done
> > done
> 
> Hmmm. fsstress is an interesting choice here...

<flush all the old benchmarks and conclusions>

I have an old 40-core Xeon E5-2660V3 with a pair of 1.5T Intel nvme ssds
and 128G of RAM running 6.18.0.  For this sample, I tried to keep the
memory usage well below the amount of DRAM so that I could measure the
pure overhead of writing parent pointers out to disk and not anything
else.  I also omit ls'ing and chmod'ing the directory tree because
neither of those operations touch parent pointers.  I also left the
logbsize at the defaults (32k) because that's what most users get.

Here I'm using the following benchmark program, compiled from various
suggestions from dchinner over the years:

#!/bin/bash -x

iter=8
feature="-n parent"
filesz=0
subdirs=10000
files_per_iter=100000
writesize=16384

mkdirme() {
        set +x
        local i

        for ((i=0;i<agcount;i++)); do
                mkdir -p /nvme/$i
                dirs+=(-d /nvme/$i)
        done
        set -x
}

bulkme() {
        set +x
        local i

        for ((i=0;i<agcount;i++)); do
                xfs_io -c "bulkstat -a $i -q" /nvme &
        done
        wait
        set -x
}

rmdirme() {
        set +x
        local i
        for dir in "${dirs[@]}"; do
                rm -r -f "${dir}" &
        done
        wait
        set -x
}

benchme() {
        agcount="$(xfs_info /nvme/ | grep agcount= | sed -e 's/^.*agcount=//g' -e 's/,.*$//g')"
        dirs=()
        mkdirme

        #time ~djwong/cdev/work/fstests/build-x86_64/ltp/fsstress -n 400000 -p 40 -z -f creat=1,mkdir=1,rmdir=1,unlink=1 -d /nvme/ -s 1
        time fs_mark -w "${writesz}" -D "${subdirs}" -S 0 -n "${files_per_iter}" -s "${filesz}" -L "${iter}" "${dirs[@]}"

        time bulkme
        time rmdirme
}

for p in 0 1; do
        umount /dev/nvme1n1 /nvme /mnt
        #mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 -n parent=$p || break
        mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 $feature=$p || break
        mount /dev/nvme1n1 /nvme/ -o logdev=/dev/nvme0n1 || break
        benchme
        umount /dev/nvme1n1 /nvme /mnt
done

I get this mkfs output:
# mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1
meta-data=/dev/nvme1n1           isize=512    agcount=40, agsize=9767586 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=0   metadir=0
data     =                       bsize=4096   blocks=390703440, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =/dev/nvme0n1           bsize=4096   blocks=262144, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
         =                       zoned=0      start=0 reserved=0
# grep nvme1n1 /proc/mounts
/dev/nvme1n1 /nvme xfs rw,relatime,inode64,logbufs=8,logbsize=32k,logdev=/dev/nvme0n1,noquota 0 0

and this output from fsmark with parent=0:

#  fs_mark  -D  10000  -S  0  -n  100000  -s  0  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 14:22:07 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2      4000000            0     566680.9         31398816
     2      8000000            0     665535.6         30037368
     2     12000000            0     537227.6         31726557
     2     16000000            0     538133.9         32411165
     2     20000000            0     619369.6         30790676
     2     24000000            0     600018.2         31583349
     2     28000000            0     607209.8         31193980
     3     32000000            0     533240.7         32277102

real    0m57.573s
user    3m53.578s
sys     19m44.440s
+ bulkme
+ set +x

real    0m1.122s
user    0m0.955s
sys     0m39.306s
+ rmdirme
+ set +x

real    0m59.649s
user    0m41.196s
sys     13m9.566s

I limited this to 8 iterations so I could post some preliminary results
after a few minutes.  Now let's try again with parent=1:

+ fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39

#  fs_mark  -D  10000  -S  0  -n  100000  -s  0  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 14:24:44 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2      4000000            0     543929.1         31344175
     2      8000000            0     523736.2         31180565
     2     12000000            0     522184.1         31700380
     2     16000000            0     513468.0         32112498
     2     20000000            0     543993.1         31910496
     2     24000000            0     562760.1         32061910
     2     28000000            0     524039.8         31825520
     3     32000000            0     526028.8         31889193

real    1m2.934s
user    3m53.508s
sys     25m14.810s
+ bulkme
+ set +x

real    0m1.158s
user    0m0.882s
sys     0m39.847s
+ rmdirme
+ set +x

real    1m12.505s
user    0m47.489s
sys     20m33.844s


fs_mark itself shows a decrease in file creation/sec of about 9%, an
increase in wall clock time of about 9%, and an increase in kernel time
of about 28%.  That's to be expected, since parent pointer updates cause
directory entry creation and deletion to hold more ILOCKs and for
longer.

Parallel bulkstat (aka bulkme) shows an increase in wall time of 3% and
system time of 1%, which is not surprising since that's just walking the
inode btree and cores, no parent pointers involved.

Similarly, deleting all the files created by fs_mark shows an increase
in wall time of about ~21% and an increase in system time of about 56%.
I concede that parent pointers has a fair amount of overhead for the
worst case of creating a large directory tree or deleting it.

I reran this with logbsize=256k and while I saw a slight increase in
performance across the board, the overhead of pptrs is about the same
percentagewise.

If I then re-run the benchmark with a file size of 1M and tell it to
create fewer files, then I get the following for parent=0:

#  fs_mark  -D  1000  -S  0  -n  200  -s  1048576  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:03:11 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 1048576 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2         8000      1048576       1493.4           198379
     2        16000      1048576       1327.0           255655
     3        24000      1048576       1355.8           255105
     4        32000      1048576       1352.3           253094
     4        40000      1048576       1836.9           262258
     5        48000      1048576       1337.6           246991
     5        56000      1048576       1328.4           240303
     6        64000      1048576       1165.9           237211

real    0m50.384s
user    0m7.640s
sys     1m43.187s
+ bulkme
+ set +x

real    0m0.023s
user    0m0.061s
sys     0m0.167s
+ rmdirme
+ set +x

real    0m0.675s
user    0m0.107s
sys     0m15.644s

and for parent=1:

#  fs_mark  -D  1000  -S  0  -n  200  -s  1048576  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:04:41 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 1048576 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2         8000      1048576       1963.9           254007
     2        16000      1048576       1716.4           227074
     3        24000      1048576       1052.5           264987
     4        32000      1048576       1793.6           242288
     4        40000      1048576       1364.2           249738
     5        48000      1048576       1081.2           250394
     5        56000      1048576       1342.0           260667
     6        64000      1048576       1356.9           242324

real    0m49.256s
user    0m7.621s
sys     1m44.847s
+ bulkme
+ set +x

real    0m0.021s
user    0m0.060s
sys     0m0.176s
+ rmdirme
+ set +x

real    0m0.537s
user    0m0.108s
sys     0m15.453s

Here we see that the fs_mark creates/sec goes up by 4%, wall time
decreases by 3%, and the kernel time increases by 2% or so.  The rmdir
wall time decreases by 2% and the kernel time by ~1%, which is quite
small.  So for a more common case of populating a directory tree full of
big files with data in them, the overhead isn't all that noticeable.

I then decided to simulate my maildir spool, which has 670,000 files
consuming 12GB for an average file size of 17936 bytes.  I reduced the
file size to 16K, increase the number of files per iteration, and set
the write buffer size to something not aligned to a block, and got this
for parent=0:

#  fs_mark  -w  778  -D  1000  -S  0  -n  6000  -s  16384  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:21:38 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 16384 bytes, written with an IO size of 778 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2       240000        16384      40085.3          2492281
     2       480000        16384      37026.7          2780077
     2       720000        16384      28445.5          2591461
     3       960000        16384      28888.6          2595817
     3      1200000        16384      25160.8          2903882
     3      1440000        16384      29372.1          2600018
     3      1680000        16384      26443.9          2732790
     4      1920000        16384      26307.1          2758750

real    1m11.633s
user    0m46.156s
sys     3m24.543s
+ bulkme
+ set +x

real    0m0.091s
user    0m0.111s
sys     0m2.461s
+ rmdirme
+ set +x

real    0m9.364s
user    0m2.245s
sys     0m47.221s

and this for parent=1

#  fs_mark  -w  778  -D  1000  -S  0  -n  6000  -s  16384  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:23:38 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 16384 bytes, written with an IO size of 778 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2       240000        16384      39340.1          2627066
     2       480000        16384      27727.2          2925494
     2       720000        16384      28305.4          2597191
     2       960000        16384      24891.6          2834421
     3      1200000        16384      27964.8          2810556
     3      1440000        16384      27204.6          2776783
     3      1680000        16384      25745.2          2779197
     3      1920000        16384      24674.9          2752721

real    1m14.422s
user    0m46.607s
sys     3m38.777s
+ bulkme
+ set +x

real    0m0.081s
user    0m0.123s
sys     0m2.408s
+ rmdirme
+ set +x

real    0m9.306s
user    0m2.570s
sys     1m10.598s

fs_mark shows a 7% decrease in creates/sec, a 4% increase in wall time,
a 7% increase in kernel time.  bulkstat is as usual not that different,
and deletion shows an increase in kernel time of 50%.

Conclusion: There are noticeable overheads to enabling parent pointers,
but counterbalancing that, we can now repair an entire filesystem,
directory tree and all.

--D

  reply	other threads:[~2025-12-10 23:49 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-09 16:16 [PATCHSET V2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-09 16:16 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-09 16:22   ` Christoph Hellwig
2025-12-09 22:25   ` Dave Chinner
2025-12-10 23:49     ` Darrick J. Wong [this message]
2025-12-15 23:59       ` Dave Chinner
2025-12-16 23:07         ` Darrick J. Wong
2025-12-09 16:16 ` [PATCH 2/2] mkfs: add 2025 LTS config file Darrick J. Wong
2025-12-09 16:23   ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2025-12-02  1:27 [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-02  1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-02  7:38   ` Christoph Hellwig
2025-12-03  0:53     ` Darrick J. Wong
2025-12-03  6:31       ` Christoph Hellwig
2025-12-04 18:48         ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251210234928.GE7725@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=aalbersh@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox