From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: aalbersh@kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/2] mkfs: enable new features by default
Date: Wed, 10 Dec 2025 15:49:28 -0800 [thread overview]
Message-ID: <20251210234928.GE7725@frogsfrogsfrogs> (raw)
In-Reply-To: <aTih1FDXt8fMrIb4@dread.disaster.area>
On Wed, Dec 10, 2025 at 09:25:24AM +1100, Dave Chinner wrote:
> On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Since the LTS is coming up, enable parent pointers and exchange-range by
> > default for all users. Also fix up an out of date comment.
> >
> > I created a really stupid benchmarking script that does:
> >
> > #!/bin/bash
> >
> > # pptr overhead benchmark
> >
> > umount /opt /mnt
> > rmmod xfs
> > for i in 1 0; do
> > umount /opt
> > mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> > mount /dev/sdb /opt
> > mkdir -p /opt/foo
> > for ((i=0;i<5;i++)); do
> > time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> > done
> > done
>
> Hmmm. fsstress is an interesting choice here...
<flush all the old benchmarks and conclusions>
I have an old 40-core Xeon E5-2660V3 with a pair of 1.5T Intel nvme ssds
and 128G of RAM running 6.18.0. For this sample, I tried to keep the
memory usage well below the amount of DRAM so that I could measure the
pure overhead of writing parent pointers out to disk and not anything
else. I also omit ls'ing and chmod'ing the directory tree because
neither of those operations touch parent pointers. I also left the
logbsize at the defaults (32k) because that's what most users get.
Here I'm using the following benchmark program, compiled from various
suggestions from dchinner over the years:
#!/bin/bash -x
iter=8
feature="-n parent"
filesz=0
subdirs=10000
files_per_iter=100000
writesize=16384
mkdirme() {
set +x
local i
for ((i=0;i<agcount;i++)); do
mkdir -p /nvme/$i
dirs+=(-d /nvme/$i)
done
set -x
}
bulkme() {
set +x
local i
for ((i=0;i<agcount;i++)); do
xfs_io -c "bulkstat -a $i -q" /nvme &
done
wait
set -x
}
rmdirme() {
set +x
local i
for dir in "${dirs[@]}"; do
rm -r -f "${dir}" &
done
wait
set -x
}
benchme() {
agcount="$(xfs_info /nvme/ | grep agcount= | sed -e 's/^.*agcount=//g' -e 's/,.*$//g')"
dirs=()
mkdirme
#time ~djwong/cdev/work/fstests/build-x86_64/ltp/fsstress -n 400000 -p 40 -z -f creat=1,mkdir=1,rmdir=1,unlink=1 -d /nvme/ -s 1
time fs_mark -w "${writesz}" -D "${subdirs}" -S 0 -n "${files_per_iter}" -s "${filesz}" -L "${iter}" "${dirs[@]}"
time bulkme
time rmdirme
}
for p in 0 1; do
umount /dev/nvme1n1 /nvme /mnt
#mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 -n parent=$p || break
mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 $feature=$p || break
mount /dev/nvme1n1 /nvme/ -o logdev=/dev/nvme0n1 || break
benchme
umount /dev/nvme1n1 /nvme /mnt
done
I get this mkfs output:
# mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1
meta-data=/dev/nvme1n1 isize=512 agcount=40, agsize=9767586 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0 metadir=0
data = bsize=4096 blocks=390703440, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =/dev/nvme0n1 bsize=4096 blocks=262144, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
= zoned=0 start=0 reserved=0
# grep nvme1n1 /proc/mounts
/dev/nvme1n1 /nvme xfs rw,relatime,inode64,logbufs=8,logbsize=32k,logdev=/dev/nvme0n1,noquota 0 0
and this output from fsmark with parent=0:
# fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 14:22:07 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 0 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 4000000 0 566680.9 31398816
2 8000000 0 665535.6 30037368
2 12000000 0 537227.6 31726557
2 16000000 0 538133.9 32411165
2 20000000 0 619369.6 30790676
2 24000000 0 600018.2 31583349
2 28000000 0 607209.8 31193980
3 32000000 0 533240.7 32277102
real 0m57.573s
user 3m53.578s
sys 19m44.440s
+ bulkme
+ set +x
real 0m1.122s
user 0m0.955s
sys 0m39.306s
+ rmdirme
+ set +x
real 0m59.649s
user 0m41.196s
sys 13m9.566s
I limited this to 8 iterations so I could post some preliminary results
after a few minutes. Now let's try again with parent=1:
+ fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 14:24:44 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 0 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 4000000 0 543929.1 31344175
2 8000000 0 523736.2 31180565
2 12000000 0 522184.1 31700380
2 16000000 0 513468.0 32112498
2 20000000 0 543993.1 31910496
2 24000000 0 562760.1 32061910
2 28000000 0 524039.8 31825520
3 32000000 0 526028.8 31889193
real 1m2.934s
user 3m53.508s
sys 25m14.810s
+ bulkme
+ set +x
real 0m1.158s
user 0m0.882s
sys 0m39.847s
+ rmdirme
+ set +x
real 1m12.505s
user 0m47.489s
sys 20m33.844s
fs_mark itself shows a decrease in file creation/sec of about 9%, an
increase in wall clock time of about 9%, and an increase in kernel time
of about 28%. That's to be expected, since parent pointer updates cause
directory entry creation and deletion to hold more ILOCKs and for
longer.
Parallel bulkstat (aka bulkme) shows an increase in wall time of 3% and
system time of 1%, which is not surprising since that's just walking the
inode btree and cores, no parent pointers involved.
Similarly, deleting all the files created by fs_mark shows an increase
in wall time of about ~21% and an increase in system time of about 56%.
I concede that parent pointers has a fair amount of overhead for the
worst case of creating a large directory tree or deleting it.
I reran this with logbsize=256k and while I saw a slight increase in
performance across the board, the overhead of pptrs is about the same
percentagewise.
If I then re-run the benchmark with a file size of 1M and tell it to
create fewer files, then I get the following for parent=0:
# fs_mark -D 1000 -S 0 -n 200 -s 1048576 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 15:03:11 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 1048576 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 8000 1048576 1493.4 198379
2 16000 1048576 1327.0 255655
3 24000 1048576 1355.8 255105
4 32000 1048576 1352.3 253094
4 40000 1048576 1836.9 262258
5 48000 1048576 1337.6 246991
5 56000 1048576 1328.4 240303
6 64000 1048576 1165.9 237211
real 0m50.384s
user 0m7.640s
sys 1m43.187s
+ bulkme
+ set +x
real 0m0.023s
user 0m0.061s
sys 0m0.167s
+ rmdirme
+ set +x
real 0m0.675s
user 0m0.107s
sys 0m15.644s
and for parent=1:
# fs_mark -D 1000 -S 0 -n 200 -s 1048576 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 15:04:41 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 1048576 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 8000 1048576 1963.9 254007
2 16000 1048576 1716.4 227074
3 24000 1048576 1052.5 264987
4 32000 1048576 1793.6 242288
4 40000 1048576 1364.2 249738
5 48000 1048576 1081.2 250394
5 56000 1048576 1342.0 260667
6 64000 1048576 1356.9 242324
real 0m49.256s
user 0m7.621s
sys 1m44.847s
+ bulkme
+ set +x
real 0m0.021s
user 0m0.060s
sys 0m0.176s
+ rmdirme
+ set +x
real 0m0.537s
user 0m0.108s
sys 0m15.453s
Here we see that the fs_mark creates/sec goes up by 4%, wall time
decreases by 3%, and the kernel time increases by 2% or so. The rmdir
wall time decreases by 2% and the kernel time by ~1%, which is quite
small. So for a more common case of populating a directory tree full of
big files with data in them, the overhead isn't all that noticeable.
I then decided to simulate my maildir spool, which has 670,000 files
consuming 12GB for an average file size of 17936 bytes. I reduced the
file size to 16K, increase the number of files per iteration, and set
the write buffer size to something not aligned to a block, and got this
for parent=0:
# fs_mark -w 778 -D 1000 -S 0 -n 6000 -s 16384 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 15:21:38 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 16384 bytes, written with an IO size of 778 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 240000 16384 40085.3 2492281
2 480000 16384 37026.7 2780077
2 720000 16384 28445.5 2591461
3 960000 16384 28888.6 2595817
3 1200000 16384 25160.8 2903882
3 1440000 16384 29372.1 2600018
3 1680000 16384 26443.9 2732790
4 1920000 16384 26307.1 2758750
real 1m11.633s
user 0m46.156s
sys 3m24.543s
+ bulkme
+ set +x
real 0m0.091s
user 0m0.111s
sys 0m2.461s
+ rmdirme
+ set +x
real 0m9.364s
user 0m2.245s
sys 0m47.221s
and this for parent=1
# fs_mark -w 778 -D 1000 -S 0 -n 6000 -s 16384 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 15:23:38 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 16384 bytes, written with an IO size of 778 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 240000 16384 39340.1 2627066
2 480000 16384 27727.2 2925494
2 720000 16384 28305.4 2597191
2 960000 16384 24891.6 2834421
3 1200000 16384 27964.8 2810556
3 1440000 16384 27204.6 2776783
3 1680000 16384 25745.2 2779197
3 1920000 16384 24674.9 2752721
real 1m14.422s
user 0m46.607s
sys 3m38.777s
+ bulkme
+ set +x
real 0m0.081s
user 0m0.123s
sys 0m2.408s
+ rmdirme
+ set +x
real 0m9.306s
user 0m2.570s
sys 1m10.598s
fs_mark shows a 7% decrease in creates/sec, a 4% increase in wall time,
a 7% increase in kernel time. bulkstat is as usual not that different,
and deletion shows an increase in kernel time of 50%.
Conclusion: There are noticeable overheads to enabling parent pointers,
but counterbalancing that, we can now repair an entire filesystem,
directory tree and all.
--D
next prev parent reply other threads:[~2025-12-10 23:49 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-09 16:16 [PATCHSET V2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-09 16:16 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-09 16:22 ` Christoph Hellwig
2025-12-09 22:25 ` Dave Chinner
2025-12-10 23:49 ` Darrick J. Wong [this message]
2025-12-15 23:59 ` Dave Chinner
2025-12-16 23:07 ` Darrick J. Wong
2025-12-09 16:16 ` [PATCH 2/2] mkfs: add 2025 LTS config file Darrick J. Wong
2025-12-09 16:23 ` Christoph Hellwig
-- strict thread matches above, loose matches on Subject: below --
2025-12-02 1:27 [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-02 1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-02 7:38 ` Christoph Hellwig
2025-12-03 0:53 ` Darrick J. Wong
2025-12-03 6:31 ` Christoph Hellwig
2025-12-04 18:48 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251210234928.GE7725@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=aalbersh@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox