* [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18
@ 2025-12-02 1:27 Darrick J. Wong
2025-12-02 1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-02 1:28 ` [PATCH 2/2] mkfs: add 2025 LTS config file Darrick J. Wong
0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-02 1:27 UTC (permalink / raw)
To: aalbersh, djwong; +Cc: linux-xfs
Hi all,
Enable by default some new features that seem stable now.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=default-features
---
Commits in this patchset:
* mkfs: enable new features by default
* mkfs: add 2025 LTS config file
---
mkfs/Makefile | 3 ++-
mkfs/lts_6.18.conf | 19 +++++++++++++++++++
mkfs/xfs_mkfs.c | 5 +++--
3 files changed, 24 insertions(+), 3 deletions(-)
create mode 100644 mkfs/lts_6.18.conf
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] mkfs: enable new features by default
2025-12-02 1:27 [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
@ 2025-12-02 1:28 ` Darrick J. Wong
2025-12-02 7:38 ` Christoph Hellwig
2025-12-02 1:28 ` [PATCH 2/2] mkfs: add 2025 LTS config file Darrick J. Wong
1 sibling, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-02 1:28 UTC (permalink / raw)
To: aalbersh, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Since the LTS is coming up, enable parent pointers and exchange-range by
default for all users. Also fix up an out of date comment.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
mkfs/xfs_mkfs.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 8f5a6fa5676453..8db51217016eb0 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -1044,7 +1044,7 @@ struct sb_feat_args {
bool inode_align; /* XFS_SB_VERSION_ALIGNBIT */
bool nci; /* XFS_SB_VERSION_BORGBIT */
bool lazy_sb_counters; /* XFS_SB_VERSION2_LAZYSBCOUNTBIT */
- bool parent_pointers; /* XFS_SB_VERSION2_PARENTBIT */
+ bool parent_pointers; /* XFS_SB_FEAT_INCOMPAT_PARENT */
bool projid32bit; /* XFS_SB_VERSION2_PROJID32BIT */
bool crcs_enabled; /* XFS_SB_VERSION2_CRCBIT */
bool dirftype; /* XFS_SB_VERSION2_FTYPE */
@@ -5984,11 +5984,12 @@ main(
.rmapbt = true,
.reflink = true,
.inobtcnt = true,
- .parent_pointers = false,
+ .parent_pointers = true,
.nodalign = false,
.nortalign = false,
.bigtime = true,
.nrext64 = true,
+ .exchrange = true,
/*
* When we decide to enable a new feature by default,
* please remember to update the mkfs conf files.
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/2] mkfs: add 2025 LTS config file
2025-12-02 1:27 [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-02 1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
@ 2025-12-02 1:28 ` Darrick J. Wong
1 sibling, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-02 1:28 UTC (permalink / raw)
To: aalbersh, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Add a new configuration file with the defaults as of 6.18 LTS.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
mkfs/Makefile | 3 ++-
mkfs/lts_6.18.conf | 19 +++++++++++++++++++
2 files changed, 21 insertions(+), 1 deletion(-)
create mode 100644 mkfs/lts_6.18.conf
diff --git a/mkfs/Makefile b/mkfs/Makefile
index 04905bd5101ccb..fb1473324cde7c 100644
--- a/mkfs/Makefile
+++ b/mkfs/Makefile
@@ -18,7 +18,8 @@ CFGFILES = \
lts_5.15.conf \
lts_6.1.conf \
lts_6.6.conf \
- lts_6.12.conf
+ lts_6.12.conf \
+ lts_6.18.conf
LLDLIBS += $(LIBXFS) $(LIBXCMD) $(LIBFROG) $(LIBRT) $(LIBBLKID) \
$(LIBUUID) $(LIBINIH) $(LIBURCU) $(LIBPTHREAD)
diff --git a/mkfs/lts_6.18.conf b/mkfs/lts_6.18.conf
new file mode 100644
index 00000000000000..2dbec51e586fa1
--- /dev/null
+++ b/mkfs/lts_6.18.conf
@@ -0,0 +1,19 @@
+# V5 features that were the mkfs defaults when the upstream Linux 6.18 LTS
+# kernel was released at the end of 2025.
+
+[metadata]
+bigtime=1
+crc=1
+finobt=1
+inobtcount=1
+metadir=0
+reflink=1
+rmapbt=1
+
+[inode]
+sparse=1
+nrext64=1
+exchange=1
+
+[naming]
+parent=1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mkfs: enable new features by default
2025-12-02 1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
@ 2025-12-02 7:38 ` Christoph Hellwig
2025-12-03 0:53 ` Darrick J. Wong
0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-12-02 7:38 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: aalbersh, linux-xfs
On Mon, Dec 01, 2025 at 05:28:16PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Since the LTS is coming up, enable parent pointers and exchange-range by
> default for all users. Also fix up an out of date comment.
Do you have any numbers that show the overhead or non-overhead of
enabling rmap? It will increase the amount of metadata written quite
a bit.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mkfs: enable new features by default
2025-12-02 7:38 ` Christoph Hellwig
@ 2025-12-03 0:53 ` Darrick J. Wong
2025-12-03 6:31 ` Christoph Hellwig
0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-03 0:53 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: aalbersh, linux-xfs
On Mon, Dec 01, 2025 at 11:38:46PM -0800, Christoph Hellwig wrote:
> On Mon, Dec 01, 2025 at 05:28:16PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Since the LTS is coming up, enable parent pointers and exchange-range by
> > default for all users. Also fix up an out of date comment.
>
> Do you have any numbers that show the overhead or non-overhead of
> enabling rmap? It will increase the amount of metadata written quite
> a bit.
I'm assuming you're interested in the overhead of *parent pointers* and
not rmap since we turned on rmap by default back in 2023?
I created a really stupid benchmarking script that does:
#!/bin/bash
umount /opt
mkfs.xfs -f /dev/sdb -n parent=$1
mount /dev/sdb /opt
mkdir -p /opt/foo
for ((i=0;i<10;i++)); do
time fsstress -n 400000 -p 4 -z -f creat=1,mkdir=1,mknod=1,rmdir=1,unlink=1,link=1,rename=1 -d /opt/foo -s 1
done
# ./dumb.sh 0
meta-data=/dev/sdb isize=512 agcount=4, agsize=1298176 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=1 metadir=0
data = bsize=4096 blocks=5192704, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=4096 blocks=16384, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
= zoned=0 start=0 reserved=0
Discarding blocks...Done.
real 0m18.807s
user 0m2.169s
sys 0m54.013s
real 0m13.845s
user 0m2.005s
sys 0m34.048s
real 0m14.019s
user 0m1.931s
sys 0m36.086s
real 0m14.435s
user 0m2.105s
sys 0m35.845s
real 0m14.823s
user 0m1.920s
sys 0m35.528s
real 0m14.181s
user 0m2.013s
sys 0m35.775s
real 0m14.281s
user 0m1.865s
sys 0m36.240s
real 0m13.638s
user 0m1.933s
sys 0m35.642s
real 0m13.553s
user 0m1.904s
sys 0m35.084s
real 0m13.963s
user 0m1.979s
sys 0m35.724s
# ./dumb.sh 1
meta-data=/dev/sdb isize=512 agcount=4, agsize=1298176 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=1 metadir=0
data = bsize=4096 blocks=5192704, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1
log =internal log bsize=4096 blocks=16384, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
= zoned=0 start=0 reserved=0
Discarding blocks...Done.
real 0m20.654s
user 0m2.374s
sys 1m4.441s
real 0m14.255s
user 0m1.990s
sys 0m36.749s
real 0m14.553s
user 0m1.931s
sys 0m36.606s
real 0m13.855s
user 0m1.767s
sys 0m36.467s
real 0m14.606s
user 0m2.073s
sys 0m37.255s
real 0m13.706s
user 0m1.942s
sys 0m36.294s
real 0m14.177s
user 0m2.017s
sys 0m36.528s
real 0m15.310s
user 0m2.164s
sys 0m37.720s
real 0m14.099s
user 0m2.013s
sys 0m37.062s
real 0m14.067s
user 0m2.068s
sys 0m36.552s
As you can see, there's a noticeable increase in the runtime of the
first fsstress invocation, but for the subsequent runs there's not much
of a difference. I think the parent pointer log items usually complete
in a single log checkpoint and are usually omitted from the log. In the
common case of a single parent and an inline xattr area, the overhead is
basically zero because we're just writing to the attr fork's if_data and
not messing with xattr blocks.
If I remove the -flink=1 parameter from fsstress so that parent pointers
are always running out of the immediate area then the first parent=0
runtime is:
real 0m18.920s
user 0m2.559s
sys 1m0.991s
and the first parent=1 is:
real 0m20.458s
user 0m2.533s
sys 1m6.301s
I see more or less the same timings for the nine subsequent runs for
each parent= setting. I think it's safe to say the overhead ranges
between negligible and 10% on a cold new filesystem.
--D
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mkfs: enable new features by default
2025-12-03 0:53 ` Darrick J. Wong
@ 2025-12-03 6:31 ` Christoph Hellwig
2025-12-04 18:48 ` Darrick J. Wong
0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-12-03 6:31 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Christoph Hellwig, aalbersh, linux-xfs
On Tue, Dec 02, 2025 at 04:53:45PM -0800, Darrick J. Wong wrote:
> On Mon, Dec 01, 2025 at 11:38:46PM -0800, Christoph Hellwig wrote:
> > On Mon, Dec 01, 2025 at 05:28:16PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Since the LTS is coming up, enable parent pointers and exchange-range by
> > > default for all users. Also fix up an out of date comment.
> >
> > Do you have any numbers that show the overhead or non-overhead of
> > enabling rmap? It will increase the amount of metadata written quite
> > a bit.
>
> I'm assuming you're interested in the overhead of *parent pointers* and
> not rmap since we turned on rmap by default back in 2023?
Yes, sorry.
> I see more or less the same timings for the nine subsequent runs for
> each parent= setting. I think it's safe to say the overhead ranges
> between negligible and 10% on a cold new filesystem.
Should we document this cleary? Because this means at least some
workloads are going to see a performance decrease.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mkfs: enable new features by default
2025-12-03 6:31 ` Christoph Hellwig
@ 2025-12-04 18:48 ` Darrick J. Wong
0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-04 18:48 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: aalbersh, linux-xfs
On Tue, Dec 02, 2025 at 10:31:22PM -0800, Christoph Hellwig wrote:
> On Tue, Dec 02, 2025 at 04:53:45PM -0800, Darrick J. Wong wrote:
> > On Mon, Dec 01, 2025 at 11:38:46PM -0800, Christoph Hellwig wrote:
> > > On Mon, Dec 01, 2025 at 05:28:16PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > Since the LTS is coming up, enable parent pointers and exchange-range by
> > > > default for all users. Also fix up an out of date comment.
> > >
> > > Do you have any numbers that show the overhead or non-overhead of
> > > enabling rmap? It will increase the amount of metadata written quite
> > > a bit.
> >
> > I'm assuming you're interested in the overhead of *parent pointers* and
> > not rmap since we turned on rmap by default back in 2023?
>
> Yes, sorry.
>
> > I see more or less the same timings for the nine subsequent runs for
> > each parent= setting. I think it's safe to say the overhead ranges
> > between negligible and 10% on a cold new filesystem.
>
> Should we document this cleary? Because this means at least some
> workloads are going to see a performance decrease.
Yep. But first -- all those results are inaccurate because I forgot
that fsstress quietly ignores everything after the first op=freq
component of the optarg, so all that benchmark was doing was creating
millions of files in a single directory and never deleting anything.
That's why the subsequent runs were much faster -- most of those files
were already created.
So I'll send a patch to fstests to fix that behavior. With that, the
benchmark that I alleged I was running produces these numbers when
creating a directory tree of only empty files:
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1
real 0m12.742s
user 0m28.074s
sys 0m10.839s
real 0m13.469s
user 0m25.827s
sys 0m11.816s
real 0m11.352s
user 0m22.602s
sys 0m11.275s
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
real 0m12.782s
user 0m28.892s
sys 0m8.897s
real 0m13.591s
user 0m25.371s
sys 0m9.601s
real 0m10.012s
user 0m20.849s
sys 0m9.018s
Almost no difference here! If I add in write=1 then there's a 5%
decrease going to parent=1:
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1
real 0m15.020s
user 0m22.358s
sys 0m14.827s
real 0m17.196s
user 0m22.888s
sys 0m15.586s
real 0m16.668s
user 0m21.709s
sys 0m15.425s
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
real 0m14.808s
user 0m22.266s
sys 0m12.843s
real 0m16.323s
user 0m22.409s
sys 0m13.695s
real 0m15.562s
user 0m21.740s
sys 0m12.927s
--D
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] mkfs: enable new features by default
2025-12-09 16:16 [PATCHSET V2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
@ 2025-12-09 16:16 ` Darrick J. Wong
2025-12-09 16:22 ` Christoph Hellwig
2025-12-09 22:25 ` Dave Chinner
0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-09 16:16 UTC (permalink / raw)
To: djwong, aalbersh; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Since the LTS is coming up, enable parent pointers and exchange-range by
default for all users. Also fix up an out of date comment.
I created a really stupid benchmarking script that does:
#!/bin/bash
# pptr overhead benchmark
umount /opt /mnt
rmmod xfs
for i in 1 0; do
umount /opt
mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
mount /dev/sdb /opt
mkdir -p /opt/foo
for ((i=0;i<5;i++)); do
time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
done
done
This is the result of creating an enormous number of empty files in a
single directory:
# ./dumb.sh
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
real 0m18.807s
user 0m2.169s
sys 0m54.013s
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1
real 0m20.654s
user 0m2.374s
sys 1m4.441s
As you can see, there's a 10% increase in runtime here. If I make the
workload a bit more representative by changing the -f argument to
include a directory tree workout:
-f creat=1,mkdir=1,mknod=1,rmdir=1,unlink=1,link=1,rename=1
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1
real 0m12.742s
user 0m28.074s
sys 0m10.839s
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
real 0m12.782s
user 0m28.892s
sys 0m8.897s
Almost no difference here. If I then actually write to the regular
files by adding:
-f write=1
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1
real 0m16.668s
user 0m21.709s
sys 0m15.425s
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
real 0m15.562s
user 0m21.740s
sys 0m12.927s
So that's about a 2% difference.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
mkfs/xfs_mkfs.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 8f5a6fa5676453..8db51217016eb0 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -1044,7 +1044,7 @@ struct sb_feat_args {
bool inode_align; /* XFS_SB_VERSION_ALIGNBIT */
bool nci; /* XFS_SB_VERSION_BORGBIT */
bool lazy_sb_counters; /* XFS_SB_VERSION2_LAZYSBCOUNTBIT */
- bool parent_pointers; /* XFS_SB_VERSION2_PARENTBIT */
+ bool parent_pointers; /* XFS_SB_FEAT_INCOMPAT_PARENT */
bool projid32bit; /* XFS_SB_VERSION2_PROJID32BIT */
bool crcs_enabled; /* XFS_SB_VERSION2_CRCBIT */
bool dirftype; /* XFS_SB_VERSION2_FTYPE */
@@ -5984,11 +5984,12 @@ main(
.rmapbt = true,
.reflink = true,
.inobtcnt = true,
- .parent_pointers = false,
+ .parent_pointers = true,
.nodalign = false,
.nortalign = false,
.bigtime = true,
.nrext64 = true,
+ .exchrange = true,
/*
* When we decide to enable a new feature by default,
* please remember to update the mkfs conf files.
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mkfs: enable new features by default
2025-12-09 16:16 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
@ 2025-12-09 16:22 ` Christoph Hellwig
2025-12-09 22:25 ` Dave Chinner
1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-12-09 16:22 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: aalbersh, linux-xfs
On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> Almost no difference here. If I then actually write to the regular
> files by adding:
>
> -f write=1
..
> So that's about a 2% difference.
Let's hope no one complains given that the parent points are useful:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mkfs: enable new features by default
2025-12-09 16:16 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-09 16:22 ` Christoph Hellwig
@ 2025-12-09 22:25 ` Dave Chinner
2025-12-10 23:49 ` Darrick J. Wong
1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2025-12-09 22:25 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: aalbersh, linux-xfs
On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Since the LTS is coming up, enable parent pointers and exchange-range by
> default for all users. Also fix up an out of date comment.
>
> I created a really stupid benchmarking script that does:
>
> #!/bin/bash
>
> # pptr overhead benchmark
>
> umount /opt /mnt
> rmmod xfs
> for i in 1 0; do
> umount /opt
> mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> mount /dev/sdb /opt
> mkdir -p /opt/foo
> for ((i=0;i<5;i++)); do
> time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> done
> done
Hmmm. fsstress is an interesting choice here...
> This is the result of creating an enormous number of empty files in a
> single directory:
>
> # ./dumb.sh
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
> real 0m18.807s
> user 0m2.169s
> sys 0m54.013s
>
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1
> real 0m20.654s
> user 0m2.374s
> sys 1m4.441s
Yeah, that's only creating 20,000 files/sec. That's a lot less than
expect a single thread to be able to do - why is the kernel burning
all 4 CPUs on this workload?
i.e. i'd expect a pure create workload to run at about 40,000
files/s with sleeping contention on the i_rwsem, but this is much
slower than I'd expect and contention is on a spinning lock...
Also, parent pointers add about 20% more system time overhead (54s
sys time to 64.4s sys time). Where does this come from? Do you have
kernel profiles? Is it PP overhead, a change in the contention
point, or just worse contention on the same resource?
> As you can see, there's a 10% increase in runtime here. If I make the
> workload a bit more representative by changing the -f argument to
> include a directory tree workout:
>
> -f creat=1,mkdir=1,mknod=1,rmdir=1,unlink=1,link=1,rename=1
>
>
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1
> real 0m12.742s
> user 0m28.074s
> sys 0m10.839s
>
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
> real 0m12.782s
> user 0m28.892s
> sys 0m8.897s
Again, that's way slower than I'd expect a 4p metadata workload to
run through 400k modification ops. i.e. it's running at about 35k
ops/s, and I'd be expecting the baseline to be upwards of 100k
ops/s.
Ah, look at the amount of time spent in userspace - 28-20s vs 9-11s
spent in the kernel filesystem code.
Ok, performance is limited by the usrespace code, not the kernel
code. I would expect a decent fs benchmark to be at most 10%
userspace CPU time, with >90% of the time being spent in the kernel
doing filesystem operations.
IOWs, there is way too much userspace overhead in this worklaod to
draw useful conclusions about the impact of the kernel side changes.
System time went up from 9s to 11s when parent pointers are turned
on - a 20% increase in CPU overhead - but that additional overhead
isn't reflected in the wall time results because the CPU overehad is
dominated by the userspace program, not the kernel code that is
being "measured".
> Almost no difference here.
Ah, no. Again, system time went up by ~20%, even though elapsed time
was unchanged. That implies there is some amount of sleeping
contention occurring between processes doing work, and the
additional CPU overhead of the PP code simply resulted in less sleep
time.
Again, this is not noticable because the workload is dominated by
userspace CPU overhead, not the kernel/filesystem operation
overhead...
> If I then actually write to the regular
> files by adding:
>
> -f write=1
>
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1
> real 0m16.668s
> user 0m21.709s
> sys 0m15.425s
>
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
> real 0m15.562s
> user 0m21.740s
> sys 0m12.927s
>
> So that's about a 2% difference.
Same here - system time went up by 25%, even though wall time didn't
change. Also, 15.5s to 16.6s increase in wall time is actually
a 7% difference in runtime, not 2%.
----
Overall, I don't think the benchmarking documented here is
sufficient to justify the conclusion that "parent pointers have
little real world overhead so we can turn them on by default".
I would at least like to see the "will-it-scale" impact on a 64p
machine with a hundred GB of RAM and IO subsystem at least capable
of a million IOPS and a filesystem optimised for max performance
(e.g. highly parallel fsmark based workloads). This will push the
filesystem and CPU usage to their actual limits and directly expose
additional overhead and new contention points in the results.
This is also much more representative of the sorts of high
performance, high end deployments that we expect XFS to be deployed
on, and where performance impact actually matters to users.
i.e. we need to know what the impact of the change is on the high
end as well as low end VM/desktop configs before any conclusion can
be drawn w.r.t. changing the parent pointer default setting....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mkfs: enable new features by default
2025-12-09 22:25 ` Dave Chinner
@ 2025-12-10 23:49 ` Darrick J. Wong
2025-12-15 23:59 ` Dave Chinner
0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-10 23:49 UTC (permalink / raw)
To: Dave Chinner; +Cc: aalbersh, linux-xfs
On Wed, Dec 10, 2025 at 09:25:24AM +1100, Dave Chinner wrote:
> On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Since the LTS is coming up, enable parent pointers and exchange-range by
> > default for all users. Also fix up an out of date comment.
> >
> > I created a really stupid benchmarking script that does:
> >
> > #!/bin/bash
> >
> > # pptr overhead benchmark
> >
> > umount /opt /mnt
> > rmmod xfs
> > for i in 1 0; do
> > umount /opt
> > mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> > mount /dev/sdb /opt
> > mkdir -p /opt/foo
> > for ((i=0;i<5;i++)); do
> > time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> > done
> > done
>
> Hmmm. fsstress is an interesting choice here...
<flush all the old benchmarks and conclusions>
I have an old 40-core Xeon E5-2660V3 with a pair of 1.5T Intel nvme ssds
and 128G of RAM running 6.18.0. For this sample, I tried to keep the
memory usage well below the amount of DRAM so that I could measure the
pure overhead of writing parent pointers out to disk and not anything
else. I also omit ls'ing and chmod'ing the directory tree because
neither of those operations touch parent pointers. I also left the
logbsize at the defaults (32k) because that's what most users get.
Here I'm using the following benchmark program, compiled from various
suggestions from dchinner over the years:
#!/bin/bash -x
iter=8
feature="-n parent"
filesz=0
subdirs=10000
files_per_iter=100000
writesize=16384
mkdirme() {
set +x
local i
for ((i=0;i<agcount;i++)); do
mkdir -p /nvme/$i
dirs+=(-d /nvme/$i)
done
set -x
}
bulkme() {
set +x
local i
for ((i=0;i<agcount;i++)); do
xfs_io -c "bulkstat -a $i -q" /nvme &
done
wait
set -x
}
rmdirme() {
set +x
local i
for dir in "${dirs[@]}"; do
rm -r -f "${dir}" &
done
wait
set -x
}
benchme() {
agcount="$(xfs_info /nvme/ | grep agcount= | sed -e 's/^.*agcount=//g' -e 's/,.*$//g')"
dirs=()
mkdirme
#time ~djwong/cdev/work/fstests/build-x86_64/ltp/fsstress -n 400000 -p 40 -z -f creat=1,mkdir=1,rmdir=1,unlink=1 -d /nvme/ -s 1
time fs_mark -w "${writesz}" -D "${subdirs}" -S 0 -n "${files_per_iter}" -s "${filesz}" -L "${iter}" "${dirs[@]}"
time bulkme
time rmdirme
}
for p in 0 1; do
umount /dev/nvme1n1 /nvme /mnt
#mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 -n parent=$p || break
mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 $feature=$p || break
mount /dev/nvme1n1 /nvme/ -o logdev=/dev/nvme0n1 || break
benchme
umount /dev/nvme1n1 /nvme /mnt
done
I get this mkfs output:
# mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1
meta-data=/dev/nvme1n1 isize=512 agcount=40, agsize=9767586 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0 metadir=0
data = bsize=4096 blocks=390703440, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =/dev/nvme0n1 bsize=4096 blocks=262144, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
= zoned=0 start=0 reserved=0
# grep nvme1n1 /proc/mounts
/dev/nvme1n1 /nvme xfs rw,relatime,inode64,logbufs=8,logbsize=32k,logdev=/dev/nvme0n1,noquota 0 0
and this output from fsmark with parent=0:
# fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 14:22:07 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 0 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 4000000 0 566680.9 31398816
2 8000000 0 665535.6 30037368
2 12000000 0 537227.6 31726557
2 16000000 0 538133.9 32411165
2 20000000 0 619369.6 30790676
2 24000000 0 600018.2 31583349
2 28000000 0 607209.8 31193980
3 32000000 0 533240.7 32277102
real 0m57.573s
user 3m53.578s
sys 19m44.440s
+ bulkme
+ set +x
real 0m1.122s
user 0m0.955s
sys 0m39.306s
+ rmdirme
+ set +x
real 0m59.649s
user 0m41.196s
sys 13m9.566s
I limited this to 8 iterations so I could post some preliminary results
after a few minutes. Now let's try again with parent=1:
+ fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 14:24:44 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 0 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 4000000 0 543929.1 31344175
2 8000000 0 523736.2 31180565
2 12000000 0 522184.1 31700380
2 16000000 0 513468.0 32112498
2 20000000 0 543993.1 31910496
2 24000000 0 562760.1 32061910
2 28000000 0 524039.8 31825520
3 32000000 0 526028.8 31889193
real 1m2.934s
user 3m53.508s
sys 25m14.810s
+ bulkme
+ set +x
real 0m1.158s
user 0m0.882s
sys 0m39.847s
+ rmdirme
+ set +x
real 1m12.505s
user 0m47.489s
sys 20m33.844s
fs_mark itself shows a decrease in file creation/sec of about 9%, an
increase in wall clock time of about 9%, and an increase in kernel time
of about 28%. That's to be expected, since parent pointer updates cause
directory entry creation and deletion to hold more ILOCKs and for
longer.
Parallel bulkstat (aka bulkme) shows an increase in wall time of 3% and
system time of 1%, which is not surprising since that's just walking the
inode btree and cores, no parent pointers involved.
Similarly, deleting all the files created by fs_mark shows an increase
in wall time of about ~21% and an increase in system time of about 56%.
I concede that parent pointers has a fair amount of overhead for the
worst case of creating a large directory tree or deleting it.
I reran this with logbsize=256k and while I saw a slight increase in
performance across the board, the overhead of pptrs is about the same
percentagewise.
If I then re-run the benchmark with a file size of 1M and tell it to
create fewer files, then I get the following for parent=0:
# fs_mark -D 1000 -S 0 -n 200 -s 1048576 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 15:03:11 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 1048576 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 8000 1048576 1493.4 198379
2 16000 1048576 1327.0 255655
3 24000 1048576 1355.8 255105
4 32000 1048576 1352.3 253094
4 40000 1048576 1836.9 262258
5 48000 1048576 1337.6 246991
5 56000 1048576 1328.4 240303
6 64000 1048576 1165.9 237211
real 0m50.384s
user 0m7.640s
sys 1m43.187s
+ bulkme
+ set +x
real 0m0.023s
user 0m0.061s
sys 0m0.167s
+ rmdirme
+ set +x
real 0m0.675s
user 0m0.107s
sys 0m15.644s
and for parent=1:
# fs_mark -D 1000 -S 0 -n 200 -s 1048576 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 15:04:41 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 1048576 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 8000 1048576 1963.9 254007
2 16000 1048576 1716.4 227074
3 24000 1048576 1052.5 264987
4 32000 1048576 1793.6 242288
4 40000 1048576 1364.2 249738
5 48000 1048576 1081.2 250394
5 56000 1048576 1342.0 260667
6 64000 1048576 1356.9 242324
real 0m49.256s
user 0m7.621s
sys 1m44.847s
+ bulkme
+ set +x
real 0m0.021s
user 0m0.060s
sys 0m0.176s
+ rmdirme
+ set +x
real 0m0.537s
user 0m0.108s
sys 0m15.453s
Here we see that the fs_mark creates/sec goes up by 4%, wall time
decreases by 3%, and the kernel time increases by 2% or so. The rmdir
wall time decreases by 2% and the kernel time by ~1%, which is quite
small. So for a more common case of populating a directory tree full of
big files with data in them, the overhead isn't all that noticeable.
I then decided to simulate my maildir spool, which has 670,000 files
consuming 12GB for an average file size of 17936 bytes. I reduced the
file size to 16K, increase the number of files per iteration, and set
the write buffer size to something not aligned to a block, and got this
for parent=0:
# fs_mark -w 778 -D 1000 -S 0 -n 6000 -s 16384 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 15:21:38 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 16384 bytes, written with an IO size of 778 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 240000 16384 40085.3 2492281
2 480000 16384 37026.7 2780077
2 720000 16384 28445.5 2591461
3 960000 16384 28888.6 2595817
3 1200000 16384 25160.8 2903882
3 1440000 16384 29372.1 2600018
3 1680000 16384 26443.9 2732790
4 1920000 16384 26307.1 2758750
real 1m11.633s
user 0m46.156s
sys 3m24.543s
+ bulkme
+ set +x
real 0m0.091s
user 0m0.111s
sys 0m2.461s
+ rmdirme
+ set +x
real 0m9.364s
user 0m2.245s
sys 0m47.221s
and this for parent=1
# fs_mark -w 778 -D 1000 -S 0 -n 6000 -s 16384 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
# Version 3.3, 40 thread(s) starting at Wed Dec 10 15:23:38 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 16384 bytes, written with an IO size of 778 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
2 240000 16384 39340.1 2627066
2 480000 16384 27727.2 2925494
2 720000 16384 28305.4 2597191
2 960000 16384 24891.6 2834421
3 1200000 16384 27964.8 2810556
3 1440000 16384 27204.6 2776783
3 1680000 16384 25745.2 2779197
3 1920000 16384 24674.9 2752721
real 1m14.422s
user 0m46.607s
sys 3m38.777s
+ bulkme
+ set +x
real 0m0.081s
user 0m0.123s
sys 0m2.408s
+ rmdirme
+ set +x
real 0m9.306s
user 0m2.570s
sys 1m10.598s
fs_mark shows a 7% decrease in creates/sec, a 4% increase in wall time,
a 7% increase in kernel time. bulkstat is as usual not that different,
and deletion shows an increase in kernel time of 50%.
Conclusion: There are noticeable overheads to enabling parent pointers,
but counterbalancing that, we can now repair an entire filesystem,
directory tree and all.
--D
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mkfs: enable new features by default
2025-12-10 23:49 ` Darrick J. Wong
@ 2025-12-15 23:59 ` Dave Chinner
2025-12-16 23:07 ` Darrick J. Wong
0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2025-12-15 23:59 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: aalbersh, linux-xfs
On Wed, Dec 10, 2025 at 03:49:28PM -0800, Darrick J. Wong wrote:
> On Wed, Dec 10, 2025 at 09:25:24AM +1100, Dave Chinner wrote:
> > On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Since the LTS is coming up, enable parent pointers and exchange-range by
> > > default for all users. Also fix up an out of date comment.
> > >
> > > I created a really stupid benchmarking script that does:
> > >
> > > #!/bin/bash
> > >
> > > # pptr overhead benchmark
> > >
> > > umount /opt /mnt
> > > rmmod xfs
> > > for i in 1 0; do
> > > umount /opt
> > > mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> > > mount /dev/sdb /opt
> > > mkdir -p /opt/foo
> > > for ((i=0;i<5;i++)); do
> > > time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> > > done
> > > done
> >
> > Hmmm. fsstress is an interesting choice here...
>
> <flush all the old benchmarks and conclusions>
>
> I have an old 40-core Xeon E5-2660V3 with a pair of 1.5T Intel nvme ssds
> and 128G of RAM running 6.18.0. For this sample, I tried to keep the
> memory usage well below the amount of DRAM so that I could measure the
> pure overhead of writing parent pointers out to disk and not anything
> else. I also omit ls'ing and chmod'ing the directory tree because
> neither of those operations touch parent pointers. I also left the
> logbsize at the defaults (32k) because that's what most users get.
ok.
.....
> benchme() {
> agcount="$(xfs_info /nvme/ | grep agcount= | sed -e 's/^.*agcount=//g' -e 's/,.*$//g')"
> dirs=()
> mkdirme
>
> #time ~djwong/cdev/work/fstests/build-x86_64/ltp/fsstress -n 400000 -p 40 -z -f creat=1,mkdir=1,rmdir=1,unlink=1 -d /nvme/ -s 1
> time fs_mark -w "${writesz}" -D "${subdirs}" -S 0 -n "${files_per_iter}" -s "${filesz}" -L "${iter}" "${dirs[@]}"
>
> time bulkme
> time rmdirme
Ok, so this is testing cache-hot bulkstat and rm, so it's not
exercising the cold-read path and hence is not needing to read and
initialising parent pointers for unlinking. Can you drop caches
between the bulkstat and the unlink phases so we exercise cold cache
parent pointer instantiation overhead somewhere?
> }
>
> for p in 0 1; do
> umount /dev/nvme1n1 /nvme /mnt
> #mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 -n parent=$p || break
> mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 $feature=$p || break
> mount /dev/nvme1n1 /nvme/ -o logdev=/dev/nvme0n1 || break
> benchme
> umount /dev/nvme1n1 /nvme /mnt
> done
>
> I get this mkfs output:
> # mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1
> meta-data=/dev/nvme1n1 isize=512 agcount=40, agsize=9767586 blks
> = sectsz=4096 attr=2, projid32bit=1
> = crc=1 finobt=1, sparse=1, rmapbt=1
> = reflink=1 bigtime=1 inobtcount=1 nrext64=1
> = exchange=0 metadir=0
> data = bsize=4096 blocks=390703440, imaxpct=5
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
> log =/dev/nvme0n1 bsize=4096 blocks=262144, version=2
> = sectsz=4096 sunit=1 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> = rgcount=0 rgsize=0 extents
> = zoned=0 start=0 reserved=0
> # grep nvme1n1 /proc/mounts
> /dev/nvme1n1 /nvme xfs rw,relatime,inode64,logbufs=8,logbsize=32k,logdev=/dev/nvme0n1,noquota 0 0
>
> and this output from fsmark with parent=0:
....
a table-based summary would have made this easier to read
parent real user sys
create 0 0m57.573s 3m53.578s 19m44.440s
create 1 1m2.934s 3m53.508s 25m14.810s
bulk 0 0m1.122s 0m0.955s 0m39.306s
bulk 1 0m1.158s 0m0.882s 0m39.847s
unlink 0 0m59.649s 0m41.196s 13m9.566s
unlink 1 1m12.505s 0m47.489s 20m33.844s
> fs_mark itself shows a decrease in file creation/sec of about 9%, an
> increase in wall clock time of about 9%, and an increase in kernel time
> of about 28%. That's to be expected, since parent pointer updates cause
> directory entry creation and deletion to hold more ILOCKs and for
> longer.
ILOCK isn't an issue with this test - the whole point of the
segmented directory structure is that each thread operates in it's
own directory, so there is no ILOCK contention at all. i.e. the
entire difference is the CPU overhead of the adding the xattr fork
and creating the parent pointer xattr.
I suspect that the create side overhead is probably acceptible,
because we also typically add security xattrs at create time and
these will be slightly faster as the xattr fork is already
prepared...
> Parallel bulkstat (aka bulkme) shows an increase in wall time of 3% and
> system time of 1%, which is not surprising since that's just walking the
> inode btree and cores, no parent pointers involved.
I was more interested in the cold cache behaviour - hot cache is
generally uninteresting as the XFS inode cache scales pretty much
perfectly in this case. Reading the inodes from disk, OTOH, adds a
whole heap of instantiation and lock contention overhead and changes
the picture significantly. I'm interested to know what the impact of
having PPs is in that case....
> Similarly, deleting all the files created by fs_mark shows an increase
> in wall time of about ~21% and an increase in system time of about 56%.
> I concede that parent pointers has a fair amount of overhead for the
> worst case of creating a large directory tree or deleting it.
Ok, so an increase in unlink CPU overhead of 56% is pretty bad. On
single threaded workloads, that's going to equate to be a ~50%
reduction in performance for operations that perform unlinks in CPU
bound loops (e.g. rm -rf on hot caches). Note that the above test is
not CPU bound - it's only running at about 50% CPU utilisation
because of some other contention point in the fs (possibly log space
or pinned/stale directory buffers requiring a log force to clear).
However, results like this make me think that PP unlink hasn't been
optimised for the common case: removing the last parent pointer
(i.e. nlink 1 -> 0) when the inode is being placed on the unlinked
list in syscall context. This is the common case in the absence of
hard links, and it puts the PP xattr removal directly in application
task context.
In this case, it seems to me that we don't actually need
to remove the parent pointer xattr. When the inode is inactivated by
bakground inodegc after last close, the xattr fork is truncated and
that will remove all xattrs including the stale remaining PP without
needing to make a specific PP transaction.
Doing this would remove the PP overhead completely from the final
unlink syscall path. It would only add minimal extra overhead on
the inodegc side as (in the common case) we have to remove security
xattrs in inodegc.
Hence I think we really need to try to mitigate this common case
overhead before we make PP the default for everyone. The perf
decrease
> If I then re-run the benchmark with a file size of 1M and tell it to
> create fewer files, then I get the following for parent=0:
These are largely meaningless as the create benchmark is throttling
hard on disk bandwidth (1.5-2GB/s) in the write() path, not limited
by PP overhead.
The variance in runtime comes from the data IO path behaviour, and
the lack of sync() operations after the create means that writeback
is likely still running when the unlink phase runs. Hence it's
pretty difficult to conclude anything about parent pointers
themselves because of the other large variants in this workload.
> I then decided to simulate my maildir spool, which has 670,000 files
> consuming 12GB for an average file size of 17936 bytes. I reduced the
> file size to 16K, increase the number of files per iteration, and set
> the write buffer size to something not aligned to a block, and got this
> for parent=0:
Same again, but this time the writeback thread will be seeing
delalloc latencies w.r.t. AGF locks vs incoming directory and inode
chunk allocation operations. That can be seen by:
>
> # fs_mark -w 778 -D 1000 -S 0 -n 6000 -s 16384 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
> # Version 3.3, 40 thread(s) starting at Wed Dec 10 15:21:38 2025
> # Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> # Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
> # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> # Files info: size 16384 bytes, written with an IO size of 778 bytes per write
> # App overhead is time in microseconds spent in the test not doing file writing related system calls.
>
> FSUse% Count Size Files/sec App Overhead
> 2 240000 16384 40085.3 2492281
> 2 480000 16384 37026.7 2780077
> 2 720000 16384 28445.5 2591461
> 3 960000 16384 28888.6 2595817
> 3 1200000 16384 25160.8 2903882
> 3 1440000 16384 29372.1 2600018
> 3 1680000 16384 26443.9 2732790
> 4 1920000 16384 26307.1 2758750
>
> real 1m11.633s
> user 0m46.156s
> sys 3m24.543s
.. creates only managing ~270% CPU utilisation for a 40-way
operation.
IOWs, parent pointer overhead is noise compared to the losses caused
by data writeback locking/throttling interactions, so nothing can
really be concluded from there here.
> Conclusion: There are noticeable overheads to enabling parent pointers,
> but counterbalancing that, we can now repair an entire filesystem,
> directory tree and all.
True, but I think that the unlink overhead is significant enough
that we need to address that before enabling PP by default for
everyone.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mkfs: enable new features by default
2025-12-15 23:59 ` Dave Chinner
@ 2025-12-16 23:07 ` Darrick J. Wong
0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-16 23:07 UTC (permalink / raw)
To: Dave Chinner; +Cc: aalbersh, linux-xfs
On Tue, Dec 16, 2025 at 10:59:42AM +1100, Dave Chinner wrote:
> On Wed, Dec 10, 2025 at 03:49:28PM -0800, Darrick J. Wong wrote:
> > On Wed, Dec 10, 2025 at 09:25:24AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > Since the LTS is coming up, enable parent pointers and exchange-range by
> > > > default for all users. Also fix up an out of date comment.
> > > >
> > > > I created a really stupid benchmarking script that does:
> > > >
> > > > #!/bin/bash
> > > >
> > > > # pptr overhead benchmark
> > > >
> > > > umount /opt /mnt
> > > > rmmod xfs
> > > > for i in 1 0; do
> > > > umount /opt
> > > > mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> > > > mount /dev/sdb /opt
> > > > mkdir -p /opt/foo
> > > > for ((i=0;i<5;i++)); do
> > > > time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> > > > done
> > > > done
> > >
> > > Hmmm. fsstress is an interesting choice here...
> >
> > <flush all the old benchmarks and conclusions>
> >
> > I have an old 40-core Xeon E5-2660V3 with a pair of 1.5T Intel nvme ssds
> > and 128G of RAM running 6.18.0. For this sample, I tried to keep the
> > memory usage well below the amount of DRAM so that I could measure the
> > pure overhead of writing parent pointers out to disk and not anything
> > else. I also omit ls'ing and chmod'ing the directory tree because
> > neither of those operations touch parent pointers. I also left the
> > logbsize at the defaults (32k) because that's what most users get.
>
> ok.
>
> .....
>
> > benchme() {
> > agcount="$(xfs_info /nvme/ | grep agcount= | sed -e 's/^.*agcount=//g' -e 's/,.*$//g')"
> > dirs=()
> > mkdirme
> >
> > #time ~djwong/cdev/work/fstests/build-x86_64/ltp/fsstress -n 400000 -p 40 -z -f creat=1,mkdir=1,rmdir=1,unlink=1 -d /nvme/ -s 1
> > time fs_mark -w "${writesz}" -D "${subdirs}" -S 0 -n "${files_per_iter}" -s "${filesz}" -L "${iter}" "${dirs[@]}"
> >
> > time bulkme
> > time rmdirme
>
> Ok, so this is testing cache-hot bulkstat and rm, so it's not
> exercising the cold-read path and hence is not needing to read and
> initialising parent pointers for unlinking. Can you drop caches
> between the bulkstat and the unlink phases so we exercise cold cache
> parent pointer instantiation overhead somewhere?
>
> > }
> >
> > for p in 0 1; do
> > umount /dev/nvme1n1 /nvme /mnt
> > #mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 -n parent=$p || break
> > mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 $feature=$p || break
> > mount /dev/nvme1n1 /nvme/ -o logdev=/dev/nvme0n1 || break
> > benchme
> > umount /dev/nvme1n1 /nvme /mnt
> > done
> >
> > I get this mkfs output:
> > # mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1
> > meta-data=/dev/nvme1n1 isize=512 agcount=40, agsize=9767586 blks
> > = sectsz=4096 attr=2, projid32bit=1
> > = crc=1 finobt=1, sparse=1, rmapbt=1
> > = reflink=1 bigtime=1 inobtcount=1 nrext64=1
> > = exchange=0 metadir=0
> > data = bsize=4096 blocks=390703440, imaxpct=5
> > = sunit=0 swidth=0 blks
> > naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
> > log =/dev/nvme0n1 bsize=4096 blocks=262144, version=2
> > = sectsz=4096 sunit=1 blks, lazy-count=1
> > realtime =none extsz=4096 blocks=0, rtextents=0
> > = rgcount=0 rgsize=0 extents
> > = zoned=0 start=0 reserved=0
> > # grep nvme1n1 /proc/mounts
> > /dev/nvme1n1 /nvme xfs rw,relatime,inode64,logbufs=8,logbsize=32k,logdev=/dev/nvme0n1,noquota 0 0
> >
> > and this output from fsmark with parent=0:
>
> ....
>
> a table-based summary would have made this easier to read
>
> parent real user sys
> create 0 0m57.573s 3m53.578s 19m44.440s
> create 1 1m2.934s 3m53.508s 25m14.810s
>
> bulk 0 0m1.122s 0m0.955s 0m39.306s
> bulk 1 0m1.158s 0m0.882s 0m39.847s
>
> unlink 0 0m59.649s 0m41.196s 13m9.566s
> unlink 1 1m12.505s 0m47.489s 20m33.844s
>
> > fs_mark itself shows a decrease in file creation/sec of about 9%, an
> > increase in wall clock time of about 9%, and an increase in kernel time
> > of about 28%. That's to be expected, since parent pointer updates cause
> > directory entry creation and deletion to hold more ILOCKs and for
> > longer.
>
> ILOCK isn't an issue with this test - the whole point of the
> segmented directory structure is that each thread operates in it's
> own directory, so there is no ILOCK contention at all. i.e. the
> entire difference is the CPU overhead of the adding the xattr fork
> and creating the parent pointer xattr.
>
> I suspect that the create side overhead is probably acceptible,
> because we also typically add security xattrs at create time and
> these will be slightly faster as the xattr fork is already
> prepared...
>
> > Parallel bulkstat (aka bulkme) shows an increase in wall time of 3% and
> > system time of 1%, which is not surprising since that's just walking the
> > inode btree and cores, no parent pointers involved.
>
> I was more interested in the cold cache behaviour - hot cache is
> generally uninteresting as the XFS inode cache scales pretty much
> perfectly in this case. Reading the inodes from disk, OTOH, adds a
> whole heap of instantiation and lock contention overhead and changes
> the picture significantly. I'm interested to know what the impact of
> having PPs is in that case....
>
> > Similarly, deleting all the files created by fs_mark shows an increase
> > in wall time of about ~21% and an increase in system time of about 56%.
> > I concede that parent pointers has a fair amount of overhead for the
> > worst case of creating a large directory tree or deleting it.
>
> Ok, so an increase in unlink CPU overhead of 56% is pretty bad. On
> single threaded workloads, that's going to equate to be a ~50%
> reduction in performance for operations that perform unlinks in CPU
> bound loops (e.g. rm -rf on hot caches). Note that the above test is
> not CPU bound - it's only running at about 50% CPU utilisation
> because of some other contention point in the fs (possibly log space
> or pinned/stale directory buffers requiring a log force to clear).
>
> However, results like this make me think that PP unlink hasn't been
> optimised for the common case: removing the last parent pointer
> (i.e. nlink 1 -> 0) when the inode is being placed on the unlinked
> list in syscall context. This is the common case in the absence of
> hard links, and it puts the PP xattr removal directly in application
> task context.
>
> In this case, it seems to me that we don't actually need
> to remove the parent pointer xattr. When the inode is inactivated by
> bakground inodegc after last close, the xattr fork is truncated and
> that will remove all xattrs including the stale remaining PP without
> needing to make a specific PP transaction.
>
> Doing this would remove the PP overhead completely from the final
> unlink syscall path. It would only add minimal extra overhead on
> the inodegc side as (in the common case) we have to remove security
> xattrs in inodegc.
At some point hch suggested that the parent pointer code could shortcut
the entire xattr intent machinery if the child file has shortform
xattrs. For this fsmark benchmark where we're creating a lot of empty
files, doing so actually /does/ cut the creation overhead from ~30% to
~3%; and the deletion overhead to nearly zero.
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index 589f810eedc0d8..c59e5ef47ed95d 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -49,6 +49,7 @@ void xfs_attr_shortform_create(struct xfs_da_args *args);
void xfs_attr_shortform_add(struct xfs_da_args *args, int forkoff);
int xfs_attr_shortform_getvalue(struct xfs_da_args *args);
int xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
+int xfs_attr_try_sf_addname(struct xfs_inode *dp, struct xfs_da_args *args);
int xfs_attr_sf_removename(struct xfs_da_args *args);
struct xfs_attr_sf_entry *xfs_attr_sf_findname(struct xfs_da_args *args);
int xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 8c04acd30d489c..89cc913a2b4345 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -349,7 +349,7 @@ xfs_attr_set_resv(
* xfs_attr_shortform_addname() will convert to leaf format and return -ENOSPC.
* to use.
*/
-STATIC int
+int
xfs_attr_try_sf_addname(
struct xfs_inode *dp,
struct xfs_da_args *args)
diff --git a/fs/xfs/libxfs/xfs_parent.c b/fs/xfs/libxfs/xfs_parent.c
index 69366c44a70159..048f822951103c 100644
--- a/fs/xfs/libxfs/xfs_parent.c
+++ b/fs/xfs/libxfs/xfs_parent.c
@@ -29,6 +29,7 @@
#include "xfs_trans_space.h"
#include "xfs_attr_item.h"
#include "xfs_health.h"
+#include "xfs_attr_leaf.h"
struct kmem_cache *xfs_parent_args_cache;
@@ -202,6 +203,16 @@ xfs_parent_addname(
xfs_inode_to_parent_rec(&ppargs->rec, dp);
xfs_parent_da_args_init(&ppargs->args, tp, &ppargs->rec, child,
child->i_ino, parent_name);
+
+ if (xfs_inode_has_attr_fork(child) &&
+ xfs_attr_is_shortform(child)) {
+ ppargs->args.op_flags |= XFS_DA_OP_ADDNAME;
+
+ error = xfs_attr_try_sf_addname(child, &ppargs->args);
+ if (error != -ENOSPC)
+ return error;
+ }
+
xfs_attr_defer_add(&ppargs->args, XFS_ATTR_DEFER_SET);
return 0;
}
@@ -224,6 +235,10 @@ xfs_parent_removename(
xfs_inode_to_parent_rec(&ppargs->rec, dp);
xfs_parent_da_args_init(&ppargs->args, tp, &ppargs->rec, child,
child->i_ino, parent_name);
+
+ if (xfs_attr_is_shortform(child))
+ return xfs_attr_sf_removename(&ppargs->args);
+
xfs_attr_defer_add(&ppargs->args, XFS_ATTR_DEFER_REMOVE);
return 0;
}
@@ -250,6 +265,27 @@ xfs_parent_replacename(
child->i_ino, old_name);
xfs_inode_to_parent_rec(&ppargs->new_rec, new_dp);
+
+ if (xfs_attr_is_shortform(child)) {
+ ppargs->args.op_flags |= XFS_DA_OP_ADDNAME | XFS_DA_OP_REPLACE;
+
+ error = xfs_attr_sf_removename(&ppargs->args);
+ if (error)
+ return error;
+
+ xfs_parent_da_args_init(&ppargs->args, tp, &ppargs->new_rec,
+ child, child->i_ino, new_name);
+ ppargs->args.op_flags |= XFS_DA_OP_ADDNAME;
+
+ error = xfs_attr_try_sf_addname(child, &ppargs->args);
+ if (error == -ENOSPC) {
+ xfs_attr_defer_add(&ppargs->args, XFS_ATTR_DEFER_SET);
+ return 0;
+ }
+
+ return error;
+ }
+
ppargs->args.new_name = new_name->name;
ppargs->args.new_namelen = new_name->len;
ppargs->args.new_value = &ppargs->new_rec;
> Hence I think we really need to try to mitigate this common case
> overhead before we make PP the default for everyone. The perf
> decrease
>
>
> > If I then re-run the benchmark with a file size of 1M and tell it to
> > create fewer files, then I get the following for parent=0:
>
> These are largely meaningless as the create benchmark is throttling
> hard on disk bandwidth (1.5-2GB/s) in the write() path, not limited
> by PP overhead.
>
> The variance in runtime comes from the data IO path behaviour, and
> the lack of sync() operations after the create means that writeback
> is likely still running when the unlink phase runs. Hence it's
> pretty difficult to conclude anything about parent pointers
> themselves because of the other large variants in this workload.
They're not meaningless numbers, Dave. Writing data into user files is
always going take up a large portion of the time spent creating a real
dreictory tree. Anyone unpacking a tarball onto a filesystem can run
into disk throttling on write bandwidth, which just reduces the relative
overhead of the pptr updates further.
The only times it becomes painful is in this microbenchmarking case
where someone is trying to create millions of empty files; and when
deleting a directory tree.
Anyway, we now have a patch, and I'll rerun the benchmark if this
survives overnight testing.
--D
> > I then decided to simulate my maildir spool, which has 670,000 files
> > consuming 12GB for an average file size of 17936 bytes. I reduced the
> > file size to 16K, increase the number of files per iteration, and set
> > the write buffer size to something not aligned to a block, and got this
> > for parent=0:
>
> Same again, but this time the writeback thread will be seeing
> delalloc latencies w.r.t. AGF locks vs incoming directory and inode
> chunk allocation operations. That can be seen by:
>
> >
> > # fs_mark -w 778 -D 1000 -S 0 -n 6000 -s 16384 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39
> > # Version 3.3, 40 thread(s) starting at Wed Dec 10 15:21:38 2025
> > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > # Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
> > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> > # Files info: size 16384 bytes, written with an IO size of 778 bytes per write
> > # App overhead is time in microseconds spent in the test not doing file writing related system calls.
> >
> > FSUse% Count Size Files/sec App Overhead
> > 2 240000 16384 40085.3 2492281
> > 2 480000 16384 37026.7 2780077
> > 2 720000 16384 28445.5 2591461
> > 3 960000 16384 28888.6 2595817
> > 3 1200000 16384 25160.8 2903882
> > 3 1440000 16384 29372.1 2600018
> > 3 1680000 16384 26443.9 2732790
> > 4 1920000 16384 26307.1 2758750
> >
> > real 1m11.633s
> > user 0m46.156s
> > sys 3m24.543s
>
> .. creates only managing ~270% CPU utilisation for a 40-way
> operation.
>
> IOWs, parent pointer overhead is noise compared to the losses caused
> by data writeback locking/throttling interactions, so nothing can
> really be concluded from there here.
>
> > Conclusion: There are noticeable overheads to enabling parent pointers,
> > but counterbalancing that, we can now repair an entire filesystem,
> > directory tree and all.
>
> True, but I think that the unlink overhead is significant enough
> that we need to address that before enabling PP by default for
> everyone.
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-12-16 23:07 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-02 1:27 [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-02 1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-02 7:38 ` Christoph Hellwig
2025-12-03 0:53 ` Darrick J. Wong
2025-12-03 6:31 ` Christoph Hellwig
2025-12-04 18:48 ` Darrick J. Wong
2025-12-02 1:28 ` [PATCH 2/2] mkfs: add 2025 LTS config file Darrick J. Wong
-- strict thread matches above, loose matches on Subject: below --
2025-12-09 16:16 [PATCHSET V2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-09 16:16 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-09 16:22 ` Christoph Hellwig
2025-12-09 22:25 ` Dave Chinner
2025-12-10 23:49 ` Darrick J. Wong
2025-12-15 23:59 ` Dave Chinner
2025-12-16 23:07 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox