public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18
@ 2025-12-02  1:27 Darrick J. Wong
  2025-12-02  1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
  2025-12-02  1:28 ` [PATCH 2/2] mkfs: add 2025 LTS config file Darrick J. Wong
  0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-02  1:27 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

Hi all,

Enable by default some new features that seem stable now.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=default-features
---
Commits in this patchset:
 * mkfs: enable new features by default
 * mkfs: add 2025 LTS config file
---
 mkfs/Makefile      |    3 ++-
 mkfs/lts_6.18.conf |   19 +++++++++++++++++++
 mkfs/xfs_mkfs.c    |    5 +++--
 3 files changed, 24 insertions(+), 3 deletions(-)
 create mode 100644 mkfs/lts_6.18.conf


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] mkfs: enable new features by default
  2025-12-02  1:27 [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
@ 2025-12-02  1:28 ` Darrick J. Wong
  2025-12-02  7:38   ` Christoph Hellwig
  2025-12-02  1:28 ` [PATCH 2/2] mkfs: add 2025 LTS config file Darrick J. Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-02  1:28 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Since the LTS is coming up, enable parent pointers and exchange-range by
default for all users.  Also fix up an out of date comment.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 mkfs/xfs_mkfs.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 8f5a6fa5676453..8db51217016eb0 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -1044,7 +1044,7 @@ struct sb_feat_args {
 	bool	inode_align;		/* XFS_SB_VERSION_ALIGNBIT */
 	bool	nci;			/* XFS_SB_VERSION_BORGBIT */
 	bool	lazy_sb_counters;	/* XFS_SB_VERSION2_LAZYSBCOUNTBIT */
-	bool	parent_pointers;	/* XFS_SB_VERSION2_PARENTBIT */
+	bool	parent_pointers;	/* XFS_SB_FEAT_INCOMPAT_PARENT */
 	bool	projid32bit;		/* XFS_SB_VERSION2_PROJID32BIT */
 	bool	crcs_enabled;		/* XFS_SB_VERSION2_CRCBIT */
 	bool	dirftype;		/* XFS_SB_VERSION2_FTYPE */
@@ -5984,11 +5984,12 @@ main(
 			.rmapbt = true,
 			.reflink = true,
 			.inobtcnt = true,
-			.parent_pointers = false,
+			.parent_pointers = true,
 			.nodalign = false,
 			.nortalign = false,
 			.bigtime = true,
 			.nrext64 = true,
+			.exchrange = true,
 			/*
 			 * When we decide to enable a new feature by default,
 			 * please remember to update the mkfs conf files.


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] mkfs: add 2025 LTS config file
  2025-12-02  1:27 [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
  2025-12-02  1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
@ 2025-12-02  1:28 ` Darrick J. Wong
  1 sibling, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-02  1:28 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new configuration file with the defaults as of 6.18 LTS.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 mkfs/Makefile      |    3 ++-
 mkfs/lts_6.18.conf |   19 +++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)
 create mode 100644 mkfs/lts_6.18.conf


diff --git a/mkfs/Makefile b/mkfs/Makefile
index 04905bd5101ccb..fb1473324cde7c 100644
--- a/mkfs/Makefile
+++ b/mkfs/Makefile
@@ -18,7 +18,8 @@ CFGFILES = \
 	lts_5.15.conf \
 	lts_6.1.conf \
 	lts_6.6.conf \
-	lts_6.12.conf
+	lts_6.12.conf \
+	lts_6.18.conf
 
 LLDLIBS += $(LIBXFS) $(LIBXCMD) $(LIBFROG) $(LIBRT) $(LIBBLKID) \
 	$(LIBUUID) $(LIBINIH) $(LIBURCU) $(LIBPTHREAD)
diff --git a/mkfs/lts_6.18.conf b/mkfs/lts_6.18.conf
new file mode 100644
index 00000000000000..2dbec51e586fa1
--- /dev/null
+++ b/mkfs/lts_6.18.conf
@@ -0,0 +1,19 @@
+# V5 features that were the mkfs defaults when the upstream Linux 6.18 LTS
+# kernel was released at the end of 2025.
+
+[metadata]
+bigtime=1
+crc=1
+finobt=1
+inobtcount=1
+metadir=0
+reflink=1
+rmapbt=1
+
+[inode]
+sparse=1
+nrext64=1
+exchange=1
+
+[naming]
+parent=1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mkfs: enable new features by default
  2025-12-02  1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
@ 2025-12-02  7:38   ` Christoph Hellwig
  2025-12-03  0:53     ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-12-02  7:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

On Mon, Dec 01, 2025 at 05:28:16PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Since the LTS is coming up, enable parent pointers and exchange-range by
> default for all users.  Also fix up an out of date comment.

Do you have any numbers that show the overhead or non-overhead of
enabling rmap?  It will increase the amount of metadata written quite
a bit.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mkfs: enable new features by default
  2025-12-02  7:38   ` Christoph Hellwig
@ 2025-12-03  0:53     ` Darrick J. Wong
  2025-12-03  6:31       ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-03  0:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: aalbersh, linux-xfs

On Mon, Dec 01, 2025 at 11:38:46PM -0800, Christoph Hellwig wrote:
> On Mon, Dec 01, 2025 at 05:28:16PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Since the LTS is coming up, enable parent pointers and exchange-range by
> > default for all users.  Also fix up an out of date comment.
> 
> Do you have any numbers that show the overhead or non-overhead of
> enabling rmap?  It will increase the amount of metadata written quite
> a bit.

I'm assuming you're interested in the overhead of *parent pointers* and
not rmap since we turned on rmap by default back in 2023?

I created a really stupid benchmarking script that does:

#!/bin/bash

umount /opt
mkfs.xfs -f /dev/sdb -n parent=$1
mount /dev/sdb /opt
mkdir -p /opt/foo
for ((i=0;i<10;i++)); do
	time fsstress -n 400000 -p 4 -z -f creat=1,mkdir=1,mknod=1,rmdir=1,unlink=1,link=1,rename=1 -d /opt/foo -s 1
done

# ./dumb.sh 0
meta-data=/dev/sdb               isize=512    agcount=4, agsize=1298176 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=1   metadir=0
data     =                       bsize=4096   blocks=5192704, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
         =                       zoned=0      start=0 reserved=0
Discarding blocks...Done.
real    0m18.807s
user    0m2.169s
sys     0m54.013s

real    0m13.845s
user    0m2.005s
sys     0m34.048s

real    0m14.019s
user    0m1.931s
sys     0m36.086s

real    0m14.435s
user    0m2.105s
sys     0m35.845s

real    0m14.823s
user    0m1.920s
sys     0m35.528s

real    0m14.181s
user    0m2.013s
sys     0m35.775s

real    0m14.281s
user    0m1.865s
sys     0m36.240s

real    0m13.638s
user    0m1.933s
sys     0m35.642s

real    0m13.553s
user    0m1.904s
sys     0m35.084s

real    0m13.963s
user    0m1.979s
sys     0m35.724s

# ./dumb.sh 1
meta-data=/dev/sdb               isize=512    agcount=4, agsize=1298176 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=1   metadir=0
data     =                       bsize=4096   blocks=5192704, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
         =                       zoned=0      start=0 reserved=0
Discarding blocks...Done.
real    0m20.654s
user    0m2.374s
sys     1m4.441s

real    0m14.255s
user    0m1.990s
sys     0m36.749s

real    0m14.553s
user    0m1.931s
sys     0m36.606s

real    0m13.855s
user    0m1.767s
sys     0m36.467s

real    0m14.606s
user    0m2.073s
sys     0m37.255s

real    0m13.706s
user    0m1.942s
sys     0m36.294s

real    0m14.177s
user    0m2.017s
sys     0m36.528s

real    0m15.310s
user    0m2.164s
sys     0m37.720s

real    0m14.099s
user    0m2.013s
sys     0m37.062s

real    0m14.067s
user    0m2.068s
sys     0m36.552s

As you can see, there's a noticeable increase in the runtime of the
first fsstress invocation, but for the subsequent runs there's not much
of a difference.  I think the parent pointer log items usually complete
in a single log checkpoint and are usually omitted from the log.  In the
common case of a single parent and an inline xattr area, the overhead is
basically zero because we're just writing to the attr fork's if_data and
not messing with xattr blocks.

If I remove the -flink=1 parameter from fsstress so that parent pointers
are always running out of the immediate area then the first parent=0
runtime is:

real    0m18.920s
user    0m2.559s
sys     1m0.991s

and the first parent=1 is:

real    0m20.458s
user    0m2.533s
sys     1m6.301s

I see more or less the same timings for the nine subsequent runs for
each parent= setting.  I think it's safe to say the overhead ranges
between negligible and 10% on a cold new filesystem.

--D

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mkfs: enable new features by default
  2025-12-03  0:53     ` Darrick J. Wong
@ 2025-12-03  6:31       ` Christoph Hellwig
  2025-12-04 18:48         ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-12-03  6:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, aalbersh, linux-xfs

On Tue, Dec 02, 2025 at 04:53:45PM -0800, Darrick J. Wong wrote:
> On Mon, Dec 01, 2025 at 11:38:46PM -0800, Christoph Hellwig wrote:
> > On Mon, Dec 01, 2025 at 05:28:16PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Since the LTS is coming up, enable parent pointers and exchange-range by
> > > default for all users.  Also fix up an out of date comment.
> > 
> > Do you have any numbers that show the overhead or non-overhead of
> > enabling rmap?  It will increase the amount of metadata written quite
> > a bit.
> 
> I'm assuming you're interested in the overhead of *parent pointers* and
> not rmap since we turned on rmap by default back in 2023?

Yes, sorry.

> I see more or less the same timings for the nine subsequent runs for
> each parent= setting.  I think it's safe to say the overhead ranges
> between negligible and 10% on a cold new filesystem.

Should we document this cleary?  Because this means at least some
workloads are going to see a performance decrease.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mkfs: enable new features by default
  2025-12-03  6:31       ` Christoph Hellwig
@ 2025-12-04 18:48         ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-04 18:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: aalbersh, linux-xfs

On Tue, Dec 02, 2025 at 10:31:22PM -0800, Christoph Hellwig wrote:
> On Tue, Dec 02, 2025 at 04:53:45PM -0800, Darrick J. Wong wrote:
> > On Mon, Dec 01, 2025 at 11:38:46PM -0800, Christoph Hellwig wrote:
> > > On Mon, Dec 01, 2025 at 05:28:16PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Since the LTS is coming up, enable parent pointers and exchange-range by
> > > > default for all users.  Also fix up an out of date comment.
> > > 
> > > Do you have any numbers that show the overhead or non-overhead of
> > > enabling rmap?  It will increase the amount of metadata written quite
> > > a bit.
> > 
> > I'm assuming you're interested in the overhead of *parent pointers* and
> > not rmap since we turned on rmap by default back in 2023?
> 
> Yes, sorry.
> 
> > I see more or less the same timings for the nine subsequent runs for
> > each parent= setting.  I think it's safe to say the overhead ranges
> > between negligible and 10% on a cold new filesystem.
> 
> Should we document this cleary?  Because this means at least some
> workloads are going to see a performance decrease.

Yep.  But first -- all those results are inaccurate because I forgot
that fsstress quietly ignores everything after the first op=freq
component of the optarg, so all that benchmark was doing was creating
millions of files in a single directory and never deleting anything.
That's why the subsequent runs were much faster -- most of those files
were already created.

So I'll send a patch to fstests to fix that behavior.  With that, the
benchmark that I alleged I was running produces these numbers when
creating a directory tree of only empty files:

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
real    0m12.742s
user    0m28.074s
sys     0m10.839s

real    0m13.469s
user    0m25.827s
sys     0m11.816s

real    0m11.352s
user    0m22.602s
sys     0m11.275s

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
real    0m12.782s
user    0m28.892s
sys     0m8.897s

real    0m13.591s
user    0m25.371s
sys     0m9.601s

real    0m10.012s
user    0m20.849s
sys     0m9.018s

Almost no difference here!  If I add in write=1 then there's a 5%
decrease going to parent=1:

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
real    0m15.020s
user    0m22.358s
sys     0m14.827s

real    0m17.196s
user    0m22.888s
sys     0m15.586s

real    0m16.668s
user    0m21.709s
sys     0m15.425s

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
real    0m14.808s
user    0m22.266s
sys     0m12.843s

real    0m16.323s
user    0m22.409s
sys     0m13.695s

real    0m15.562s
user    0m21.740s
sys     0m12.927s

--D

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] mkfs: enable new features by default
  2025-12-09 16:16 [PATCHSET V2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
@ 2025-12-09 16:16 ` Darrick J. Wong
  2025-12-09 16:22   ` Christoph Hellwig
  2025-12-09 22:25   ` Dave Chinner
  0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-09 16:16 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Since the LTS is coming up, enable parent pointers and exchange-range by
default for all users.  Also fix up an out of date comment.

I created a really stupid benchmarking script that does:

#!/bin/bash

# pptr overhead benchmark

umount /opt /mnt
rmmod xfs
for i in 1 0; do
	umount /opt
	mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
	mount /dev/sdb /opt
	mkdir -p /opt/foo
	for ((i=0;i<5;i++)); do
		time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
	done
done

This is the result of creating an enormous number of empty files in a
single directory:

# ./dumb.sh
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
real    0m18.807s
user    0m2.169s
sys     0m54.013s

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
real    0m20.654s
user    0m2.374s
sys     1m4.441s

As you can see, there's a 10% increase in runtime here.  If I make the
workload a bit more representative by changing the -f argument to
include a directory tree workout:

-f creat=1,mkdir=1,mknod=1,rmdir=1,unlink=1,link=1,rename=1


naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
real    0m12.742s
user    0m28.074s
sys     0m10.839s

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
real    0m12.782s
user    0m28.892s
sys     0m8.897s

Almost no difference here.  If I then actually write to the regular
files by adding:

-f write=1

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
real    0m16.668s
user    0m21.709s
sys     0m15.425s

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
real    0m15.562s
user    0m21.740s
sys     0m12.927s

So that's about a 2% difference.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 mkfs/xfs_mkfs.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 8f5a6fa5676453..8db51217016eb0 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -1044,7 +1044,7 @@ struct sb_feat_args {
 	bool	inode_align;		/* XFS_SB_VERSION_ALIGNBIT */
 	bool	nci;			/* XFS_SB_VERSION_BORGBIT */
 	bool	lazy_sb_counters;	/* XFS_SB_VERSION2_LAZYSBCOUNTBIT */
-	bool	parent_pointers;	/* XFS_SB_VERSION2_PARENTBIT */
+	bool	parent_pointers;	/* XFS_SB_FEAT_INCOMPAT_PARENT */
 	bool	projid32bit;		/* XFS_SB_VERSION2_PROJID32BIT */
 	bool	crcs_enabled;		/* XFS_SB_VERSION2_CRCBIT */
 	bool	dirftype;		/* XFS_SB_VERSION2_FTYPE */
@@ -5984,11 +5984,12 @@ main(
 			.rmapbt = true,
 			.reflink = true,
 			.inobtcnt = true,
-			.parent_pointers = false,
+			.parent_pointers = true,
 			.nodalign = false,
 			.nortalign = false,
 			.bigtime = true,
 			.nrext64 = true,
+			.exchrange = true,
 			/*
 			 * When we decide to enable a new feature by default,
 			 * please remember to update the mkfs conf files.


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mkfs: enable new features by default
  2025-12-09 16:16 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
@ 2025-12-09 16:22   ` Christoph Hellwig
  2025-12-09 22:25   ` Dave Chinner
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-12-09 16:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> Almost no difference here.  If I then actually write to the regular
> files by adding:
> 
> -f write=1

..

> So that's about a 2% difference.

Let's hope no one complains given that the parent points are useful:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mkfs: enable new features by default
  2025-12-09 16:16 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
  2025-12-09 16:22   ` Christoph Hellwig
@ 2025-12-09 22:25   ` Dave Chinner
  2025-12-10 23:49     ` Darrick J. Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2025-12-09 22:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Since the LTS is coming up, enable parent pointers and exchange-range by
> default for all users.  Also fix up an out of date comment.
> 
> I created a really stupid benchmarking script that does:
> 
> #!/bin/bash
> 
> # pptr overhead benchmark
> 
> umount /opt /mnt
> rmmod xfs
> for i in 1 0; do
> 	umount /opt
> 	mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> 	mount /dev/sdb /opt
> 	mkdir -p /opt/foo
> 	for ((i=0;i<5;i++)); do
> 		time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> 	done
> done

Hmmm. fsstress is an interesting choice here...

> This is the result of creating an enormous number of empty files in a
> single directory:
> 
> # ./dumb.sh
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
> real    0m18.807s
> user    0m2.169s
> sys     0m54.013s

> 
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
> real    0m20.654s
> user    0m2.374s
> sys     1m4.441s

Yeah, that's only creating 20,000 files/sec. That's a lot less than
expect a single thread to be able to do - why is the kernel burning
all 4 CPUs on this workload?

i.e. i'd expect a pure create workload to run at about 40,000
files/s with sleeping contention on the i_rwsem, but this is much
slower than I'd expect and contention is on a spinning lock...

Also, parent pointers add about 20% more system time overhead (54s
sys time to 64.4s sys time). Where does this come from? Do you have
kernel profiles? Is it PP overhead, a change in the contention
point, or just worse contention on the same resource?

> As you can see, there's a 10% increase in runtime here.  If I make the
> workload a bit more representative by changing the -f argument to
> include a directory tree workout:
> 
> -f creat=1,mkdir=1,mknod=1,rmdir=1,unlink=1,link=1,rename=1
> 
> 
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
> real    0m12.742s
> user    0m28.074s
> sys     0m10.839s
> 
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
> real    0m12.782s
> user    0m28.892s
> sys     0m8.897s

Again, that's way slower than I'd expect a 4p metadata workload to
run through 400k modification ops. i.e. it's running at about 35k
ops/s, and I'd be expecting the baseline to be upwards of 100k
ops/s.

Ah, look at the amount of time spent in userspace - 28-20s vs 9-11s
spent in the kernel filesystem code.

Ok, performance is limited by the usrespace code, not the kernel
code. I would expect a decent fs benchmark to be at most 10%
userspace CPU time, with >90% of the time being spent in the kernel
doing filesystem operations.

IOWs, there is way too much userspace overhead in this worklaod to
draw useful conclusions about the impact of the kernel side changes.

System time went up from 9s to 11s when parent pointers are turned
on - a 20% increase in CPU overhead - but that additional overhead
isn't reflected in the wall time results because the CPU overehad is
dominated by the userspace program, not the kernel code that is
being "measured".

> Almost no difference here.

Ah, no. Again, system time went up by ~20%, even though elapsed time
was unchanged. That implies there is some amount of sleeping
contention occurring between processes doing work, and the
additional CPU overhead of the PP code simply resulted in less sleep
time.

Again, this is not noticable because the workload is dominated by
userspace CPU overhead, not the kernel/filesystem operation
overhead...


> If I then actually write to the regular
> files by adding:
> 
> -f write=1
> 
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
> real    0m16.668s
> user    0m21.709s
> sys     0m15.425s
> 
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
> real    0m15.562s
> user    0m21.740s
> sys     0m12.927s
> 
> So that's about a 2% difference.

Same here - system time went up by 25%, even though wall time didn't
change. Also, 15.5s to 16.6s increase in wall time is actually
a 7% difference in runtime, not 2%.

----

Overall, I don't think the benchmarking documented here is
sufficient to justify the conclusion that "parent pointers have
little real world overhead so we can turn them on by default".

I would at least like to see the "will-it-scale" impact on a 64p
machine with a hundred GB of RAM and IO subsystem at least capable
of a million IOPS and a filesystem optimised for max performance
(e.g. highly parallel fsmark based workloads). This will push the
filesystem and CPU usage to their actual limits and directly expose
additional overhead and new contention points in the results.

This is also much more representative of the sorts of high
performance, high end deployments that we expect XFS to be deployed
on, and where performance impact actually matters to users.

i.e. we need to know what the impact of the change is on the high
end as well as low end VM/desktop configs before any conclusion can
be drawn w.r.t. changing the parent pointer default setting....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mkfs: enable new features by default
  2025-12-09 22:25   ` Dave Chinner
@ 2025-12-10 23:49     ` Darrick J. Wong
  2025-12-15 23:59       ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-10 23:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: aalbersh, linux-xfs

On Wed, Dec 10, 2025 at 09:25:24AM +1100, Dave Chinner wrote:
> On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Since the LTS is coming up, enable parent pointers and exchange-range by
> > default for all users.  Also fix up an out of date comment.
> > 
> > I created a really stupid benchmarking script that does:
> > 
> > #!/bin/bash
> > 
> > # pptr overhead benchmark
> > 
> > umount /opt /mnt
> > rmmod xfs
> > for i in 1 0; do
> > 	umount /opt
> > 	mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> > 	mount /dev/sdb /opt
> > 	mkdir -p /opt/foo
> > 	for ((i=0;i<5;i++)); do
> > 		time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> > 	done
> > done
> 
> Hmmm. fsstress is an interesting choice here...

<flush all the old benchmarks and conclusions>

I have an old 40-core Xeon E5-2660V3 with a pair of 1.5T Intel nvme ssds
and 128G of RAM running 6.18.0.  For this sample, I tried to keep the
memory usage well below the amount of DRAM so that I could measure the
pure overhead of writing parent pointers out to disk and not anything
else.  I also omit ls'ing and chmod'ing the directory tree because
neither of those operations touch parent pointers.  I also left the
logbsize at the defaults (32k) because that's what most users get.

Here I'm using the following benchmark program, compiled from various
suggestions from dchinner over the years:

#!/bin/bash -x

iter=8
feature="-n parent"
filesz=0
subdirs=10000
files_per_iter=100000
writesize=16384

mkdirme() {
        set +x
        local i

        for ((i=0;i<agcount;i++)); do
                mkdir -p /nvme/$i
                dirs+=(-d /nvme/$i)
        done
        set -x
}

bulkme() {
        set +x
        local i

        for ((i=0;i<agcount;i++)); do
                xfs_io -c "bulkstat -a $i -q" /nvme &
        done
        wait
        set -x
}

rmdirme() {
        set +x
        local i
        for dir in "${dirs[@]}"; do
                rm -r -f "${dir}" &
        done
        wait
        set -x
}

benchme() {
        agcount="$(xfs_info /nvme/ | grep agcount= | sed -e 's/^.*agcount=//g' -e 's/,.*$//g')"
        dirs=()
        mkdirme

        #time ~djwong/cdev/work/fstests/build-x86_64/ltp/fsstress -n 400000 -p 40 -z -f creat=1,mkdir=1,rmdir=1,unlink=1 -d /nvme/ -s 1
        time fs_mark -w "${writesz}" -D "${subdirs}" -S 0 -n "${files_per_iter}" -s "${filesz}" -L "${iter}" "${dirs[@]}"

        time bulkme
        time rmdirme
}

for p in 0 1; do
        umount /dev/nvme1n1 /nvme /mnt
        #mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 -n parent=$p || break
        mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 $feature=$p || break
        mount /dev/nvme1n1 /nvme/ -o logdev=/dev/nvme0n1 || break
        benchme
        umount /dev/nvme1n1 /nvme /mnt
done

I get this mkfs output:
# mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1
meta-data=/dev/nvme1n1           isize=512    agcount=40, agsize=9767586 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=0   metadir=0
data     =                       bsize=4096   blocks=390703440, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =/dev/nvme0n1           bsize=4096   blocks=262144, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
         =                       zoned=0      start=0 reserved=0
# grep nvme1n1 /proc/mounts
/dev/nvme1n1 /nvme xfs rw,relatime,inode64,logbufs=8,logbsize=32k,logdev=/dev/nvme0n1,noquota 0 0

and this output from fsmark with parent=0:

#  fs_mark  -D  10000  -S  0  -n  100000  -s  0  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 14:22:07 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2      4000000            0     566680.9         31398816
     2      8000000            0     665535.6         30037368
     2     12000000            0     537227.6         31726557
     2     16000000            0     538133.9         32411165
     2     20000000            0     619369.6         30790676
     2     24000000            0     600018.2         31583349
     2     28000000            0     607209.8         31193980
     3     32000000            0     533240.7         32277102

real    0m57.573s
user    3m53.578s
sys     19m44.440s
+ bulkme
+ set +x

real    0m1.122s
user    0m0.955s
sys     0m39.306s
+ rmdirme
+ set +x

real    0m59.649s
user    0m41.196s
sys     13m9.566s

I limited this to 8 iterations so I could post some preliminary results
after a few minutes.  Now let's try again with parent=1:

+ fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39

#  fs_mark  -D  10000  -S  0  -n  100000  -s  0  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 14:24:44 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2      4000000            0     543929.1         31344175
     2      8000000            0     523736.2         31180565
     2     12000000            0     522184.1         31700380
     2     16000000            0     513468.0         32112498
     2     20000000            0     543993.1         31910496
     2     24000000            0     562760.1         32061910
     2     28000000            0     524039.8         31825520
     3     32000000            0     526028.8         31889193

real    1m2.934s
user    3m53.508s
sys     25m14.810s
+ bulkme
+ set +x

real    0m1.158s
user    0m0.882s
sys     0m39.847s
+ rmdirme
+ set +x

real    1m12.505s
user    0m47.489s
sys     20m33.844s


fs_mark itself shows a decrease in file creation/sec of about 9%, an
increase in wall clock time of about 9%, and an increase in kernel time
of about 28%.  That's to be expected, since parent pointer updates cause
directory entry creation and deletion to hold more ILOCKs and for
longer.

Parallel bulkstat (aka bulkme) shows an increase in wall time of 3% and
system time of 1%, which is not surprising since that's just walking the
inode btree and cores, no parent pointers involved.

Similarly, deleting all the files created by fs_mark shows an increase
in wall time of about ~21% and an increase in system time of about 56%.
I concede that parent pointers has a fair amount of overhead for the
worst case of creating a large directory tree or deleting it.

I reran this with logbsize=256k and while I saw a slight increase in
performance across the board, the overhead of pptrs is about the same
percentagewise.

If I then re-run the benchmark with a file size of 1M and tell it to
create fewer files, then I get the following for parent=0:

#  fs_mark  -D  1000  -S  0  -n  200  -s  1048576  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:03:11 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 1048576 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2         8000      1048576       1493.4           198379
     2        16000      1048576       1327.0           255655
     3        24000      1048576       1355.8           255105
     4        32000      1048576       1352.3           253094
     4        40000      1048576       1836.9           262258
     5        48000      1048576       1337.6           246991
     5        56000      1048576       1328.4           240303
     6        64000      1048576       1165.9           237211

real    0m50.384s
user    0m7.640s
sys     1m43.187s
+ bulkme
+ set +x

real    0m0.023s
user    0m0.061s
sys     0m0.167s
+ rmdirme
+ set +x

real    0m0.675s
user    0m0.107s
sys     0m15.644s

and for parent=1:

#  fs_mark  -D  1000  -S  0  -n  200  -s  1048576  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:04:41 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 1048576 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2         8000      1048576       1963.9           254007
     2        16000      1048576       1716.4           227074
     3        24000      1048576       1052.5           264987
     4        32000      1048576       1793.6           242288
     4        40000      1048576       1364.2           249738
     5        48000      1048576       1081.2           250394
     5        56000      1048576       1342.0           260667
     6        64000      1048576       1356.9           242324

real    0m49.256s
user    0m7.621s
sys     1m44.847s
+ bulkme
+ set +x

real    0m0.021s
user    0m0.060s
sys     0m0.176s
+ rmdirme
+ set +x

real    0m0.537s
user    0m0.108s
sys     0m15.453s

Here we see that the fs_mark creates/sec goes up by 4%, wall time
decreases by 3%, and the kernel time increases by 2% or so.  The rmdir
wall time decreases by 2% and the kernel time by ~1%, which is quite
small.  So for a more common case of populating a directory tree full of
big files with data in them, the overhead isn't all that noticeable.

I then decided to simulate my maildir spool, which has 670,000 files
consuming 12GB for an average file size of 17936 bytes.  I reduced the
file size to 16K, increase the number of files per iteration, and set
the write buffer size to something not aligned to a block, and got this
for parent=0:

#  fs_mark  -w  778  -D  1000  -S  0  -n  6000  -s  16384  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:21:38 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 16384 bytes, written with an IO size of 778 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2       240000        16384      40085.3          2492281
     2       480000        16384      37026.7          2780077
     2       720000        16384      28445.5          2591461
     3       960000        16384      28888.6          2595817
     3      1200000        16384      25160.8          2903882
     3      1440000        16384      29372.1          2600018
     3      1680000        16384      26443.9          2732790
     4      1920000        16384      26307.1          2758750

real    1m11.633s
user    0m46.156s
sys     3m24.543s
+ bulkme
+ set +x

real    0m0.091s
user    0m0.111s
sys     0m2.461s
+ rmdirme
+ set +x

real    0m9.364s
user    0m2.245s
sys     0m47.221s

and this for parent=1

#  fs_mark  -w  778  -D  1000  -S  0  -n  6000  -s  16384  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
#       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:23:38 2025
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 16384 bytes, written with an IO size of 778 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2       240000        16384      39340.1          2627066
     2       480000        16384      27727.2          2925494
     2       720000        16384      28305.4          2597191
     2       960000        16384      24891.6          2834421
     3      1200000        16384      27964.8          2810556
     3      1440000        16384      27204.6          2776783
     3      1680000        16384      25745.2          2779197
     3      1920000        16384      24674.9          2752721

real    1m14.422s
user    0m46.607s
sys     3m38.777s
+ bulkme
+ set +x

real    0m0.081s
user    0m0.123s
sys     0m2.408s
+ rmdirme
+ set +x

real    0m9.306s
user    0m2.570s
sys     1m10.598s

fs_mark shows a 7% decrease in creates/sec, a 4% increase in wall time,
a 7% increase in kernel time.  bulkstat is as usual not that different,
and deletion shows an increase in kernel time of 50%.

Conclusion: There are noticeable overheads to enabling parent pointers,
but counterbalancing that, we can now repair an entire filesystem,
directory tree and all.

--D

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mkfs: enable new features by default
  2025-12-10 23:49     ` Darrick J. Wong
@ 2025-12-15 23:59       ` Dave Chinner
  2025-12-16 23:07         ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2025-12-15 23:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

On Wed, Dec 10, 2025 at 03:49:28PM -0800, Darrick J. Wong wrote:
> On Wed, Dec 10, 2025 at 09:25:24AM +1100, Dave Chinner wrote:
> > On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Since the LTS is coming up, enable parent pointers and exchange-range by
> > > default for all users.  Also fix up an out of date comment.
> > > 
> > > I created a really stupid benchmarking script that does:
> > > 
> > > #!/bin/bash
> > > 
> > > # pptr overhead benchmark
> > > 
> > > umount /opt /mnt
> > > rmmod xfs
> > > for i in 1 0; do
> > > 	umount /opt
> > > 	mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> > > 	mount /dev/sdb /opt
> > > 	mkdir -p /opt/foo
> > > 	for ((i=0;i<5;i++)); do
> > > 		time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> > > 	done
> > > done
> > 
> > Hmmm. fsstress is an interesting choice here...
> 
> <flush all the old benchmarks and conclusions>
> 
> I have an old 40-core Xeon E5-2660V3 with a pair of 1.5T Intel nvme ssds
> and 128G of RAM running 6.18.0.  For this sample, I tried to keep the
> memory usage well below the amount of DRAM so that I could measure the
> pure overhead of writing parent pointers out to disk and not anything
> else.  I also omit ls'ing and chmod'ing the directory tree because
> neither of those operations touch parent pointers.  I also left the
> logbsize at the defaults (32k) because that's what most users get.

ok.

.....

> benchme() {
>         agcount="$(xfs_info /nvme/ | grep agcount= | sed -e 's/^.*agcount=//g' -e 's/,.*$//g')"
>         dirs=()
>         mkdirme
> 
>         #time ~djwong/cdev/work/fstests/build-x86_64/ltp/fsstress -n 400000 -p 40 -z -f creat=1,mkdir=1,rmdir=1,unlink=1 -d /nvme/ -s 1
>         time fs_mark -w "${writesz}" -D "${subdirs}" -S 0 -n "${files_per_iter}" -s "${filesz}" -L "${iter}" "${dirs[@]}"
> 
>         time bulkme
>         time rmdirme

Ok, so this is testing cache-hot bulkstat and rm, so it's not
exercising the cold-read path and hence is not needing to read and
initialising parent pointers for unlinking. Can you drop caches
between the bulkstat and the unlink phases so we exercise cold cache
parent pointer instantiation overhead somewhere?

> }
> 
> for p in 0 1; do
>         umount /dev/nvme1n1 /nvme /mnt
>         #mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 -n parent=$p || break
>         mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 $feature=$p || break
>         mount /dev/nvme1n1 /nvme/ -o logdev=/dev/nvme0n1 || break
>         benchme
>         umount /dev/nvme1n1 /nvme /mnt
> done
> 
> I get this mkfs output:
> # mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1
> meta-data=/dev/nvme1n1           isize=512    agcount=40, agsize=9767586 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
>          =                       exchange=0   metadir=0
> data     =                       bsize=4096   blocks=390703440, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
> log      =/dev/nvme0n1           bsize=4096   blocks=262144, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>          =                       rgcount=0    rgsize=0 extents
>          =                       zoned=0      start=0 reserved=0
> # grep nvme1n1 /proc/mounts
> /dev/nvme1n1 /nvme xfs rw,relatime,inode64,logbufs=8,logbsize=32k,logdev=/dev/nvme0n1,noquota 0 0
> 
> and this output from fsmark with parent=0:

....

a table-based summary would have made this easier to read

	parent		real		user		sys
create	0		0m57.573s	3m53.578s	19m44.440s
create	1		1m2.934s	3m53.508s	25m14.810s

bulk	0		0m1.122s	0m0.955s	0m39.306s
bulk	1		0m1.158s	0m0.882s	0m39.847s

unlink	0		0m59.649s	0m41.196s	13m9.566s
unlink	1		1m12.505s	0m47.489s	20m33.844s

> fs_mark itself shows a decrease in file creation/sec of about 9%, an
> increase in wall clock time of about 9%, and an increase in kernel time
> of about 28%.  That's to be expected, since parent pointer updates cause
> directory entry creation and deletion to hold more ILOCKs and for
> longer.

ILOCK isn't an issue with this test - the whole point of the
segmented directory structure is that each thread operates in it's
own directory, so there is no ILOCK contention at all. i.e. the
entire difference is the CPU overhead of the adding the xattr fork
and creating the parent pointer xattr.

I suspect that the create side overhead is probably acceptible,
because we also typically add security xattrs at create time and
these will be slightly faster as the xattr fork is already
prepared...

> Parallel bulkstat (aka bulkme) shows an increase in wall time of 3% and
> system time of 1%, which is not surprising since that's just walking the
> inode btree and cores, no parent pointers involved.

I was more interested in the cold cache behaviour - hot cache is
generally uninteresting as the XFS inode cache scales pretty much
perfectly in this case. Reading the inodes from disk, OTOH, adds a
whole heap of instantiation and lock contention overhead and changes
the picture significantly. I'm interested to know what the impact of
having PPs is in that case....

> Similarly, deleting all the files created by fs_mark shows an increase
> in wall time of about ~21% and an increase in system time of about 56%.
> I concede that parent pointers has a fair amount of overhead for the
> worst case of creating a large directory tree or deleting it.

Ok, so an increase in unlink CPU overhead of 56% is pretty bad. On
single threaded workloads, that's going to equate to be a ~50%
reduction in performance for operations that perform unlinks in CPU
bound loops (e.g. rm -rf on hot caches). Note that the above test is
not CPU bound - it's only running at about 50% CPU utilisation
because of some other contention point in the fs (possibly log space
or pinned/stale directory buffers requiring a log force to clear).

However, results like this make me think that PP unlink hasn't been
optimised for the common case: removing the last parent pointer
(i.e. nlink 1 -> 0) when the inode is being placed on the unlinked
list in syscall context. This is the common case in the absence of
hard links, and it puts the PP xattr removal directly in application
task context.

In this case, it seems to me that we don't actually need
to remove the parent pointer xattr. When the inode is inactivated by
bakground inodegc after last close, the xattr fork is truncated and
that will remove all xattrs including the stale remaining PP without
needing to make a specific PP transaction.

Doing this would remove the PP overhead completely from the final
unlink syscall path. It would only add  minimal extra overhead on
the inodegc side as (in the common case) we have to remove security
xattrs in inodegc. 

Hence I think we really need to try to mitigate this common case
overhead before we make PP the default for everyone. The perf
decrease


> If I then re-run the benchmark with a file size of 1M and tell it to
> create fewer files, then I get the following for parent=0:

These are largely meaningless as the create benchmark is throttling
hard on disk bandwidth (1.5-2GB/s) in the write() path, not limited
by PP overhead.

The variance in runtime comes from the data IO path behaviour, and
the lack of sync() operations after the create means that writeback
is likely still running when the unlink phase runs. Hence it's
pretty difficult to conclude anything about parent pointers
themselves because of the other large variants in this workload.

> I then decided to simulate my maildir spool, which has 670,000 files
> consuming 12GB for an average file size of 17936 bytes.  I reduced the
> file size to 16K, increase the number of files per iteration, and set
> the write buffer size to something not aligned to a block, and got this
> for parent=0:

Same again, but this time the writeback thread will be seeing
delalloc latencies w.r.t. AGF locks vs incoming directory and inode
chunk allocation operations. That can be seen by:

> 
> #  fs_mark  -w  778  -D  1000  -S  0  -n  6000  -s  16384  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
> #       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:21:38 2025
> #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> #       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
> #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> #       Files info: size 16384 bytes, written with an IO size of 778 bytes per write
> #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      2       240000        16384      40085.3          2492281
>      2       480000        16384      37026.7          2780077
>      2       720000        16384      28445.5          2591461
>      3       960000        16384      28888.6          2595817
>      3      1200000        16384      25160.8          2903882
>      3      1440000        16384      29372.1          2600018
>      3      1680000        16384      26443.9          2732790
>      4      1920000        16384      26307.1          2758750
> 
> real    1m11.633s
> user    0m46.156s
> sys     3m24.543s

.. creates only managing ~270% CPU utilisation for a 40-way
operation.

IOWs, parent pointer overhead is noise compared to the losses caused
by data writeback locking/throttling interactions, so nothing can
really be concluded from there here.

> Conclusion: There are noticeable overheads to enabling parent pointers,
> but counterbalancing that, we can now repair an entire filesystem,
> directory tree and all.

True, but I think that the unlink overhead is significant enough
that we need to address that before enabling PP by default for
everyone.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mkfs: enable new features by default
  2025-12-15 23:59       ` Dave Chinner
@ 2025-12-16 23:07         ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2025-12-16 23:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: aalbersh, linux-xfs

On Tue, Dec 16, 2025 at 10:59:42AM +1100, Dave Chinner wrote:
> On Wed, Dec 10, 2025 at 03:49:28PM -0800, Darrick J. Wong wrote:
> > On Wed, Dec 10, 2025 at 09:25:24AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Since the LTS is coming up, enable parent pointers and exchange-range by
> > > > default for all users.  Also fix up an out of date comment.
> > > > 
> > > > I created a really stupid benchmarking script that does:
> > > > 
> > > > #!/bin/bash
> > > > 
> > > > # pptr overhead benchmark
> > > > 
> > > > umount /opt /mnt
> > > > rmmod xfs
> > > > for i in 1 0; do
> > > > 	umount /opt
> > > > 	mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent=
> > > > 	mount /dev/sdb /opt
> > > > 	mkdir -p /opt/foo
> > > > 	for ((i=0;i<5;i++)); do
> > > > 		time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1
> > > > 	done
> > > > done
> > > 
> > > Hmmm. fsstress is an interesting choice here...
> > 
> > <flush all the old benchmarks and conclusions>
> > 
> > I have an old 40-core Xeon E5-2660V3 with a pair of 1.5T Intel nvme ssds
> > and 128G of RAM running 6.18.0.  For this sample, I tried to keep the
> > memory usage well below the amount of DRAM so that I could measure the
> > pure overhead of writing parent pointers out to disk and not anything
> > else.  I also omit ls'ing and chmod'ing the directory tree because
> > neither of those operations touch parent pointers.  I also left the
> > logbsize at the defaults (32k) because that's what most users get.
> 
> ok.
> 
> .....
> 
> > benchme() {
> >         agcount="$(xfs_info /nvme/ | grep agcount= | sed -e 's/^.*agcount=//g' -e 's/,.*$//g')"
> >         dirs=()
> >         mkdirme
> > 
> >         #time ~djwong/cdev/work/fstests/build-x86_64/ltp/fsstress -n 400000 -p 40 -z -f creat=1,mkdir=1,rmdir=1,unlink=1 -d /nvme/ -s 1
> >         time fs_mark -w "${writesz}" -D "${subdirs}" -S 0 -n "${files_per_iter}" -s "${filesz}" -L "${iter}" "${dirs[@]}"
> > 
> >         time bulkme
> >         time rmdirme
> 
> Ok, so this is testing cache-hot bulkstat and rm, so it's not
> exercising the cold-read path and hence is not needing to read and
> initialising parent pointers for unlinking. Can you drop caches
> between the bulkstat and the unlink phases so we exercise cold cache
> parent pointer instantiation overhead somewhere?
> 
> > }
> > 
> > for p in 0 1; do
> >         umount /dev/nvme1n1 /nvme /mnt
> >         #mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 -n parent=$p || break
> >         mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 $feature=$p || break
> >         mount /dev/nvme1n1 /nvme/ -o logdev=/dev/nvme0n1 || break
> >         benchme
> >         umount /dev/nvme1n1 /nvme /mnt
> > done
> > 
> > I get this mkfs output:
> > # mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1
> > meta-data=/dev/nvme1n1           isize=512    agcount=40, agsize=9767586 blks
> >          =                       sectsz=4096  attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=1, rmapbt=1
> >          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
> >          =                       exchange=0   metadir=0
> > data     =                       bsize=4096   blocks=390703440, imaxpct=5
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
> > log      =/dev/nvme0n1           bsize=4096   blocks=262144, version=2
> >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> >          =                       rgcount=0    rgsize=0 extents
> >          =                       zoned=0      start=0 reserved=0
> > # grep nvme1n1 /proc/mounts
> > /dev/nvme1n1 /nvme xfs rw,relatime,inode64,logbufs=8,logbsize=32k,logdev=/dev/nvme0n1,noquota 0 0
> > 
> > and this output from fsmark with parent=0:
> 
> ....
> 
> a table-based summary would have made this easier to read
> 
> 	parent		real		user		sys
> create	0		0m57.573s	3m53.578s	19m44.440s
> create	1		1m2.934s	3m53.508s	25m14.810s
> 
> bulk	0		0m1.122s	0m0.955s	0m39.306s
> bulk	1		0m1.158s	0m0.882s	0m39.847s
> 
> unlink	0		0m59.649s	0m41.196s	13m9.566s
> unlink	1		1m12.505s	0m47.489s	20m33.844s
> 
> > fs_mark itself shows a decrease in file creation/sec of about 9%, an
> > increase in wall clock time of about 9%, and an increase in kernel time
> > of about 28%.  That's to be expected, since parent pointer updates cause
> > directory entry creation and deletion to hold more ILOCKs and for
> > longer.
> 
> ILOCK isn't an issue with this test - the whole point of the
> segmented directory structure is that each thread operates in it's
> own directory, so there is no ILOCK contention at all. i.e. the
> entire difference is the CPU overhead of the adding the xattr fork
> and creating the parent pointer xattr.
> 
> I suspect that the create side overhead is probably acceptible,
> because we also typically add security xattrs at create time and
> these will be slightly faster as the xattr fork is already
> prepared...
> 
> > Parallel bulkstat (aka bulkme) shows an increase in wall time of 3% and
> > system time of 1%, which is not surprising since that's just walking the
> > inode btree and cores, no parent pointers involved.
> 
> I was more interested in the cold cache behaviour - hot cache is
> generally uninteresting as the XFS inode cache scales pretty much
> perfectly in this case. Reading the inodes from disk, OTOH, adds a
> whole heap of instantiation and lock contention overhead and changes
> the picture significantly. I'm interested to know what the impact of
> having PPs is in that case....
> 
> > Similarly, deleting all the files created by fs_mark shows an increase
> > in wall time of about ~21% and an increase in system time of about 56%.
> > I concede that parent pointers has a fair amount of overhead for the
> > worst case of creating a large directory tree or deleting it.
> 
> Ok, so an increase in unlink CPU overhead of 56% is pretty bad. On
> single threaded workloads, that's going to equate to be a ~50%
> reduction in performance for operations that perform unlinks in CPU
> bound loops (e.g. rm -rf on hot caches). Note that the above test is
> not CPU bound - it's only running at about 50% CPU utilisation
> because of some other contention point in the fs (possibly log space
> or pinned/stale directory buffers requiring a log force to clear).
> 
> However, results like this make me think that PP unlink hasn't been
> optimised for the common case: removing the last parent pointer
> (i.e. nlink 1 -> 0) when the inode is being placed on the unlinked
> list in syscall context. This is the common case in the absence of
> hard links, and it puts the PP xattr removal directly in application
> task context.
> 
> In this case, it seems to me that we don't actually need
> to remove the parent pointer xattr. When the inode is inactivated by
> bakground inodegc after last close, the xattr fork is truncated and
> that will remove all xattrs including the stale remaining PP without
> needing to make a specific PP transaction.
> 
> Doing this would remove the PP overhead completely from the final
> unlink syscall path. It would only add  minimal extra overhead on
> the inodegc side as (in the common case) we have to remove security
> xattrs in inodegc. 

At some point hch suggested that the parent pointer code could shortcut
the entire xattr intent machinery if the child file has shortform
xattrs.  For this fsmark benchmark where we're creating a lot of empty
files, doing so actually /does/ cut the creation overhead from ~30% to
~3%; and the deletion overhead to nearly zero.

diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index 589f810eedc0d8..c59e5ef47ed95d 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -49,6 +49,7 @@ void	xfs_attr_shortform_create(struct xfs_da_args *args);
 void	xfs_attr_shortform_add(struct xfs_da_args *args, int forkoff);
 int	xfs_attr_shortform_getvalue(struct xfs_da_args *args);
 int	xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
+int	xfs_attr_try_sf_addname(struct xfs_inode *dp, struct xfs_da_args *args);
 int	xfs_attr_sf_removename(struct xfs_da_args *args);
 struct xfs_attr_sf_entry *xfs_attr_sf_findname(struct xfs_da_args *args);
 int	xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 8c04acd30d489c..89cc913a2b4345 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -349,7 +349,7 @@ xfs_attr_set_resv(
  * xfs_attr_shortform_addname() will convert to leaf format and return -ENOSPC.
  * to use.
  */
-STATIC int
+int
 xfs_attr_try_sf_addname(
 	struct xfs_inode	*dp,
 	struct xfs_da_args	*args)
diff --git a/fs/xfs/libxfs/xfs_parent.c b/fs/xfs/libxfs/xfs_parent.c
index 69366c44a70159..048f822951103c 100644
--- a/fs/xfs/libxfs/xfs_parent.c
+++ b/fs/xfs/libxfs/xfs_parent.c
@@ -29,6 +29,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_attr_item.h"
 #include "xfs_health.h"
+#include "xfs_attr_leaf.h"
 
 struct kmem_cache		*xfs_parent_args_cache;
 
@@ -202,6 +203,16 @@ xfs_parent_addname(
 	xfs_inode_to_parent_rec(&ppargs->rec, dp);
 	xfs_parent_da_args_init(&ppargs->args, tp, &ppargs->rec, child,
 			child->i_ino, parent_name);
+
+	if (xfs_inode_has_attr_fork(child) &&
+	    xfs_attr_is_shortform(child)) {
+		ppargs->args.op_flags |= XFS_DA_OP_ADDNAME;
+
+		error = xfs_attr_try_sf_addname(child, &ppargs->args);
+		if (error != -ENOSPC)
+			return error;
+	}
+
 	xfs_attr_defer_add(&ppargs->args, XFS_ATTR_DEFER_SET);
 	return 0;
 }
@@ -224,6 +235,10 @@ xfs_parent_removename(
 	xfs_inode_to_parent_rec(&ppargs->rec, dp);
 	xfs_parent_da_args_init(&ppargs->args, tp, &ppargs->rec, child,
 			child->i_ino, parent_name);
+
+	if (xfs_attr_is_shortform(child))
+		return xfs_attr_sf_removename(&ppargs->args);
+
 	xfs_attr_defer_add(&ppargs->args, XFS_ATTR_DEFER_REMOVE);
 	return 0;
 }
@@ -250,6 +265,27 @@ xfs_parent_replacename(
 			child->i_ino, old_name);
 
 	xfs_inode_to_parent_rec(&ppargs->new_rec, new_dp);
+
+	if (xfs_attr_is_shortform(child)) {
+		ppargs->args.op_flags |= XFS_DA_OP_ADDNAME | XFS_DA_OP_REPLACE;
+
+		error = xfs_attr_sf_removename(&ppargs->args);
+		if (error)
+			return error;
+
+		xfs_parent_da_args_init(&ppargs->args, tp, &ppargs->new_rec,
+				child, child->i_ino, new_name);
+		ppargs->args.op_flags |= XFS_DA_OP_ADDNAME;
+
+		error = xfs_attr_try_sf_addname(child, &ppargs->args);
+		if (error == -ENOSPC) {
+			xfs_attr_defer_add(&ppargs->args, XFS_ATTR_DEFER_SET);
+			return 0;
+		}
+
+		return error;
+	}
+
 	ppargs->args.new_name = new_name->name;
 	ppargs->args.new_namelen = new_name->len;
 	ppargs->args.new_value = &ppargs->new_rec;

> Hence I think we really need to try to mitigate this common case
> overhead before we make PP the default for everyone. The perf
> decrease
> 
> 
> > If I then re-run the benchmark with a file size of 1M and tell it to
> > create fewer files, then I get the following for parent=0:
> 
> These are largely meaningless as the create benchmark is throttling
> hard on disk bandwidth (1.5-2GB/s) in the write() path, not limited
> by PP overhead.
> 
> The variance in runtime comes from the data IO path behaviour, and
> the lack of sync() operations after the create means that writeback
> is likely still running when the unlink phase runs. Hence it's
> pretty difficult to conclude anything about parent pointers
> themselves because of the other large variants in this workload.

They're not meaningless numbers, Dave.  Writing data into user files is
always going take up a large portion of the time spent creating a real
dreictory tree.  Anyone unpacking a tarball onto a filesystem can run
into disk throttling on write bandwidth, which just reduces the relative
overhead of the pptr updates further.

The only times it becomes painful is in this microbenchmarking case
where someone is trying to create millions of empty files; and when
deleting a directory tree.

Anyway, we now have a patch, and I'll rerun the benchmark if this
survives overnight testing.

--D

> > I then decided to simulate my maildir spool, which has 670,000 files
> > consuming 12GB for an average file size of 17936 bytes.  I reduced the
> > file size to 16K, increase the number of files per iteration, and set
> > the write buffer size to something not aligned to a block, and got this
> > for parent=0:
> 
> Same again, but this time the writeback thread will be seeing
> delalloc latencies w.r.t. AGF locks vs incoming directory and inode
> chunk allocation operations. That can be seen by:
> 
> > 
> > #  fs_mark  -w  778  -D  1000  -S  0  -n  6000  -s  16384  -L  8  -d  /nvme/0  -d  /nvme/1  -d  /nvme/2  -d  /nvme/3  -d  /nvme/4  -d  /nvme/5  -d  /nvme/6  -d  /nvme/7  -d  /nvme/8  -d  /nvme/9  -d  /nvme/10  -d  /nvme/11  -d  /nvme/12  -d  /nvme/13  -d  /nvme/14  -d  /nvme/15  -d  /nvme/16  -d  /nvme/17  -d  /nvme/18  -d  /nvme/19  -d  /nvme/20  -d  /nvme/21  -d  /nvme/22  -d  /nvme/23  -d  /nvme/24  -d  /nvme/25  -d  /nvme/26  -d  /nvme/27  -d  /nvme/28  -d  /nvme/29  -d  /nvme/30  -d  /nvme/31  -d  /nvme/32  -d  /nvme/33  -d  /nvme/34  -d  /nvme/35  -d  /nvme/36  -d  /nvme/37  -d  /nvme/38  -d  /nvme/39 
> > #       Version 3.3, 40 thread(s) starting at Wed Dec 10 15:21:38 2025
> > #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > #       Directories:  Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory.
> > #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> > #       Files info: size 16384 bytes, written with an IO size of 778 bytes per write
> > #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> > 
> > FSUse%        Count         Size    Files/sec     App Overhead
> >      2       240000        16384      40085.3          2492281
> >      2       480000        16384      37026.7          2780077
> >      2       720000        16384      28445.5          2591461
> >      3       960000        16384      28888.6          2595817
> >      3      1200000        16384      25160.8          2903882
> >      3      1440000        16384      29372.1          2600018
> >      3      1680000        16384      26443.9          2732790
> >      4      1920000        16384      26307.1          2758750
> > 
> > real    1m11.633s
> > user    0m46.156s
> > sys     3m24.543s
> 
> .. creates only managing ~270% CPU utilisation for a 40-way
> operation.
> 
> IOWs, parent pointer overhead is noise compared to the losses caused
> by data writeback locking/throttling interactions, so nothing can
> really be concluded from there here.
> 
> > Conclusion: There are noticeable overheads to enabling parent pointers,
> > but counterbalancing that, we can now repair an entire filesystem,
> > directory tree and all.
> 
> True, but I think that the unlink overhead is significant enough
> that we need to address that before enabling PP by default for
> everyone.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-12-16 23:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-02  1:27 [PATCHSET 2/2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-02  1:28 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-02  7:38   ` Christoph Hellwig
2025-12-03  0:53     ` Darrick J. Wong
2025-12-03  6:31       ` Christoph Hellwig
2025-12-04 18:48         ` Darrick J. Wong
2025-12-02  1:28 ` [PATCH 2/2] mkfs: add 2025 LTS config file Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2025-12-09 16:16 [PATCHSET V2] xfsprogs: enable new stable features for 6.18 Darrick J. Wong
2025-12-09 16:16 ` [PATCH 1/2] mkfs: enable new features by default Darrick J. Wong
2025-12-09 16:22   ` Christoph Hellwig
2025-12-09 22:25   ` Dave Chinner
2025-12-10 23:49     ` Darrick J. Wong
2025-12-15 23:59       ` Dave Chinner
2025-12-16 23:07         ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox