From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8BE0629ACF6 for ; Mon, 15 Dec 2025 23:59:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765843188; cv=none; b=Sjq2EZzGN/cp6ueeuJ2rB552IDEhTawWPbpJxh168WLhFJKBW1U7V2QzgkFL/KXB8lH3aA5TrrIIZoffp8rB9yaurdSKoV6aP8LDyq/x+nQeZDn0jDoemIaONCzAHn0VRBUpNVejd0G6mbJVSW5Bt43IfKcBumSOvRlo96VnSkM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765843188; c=relaxed/simple; bh=sZ3ift9vGCY/S9j7F064h8Rw0nHYHNz7Vmq4h4wkaes=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ct+/qxlJcn8EAkiZ/BWsYMc9GCvZmQYed0EB8WMliW25gKhpqPcoxfZoIAFyaHe4A5xFjscus6aBYt3fGgpIZuStud7uGBSTnUS5TSizvgXiBLg2Z1arS+9ixbwVUExgdNMBUXo6JxdhMiPqO3buMnPxhUmrbqnPfW7T0hqtGXI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=etvJ2LGT; arc=none smtp.client-ip=209.85.210.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="etvJ2LGT" Received: by mail-pf1-f172.google.com with SMTP id d2e1a72fcca58-7bab7c997eeso4480967b3a.0 for ; Mon, 15 Dec 2025 15:59:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1765843186; x=1766447986; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=1Gc6VxpIo8meRttcVZxrDYNG7k7kiDBT2f/NE0VjoHk=; b=etvJ2LGTV8HdQyCZKK/aFbeONpKbEkwx0cKqNFYf8i3Jd/0hE1elhDhcLAyixU+ukC i6xO1tA5vwEbC7aT3x3D5DocOmDiKLYdCVC4XGIJkZNMuNmZqVODGWJgGphRGaf3dkM2 0YEPpzto34T2bKMqsnbMopQOoZgPV21gn+ndhfVP/phncfFSWYb+/n0oFEIy6BICT421 wxxWqmFrD3M95bSr1teHB3rF/bI3X6a5ouLzdEMPSp8KSrtSLhHLnqbS8KztO48vYgcK jTKN597ZdZa+D5Jnr6KOuQV8c/uNo4ATPxrVrgQDaEfb50tsnTLw+oLklkllAaPheRMl 4ANQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765843186; x=1766447986; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1Gc6VxpIo8meRttcVZxrDYNG7k7kiDBT2f/NE0VjoHk=; b=cwxLkuLnzMlv2wg6busK6T77WdZK84d7zqYpgoTzH/2tFU3Nr+V8DRn0HFLl7DJO7f xQEnMuE8QPpgbXMY+DFAF4kJdLkqBW89oYWfsDdsMZk0OFebSvcyoOQBL+7k7/65xgTd hsr2A3jj9eSdMn8chiH6ejLTJNN68Jjcp4FPwj9MQQAiHv/ezpXPpm0H1C1g1wvEmNcd 1VMBbUs85XUHUMpE2yP4QzYkahMmIqusPpYkoihCfFnOqXlZtfU2JShnxpmrydD1LQ7K 41qCqW+XSVD7xN9nPgeaOwuQ766jqBch9ds5YkFNYToG+0TDh9PTel4FlrBvlAYKSDzZ 4Y/Q== X-Forwarded-Encrypted: i=1; AJvYcCVmmmyEWC9qAWVTIbZYin/W+qADbiWLwWcpRoZZhxbw7VuBcvNAYwnMs9u0Qe7MQb4sEn43KbeDoeE=@vger.kernel.org X-Gm-Message-State: AOJu0YwDXOUVYc+HuOgUInnuhBo913zhw2StlkGUOgcLhLZNz0z4tD+d Vnl0qcanoxcPVF518MZdzmyFHU8d57f/UjViTM6qFSNS+V456JUW4fhPJ/VlvtaIFximR10Z5Xz baUDz X-Gm-Gg: AY/fxX7YkC7k5HLZUvWP0Wr1e85tbLdn06tHVlHobwrE5y4D2xpaZ85Xzyv1403qm1b g/PYPXz0NakYEB+3JcUn2q3ny5zG+NGNaWMF+iVNZIdApvYD7+hU3GnKN7pbqixsgkVSFfCFfRd gJAQlmBgUXdmwxW8DTKNpHCX+gOxmWa1sA4NCLEKgK5GD4zvfEfuCDdZcCb+1q1xQx9/Er1VVnG a6pHhTMC5bzmcd0rYqOZFwQ/3U4XLAZ79FpVbAEAEI0bbsH5hHTnuZnCK4y5jxYLjwSp2xX6sTx oju3ISCLCMmjgSxKO8RvXB+zHCvGwfKkPgiGfiVWNMJvf0S6Z8jrDuFqkaNXwkNLb1FRN7cnG5N xXuXTN2pWGGWPc0SFcV/DtAHFClRr6iEMClIj9KGoWUXx0z8g7L2be3ejRU665CQCb+NKvsY7FS inKwANzcXj2L4/LtI9YMnCPWMTBfEFItVSqGlLLUu4xF8CfT0sxqTC0t2FD46UW/p1Wc6e9A== X-Google-Smtp-Source: AGHT+IEneRC+2J7I2QguGO+UE5LzCu2s/qxt6PGLCtEstQXSYUVMyUtsAVa8NaIr36q1u/pMVXP7Zg== X-Received: by 2002:a05:6a00:3311:b0:7f1:fad7:2ce with SMTP id d2e1a72fcca58-7f669a8a7a7mr11007909b3a.48.1765843185601; Mon, 15 Dec 2025 15:59:45 -0800 (PST) Received: from dread.disaster.area (pa49-195-10-63.pa.nsw.optusnet.com.au. [49.195.10.63]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7f4c22842e5sm13660362b3a.7.2025.12.15.15.59.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Dec 2025 15:59:45 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.98.2) (envelope-from ) id 1vVITe-00000003Lrq-0wPw; Tue, 16 Dec 2025 10:59:42 +1100 Date: Tue, 16 Dec 2025 10:59:42 +1100 From: Dave Chinner To: "Darrick J. Wong" Cc: aalbersh@kernel.org, linux-xfs@vger.kernel.org Subject: Re: [PATCH 1/2] mkfs: enable new features by default Message-ID: References: <176529676119.3974899.4941979844964370861.stgit@frogsfrogsfrogs> <176529676146.3974899.6119777261763784206.stgit@frogsfrogsfrogs> <20251210234928.GE7725@frogsfrogsfrogs> Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20251210234928.GE7725@frogsfrogsfrogs> On Wed, Dec 10, 2025 at 03:49:28PM -0800, Darrick J. Wong wrote: > On Wed, Dec 10, 2025 at 09:25:24AM +1100, Dave Chinner wrote: > > On Tue, Dec 09, 2025 at 08:16:08AM -0800, Darrick J. Wong wrote: > > > From: Darrick J. Wong > > > > > > Since the LTS is coming up, enable parent pointers and exchange-range by > > > default for all users. Also fix up an out of date comment. > > > > > > I created a really stupid benchmarking script that does: > > > > > > #!/bin/bash > > > > > > # pptr overhead benchmark > > > > > > umount /opt /mnt > > > rmmod xfs > > > for i in 1 0; do > > > umount /opt > > > mkfs.xfs -f /dev/sdb -n parent=$i | grep -i parent= > > > mount /dev/sdb /opt > > > mkdir -p /opt/foo > > > for ((i=0;i<5;i++)); do > > > time fsstress -n 100000 -p 4 -z -f creat=1 -d /opt/foo -s 1 > > > done > > > done > > > > Hmmm. fsstress is an interesting choice here... > > > > I have an old 40-core Xeon E5-2660V3 with a pair of 1.5T Intel nvme ssds > and 128G of RAM running 6.18.0. For this sample, I tried to keep the > memory usage well below the amount of DRAM so that I could measure the > pure overhead of writing parent pointers out to disk and not anything > else. I also omit ls'ing and chmod'ing the directory tree because > neither of those operations touch parent pointers. I also left the > logbsize at the defaults (32k) because that's what most users get. ok. ..... > benchme() { > agcount="$(xfs_info /nvme/ | grep agcount= | sed -e 's/^.*agcount=//g' -e 's/,.*$//g')" > dirs=() > mkdirme > > #time ~djwong/cdev/work/fstests/build-x86_64/ltp/fsstress -n 400000 -p 40 -z -f creat=1,mkdir=1,rmdir=1,unlink=1 -d /nvme/ -s 1 > time fs_mark -w "${writesz}" -D "${subdirs}" -S 0 -n "${files_per_iter}" -s "${filesz}" -L "${iter}" "${dirs[@]}" > > time bulkme > time rmdirme Ok, so this is testing cache-hot bulkstat and rm, so it's not exercising the cold-read path and hence is not needing to read and initialising parent pointers for unlinking. Can you drop caches between the bulkstat and the unlink phases so we exercise cold cache parent pointer instantiation overhead somewhere? > } > > for p in 0 1; do > umount /dev/nvme1n1 /nvme /mnt > #mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 -n parent=$p || break > mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 $feature=$p || break > mount /dev/nvme1n1 /nvme/ -o logdev=/dev/nvme0n1 || break > benchme > umount /dev/nvme1n1 /nvme /mnt > done > > I get this mkfs output: > # mkfs.xfs -f -l logdev=/dev/nvme0n1,size=1g /dev/nvme1n1 > meta-data=/dev/nvme1n1 isize=512 agcount=40, agsize=9767586 blks > = sectsz=4096 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=1 > = reflink=1 bigtime=1 inobtcount=1 nrext64=1 > = exchange=0 metadir=0 > data = bsize=4096 blocks=390703440, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0 > log =/dev/nvme0n1 bsize=4096 blocks=262144, version=2 > = sectsz=4096 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > = rgcount=0 rgsize=0 extents > = zoned=0 start=0 reserved=0 > # grep nvme1n1 /proc/mounts > /dev/nvme1n1 /nvme xfs rw,relatime,inode64,logbufs=8,logbsize=32k,logdev=/dev/nvme0n1,noquota 0 0 > > and this output from fsmark with parent=0: .... a table-based summary would have made this easier to read parent real user sys create 0 0m57.573s 3m53.578s 19m44.440s create 1 1m2.934s 3m53.508s 25m14.810s bulk 0 0m1.122s 0m0.955s 0m39.306s bulk 1 0m1.158s 0m0.882s 0m39.847s unlink 0 0m59.649s 0m41.196s 13m9.566s unlink 1 1m12.505s 0m47.489s 20m33.844s > fs_mark itself shows a decrease in file creation/sec of about 9%, an > increase in wall clock time of about 9%, and an increase in kernel time > of about 28%. That's to be expected, since parent pointer updates cause > directory entry creation and deletion to hold more ILOCKs and for > longer. ILOCK isn't an issue with this test - the whole point of the segmented directory structure is that each thread operates in it's own directory, so there is no ILOCK contention at all. i.e. the entire difference is the CPU overhead of the adding the xattr fork and creating the parent pointer xattr. I suspect that the create side overhead is probably acceptible, because we also typically add security xattrs at create time and these will be slightly faster as the xattr fork is already prepared... > Parallel bulkstat (aka bulkme) shows an increase in wall time of 3% and > system time of 1%, which is not surprising since that's just walking the > inode btree and cores, no parent pointers involved. I was more interested in the cold cache behaviour - hot cache is generally uninteresting as the XFS inode cache scales pretty much perfectly in this case. Reading the inodes from disk, OTOH, adds a whole heap of instantiation and lock contention overhead and changes the picture significantly. I'm interested to know what the impact of having PPs is in that case.... > Similarly, deleting all the files created by fs_mark shows an increase > in wall time of about ~21% and an increase in system time of about 56%. > I concede that parent pointers has a fair amount of overhead for the > worst case of creating a large directory tree or deleting it. Ok, so an increase in unlink CPU overhead of 56% is pretty bad. On single threaded workloads, that's going to equate to be a ~50% reduction in performance for operations that perform unlinks in CPU bound loops (e.g. rm -rf on hot caches). Note that the above test is not CPU bound - it's only running at about 50% CPU utilisation because of some other contention point in the fs (possibly log space or pinned/stale directory buffers requiring a log force to clear). However, results like this make me think that PP unlink hasn't been optimised for the common case: removing the last parent pointer (i.e. nlink 1 -> 0) when the inode is being placed on the unlinked list in syscall context. This is the common case in the absence of hard links, and it puts the PP xattr removal directly in application task context. In this case, it seems to me that we don't actually need to remove the parent pointer xattr. When the inode is inactivated by bakground inodegc after last close, the xattr fork is truncated and that will remove all xattrs including the stale remaining PP without needing to make a specific PP transaction. Doing this would remove the PP overhead completely from the final unlink syscall path. It would only add minimal extra overhead on the inodegc side as (in the common case) we have to remove security xattrs in inodegc. Hence I think we really need to try to mitigate this common case overhead before we make PP the default for everyone. The perf decrease > If I then re-run the benchmark with a file size of 1M and tell it to > create fewer files, then I get the following for parent=0: These are largely meaningless as the create benchmark is throttling hard on disk bandwidth (1.5-2GB/s) in the write() path, not limited by PP overhead. The variance in runtime comes from the data IO path behaviour, and the lack of sync() operations after the create means that writeback is likely still running when the unlink phase runs. Hence it's pretty difficult to conclude anything about parent pointers themselves because of the other large variants in this workload. > I then decided to simulate my maildir spool, which has 670,000 files > consuming 12GB for an average file size of 17936 bytes. I reduced the > file size to 16K, increase the number of files per iteration, and set > the write buffer size to something not aligned to a block, and got this > for parent=0: Same again, but this time the writeback thread will be seeing delalloc latencies w.r.t. AGF locks vs incoming directory and inode chunk allocation operations. That can be seen by: > > # fs_mark -w 778 -D 1000 -S 0 -n 6000 -s 16384 -L 8 -d /nvme/0 -d /nvme/1 -d /nvme/2 -d /nvme/3 -d /nvme/4 -d /nvme/5 -d /nvme/6 -d /nvme/7 -d /nvme/8 -d /nvme/9 -d /nvme/10 -d /nvme/11 -d /nvme/12 -d /nvme/13 -d /nvme/14 -d /nvme/15 -d /nvme/16 -d /nvme/17 -d /nvme/18 -d /nvme/19 -d /nvme/20 -d /nvme/21 -d /nvme/22 -d /nvme/23 -d /nvme/24 -d /nvme/25 -d /nvme/26 -d /nvme/27 -d /nvme/28 -d /nvme/29 -d /nvme/30 -d /nvme/31 -d /nvme/32 -d /nvme/33 -d /nvme/34 -d /nvme/35 -d /nvme/36 -d /nvme/37 -d /nvme/38 -d /nvme/39 > # Version 3.3, 40 thread(s) starting at Wed Dec 10 15:21:38 2025 > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. > # Directories: Time based hash between directories across 1000 subdirectories with 180 seconds per subdirectory. > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) > # Files info: size 16384 bytes, written with an IO size of 778 bytes per write > # App overhead is time in microseconds spent in the test not doing file writing related system calls. > > FSUse% Count Size Files/sec App Overhead > 2 240000 16384 40085.3 2492281 > 2 480000 16384 37026.7 2780077 > 2 720000 16384 28445.5 2591461 > 3 960000 16384 28888.6 2595817 > 3 1200000 16384 25160.8 2903882 > 3 1440000 16384 29372.1 2600018 > 3 1680000 16384 26443.9 2732790 > 4 1920000 16384 26307.1 2758750 > > real 1m11.633s > user 0m46.156s > sys 3m24.543s .. creates only managing ~270% CPU utilisation for a 40-way operation. IOWs, parent pointer overhead is noise compared to the losses caused by data writeback locking/throttling interactions, so nothing can really be concluded from there here. > Conclusion: There are noticeable overheads to enabling parent pointers, > but counterbalancing that, we can now repair an entire filesystem, > directory tree and all. True, but I think that the unlink overhead is significant enough that we need to address that before enabling PP by default for everyone. -Dave. -- Dave Chinner david@fromorbit.com