public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET] xfs: random fixes for 6.10
@ 2024-06-12 17:46 Darrick J. Wong
  2024-06-12 17:46 ` [PATCH 1/5] xfs: don't treat append-only files as having preallocations Darrick J. Wong
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-12 17:46 UTC (permalink / raw)
  To: hch, djwong, chandanbabu; +Cc: linux-xfs

Hi all,

Here are some bugfixes for 6.10.  The first two patches are from hch,
and fix some longstanding delalloc leaks that only came to light now
that we've enabled it for realtime.

The second two fixes are from me -- one fixes a bug when we run out
of space for cow preallocations when alwayscow is turned on (xfs/205),
and the other corrects overzealous inode validation that causes log
recovery failure with generic/388.

The last patch is a debugging patch to ensure that transactions never
commit corrupt inodes, buffers, or dquots.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been lightly tested with fstests.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=random-fixes-6.10
---
Commits in this patchset:
 * xfs: don't treat append-only files as having preallocations
 * xfs: fix freeing speculative preallocations for preallocated files
 * xfs: restrict when we try to align cow fork delalloc to cowextsz hints
 * xfs: allow unlinked symlinks and dirs with zero size
 * xfs: verify buffer, inode, and dquot items every tx commit
---
 fs/xfs/libxfs/xfs_bmap.c      |   14 +++++++++++---
 fs/xfs/libxfs/xfs_bmap.h      |    2 +-
 fs/xfs/libxfs/xfs_inode_buf.c |   23 ++++++++++++++++++-----
 fs/xfs/xfs_bmap_util.c        |   37 +++++++++++++++++++++++--------------
 fs/xfs/xfs_bmap_util.h        |    2 +-
 fs/xfs/xfs_buf_item.c         |   32 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_dquot_item.c       |   31 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_icache.c           |    4 ++--
 fs/xfs/xfs_inode.c            |   14 ++++----------
 fs/xfs/xfs_inode_item.c       |   32 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iomap.c            |   14 ++++++++++++--
 11 files changed, 167 insertions(+), 38 deletions(-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] xfs: don't treat append-only files as having preallocations
  2024-06-12 17:46 [PATCHSET] xfs: random fixes for 6.10 Darrick J. Wong
@ 2024-06-12 17:46 ` Darrick J. Wong
  2024-06-13  6:03   ` Dave Chinner
  2024-06-12 17:47 ` [PATCH 2/5] xfs: fix freeing speculative preallocations for preallocated files Darrick J. Wong
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-12 17:46 UTC (permalink / raw)
  To: hch, djwong, chandanbabu; +Cc: linux-xfs

From: Christoph Hellwig <hch@lst.de>

The XFS XFS_DIFLAG_APPEND maps to the VFS S_APPEND flag, which forbids
writes that don't append at the current EOF.

But the commit originally adding XFS_DIFLAG_APPEND support (commit
a23321e766d in xfs xfs-import repository) also checked it to skip
releasing speculative preallocations, which doesn't make any sense.

Another commit (dd9f438e3290 in the xfs-import repository) late extended
that flag to also report these speculation preallocations which should
not exist in getbmap.

Remove these checks as nothing XFS_DIFLAG_APPEND implies that
preallocations beyond EOF should exist.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |    9 ++++-----
 fs/xfs/xfs_icache.c    |    2 +-
 2 files changed, 5 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index ac2e77ebb54c..eb8056b1c906 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -331,8 +331,7 @@ xfs_getbmap(
 		}
 
 		if (xfs_get_extsz_hint(ip) ||
-		    (ip->i_diflags &
-		     (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))
+		    (ip->i_diflags & XFS_DIFLAG_PREALLOC))
 			max_len = mp->m_super->s_maxbytes;
 		else
 			max_len = XFS_ISIZE(ip);
@@ -526,10 +525,10 @@ xfs_can_free_eofblocks(
 		return false;
 
 	/*
-	 * Do not free real preallocated or append-only files unless the file
-	 * has delalloc blocks and we are forced to remove them.
+	 * Do not free real extents in preallocated files unless the file has
+	 * delalloc blocks and we are forced to remove them.
 	 */
-	if (ip->i_diflags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND))
+	if (ip->i_diflags & XFS_DIFLAG_PREALLOC)
 		if (!force || ip->i_delayed_blks == 0)
 			return false;
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0953163a2d84..41b8a5c4dd69 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1158,7 +1158,7 @@ xfs_inode_free_eofblocks(
 	if (xfs_can_free_eofblocks(ip, false))
 		return xfs_free_eofblocks(ip);
 
-	/* inode could be preallocated or append-only */
+	/* inode could be preallocated */
 	trace_xfs_inode_free_eofblocks_invalid(ip);
 	xfs_inode_clear_eofblocks_tag(ip);
 	return 0;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/5] xfs: fix freeing speculative preallocations for preallocated files
  2024-06-12 17:46 [PATCHSET] xfs: random fixes for 6.10 Darrick J. Wong
  2024-06-12 17:46 ` [PATCH 1/5] xfs: don't treat append-only files as having preallocations Darrick J. Wong
@ 2024-06-12 17:47 ` Darrick J. Wong
  2024-06-12 17:47 ` [PATCH 3/5] xfs: restrict when we try to align cow fork delalloc to cowextsz hints Darrick J. Wong
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-12 17:47 UTC (permalink / raw)
  To: hch, djwong, chandanbabu; +Cc: linux-xfs

From: Christoph Hellwig <hch@lst.de>

xfs_can_free_eofblocks returns false for files that have persistent
preallocations unless the force flag is passed and there are delayed
blocks.  This means it won't free delalloc reservations for files
with persistent preallocations unless the force flag is set, and it
will also free the persistent preallocations if the force flag is
set and the file happens to have delayed allocations.

Both of these are bad, so do away with the force flag and always
free post-EOF delayed allocations only for files with the
XFS_DIFLAG_PREALLOC flag set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |   34 ++++++++++++++++++++++------------
 fs/xfs/xfs_bmap_util.h |    2 +-
 fs/xfs/xfs_icache.c    |    2 +-
 fs/xfs/xfs_inode.c     |   14 ++++----------
 4 files changed, 28 insertions(+), 24 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index eb8056b1c906..3d6896e9e540 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -485,13 +485,11 @@ xfs_bmap_punch_delalloc_range(
 
 /*
  * Test whether it is appropriate to check an inode for and free post EOF
- * blocks. The 'force' parameter determines whether we should also consider
- * regular files that are marked preallocated or append-only.
+ * blocks.
  */
 bool
 xfs_can_free_eofblocks(
-	struct xfs_inode	*ip,
-	bool			force)
+	struct xfs_inode	*ip)
 {
 	struct xfs_bmbt_irec	imap;
 	struct xfs_mount	*mp = ip->i_mount;
@@ -524,13 +522,9 @@ xfs_can_free_eofblocks(
 	if (xfs_need_iread_extents(&ip->i_df))
 		return false;
 
-	/*
-	 * Do not free real extents in preallocated files unless the file has
-	 * delalloc blocks and we are forced to remove them.
-	 */
-	if (ip->i_diflags & XFS_DIFLAG_PREALLOC)
-		if (!force || ip->i_delayed_blks == 0)
-			return false;
+	/* Only free real extents for inodes with persistent preallocations. */
+	if ((ip->i_diflags & XFS_DIFLAG_PREALLOC) && !ip->i_delayed_blks)
+		return false;
 
 	/*
 	 * Do not try to free post-EOF blocks if EOF is beyond the end of the
@@ -583,6 +577,22 @@ xfs_free_eofblocks(
 	/* Wait on dio to ensure i_size has settled. */
 	inode_dio_wait(VFS_I(ip));
 
+	/*
+	 * For preallocated files only free delayed allocations.
+	 *
+	 * Note that this means we also leave speculative preallocations in
+	 * place for preallocated files.
+	 */
+	if (ip->i_diflags & XFS_DIFLAG_PREALLOC) {
+		if (ip->i_delayed_blks) {
+			xfs_bmap_punch_delalloc_range(ip,
+				round_up(XFS_ISIZE(ip), mp->m_sb.sb_blocksize),
+				LLONG_MAX);
+		}
+		xfs_inode_clear_eofblocks_tag(ip);
+		return 0;
+	}
+
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
 	if (error) {
 		ASSERT(xfs_is_shutdown(mp));
@@ -890,7 +900,7 @@ xfs_prepare_shift(
 	 * Trim eofblocks to avoid shifting uninitialized post-eof preallocation
 	 * into the accessible region of the file.
 	 */
-	if (xfs_can_free_eofblocks(ip, true)) {
+	if (xfs_can_free_eofblocks(ip)) {
 		error = xfs_free_eofblocks(ip);
 		if (error)
 			return error;
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 51f84d8ff372..eb0895bfb9da 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -63,7 +63,7 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
 
 /* EOF block manipulation functions */
-bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
+bool	xfs_can_free_eofblocks(struct xfs_inode *ip);
 int	xfs_free_eofblocks(struct xfs_inode *ip);
 
 int	xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 41b8a5c4dd69..0f07ec842b70 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1155,7 +1155,7 @@ xfs_inode_free_eofblocks(
 	}
 	*lockflags |= XFS_IOLOCK_EXCL;
 
-	if (xfs_can_free_eofblocks(ip, false))
+	if (xfs_can_free_eofblocks(ip))
 		return xfs_free_eofblocks(ip);
 
 	/* inode could be preallocated */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 58fb7a5062e1..b699fa6ee3b6 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1595,7 +1595,7 @@ xfs_release(
 	if (!xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL))
 		return 0;
 
-	if (xfs_can_free_eofblocks(ip, false)) {
+	if (xfs_can_free_eofblocks(ip)) {
 		/*
 		 * Check if the inode is being opened, written and closed
 		 * frequently and we have delayed allocation blocks outstanding
@@ -1856,15 +1856,13 @@ xfs_inode_needs_inactive(
 
 	/*
 	 * This file isn't being freed, so check if there are post-eof blocks
-	 * to free.  @force is true because we are evicting an inode from the
-	 * cache.  Post-eof blocks must be freed, lest we end up with broken
-	 * free space accounting.
+	 * to free.
 	 *
 	 * Note: don't bother with iolock here since lockdep complains about
 	 * acquiring it in reclaim context. We have the only reference to the
 	 * inode at this point anyways.
 	 */
-	return xfs_can_free_eofblocks(ip, true);
+	return xfs_can_free_eofblocks(ip);
 }
 
 /*
@@ -1947,15 +1945,11 @@ xfs_inactive(
 
 	if (VFS_I(ip)->i_nlink != 0) {
 		/*
-		 * force is true because we are evicting an inode from the
-		 * cache. Post-eof blocks must be freed, lest we end up with
-		 * broken free space accounting.
-		 *
 		 * Note: don't bother with iolock here since lockdep complains
 		 * about acquiring it in reclaim context. We have the only
 		 * reference to the inode at this point anyways.
 		 */
-		if (xfs_can_free_eofblocks(ip, true))
+		if (xfs_can_free_eofblocks(ip))
 			error = xfs_free_eofblocks(ip);
 
 		goto out;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/5] xfs: restrict when we try to align cow fork delalloc to cowextsz hints
  2024-06-12 17:46 [PATCHSET] xfs: random fixes for 6.10 Darrick J. Wong
  2024-06-12 17:46 ` [PATCH 1/5] xfs: don't treat append-only files as having preallocations Darrick J. Wong
  2024-06-12 17:47 ` [PATCH 2/5] xfs: fix freeing speculative preallocations for preallocated files Darrick J. Wong
@ 2024-06-12 17:47 ` Darrick J. Wong
  2024-06-13  5:06   ` Christoph Hellwig
  2024-06-12 17:47 ` [PATCH 4/5] xfs: allow unlinked symlinks and dirs with zero size Darrick J. Wong
  2024-06-12 17:47 ` [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit Darrick J. Wong
  4 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-12 17:47 UTC (permalink / raw)
  To: hch, djwong, chandanbabu; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

xfs/205 produces the following failure when always_cow is enabled:

  --- a/tests/xfs/205.out	2024-02-28 16:20:24.437887970 -0800
  +++ b/tests/xfs/205.out.bad	2024-06-03 21:13:40.584000000 -0700
  @@ -1,4 +1,5 @@
   QA output created by 205
   *** one file
  +   !!! disk full (expected)
   *** one file, a few bytes at a time
   *** done

This is the result of overly aggressive attempts to align cow fork
delalloc reservations to the CoW extent size hint.  Looking at the trace
data, we're trying to append a single fsblock to the "fred" file.
Trying to create a speculative post-eof reservation fails because
there's not enough space.

We then set @prealloc_blocks to zero and try again, but the cowextsz
alignment code triggers, which expands our request for a 1-fsblock
reservation into a 39-block reservation.  There's not enough space for
that, so the whole write fails with ENOSPC even though there's
sufficient space in the filesystem to allocate the single block that we
need to land the write.

There are two things wrong here -- first, we shouldn't be attempting
speculative preallocations beyond what was requested when we're low on
space.  Second, if we've already computed a posteof preallocation, we
shouldn't bother trying to align that to the cowextsize hint.

Fix both of these problems by adding a flag that only enables the
expansion of the delalloc reservation to the cowextsize if we're doing a
non-extending write, and only if we're not doing an ENOSPC retry.

I probably should have caught this six years ago when 6ca30729c206d was
being reviewed, but oh well.  Update the comments to reflect what the
code does now.

Fixes: 6ca30729c206d ("xfs: bmap code cleanup")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |   14 +++++++++++---
 fs/xfs/libxfs/xfs_bmap.h |    2 +-
 fs/xfs/xfs_iomap.c       |   14 ++++++++++++--
 3 files changed, 24 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c101cf266bc4..0dc4ff2fe751 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4050,7 +4050,8 @@ xfs_bmapi_reserve_delalloc(
 	xfs_filblks_t		prealloc,
 	struct xfs_bmbt_irec	*got,
 	struct xfs_iext_cursor	*icur,
-	int			eof)
+	int			eof,
+	bool			use_cowextszhint)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, whichfork);
@@ -4070,8 +4071,15 @@ xfs_bmapi_reserve_delalloc(
 	if (prealloc && alen >= len)
 		prealloc = alen - len;
 
-	/* Figure out the extent size, adjust alen */
-	if (whichfork == XFS_COW_FORK) {
+	/*
+	 * If the caller wants us to do so, try to expand the range of the
+	 * delalloc reservation up and down so that it's aligned with the CoW
+	 * extent size hint.  Unlike the data fork, the CoW cancellation
+	 * functions will free all the reservations at inactivation, so we
+	 * don't require that every delalloc reservation have a dirty
+	 * pagecache.
+	 */
+	if (whichfork == XFS_COW_FORK && use_cowextszhint) {
 		struct xfs_bmbt_irec	prev;
 		xfs_extlen_t		extsz = xfs_get_cowextsz_hint(ip);
 
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 667b0c2b33d1..aa9814649c5b 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -222,7 +222,7 @@ int	xfs_bmap_split_extent(struct xfs_trans *tp, struct xfs_inode *ip,
 int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
 		xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc,
 		struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur,
-		int eof);
+		int eof, bool use_cowextszhint);
 int	xfs_bmapi_convert_delalloc(struct xfs_inode *ip, int whichfork,
 		xfs_off_t offset, struct iomap *iomap, unsigned int *seq);
 int	xfs_bmap_add_extent_unwritten_real(struct xfs_trans *tp,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 378342673925..a7d74f871773 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -979,6 +979,7 @@ xfs_buffered_write_iomap_begin(
 	int			error = 0;
 	unsigned int		lockmode = XFS_ILOCK_EXCL;
 	u64			seq;
+	bool			use_cowextszhint = false;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -1148,12 +1149,20 @@ xfs_buffered_write_iomap_begin(
 		}
 	}
 
+	/*
+	 * If we're targetting the COW fork but aren't creating a speculative
+	 * posteof preallocation, try to expand the reservation to align with
+	 * the cow extent size hint if there's sufficient free space.
+	 */
+	if (allocfork == XFS_COW_FORK && !prealloc_blocks)
+		use_cowextszhint = true;
 retry:
 	error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
 			end_fsb - offset_fsb, prealloc_blocks,
 			allocfork == XFS_DATA_FORK ? &imap : &cmap,
 			allocfork == XFS_DATA_FORK ? &icur : &ccur,
-			allocfork == XFS_DATA_FORK ? eof : cow_eof);
+			allocfork == XFS_DATA_FORK ? eof : cow_eof,
+			use_cowextszhint);
 	switch (error) {
 	case 0:
 		break;
@@ -1161,7 +1170,8 @@ xfs_buffered_write_iomap_begin(
 	case -EDQUOT:
 		/* retry without any preallocation */
 		trace_xfs_delalloc_enospc(ip, offset, count);
-		if (prealloc_blocks) {
+		if (prealloc_blocks || use_cowextszhint) {
+			use_cowextszhint = false;
 			prealloc_blocks = 0;
 			goto retry;
 		}


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/5] xfs: allow unlinked symlinks and dirs with zero size
  2024-06-12 17:46 [PATCHSET] xfs: random fixes for 6.10 Darrick J. Wong
                   ` (2 preceding siblings ...)
  2024-06-12 17:47 ` [PATCH 3/5] xfs: restrict when we try to align cow fork delalloc to cowextsz hints Darrick J. Wong
@ 2024-06-12 17:47 ` Darrick J. Wong
  2024-06-13  4:57   ` Christoph Hellwig
  2024-06-12 17:47 ` [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit Darrick J. Wong
  4 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-12 17:47 UTC (permalink / raw)
  To: hch, djwong, chandanbabu; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

For a very very long time, inode inactivation has set the inode size to
zero before unmapping the extents associated with the data fork.
Unfortunately, commit 3c6f46eacd876 changed the inode verifier to
prohibit zero-length symlinks and directories.  If an inode happens to
get logged in this state and the system crashes before freeing the
inode, log recovery will also fail on the broken inode.

Therefore, allow zero-size symlinks and directories as long as the link
count is zero; nobody will be able to open these files by handle so
there isn't any risk of data exposure.

Fixes: 3c6f46eacd876 ("xfs: sanity check directory inode di_size")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_inode_buf.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index e7a7bfbe75b4..513b50da6215 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -379,10 +379,13 @@ xfs_dinode_verify_fork(
 		/*
 		 * A directory small enough to fit in the inode must be stored
 		 * in local format.  The directory sf <-> extents conversion
-		 * code updates the directory size accordingly.
+		 * code updates the directory size accordingly.  Directories
+		 * being truncated have zero size and are not subject to this
+		 * check.
 		 */
 		if (S_ISDIR(mode)) {
-			if (be64_to_cpu(dip->di_size) <= fork_size &&
+			if (dip->di_size &&
+			    be64_to_cpu(dip->di_size) <= fork_size &&
 			    fork_format != XFS_DINODE_FMT_LOCAL)
 				return __this_address;
 		}
@@ -528,9 +531,19 @@ xfs_dinode_verify(
 	if (mode && xfs_mode_to_ftype(mode) == XFS_DIR3_FT_UNKNOWN)
 		return __this_address;
 
-	/* No zero-length symlinks/dirs. */
-	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
-		return __this_address;
+	/*
+	 * No zero-length symlinks/dirs unless they're unlinked and hence being
+	 * inactivated.
+	 */
+	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0) {
+		if (dip->di_version > 1) {
+			if (dip->di_nlink)
+				return __this_address;
+		} else {
+			if (dip->di_onlink)
+				return __this_address;
+		}
+	}
 
 	fa = xfs_dinode_verify_nrext64(mp, dip);
 	if (fa)


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit
  2024-06-12 17:46 [PATCHSET] xfs: random fixes for 6.10 Darrick J. Wong
                   ` (3 preceding siblings ...)
  2024-06-12 17:47 ` [PATCH 4/5] xfs: allow unlinked symlinks and dirs with zero size Darrick J. Wong
@ 2024-06-12 17:47 ` Darrick J. Wong
  2024-06-13  5:07   ` Christoph Hellwig
                     ` (2 more replies)
  4 siblings, 3 replies; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-12 17:47 UTC (permalink / raw)
  To: hch, djwong, chandanbabu; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

generic/388 has an annoying tendency to fail like this during log
recovery:

XFS (sda4): Unmounting Filesystem 435fe39b-82b6-46ef-be56-819499585130
XFS (sda4): Mounting V5 Filesystem 435fe39b-82b6-46ef-be56-819499585130
XFS (sda4): Starting recovery (logdev: internal)
00000000: 49 4e 81 b6 03 02 00 00 00 00 00 07 00 00 00 07  IN..............
00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 10  ................
00000020: 35 9a 8b c1 3e 6e 81 00 35 9a 8b c1 3f dc b7 00  5...>n..5...?...
00000030: 35 9a 8b c1 3f dc b7 00 00 00 00 00 00 3c 86 4f  5...?........<.O
00000040: 00 00 00 00 00 00 02 f3 00 00 00 00 00 00 00 00  ................
00000050: 00 00 1f 01 00 00 00 00 00 00 00 02 b2 74 c9 0b  .............t..
00000060: ff ff ff ff d7 45 73 10 00 00 00 00 00 00 00 2d  .....Es........-
00000070: 00 00 07 92 00 01 fe 30 00 00 00 00 00 00 00 1a  .......0........
00000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000090: 35 9a 8b c1 3b 55 0c 00 00 00 00 00 04 27 b2 d1  5...;U.......'..
000000a0: 43 5f e3 9b 82 b6 46 ef be 56 81 94 99 58 51 30  C_....F..V...XQ0
XFS (sda4): Internal error Bad dinode after recovery at line 539 of file fs/xfs/xfs_inode_item_recover.c.  Caller xlog_recover_items_pass2+0x4e/0xc0 [xfs]
CPU: 0 PID: 2189311 Comm: mount Not tainted 6.9.0-rc4-djwx #rc4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x4f/0x60
 xfs_corruption_error+0x90/0xa0
 xlog_recover_inode_commit_pass2+0x5f1/0xb00
 xlog_recover_items_pass2+0x4e/0xc0
 xlog_recover_commit_trans+0x2db/0x350
 xlog_recovery_process_trans+0xab/0xe0
 xlog_recover_process_data+0xa7/0x130
 xlog_do_recovery_pass+0x398/0x840
 xlog_do_log_recovery+0x62/0xc0
 xlog_do_recover+0x34/0x1d0
 xlog_recover+0xe9/0x1a0
 xfs_log_mount+0xff/0x260
 xfs_mountfs+0x5d9/0xb60
 xfs_fs_fill_super+0x76b/0xa30
 get_tree_bdev+0x124/0x1d0
 vfs_get_tree+0x17/0xa0
 path_mount+0x72b/0xa90
 __x64_sys_mount+0x112/0x150
 do_syscall_64+0x49/0x100
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
 </TASK>
XFS (sda4): Corruption detected. Unmount and run xfs_repair
XFS (sda4): Metadata corruption detected at xfs_dinode_verify.part.0+0x739/0x920 [xfs], inode 0x427b2d1
XFS (sda4): Filesystem has been shut down due to log error (0x2).
XFS (sda4): Please unmount the filesystem and rectify the problem(s).
XFS (sda4): log mount/recovery failed: error -117
XFS (sda4): log mount failed

This inode log item recovery failing the dinode verifier after
replaying the contents of the inode log item into the ondisk inode.
Looking back into what the kernel was doing at the time of the fs
shutdown, a thread was in the middle of running a series of
transactions, each of which committed changes to the inode.

At some point in the middle of that chain, an invalid (at least
according to the verifier) change was committed.  Had the filesystem not
shut down in the middle of the chain, a subsequent transaction would
have corrected the invalid state and nobody would have noticed.  But
that's not what happened here.  Instead, the invalid inode state was
committed to the ondisk log, so log recovery tripped over it.

The actual defect here was an overzealous inode verifier, which was
fixed in a separate patch.  This patch adds some transaction precommit
functions for CONFIG_XFS_DEBUG=y mode so that we can detect these kinds
of transient errors at transaction commit time, where it's much easier
to find the root cause.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf_item.c   |   32 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_dquot_item.c |   31 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode_item.c |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 95 insertions(+)


diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 43031842341a..44f0078babda 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -22,6 +22,7 @@
 #include "xfs_trace.h"
 #include "xfs_log.h"
 #include "xfs_log_priv.h"
+#include "xfs_error.h"
 
 
 struct kmem_cache	*xfs_buf_item_cache;
@@ -781,8 +782,39 @@ xfs_buf_item_committed(
 	return lsn;
 }
 
+#ifdef DEBUG
+static int
+xfs_buf_item_precommit(
+	struct xfs_trans	*tp,
+	struct xfs_log_item	*lip)
+{
+	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
+	struct xfs_buf		*bp = bip->bli_buf;
+	struct xfs_mount	*mp = bp->b_mount;
+	xfs_failaddr_t		fa;
+
+	if (!bp->b_ops || !bp->b_ops->verify_struct)
+		return 0;
+	if (bip->bli_flags & XFS_BLI_STALE)
+		return 0;
+
+	fa = bp->b_ops->verify_struct(bp);
+	if (fa) {
+		xfs_buf_verifier_error(bp, -EFSCORRUPTED, bp->b_ops->name,
+				bp->b_addr, BBTOB(bp->b_length), fa);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		ASSERT(fa == NULL);
+	}
+
+	return 0;
+}
+#else
+# define xfs_buf_item_precommit	NULL
+#endif
+
 static const struct xfs_item_ops xfs_buf_item_ops = {
 	.iop_size	= xfs_buf_item_size,
+	.iop_precommit	= xfs_buf_item_precommit,
 	.iop_format	= xfs_buf_item_format,
 	.iop_pin	= xfs_buf_item_pin,
 	.iop_unpin	= xfs_buf_item_unpin,
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 6a1aae799cf1..dfb00354d457 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -17,6 +17,7 @@
 #include "xfs_trans_priv.h"
 #include "xfs_qm.h"
 #include "xfs_log.h"
+#include "xfs_error.h"
 
 static inline struct xfs_dq_logitem *DQUOT_ITEM(struct xfs_log_item *lip)
 {
@@ -193,8 +194,38 @@ xfs_qm_dquot_logitem_committing(
 	return xfs_qm_dquot_logitem_release(lip);
 }
 
+#ifdef DEBUG
+static int
+xfs_qm_dquot_logitem_precommit(
+	struct xfs_trans	*tp,
+	struct xfs_log_item	*lip)
+{
+	struct xfs_dquot	*dqp = DQUOT_ITEM(lip)->qli_dquot;
+	struct xfs_mount	*mp = dqp->q_mount;
+	struct xfs_disk_dquot	ddq;
+	xfs_failaddr_t		fa;
+
+	xfs_dquot_to_disk(&ddq, dqp);
+	fa = xfs_dquot_verify(mp, &ddq, dqp->q_id);
+	if (fa) {
+		XFS_CORRUPTION_ERROR("Bad dquot during logging",
+				XFS_ERRLEVEL_LOW, mp, &ddq, sizeof(ddq));
+		xfs_alert(mp,
+ "Metadata corruption detected at %pS, dquot 0x%x",
+				fa, dqp->q_id);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		ASSERT(fa == NULL);
+	}
+
+	return 0;
+}
+#else
+# define xfs_qm_dquot_logitem_precommit	NULL
+#endif
+
 static const struct xfs_item_ops xfs_dquot_item_ops = {
 	.iop_size	= xfs_qm_dquot_logitem_size,
+	.iop_precommit	= xfs_qm_dquot_logitem_precommit,
 	.iop_format	= xfs_qm_dquot_logitem_format,
 	.iop_pin	= xfs_qm_dquot_logitem_pin,
 	.iop_unpin	= xfs_qm_dquot_logitem_unpin,
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index f28d653300d1..0d97ae015114 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -37,6 +37,36 @@ xfs_inode_item_sort(
 	return INODE_ITEM(lip)->ili_inode->i_ino;
 }
 
+#ifdef DEBUG
+static void
+xfs_inode_item_precommit_check(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_dinode	*dip;
+	xfs_failaddr_t		fa;
+
+	dip = kzalloc(mp->m_sb.sb_inodesize, GFP_KERNEL | GFP_NOFS);
+	if (!dip) {
+		ASSERT(dip != NULL);
+		return;
+	}
+
+	xfs_inode_to_disk(ip, dip, 0);
+	xfs_dinode_calc_crc(mp, dip);
+	fa = xfs_dinode_verify(mp, ip->i_ino, dip);
+	if (fa) {
+		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
+				sizeof(*dip), fa);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		ASSERT(fa == NULL);
+	}
+	kfree(dip);
+}
+#else
+# define xfs_inode_item_precommit_check(ip)	((void)0)
+#endif
+
 /*
  * Prior to finally logging the inode, we have to ensure that all the
  * per-modification inode state changes are applied. This includes VFS inode
@@ -169,6 +199,8 @@ xfs_inode_item_precommit(
 	iip->ili_fields |= (flags | iip->ili_last_fields);
 	spin_unlock(&iip->ili_lock);
 
+	xfs_inode_item_precommit_check(ip);
+
 	/*
 	 * We are done with the log item transaction dirty state, so clear it so
 	 * that it doesn't pollute future transactions.


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] xfs: allow unlinked symlinks and dirs with zero size
  2024-06-12 17:47 ` [PATCH 4/5] xfs: allow unlinked symlinks and dirs with zero size Darrick J. Wong
@ 2024-06-13  4:57   ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-13  4:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, chandanbabu, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] xfs: restrict when we try to align cow fork delalloc to cowextsz hints
  2024-06-12 17:47 ` [PATCH 3/5] xfs: restrict when we try to align cow fork delalloc to cowextsz hints Darrick J. Wong
@ 2024-06-13  5:06   ` Christoph Hellwig
  2024-06-14  4:13     ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-13  5:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, chandanbabu, linux-xfs

On Wed, Jun 12, 2024 at 10:47:19AM -0700, Darrick J. Wong wrote:
>  	xfs_filblks_t		prealloc,
>  	struct xfs_bmbt_irec	*got,
>  	struct xfs_iext_cursor	*icur,
> -	int			eof)
> +	int			eof,
> +	bool			use_cowextszhint)

Looking at the caller below I don't think we need the use_cowextszhint
flag here, we can just locally check for prealloc beeing non-0 in
the branch below:

> +	/*
> +	 * If the caller wants us to do so, try to expand the range of the
> +	 * delalloc reservation up and down so that it's aligned with the CoW
> +	 * extent size hint.  Unlike the data fork, the CoW cancellation
> +	 * functions will free all the reservations at inactivation, so we
> +	 * don't require that every delalloc reservation have a dirty
> +	 * pagecache.
> +	 */
> +	if (whichfork == XFS_COW_FORK && use_cowextszhint) {

Which keeps all the logic and the comments in one single place.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit
  2024-06-12 17:47 ` [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit Darrick J. Wong
@ 2024-06-13  5:07   ` Christoph Hellwig
  2024-06-13  7:04   ` Dave Chinner
  2024-06-18  0:18   ` [PATCH v1.1 " Darrick J. Wong
  2 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-13  5:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, chandanbabu, linux-xfs

On Wed, Jun 12, 2024 at 10:47:50AM -0700, Darrick J. Wong wrote:
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_dinode	*dip;
> +	xfs_failaddr_t		fa;
> +
> +	dip = kzalloc(mp->m_sb.sb_inodesize, GFP_KERNEL | GFP_NOFS);
> +	if (!dip) {
> +		ASSERT(dip != NULL);
> +		return;
> +	}
> +
> +	xfs_inode_to_disk(ip, dip, 0);
> +	xfs_dinode_calc_crc(mp, dip);
> +	fa = xfs_dinode_verify(mp, ip->i_ino, dip);
> +	if (fa) {
> +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
> +				sizeof(*dip), fa);
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +		ASSERT(fa == NULL);
> +	}
> +	kfree(dip);

Doing abother malloc and per committed inode feels awfully expensive.

Overall this feels like the wrong tradeoff, at least for generic
debug builds.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] xfs: don't treat append-only files as having preallocations
  2024-06-12 17:46 ` [PATCH 1/5] xfs: don't treat append-only files as having preallocations Darrick J. Wong
@ 2024-06-13  6:03   ` Dave Chinner
  2024-06-13  8:28     ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2024-06-13  6:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, chandanbabu, linux-xfs

On Wed, Jun 12, 2024 at 10:46:48AM -0700, Darrick J. Wong wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> The XFS XFS_DIFLAG_APPEND maps to the VFS S_APPEND flag, which forbids
> writes that don't append at the current EOF.
> 
> But the commit originally adding XFS_DIFLAG_APPEND support (commit
> a23321e766d in xfs xfs-import repository) also checked it to skip
> releasing speculative preallocations, which doesn't make any sense.

I disagree, there was a very good reason for this behaviour:
preventing append-only log files from getting excessively fragmented
because speculative prealloc would get removed on close().

i.e. applications that slowly log messages to append only files
with the pattern open(O_APPEND); write(a single line to the log);
close(); caused worst case file fragmentation because the close()
always removed the speculative prealloc beyond EOF.

The fix for this pessimisitic XFS behaviour is for the application
to use chattr +A (like they would for ext3/4) hence triggering the
existence of XFS_DIFLAG_APPEND and that avoided the removal
speculative delalloc removed when the file is closed. hence the
fragmentation problems went away.

Note that fragmentation issue didn't affect the log writes - it
badly affected log reads because it turned them into worse case
random read workloads instead of sequential reads.

As such, I think the justification for this change is wrong and that
it removes a longstanding feature that prevents severe fragmentation
of append only log files. I think we should be leaving this code as
it currently stands.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit
  2024-06-12 17:47 ` [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit Darrick J. Wong
  2024-06-13  5:07   ` Christoph Hellwig
@ 2024-06-13  7:04   ` Dave Chinner
  2024-06-14  3:49     ` Darrick J. Wong
  2024-06-18  0:18   ` [PATCH v1.1 " Darrick J. Wong
  2 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2024-06-13  7:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, chandanbabu, linux-xfs

On Wed, Jun 12, 2024 at 10:47:50AM -0700, Darrick J. Wong wrote:
> The actual defect here was an overzealous inode verifier, which was
> fixed in a separate patch.  This patch adds some transaction precommit
> functions for CONFIG_XFS_DEBUG=y mode so that we can detect these kinds
> of transient errors at transaction commit time, where it's much easier
> to find the root cause.

Ok, I can see the value in this for very strict integrity checking,
but I don't think that XONFIG_XFS_DEBUG context is right
for this level of checking. 

Think of the difference using xfs_assert_ilocked() with
CONFIG_XFS_DEBUG vs iusing CONFIG_PROVE_LOCKING to enable lockdep.
Lockdep checks a lot more about lock usage than our debug build
asserts and so may find deep, subtle issues that our asserts won't
find. However, that extra capability comes at a huge cost for
relatively little extra gain, and so most of the time people work
without CONFIG_PROVE_LOCKING enabled. A test run here or there, and
then when the code developement is done, but it's not used all the
time on every little change that is developed and tested.

In comparison, I can't remember the last time I did any testing with
CONFIG_XFS_DEBUG disabled. Even all my performance regression
testing is run with CONFIG_XFS_DEBUG=y, and a change like this one
would make any sort of load testing on debug kernels far to costly
and so all that testing would get done with debugging turned off.
That's a significant loss, IMO, because we'd lose more validation
from people turning CONFIG_XFS_DEBUG off than we'd gain from the
rare occasions this new commit verifier infrastructure would catch
a real bug.

Hence I think this should be pushed into a separate debug config
sub-option. Make it something we can easily turn on with
KASAN and lockdep when we our periodic costly extensive validation
test runs.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] xfs: don't treat append-only files as having preallocations
  2024-06-13  6:03   ` Dave Chinner
@ 2024-06-13  8:28     ` Christoph Hellwig
  2024-06-17  5:03       ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-13  8:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, hch, chandanbabu, linux-xfs

On Thu, Jun 13, 2024 at 04:03:53PM +1000, Dave Chinner wrote:
> I disagree, there was a very good reason for this behaviour:
> preventing append-only log files from getting excessively fragmented
> because speculative prealloc would get removed on close().

Where is that very clear intent documented?  Not in the original
commit message (which is very sparse) and no where in any documentation
I can find.

> i.e. applications that slowly log messages to append only files
> with the pattern open(O_APPEND); write(a single line to the log);
> close(); caused worst case file fragmentation because the close()
> always removed the speculative prealloc beyond EOF.

That case should be covered by the XFS_IDIRTY_RELEASE, at least
except for O_SYNC workloads. 

> 
> The fix for this pessimisitic XFS behaviour is for the application
> to use chattr +A (like they would for ext3/4) hence triggering the
> existence of XFS_DIFLAG_APPEND and that avoided the removal
> speculative delalloc removed when the file is closed. hence the
> fragmentation problems went away.

For ext4 the EXT4_APPEND_FL flag does not cause any difference
in allocation behavior.  For the historic ext2 driver it apparently
did just, with an XXX comment marking this as a bug, but for ext3 it
also never did looking back quite a bit in history.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit
  2024-06-13  7:04   ` Dave Chinner
@ 2024-06-14  3:49     ` Darrick J. Wong
  2024-06-14  4:42       ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-14  3:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, chandanbabu, linux-xfs

On Thu, Jun 13, 2024 at 05:04:47PM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2024 at 10:47:50AM -0700, Darrick J. Wong wrote:
> > The actual defect here was an overzealous inode verifier, which was
> > fixed in a separate patch.  This patch adds some transaction precommit
> > functions for CONFIG_XFS_DEBUG=y mode so that we can detect these kinds
> > of transient errors at transaction commit time, where it's much easier
> > to find the root cause.
> 
> Ok, I can see the value in this for very strict integrity checking,
> but I don't think that XONFIG_XFS_DEBUG context is right
> for this level of checking. 
> 
> Think of the difference using xfs_assert_ilocked() with
> CONFIG_XFS_DEBUG vs iusing CONFIG_PROVE_LOCKING to enable lockdep.
> Lockdep checks a lot more about lock usage than our debug build
> asserts and so may find deep, subtle issues that our asserts won't
> find. However, that extra capability comes at a huge cost for
> relatively little extra gain, and so most of the time people work
> without CONFIG_PROVE_LOCKING enabled. A test run here or there, and
> then when the code developement is done, but it's not used all the
> time on every little change that is developed and tested.
> 
> In comparison, I can't remember the last time I did any testing with
> CONFIG_XFS_DEBUG disabled. Even all my performance regression
> testing is run with CONFIG_XFS_DEBUG=y, and a change like this one
> would make any sort of load testing on debug kernels far to costly
> and so all that testing would get done with debugging turned off.
> That's a significant loss, IMO, because we'd lose more validation
> from people turning CONFIG_XFS_DEBUG off than we'd gain from the
> rare occasions this new commit verifier infrastructure would catch
> a real bug.
> 
> Hence I think this should be pushed into a separate debug config
> sub-option. Make it something we can easily turn on with
> KASAN and lockdep when we our periodic costly extensive validation
> test runs.

Do you want a CONFIG_XFS_DEBUG_EXPENSIVE=y guard, then?  Some of the
bmbt scanning debug things might qualify for that too.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] xfs: restrict when we try to align cow fork delalloc to cowextsz hints
  2024-06-13  5:06   ` Christoph Hellwig
@ 2024-06-14  4:13     ` Darrick J. Wong
  2024-06-14  4:41       ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-14  4:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: chandanbabu, linux-xfs

On Thu, Jun 13, 2024 at 07:06:13AM +0200, Christoph Hellwig wrote:
> On Wed, Jun 12, 2024 at 10:47:19AM -0700, Darrick J. Wong wrote:
> >  	xfs_filblks_t		prealloc,
> >  	struct xfs_bmbt_irec	*got,
> >  	struct xfs_iext_cursor	*icur,
> > -	int			eof)
> > +	int			eof,
> > +	bool			use_cowextszhint)
> 
> Looking at the caller below I don't think we need the use_cowextszhint
> flag here, we can just locally check for prealloc beeing non-0 in
> the branch below:

That won't work, because xfs_buffered_write_iomap_begin only sets
@prealloc to nonzero if it thinks is an extending write.  For the cow
fork, we create delalloc reservations that are aligned to the cowextsize
value for overwrites below eof.

--D

> > +	/*
> > +	 * If the caller wants us to do so, try to expand the range of the
> > +	 * delalloc reservation up and down so that it's aligned with the CoW
> > +	 * extent size hint.  Unlike the data fork, the CoW cancellation
> > +	 * functions will free all the reservations at inactivation, so we
> > +	 * don't require that every delalloc reservation have a dirty
> > +	 * pagecache.
> > +	 */
> > +	if (whichfork == XFS_COW_FORK && use_cowextszhint) {
> 
> Which keeps all the logic and the comments in one single place.
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] xfs: restrict when we try to align cow fork delalloc to cowextsz hints
  2024-06-14  4:13     ` Darrick J. Wong
@ 2024-06-14  4:41       ` Christoph Hellwig
  2024-06-14  5:27         ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-14  4:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, chandanbabu, linux-xfs

On Thu, Jun 13, 2024 at 09:13:10PM -0700, Darrick J. Wong wrote:
> > Looking at the caller below I don't think we need the use_cowextszhint
> > flag here, we can just locally check for prealloc beeing non-0 in
> > the branch below:
> 
> That won't work, because xfs_buffered_write_iomap_begin only sets
> @prealloc to nonzero if it thinks is an extending write.  For the cow
> fork, we create delalloc reservations that are aligned to the cowextsize
> value for overwrites below eof.

Yeah.  For that we'd need to move the retry loop into
xfs_bmapi_reserve_delalloc - which honestly feels like the more logical
place for it anyway.  As in the untested version below, also note
my XXX comment about a comment being added:

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c101cf266bc4db..58ff21cb84e0f5 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4059,19 +4059,33 @@ xfs_bmapi_reserve_delalloc(
 	uint64_t		fdblocks;
 	int			error;
 	xfs_fileoff_t		aoff = off;
+	bool			use_cowextszhint =
+		whichfork == XFS_COW_FORK && !prealloc;
 
 	/*
 	 * Cap the alloc length. Keep track of prealloc so we know whether to
 	 * tag the inode before we return.
 	 */
+retry:
 	alen = XFS_FILBLKS_MIN(len + prealloc, XFS_MAX_BMBT_EXTLEN);
 	if (!eof)
 		alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff);
 	if (prealloc && alen >= len)
 		prealloc = alen - len;
 
-	/* Figure out the extent size, adjust alen */
-	if (whichfork == XFS_COW_FORK) {
+	/*
+	 * If we're targeting the COW fork but aren't creating a speculative
+	 * posteof preallocation, try to expand the reservation to align with
+	 * the cow extent size hint if there's sufficient free space.
+	 *
+	 * Unlike the data fork, the CoW cancellation functions will free all
+	 * the reservations at inactivation, so we don't require that every
+	 * delalloc reservation have a dirty pagecache.
+	 *
+	 * XXX(hch): I can't see where we actually require dirty pagecache
+	 * for speculative data fork preallocations.  What am I missing?
+	 */
+	if (use_cowextszhint) {
 		struct xfs_bmbt_irec	prev;
 		xfs_extlen_t		extsz = xfs_get_cowextsz_hint(ip);
 
@@ -4090,7 +4104,7 @@ xfs_bmapi_reserve_delalloc(
 	 */
 	error = xfs_quota_reserve_blkres(ip, alen);
 	if (error)
-		return error;
+		goto out;
 
 	/*
 	 * Split changing sb for alen and indlen since they could be coming
@@ -4140,6 +4154,16 @@ xfs_bmapi_reserve_delalloc(
 out_unreserve_quota:
 	if (XFS_IS_QUOTA_ON(mp))
 		xfs_quota_unreserve_blkres(ip, alen);
+out:
+	if (error == -ENOSPC || error == -EDQUOT) {
+		trace_xfs_delalloc_enospc(ip, off, len);
+		if (prealloc || use_cowextszhint) {
+			/* retry without any preallocation */
+			prealloc = 0;
+			use_cowextszhint = false;
+			goto retry;
+		}
+	}
 	return error;
 }
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 3783426739258c..34cce017fe7ce1 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1148,27 +1148,13 @@ xfs_buffered_write_iomap_begin(
 		}
 	}
 
-retry:
 	error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
 			end_fsb - offset_fsb, prealloc_blocks,
 			allocfork == XFS_DATA_FORK ? &imap : &cmap,
 			allocfork == XFS_DATA_FORK ? &icur : &ccur,
 			allocfork == XFS_DATA_FORK ? eof : cow_eof);
-	switch (error) {
-	case 0:
-		break;
-	case -ENOSPC:
-	case -EDQUOT:
-		/* retry without any preallocation */
-		trace_xfs_delalloc_enospc(ip, offset, count);
-		if (prealloc_blocks) {
-			prealloc_blocks = 0;
-			goto retry;
-		}
-		fallthrough;
-	default:
+	if (error)
 		goto out_unlock;
-	}
 
 	if (allocfork == XFS_COW_FORK) {
 		trace_xfs_iomap_alloc(ip, offset, count, allocfork, &cmap);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit
  2024-06-14  3:49     ` Darrick J. Wong
@ 2024-06-14  4:42       ` Christoph Hellwig
  2024-06-14  5:23         ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-14  4:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, hch, chandanbabu, linux-xfs

On Thu, Jun 13, 2024 at 08:49:49PM -0700, Darrick J. Wong wrote:
> > Hence I think this should be pushed into a separate debug config
> > sub-option. Make it something we can easily turn on with
> > KASAN and lockdep when we our periodic costly extensive validation
> > test runs.
> 
> Do you want a CONFIG_XFS_DEBUG_EXPENSIVE=y guard, then?  Some of the
> bmbt scanning debug things might qualify for that too.

Or EXPENSIVE_VALIDATION.  Another option would be a runtime selection,
but that feels like a bit too much to bother.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit
  2024-06-14  4:42       ` Christoph Hellwig
@ 2024-06-14  5:23         ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-14  5:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, chandanbabu, linux-xfs

On Fri, Jun 14, 2024 at 06:42:38AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 13, 2024 at 08:49:49PM -0700, Darrick J. Wong wrote:
> > > Hence I think this should be pushed into a separate debug config
> > > sub-option. Make it something we can easily turn on with
> > > KASAN and lockdep when we our periodic costly extensive validation
> > > test runs.
> > 
> > Do you want a CONFIG_XFS_DEBUG_EXPENSIVE=y guard, then?  Some of the
> > bmbt scanning debug things might qualify for that too.
> 
> Or EXPENSIVE_VALIDATION.  Another option would be a runtime selection,
> but that feels like a bit too much to bother.

Yeah, probably.  FWIW I haven't seen any increase in fstests runtime
since I added this debug patch.  I suspect that the 512b allocations
for the inode easily come out of that slab and don't slow us down much.

--D

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] xfs: restrict when we try to align cow fork delalloc to cowextsz hints
  2024-06-14  4:41       ` Christoph Hellwig
@ 2024-06-14  5:27         ` Darrick J. Wong
  2024-06-14  5:30           ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-14  5:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: chandanbabu, linux-xfs

On Fri, Jun 14, 2024 at 06:41:55AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 13, 2024 at 09:13:10PM -0700, Darrick J. Wong wrote:
> > > Looking at the caller below I don't think we need the use_cowextszhint
> > > flag here, we can just locally check for prealloc beeing non-0 in
> > > the branch below:
> > 
> > That won't work, because xfs_buffered_write_iomap_begin only sets
> > @prealloc to nonzero if it thinks is an extending write.  For the cow
> > fork, we create delalloc reservations that are aligned to the cowextsize
> > value for overwrites below eof.
> 
> Yeah.  For that we'd need to move the retry loop into
> xfs_bmapi_reserve_delalloc - which honestly feels like the more logical
> place for it anyway.  As in the untested version below, also note
> my XXX comment about a comment being added:
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index c101cf266bc4db..58ff21cb84e0f5 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4059,19 +4059,33 @@ xfs_bmapi_reserve_delalloc(
>  	uint64_t		fdblocks;
>  	int			error;
>  	xfs_fileoff_t		aoff = off;
> +	bool			use_cowextszhint =
> +		whichfork == XFS_COW_FORK && !prealloc;
>  
>  	/*
>  	 * Cap the alloc length. Keep track of prealloc so we know whether to
>  	 * tag the inode before we return.
>  	 */
> +retry:
>  	alen = XFS_FILBLKS_MIN(len + prealloc, XFS_MAX_BMBT_EXTLEN);
>  	if (!eof)
>  		alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff);
>  	if (prealloc && alen >= len)
>  		prealloc = alen - len;
>  
> -	/* Figure out the extent size, adjust alen */
> -	if (whichfork == XFS_COW_FORK) {
> +	/*
> +	 * If we're targeting the COW fork but aren't creating a speculative
> +	 * posteof preallocation, try to expand the reservation to align with
> +	 * the cow extent size hint if there's sufficient free space.
> +	 *
> +	 * Unlike the data fork, the CoW cancellation functions will free all
> +	 * the reservations at inactivation, so we don't require that every
> +	 * delalloc reservation have a dirty pagecache.
> +	 *
> +	 * XXX(hch): I can't see where we actually require dirty pagecache
> +	 * for speculative data fork preallocations.  What am I missing?

IIRC a delalloc reservation in the data fork that isn't backing a dirty
page will just sit there in the data fork and never get reclaimed.
There's no writeback to turn it into an unwritten -> written extent.
The blockgc functions won't (can't?) walk the pagecache to find clean
regions that could be torn down.  xfs destroy_inode just asserts on any
reservations that it finds.

--D

> +	 */
> +	if (use_cowextszhint) {
>  		struct xfs_bmbt_irec	prev;
>  		xfs_extlen_t		extsz = xfs_get_cowextsz_hint(ip);
>  
> @@ -4090,7 +4104,7 @@ xfs_bmapi_reserve_delalloc(
>  	 */
>  	error = xfs_quota_reserve_blkres(ip, alen);
>  	if (error)
> -		return error;
> +		goto out;
>  
>  	/*
>  	 * Split changing sb for alen and indlen since they could be coming
> @@ -4140,6 +4154,16 @@ xfs_bmapi_reserve_delalloc(
>  out_unreserve_quota:
>  	if (XFS_IS_QUOTA_ON(mp))
>  		xfs_quota_unreserve_blkres(ip, alen);
> +out:
> +	if (error == -ENOSPC || error == -EDQUOT) {
> +		trace_xfs_delalloc_enospc(ip, off, len);
> +		if (prealloc || use_cowextszhint) {
> +			/* retry without any preallocation */
> +			prealloc = 0;
> +			use_cowextszhint = false;
> +			goto retry;
> +		}
> +	}
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 3783426739258c..34cce017fe7ce1 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1148,27 +1148,13 @@ xfs_buffered_write_iomap_begin(
>  		}
>  	}
>  
> -retry:
>  	error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
>  			end_fsb - offset_fsb, prealloc_blocks,
>  			allocfork == XFS_DATA_FORK ? &imap : &cmap,
>  			allocfork == XFS_DATA_FORK ? &icur : &ccur,
>  			allocfork == XFS_DATA_FORK ? eof : cow_eof);
> -	switch (error) {
> -	case 0:
> -		break;
> -	case -ENOSPC:
> -	case -EDQUOT:
> -		/* retry without any preallocation */
> -		trace_xfs_delalloc_enospc(ip, offset, count);
> -		if (prealloc_blocks) {
> -			prealloc_blocks = 0;
> -			goto retry;
> -		}
> -		fallthrough;
> -	default:
> +	if (error)
>  		goto out_unlock;
> -	}
>  
>  	if (allocfork == XFS_COW_FORK) {
>  		trace_xfs_iomap_alloc(ip, offset, count, allocfork, &cmap);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] xfs: restrict when we try to align cow fork delalloc to cowextsz hints
  2024-06-14  5:27         ` Darrick J. Wong
@ 2024-06-14  5:30           ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-14  5:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, chandanbabu, linux-xfs

On Thu, Jun 13, 2024 at 10:27:05PM -0700, Darrick J. Wong wrote:
> > +	 * Unlike the data fork, the CoW cancellation functions will free all
> > +	 * the reservations at inactivation, so we don't require that every
> > +	 * delalloc reservation have a dirty pagecache.
> > +	 *
> > +	 * XXX(hch): I can't see where we actually require dirty pagecache
> > +	 * for speculative data fork preallocations.  What am I missing?
> 
> IIRC a delalloc reservation in the data fork that isn't backing a dirty
> page will just sit there in the data fork and never get reclaimed.
> There's no writeback to turn it into an unwritten -> written extent.
> The blockgc functions won't (can't?) walk the pagecache to find clean
> regions that could be torn down.  xfs destroy_inode just asserts on any
> reservations that it finds.

blockgc doesn't walk the page cache at all.  It just calls
xfs_free_eofblocks which simply drops all extents after i_size.

If it didn't do that we'd be in trouble because there never is any dirty
page cache past roundup(i_size, PAGE_SIZE).


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] xfs: don't treat append-only files as having preallocations
  2024-06-13  8:28     ` Christoph Hellwig
@ 2024-06-17  5:03       ` Dave Chinner
  2024-06-17  6:46         ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2024-06-17  5:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, chandanbabu, linux-xfs

On Thu, Jun 13, 2024 at 10:28:55AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 13, 2024 at 04:03:53PM +1000, Dave Chinner wrote:
> > I disagree, there was a very good reason for this behaviour:
> > preventing append-only log files from getting excessively fragmented
> > because speculative prealloc would get removed on close().
> 
> Where is that very clear intent documented?  Not in the original
> commit message (which is very sparse) and no where in any documentation
> I can find.

We've lost all the internal SGI bug databases, so there's little to
know evidence I can point at. But at the time, it was a well known
problem amongst Irix XFS engineers that append-only log files would
regularly get horribly fragmented.

There'd been several escalations over that behaviour over the years
w.r.t. large remote servers (think of facilities that "don't trust
the logs on client machines because they might be compromised"). In
general, the fixes for these applications tended to require the
loggin server application to use F_RESVSP to do the append-only log
file initialisation.  That got XFS_DIFLAG_PREALLOC set on the files,
so then anything allocated by appending writes beyond EOF was left
alone. That small change was largely sufficient to mitigate worst
case log file fragmentation on Irix-XFS.

So when adding a flag on disk for Linux-XFS to say "this is an
append only file" it made lots of sense to make it behave like
XFS_DIFLAG_PREALLOC had already been set on the inode without
requring the application to do anything to set that up.

I'll note that the patches sent to the list by Ethan Benson to
originally implement XFS_DIFLAG_APPEND (and others) is not exactly
what was committed in this commit:

https://marc.info/?l=linux-xfs&m=106360278223548&w=2

The last version posted on the list was this:

https://marc.info/?l=linux-xfs&m=106109662212214&w=2

but the version committed had lots of things renamed, sysctls for
sync and nodump inheritance and other bits and pieces including
the EOF freeing changes to skip if DIFLAG_APPEND was set.

It is clear that there was internal SGI discussion, modification and
review of the original proposed patch set, and none of that internal
discussion is on open mailing lists. We might have the historical
XFS code and Linux mailing list archives, but that doesn't always
tell us what institutional knowledge was behind subtle changes to
publicly proposed patches like this....

> > i.e. applications that slowly log messages to append only files
> > with the pattern open(O_APPEND); write(a single line to the log);
> > close(); caused worst case file fragmentation because the close()
> > always removed the speculative prealloc beyond EOF.
> 
> That case should be covered by the XFS_IDIRTY_RELEASE, at least
> except for O_SYNC workloads. 

Ah, so I fixed the problem independently 7 or 8 years later to fix
Linux NFS server performance issues. Ok, that makes removing the
flag less bad, but I still don't see the harm in keeping it there
given that behaviour has existed for the past 20 years....

> > The fix for this pessimisitic XFS behaviour is for the application
> > to use chattr +A (like they would for ext3/4) hence triggering the
> > existence of XFS_DIFLAG_APPEND and that avoided the removal
> > speculative delalloc removed when the file is closed. hence the
> > fragmentation problems went away.
> 
> For ext4 the EXT4_APPEND_FL flag does not cause any difference
> in allocation behavior.

Sure, but ext4 doesn't have speculative preallocation beyond EOF to
prevent fragmentation, either.

> For the historic ext2 driver it apparently
> did just, with an XXX comment marking this as a bug, but for ext3 it
> also never did looking back quite a bit in history.

Ditto - when the filesystem isn't allocating anything beyond EOF,
there's little point in trying to removing blocks beyond EOF that
can't exist on final close()...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] xfs: don't treat append-only files as having preallocations
  2024-06-17  5:03       ` Dave Chinner
@ 2024-06-17  6:46         ` Christoph Hellwig
  2024-06-17 23:28           ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-17  6:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Darrick J. Wong, chandanbabu, linux-xfs

On Mon, Jun 17, 2024 at 03:03:28PM +1000, Dave Chinner wrote:
> > That case should be covered by the XFS_IDIRTY_RELEASE, at least
> > except for O_SYNC workloads. 
> 
> Ah, so I fixed the problem independently 7 or 8 years later to fix
> Linux NFS server performance issues. Ok, that makes removing the
> flag less bad, but I still don't see the harm in keeping it there
> given that behaviour has existed for the past 20 years....

I'm really kinda worried about these unaccounted preallocations lingering
around basically forever.  Note that in current mainline there actually
is a path removing them more or less accidentally when there are
delalloc blocks in a can_free_eofblocks path with force == true,
but that's going away with the next patch.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] xfs: don't treat append-only files as having preallocations
  2024-06-17  6:46         ` Christoph Hellwig
@ 2024-06-17 23:28           ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2024-06-17 23:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, chandanbabu, linux-xfs

On Mon, Jun 17, 2024 at 08:46:03AM +0200, Christoph Hellwig wrote:
> On Mon, Jun 17, 2024 at 03:03:28PM +1000, Dave Chinner wrote:
> > > That case should be covered by the XFS_IDIRTY_RELEASE, at least
> > > except for O_SYNC workloads. 
> > 
> > Ah, so I fixed the problem independently 7 or 8 years later to fix
> > Linux NFS server performance issues. Ok, that makes removing the
> > flag less bad, but I still don't see the harm in keeping it there
> > given that behaviour has existed for the past 20 years....
> 
> I'm really kinda worried about these unaccounted preallocations lingering
> around basically forever.

How are they "unaccounted"? They are accounted to the inode, they
are visible in statx and so du reports them.

Maybe you meant "unreclaimable"?

But that's not true, either, because a truncate to the same size or
a hole punch from EOF to -1 will remove the post-EOF blocks. But
that's what the blockgc ioctls are supposed to be doing for these
files, so....

> Note that in current mainline there actually
> is a path removing them more or less accidentally when there are
> delalloc blocks in a can_free_eofblocks path with force == true,
> but that's going away with the next patch.

... fix the blockgc walk to ignore DIFLAG_APPEND when doing it's
passes. The files are not marked with DIFLAG_PREALLOC, so blockgc
should trim them, just like it does with all other files that have
had post-eof prealloc that is currently unused.

In short: Don't remove the optimisation that prevents worst case
fragmentation in known workloads. Instead, fix the garbage
collection to do the right thing when space is low and we are
optimising for allocation success rather than optimal file layout.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1.1 5/5] xfs: verify buffer, inode, and dquot items every tx commit
  2024-06-12 17:47 ` [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit Darrick J. Wong
  2024-06-13  5:07   ` Christoph Hellwig
  2024-06-13  7:04   ` Dave Chinner
@ 2024-06-18  0:18   ` Darrick J. Wong
  2024-06-18  6:38     ` Christoph Hellwig
  2 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2024-06-18  0:18 UTC (permalink / raw)
  To: hch, chandanbabu; +Cc: linux-xfs, Dave Chinner

From: Darrick J. Wong <djwong@kernel.org>

generic/388 has an annoying tendency to fail like this during log
recovery:

XFS (sda4): Unmounting Filesystem 435fe39b-82b6-46ef-be56-819499585130
XFS (sda4): Mounting V5 Filesystem 435fe39b-82b6-46ef-be56-819499585130
XFS (sda4): Starting recovery (logdev: internal)
00000000: 49 4e 81 b6 03 02 00 00 00 00 00 07 00 00 00 07  IN..............
00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 10  ................
00000020: 35 9a 8b c1 3e 6e 81 00 35 9a 8b c1 3f dc b7 00  5...>n..5...?...
00000030: 35 9a 8b c1 3f dc b7 00 00 00 00 00 00 3c 86 4f  5...?........<.O
00000040: 00 00 00 00 00 00 02 f3 00 00 00 00 00 00 00 00  ................
00000050: 00 00 1f 01 00 00 00 00 00 00 00 02 b2 74 c9 0b  .............t..
00000060: ff ff ff ff d7 45 73 10 00 00 00 00 00 00 00 2d  .....Es........-
00000070: 00 00 07 92 00 01 fe 30 00 00 00 00 00 00 00 1a  .......0........
00000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000090: 35 9a 8b c1 3b 55 0c 00 00 00 00 00 04 27 b2 d1  5...;U.......'..
000000a0: 43 5f e3 9b 82 b6 46 ef be 56 81 94 99 58 51 30  C_....F..V...XQ0
XFS (sda4): Internal error Bad dinode after recovery at line 539 of file fs/xfs/xfs_inode_item_recover.c.  Caller xlog_recover_items_pass2+0x4e/0xc0 [xfs]
CPU: 0 PID: 2189311 Comm: mount Not tainted 6.9.0-rc4-djwx #rc4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x4f/0x60
 xfs_corruption_error+0x90/0xa0
 xlog_recover_inode_commit_pass2+0x5f1/0xb00
 xlog_recover_items_pass2+0x4e/0xc0
 xlog_recover_commit_trans+0x2db/0x350
 xlog_recovery_process_trans+0xab/0xe0
 xlog_recover_process_data+0xa7/0x130
 xlog_do_recovery_pass+0x398/0x840
 xlog_do_log_recovery+0x62/0xc0
 xlog_do_recover+0x34/0x1d0
 xlog_recover+0xe9/0x1a0
 xfs_log_mount+0xff/0x260
 xfs_mountfs+0x5d9/0xb60
 xfs_fs_fill_super+0x76b/0xa30
 get_tree_bdev+0x124/0x1d0
 vfs_get_tree+0x17/0xa0
 path_mount+0x72b/0xa90
 __x64_sys_mount+0x112/0x150
 do_syscall_64+0x49/0x100
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
 </TASK>
XFS (sda4): Corruption detected. Unmount and run xfs_repair
XFS (sda4): Metadata corruption detected at xfs_dinode_verify.part.0+0x739/0x920 [xfs], inode 0x427b2d1
XFS (sda4): Filesystem has been shut down due to log error (0x2).
XFS (sda4): Please unmount the filesystem and rectify the problem(s).
XFS (sda4): log mount/recovery failed: error -117
XFS (sda4): log mount failed

This inode log item recovery failing the dinode verifier after
replaying the contents of the inode log item into the ondisk inode.
Looking back into what the kernel was doing at the time of the fs
shutdown, a thread was in the middle of running a series of
transactions, each of which committed changes to the inode.

At some point in the middle of that chain, an invalid (at least
according to the verifier) change was committed.  Had the filesystem not
shut down in the middle of the chain, a subsequent transaction would
have corrected the invalid state and nobody would have noticed.  But
that's not what happened here.  Instead, the invalid inode state was
committed to the ondisk log, so log recovery tripped over it.

The actual defect here was an overzealous inode verifier, which was
fixed in a separate patch.  This patch adds some transaction precommit
functions for CONFIG_XFS_DEBUG=y mode so that we can detect these kinds
of transient errors at transaction commit time, where it's much easier
to find the root cause.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v1.1: hide behind a kconfig switch
---
 fs/xfs/Kconfig          |   12 ++++++++++++
 fs/xfs/xfs.h            |    4 ++++
 fs/xfs/xfs_buf_item.c   |   32 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_dquot_item.c |   31 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode_item.c |   32 ++++++++++++++++++++++++++++++++
 5 files changed, 111 insertions(+)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index c38db1bf4764..53898f6be7f2 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -217,6 +217,18 @@ config XFS_DEBUG
 
 	  Say N unless you are an XFS developer, or you play one on TV.
 
+config XFS_DEBUG_EXPENSIVE
+	bool "XFS expensive debugging checks"
+	depends on XFS_FS && XFS_DEBUG
+	help
+	  Say Y here to get an XFS build with expensive debugging checks
+	  enabled.  These checks may affect performance significantly.
+
+	  Note that the resulting code will be HUGER and SLOWER, and probably
+	  not useful unless you are debugging a particular problem.
+
+	  Say N unless you are an XFS developer, or you play one on TV.
+
 config XFS_ASSERT_FATAL
 	bool "XFS fatal asserts"
 	default y
diff --git a/fs/xfs/xfs.h b/fs/xfs/xfs.h
index f6ffb4f248f7..9355ccad9503 100644
--- a/fs/xfs/xfs.h
+++ b/fs/xfs/xfs.h
@@ -10,6 +10,10 @@
 #define DEBUG 1
 #endif
 
+#ifdef CONFIG_XFS_DEBUG_EXPENSIVE
+#define DEBUG_EXPENSIVE 1
+#endif
+
 #ifdef CONFIG_XFS_ASSERT_FATAL
 #define XFS_ASSERT_FATAL 1
 #endif
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 43031842341a..47549cfa61cd 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -22,6 +22,7 @@
 #include "xfs_trace.h"
 #include "xfs_log.h"
 #include "xfs_log_priv.h"
+#include "xfs_error.h"
 
 
 struct kmem_cache	*xfs_buf_item_cache;
@@ -781,8 +782,39 @@ xfs_buf_item_committed(
 	return lsn;
 }
 
+#ifdef DEBUG_EXPENSIVE
+static int
+xfs_buf_item_precommit(
+	struct xfs_trans	*tp,
+	struct xfs_log_item	*lip)
+{
+	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
+	struct xfs_buf		*bp = bip->bli_buf;
+	struct xfs_mount	*mp = bp->b_mount;
+	xfs_failaddr_t		fa;
+
+	if (!bp->b_ops || !bp->b_ops->verify_struct)
+		return 0;
+	if (bip->bli_flags & XFS_BLI_STALE)
+		return 0;
+
+	fa = bp->b_ops->verify_struct(bp);
+	if (fa) {
+		xfs_buf_verifier_error(bp, -EFSCORRUPTED, bp->b_ops->name,
+				bp->b_addr, BBTOB(bp->b_length), fa);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		ASSERT(fa == NULL);
+	}
+
+	return 0;
+}
+#else
+# define xfs_buf_item_precommit	NULL
+#endif
+
 static const struct xfs_item_ops xfs_buf_item_ops = {
 	.iop_size	= xfs_buf_item_size,
+	.iop_precommit	= xfs_buf_item_precommit,
 	.iop_format	= xfs_buf_item_format,
 	.iop_pin	= xfs_buf_item_pin,
 	.iop_unpin	= xfs_buf_item_unpin,
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 6a1aae799cf1..7d19091215b0 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -17,6 +17,7 @@
 #include "xfs_trans_priv.h"
 #include "xfs_qm.h"
 #include "xfs_log.h"
+#include "xfs_error.h"
 
 static inline struct xfs_dq_logitem *DQUOT_ITEM(struct xfs_log_item *lip)
 {
@@ -193,8 +194,38 @@ xfs_qm_dquot_logitem_committing(
 	return xfs_qm_dquot_logitem_release(lip);
 }
 
+#ifdef DEBUG_EXPENSIVE
+static int
+xfs_qm_dquot_logitem_precommit(
+	struct xfs_trans	*tp,
+	struct xfs_log_item	*lip)
+{
+	struct xfs_dquot	*dqp = DQUOT_ITEM(lip)->qli_dquot;
+	struct xfs_mount	*mp = dqp->q_mount;
+	struct xfs_disk_dquot	ddq = { };
+	xfs_failaddr_t		fa;
+
+	xfs_dquot_to_disk(&ddq, dqp);
+	fa = xfs_dquot_verify(mp, &ddq, dqp->q_id);
+	if (fa) {
+		XFS_CORRUPTION_ERROR("Bad dquot during logging",
+				XFS_ERRLEVEL_LOW, mp, &ddq, sizeof(ddq));
+		xfs_alert(mp,
+ "Metadata corruption detected at %pS, dquot 0x%x",
+				fa, dqp->q_id);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		ASSERT(fa == NULL);
+	}
+
+	return 0;
+}
+#else
+# define xfs_qm_dquot_logitem_precommit	NULL
+#endif
+
 static const struct xfs_item_ops xfs_dquot_item_ops = {
 	.iop_size	= xfs_qm_dquot_logitem_size,
+	.iop_precommit	= xfs_qm_dquot_logitem_precommit,
 	.iop_format	= xfs_qm_dquot_logitem_format,
 	.iop_pin	= xfs_qm_dquot_logitem_pin,
 	.iop_unpin	= xfs_qm_dquot_logitem_unpin,
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index f28d653300d1..ef05cbbe116c 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -37,6 +37,36 @@ xfs_inode_item_sort(
 	return INODE_ITEM(lip)->ili_inode->i_ino;
 }
 
+#ifdef DEBUG_EXPENSIVE
+static void
+xfs_inode_item_precommit_check(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_dinode	*dip;
+	xfs_failaddr_t		fa;
+
+	dip = kzalloc(mp->m_sb.sb_inodesize, GFP_KERNEL | GFP_NOFS);
+	if (!dip) {
+		ASSERT(dip != NULL);
+		return;
+	}
+
+	xfs_inode_to_disk(ip, dip, 0);
+	xfs_dinode_calc_crc(mp, dip);
+	fa = xfs_dinode_verify(mp, ip->i_ino, dip);
+	if (fa) {
+		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
+				sizeof(*dip), fa);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		ASSERT(fa == NULL);
+	}
+	kfree(dip);
+}
+#else
+# define xfs_inode_item_precommit_check(ip)	((void)0)
+#endif
+
 /*
  * Prior to finally logging the inode, we have to ensure that all the
  * per-modification inode state changes are applied. This includes VFS inode
@@ -169,6 +199,8 @@ xfs_inode_item_precommit(
 	iip->ili_fields |= (flags | iip->ili_last_fields);
 	spin_unlock(&iip->ili_lock);
 
+	xfs_inode_item_precommit_check(ip);
+
 	/*
 	 * We are done with the log item transaction dirty state, so clear it so
 	 * that it doesn't pollute future transactions.

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v1.1 5/5] xfs: verify buffer, inode, and dquot items every tx commit
  2024-06-18  0:18   ` [PATCH v1.1 " Darrick J. Wong
@ 2024-06-18  6:38     ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-06-18  6:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, chandanbabu, linux-xfs, Dave Chinner

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-06-18  6:38 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-12 17:46 [PATCHSET] xfs: random fixes for 6.10 Darrick J. Wong
2024-06-12 17:46 ` [PATCH 1/5] xfs: don't treat append-only files as having preallocations Darrick J. Wong
2024-06-13  6:03   ` Dave Chinner
2024-06-13  8:28     ` Christoph Hellwig
2024-06-17  5:03       ` Dave Chinner
2024-06-17  6:46         ` Christoph Hellwig
2024-06-17 23:28           ` Dave Chinner
2024-06-12 17:47 ` [PATCH 2/5] xfs: fix freeing speculative preallocations for preallocated files Darrick J. Wong
2024-06-12 17:47 ` [PATCH 3/5] xfs: restrict when we try to align cow fork delalloc to cowextsz hints Darrick J. Wong
2024-06-13  5:06   ` Christoph Hellwig
2024-06-14  4:13     ` Darrick J. Wong
2024-06-14  4:41       ` Christoph Hellwig
2024-06-14  5:27         ` Darrick J. Wong
2024-06-14  5:30           ` Christoph Hellwig
2024-06-12 17:47 ` [PATCH 4/5] xfs: allow unlinked symlinks and dirs with zero size Darrick J. Wong
2024-06-13  4:57   ` Christoph Hellwig
2024-06-12 17:47 ` [PATCH 5/5] xfs: verify buffer, inode, and dquot items every tx commit Darrick J. Wong
2024-06-13  5:07   ` Christoph Hellwig
2024-06-13  7:04   ` Dave Chinner
2024-06-14  3:49     ` Darrick J. Wong
2024-06-14  4:42       ` Christoph Hellwig
2024-06-14  5:23         ` Darrick J. Wong
2024-06-18  0:18   ` [PATCH v1.1 " Darrick J. Wong
2024-06-18  6:38     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox