post-EOF block handling revamp v2

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* post-EOF block handling revamp v2
@ 2024-08-08 15:27 Christoph Hellwig
  2024-08-08 15:27 ` [PATCH 1/9] xfs: remove the i_mode check in xfs_release Christoph Hellwig
                   ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-08 15:27 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

Hi all,

this series reworks handling of post-EOF blocks, primarily in ->release.
This takes over the work originally started by Dave in:

    https://marc.info/?l=linux-xfs&m=154951612101291&w=2

and lingering in Darricks's tree:

    https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=reduce-eofblocks-gc-on-close

for years to ensure ->release doesn't too eagerly kill post-EOF block
speculative preallocation and then goes on to not let preallocation
on inodes with the append only flag set linger forever.

I'll also post a rebased version of Dave's patches from back then.

The first patch has already been sent standalone and as part of Darrick's
fixes series, but as the rest of the series depends on it I'm sending it
here again - third time's a charm.

Changes since v1:
 - fix a few commit log and comment typos

Diffstat:
 xfs_bmap_util.c |   34 ++++++++---------------
 xfs_file.c      |   73 +++++++++++++++++++++++++++++++++++++++++++++++--
 xfs_icache.c    |    2 -
 xfs_inode.c     |   82 --------------------------------------------------------
 xfs_inode.h     |    5 +--
 5 files changed, 85 insertions(+), 111 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/9] xfs: remove the i_mode check in xfs_release
  2024-08-08 15:27 post-EOF block handling revamp v2 Christoph Hellwig
@ 2024-08-08 15:27 ` Christoph Hellwig
  2024-08-08 15:27 ` [PATCH 2/9] xfs: refactor f_op->release handling Christoph Hellwig
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-08 15:27 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

xfs_release is only called from xfs_file_release, which is wired up as
the f_op->release handler for regular files only.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_inode.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7dc6f326936cad..c7249257155881 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1086,9 +1086,6 @@ xfs_release(
 	xfs_mount_t	*mp = ip->i_mount;
 	int		error = 0;
 
-	if (!S_ISREG(VFS_I(ip)->i_mode) || (VFS_I(ip)->i_mode == 0))
-		return 0;
-
 	/* If this is a read-only mount, don't do this (would generate I/O) */
 	if (xfs_is_readonly(mp))
 		return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/9] xfs: refactor f_op->release handling
  2024-08-08 15:27 post-EOF block handling revamp v2 Christoph Hellwig
  2024-08-08 15:27 ` [PATCH 1/9] xfs: remove the i_mode check in xfs_release Christoph Hellwig
@ 2024-08-08 15:27 ` Christoph Hellwig
  2024-08-08 15:27 ` [PATCH 3/9] xfs: don't bother returning errors from xfs_file_release Christoph Hellwig
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-08 15:27 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

Currently f_op->release is split in not very obvious ways.  Fix that by
folding xfs_release into xfs_file_release.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c  | 71 +++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_inode.c | 79 ----------------------------------------------
 fs/xfs/xfs_inode.h |  1 -
 3 files changed, 68 insertions(+), 83 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4cdc54dc96862e..11732fe1c657c9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1177,10 +1177,75 @@ xfs_dir_open(
 
 STATIC int
 xfs_file_release(
-	struct inode	*inode,
-	struct file	*filp)
+	struct inode		*inode,
+	struct file		*file)
 {
-	return xfs_release(XFS_I(inode));
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
+
+	/* If this is a read-only mount, don't generate I/O */
+	if (xfs_is_readonly(mp))
+		return 0;
+
+	/*
+	 * If we previously truncated this file and removed old data in the
+	 * process, we want to initiate "early" writeout on the last close.
+	 * This is an attempt to combat the notorious NULL files problem which
+	 * is particularly noticeable from a truncate down, buffered (re-)write
+	 * (delalloc), followed by a crash.  What we are effectively doing here
+	 * is significantly reducing the time window where we'd otherwise be
+	 * exposed to that problem.
+	 */
+	if (!xfs_is_shutdown(mp) &&
+	    xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED)) {
+		xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
+		if (ip->i_delayed_blks > 0) {
+			error = filemap_flush(inode->i_mapping);
+			if (error)
+				return error;
+		}
+	}
+
+	/*
+	 * XFS aggressively preallocates post-EOF space to generate contiguous
+	 * allocations for writers that append to the end of the file and we
+	 * try to free these when an open file context is released.
+	 *
+	 * There is no point in freeing blocks here for open but unlinked files
+	 * as they will be taken care of by the inactivation path soon.
+	 *
+	 * If we can't get the iolock just skip truncating the blocks past EOF
+	 * because we could deadlock with the mmap_lock otherwise. We'll get
+	 * another chance to drop them once the last reference to the inode is
+	 * dropped, so we'll never leak blocks permanently.
+	 */
+	if (inode->i_nlink && xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
+		if (xfs_can_free_eofblocks(ip) &&
+		    !xfs_iflags_test(ip, XFS_IDIRTY_RELEASE)) {
+			/*
+			 * Check if the inode is being opened, written and
+			 * closed frequently and we have delayed allocation
+			 * blocks outstanding (e.g. streaming writes from the
+			 * NFS server), truncating the blocks past EOF will
+			 * cause fragmentation to occur.
+			 *
+			 * In this case don't do the truncation, but we have to
+			 * be careful how we detect this case. Blocks beyond EOF
+			 * show up as i_delayed_blks even when the inode is
+			 * clean, so we need to truncate them away first before
+			 * checking for a dirty release. Hence on the first
+			 * dirty close we will still remove the speculative
+			 * allocation, but after that we will leave it in place.
+			 */
+			error = xfs_free_eofblocks(ip);
+			if (!error && ip->i_delayed_blks)
+				xfs_iflags_set(ip, XFS_IDIRTY_RELEASE);
+		}
+		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	}
+
+	return error;
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c7249257155881..a283312033e562 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1079,85 +1079,6 @@ xfs_itruncate_extents_flags(
 	return error;
 }
 
-int
-xfs_release(
-	xfs_inode_t	*ip)
-{
-	xfs_mount_t	*mp = ip->i_mount;
-	int		error = 0;
-
-	/* If this is a read-only mount, don't do this (would generate I/O) */
-	if (xfs_is_readonly(mp))
-		return 0;
-
-	if (!xfs_is_shutdown(mp)) {
-		int truncated;
-
-		/*
-		 * If we previously truncated this file and removed old data
-		 * in the process, we want to initiate "early" writeout on
-		 * the last close.  This is an attempt to combat the notorious
-		 * NULL files problem which is particularly noticeable from a
-		 * truncate down, buffered (re-)write (delalloc), followed by
-		 * a crash.  What we are effectively doing here is
-		 * significantly reducing the time window where we'd otherwise
-		 * be exposed to that problem.
-		 */
-		truncated = xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED);
-		if (truncated) {
-			xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
-			if (ip->i_delayed_blks > 0) {
-				error = filemap_flush(VFS_I(ip)->i_mapping);
-				if (error)
-					return error;
-			}
-		}
-	}
-
-	if (VFS_I(ip)->i_nlink == 0)
-		return 0;
-
-	/*
-	 * If we can't get the iolock just skip truncating the blocks past EOF
-	 * because we could deadlock with the mmap_lock otherwise. We'll get
-	 * another chance to drop them once the last reference to the inode is
-	 * dropped, so we'll never leak blocks permanently.
-	 */
-	if (!xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL))
-		return 0;
-
-	if (xfs_can_free_eofblocks(ip)) {
-		/*
-		 * Check if the inode is being opened, written and closed
-		 * frequently and we have delayed allocation blocks outstanding
-		 * (e.g. streaming writes from the NFS server), truncating the
-		 * blocks past EOF will cause fragmentation to occur.
-		 *
-		 * In this case don't do the truncation, but we have to be
-		 * careful how we detect this case. Blocks beyond EOF show up as
-		 * i_delayed_blks even when the inode is clean, so we need to
-		 * truncate them away first before checking for a dirty release.
-		 * Hence on the first dirty close we will still remove the
-		 * speculative allocation, but after that we will leave it in
-		 * place.
-		 */
-		if (xfs_iflags_test(ip, XFS_IDIRTY_RELEASE))
-			goto out_unlock;
-
-		error = xfs_free_eofblocks(ip);
-		if (error)
-			goto out_unlock;
-
-		/* delalloc blocks after truncation means it really is dirty */
-		if (ip->i_delayed_blks)
-			xfs_iflags_set(ip, XFS_IDIRTY_RELEASE);
-	}
-
-out_unlock:
-	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
-	return error;
-}
-
 /*
  * Mark all the buffers attached to this directory stale.  In theory we should
  * never be freeing a directory with any blocks at all, but this covers the
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 51defdebef30ed..6ec83fab66266a 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -512,7 +512,6 @@ enum layout_break_reason {
 #define XFS_INHERIT_GID(pip)	\
 	(xfs_has_grpid((pip)->i_mount) || (VFS_I(pip)->i_mode & S_ISGID))
 
-int		xfs_release(struct xfs_inode *ip);
 int		xfs_inactive(struct xfs_inode *ip);
 int		xfs_lookup(struct xfs_inode *dp, const struct xfs_name *name,
 			   struct xfs_inode **ipp, struct xfs_name *ci_name);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/9] xfs: don't bother returning errors from xfs_file_release
  2024-08-08 15:27 post-EOF block handling revamp v2 Christoph Hellwig
  2024-08-08 15:27 ` [PATCH 1/9] xfs: remove the i_mode check in xfs_release Christoph Hellwig
  2024-08-08 15:27 ` [PATCH 2/9] xfs: refactor f_op->release handling Christoph Hellwig
@ 2024-08-08 15:27 ` Christoph Hellwig
  2024-08-08 15:27 ` [PATCH 4/9] xfs: skip all of xfs_file_release when shut down Christoph Hellwig
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-08 15:27 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

While ->release returns int, the only caller ignores the return value.
As we're only doing cleanup work there isn't much of a point in
return a value to start with, so just document the situation instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 11732fe1c657c9..17dfbaca1c581c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1175,6 +1175,10 @@ xfs_dir_open(
 	return error;
 }
 
+/*
+ * Don't bother propagating errors.  We're just doing cleanup, and the caller
+ * ignores the return value anyway.
+ */
 STATIC int
 xfs_file_release(
 	struct inode		*inode,
@@ -1182,7 +1186,6 @@ xfs_file_release(
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
-	int			error;
 
 	/* If this is a read-only mount, don't generate I/O */
 	if (xfs_is_readonly(mp))
@@ -1200,11 +1203,8 @@ xfs_file_release(
 	if (!xfs_is_shutdown(mp) &&
 	    xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED)) {
 		xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
-		if (ip->i_delayed_blks > 0) {
-			error = filemap_flush(inode->i_mapping);
-			if (error)
-				return error;
-		}
+		if (ip->i_delayed_blks > 0)
+			filemap_flush(inode->i_mapping);
 	}
 
 	/*
@@ -1238,14 +1238,14 @@ xfs_file_release(
 			 * dirty close we will still remove the speculative
 			 * allocation, but after that we will leave it in place.
 			 */
-			error = xfs_free_eofblocks(ip);
-			if (!error && ip->i_delayed_blks)
+			xfs_free_eofblocks(ip);
+			if (ip->i_delayed_blks)
 				xfs_iflags_set(ip, XFS_IDIRTY_RELEASE);
 		}
 		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 	}
 
-	return error;
+	return 0;
 }
 
 STATIC int
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/9] xfs: skip all of xfs_file_release when shut down
  2024-08-08 15:27 post-EOF block handling revamp v2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2024-08-08 15:27 ` [PATCH 3/9] xfs: don't bother returning errors from xfs_file_release Christoph Hellwig
@ 2024-08-08 15:27 ` Christoph Hellwig
  2024-08-08 22:25   ` Dave Chinner
  2024-08-08 15:27 ` [PATCH 5/9] xfs: don't free post-EOF blocks on read close Christoph Hellwig
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-08 15:27 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

There is no point in trying to free post-EOF blocks when the file system
is shutdown, as it will just error out ASAP.  Instead return instantly
when xfs_file_shutdown is called on a shut down file system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 17dfbaca1c581c..dae8dd1223550d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1187,8 +1187,11 @@ xfs_file_release(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 
-	/* If this is a read-only mount, don't generate I/O */
-	if (xfs_is_readonly(mp))
+	/*
+	 * If this is a read-only mount or the file system has been shut down,
+	 * don't generate I/O.
+	 */
+	if (xfs_is_readonly(mp) || xfs_is_shutdown(mp))
 		return 0;
 
 	/*
@@ -1200,8 +1203,7 @@ xfs_file_release(
 	 * is significantly reducing the time window where we'd otherwise be
 	 * exposed to that problem.
 	 */
-	if (!xfs_is_shutdown(mp) &&
-	    xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED)) {
+	if (xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED)) {
 		xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
 		if (ip->i_delayed_blks > 0)
 			filemap_flush(inode->i_mapping);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/9] xfs: don't free post-EOF blocks on read close
  2024-08-08 15:27 post-EOF block handling revamp v2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2024-08-08 15:27 ` [PATCH 4/9] xfs: skip all of xfs_file_release when shut down Christoph Hellwig
@ 2024-08-08 15:27 ` Christoph Hellwig
  2024-08-08 15:27 ` [PATCH 6/9] xfs: only free posteof blocks on first close Christoph Hellwig
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-08 15:27 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we have a workload that does open/read/close in parallel with other
allocation, the file becomes rapidly fragmented. This is due to close()
calling xfs_file_release() and removing the speculative preallocation
beyond EOF.

Add a check for a writable context to xfs_file_release to skip the
post-EOF block freeing (an the similarly pointless flushing on truncate
down).

Before:

Test 1: sync write fragmentation counts

/mnt/scratch/file.0: 919
/mnt/scratch/file.1: 916
/mnt/scratch/file.2: 919
/mnt/scratch/file.3: 920
/mnt/scratch/file.4: 920
/mnt/scratch/file.5: 921
/mnt/scratch/file.6: 916
/mnt/scratch/file.7: 918

After:

Test 1: sync write fragmentation counts

/mnt/scratch/file.0: 24
/mnt/scratch/file.1: 24
/mnt/scratch/file.2: 11
/mnt/scratch/file.3: 24
/mnt/scratch/file.4: 3
/mnt/scratch/file.5: 24
/mnt/scratch/file.6: 24
/mnt/scratch/file.7: 23

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: wordsmithing, fix commit message]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[hch: ported to the new ->release code structure]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index dae8dd1223550d..60424e64230743 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1217,12 +1217,18 @@ xfs_file_release(
 	 * There is no point in freeing blocks here for open but unlinked files
 	 * as they will be taken care of by the inactivation path soon.
 	 *
+	 * When releasing a read-only context, don't flush data or trim post-EOF
+	 * blocks.  This avoids open/read/close workloads from removing EOF
+	 * blocks that other writers depend upon to reduce fragmentation.
+	 *
 	 * If we can't get the iolock just skip truncating the blocks past EOF
 	 * because we could deadlock with the mmap_lock otherwise. We'll get
 	 * another chance to drop them once the last reference to the inode is
 	 * dropped, so we'll never leak blocks permanently.
 	 */
-	if (inode->i_nlink && xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
+	if (inode->i_nlink &&
+	    (file->f_mode & FMODE_WRITE) &&
+	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
 		if (xfs_can_free_eofblocks(ip) &&
 		    !xfs_iflags_test(ip, XFS_IDIRTY_RELEASE)) {
 			/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/9] xfs: only free posteof blocks on first close
  2024-08-08 15:27 post-EOF block handling revamp v2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2024-08-08 15:27 ` [PATCH 5/9] xfs: don't free post-EOF blocks on read close Christoph Hellwig
@ 2024-08-08 15:27 ` Christoph Hellwig
  2024-08-08 22:36   ` Dave Chinner
  2024-08-08 15:27 ` [PATCH 7/9] xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks Christoph Hellwig
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-08 15:27 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

From: "Darrick J. Wong" <djwong@kernel.org>

Certain workloads fragment files on XFS very badly, such as a software
package that creates a number of threads, each of which repeatedly run
the sequence: open a file, perform a synchronous write, and close the
file, which defeats the speculative preallocation mechanism.  We work
around this problem by only deleting posteof blocks the /first/ time a
file is closed to preserve the behavior that unpacking a tarball lays
out files one after the other with no gaps.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[hch: rebased, updated comment, renamed the flag]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c  | 32 +++++++++++---------------------
 fs/xfs/xfs_inode.h |  4 ++--
 2 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 60424e64230743..30b553ac8f56bb 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1204,15 +1204,21 @@ xfs_file_release(
 	 * exposed to that problem.
 	 */
 	if (xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED)) {
-		xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
+		xfs_iflags_clear(ip, XFS_EOFBLOCKS_RELEASED);
 		if (ip->i_delayed_blks > 0)
 			filemap_flush(inode->i_mapping);
 	}
 
 	/*
 	 * XFS aggressively preallocates post-EOF space to generate contiguous
-	 * allocations for writers that append to the end of the file and we
-	 * try to free these when an open file context is released.
+	 * allocations for writers that append to the end of the file.
+	 *
+	 * To support workloads that close and reopen the file frequently, these
+	 * preallocations usually persist after a close unless it is the first
+	 * close for the inode.  This is a tradeoff to generate tightly packed
+	 * data layouts for unpacking tarballs or similar archives that write
+	 * one file after another without going back to it while keeping the
+	 * preallocation for files that have recurring open/write/close cycles.
 	 *
 	 * There is no point in freeing blocks here for open but unlinked files
 	 * as they will be taken care of by the inactivation path soon.
@@ -1230,25 +1236,9 @@ xfs_file_release(
 	    (file->f_mode & FMODE_WRITE) &&
 	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
 		if (xfs_can_free_eofblocks(ip) &&
-		    !xfs_iflags_test(ip, XFS_IDIRTY_RELEASE)) {
-			/*
-			 * Check if the inode is being opened, written and
-			 * closed frequently and we have delayed allocation
-			 * blocks outstanding (e.g. streaming writes from the
-			 * NFS server), truncating the blocks past EOF will
-			 * cause fragmentation to occur.
-			 *
-			 * In this case don't do the truncation, but we have to
-			 * be careful how we detect this case. Blocks beyond EOF
-			 * show up as i_delayed_blks even when the inode is
-			 * clean, so we need to truncate them away first before
-			 * checking for a dirty release. Hence on the first
-			 * dirty close we will still remove the speculative
-			 * allocation, but after that we will leave it in place.
-			 */
+		    !xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED)) {
 			xfs_free_eofblocks(ip);
-			if (ip->i_delayed_blks)
-				xfs_iflags_set(ip, XFS_IDIRTY_RELEASE);
+			xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);
 		}
 		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 	}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 6ec83fab66266a..2763a9ffa643db 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -335,7 +335,7 @@ static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
 #define XFS_INEW		(1 << 3) /* inode has just been allocated */
 #define XFS_IPRESERVE_DM_FIELDS	(1 << 4) /* has legacy DMAPI fields set */
 #define XFS_ITRUNCATED		(1 << 5) /* truncated down so flush-on-close */
-#define XFS_IDIRTY_RELEASE	(1 << 6) /* dirty release already seen */
+#define XFS_EOFBLOCKS_RELEASED	(1 << 6) /* eofblocks were freed in ->release */
 #define XFS_IFLUSHING		(1 << 7) /* inode is being flushed */
 #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
@@ -382,7 +382,7 @@ static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
  */
 #define XFS_IRECLAIM_RESET_FLAGS	\
 	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
-	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \
+	 XFS_EOFBLOCKS_RELEASED | XFS_ITRUNCATED | XFS_NEED_INACTIVE | \
 	 XFS_INACTIVATING | XFS_IQUOTAUNCHECKED)
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 7/9] xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks
  2024-08-08 15:27 post-EOF block handling revamp v2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2024-08-08 15:27 ` [PATCH 6/9] xfs: only free posteof blocks on first close Christoph Hellwig
@ 2024-08-08 15:27 ` Christoph Hellwig
  2024-08-08 23:03   ` Dave Chinner
  2024-08-08 15:27 ` [PATCH 8/9] xfs: simplify extent lookup in xfs_can_free_eofblocks Christoph Hellwig
  2024-08-08 15:27 ` [PATCH 9/9] xfs: reclaim speculative preallocations for append only files Christoph Hellwig
  8 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-08 15:27 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

If the XFS_EOFBLOCKS_RELEASED flag is set, we are not going to free the
eofblocks, so don't bother locking the inode or performing the checks in
xfs_can_free_eofblocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 30b553ac8f56bb..f1593690ba88d2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1234,9 +1234,9 @@ xfs_file_release(
 	 */
 	if (inode->i_nlink &&
 	    (file->f_mode & FMODE_WRITE) &&
+	    !xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
 	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
-		if (xfs_can_free_eofblocks(ip) &&
-		    !xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED)) {
+		if (xfs_can_free_eofblocks(ip)) {
 			xfs_free_eofblocks(ip);
 			xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);
 		}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 8/9] xfs: simplify extent lookup in xfs_can_free_eofblocks
  2024-08-08 15:27 post-EOF block handling revamp v2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2024-08-08 15:27 ` [PATCH 7/9] xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks Christoph Hellwig
@ 2024-08-08 15:27 ` Christoph Hellwig
  2024-08-08 15:27 ` [PATCH 9/9] xfs: reclaim speculative preallocations for append only files Christoph Hellwig
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-08 15:27 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

xfs_can_free_eofblocks just cares if there is an extent beyond EOF.
Replace the call to xfs_bmapi_read with a xfs_iext_lookup_extent
as we've already checked that extents are read in earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index fe2e2c93097550..9c42cfb62cf2dc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -492,12 +492,12 @@ bool
 xfs_can_free_eofblocks(
 	struct xfs_inode	*ip)
 {
-	struct xfs_bmbt_irec	imap;
 	struct xfs_mount	*mp = ip->i_mount;
+	bool			found_blocks = false;
 	xfs_fileoff_t		end_fsb;
 	xfs_fileoff_t		last_fsb;
-	int			nimaps = 1;
-	int			error;
+	struct xfs_bmbt_irec	imap;
+	struct xfs_iext_cursor	icur;
 
 	/*
 	 * Caller must either hold the exclusive io lock; or be inactivating
@@ -544,21 +544,13 @@ xfs_can_free_eofblocks(
 		return false;
 
 	/*
-	 * Look up the mapping for the first block past EOF.  If we can't find
-	 * it, there's nothing to free.
+	 * Check if there is an post-EOF extent to free.
 	 */
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	error = xfs_bmapi_read(ip, end_fsb, last_fsb - end_fsb, &imap, &nimaps,
-			0);
+	if (xfs_iext_lookup_extent(ip, &ip->i_df, end_fsb, &icur, &imap))
+		found_blocks = true;
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-	if (error || nimaps == 0)
-		return false;
-
-	/*
-	 * If there's a real mapping there or there are delayed allocation
-	 * reservations, then we have post-EOF blocks to try to free.
-	 */
-	return imap.br_startblock != HOLESTARTBLOCK || ip->i_delayed_blks;
+	return found_blocks;
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 9/9] xfs: reclaim speculative preallocations for append only files
  2024-08-08 15:27 post-EOF block handling revamp v2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2024-08-08 15:27 ` [PATCH 8/9] xfs: simplify extent lookup in xfs_can_free_eofblocks Christoph Hellwig
@ 2024-08-08 15:27 ` Christoph Hellwig
  8 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-08 15:27 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

The XFS XFS_DIFLAG_APPEND maps to the VFS S_APPEND flag, which forbids
writes that don't append at the current EOF.

But the commit originally adding XFS_DIFLAG_APPEND support (commit
a23321e766d in xfs xfs-import repository) also checked it to skip
releasing speculative preallocations, which doesn't make any sense.

Another commit (dd9f438e3290 in the xfs-import repository) later extended
that flag to also report these speculation preallocations which should
not exist in getbmap.

Remove these checks as nothing XFS_DIFLAG_APPEND implies that
preallocations beyond EOF should exist, but explicitly check for
XFS_DIFLAG_APPEND in xfs_file_release to bypass the algorithm that
discard preallocations on the first close as append only files aren't
expected to be written to only once.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c | 12 +++++-------
 fs/xfs/xfs_file.c      |  4 ++++
 fs/xfs/xfs_icache.c    |  2 +-
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 9c42cfb62cf2dc..0f1e3289255c2e 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -331,8 +331,7 @@ xfs_getbmap(
 		}
 
 		if (xfs_get_extsz_hint(ip) ||
-		    (ip->i_diflags &
-		     (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))
+		    (ip->i_diflags & XFS_DIFLAG_PREALLOC))
 			max_len = mp->m_super->s_maxbytes;
 		else
 			max_len = XFS_ISIZE(ip);
@@ -524,12 +523,11 @@ xfs_can_free_eofblocks(
 		return false;
 
 	/*
-	 * Only free real extents for inodes with persistent preallocations or
-	 * the append-only flag.
+	 * Do not free real extents in preallocated files unless the file has
+	 * delalloc blocks and we are forced to remove them.
 	 */
-	if (ip->i_diflags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND))
-		if (ip->i_delayed_blks == 0)
-			return false;
+	if ((ip->i_diflags & XFS_DIFLAG_PREALLOC) && !ip->i_delayed_blks)
+		return false;
 
 	/*
 	 * Do not try to free post-EOF blocks if EOF is beyond the end of the
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f1593690ba88d2..f244b8e8056f66 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1220,6 +1220,9 @@ xfs_file_release(
 	 * one file after another without going back to it while keeping the
 	 * preallocation for files that have recurring open/write/close cycles.
 	 *
+	 * This heuristic is skipped for inodes with the append-only flag as
+	 * that flag is rather pointless for inodes written only once.
+	 *
 	 * There is no point in freeing blocks here for open but unlinked files
 	 * as they will be taken care of by the inactivation path soon.
 	 *
@@ -1234,6 +1237,7 @@ xfs_file_release(
 	 */
 	if (inode->i_nlink &&
 	    (file->f_mode & FMODE_WRITE) &&
+	    !(ip->i_diflags & XFS_DIFLAG_APPEND) &&
 	    !xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
 	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
 		if (xfs_can_free_eofblocks(ip)) {
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index cf629302d48e74..e995e2f6152dbd 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1159,7 +1159,7 @@ xfs_inode_free_eofblocks(
 	if (xfs_can_free_eofblocks(ip))
 		return xfs_free_eofblocks(ip);
 
-	/* inode could be preallocated or append-only */
+	/* inode could be preallocated */
 	trace_xfs_inode_free_eofblocks_invalid(ip);
 	xfs_inode_clear_eofblocks_tag(ip);
 	return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/9] xfs: skip all of xfs_file_release when shut down
  2024-08-08 15:27 ` [PATCH 4/9] xfs: skip all of xfs_file_release when shut down Christoph Hellwig
@ 2024-08-08 22:25   ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2024-08-08 22:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, linux-xfs

On Thu, Aug 08, 2024 at 08:27:30AM -0700, Christoph Hellwig wrote:
> There is no point in trying to free post-EOF blocks when the file system
> is shutdown, as it will just error out ASAP.  Instead return instantly
> when xfs_file_shutdown is called on a shut down file system.
                ^^^^^^^^
       xfs_file_release()

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/9] xfs: only free posteof blocks on first close
  2024-08-08 15:27 ` [PATCH 6/9] xfs: only free posteof blocks on first close Christoph Hellwig
@ 2024-08-08 22:36   ` Dave Chinner
  2024-08-11  8:44     ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2024-08-08 22:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, linux-xfs

On Thu, Aug 08, 2024 at 08:27:32AM -0700, Christoph Hellwig wrote:
> From: "Darrick J. Wong" <djwong@kernel.org>
> 
> Certain workloads fragment files on XFS very badly, such as a software
> package that creates a number of threads, each of which repeatedly run
> the sequence: open a file, perform a synchronous write, and close the
> file, which defeats the speculative preallocation mechanism.  We work
> around this problem by only deleting posteof blocks the /first/ time a
> file is closed to preserve the behavior that unpacking a tarball lays
> out files one after the other with no gaps.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> [hch: rebased, updated comment, renamed the flag]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_file.c  | 32 +++++++++++---------------------
>  fs/xfs/xfs_inode.h |  4 ++--
>  2 files changed, 13 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 60424e64230743..30b553ac8f56bb 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1204,15 +1204,21 @@ xfs_file_release(
>  	 * exposed to that problem.
>  	 */
>  	if (xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED)) {
> -		xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
> +		xfs_iflags_clear(ip, XFS_EOFBLOCKS_RELEASED);
>  		if (ip->i_delayed_blks > 0)
>  			filemap_flush(inode->i_mapping);
>  	}

This should probably be open coded to minimise lock cycles and lock
contention on the flags lock when concurrent open/sync write/close 
cycles are run on the file (as recently reported by Mateusz). i.e:

	if (ip->i_flags & XFS_ITRUNCATED) {
		spin_lock(&ip->i_flags_lock);
		if (ip->i_flags & XFS_ITRUNCATED)
			ip->i_flags &= ~(XFS_ITRUNCATED | XFS_EOFBLOCKS_RELEASED);
		spin_unlock(&ip->i_flags_lock);
		if (ip->i_delayed_blks > 0)
			filemap_flush(inode->i_mapping);
	}

....
> @@ -1230,25 +1236,9 @@ xfs_file_release(
>  	    (file->f_mode & FMODE_WRITE) &&
>  	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
>  		if (xfs_can_free_eofblocks(ip) &&
> -		    !xfs_iflags_test(ip, XFS_IDIRTY_RELEASE)) {
> -			/*
> -			 * Check if the inode is being opened, written and
> -			 * closed frequently and we have delayed allocation
> -			 * blocks outstanding (e.g. streaming writes from the
> -			 * NFS server), truncating the blocks past EOF will
> -			 * cause fragmentation to occur.
> -			 *
> -			 * In this case don't do the truncation, but we have to
> -			 * be careful how we detect this case. Blocks beyond EOF
> -			 * show up as i_delayed_blks even when the inode is
> -			 * clean, so we need to truncate them away first before
> -			 * checking for a dirty release. Hence on the first
> -			 * dirty close we will still remove the speculative
> -			 * allocation, but after that we will leave it in place.
> -			 */
> +		    !xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED)) {
>  			xfs_free_eofblocks(ip);
> -			if (ip->i_delayed_blks)
> -				xfs_iflags_set(ip, XFS_IDIRTY_RELEASE);
> +			xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);

		     !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED)
			xfs_free_eofblocks(ip);

This also avoids an extra lock cycle to set the flag....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 7/9] xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks
  2024-08-08 15:27 ` [PATCH 7/9] xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks Christoph Hellwig
@ 2024-08-08 23:03   ` Dave Chinner
  2024-08-11  8:59     ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2024-08-08 23:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, linux-xfs

On Thu, Aug 08, 2024 at 08:27:33AM -0700, Christoph Hellwig wrote:
> If the XFS_EOFBLOCKS_RELEASED flag is set, we are not going to free the
> eofblocks, so don't bother locking the inode or performing the checks in
> xfs_can_free_eofblocks.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_file.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 30b553ac8f56bb..f1593690ba88d2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1234,9 +1234,9 @@ xfs_file_release(
>  	 */
>  	if (inode->i_nlink &&
>  	    (file->f_mode & FMODE_WRITE) &&
> +	    !xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
>  	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
> -		if (xfs_can_free_eofblocks(ip) &&
> -		    !xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED)) {
> +		if (xfs_can_free_eofblocks(ip)) {
>  			xfs_free_eofblocks(ip);
>  			xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);
>  		}

The test and set here is racy. A long time can pass between the test
and the setting of the flag, so maybe this should be optimised to
something like:

	if (inode->i_nlink &&
	    (file->f_mode & FMODE_WRITE) &&
	    (!(ip->i_flags & XFS_EOFBLOCKS_RELEASED)) &&
	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
		if (xfs_can_free_eofblocks(ip) &&
		    !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED))
			xfs_free_eofblocks(ip);
		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
	}

I do wonder, though - why do we need to hold the IOLOCK to call
xfs_can_free_eofblocks()? The only thing that really needs
serialisation is the xfS_bmapi_read() call, and that's done under
the ILOCK not the IOLOCK. Sure, xfs_free_eofblocks() needs the
IOLOCK because it's effectively a truncate w.r.t. extending writes,
but races with extending writes while checking if we need to do that
operation aren't really a big deal. Worst case is we take the
lock and free the EOF blocks beyond the writes we raced with.

What am I missing here?

i.e. it seems to me that the logic here could be:

	if (inode->i_nlink &&
	    (file->f_mode & FMODE_WRITE) &&
	    (!(ip->i_flags & XFS_EOFBLOCKS_RELEASED)) &&
	    xfs_can_free_eofblocks(ip) &&
	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL) &&
	    !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED)) {
		xfs_free_eofblocks(ip);
		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
	}

And so avoids attempting to take or taking locks in all the cases
where locks can be avoided.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/9] xfs: only free posteof blocks on first close
  2024-08-08 22:36   ` Dave Chinner
@ 2024-08-11  8:44     ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-11  8:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Chandan Babu R, Darrick J. Wong, Dave Chinner,
	linux-xfs

On Fri, Aug 09, 2024 at 08:36:33AM +1000, Dave Chinner wrote:
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 60424e64230743..30b553ac8f56bb 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1204,15 +1204,21 @@ xfs_file_release(
> >  	 * exposed to that problem.
> >  	 */
> >  	if (xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED)) {
> > -		xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
> > +		xfs_iflags_clear(ip, XFS_EOFBLOCKS_RELEASED);
> >  		if (ip->i_delayed_blks > 0)
> >  			filemap_flush(inode->i_mapping);
> >  	}
> 
> This should probably be open coded to minimise lock cycles and lock
> contention on the flags lock when concurrent open/sync write/close 
> cycles are run on the file (as recently reported by Mateusz). i.e:

Let's do that as a separate patch on top of the series, as that
is unrelated.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 7/9] xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks
  2024-08-08 23:03   ` Dave Chinner
@ 2024-08-11  8:59     ` Christoph Hellwig
  2024-08-11 23:48       ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-11  8:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Chandan Babu R, Darrick J. Wong, Dave Chinner,
	linux-xfs

On Fri, Aug 09, 2024 at 09:03:24AM +1000, Dave Chinner wrote:
> The test and set here is racy. A long time can pass between the test
> and the setting of the flag,

The race window is much tighter due to the iolock, but if we really
care about the race here, the right fix for that is to keep a second
check for the XFS_EOFBLOCKS_RELEASED flag inside the iolock.

> so maybe this should be optimised to
> something like:
> 
> 	if (inode->i_nlink &&
> 	    (file->f_mode & FMODE_WRITE) &&
> 	    (!(ip->i_flags & XFS_EOFBLOCKS_RELEASED)) &&
> 	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
> 		if (xfs_can_free_eofblocks(ip) &&
> 		    !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED))
> 			xfs_free_eofblocks(ip);
> 		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> 	}

All these direct i_flags access actually are racy too (at least in
theory).  We'd probably be better off moving those over to the atomic
bitops and only using i_lock for any coordination beyond the actual
flags.  I'd rather not get into that here for now, even if it is a
worthwhile project for later.

> I do wonder, though - why do we need to hold the IOLOCK to call
> xfs_can_free_eofblocks()? The only thing that really needs
> serialisation is the xfS_bmapi_read() call, and that's done under
> the ILOCK not the IOLOCK. Sure, xfs_free_eofblocks() needs the
> IOLOCK because it's effectively a truncate w.r.t. extending writes,
> but races with extending writes while checking if we need to do that
> operation aren't really a big deal. Worst case is we take the
> lock and free the EOF blocks beyond the writes we raced with.
> 
> What am I missing here?

I think the prime part of the story is that xfs_can_free_eofblocks was
split out of xfs_free_eofblocks, which requires the iolock.  But I'm
not sure if some of the checks are a little racy without the iolock,
although I doubt it matter in practice as they are all optimizations.
I'd need to take a deeper look at this, so maybe it's worth a follow
on together with the changes in i_flags handling.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 7/9] xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks
  2024-08-11  8:59     ` Christoph Hellwig
@ 2024-08-11 23:48       ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2024-08-11 23:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Babu R, Darrick J. Wong, Dave Chinner, linux-xfs

On Sun, Aug 11, 2024 at 10:59:52AM +0200, Christoph Hellwig wrote:
> On Fri, Aug 09, 2024 at 09:03:24AM +1000, Dave Chinner wrote:
> > The test and set here is racy. A long time can pass between the test
> > and the setting of the flag,
> 
> The race window is much tighter due to the iolock, but if we really
> care about the race here, the right fix for that is to keep a second
> check for the XFS_EOFBLOCKS_RELEASED flag inside the iolock.

Right, that's exactly what the code I proposed below does.

> > so maybe this should be optimised to
> > something like:
> > 
> > 	if (inode->i_nlink &&
> > 	    (file->f_mode & FMODE_WRITE) &&
> > 	    (!(ip->i_flags & XFS_EOFBLOCKS_RELEASED)) &&
> > 	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
> > 		if (xfs_can_free_eofblocks(ip) &&
> > 		    !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED))
> > 			xfs_free_eofblocks(ip);
> > 		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> > 	}
> 
> All these direct i_flags access actually are racy too (at least in
> theory).

Yes, but we really don't care about racing against the bit being
set. The flag never gets cleared unless a truncate down occurs, so
we don't really have to care about racing with that case - there
will be no eofblocks to free.

If the test races with another release call setting the flag (i.e.
we see it clear) then we are going to go the slow way and then do
exactly the right thing according to the current bit state once we
hold the IO lock and the i_flags_lock.

> We'd probably be better off moving those over to the atomic
> bitops and only using i_lock for any coordination beyond the actual
> flags.  I'd rather not get into that here for now, even if it is a
> worthwhile project for later.

That doesn't solve the exclusive cacheline access problem Mateusz
reported. It allows us to isolate the bitop updates, but in this
case here the atomic test-and-set op still requires exclusive
cacheline access.

Hence we'd still need test-test-and-set optimisations here to avoid
the exclusive cacheline contention when the bit is already set...

> > I do wonder, though - why do we need to hold the IOLOCK to call
> > xfs_can_free_eofblocks()? The only thing that really needs
> > serialisation is the xfS_bmapi_read() call, and that's done under
> > the ILOCK not the IOLOCK. Sure, xfs_free_eofblocks() needs the
> > IOLOCK because it's effectively a truncate w.r.t. extending writes,
> > but races with extending writes while checking if we need to do that
> > operation aren't really a big deal. Worst case is we take the
> > lock and free the EOF blocks beyond the writes we raced with.
> > 
> > What am I missing here?
> 
> I think the prime part of the story is that xfs_can_free_eofblocks was
> split out of xfs_free_eofblocks, which requires the iolock.  But I'm
> not sure if some of the checks are a little racy without the iolock,

Ok. I think the checks are racy even with the iolock - most of the
checks are for inode metadata that is modified under the ilock (e.g.
i_diflags, i_delayed_blks) or the ip->i_flags_lock (e.g.
VFS_I(ip)->i_size for serialisation with updates via
xfs_dio_write_end_io()). Hence I don't think that holding the IO
lock here makes any difference here at all...

> although I doubt it matter in practice as they are all optimizations.
> I'd need to take a deeper look at this, so maybe it's worth a follow
> on together with the changes in i_flags handling.

*nod*

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 7/9] xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks
  2024-08-13  7:39 post-EOF block handling revamp v3 Christoph Hellwig
@ 2024-08-13  7:39 ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-13  7:39 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

If the XFS_EOFBLOCKS_RELEASED flag is set, we are not going to free the
eofblocks, so don't bother locking the inode or performing the checks in
xfs_can_free_eofblocks.  Also switch to a test_and_set operation once
the iolock has been acquire so that only the caller that sets it actually
frees the post-EOF blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 30b553ac8f56bb..986448d1ff3c0c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1234,12 +1234,11 @@ xfs_file_release(
 	 */
 	if (inode->i_nlink &&
 	    (file->f_mode & FMODE_WRITE) &&
+	    !xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED) &&
 	    xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL)) {
 		if (xfs_can_free_eofblocks(ip) &&
-		    !xfs_iflags_test(ip, XFS_EOFBLOCKS_RELEASED)) {
+		    !xfs_iflags_test_and_set(ip, XFS_EOFBLOCKS_RELEASED))
 			xfs_free_eofblocks(ip);
-			xfs_iflags_set(ip, XFS_EOFBLOCKS_RELEASED);
-		}
 		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-08-13  7:40 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-08 15:27 post-EOF block handling revamp v2 Christoph Hellwig
2024-08-08 15:27 ` [PATCH 1/9] xfs: remove the i_mode check in xfs_release Christoph Hellwig
2024-08-08 15:27 ` [PATCH 2/9] xfs: refactor f_op->release handling Christoph Hellwig
2024-08-08 15:27 ` [PATCH 3/9] xfs: don't bother returning errors from xfs_file_release Christoph Hellwig
2024-08-08 15:27 ` [PATCH 4/9] xfs: skip all of xfs_file_release when shut down Christoph Hellwig
2024-08-08 22:25   ` Dave Chinner
2024-08-08 15:27 ` [PATCH 5/9] xfs: don't free post-EOF blocks on read close Christoph Hellwig
2024-08-08 15:27 ` [PATCH 6/9] xfs: only free posteof blocks on first close Christoph Hellwig
2024-08-08 22:36   ` Dave Chinner
2024-08-11  8:44     ` Christoph Hellwig
2024-08-08 15:27 ` [PATCH 7/9] xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks Christoph Hellwig
2024-08-08 23:03   ` Dave Chinner
2024-08-11  8:59     ` Christoph Hellwig
2024-08-11 23:48       ` Dave Chinner
2024-08-08 15:27 ` [PATCH 8/9] xfs: simplify extent lookup in xfs_can_free_eofblocks Christoph Hellwig
2024-08-08 15:27 ` [PATCH 9/9] xfs: reclaim speculative preallocations for append only files Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2024-08-13  7:39 post-EOF block handling revamp v3 Christoph Hellwig
2024-08-13  7:39 ` [PATCH 7/9] xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox