linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fix stale delalloc punching for COW I/O v2
@ 2024-09-10  4:39 Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 01/12] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release Christoph Hellwig
                   ` (13 more replies)
  0 siblings, 14 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

Hi all,

this is another fallout from the zoned XFS work, which stresses the XFS
COW I/O path very heavily.  It affects normal I/O to reflinked files as
well, but is very hard to hit there.

The main problem here is that we only punch out delalloc reservations
from the data fork, but COW I/O places delalloc extents into the COW
fork, which means that it won't get punched out forshort writes.

Changes since v1:
 - move the already reviewed iomap prep changes to the beginning in case
   Christian wants to take them ASAP
 - take the invalidate_lock for post-EOF zeroing so that we have a
   consistent locking pattern for zeroing.

Diffstat:
 fs/iomap/buffered-io.c |  139 +++++++++++++++++++++++--------------------------
 fs/xfs/xfs_aops.c      |    4 -
 fs/xfs/xfs_bmap_util.c |   10 ++-
 fs/xfs/xfs_bmap_util.h |    2 
 fs/xfs/xfs_file.c      |  139 ++++++++++++++++++++++++++++---------------------
 fs/xfs/xfs_iomap.c     |   59 +++++++++-----------
 include/linux/iomap.h  |   11 ++-
 7 files changed, 192 insertions(+), 172 deletions(-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/12] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 02/12] iomap: improve shared block detection in iomap_unshare_iter Christoph Hellwig
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

When direct I/O completions invalidates the page cache it holds neither the
i_rwsem nor the invalidate_lock so it can be racing with
iomap_write_delalloc_release.  If the search for the end of the region that
contains data returns the start offset we hit such a race and just need to
look for the end of the newly created hole instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86acc5..69a931de1979b9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1241,7 +1241,15 @@ static int iomap_write_delalloc_release(struct inode *inode,
 			error = data_end;
 			goto out_unlock;
 		}
-		WARN_ON_ONCE(data_end <= start_byte);
+
+		/*
+		 * If we race with post-direct I/O invalidation of the page cache,
+		 * there might be no data left at start_byte.
+		 */
+		if (data_end == start_byte)
+			continue;
+
+		WARN_ON_ONCE(data_end < start_byte);
 		WARN_ON_ONCE(data_end > scan_end_byte);
 
 		error = iomap_write_delalloc_scan(inode, &punch_start_byte,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/12] iomap: improve shared block detection in iomap_unshare_iter
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 01/12] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 03/12] iomap: pass flags to iomap_file_buffered_write_punch_delalloc Christoph Hellwig
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

Currently iomap_unshare_iter relies on the IOMAP_F_SHARED flag to detect
blocks to unshare.  This is reasonable, but IOMAP_F_SHARED is also useful
for the file system to do internal book keeping for out of place writes.
XFS used to that, until it got removed in commit 72a048c1056a
("xfs: only set IOMAP_F_SHARED when providing a srcmap to a write")
because unshare for incorrectly unshare such blocks.

Add an extra safeguard by checking the explicitly provided srcmap instead
of the fallback to the iomap for valid data, as that catches the case
where we'd just copy from the same place we'd write to easily, allowing
to reinstate setting IOMAP_F_SHARED for all XFS writes that go to the
COW fork.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 69a931de1979b9..737a005082e035 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1337,16 +1337,25 @@ EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
 static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 {
 	struct iomap *iomap = &iter->iomap;
-	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
 	loff_t written = 0;
 
-	/* don't bother with blocks that are not shared to start with */
+	/* Don't bother with blocks that are not shared to start with. */
 	if (!(iomap->flags & IOMAP_F_SHARED))
 		return length;
-	/* don't bother with holes or unwritten extents */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+
+	/*
+	 * Don't bother with holes or unwritten extents.
+	 *
+	 * Note that we use srcmap directly instead of iomap_iter_srcmap as
+	 * unsharing requires providing a separate source map, and the presence
+	 * of one is a good indicator that unsharing is needed, unlike
+	 * IOMAP_F_SHARED which can be set for any data that goes into the COW
+	 * fork for XFS.
+	 */
+	if (iter->srcmap.type == IOMAP_HOLE ||
+	    iter->srcmap.type == IOMAP_UNWRITTEN)
 		return length;
 
 	do {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/12] iomap: pass flags to iomap_file_buffered_write_punch_delalloc
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 01/12] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 02/12] iomap: improve shared block detection in iomap_unshare_iter Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 04/12] iomap: pass the iomap to the punch callback Christoph Hellwig
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

To fix short write error handling, We'll need to figure out what operation
iomap_file_buffered_write_punch_delalloc is called for.  Pass the flags
argument on to it, and reorder the argument list to match that of
->iomap_end so that the compiler only has to add the new punch argument
to the end of it instead of reshuffling the registers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 11 +++++------
 fs/xfs/xfs_iomap.c     |  5 +++--
 include/linux/iomap.h  | 10 ++++++----
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 737a005082e035..ac4666fede4c18 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -23,7 +23,6 @@
 
 #define IOEND_BATCH_SIZE	4096
 
-typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
 /*
  * Structure allocated for each folio to track per-block uptodate, dirty state
  * and I/O completions.
@@ -1199,8 +1198,8 @@ static int iomap_write_delalloc_scan(struct inode *inode,
  * require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose
  * the code to subtle off-by-one bugs....
  */
-static int iomap_write_delalloc_release(struct inode *inode,
-		loff_t start_byte, loff_t end_byte, iomap_punch_t punch)
+static int iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
+		loff_t end_byte, unsigned flags, iomap_punch_t punch)
 {
 	loff_t punch_start_byte = start_byte;
 	loff_t scan_end_byte = min(i_size_read(inode), end_byte);
@@ -1300,8 +1299,8 @@ static int iomap_write_delalloc_release(struct inode *inode,
  *         internal filesystem allocation lock
  */
 int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
-		struct iomap *iomap, loff_t pos, loff_t length,
-		ssize_t written, iomap_punch_t punch)
+		loff_t pos, loff_t length, ssize_t written, unsigned flags,
+		struct iomap *iomap, iomap_punch_t punch)
 {
 	loff_t			start_byte;
 	loff_t			end_byte;
@@ -1329,7 +1328,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 	if (start_byte >= end_byte)
 		return 0;
 
-	return iomap_write_delalloc_release(inode, start_byte, end_byte,
+	return iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
 					punch);
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 72c981e3dc9211..47b5c83588259e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1231,8 +1231,9 @@ xfs_buffered_write_iomap_end(
 	struct xfs_mount	*mp = XFS_M(inode->i_sb);
 	int			error;
 
-	error = iomap_file_buffered_write_punch_delalloc(inode, iomap, offset,
-			length, written, &xfs_buffered_write_delalloc_punch);
+	error = iomap_file_buffered_write_punch_delalloc(inode, offset, length,
+			written, flags, iomap,
+			&xfs_buffered_write_delalloc_punch);
 	if (error && !xfs_is_shutdown(mp)) {
 		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
 			__func__, XFS_I(inode)->i_ino);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d1e..83da37d64d1144 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -258,10 +258,6 @@ static inline const struct iomap *iomap_iter_srcmap(const struct iomap_iter *i)
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
-int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
-		struct iomap *iomap, loff_t pos, loff_t length, ssize_t written,
-		int (*punch)(struct inode *inode, loff_t pos, loff_t length));
-
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
@@ -277,6 +273,12 @@ int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		const struct iomap_ops *ops);
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
 			const struct iomap_ops *ops);
+
+typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
+int iomap_file_buffered_write_punch_delalloc(struct inode *inode, loff_t pos,
+		loff_t length, ssize_t written, unsigned flag,
+		struct iomap *iomap, iomap_punch_t punch);
+
 int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		u64 start, u64 len, const struct iomap_ops *ops);
 loff_t iomap_seek_hole(struct inode *inode, loff_t offset,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/12] iomap: pass the iomap to the punch callback
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2024-09-10  4:39 ` [PATCH 03/12] iomap: pass flags to iomap_file_buffered_write_punch_delalloc Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 05/12] iomap: remove the iomap_file_buffered_write_punch_delalloc return value Christoph Hellwig
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

XFS will need to look at the flags in the iomap structure, so pass it
down all the way to the callback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 25 +++++++++++++------------
 fs/xfs/xfs_iomap.c     |  3 ++-
 include/linux/iomap.h  |  3 ++-
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ac4666fede4c18..a0bc7c3654cc36 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1047,7 +1047,7 @@ EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
 static int iomap_write_delalloc_ifs_punch(struct inode *inode,
 		struct folio *folio, loff_t start_byte, loff_t end_byte,
-		iomap_punch_t punch)
+		struct iomap *iomap, iomap_punch_t punch)
 {
 	unsigned int first_blk, last_blk, i;
 	loff_t last_byte;
@@ -1072,7 +1072,7 @@ static int iomap_write_delalloc_ifs_punch(struct inode *inode,
 	for (i = first_blk; i <= last_blk; i++) {
 		if (!ifs_block_is_dirty(folio, ifs, i)) {
 			ret = punch(inode, folio_pos(folio) + (i << blkbits),
-				    1 << blkbits);
+				    1 << blkbits, iomap);
 			if (ret)
 				return ret;
 		}
@@ -1084,7 +1084,7 @@ static int iomap_write_delalloc_ifs_punch(struct inode *inode,
 
 static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
 		loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
-		iomap_punch_t punch)
+		struct iomap *iomap, iomap_punch_t punch)
 {
 	int ret = 0;
 
@@ -1094,14 +1094,14 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
 	/* if dirty, punch up to offset */
 	if (start_byte > *punch_start_byte) {
 		ret = punch(inode, *punch_start_byte,
-				start_byte - *punch_start_byte);
+				start_byte - *punch_start_byte, iomap);
 		if (ret)
 			return ret;
 	}
 
 	/* Punch non-dirty blocks within folio */
-	ret = iomap_write_delalloc_ifs_punch(inode, folio, start_byte,
-			end_byte, punch);
+	ret = iomap_write_delalloc_ifs_punch(inode, folio, start_byte, end_byte,
+			iomap, punch);
 	if (ret)
 		return ret;
 
@@ -1134,7 +1134,7 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
  */
 static int iomap_write_delalloc_scan(struct inode *inode,
 		loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
-		iomap_punch_t punch)
+		struct iomap *iomap, iomap_punch_t punch)
 {
 	while (start_byte < end_byte) {
 		struct folio	*folio;
@@ -1150,7 +1150,7 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 		}
 
 		ret = iomap_write_delalloc_punch(inode, folio, punch_start_byte,
-						 start_byte, end_byte, punch);
+				start_byte, end_byte, iomap, punch);
 		if (ret) {
 			folio_unlock(folio);
 			folio_put(folio);
@@ -1199,7 +1199,8 @@ static int iomap_write_delalloc_scan(struct inode *inode,
  * the code to subtle off-by-one bugs....
  */
 static int iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
-		loff_t end_byte, unsigned flags, iomap_punch_t punch)
+		loff_t end_byte, unsigned flags, struct iomap *iomap,
+		iomap_punch_t punch)
 {
 	loff_t punch_start_byte = start_byte;
 	loff_t scan_end_byte = min(i_size_read(inode), end_byte);
@@ -1252,7 +1253,7 @@ static int iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
 		WARN_ON_ONCE(data_end > scan_end_byte);
 
 		error = iomap_write_delalloc_scan(inode, &punch_start_byte,
-				start_byte, data_end, punch);
+				start_byte, data_end, iomap, punch);
 		if (error)
 			goto out_unlock;
 
@@ -1262,7 +1263,7 @@ static int iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
 
 	if (punch_start_byte < end_byte)
 		error = punch(inode, punch_start_byte,
-				end_byte - punch_start_byte);
+				end_byte - punch_start_byte, iomap);
 out_unlock:
 	filemap_invalidate_unlock(inode->i_mapping);
 	return error;
@@ -1329,7 +1330,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 		return 0;
 
 	return iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
-					punch);
+					iomap, punch);
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 47b5c83588259e..695e5bee776f94 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1212,7 +1212,8 @@ static int
 xfs_buffered_write_delalloc_punch(
 	struct inode		*inode,
 	loff_t			offset,
-	loff_t			length)
+	loff_t			length,
+	struct iomap		*iomap)
 {
 	xfs_bmap_punch_delalloc_range(XFS_I(inode), offset, offset + length);
 	return 0;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 83da37d64d1144..a931190f6d858b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -274,7 +274,8 @@ int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
 			const struct iomap_ops *ops);
 
-typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
+typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length,
+		struct iomap *iomap);
 int iomap_file_buffered_write_punch_delalloc(struct inode *inode, loff_t pos,
 		loff_t length, ssize_t written, unsigned flag,
 		struct iomap *iomap, iomap_punch_t punch);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/12] iomap: remove the iomap_file_buffered_write_punch_delalloc return value
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2024-09-10  4:39 ` [PATCH 04/12] iomap: pass the iomap to the punch callback Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 06/12] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

iomap_file_buffered_write_punch_delalloc can only return errors if either
the ->punch callback returned an error, or if someone changed the API of
mapping_seek_hole_data to return a negative error code that is not
-ENXIO.

As the only instance of ->punch never returns an error, an such an error
would be fatal anyway remove the entire error propagation and don't
return an error code from iomap_file_buffered_write_punch_delalloc.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 82 +++++++++++++++---------------------------
 fs/xfs/xfs_iomap.c     | 17 ++-------
 include/linux/iomap.h  |  4 +--
 3 files changed, 33 insertions(+), 70 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a0bc7c3654cc36..52f285ae4bddcb 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1045,7 +1045,7 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
-static int iomap_write_delalloc_ifs_punch(struct inode *inode,
+static void iomap_write_delalloc_ifs_punch(struct inode *inode,
 		struct folio *folio, loff_t start_byte, loff_t end_byte,
 		struct iomap *iomap, iomap_punch_t punch)
 {
@@ -1053,7 +1053,6 @@ static int iomap_write_delalloc_ifs_punch(struct inode *inode,
 	loff_t last_byte;
 	u8 blkbits = inode->i_blkbits;
 	struct iomap_folio_state *ifs;
-	int ret = 0;
 
 	/*
 	 * When we have per-block dirty tracking, there can be
@@ -1063,47 +1062,35 @@ static int iomap_write_delalloc_ifs_punch(struct inode *inode,
 	 */
 	ifs = folio->private;
 	if (!ifs)
-		return ret;
+		return;
 
 	last_byte = min_t(loff_t, end_byte - 1,
 			folio_pos(folio) + folio_size(folio) - 1);
 	first_blk = offset_in_folio(folio, start_byte) >> blkbits;
 	last_blk = offset_in_folio(folio, last_byte) >> blkbits;
 	for (i = first_blk; i <= last_blk; i++) {
-		if (!ifs_block_is_dirty(folio, ifs, i)) {
-			ret = punch(inode, folio_pos(folio) + (i << blkbits),
+		if (!ifs_block_is_dirty(folio, ifs, i))
+			punch(inode, folio_pos(folio) + (i << blkbits),
 				    1 << blkbits, iomap);
-			if (ret)
-				return ret;
-		}
 	}
-
-	return ret;
 }
 
-
-static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
+static void iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
 		loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
 		struct iomap *iomap, iomap_punch_t punch)
 {
-	int ret = 0;
-
 	if (!folio_test_dirty(folio))
-		return ret;
+		return;
 
 	/* if dirty, punch up to offset */
 	if (start_byte > *punch_start_byte) {
-		ret = punch(inode, *punch_start_byte,
-				start_byte - *punch_start_byte, iomap);
-		if (ret)
-			return ret;
+		punch(inode, *punch_start_byte, start_byte - *punch_start_byte,
+				iomap);
 	}
 
 	/* Punch non-dirty blocks within folio */
-	ret = iomap_write_delalloc_ifs_punch(inode, folio, start_byte, end_byte,
+	iomap_write_delalloc_ifs_punch(inode, folio, start_byte, end_byte,
 			iomap, punch);
-	if (ret)
-		return ret;
 
 	/*
 	 * Make sure the next punch start is correctly bound to
@@ -1111,8 +1098,6 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
 	 */
 	*punch_start_byte = min_t(loff_t, end_byte,
 				folio_pos(folio) + folio_size(folio));
-
-	return ret;
 }
 
 /*
@@ -1132,13 +1117,12 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
  * This function uses [start_byte, end_byte) intervals (i.e. open ended) to
  * simplify range iterations.
  */
-static int iomap_write_delalloc_scan(struct inode *inode,
+static void iomap_write_delalloc_scan(struct inode *inode,
 		loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
 		struct iomap *iomap, iomap_punch_t punch)
 {
 	while (start_byte < end_byte) {
 		struct folio	*folio;
-		int ret;
 
 		/* grab locked page */
 		folio = filemap_lock_folio(inode->i_mapping,
@@ -1149,20 +1133,14 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 			continue;
 		}
 
-		ret = iomap_write_delalloc_punch(inode, folio, punch_start_byte,
+		iomap_write_delalloc_punch(inode, folio, punch_start_byte,
 				start_byte, end_byte, iomap, punch);
-		if (ret) {
-			folio_unlock(folio);
-			folio_put(folio);
-			return ret;
-		}
 
 		/* move offset to start of next folio in range */
 		start_byte = folio_next_index(folio) << PAGE_SHIFT;
 		folio_unlock(folio);
 		folio_put(folio);
 	}
-	return 0;
 }
 
 /*
@@ -1198,13 +1176,12 @@ static int iomap_write_delalloc_scan(struct inode *inode,
  * require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose
  * the code to subtle off-by-one bugs....
  */
-static int iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
+static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
 		loff_t end_byte, unsigned flags, struct iomap *iomap,
 		iomap_punch_t punch)
 {
 	loff_t punch_start_byte = start_byte;
 	loff_t scan_end_byte = min(i_size_read(inode), end_byte);
-	int error = 0;
 
 	/*
 	 * Lock the mapping to avoid races with page faults re-instantiating
@@ -1221,13 +1198,15 @@ static int iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
 		/*
 		 * If there is no more data to scan, all that is left is to
 		 * punch out the remaining range.
+		 *
+		 * Note that mapping_seek_hole_data is only supposed to return
+		 * either an offset or -ENXIO, so WARN on any other error as
+		 * that would be an API change without updating the callers.
 		 */
 		if (start_byte == -ENXIO || start_byte == scan_end_byte)
 			break;
-		if (start_byte < 0) {
-			error = start_byte;
+		if (WARN_ON_ONCE(start_byte < 0))
 			goto out_unlock;
-		}
 		WARN_ON_ONCE(start_byte < punch_start_byte);
 		WARN_ON_ONCE(start_byte > scan_end_byte);
 
@@ -1237,10 +1216,8 @@ static int iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
 		 */
 		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
 				scan_end_byte, SEEK_HOLE);
-		if (data_end < 0) {
-			error = data_end;
+		if (WARN_ON_ONCE(data_end < 0))
 			goto out_unlock;
-		}
 
 		/*
 		 * If we race with post-direct I/O invalidation of the page cache,
@@ -1252,21 +1229,18 @@ static int iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
 		WARN_ON_ONCE(data_end < start_byte);
 		WARN_ON_ONCE(data_end > scan_end_byte);
 
-		error = iomap_write_delalloc_scan(inode, &punch_start_byte,
-				start_byte, data_end, iomap, punch);
-		if (error)
-			goto out_unlock;
+		iomap_write_delalloc_scan(inode, &punch_start_byte, start_byte,
+				data_end, iomap, punch);
 
 		/* The next data search starts at the end of this one. */
 		start_byte = data_end;
 	}
 
 	if (punch_start_byte < end_byte)
-		error = punch(inode, punch_start_byte,
-				end_byte - punch_start_byte, iomap);
+		punch(inode, punch_start_byte, end_byte - punch_start_byte,
+				iomap);
 out_unlock:
 	filemap_invalidate_unlock(inode->i_mapping);
-	return error;
 }
 
 /*
@@ -1299,7 +1273,7 @@ static int iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
  *       ->punch
  *         internal filesystem allocation lock
  */
-int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
+void iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 		loff_t pos, loff_t length, ssize_t written, unsigned flags,
 		struct iomap *iomap, iomap_punch_t punch)
 {
@@ -1308,11 +1282,11 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 	unsigned int		blocksize = i_blocksize(inode);
 
 	if (iomap->type != IOMAP_DELALLOC)
-		return 0;
+		return;
 
 	/* If we didn't reserve the blocks, we're not allowed to punch them. */
 	if (!(iomap->flags & IOMAP_F_NEW))
-		return 0;
+		return;
 
 	/*
 	 * start_byte refers to the first unused block after a short write. If
@@ -1327,10 +1301,10 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 
 	/* Nothing to do if we've written the entire delalloc extent */
 	if (start_byte >= end_byte)
-		return 0;
+		return;
 
-	return iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
-					iomap, punch);
+	iomap_write_delalloc_release(inode, start_byte, end_byte, flags, iomap,
+			punch);
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 695e5bee776f94..1e11f48814c0d0 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1208,7 +1208,7 @@ xfs_buffered_write_iomap_begin(
 	return error;
 }
 
-static int
+static void
 xfs_buffered_write_delalloc_punch(
 	struct inode		*inode,
 	loff_t			offset,
@@ -1216,7 +1216,6 @@ xfs_buffered_write_delalloc_punch(
 	struct iomap		*iomap)
 {
 	xfs_bmap_punch_delalloc_range(XFS_I(inode), offset, offset + length);
-	return 0;
 }
 
 static int
@@ -1228,18 +1227,8 @@ xfs_buffered_write_iomap_end(
 	unsigned		flags,
 	struct iomap		*iomap)
 {
-
-	struct xfs_mount	*mp = XFS_M(inode->i_sb);
-	int			error;
-
-	error = iomap_file_buffered_write_punch_delalloc(inode, offset, length,
-			written, flags, iomap,
-			&xfs_buffered_write_delalloc_punch);
-	if (error && !xfs_is_shutdown(mp)) {
-		xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
-			__func__, XFS_I(inode)->i_ino);
-		return error;
-	}
+	iomap_file_buffered_write_punch_delalloc(inode, offset, length, written,
+			flags, iomap, &xfs_buffered_write_delalloc_punch);
 	return 0;
 }
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index a931190f6d858b..78a48af4d2c0a1 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -274,9 +274,9 @@ int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
 			const struct iomap_ops *ops);
 
-typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length,
+typedef void (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length,
 		struct iomap *iomap);
-int iomap_file_buffered_write_punch_delalloc(struct inode *inode, loff_t pos,
+void iomap_file_buffered_write_punch_delalloc(struct inode *inode, loff_t pos,
 		loff_t length, ssize_t written, unsigned flag,
 		struct iomap *iomap, iomap_punch_t punch);
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/12] xfs: factor out a xfs_file_write_zero_eof helper
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2024-09-10  4:39 ` [PATCH 05/12] iomap: remove the iomap_file_buffered_write_punch_delalloc return value Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-17 21:14   ` Darrick J. Wong
  2024-09-10  4:39 ` [PATCH 07/12] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof Christoph Hellwig
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

Split a helper from xfs_file_write_checks that just deal with the
post-EOF zeroing to keep the code readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 133 ++++++++++++++++++++++++++--------------------
 1 file changed, 75 insertions(+), 58 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 0d258c21b9897f..a30fda1985e6af 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -347,10 +347,70 @@ xfs_file_splice_read(
 	return ret;
 }
 
+static ssize_t
+xfs_file_write_zero_eof(
+	struct kiocb		*iocb,
+	struct iov_iter		*from,
+	unsigned int		*iolock,
+	size_t			count,
+	bool			*drained_dio)
+{
+	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
+	loff_t			isize;
+
+	/*
+	 * We need to serialise against EOF updates that occur in IO completions
+	 * here. We want to make sure that nobody is changing the size while
+	 * we do this check until we have placed an IO barrier (i.e. hold
+	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
+	 * spinlock effectively forms a memory barrier once we have
+	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
+	 * hence be able to correctly determine if we need to run zeroing.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	isize = i_size_read(VFS_I(ip));
+	if (iocb->ki_pos <= isize) {
+		spin_unlock(&ip->i_flags_lock);
+		return 0;
+	}
+	spin_unlock(&ip->i_flags_lock);
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		return -EAGAIN;
+
+	if (!*drained_dio) {
+		/*
+		 * If zeroing is needed and we are currently holding the iolock
+		 * shared, we need to update it to exclusive which implies
+		 * having to redo all checks before.
+		 */
+		if (*iolock == XFS_IOLOCK_SHARED) {
+			xfs_iunlock(ip, *iolock);
+			*iolock = XFS_IOLOCK_EXCL;
+			xfs_ilock(ip, *iolock);
+			iov_iter_reexpand(from, count);
+		}
+
+		/*
+		 * We now have an IO submission barrier in place, but AIO can do
+		 * EOF updates during IO completion and hence we now need to
+		 * wait for all of them to drain.  Non-AIO DIO will have drained
+		 * before we are given the XFS_IOLOCK_EXCL, and so for most
+		 * cases this wait is a no-op.
+		 */
+		inode_dio_wait(VFS_I(ip));
+		*drained_dio = true;
+		return 1;
+	}
+
+	trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
+	return xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
+}
+
 /*
  * Common pre-write limit and setup checks.
  *
- * Called with the iolocked held either shared and exclusive according to
+ * Called with the iolock held either shared and exclusive according to
  * @iolock, and returns with it held.  Might upgrade the iolock to exclusive
  * if called for a direct write beyond i_size.
  */
@@ -360,13 +420,10 @@ xfs_file_write_checks(
 	struct iov_iter		*from,
 	unsigned int		*iolock)
 {
-	struct file		*file = iocb->ki_filp;
-	struct inode		*inode = file->f_mapping->host;
-	struct xfs_inode	*ip = XFS_I(inode);
-	ssize_t			error = 0;
+	struct inode		*inode = iocb->ki_filp->f_mapping->host;
 	size_t			count = iov_iter_count(from);
 	bool			drained_dio = false;
-	loff_t			isize;
+	ssize_t			error;
 
 restart:
 	error = generic_write_checks(iocb, from);
@@ -389,7 +446,7 @@ xfs_file_write_checks(
 	 * exclusively.
 	 */
 	if (*iolock == XFS_IOLOCK_SHARED && !IS_NOSEC(inode)) {
-		xfs_iunlock(ip, *iolock);
+		xfs_iunlock(XFS_I(inode), *iolock);
 		*iolock = XFS_IOLOCK_EXCL;
 		error = xfs_ilock_iocb(iocb, *iolock);
 		if (error) {
@@ -400,64 +457,24 @@ xfs_file_write_checks(
 	}
 
 	/*
-	 * If the offset is beyond the size of the file, we need to zero any
+	 * If the offset is beyond the size of the file, we need to zero all
 	 * blocks that fall between the existing EOF and the start of this
-	 * write.  If zeroing is needed and we are currently holding the iolock
-	 * shared, we need to update it to exclusive which implies having to
-	 * redo all checks before.
-	 *
-	 * We need to serialise against EOF updates that occur in IO completions
-	 * here. We want to make sure that nobody is changing the size while we
-	 * do this check until we have placed an IO barrier (i.e.  hold the
-	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
-	 * spinlock effectively forms a memory barrier once we have the
-	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
-	 * hence be able to correctly determine if we need to run zeroing.
+	 * write.
 	 *
-	 * We can do an unlocked check here safely as IO completion can only
-	 * extend EOF. Truncate is locked out at this point, so the EOF can
-	 * not move backwards, only forwards. Hence we only need to take the
-	 * slow path and spin locks when we are at or beyond the current EOF.
+	 * We can do an unlocked check for i_size here safely as I/O completion
+	 * can only extend EOF.  Truncate is locked out at this point, so the
+	 * EOF can not move backwards, only forwards. Hence we only need to take
+	 * the slow path when we are at or beyond the current EOF.
 	 */
-	if (iocb->ki_pos <= i_size_read(inode))
-		goto out;
-
-	spin_lock(&ip->i_flags_lock);
-	isize = i_size_read(inode);
-	if (iocb->ki_pos > isize) {
-		spin_unlock(&ip->i_flags_lock);
-
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			return -EAGAIN;
-
-		if (!drained_dio) {
-			if (*iolock == XFS_IOLOCK_SHARED) {
-				xfs_iunlock(ip, *iolock);
-				*iolock = XFS_IOLOCK_EXCL;
-				xfs_ilock(ip, *iolock);
-				iov_iter_reexpand(from, count);
-			}
-			/*
-			 * We now have an IO submission barrier in place, but
-			 * AIO can do EOF updates during IO completion and hence
-			 * we now need to wait for all of them to drain. Non-AIO
-			 * DIO will have drained before we are given the
-			 * XFS_IOLOCK_EXCL, and so for most cases this wait is a
-			 * no-op.
-			 */
-			inode_dio_wait(inode);
-			drained_dio = true;
+	if (iocb->ki_pos > i_size_read(inode)) {
+		error = xfs_file_write_zero_eof(iocb, from, iolock, count,
+				&drained_dio);
+		if (error == 1)
 			goto restart;
-		}
-
-		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
-		error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
 		if (error)
 			return error;
-	} else
-		spin_unlock(&ip->i_flags_lock);
+	}
 
-out:
 	return kiocb_modified(iocb);
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/12] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2024-09-10  4:39 ` [PATCH 06/12] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-17 21:24   ` Darrick J. Wong
  2024-09-10  4:39 ` [PATCH 08/12] iomap: zeroing already holds invalidate_lock Christoph Hellwig
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

xfs_file_write_zero_eof is the only caller of xfs_zero_range that does
not take XFS_MMAPLOCK_EXCL (aka the invalidate lock).  Currently that
is acrually the right thing, as an error in the iomap zeroing code will
also take the invalidate_lock to clean up, but to fix that deadlock we
need a consistent locking pattern first.

The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read
pagefaults, which isn't really needed here, but also not actively
harmful.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c  | 8 +++++++-
 fs/xfs/xfs_iomap.c | 2 ++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a30fda1985e6af..37dc26f51ace65 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -357,6 +357,7 @@ xfs_file_write_zero_eof(
 {
 	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
 	loff_t			isize;
+	int			error;
 
 	/*
 	 * We need to serialise against EOF updates that occur in IO completions
@@ -404,7 +405,12 @@ xfs_file_write_zero_eof(
 	}
 
 	trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
-	return xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
+
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
+	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1e11f48814c0d0..3c98d82c0ad0dc 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1435,6 +1435,8 @@ xfs_zero_range(
 {
 	struct inode		*inode = VFS_I(ip);
 
+	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
+
 	if (IS_DAX(inode))
 		return dax_zero_range(inode, pos, len, did_zero,
 				      &xfs_dax_write_iomap_ops);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/12] iomap: zeroing already holds invalidate_lock
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2024-09-10  4:39 ` [PATCH 07/12] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-17 21:29   ` Darrick J. Wong
  2024-09-10  4:39 ` [PATCH 09/12] xfs: support the COW fork in xfs_bmap_punch_delalloc_range Christoph Hellwig
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

All callers of iomap_zero_range already hold invalidate_lock, so we can't
take it again in iomap_file_buffered_write_punch_delalloc.

Use the passed in flags argument to detect if we're called from a zeroing
operation and don't take the lock again in this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 52f285ae4bddcb..3d7e69a542518a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1188,8 +1188,13 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
 	 * folios and dirtying them via ->page_mkwrite whilst we walk the
 	 * cache and perform delalloc extent removal. Failing to do this can
 	 * leave dirty pages with no space reservation in the cache.
+	 *
+	 * For zeroing operations the callers already hold invalidate_lock.
 	 */
-	filemap_invalidate_lock(inode->i_mapping);
+	if (flags & IOMAP_ZERO)
+		rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);
+	else
+		filemap_invalidate_lock(inode->i_mapping);
 	while (start_byte < scan_end_byte) {
 		loff_t		data_end;
 
@@ -1240,7 +1245,8 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
 		punch(inode, punch_start_byte, end_byte - punch_start_byte,
 				iomap);
 out_unlock:
-	filemap_invalidate_unlock(inode->i_mapping);
+	if (!(flags & IOMAP_ZERO))
+		filemap_invalidate_unlock(inode->i_mapping);
 }
 
 /*
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/12] xfs: support the COW fork in xfs_bmap_punch_delalloc_range
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2024-09-10  4:39 ` [PATCH 08/12] iomap: zeroing already holds invalidate_lock Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 10/12] xfs: share more code in xfs_buffered_write_iomap_begin Christoph Hellwig
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

xfs_buffered_write_iomap_begin can also create delallocate reservations
that need cleaning up, prepare for that by adding support for the COW
fork in xfs_bmap_punch_delalloc_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_aops.c      |  4 ++--
 fs/xfs/xfs_bmap_util.c | 10 +++++++---
 fs/xfs/xfs_bmap_util.h |  2 +-
 fs/xfs/xfs_iomap.c     |  3 ++-
 4 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6dead20338e24c..559a3a57709748 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -116,7 +116,7 @@ xfs_end_ioend(
 	if (unlikely(error)) {
 		if (ioend->io_flags & IOMAP_F_SHARED) {
 			xfs_reflink_cancel_cow_range(ip, offset, size, true);
-			xfs_bmap_punch_delalloc_range(ip, offset,
+			xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, offset,
 					offset + size);
 		}
 		goto done;
@@ -456,7 +456,7 @@ xfs_discard_folio(
 	 * byte of the next folio. Hence the end offset is only dependent on the
 	 * folio itself and not the start offset that is passed in.
 	 */
-	xfs_bmap_punch_delalloc_range(ip, pos,
+	xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, pos,
 				folio_pos(folio) + folio_size(folio));
 }
 
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 087867be0cf92d..c5165aed0e5102 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -442,11 +442,12 @@ xfs_getbmap(
 void
 xfs_bmap_punch_delalloc_range(
 	struct xfs_inode	*ip,
+	int			whichfork,
 	xfs_off_t		start_byte,
 	xfs_off_t		end_byte)
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_ifork	*ifp = &ip->i_df;
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, whichfork);
 	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, start_byte);
 	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, end_byte);
 	struct xfs_bmbt_irec	got, del;
@@ -474,11 +475,14 @@ xfs_bmap_punch_delalloc_range(
 			continue;
 		}
 
-		xfs_bmap_del_extent_delay(ip, XFS_DATA_FORK, &icur, &got, &del);
+		xfs_bmap_del_extent_delay(ip, whichfork, &icur, &got, &del);
 		if (!xfs_iext_get_extent(ifp, &icur, &got))
 			break;
 	}
 
+	if (whichfork == XFS_COW_FORK && !ifp->if_bytes)
+		xfs_inode_clear_cowblocks_tag(ip);
+
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 }
@@ -580,7 +584,7 @@ xfs_free_eofblocks(
 	 */
 	if (ip->i_diflags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)) {
 		if (ip->i_delayed_blks) {
-			xfs_bmap_punch_delalloc_range(ip,
+			xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK,
 				round_up(XFS_ISIZE(ip), mp->m_sb.sb_blocksize),
 				LLONG_MAX);
 		}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index eb0895bfb9dae4..b29760d36e1ab1 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -30,7 +30,7 @@ xfs_bmap_rtalloc(struct xfs_bmalloca *ap)
 }
 #endif /* CONFIG_XFS_RT */
 
-void	xfs_bmap_punch_delalloc_range(struct xfs_inode *ip,
+void	xfs_bmap_punch_delalloc_range(struct xfs_inode *ip, int whichfork,
 		xfs_off_t start_byte, xfs_off_t end_byte);
 
 struct kgetbmap {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 3c98d82c0ad0dc..1d068cac69b4cc 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1215,7 +1215,8 @@ xfs_buffered_write_delalloc_punch(
 	loff_t			length,
 	struct iomap		*iomap)
 {
-	xfs_bmap_punch_delalloc_range(XFS_I(inode), offset, offset + length);
+	xfs_bmap_punch_delalloc_range(XFS_I(inode), XFS_DATA_FORK, offset,
+			offset + length);
 }
 
 static int
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/12] xfs: share more code in xfs_buffered_write_iomap_begin
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2024-09-10  4:39 ` [PATCH 09/12] xfs: support the COW fork in xfs_bmap_punch_delalloc_range Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 11/12] xfs: set IOMAP_F_SHARED for all COW fork allocations Christoph Hellwig
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

Introduce a local iomap_flags variable so that the code allocating new
delalloc blocks in the data fork can fall through to the found_imap
label and reuse the code to unlock and fill the iomap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1d068cac69b4cc..12510380a495f8 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -975,6 +975,7 @@ xfs_buffered_write_iomap_begin(
 	int			allocfork = XFS_DATA_FORK;
 	int			error = 0;
 	unsigned int		lockmode = XFS_ILOCK_EXCL;
+	unsigned int		iomap_flags = 0;
 	u64			seq;
 
 	if (xfs_is_shutdown(mp))
@@ -1145,6 +1146,11 @@ xfs_buffered_write_iomap_begin(
 		}
 	}
 
+	/*
+	 * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
+	 * them out if the write happens to fail.
+	 */
+	iomap_flags |= IOMAP_F_NEW;
 	if (allocfork == XFS_COW_FORK) {
 		error = xfs_bmapi_reserve_delalloc(ip, allocfork, offset_fsb,
 				end_fsb - offset_fsb, prealloc_blocks, &cmap,
@@ -1162,19 +1168,11 @@ xfs_buffered_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
-	/*
-	 * Flag newly allocated delalloc blocks with IOMAP_F_NEW so we punch
-	 * them out if the write happens to fail.
-	 */
-	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_NEW);
-	xfs_iunlock(ip, lockmode);
 	trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW, seq);
-
 found_imap:
-	seq = xfs_iomap_inode_sequence(ip, 0);
+	seq = xfs_iomap_inode_sequence(ip, iomap_flags);
 	xfs_iunlock(ip, lockmode);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
 
 convert_delay:
 	xfs_iunlock(ip, lockmode);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 11/12] xfs: set IOMAP_F_SHARED for all COW fork allocations
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2024-09-10  4:39 ` [PATCH 10/12] xfs: share more code in xfs_buffered_write_iomap_begin Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-10  4:39 ` [PATCH 12/12] xfs: punch delalloc extents from the COW fork for COW writes Christoph Hellwig
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

Change to always set xfs_buffered_write_iomap_begin for COW fork
allocations even if they don't overlap existing data fork extents,
which will allow the iomap_end callback to detect if it has to punch
stale delalloc blocks from the COW fork instead of the data fork.  It
also means we sample the sequence counter for both the data and the COW
fork when writing to the COW fork, which ensures we properly revalidate
when only COW fork changes happens.

This is essentially a revert of commit 72a048c1056a ("xfs: only set
IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because
the problem that the commit fixed has now been dealt with in iomap by
only looking at the actual srcmap and not the fallback to the write
iomap.

Note that the direct I/O path was never changed and has always set
IOMAP_F_SHARED for all COW fork allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 12510380a495f8..844368b592f94f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1186,20 +1186,19 @@ xfs_buffered_write_iomap_begin(
 	return 0;
 
 found_cow:
-	seq = xfs_iomap_inode_sequence(ip, 0);
 	if (imap.br_startoff <= offset_fsb) {
-		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
+		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0,
+				xfs_iomap_inode_sequence(ip, 0));
 		if (error)
 			goto out_unlock;
-		seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
-		xfs_iunlock(ip, lockmode);
-		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
-					 IOMAP_F_SHARED, seq);
+	} else {
+		xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
 	}
 
-	xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
+	iomap_flags = IOMAP_F_SHARED;
+	seq = xfs_iomap_inode_sequence(ip, iomap_flags);
 	xfs_iunlock(ip, lockmode);
-	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, 0, seq);
+	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, iomap_flags, seq);
 
 out_unlock:
 	xfs_iunlock(ip, lockmode);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 12/12] xfs: punch delalloc extents from the COW fork for COW writes
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2024-09-10  4:39 ` [PATCH 11/12] xfs: set IOMAP_F_SHARED for all COW fork allocations Christoph Hellwig
@ 2024-09-10  4:39 ` Christoph Hellwig
  2024-09-10  9:21 ` fix stale delalloc punching for COW I/O v2 Christian Brauner
  2024-09-10  9:22 ` (subset) " Christian Brauner
  13 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10  4:39 UTC (permalink / raw)
  To: Chandan Babu R, Christian Brauner, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel

When ->iomap_end is called on a short write to the COW fork it needs to
punch stale delalloc data from the COW fork and not the data fork.

Ensure that IOMAP_F_NEW is set for new COW fork allocations in
xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag
in xfs_buffered_write_delalloc_punch to decide which fork to punch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 844368b592f94f..5b071f36874318 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1195,7 +1195,7 @@ xfs_buffered_write_iomap_begin(
 		xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb);
 	}
 
-	iomap_flags = IOMAP_F_SHARED;
+	iomap_flags |= IOMAP_F_SHARED;
 	seq = xfs_iomap_inode_sequence(ip, iomap_flags);
 	xfs_iunlock(ip, lockmode);
 	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, iomap_flags, seq);
@@ -1212,8 +1212,10 @@ xfs_buffered_write_delalloc_punch(
 	loff_t			length,
 	struct iomap		*iomap)
 {
-	xfs_bmap_punch_delalloc_range(XFS_I(inode), XFS_DATA_FORK, offset,
-			offset + length);
+	xfs_bmap_punch_delalloc_range(XFS_I(inode),
+			(iomap->flags & IOMAP_F_SHARED) ?
+				XFS_COW_FORK : XFS_DATA_FORK,
+			offset, offset + length);
 }
 
 static int
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: fix stale delalloc punching for COW I/O v2
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2024-09-10  4:39 ` [PATCH 12/12] xfs: punch delalloc extents from the COW fork for COW writes Christoph Hellwig
@ 2024-09-10  9:21 ` Christian Brauner
  2024-09-10 15:17   ` Christoph Hellwig
  2024-09-10  9:22 ` (subset) " Christian Brauner
  13 siblings, 1 reply; 28+ messages in thread
From: Christian Brauner @ 2024-09-10  9:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Babu R, Darrick J. Wong, linux-xfs, linux-fsdevel

>  - move the already reviewed iomap prep changes to the beginning in case
>    Christian wants to take them ASAP

I picked up patches 1-5. It was a bit hairy as I didn't have your merge
base and I went and looked at https://git.infradead.org/?p=users/hch/xfs.git;a=summary
but couldn't find the branch this was created from there. It'd be nice
if you could just point me where to look for any prerequisits next time
or just base it on some vfs.git branch. Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: (subset) fix stale delalloc punching for COW I/O v2
  2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2024-09-10  9:21 ` fix stale delalloc punching for COW I/O v2 Christian Brauner
@ 2024-09-10  9:22 ` Christian Brauner
  2024-09-10 15:17   ` Christoph Hellwig
  13 siblings, 1 reply; 28+ messages in thread
From: Christian Brauner @ 2024-09-10  9:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, linux-xfs, linux-fsdevel, Chandan Babu R,
	Darrick J. Wong

On Tue, 10 Sep 2024 07:39:02 +0300, Christoph Hellwig wrote:
> this is another fallout from the zoned XFS work, which stresses the XFS
> COW I/O path very heavily.  It affects normal I/O to reflinked files as
> well, but is very hard to hit there.
> 
> The main problem here is that we only punch out delalloc reservations
> from the data fork, but COW I/O places delalloc extents into the COW
> fork, which means that it won't get punched out forshort writes.
> 
> [...]

Christoph, could you double-check that things look good to you? I had to
resolve a minor merge conflict.

---

Applied to the vfs.blocksize branch of the vfs/vfs.git tree.
Patches in the vfs.blocksize branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.blocksize

[01/12] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release
        https://git.kernel.org/vfs/vfs/c/7a9d43eace88
[02/12] iomap: improve shared block detection in iomap_unshare_iter
        https://git.kernel.org/vfs/vfs/c/b53fdb215d13
[03/12] iomap: pass flags to iomap_file_buffered_write_punch_delalloc
        https://git.kernel.org/vfs/vfs/c/11596dc3dfae
[04/12] iomap: pass the iomap to the punch callback
        https://git.kernel.org/vfs/vfs/c/492f53758fad
[05/12] iomap: remove the iomap_file_buffered_write_punch_delalloc return value
        https://git.kernel.org/vfs/vfs/c/4bceb9ba05ac

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: fix stale delalloc punching for COW I/O v2
  2024-09-10  9:21 ` fix stale delalloc punching for COW I/O v2 Christian Brauner
@ 2024-09-10 15:17   ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10 15:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Chandan Babu R, Darrick J. Wong, linux-xfs,
	linux-fsdevel

On Tue, Sep 10, 2024 at 11:21:16AM +0200, Christian Brauner wrote:
> >  - move the already reviewed iomap prep changes to the beginning in case
> >    Christian wants to take them ASAP
> 
> I picked up patches 1-5. It was a bit hairy as I didn't have your merge
> base and I went and looked at https://git.infradead.org/?p=users/hch/xfs.git;a=summary
> but couldn't find the branch this was created from there. It'd be nice
> if you could just point me where to look for any prerequisits next time
> or just base it on some vfs.git branch. Thanks!

This was based on the xfs for-next branch.  For XFS that's kinda obvious,
but for the VFS tree not, sorry.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: (subset) fix stale delalloc punching for COW I/O v2
  2024-09-10  9:22 ` (subset) " Christian Brauner
@ 2024-09-10 15:17   ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-10 15:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Chandan Babu R,
	Darrick J. Wong

On Tue, Sep 10, 2024 at 11:22:20AM +0200, Christian Brauner wrote:
> Christoph, could you double-check that things look good to you? I had to
> resolve a minor merge conflict.

I compared the end result and the touched areas are exactly the same
in your tree as locally, so it must be perfect.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 06/12] xfs: factor out a xfs_file_write_zero_eof helper
  2024-09-10  4:39 ` [PATCH 06/12] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
@ 2024-09-17 21:14   ` Darrick J. Wong
  2024-09-18  5:09     ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-09-17 21:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel

On Tue, Sep 10, 2024 at 07:39:08AM +0300, Christoph Hellwig wrote:
> Split a helper from xfs_file_write_checks that just deal with the
> post-EOF zeroing to keep the code readable.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_file.c | 133 ++++++++++++++++++++++++++--------------------
>  1 file changed, 75 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 0d258c21b9897f..a30fda1985e6af 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -347,10 +347,70 @@ xfs_file_splice_read(
>  	return ret;
>  }
>  
> +static ssize_t
> +xfs_file_write_zero_eof(
> +	struct kiocb		*iocb,
> +	struct iov_iter		*from,
> +	unsigned int		*iolock,
> +	size_t			count,
> +	bool			*drained_dio)
> +{
> +	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
> +	loff_t			isize;
> +
> +	/*
> +	 * We need to serialise against EOF updates that occur in IO completions
> +	 * here. We want to make sure that nobody is changing the size while
> +	 * we do this check until we have placed an IO barrier (i.e. hold
> +	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> +	 * spinlock effectively forms a memory barrier once we have
> +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> +	 * hence be able to correctly determine if we need to run zeroing.
> +	 */
> +	spin_lock(&ip->i_flags_lock);
> +	isize = i_size_read(VFS_I(ip));
> +	if (iocb->ki_pos <= isize) {
> +		spin_unlock(&ip->i_flags_lock);
> +		return 0;
> +	}
> +	spin_unlock(&ip->i_flags_lock);
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT)
> +		return -EAGAIN;
> +
> +	if (!*drained_dio) {
> +		/*
> +		 * If zeroing is needed and we are currently holding the iolock
> +		 * shared, we need to update it to exclusive which implies
> +		 * having to redo all checks before.
> +		 */
> +		if (*iolock == XFS_IOLOCK_SHARED) {
> +			xfs_iunlock(ip, *iolock);
> +			*iolock = XFS_IOLOCK_EXCL;
> +			xfs_ilock(ip, *iolock);
> +			iov_iter_reexpand(from, count);
> +		}
> +
> +		/*
> +		 * We now have an IO submission barrier in place, but AIO can do
> +		 * EOF updates during IO completion and hence we now need to
> +		 * wait for all of them to drain.  Non-AIO DIO will have drained
> +		 * before we are given the XFS_IOLOCK_EXCL, and so for most
> +		 * cases this wait is a no-op.
> +		 */
> +		inode_dio_wait(VFS_I(ip));
> +		*drained_dio = true;
> +		return 1;

I gotta say, I'm not a big fan of the "return 1 to loop again" behavior.
Can you add a comment at the top stating that this is a possible return
value and why it gets returned?

--D

> +	}
> +
> +	trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
> +	return xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
> +}
> +
>  /*
>   * Common pre-write limit and setup checks.
>   *
> - * Called with the iolocked held either shared and exclusive according to
> + * Called with the iolock held either shared and exclusive according to
>   * @iolock, and returns with it held.  Might upgrade the iolock to exclusive
>   * if called for a direct write beyond i_size.
>   */
> @@ -360,13 +420,10 @@ xfs_file_write_checks(
>  	struct iov_iter		*from,
>  	unsigned int		*iolock)
>  {
> -	struct file		*file = iocb->ki_filp;
> -	struct inode		*inode = file->f_mapping->host;
> -	struct xfs_inode	*ip = XFS_I(inode);
> -	ssize_t			error = 0;
> +	struct inode		*inode = iocb->ki_filp->f_mapping->host;
>  	size_t			count = iov_iter_count(from);
>  	bool			drained_dio = false;
> -	loff_t			isize;
> +	ssize_t			error;
>  
>  restart:
>  	error = generic_write_checks(iocb, from);
> @@ -389,7 +446,7 @@ xfs_file_write_checks(
>  	 * exclusively.
>  	 */
>  	if (*iolock == XFS_IOLOCK_SHARED && !IS_NOSEC(inode)) {
> -		xfs_iunlock(ip, *iolock);
> +		xfs_iunlock(XFS_I(inode), *iolock);
>  		*iolock = XFS_IOLOCK_EXCL;
>  		error = xfs_ilock_iocb(iocb, *iolock);
>  		if (error) {
> @@ -400,64 +457,24 @@ xfs_file_write_checks(
>  	}
>  
>  	/*
> -	 * If the offset is beyond the size of the file, we need to zero any
> +	 * If the offset is beyond the size of the file, we need to zero all
>  	 * blocks that fall between the existing EOF and the start of this
> -	 * write.  If zeroing is needed and we are currently holding the iolock
> -	 * shared, we need to update it to exclusive which implies having to
> -	 * redo all checks before.
> -	 *
> -	 * We need to serialise against EOF updates that occur in IO completions
> -	 * here. We want to make sure that nobody is changing the size while we
> -	 * do this check until we have placed an IO barrier (i.e.  hold the
> -	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> -	 * spinlock effectively forms a memory barrier once we have the
> -	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> -	 * hence be able to correctly determine if we need to run zeroing.
> +	 * write.
>  	 *
> -	 * We can do an unlocked check here safely as IO completion can only
> -	 * extend EOF. Truncate is locked out at this point, so the EOF can
> -	 * not move backwards, only forwards. Hence we only need to take the
> -	 * slow path and spin locks when we are at or beyond the current EOF.
> +	 * We can do an unlocked check for i_size here safely as I/O completion
> +	 * can only extend EOF.  Truncate is locked out at this point, so the
> +	 * EOF can not move backwards, only forwards. Hence we only need to take
> +	 * the slow path when we are at or beyond the current EOF.
>  	 */
> -	if (iocb->ki_pos <= i_size_read(inode))
> -		goto out;
> -
> -	spin_lock(&ip->i_flags_lock);
> -	isize = i_size_read(inode);
> -	if (iocb->ki_pos > isize) {
> -		spin_unlock(&ip->i_flags_lock);
> -
> -		if (iocb->ki_flags & IOCB_NOWAIT)
> -			return -EAGAIN;
> -
> -		if (!drained_dio) {
> -			if (*iolock == XFS_IOLOCK_SHARED) {
> -				xfs_iunlock(ip, *iolock);
> -				*iolock = XFS_IOLOCK_EXCL;
> -				xfs_ilock(ip, *iolock);
> -				iov_iter_reexpand(from, count);
> -			}
> -			/*
> -			 * We now have an IO submission barrier in place, but
> -			 * AIO can do EOF updates during IO completion and hence
> -			 * we now need to wait for all of them to drain. Non-AIO
> -			 * DIO will have drained before we are given the
> -			 * XFS_IOLOCK_EXCL, and so for most cases this wait is a
> -			 * no-op.
> -			 */
> -			inode_dio_wait(inode);
> -			drained_dio = true;
> +	if (iocb->ki_pos > i_size_read(inode)) {
> +		error = xfs_file_write_zero_eof(iocb, from, iolock, count,
> +				&drained_dio);
> +		if (error == 1)
>  			goto restart;
> -		}
> -
> -		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
> -		error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
>  		if (error)
>  			return error;
> -	} else
> -		spin_unlock(&ip->i_flags_lock);
> +	}
>  
> -out:
>  	return kiocb_modified(iocb);
>  }
>  
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/12] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
  2024-09-10  4:39 ` [PATCH 07/12] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof Christoph Hellwig
@ 2024-09-17 21:24   ` Darrick J. Wong
  2024-09-18  5:10     ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-09-17 21:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel

On Tue, Sep 10, 2024 at 07:39:09AM +0300, Christoph Hellwig wrote:
> xfs_file_write_zero_eof is the only caller of xfs_zero_range that does
> not take XFS_MMAPLOCK_EXCL (aka the invalidate lock).  Currently that
> is acrually the right thing, as an error in the iomap zeroing code will

     actually

> also take the invalidate_lock to clean up, but to fix that deadlock we
> need a consistent locking pattern first.
> 
> The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read
> pagefaults, which isn't really needed here, but also not actively
> harmful.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_file.c  | 8 +++++++-
>  fs/xfs/xfs_iomap.c | 2 ++
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a30fda1985e6af..37dc26f51ace65 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -357,6 +357,7 @@ xfs_file_write_zero_eof(
>  {
>  	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
>  	loff_t			isize;
> +	int			error;
>  
>  	/*
>  	 * We need to serialise against EOF updates that occur in IO completions
> @@ -404,7 +405,12 @@ xfs_file_write_zero_eof(
>  	}
>  
>  	trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
> -	return xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
> +
> +	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +	error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);

Ah, ok, so we're taking the invalidate_lock so that we can't have page
faults that might add folios (or dirty existing ones) in the mapping.
We're the only ones who can access the page cache, and we're doing that
so that we can zero the folios between the old EOF and the start of the
write region.

Is that right?  Then
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +
> +	return error;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 1e11f48814c0d0..3c98d82c0ad0dc 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1435,6 +1435,8 @@ xfs_zero_range(
>  {
>  	struct inode		*inode = VFS_I(ip);
>  
> +	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
> +
>  	if (IS_DAX(inode))
>  		return dax_zero_range(inode, pos, len, did_zero,
>  				      &xfs_dax_write_iomap_ops);
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 08/12] iomap: zeroing already holds invalidate_lock
  2024-09-10  4:39 ` [PATCH 08/12] iomap: zeroing already holds invalidate_lock Christoph Hellwig
@ 2024-09-17 21:29   ` Darrick J. Wong
  2024-09-18  5:15     ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-09-17 21:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel

On Tue, Sep 10, 2024 at 07:39:10AM +0300, Christoph Hellwig wrote:
> All callers of iomap_zero_range already hold invalidate_lock, so we can't
> take it again in iomap_file_buffered_write_punch_delalloc.
> 
> Use the passed in flags argument to detect if we're called from a zeroing
> operation and don't take the lock again in this case.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 52f285ae4bddcb..3d7e69a542518a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1188,8 +1188,13 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
>  	 * folios and dirtying them via ->page_mkwrite whilst we walk the
>  	 * cache and perform delalloc extent removal. Failing to do this can
>  	 * leave dirty pages with no space reservation in the cache.
> +	 *
> +	 * For zeroing operations the callers already hold invalidate_lock.
>  	 */
> -	filemap_invalidate_lock(inode->i_mapping);
> +	if (flags & IOMAP_ZERO)
> +		rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);

Does the other iomap_zero_range user (gfs2) take the invalidate lock?
AFAICT it doesn't.  Shouldn't we annotate iomap_zero_range to say that
callers have to hold i_rwsem and the invalidate_lock?

--D

> +	else
> +		filemap_invalidate_lock(inode->i_mapping);
>  	while (start_byte < scan_end_byte) {
>  		loff_t		data_end;
>  
> @@ -1240,7 +1245,8 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
>  		punch(inode, punch_start_byte, end_byte - punch_start_byte,
>  				iomap);
>  out_unlock:
> -	filemap_invalidate_unlock(inode->i_mapping);
> +	if (!(flags & IOMAP_ZERO))
> +		filemap_invalidate_unlock(inode->i_mapping);
>  }
>  
>  /*
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 06/12] xfs: factor out a xfs_file_write_zero_eof helper
  2024-09-17 21:14   ` Darrick J. Wong
@ 2024-09-18  5:09     ` Christoph Hellwig
  2024-09-18 15:30       ` Darrick J. Wong
  2024-09-20  0:25       ` Dave Chinner
  0 siblings, 2 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-18  5:09 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Chandan Babu R, Christian Brauner, linux-xfs,
	linux-fsdevel

On Tue, Sep 17, 2024 at 02:14:19PM -0700, Darrick J. Wong wrote:
> I gotta say, I'm not a big fan of the "return 1 to loop again" behavior.
> Can you add a comment at the top stating that this is a possible return
> value and why it gets returned?

Sure.  If you have a better idea I'm all ears, too.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/12] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
  2024-09-17 21:24   ` Darrick J. Wong
@ 2024-09-18  5:10     ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-18  5:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Chandan Babu R, Christian Brauner, linux-xfs,
	linux-fsdevel

On Tue, Sep 17, 2024 at 02:24:27PM -0700, Darrick J. Wong wrote:
> Ah, ok, so we're taking the invalidate_lock so that we can't have page
> faults that might add folios (or dirty existing ones) in the mapping.
> We're the only ones who can access the page cache, and we're doing that
> so that we can zero the folios between the old EOF and the start of the
> write region.
> 
> Is that right?  Then

Yes.

We might eventually also relax this a bit to only take the look if
we actually zero anything, but for that we need to do the work to
turn the iomap iterators inside out first.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 08/12] iomap: zeroing already holds invalidate_lock
  2024-09-17 21:29   ` Darrick J. Wong
@ 2024-09-18  5:15     ` Christoph Hellwig
  2024-09-18 15:32       ` Darrick J. Wong
  2024-09-20  0:42       ` Dave Chinner
  0 siblings, 2 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-18  5:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Chandan Babu R, Christian Brauner, linux-xfs,
	linux-fsdevel

On Tue, Sep 17, 2024 at 02:29:35PM -0700, Darrick J. Wong wrote:
> On Tue, Sep 10, 2024 at 07:39:10AM +0300, Christoph Hellwig wrote:
> > All callers of iomap_zero_range already hold invalidate_lock, so we can't
> > take it again in iomap_file_buffered_write_punch_delalloc.
> > 
> > Use the passed in flags argument to detect if we're called from a zeroing
> > operation and don't take the lock again in this case.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/iomap/buffered-io.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 52f285ae4bddcb..3d7e69a542518a 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1188,8 +1188,13 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> >  	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> >  	 * cache and perform delalloc extent removal. Failing to do this can
> >  	 * leave dirty pages with no space reservation in the cache.
> > +	 *
> > +	 * For zeroing operations the callers already hold invalidate_lock.
> >  	 */
> > -	filemap_invalidate_lock(inode->i_mapping);
> > +	if (flags & IOMAP_ZERO)
> > +		rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);
> 
> Does the other iomap_zero_range user (gfs2) take the invalidate lock?
> AFAICT it doesn't.  Shouldn't we annotate iomap_zero_range to say that
> callers have to hold i_rwsem and the invalidate_lock?

gfs2 does not hold invalidate_lock over iomap_zero_range.  But
it also does not use iomap_file_buffered_write_punch_delalloc at
all, which is what requires the lock (and asserts that it is held).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 06/12] xfs: factor out a xfs_file_write_zero_eof helper
  2024-09-18  5:09     ` Christoph Hellwig
@ 2024-09-18 15:30       ` Darrick J. Wong
  2024-09-20  0:25       ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2024-09-18 15:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel

On Wed, Sep 18, 2024 at 07:09:36AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 17, 2024 at 02:14:19PM -0700, Darrick J. Wong wrote:
> > I gotta say, I'm not a big fan of the "return 1 to loop again" behavior.
> > Can you add a comment at the top stating that this is a possible return
> > value and why it gets returned?
> 
> Sure.  If you have a better idea I'm all ears, too.

Sadly, I don't.  The closest I can think of is that iterator functions
return 1 for "here's an object", 0 for "no more objects", or negative
errno.  But this isn't iterating objects, it's (at best) "iterating" the
"not yet ready to do zeroing" states, and that's a stretch.

--D

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 08/12] iomap: zeroing already holds invalidate_lock
  2024-09-18  5:15     ` Christoph Hellwig
@ 2024-09-18 15:32       ` Darrick J. Wong
  2024-09-20  0:42       ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Darrick J. Wong @ 2024-09-18 15:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Babu R, Christian Brauner, linux-xfs, linux-fsdevel

On Wed, Sep 18, 2024 at 07:15:23AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 17, 2024 at 02:29:35PM -0700, Darrick J. Wong wrote:
> > On Tue, Sep 10, 2024 at 07:39:10AM +0300, Christoph Hellwig wrote:
> > > All callers of iomap_zero_range already hold invalidate_lock, so we can't
> > > take it again in iomap_file_buffered_write_punch_delalloc.
> > > 
> > > Use the passed in flags argument to detect if we're called from a zeroing
> > > operation and don't take the lock again in this case.
> > > 
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  fs/iomap/buffered-io.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 52f285ae4bddcb..3d7e69a542518a 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1188,8 +1188,13 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> > >  	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> > >  	 * cache and perform delalloc extent removal. Failing to do this can
> > >  	 * leave dirty pages with no space reservation in the cache.
> > > +	 *
> > > +	 * For zeroing operations the callers already hold invalidate_lock.
> > >  	 */
> > > -	filemap_invalidate_lock(inode->i_mapping);
> > > +	if (flags & IOMAP_ZERO)
> > > +		rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);
> > 
> > Does the other iomap_zero_range user (gfs2) take the invalidate lock?
> > AFAICT it doesn't.  Shouldn't we annotate iomap_zero_range to say that
> > callers have to hold i_rwsem and the invalidate_lock?
> 
> gfs2 does not hold invalidate_lock over iomap_zero_range.  But
> it also does not use iomap_file_buffered_write_punch_delalloc at
> all, which is what requires the lock (and asserts that it is held).

Aha, that's why it works.  Silly me, forgetting that gfs2 doesn't do
delalloc.  It was quite relaxing to let everything page out of my brain
these past two weeks... :)

--D

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 06/12] xfs: factor out a xfs_file_write_zero_eof helper
  2024-09-18  5:09     ` Christoph Hellwig
  2024-09-18 15:30       ` Darrick J. Wong
@ 2024-09-20  0:25       ` Dave Chinner
  2024-09-20 11:31         ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2024-09-20  0:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Chandan Babu R, Christian Brauner, linux-xfs,
	linux-fsdevel

On Wed, Sep 18, 2024 at 07:09:36AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 17, 2024 at 02:14:19PM -0700, Darrick J. Wong wrote:
> > I gotta say, I'm not a big fan of the "return 1 to loop again" behavior.
> > Can you add a comment at the top stating that this is a possible return
> > value and why it gets returned?
> 
> Sure.  If you have a better idea I'm all ears, too.

I would have thought a -EBUSY error would have been appropriate.
i.e. there was an extending write in progress (busy doing IO) so we
couldn't perform the zeroing operation and hence the write needs to
be restarted now the IO has been drained...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 08/12] iomap: zeroing already holds invalidate_lock
  2024-09-18  5:15     ` Christoph Hellwig
  2024-09-18 15:32       ` Darrick J. Wong
@ 2024-09-20  0:42       ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2024-09-20  0:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Chandan Babu R, Christian Brauner, linux-xfs,
	linux-fsdevel

On Wed, Sep 18, 2024 at 07:15:23AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 17, 2024 at 02:29:35PM -0700, Darrick J. Wong wrote:
> > On Tue, Sep 10, 2024 at 07:39:10AM +0300, Christoph Hellwig wrote:
> > > All callers of iomap_zero_range already hold invalidate_lock, so we can't
> > > take it again in iomap_file_buffered_write_punch_delalloc.
> > > 
> > > Use the passed in flags argument to detect if we're called from a zeroing
> > > operation and don't take the lock again in this case.
> > > 
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  fs/iomap/buffered-io.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 52f285ae4bddcb..3d7e69a542518a 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1188,8 +1188,13 @@ static void iomap_write_delalloc_release(struct inode *inode, loff_t start_byte,
> > >  	 * folios and dirtying them via ->page_mkwrite whilst we walk the
> > >  	 * cache and perform delalloc extent removal. Failing to do this can
> > >  	 * leave dirty pages with no space reservation in the cache.
> > > +	 *
> > > +	 * For zeroing operations the callers already hold invalidate_lock.
> > >  	 */
> > > -	filemap_invalidate_lock(inode->i_mapping);
> > > +	if (flags & IOMAP_ZERO)
> > > +		rwsem_assert_held_write(&inode->i_mapping->invalidate_lock);
> > 
> > Does the other iomap_zero_range user (gfs2) take the invalidate lock?
> > AFAICT it doesn't.  Shouldn't we annotate iomap_zero_range to say that
> > callers have to hold i_rwsem and the invalidate_lock?
> 
> gfs2 does not hold invalidate_lock over iomap_zero_range.  But
> it also does not use iomap_file_buffered_write_punch_delalloc at
> all, which is what requires the lock (and asserts that it is held).

Not a fan of this dichotomy.  It means that filesystems that don't
support IOMAP_DELALLOC don't need to hold the invalidate lock to
zero, but filesystems that do support IOMAP_DELALLOC do need to hold
it.

I'd kinda prefer there be one locking rule for all callers; it makes
it much easy to determine if the callers are doing the right thing
without needing to know if the filesystem is IOMAP_DELALLOC capable
or not. At minimum, it needs to be clearly documented.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 06/12] xfs: factor out a xfs_file_write_zero_eof helper
  2024-09-20  0:25       ` Dave Chinner
@ 2024-09-20 11:31         ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-09-20 11:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, Chandan Babu R,
	Christian Brauner, linux-xfs, linux-fsdevel

On Fri, Sep 20, 2024 at 10:25:54AM +1000, Dave Chinner wrote:
> > Sure.  If you have a better idea I'm all ears, too.
> 
> I would have thought a -EBUSY error would have been appropriate.
> i.e. there was an extending write in progress (busy doing IO) so we
> couldn't perform the zeroing operation and hence the write needs to
> be restarted now the IO has been drained...

I can't say that I'm a huge fan of overloading errno values when there
is quite a few call that could return basically arbitrary errors in
the chain.  See the "fix a DEBUG-only assert failure in xfs/538"
series for when this kind of errno overloading causes problems later
on in unexpected ways.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-09-20 11:31 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-10  4:39 fix stale delalloc punching for COW I/O v2 Christoph Hellwig
2024-09-10  4:39 ` [PATCH 01/12] iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release Christoph Hellwig
2024-09-10  4:39 ` [PATCH 02/12] iomap: improve shared block detection in iomap_unshare_iter Christoph Hellwig
2024-09-10  4:39 ` [PATCH 03/12] iomap: pass flags to iomap_file_buffered_write_punch_delalloc Christoph Hellwig
2024-09-10  4:39 ` [PATCH 04/12] iomap: pass the iomap to the punch callback Christoph Hellwig
2024-09-10  4:39 ` [PATCH 05/12] iomap: remove the iomap_file_buffered_write_punch_delalloc return value Christoph Hellwig
2024-09-10  4:39 ` [PATCH 06/12] xfs: factor out a xfs_file_write_zero_eof helper Christoph Hellwig
2024-09-17 21:14   ` Darrick J. Wong
2024-09-18  5:09     ` Christoph Hellwig
2024-09-18 15:30       ` Darrick J. Wong
2024-09-20  0:25       ` Dave Chinner
2024-09-20 11:31         ` Christoph Hellwig
2024-09-10  4:39 ` [PATCH 07/12] xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof Christoph Hellwig
2024-09-17 21:24   ` Darrick J. Wong
2024-09-18  5:10     ` Christoph Hellwig
2024-09-10  4:39 ` [PATCH 08/12] iomap: zeroing already holds invalidate_lock Christoph Hellwig
2024-09-17 21:29   ` Darrick J. Wong
2024-09-18  5:15     ` Christoph Hellwig
2024-09-18 15:32       ` Darrick J. Wong
2024-09-20  0:42       ` Dave Chinner
2024-09-10  4:39 ` [PATCH 09/12] xfs: support the COW fork in xfs_bmap_punch_delalloc_range Christoph Hellwig
2024-09-10  4:39 ` [PATCH 10/12] xfs: share more code in xfs_buffered_write_iomap_begin Christoph Hellwig
2024-09-10  4:39 ` [PATCH 11/12] xfs: set IOMAP_F_SHARED for all COW fork allocations Christoph Hellwig
2024-09-10  4:39 ` [PATCH 12/12] xfs: punch delalloc extents from the COW fork for COW writes Christoph Hellwig
2024-09-10  9:21 ` fix stale delalloc punching for COW I/O v2 Christian Brauner
2024-09-10 15:17   ` Christoph Hellwig
2024-09-10  9:22 ` (subset) " Christian Brauner
2024-09-10 15:17   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).