Linux XFS filesystem development
 help / color / mirror / Atom feed
* [PATCH v8 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs
@ 2026-06-25 11:45 Pankaj Raghav
  2026-06-25 11:45 ` [PATCH v8 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
  2026-06-25 11:45 ` [PATCH v8 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
  0 siblings, 2 replies; 6+ messages in thread
From: Pankaj Raghav @ 2026-06-25 11:45 UTC (permalink / raw)
  To: linux-xfs
  Cc: bfoster, lukas, Darrick J . Wong, p.raghav, dgc, gost.dev,
	Zhang Yi, pankaj.raghav, andres, kundan.kumar, hch, cem, hch

The benefits of FALLOC_FL_WRITE_ZEROES was already discussed as a part
of Zhang Yi's initial patches[1]. Postgres developer Andres also
mentioned they would like to use this feature in Postgres [2].

I tested the changes with fsstress and fsx based on the xfstests patch I
sent recently to test this flag[4]. generic/363 helped me debug the
crash I noticed when I did the initial implementation[3].

Dave initially suggested to create a common helper based on
xfs_iomap_convert_unwritten() but as it can be seen in the previous
version, a lot of the code had to be rewritten. The changes had more in
common with xfs_alloc_file_space(). This version reuses
xfs_alloc_file_space() for write zeroes.

Thanks to Christoph for all the review comments and design suggestions
that were made both offline and online for this series.

Stress test generic/363 generic/127 xfs/131 are passing. I have started
the full xfstest suite for this series.

I will be sending a new generic test case for testing boundary block
corner case handling.

Changes since v7:
- Pass offset and len to xfs_alloc_file_space (Based on Sashiko's feedback).
- Add a lot of comments to prove correctness based on Zhang's feedback.
- Add Darrick's comment about xfs_alloc_file_space description.

Changes since v6:
- Pass only offset that needs to be zeroed to alloc_file_space (Christoph).
- Add RVB from Christoph.
- Change the call order. Call xfs_falloc_setsize() and then call
  xfs_alloc_file_space().
- Remove the prep patch to allow xfs_set_filesize to take 64-bit len.

Changes since v5:
- Add a prep patch to allow xfs_set_filesize to take 64-bit len
  (Sashiko)

Changes since v4:
- Introduce an enum for allocation mode in xfs_alloc_file_space (Christoph)
- Use xfs_set_filesize instead of updating the on-disk size in the
  function.

Changes since v3:
- Introduce xfs_bmap_alloc_or_convert_range() in xfs_iomap.c for easy
  review experience (christoph)
- Add extsz hint and rt support in xfs_bmap_alloc_or_convert_range()

Changes since v2:
- Add allow_write_zeroes to xfs_global so that we can enable this
  feature independent of the HW underneath.

Changes since v1 [5.1 5.2]:
- Added a new function xfs_bmap_alloc_or_convert_range() based on Dave's
  feedback.
- Changed the xfs_falloc_write_zeroes to use
  xfs_bmap_alloc_or_convert_range() instead of doing prealloc and
  convert approach.

[1] https://lore.kernel.org/linux-fsdevel/20250619111806.3546162-1-yi.zhang@huaweicloud.com/
[2] https://lore.kernel.org/linux-fsdevel/20260217055103.GA6174@lst.de/T/#m7935b9bab32bb5ff372507f84803b8753ad1c814
[3] https://lore.kernel.org/linux-xfs/6i2jvzn3lyugjlbgmjzpped3gogzyqv5mpe2uqaifz4vjpaega@pomzoq7ley77/
[4] https://lore.kernel.org/linux-xfs/20260312195308.738189-1-p.raghav@samsung.com/
[5.1] https://lore.kernel.org/linux-xfs/20260309180708.427553-2-lukas@herbolt.com/
[5.2] https://lore.kernel.org/linux-xfs/abC1LvRElctaHPe5@dread/

Pankaj Raghav (2):
  xfs: add an allocation mode to xfs_alloc_file_space()
  xfs: add support for FALLOC_FL_WRITE_ZEROES

 fs/xfs/xfs_bmap_util.c | 42 +++++++++++++++++---
 fs/xfs/xfs_bmap_util.h |  7 +++-
 fs/xfs/xfs_file.c      | 87 ++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 125 insertions(+), 11 deletions(-)


base-commit: 6e24acc45ab58d39a0162b4d5f3fd001d07d868e
-- 
2.51.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v8 1/2] xfs: add an allocation mode to xfs_alloc_file_space()
  2026-06-25 11:45 [PATCH v8 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs Pankaj Raghav
@ 2026-06-25 11:45 ` Pankaj Raghav
  2026-06-25 17:01   ` Darrick J. Wong
  2026-06-25 11:45 ` [PATCH v8 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
  1 sibling, 1 reply; 6+ messages in thread
From: Pankaj Raghav @ 2026-06-25 11:45 UTC (permalink / raw)
  To: linux-xfs
  Cc: bfoster, lukas, Darrick J . Wong, p.raghav, dgc, gost.dev,
	Zhang Yi, pankaj.raghav, andres, kundan.kumar, hch, cem, hch

xfs_alloc_file_space() hardcodes XFS_BMAPI_PREALLOC to preallocate
unwritten extents across a range.

In preparation for FALLOC_FL_WRITE_ZEROES, add an explicit allocation
mode argument, enum xfs_alloc_file_space_mode, and derive the xfs_bmapi
flags from it. The only mode for now is XFS_ALLOC_FILE_SPACE_PREALLOC,
which preallocates unwritten extents and marks the inode as preallocated
exactly as before, so there is no functional change.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/xfs_bmap_util.c | 25 +++++++++++++++++++++----
 fs/xfs/xfs_bmap_util.h |  6 +++++-
 fs/xfs/xfs_file.c      |  9 ++++++---
 3 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 3b9f262f8e91..e5424d010a69 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -642,11 +642,19 @@ xfs_free_eofblocks(
 	return error;
 }
 
+/*
+ * Allocate space for a file according to @mode:
+ *
+ * XFS_ALLOC_FILE_SPACE_PREALLOC:
+ * Preallocate unwritten extents over holes across the range and mark the inode
+ * as preallocated.
+ */
 int
 xfs_alloc_file_space(
 	struct xfs_inode	*ip,
 	xfs_off_t		offset,
-	xfs_off_t		len)
+	xfs_off_t		len,
+	enum xfs_alloc_file_space_mode mode)
 {
 	xfs_mount_t		*mp = ip->i_mount;
 	xfs_off_t		count;
@@ -657,6 +665,7 @@ xfs_alloc_file_space(
 	int			rt;
 	xfs_trans_t		*tp;
 	xfs_bmbt_irec_t		imaps[1], *imapp;
+	uint32_t		bmapi_flags, nr_exts;
 	int			error;
 
 	if (xfs_is_always_cow_inode(ip))
@@ -674,6 +683,15 @@ xfs_alloc_file_space(
 	if (len <= 0)
 		return -EINVAL;
 
+	switch (mode) {
+	case XFS_ALLOC_FILE_SPACE_PREALLOC:
+		bmapi_flags = XFS_BMAPI_PREALLOC;
+		nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT;
+		break;
+	default:
+		return -EINVAL;
+	}
+
 	rt = XFS_IS_REALTIME_INODE(ip);
 	extsz = xfs_get_extsz_hint(ip);
 
@@ -733,8 +751,7 @@ xfs_alloc_file_space(
 		if (error)
 			break;
 
-		error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
-				XFS_IEXT_ADD_NOSPLIT_CNT);
+		error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK, nr_exts);
 		if (error)
 			goto error;
 
@@ -748,7 +765,7 @@ xfs_alloc_file_space(
 		 * will eventually reach the requested range.
 		 */
 		error = xfs_bmapi_write(tp, ip, startoffset_fsb,
-				allocatesize_fsb, XFS_BMAPI_PREALLOC, 0, imapp,
+				allocatesize_fsb, bmapi_flags, 0, imapp,
 				&nimaps);
 		if (error) {
 			if (error != -ENOSR)
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index c477b3361630..232b4c48247e 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -55,8 +55,12 @@ int	xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
 			     int *is_empty);
 
 /* preallocation and hole punch interface */
+enum xfs_alloc_file_space_mode {
+	XFS_ALLOC_FILE_SPACE_PREALLOC,
+};
+
 int	xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
-		xfs_off_t len);
+		xfs_off_t len, enum xfs_alloc_file_space_mode mode);
 int	xfs_free_file_space(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t len, struct xfs_zone_alloc_ctx *ac);
 int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 845a97c9b063..e90ea6ebdc8e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1406,7 +1406,8 @@ xfs_falloc_zero_range(
 		len = round_up(offset + len, blksize) -
 			round_down(offset, blksize);
 		offset = round_down(offset, blksize);
-		error = xfs_alloc_file_space(ip, offset, len);
+		error = xfs_alloc_file_space(ip, offset, len,
+				XFS_ALLOC_FILE_SPACE_PREALLOC);
 	}
 	if (error)
 		return error;
@@ -1432,7 +1433,8 @@ xfs_falloc_unshare_range(
 	if (error)
 		return error;
 
-	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
+	error = xfs_alloc_file_space(XFS_I(inode), offset, len,
+			XFS_ALLOC_FILE_SPACE_PREALLOC);
 	if (error)
 		return error;
 	return xfs_falloc_setsize(file, new_size);
@@ -1460,7 +1462,8 @@ xfs_falloc_allocate_range(
 	if (error)
 		return error;
 
-	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
+	error = xfs_alloc_file_space(XFS_I(inode), offset, len,
+			XFS_ALLOC_FILE_SPACE_PREALLOC);
 	if (error)
 		return error;
 	return xfs_falloc_setsize(file, new_size);
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v8 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES
  2026-06-25 11:45 [PATCH v8 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs Pankaj Raghav
  2026-06-25 11:45 ` [PATCH v8 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
@ 2026-06-25 11:45 ` Pankaj Raghav
  2026-06-25 17:20   ` Darrick J. Wong
  1 sibling, 1 reply; 6+ messages in thread
From: Pankaj Raghav @ 2026-06-25 11:45 UTC (permalink / raw)
  To: linux-xfs
  Cc: bfoster, lukas, Darrick J . Wong, p.raghav, dgc, gost.dev,
	Zhang Yi, pankaj.raghav, andres, kundan.kumar, hch, cem, hch

If the underlying block device supports the unmap write zeroes
operation, this flag allows users to quickly preallocate a file with
written extents that contain zeroes. This is beneficial for subsequent
overwrites as it prevents the need for unwritten-to-written extent
conversions, thereby significantly reducing metadata updates and journal
I/O overhead, improving overwrite performance.

Punch the range first so it becomes a hole, update the size via
xfs_falloc_setsize() while it is still a hole (so its xfs_zero_range()
skips it and avoids rezeroing), then convert it to written
zeroed extents. A crash between the size update and the conversion is
safe, as a hole within i_size reads back as zeroes.

Co-developed-by: Lukas Herbolt <lukas@herbolt.com>
Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/xfs_bmap_util.c | 19 ++++++++--
 fs/xfs/xfs_bmap_util.h |  1 +
 fs/xfs/xfs_file.c      | 78 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e5424d010a69..855602cb35e8 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -643,11 +643,18 @@ xfs_free_eofblocks(
 }
 
 /*
- * Allocate space for a file according to @mode:
+ * Allocate space or convert extents for a file according to @mode:
  *
  * XFS_ALLOC_FILE_SPACE_PREALLOC:
  * Preallocate unwritten extents over holes across the range and mark the inode
  * as preallocated.
+ *
+ * XFS_ALLOC_FILE_SPACE_WRITE_ZEROES:
+ * Allocate written extents over holes and convert unwritten extents in the
+ * range to written extents, initialising both to contain zeroes.
+ *
+ * This function does not update the file size; callers that extend the file
+ * are responsible for updating it once the extents are allocated.
  */
 int
 xfs_alloc_file_space(
@@ -688,6 +695,10 @@ xfs_alloc_file_space(
 		bmapi_flags = XFS_BMAPI_PREALLOC;
 		nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT;
 		break;
+	case XFS_ALLOC_FILE_SPACE_WRITE_ZEROES:
+		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
+		nr_exts = XFS_IEXT_WRITE_UNWRITTEN_CNT;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -776,8 +787,10 @@ xfs_alloc_file_space(
 			allocatesize_fsb -= imapp->br_blockcount;
 		}
 
-		ip->i_diflags |= XFS_DIFLAG_PREALLOC;
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		if (mode == XFS_ALLOC_FILE_SPACE_PREALLOC) {
+			ip->i_diflags |= XFS_DIFLAG_PREALLOC;
+			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		}
 
 		error = xfs_trans_commit(tp);
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 232b4c48247e..e3d506ca9610 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -57,6 +57,7 @@ int	xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
 /* preallocation and hole punch interface */
 enum xfs_alloc_file_space_mode {
 	XFS_ALLOC_FILE_SPACE_PREALLOC,
+	XFS_ALLOC_FILE_SPACE_WRITE_ZEROES,
 };
 
 int	xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e90ea6ebdc8e..5dcb03e6de12 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1368,6 +1368,79 @@ xfs_falloc_force_zero(
 	return XFS_TEST_ERROR(ip->i_mount, XFS_ERRTAG_FORCE_ZERO_RANGE);
 }
 
+static int
+xfs_falloc_write_zeroes(
+	struct file		*file,
+	int			mode,
+	loff_t			offset,
+	loff_t			len,
+	struct xfs_zone_alloc_ctx *ac)
+{
+	struct inode		*inode = file_inode(file);
+	struct xfs_inode	*ip = XFS_I(inode);
+	loff_t			new_size = 0;
+	int			error;
+
+	if (xfs_is_always_cow_inode(ip) ||
+	    !bdev_write_zeroes_unmap_sectors(xfs_inode_buftarg(ip)->bt_bdev))
+		return -EOPNOTSUPP;
+
+	error = xfs_falloc_newsize(file, mode, offset, len, &new_size);
+	if (error)
+		return error;
+
+	/*
+	 *
+	 *    |----------|----------|----------|----------|----------|
+	 *    ^     ^    ^                     ^     ^    ^
+	 *    |     |    |                     |     |    |
+	 *    |   offset |                     |    end   |
+	 *    |          |                     |          |
+	 * offset_rd   offset_ru              end_rd    end_ru
+	 *
+	 * xfs_free_file_space() punches the aligned interior offset_ru -> end_rd
+	 * to holes and byte-zeroes the in-range parts of the partial edge blocks,
+	 * offset -> offset_ru and end_rd -> end.  xfs_zero_range() only touches
+	 * already-written blocks here; it skips holes and unwritten extents, so
+	 * unallocated/unwritten edge blocks are left for the allocation below.
+	 */
+	error = xfs_free_file_space(ip, offset, len, ac);
+	if (error)
+		return error;
+
+	/*
+	 * Publish the new size while the punched range is still a hole, then
+	 * fill it with written zeroes.  Like the other fallocate modes we use
+	 * xfs_falloc_setsize(), but it must run *before* we convert the range
+	 * to written extents: xfs_setattr_size() zeroes [old EOF, new size) via
+	 * xfs_zero_range(), which skips holes, so there is nothing to re-zero.
+	 * It will also writeback partial EOF block before the on-disk size is
+	 * logged.
+	 * Note: extending the size before allocating means a failure below
+	 * leaves the file larger with unallocated holes in the new range.
+	 * That is safe as holes within i_size read back as zeroes and expose
+	 * no stale data while the error is propagated to the caller.
+	 */
+	error = xfs_falloc_setsize(file, new_size);
+	if (error)
+		return error;
+
+	/*
+	 * Allocate written, zeroed extents across the range.  xfs_alloc_file_space()
+	 * rounds outward to block granularity:
+	 *  - holes (the punched interior and any unallocated edge block) are
+	 *    allocated and zeroed;
+	 *  - unwritten extents (including unwritten edge blocks) are converted to
+	 *    written and zeroed;
+	 *  - Already written edge blocks are skipped. The out-of-range bytes of
+	 *    a written edge block keep their data (offset_rd -> offset and
+	 *    end -> end_rd); their in-range bytes (offset -> offset_ru and
+	 *    end_ru -> end were already zeroed by xfs_free_file_space().
+	 */
+	return xfs_alloc_file_space(ip, offset, len,
+			XFS_ALLOC_FILE_SPACE_WRITE_ZEROES);
+}
+
 /*
  * Punch a hole and prealloc the range.  We use a hole punch rather than
  * unwritten extent conversion for two reasons:
@@ -1473,7 +1546,7 @@ xfs_falloc_allocate_range(
 		(FALLOC_FL_ALLOCATE_RANGE | FALLOC_FL_KEEP_SIZE |	\
 		 FALLOC_FL_PUNCH_HOLE |	FALLOC_FL_COLLAPSE_RANGE |	\
 		 FALLOC_FL_ZERO_RANGE |	FALLOC_FL_INSERT_RANGE |	\
-		 FALLOC_FL_UNSHARE_RANGE)
+		 FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_WRITE_ZEROES)
 
 STATIC long
 __xfs_file_fallocate(
@@ -1525,6 +1598,9 @@ __xfs_file_fallocate(
 	case FALLOC_FL_ALLOCATE_RANGE:
 		error = xfs_falloc_allocate_range(file, mode, offset, len);
 		break;
+	case FALLOC_FL_WRITE_ZEROES:
+		error = xfs_falloc_write_zeroes(file, mode, offset, len, ac);
+		break;
 	default:
 		error = -EOPNOTSUPP;
 		break;
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v8 1/2] xfs: add an allocation mode to xfs_alloc_file_space()
  2026-06-25 11:45 ` [PATCH v8 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
@ 2026-06-25 17:01   ` Darrick J. Wong
  0 siblings, 0 replies; 6+ messages in thread
From: Darrick J. Wong @ 2026-06-25 17:01 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, bfoster, lukas, dgc, gost.dev, Zhang Yi, pankaj.raghav,
	andres, kundan.kumar, hch, cem, hch

On Thu, Jun 25, 2026 at 01:45:49PM +0200, Pankaj Raghav wrote:
> xfs_alloc_file_space() hardcodes XFS_BMAPI_PREALLOC to preallocate
> unwritten extents across a range.
> 
> In preparation for FALLOC_FL_WRITE_ZEROES, add an explicit allocation
> mode argument, enum xfs_alloc_file_space_mode, and derive the xfs_bmapi
> flags from it. The only mode for now is XFS_ALLOC_FILE_SPACE_PREALLOC,
> which preallocates unwritten extents and marks the inode as preallocated
> exactly as before, so there is no functional change.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

LGTM,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_bmap_util.c | 25 +++++++++++++++++++++----
>  fs/xfs/xfs_bmap_util.h |  6 +++++-
>  fs/xfs/xfs_file.c      |  9 ++++++---
>  3 files changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 3b9f262f8e91..e5424d010a69 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -642,11 +642,19 @@ xfs_free_eofblocks(
>  	return error;
>  }
>  
> +/*
> + * Allocate space for a file according to @mode:
> + *
> + * XFS_ALLOC_FILE_SPACE_PREALLOC:
> + * Preallocate unwritten extents over holes across the range and mark the inode
> + * as preallocated.
> + */
>  int
>  xfs_alloc_file_space(
>  	struct xfs_inode	*ip,
>  	xfs_off_t		offset,
> -	xfs_off_t		len)
> +	xfs_off_t		len,
> +	enum xfs_alloc_file_space_mode mode)
>  {
>  	xfs_mount_t		*mp = ip->i_mount;
>  	xfs_off_t		count;
> @@ -657,6 +665,7 @@ xfs_alloc_file_space(
>  	int			rt;
>  	xfs_trans_t		*tp;
>  	xfs_bmbt_irec_t		imaps[1], *imapp;
> +	uint32_t		bmapi_flags, nr_exts;
>  	int			error;
>  
>  	if (xfs_is_always_cow_inode(ip))
> @@ -674,6 +683,15 @@ xfs_alloc_file_space(
>  	if (len <= 0)
>  		return -EINVAL;
>  
> +	switch (mode) {
> +	case XFS_ALLOC_FILE_SPACE_PREALLOC:
> +		bmapi_flags = XFS_BMAPI_PREALLOC;
> +		nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
>  	rt = XFS_IS_REALTIME_INODE(ip);
>  	extsz = xfs_get_extsz_hint(ip);
>  
> @@ -733,8 +751,7 @@ xfs_alloc_file_space(
>  		if (error)
>  			break;
>  
> -		error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
> -				XFS_IEXT_ADD_NOSPLIT_CNT);
> +		error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK, nr_exts);
>  		if (error)
>  			goto error;
>  
> @@ -748,7 +765,7 @@ xfs_alloc_file_space(
>  		 * will eventually reach the requested range.
>  		 */
>  		error = xfs_bmapi_write(tp, ip, startoffset_fsb,
> -				allocatesize_fsb, XFS_BMAPI_PREALLOC, 0, imapp,
> +				allocatesize_fsb, bmapi_flags, 0, imapp,
>  				&nimaps);
>  		if (error) {
>  			if (error != -ENOSR)
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index c477b3361630..232b4c48247e 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -55,8 +55,12 @@ int	xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
>  			     int *is_empty);
>  
>  /* preallocation and hole punch interface */
> +enum xfs_alloc_file_space_mode {
> +	XFS_ALLOC_FILE_SPACE_PREALLOC,
> +};
> +
>  int	xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
> -		xfs_off_t len);
> +		xfs_off_t len, enum xfs_alloc_file_space_mode mode);
>  int	xfs_free_file_space(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t len, struct xfs_zone_alloc_ctx *ac);
>  int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 845a97c9b063..e90ea6ebdc8e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1406,7 +1406,8 @@ xfs_falloc_zero_range(
>  		len = round_up(offset + len, blksize) -
>  			round_down(offset, blksize);
>  		offset = round_down(offset, blksize);
> -		error = xfs_alloc_file_space(ip, offset, len);
> +		error = xfs_alloc_file_space(ip, offset, len,
> +				XFS_ALLOC_FILE_SPACE_PREALLOC);
>  	}
>  	if (error)
>  		return error;
> @@ -1432,7 +1433,8 @@ xfs_falloc_unshare_range(
>  	if (error)
>  		return error;
>  
> -	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> +	error = xfs_alloc_file_space(XFS_I(inode), offset, len,
> +			XFS_ALLOC_FILE_SPACE_PREALLOC);
>  	if (error)
>  		return error;
>  	return xfs_falloc_setsize(file, new_size);
> @@ -1460,7 +1462,8 @@ xfs_falloc_allocate_range(
>  	if (error)
>  		return error;
>  
> -	error = xfs_alloc_file_space(XFS_I(inode), offset, len);
> +	error = xfs_alloc_file_space(XFS_I(inode), offset, len,
> +			XFS_ALLOC_FILE_SPACE_PREALLOC);
>  	if (error)
>  		return error;
>  	return xfs_falloc_setsize(file, new_size);
> -- 
> 2.51.2
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v8 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES
  2026-06-25 11:45 ` [PATCH v8 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
@ 2026-06-25 17:20   ` Darrick J. Wong
  2026-06-26 16:04     ` Pankaj Raghav
  0 siblings, 1 reply; 6+ messages in thread
From: Darrick J. Wong @ 2026-06-25 17:20 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, bfoster, lukas, dgc, gost.dev, Zhang Yi, pankaj.raghav,
	andres, kundan.kumar, hch, cem, hch

On Thu, Jun 25, 2026 at 01:45:50PM +0200, Pankaj Raghav wrote:
> If the underlying block device supports the unmap write zeroes
> operation, this flag allows users to quickly preallocate a file with
> written extents that contain zeroes. This is beneficial for subsequent
> overwrites as it prevents the need for unwritten-to-written extent
> conversions, thereby significantly reducing metadata updates and journal
> I/O overhead, improving overwrite performance.
> 
> Punch the range first so it becomes a hole, update the size via
> xfs_falloc_setsize() while it is still a hole (so its xfs_zero_range()
> skips it and avoids rezeroing), then convert it to written
> zeroed extents. A crash between the size update and the conversion is
> safe, as a hole within i_size reads back as zeroes.
> 
> Co-developed-by: Lukas Herbolt <lukas@herbolt.com>
> Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 19 ++++++++--
>  fs/xfs/xfs_bmap_util.h |  1 +
>  fs/xfs/xfs_file.c      | 78 +++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 94 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index e5424d010a69..855602cb35e8 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -643,11 +643,18 @@ xfs_free_eofblocks(
>  }
>  
>  /*
> - * Allocate space for a file according to @mode:
> + * Allocate space or convert extents for a file according to @mode:
>   *
>   * XFS_ALLOC_FILE_SPACE_PREALLOC:
>   * Preallocate unwritten extents over holes across the range and mark the inode
>   * as preallocated.
> + *
> + * XFS_ALLOC_FILE_SPACE_WRITE_ZEROES:
> + * Allocate written extents over holes and convert unwritten extents in the
> + * range to written extents, initialising both to contain zeroes.
> + *
> + * This function does not update the file size; callers that extend the file
> + * are responsible for updating it once the extents are allocated.
>   */
>  int
>  xfs_alloc_file_space(
> @@ -688,6 +695,10 @@ xfs_alloc_file_space(
>  		bmapi_flags = XFS_BMAPI_PREALLOC;
>  		nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT;
>  		break;
> +	case XFS_ALLOC_FILE_SPACE_WRITE_ZEROES:
> +		bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
> +		nr_exts = XFS_IEXT_WRITE_UNWRITTEN_CNT;
> +		break;
>  	default:
>  		return -EINVAL;
>  	}
> @@ -776,8 +787,10 @@ xfs_alloc_file_space(
>  			allocatesize_fsb -= imapp->br_blockcount;
>  		}
>  
> -		ip->i_diflags |= XFS_DIFLAG_PREALLOC;
> -		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +		if (mode == XFS_ALLOC_FILE_SPACE_PREALLOC) {
> +			ip->i_diflags |= XFS_DIFLAG_PREALLOC;
> +			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +		}
>  
>  		error = xfs_trans_commit(tp);
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 232b4c48247e..e3d506ca9610 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -57,6 +57,7 @@ int	xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
>  /* preallocation and hole punch interface */
>  enum xfs_alloc_file_space_mode {
>  	XFS_ALLOC_FILE_SPACE_PREALLOC,
> +	XFS_ALLOC_FILE_SPACE_WRITE_ZEROES,
>  };
>  
>  int	xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e90ea6ebdc8e..5dcb03e6de12 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1368,6 +1368,79 @@ xfs_falloc_force_zero(
>  	return XFS_TEST_ERROR(ip->i_mount, XFS_ERRTAG_FORCE_ZERO_RANGE);
>  }
>  
> +static int
> +xfs_falloc_write_zeroes(
> +	struct file		*file,
> +	int			mode,
> +	loff_t			offset,
> +	loff_t			len,
> +	struct xfs_zone_alloc_ctx *ac)
> +{
> +	struct inode		*inode = file_inode(file);
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	loff_t			new_size = 0;
> +	int			error;
> +
> +	if (xfs_is_always_cow_inode(ip) ||
> +	    !bdev_write_zeroes_unmap_sectors(xfs_inode_buftarg(ip)->bt_bdev))
> +		return -EOPNOTSUPP;
> +
> +	error = xfs_falloc_newsize(file, mode, offset, len, &new_size);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 *
> +	 *    |----------|----------|----------|----------|----------|
> +	 *    ^     ^    ^                     ^     ^    ^
> +	 *    |     |    |                     |     |    |
> +	 *    |   offset |                     |    end   |
> +	 *    |          |                     |          |
> +	 * offset_rd   offset_ru              end_rd    end_ru

Do "_rd" and "_ru" mean "round down" and "round up"?  And is that to the
fsblock size, or the allocation unit size?

> +	 *
> +	 * xfs_free_file_space() punches the aligned interior offset_ru -> end_rd
> +	 * to holes and byte-zeroes the in-range parts of the partial edge blocks,

xfs_free_file_space rounds inward to allocation unit granularity and
punches out that range; and then it writes zeroes to non-hole space
that doesn't get unmapped.

> +	 * offset -> offset_ru and end_rd -> end.  xfs_zero_range() only touches
> +	 * already-written blocks here; it skips holes and unwritten extents, so
> +	 * unallocated/unwritten edge blocks are left for the allocation below.
> +	 */
> +	error = xfs_free_file_space(ip, offset, len, ac);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Publish the new size while the punched range is still a hole, then
> +	 * fill it with written zeroes.  Like the other fallocate modes we use
> +	 * xfs_falloc_setsize(), but it must run *before* we convert the range
> +	 * to written extents: xfs_setattr_size() zeroes [old EOF, new size) via
> +	 * xfs_zero_range(), which skips holes, so there is nothing to re-zero.
> +	 * It will also writeback partial EOF block before the on-disk size is
> +	 * logged.
> +	 * Note: extending the size before allocating means a failure below
> +	 * leaves the file larger with unallocated holes in the new range.
> +	 * That is safe as holes within i_size read back as zeroes and expose
> +	 * no stale data while the error is propagated to the caller.
> +	 */
> +	error = xfs_falloc_setsize(file, new_size);
> +	if (error)
> +		return error;

Hrm ok so now that we've punched out some blocks and zeroed the rest,
now we adjust the file size, which should only entail committing the new
file size to disk...

> +
> +	/*
> +	 * Allocate written, zeroed extents across the range.  xfs_alloc_file_space()
> +	 * rounds outward to block granularity:
> +	 *  - holes (the punched interior and any unallocated edge block) are
> +	 *    allocated and zeroed;
> +	 *  - unwritten extents (including unwritten edge blocks) are converted to
> +	 *    written and zeroed;
> +	 *  - Already written edge blocks are skipped. The out-of-range bytes of
> +	 *    a written edge block keep their data (offset_rd -> offset and
> +	 *    end -> end_rd); their in-range bytes (offset -> offset_ru and
> +	 *    end_ru -> end were already zeroed by xfs_free_file_space().
> +	 */
> +	return xfs_alloc_file_space(ip, offset, len,
> +			XFS_ALLOC_FILE_SPACE_WRITE_ZEROES);

...and now we can just do an accelerated "write zeroes to disk" which is
conveniently always within EOF now.  I /think/ this looks ok to me now,
though I'm curious how extensively the new fallocate mode has been
tested with fsx and unaligned file ranges?  And rt volumes with rt
extent size > 1 fsblock.

--D

> +}
> +
>  /*
>   * Punch a hole and prealloc the range.  We use a hole punch rather than
>   * unwritten extent conversion for two reasons:
> @@ -1473,7 +1546,7 @@ xfs_falloc_allocate_range(
>  		(FALLOC_FL_ALLOCATE_RANGE | FALLOC_FL_KEEP_SIZE |	\
>  		 FALLOC_FL_PUNCH_HOLE |	FALLOC_FL_COLLAPSE_RANGE |	\
>  		 FALLOC_FL_ZERO_RANGE |	FALLOC_FL_INSERT_RANGE |	\
> -		 FALLOC_FL_UNSHARE_RANGE)
> +		 FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_WRITE_ZEROES)
>  
>  STATIC long
>  __xfs_file_fallocate(
> @@ -1525,6 +1598,9 @@ __xfs_file_fallocate(
>  	case FALLOC_FL_ALLOCATE_RANGE:
>  		error = xfs_falloc_allocate_range(file, mode, offset, len);
>  		break;
> +	case FALLOC_FL_WRITE_ZEROES:
> +		error = xfs_falloc_write_zeroes(file, mode, offset, len, ac);
> +		break;
>  	default:
>  		error = -EOPNOTSUPP;
>  		break;
> -- 
> 2.51.2
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v8 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES
  2026-06-25 17:20   ` Darrick J. Wong
@ 2026-06-26 16:04     ` Pankaj Raghav
  0 siblings, 0 replies; 6+ messages in thread
From: Pankaj Raghav @ 2026-06-26 16:04 UTC (permalink / raw)
  To: Darrick J. Wong, Pankaj Raghav
  Cc: linux-xfs, bfoster, lukas, dgc, gost.dev, Zhang Yi, andres,
	kundan.kumar, hch, cem, hch

>> +	error = xfs_falloc_newsize(file, mode, offset, len, &new_size);
>> +	if (error)
>> +		return error;
>> +
>> +	/*
>> +	 *
>> +	 *    |----------|----------|----------|----------|----------|
>> +	 *    ^     ^    ^                     ^     ^    ^
>> +	 *    |     |    |                     |     |    |
>> +	 *    |   offset |                     |    end   |
>> +	 *    |          |                     |          |
>> +	 * offset_rd   offset_ru              end_rd    end_ru
> 
> Do "_rd" and "_ru" mean "round down" and "round up"?  And is that to the
> fsblock size, or the allocation unit size?
> 
Hmm, until now, I was thinking of fs block size, but looking at the function again,
we change it if it is a realtime file.

	startoffset_fsb = XFS_B_TO_FSB(mp, offset);
	endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len);

	/* We can only free complete realtime extents. */
	if (xfs_inode_has_bigrtalloc(ip)) {
		startoffset_fsb = xfs_fileoff_roundup_rtx(mp, startoffset_fsb);
		endoffset_fsb = xfs_fileoff_rounddown_rtx(mp, endoffset_fsb);
	}

>> +	 *
>> +	 * xfs_free_file_space() punches the aligned interior offset_ru -> end_rd
>> +	 * to holes and byte-zeroes the in-range parts of the partial edge blocks,
> 
> xfs_free_file_space rounds inward to allocation unit granularity and
> punches out that range; and then it writes zeroes to non-hole space
> that doesn't get unmapped.
> 
>> +	 * offset -> offset_ru and end_rd -> end.  xfs_zero_range() only touches
>> +	 * already-written blocks here; it skips holes and unwritten extents, so
>> +	 * unallocated/unwritten edge blocks are left for the allocation below.
>> +	 */
>> +	error = xfs_free_file_space(ip, offset, len, ac);
>> +	if (error)
>> +		return error;
>> +
>> +	/*
>> +	 * Publish the new size while the punched range is still a hole, then
>> +	 * fill it with written zeroes.  Like the other fallocate modes we use
>> +	 * xfs_falloc_setsize(), but it must run *before* we convert the range
>> +	 * to written extents: xfs_setattr_size() zeroes [old EOF, new size) via
>> +	 * xfs_zero_range(), which skips holes, so there is nothing to re-zero.
>> +	 * It will also writeback partial EOF block before the on-disk size is
>> +	 * logged.
>> +	 * Note: extending the size before allocating means a failure below
>> +	 * leaves the file larger with unallocated holes in the new range.
>> +	 * That is safe as holes within i_size read back as zeroes and expose
>> +	 * no stale data while the error is propagated to the caller.
>> +	 */
>> +	error = xfs_falloc_setsize(file, new_size);
>> +	if (error)
>> +		return error;
> 
> Hrm ok so now that we've punched out some blocks and zeroed the rest,
> now we adjust the file size, which should only entail committing the new
> file size to disk...
> 
>> +
>> +	/*
>> +	 * Allocate written, zeroed extents across the range.  xfs_alloc_file_space()
>> +	 * rounds outward to block granularity:
>> +	 *  - holes (the punched interior and any unallocated edge block) are
>> +	 *    allocated and zeroed;
>> +	 *  - unwritten extents (including unwritten edge blocks) are converted to
>> +	 *    written and zeroed;
>> +	 *  - Already written edge blocks are skipped. The out-of-range bytes of
>> +	 *    a written edge block keep their data (offset_rd -> offset and
>> +	 *    end -> end_rd); their in-range bytes (offset -> offset_ru and
>> +	 *    end_ru -> end were already zeroed by xfs_free_file_space().
>> +	 */
>> +	return xfs_alloc_file_space(ip, offset, len,
>> +			XFS_ALLOC_FILE_SPACE_WRITE_ZEROES);
> 
> ...and now we can just do an accelerated "write zeroes to disk" which is
> conveniently always within EOF now.  I /think/ this looks ok to me now,
> though I'm curious how extensively the new fallocate mode has been
> tested with fsx and unaligned file ranges?  And rt volumes with rt
> extent size > 1 fsblock.
> 

I tested it extensively with fsblock with fsx and I added some tests locally (which I will
send it upstream soon) for unaligned edges. Some of the corner cases I figured because of some
fsx test (generic/363). But I didn't do it for all the profiles. I will also test it for `-r
extsize=8k -b size=4k`.

Thanks for the review Darrick.

--
Pankaj


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-26 16:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 11:45 [PATCH v8 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs Pankaj Raghav
2026-06-25 11:45 ` [PATCH v8 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
2026-06-25 17:01   ` Darrick J. Wong
2026-06-25 11:45 ` [PATCH v8 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
2026-06-25 17:20   ` Darrick J. Wong
2026-06-26 16:04     ` Pankaj Raghav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox