* [PATCH v5 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs
@ 2026-06-04 10:14 Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
0 siblings, 2 replies; 5+ messages in thread
From: Pankaj Raghav @ 2026-06-04 10:14 UTC (permalink / raw)
To: linux-xfs
Cc: bfoster, lukas, Darrick J . Wong, p.raghav, dgc, gost.dev,
pankaj.raghav, andres, kundan.kumar, hch, cem, hch
The benefits of FALLOC_FL_WRITE_ZEROES was already discussed as a part
of Zhang Yi's initial patches[1]. Postgres developer Andres also
mentioned they would like to use this feature in Postgres [2].
I tested the changes with fsstress and fsx based on the xfstests patch I
sent recently to test this flag[4]. generic/363 helped me debug the
crash I noticed when I did the initial implementation[3].
Dave initially suggested to create a common helper based on
xfs_iomap_convert_unwritten() but as it can be seen in the previous
version, a lot of the code had to be rewritten. The changes had more in
common with xfs_alloc_file_space(). This version reuses
xfs_alloc_file_space() for write zeroes.
Thanks to Christoph for all the review comments and design suggestions
that were made both offline and online for this series.
Stress test generic/363 generic/127 xfs/131 are passing. I have started
the full xfstest suite for this series.
Changes since v4:
- Introduce an enum for allocation mode in xfs_alloc_file_space (Christoph)
- Use xfs_set_filesize instead of updating the on-disk size in the
function.
Changes since v3:
- Introduce xfs_bmap_alloc_or_convert_range() in xfs_iomap.c for easy
review experience (christoph)
- Add extsz hint and rt support in xfs_bmap_alloc_or_convert_range()
Changes since v2:
- Add allow_write_zeroes to xfs_global so that we can enable this
feature independent of the HW underneath.
Changes since v1 [5.1 5.2]:
- Added a new function xfs_bmap_alloc_or_convert_range() based on Dave's
feedback.
- Changed the xfs_falloc_write_zeroes to use
xfs_bmap_alloc_or_convert_range() instead of doing prealloc and
convert approach.
[1] https://lore.kernel.org/linux-fsdevel/20250619111806.3546162-1-yi.zhang@huaweicloud.com/
[2] https://lore.kernel.org/linux-fsdevel/20260217055103.GA6174@lst.de/T/#m7935b9bab32bb5ff372507f84803b8753ad1c814
[3] https://lore.kernel.org/linux-xfs/6i2jvzn3lyugjlbgmjzpped3gogzyqv5mpe2uqaifz4vjpaega@pomzoq7ley77/
[4] https://lore.kernel.org/linux-xfs/20260312195308.738189-1-p.raghav@samsung.com/
[5.1] https://lore.kernel.org/linux-xfs/20260309180708.427553-2-lukas@herbolt.com/
[5.2] https://lore.kernel.org/linux-xfs/abC1LvRElctaHPe5@dread/
Pankaj Raghav (2):
xfs: add an allocation mode to xfs_alloc_file_space()
xfs: add support for FALLOC_FL_WRITE_ZEROES
fs/xfs/xfs_bmap_util.c | 42 +++++++++++++++++++----
fs/xfs/xfs_bmap_util.h | 7 +++-
fs/xfs/xfs_file.c | 75 +++++++++++++++++++++++++++++++++++++++---
3 files changed, 113 insertions(+), 11 deletions(-)
base-commit: 184888e159ef82423987f348202d74a0b0dc4138
--
2.51.2
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v5 1/2] xfs: add an allocation mode to xfs_alloc_file_space()
2026-06-04 10:14 [PATCH v5 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs Pankaj Raghav
@ 2026-06-04 10:14 ` Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
1 sibling, 0 replies; 5+ messages in thread
From: Pankaj Raghav @ 2026-06-04 10:14 UTC (permalink / raw)
To: linux-xfs
Cc: bfoster, lukas, Darrick J . Wong, p.raghav, dgc, gost.dev,
pankaj.raghav, andres, kundan.kumar, hch, cem, hch
xfs_alloc_file_space() hardcodes XFS_BMAPI_PREALLOC to preallocate
unwritten extents across a range.
In preparation for FALLOC_FL_WRITE_ZEROES, add an explicit allocation
mode argument, enum xfs_alloc_file_space_mode, and derive the xfs_bmapi
flags from it. The only mode for now is XFS_ALLOC_FILE_SPACE_PREALLOC,
which preallocates unwritten extents and marks the inode as preallocated
exactly as before, so there is no functional change.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
fs/xfs/xfs_bmap_util.c | 25 +++++++++++++++++++++----
fs/xfs/xfs_bmap_util.h | 6 +++++-
fs/xfs/xfs_file.c | 9 ++++++---
3 files changed, 32 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 0ab00615f1ad..7466267f6c60 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -642,11 +642,19 @@ xfs_free_eofblocks(
return error;
}
+/*
+ * Allocate space for a file according to @mode:
+ *
+ * XFS_ALLOC_FILE_SPACE_PREALLOC:
+ * Preallocate unwritten extents across the range and mark the inode as
+ * preallocated.
+ */
int
xfs_alloc_file_space(
struct xfs_inode *ip,
xfs_off_t offset,
- xfs_off_t len)
+ xfs_off_t len,
+ enum xfs_alloc_file_space_mode mode)
{
xfs_mount_t *mp = ip->i_mount;
xfs_off_t count;
@@ -657,6 +665,7 @@ xfs_alloc_file_space(
int rt;
xfs_trans_t *tp;
xfs_bmbt_irec_t imaps[1], *imapp;
+ uint32_t bmapi_flags, nr_exts;
int error;
if (xfs_is_always_cow_inode(ip))
@@ -674,6 +683,15 @@ xfs_alloc_file_space(
if (len <= 0)
return -EINVAL;
+ switch (mode) {
+ case XFS_ALLOC_FILE_SPACE_PREALLOC:
+ bmapi_flags = XFS_BMAPI_PREALLOC;
+ nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT;
+ break;
+ default:
+ return -EINVAL;
+ }
+
rt = XFS_IS_REALTIME_INODE(ip);
extsz = xfs_get_extsz_hint(ip);
@@ -733,8 +751,7 @@ xfs_alloc_file_space(
if (error)
break;
- error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
- XFS_IEXT_ADD_NOSPLIT_CNT);
+ error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK, nr_exts);
if (error)
goto error;
@@ -748,7 +765,7 @@ xfs_alloc_file_space(
* will eventually reach the requested range.
*/
error = xfs_bmapi_write(tp, ip, startoffset_fsb,
- allocatesize_fsb, XFS_BMAPI_PREALLOC, 0, imapp,
+ allocatesize_fsb, bmapi_flags, 0, imapp,
&nimaps);
if (error) {
if (error != -ENOSR)
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index c477b3361630..232b4c48247e 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -55,8 +55,12 @@ int xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
int *is_empty);
/* preallocation and hole punch interface */
+enum xfs_alloc_file_space_mode {
+ XFS_ALLOC_FILE_SPACE_PREALLOC,
+};
+
int xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
- xfs_off_t len);
+ xfs_off_t len, enum xfs_alloc_file_space_mode mode);
int xfs_free_file_space(struct xfs_inode *ip, xfs_off_t offset,
xfs_off_t len, struct xfs_zone_alloc_ctx *ac);
int xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 845a97c9b063..e90ea6ebdc8e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1406,7 +1406,8 @@ xfs_falloc_zero_range(
len = round_up(offset + len, blksize) -
round_down(offset, blksize);
offset = round_down(offset, blksize);
- error = xfs_alloc_file_space(ip, offset, len);
+ error = xfs_alloc_file_space(ip, offset, len,
+ XFS_ALLOC_FILE_SPACE_PREALLOC);
}
if (error)
return error;
@@ -1432,7 +1433,8 @@ xfs_falloc_unshare_range(
if (error)
return error;
- error = xfs_alloc_file_space(XFS_I(inode), offset, len);
+ error = xfs_alloc_file_space(XFS_I(inode), offset, len,
+ XFS_ALLOC_FILE_SPACE_PREALLOC);
if (error)
return error;
return xfs_falloc_setsize(file, new_size);
@@ -1460,7 +1462,8 @@ xfs_falloc_allocate_range(
if (error)
return error;
- error = xfs_alloc_file_space(XFS_I(inode), offset, len);
+ error = xfs_alloc_file_space(XFS_I(inode), offset, len,
+ XFS_ALLOC_FILE_SPACE_PREALLOC);
if (error)
return error;
return xfs_falloc_setsize(file, new_size);
--
2.51.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES
2026-06-04 10:14 [PATCH v5 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
@ 2026-06-04 10:14 ` Pankaj Raghav
2026-06-08 10:20 ` Pankaj Raghav
1 sibling, 1 reply; 5+ messages in thread
From: Pankaj Raghav @ 2026-06-04 10:14 UTC (permalink / raw)
To: linux-xfs
Cc: bfoster, lukas, Darrick J . Wong, p.raghav, dgc, gost.dev,
pankaj.raghav, andres, kundan.kumar, hch, cem, hch
If the underlying block device supports the unmap write zeroes
operation, this flag allows users to quickly preallocate a file with
written extents that contain zeroes. This is beneficial for subsequent
overwrites as it prevents the need for unwritten-to-written extent
conversions, thereby significantly reducing metadata updates and journal
I/O overhead, improving overwrite performance.
Co-developed-by: Lukas Herbolt <lukas@herbolt.com>
Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
fs/xfs/xfs_bmap_util.c | 19 ++++++++++--
fs/xfs/xfs_bmap_util.h | 1 +
fs/xfs/xfs_file.c | 66 +++++++++++++++++++++++++++++++++++++++++-
3 files changed, 82 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 7466267f6c60..d175a1057f13 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -643,11 +643,18 @@ xfs_free_eofblocks(
}
/*
- * Allocate space for a file according to @mode:
+ * Allocate space or convert extents for a file according to @mode:
*
* XFS_ALLOC_FILE_SPACE_PREALLOC:
* Preallocate unwritten extents across the range and mark the inode as
* preallocated.
+ *
+ * XFS_ALLOC_FILE_SPACE_WRITE_ZEROES:
+ * Allocate written extents over holes and convert unwritten extents in the
+ * range to written extents, initialising both to contain zeroes.
+ *
+ * This function does not update the file size; callers that extend the file
+ * are responsible for updating it once the extents are allocated.
*/
int
xfs_alloc_file_space(
@@ -688,6 +695,10 @@ xfs_alloc_file_space(
bmapi_flags = XFS_BMAPI_PREALLOC;
nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT;
break;
+ case XFS_ALLOC_FILE_SPACE_WRITE_ZEROES:
+ bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
+ nr_exts = XFS_IEXT_WRITE_UNWRITTEN_CNT;
+ break;
default:
return -EINVAL;
}
@@ -776,8 +787,10 @@ xfs_alloc_file_space(
allocatesize_fsb -= imapp->br_blockcount;
}
- ip->i_diflags |= XFS_DIFLAG_PREALLOC;
- xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+ if (mode == XFS_ALLOC_FILE_SPACE_PREALLOC) {
+ ip->i_diflags |= XFS_DIFLAG_PREALLOC;
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+ }
error = xfs_trans_commit(tp);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 232b4c48247e..e3d506ca9610 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -57,6 +57,7 @@ int xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
/* preallocation and hole punch interface */
enum xfs_alloc_file_space_mode {
XFS_ALLOC_FILE_SPACE_PREALLOC,
+ XFS_ALLOC_FILE_SPACE_WRITE_ZEROES,
};
int xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e90ea6ebdc8e..37623baaaed6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1368,6 +1368,67 @@ xfs_falloc_force_zero(
return XFS_TEST_ERROR(ip->i_mount, XFS_ERRTAG_FORCE_ZERO_RANGE);
}
+static int
+xfs_falloc_write_zeroes(
+ struct file *file,
+ int mode,
+ loff_t offset,
+ loff_t len,
+ struct xfs_zone_alloc_ctx *ac)
+{
+ struct inode *inode = file_inode(file);
+ struct xfs_inode *ip = XFS_I(inode);
+ loff_t new_size = 0;
+ loff_t old_size = XFS_ISIZE(ip);
+ int error;
+ unsigned int blksize = i_blocksize(inode);
+ loff_t offset_aligned = round_down(offset, blksize);
+ bool did_zero;
+
+ if (xfs_is_always_cow_inode(ip) ||
+ !bdev_write_zeroes_unmap_sectors(xfs_inode_buftarg(ip)->bt_bdev))
+ return -EOPNOTSUPP;
+
+ error = xfs_falloc_newsize(file, mode, offset, len, &new_size);
+ if (error)
+ return error;
+
+ error = xfs_free_file_space(ip, offset, len, ac);
+ if (error)
+ return error;
+
+ /*
+ * Zero the tail of the old EOF block and any space up to the new
+ * offset.
+ * In the usual truncate path, xfs_falloc_setsize takes care of
+ * zeroing those blocks.
+ */
+ if (offset_aligned > old_size) {
+ trace_xfs_zero_eof(ip, old_size, offset_aligned - old_size);
+ error = xfs_zero_range(ip, old_size, offset_aligned - old_size,
+ NULL, &did_zero);
+ if (error)
+ return error;
+
+ }
+
+ error = xfs_alloc_file_space(ip, offset, len,
+ XFS_ALLOC_FILE_SPACE_WRITE_ZEROES);
+ if (error)
+ return error;
+
+ /*
+ * xfs_falloc_setsize() would re-zero the written extents via
+ * iomap_zero_range(). Use xfs_setfilesize() instead.
+ * Update in-core i_size first as xfs_setfilesize() clamps the on-disk
+ * size to it.
+ */
+ if (new_size > i_size_read(inode))
+ i_size_write(inode, new_size);
+
+ return xfs_setfilesize(ip, offset, len);
+}
+
/*
* Punch a hole and prealloc the range. We use a hole punch rather than
* unwritten extent conversion for two reasons:
@@ -1473,7 +1534,7 @@ xfs_falloc_allocate_range(
(FALLOC_FL_ALLOCATE_RANGE | FALLOC_FL_KEEP_SIZE | \
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_COLLAPSE_RANGE | \
FALLOC_FL_ZERO_RANGE | FALLOC_FL_INSERT_RANGE | \
- FALLOC_FL_UNSHARE_RANGE)
+ FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_WRITE_ZEROES)
STATIC long
__xfs_file_fallocate(
@@ -1525,6 +1586,9 @@ __xfs_file_fallocate(
case FALLOC_FL_ALLOCATE_RANGE:
error = xfs_falloc_allocate_range(file, mode, offset, len);
break;
+ case FALLOC_FL_WRITE_ZEROES:
+ error = xfs_falloc_write_zeroes(file, mode, offset, len, ac);
+ break;
default:
error = -EOPNOTSUPP;
break;
--
2.51.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES
2026-06-04 10:14 ` [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
@ 2026-06-08 10:20 ` Pankaj Raghav
2026-06-10 5:56 ` Christoph Hellwig
0 siblings, 1 reply; 5+ messages in thread
From: Pankaj Raghav @ 2026-06-08 10:20 UTC (permalink / raw)
To: Pankaj Raghav, linux-xfs, dgc
Cc: bfoster, lukas, Darrick J . Wong, gost.dev, andres, kundan.kumar,
hch, cem, hch, pankaj.raghav
> +
> + error = xfs_alloc_file_space(ip, offset, len,
> + XFS_ALLOC_FILE_SPACE_WRITE_ZEROES);
> + if (error)
> + return error;
> +
> + /*
> + * xfs_falloc_setsize() would re-zero the written extents via
> + * iomap_zero_range(). Use xfs_setfilesize() instead.
> + * Update in-core i_size first as xfs_setfilesize() clamps the on-disk
> + * size to it.
> + */
> + if (new_size > i_size_read(inode))
> + i_size_write(inode, new_size);
> +
> + return xfs_setfilesize(ip, offset, len);
Sashiko reported:
On 32-bit systems where size_t is 32 bits, lengths exceeding 4GB will be
truncated, which might cause the on-disk inode size to be permanently updated
to a severely incorrect smaller size.
So a simple fix would be the following as we already store the value of offset + len locally:
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 37623baaaed6..86fae2190c24 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1426,7 +1426,7 @@ xfs_falloc_write_zeroes(
if (new_size > i_size_read(inode))
i_size_write(inode, new_size);
- return xfs_setfilesize(ip, offset, len);
+ return xfs_setfilesize(ip, new_size, 0);
}
Probably I will update setfilesize to take 64bit values for len in a separate series.
I will also wait if others have any comments before sending the next version.
@Dave: You were against the initial design[1]. Let me know your thoughts on the current version.
[1] https://lore.kernel.org/linux-xfs/abCzhDSVmFx4PtWI@dread/
--
Pankaj
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES
2026-06-08 10:20 ` Pankaj Raghav
@ 2026-06-10 5:56 ` Christoph Hellwig
0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2026-06-10 5:56 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Pankaj Raghav, linux-xfs, dgc, bfoster, lukas, Darrick J . Wong,
gost.dev, andres
On Mon, Jun 08, 2026 at 12:20:29PM +0200, Pankaj Raghav wrote:
> Sashiko reported:
> On 32-bit systems where size_t is 32 bits, lengths exceeding 4GB will be
> truncated, which might cause the on-disk inode size to be permanently updated
> to a severely incorrect smaller size.
>
> So a simple fix would be the following as we already store the value of offset + len locally:
xfs_setfilesize only cares about offset and len vs the final size
for tracing, so this works fine. Not sure if we care about the exact
tracing here. The other option would be to extend the sizse of the len
argument.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-06-10 5:56 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 10:14 [PATCH v5 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
2026-06-08 10:20 ` Pankaj Raghav
2026-06-10 5:56 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox