* [PATCH v5 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs
@ 2026-06-04 10:14 Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
0 siblings, 2 replies; 4+ messages in thread
From: Pankaj Raghav @ 2026-06-04 10:14 UTC (permalink / raw)
To: linux-xfs
Cc: bfoster, lukas, Darrick J . Wong, p.raghav, dgc, gost.dev,
pankaj.raghav, andres, kundan.kumar, hch, cem, hch
The benefits of FALLOC_FL_WRITE_ZEROES was already discussed as a part
of Zhang Yi's initial patches[1]. Postgres developer Andres also
mentioned they would like to use this feature in Postgres [2].
I tested the changes with fsstress and fsx based on the xfstests patch I
sent recently to test this flag[4]. generic/363 helped me debug the
crash I noticed when I did the initial implementation[3].
Dave initially suggested to create a common helper based on
xfs_iomap_convert_unwritten() but as it can be seen in the previous
version, a lot of the code had to be rewritten. The changes had more in
common with xfs_alloc_file_space(). This version reuses
xfs_alloc_file_space() for write zeroes.
Thanks to Christoph for all the review comments and design suggestions
that were made both offline and online for this series.
Stress test generic/363 generic/127 xfs/131 are passing. I have started
the full xfstest suite for this series.
Changes since v4:
- Introduce an enum for allocation mode in xfs_alloc_file_space (Christoph)
- Use xfs_set_filesize instead of updating the on-disk size in the
function.
Changes since v3:
- Introduce xfs_bmap_alloc_or_convert_range() in xfs_iomap.c for easy
review experience (christoph)
- Add extsz hint and rt support in xfs_bmap_alloc_or_convert_range()
Changes since v2:
- Add allow_write_zeroes to xfs_global so that we can enable this
feature independent of the HW underneath.
Changes since v1 [5.1 5.2]:
- Added a new function xfs_bmap_alloc_or_convert_range() based on Dave's
feedback.
- Changed the xfs_falloc_write_zeroes to use
xfs_bmap_alloc_or_convert_range() instead of doing prealloc and
convert approach.
[1] https://lore.kernel.org/linux-fsdevel/20250619111806.3546162-1-yi.zhang@huaweicloud.com/
[2] https://lore.kernel.org/linux-fsdevel/20260217055103.GA6174@lst.de/T/#m7935b9bab32bb5ff372507f84803b8753ad1c814
[3] https://lore.kernel.org/linux-xfs/6i2jvzn3lyugjlbgmjzpped3gogzyqv5mpe2uqaifz4vjpaega@pomzoq7ley77/
[4] https://lore.kernel.org/linux-xfs/20260312195308.738189-1-p.raghav@samsung.com/
[5.1] https://lore.kernel.org/linux-xfs/20260309180708.427553-2-lukas@herbolt.com/
[5.2] https://lore.kernel.org/linux-xfs/abC1LvRElctaHPe5@dread/
Pankaj Raghav (2):
xfs: add an allocation mode to xfs_alloc_file_space()
xfs: add support for FALLOC_FL_WRITE_ZEROES
fs/xfs/xfs_bmap_util.c | 42 +++++++++++++++++++----
fs/xfs/xfs_bmap_util.h | 7 +++-
fs/xfs/xfs_file.c | 75 +++++++++++++++++++++++++++++++++++++++---
3 files changed, 113 insertions(+), 11 deletions(-)
base-commit: 184888e159ef82423987f348202d74a0b0dc4138
--
2.51.2
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v5 1/2] xfs: add an allocation mode to xfs_alloc_file_space()
2026-06-04 10:14 [PATCH v5 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs Pankaj Raghav
@ 2026-06-04 10:14 ` Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
1 sibling, 0 replies; 4+ messages in thread
From: Pankaj Raghav @ 2026-06-04 10:14 UTC (permalink / raw)
To: linux-xfs
Cc: bfoster, lukas, Darrick J . Wong, p.raghav, dgc, gost.dev,
pankaj.raghav, andres, kundan.kumar, hch, cem, hch
xfs_alloc_file_space() hardcodes XFS_BMAPI_PREALLOC to preallocate
unwritten extents across a range.
In preparation for FALLOC_FL_WRITE_ZEROES, add an explicit allocation
mode argument, enum xfs_alloc_file_space_mode, and derive the xfs_bmapi
flags from it. The only mode for now is XFS_ALLOC_FILE_SPACE_PREALLOC,
which preallocates unwritten extents and marks the inode as preallocated
exactly as before, so there is no functional change.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
fs/xfs/xfs_bmap_util.c | 25 +++++++++++++++++++++----
fs/xfs/xfs_bmap_util.h | 6 +++++-
fs/xfs/xfs_file.c | 9 ++++++---
3 files changed, 32 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 0ab00615f1ad..7466267f6c60 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -642,11 +642,19 @@ xfs_free_eofblocks(
return error;
}
+/*
+ * Allocate space for a file according to @mode:
+ *
+ * XFS_ALLOC_FILE_SPACE_PREALLOC:
+ * Preallocate unwritten extents across the range and mark the inode as
+ * preallocated.
+ */
int
xfs_alloc_file_space(
struct xfs_inode *ip,
xfs_off_t offset,
- xfs_off_t len)
+ xfs_off_t len,
+ enum xfs_alloc_file_space_mode mode)
{
xfs_mount_t *mp = ip->i_mount;
xfs_off_t count;
@@ -657,6 +665,7 @@ xfs_alloc_file_space(
int rt;
xfs_trans_t *tp;
xfs_bmbt_irec_t imaps[1], *imapp;
+ uint32_t bmapi_flags, nr_exts;
int error;
if (xfs_is_always_cow_inode(ip))
@@ -674,6 +683,15 @@ xfs_alloc_file_space(
if (len <= 0)
return -EINVAL;
+ switch (mode) {
+ case XFS_ALLOC_FILE_SPACE_PREALLOC:
+ bmapi_flags = XFS_BMAPI_PREALLOC;
+ nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT;
+ break;
+ default:
+ return -EINVAL;
+ }
+
rt = XFS_IS_REALTIME_INODE(ip);
extsz = xfs_get_extsz_hint(ip);
@@ -733,8 +751,7 @@ xfs_alloc_file_space(
if (error)
break;
- error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
- XFS_IEXT_ADD_NOSPLIT_CNT);
+ error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK, nr_exts);
if (error)
goto error;
@@ -748,7 +765,7 @@ xfs_alloc_file_space(
* will eventually reach the requested range.
*/
error = xfs_bmapi_write(tp, ip, startoffset_fsb,
- allocatesize_fsb, XFS_BMAPI_PREALLOC, 0, imapp,
+ allocatesize_fsb, bmapi_flags, 0, imapp,
&nimaps);
if (error) {
if (error != -ENOSR)
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index c477b3361630..232b4c48247e 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -55,8 +55,12 @@ int xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
int *is_empty);
/* preallocation and hole punch interface */
+enum xfs_alloc_file_space_mode {
+ XFS_ALLOC_FILE_SPACE_PREALLOC,
+};
+
int xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
- xfs_off_t len);
+ xfs_off_t len, enum xfs_alloc_file_space_mode mode);
int xfs_free_file_space(struct xfs_inode *ip, xfs_off_t offset,
xfs_off_t len, struct xfs_zone_alloc_ctx *ac);
int xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 845a97c9b063..e90ea6ebdc8e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1406,7 +1406,8 @@ xfs_falloc_zero_range(
len = round_up(offset + len, blksize) -
round_down(offset, blksize);
offset = round_down(offset, blksize);
- error = xfs_alloc_file_space(ip, offset, len);
+ error = xfs_alloc_file_space(ip, offset, len,
+ XFS_ALLOC_FILE_SPACE_PREALLOC);
}
if (error)
return error;
@@ -1432,7 +1433,8 @@ xfs_falloc_unshare_range(
if (error)
return error;
- error = xfs_alloc_file_space(XFS_I(inode), offset, len);
+ error = xfs_alloc_file_space(XFS_I(inode), offset, len,
+ XFS_ALLOC_FILE_SPACE_PREALLOC);
if (error)
return error;
return xfs_falloc_setsize(file, new_size);
@@ -1460,7 +1462,8 @@ xfs_falloc_allocate_range(
if (error)
return error;
- error = xfs_alloc_file_space(XFS_I(inode), offset, len);
+ error = xfs_alloc_file_space(XFS_I(inode), offset, len,
+ XFS_ALLOC_FILE_SPACE_PREALLOC);
if (error)
return error;
return xfs_falloc_setsize(file, new_size);
--
2.51.2
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES
2026-06-04 10:14 [PATCH v5 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
@ 2026-06-04 10:14 ` Pankaj Raghav
2026-06-08 10:20 ` Pankaj Raghav
1 sibling, 1 reply; 4+ messages in thread
From: Pankaj Raghav @ 2026-06-04 10:14 UTC (permalink / raw)
To: linux-xfs
Cc: bfoster, lukas, Darrick J . Wong, p.raghav, dgc, gost.dev,
pankaj.raghav, andres, kundan.kumar, hch, cem, hch
If the underlying block device supports the unmap write zeroes
operation, this flag allows users to quickly preallocate a file with
written extents that contain zeroes. This is beneficial for subsequent
overwrites as it prevents the need for unwritten-to-written extent
conversions, thereby significantly reducing metadata updates and journal
I/O overhead, improving overwrite performance.
Co-developed-by: Lukas Herbolt <lukas@herbolt.com>
Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
fs/xfs/xfs_bmap_util.c | 19 ++++++++++--
fs/xfs/xfs_bmap_util.h | 1 +
fs/xfs/xfs_file.c | 66 +++++++++++++++++++++++++++++++++++++++++-
3 files changed, 82 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 7466267f6c60..d175a1057f13 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -643,11 +643,18 @@ xfs_free_eofblocks(
}
/*
- * Allocate space for a file according to @mode:
+ * Allocate space or convert extents for a file according to @mode:
*
* XFS_ALLOC_FILE_SPACE_PREALLOC:
* Preallocate unwritten extents across the range and mark the inode as
* preallocated.
+ *
+ * XFS_ALLOC_FILE_SPACE_WRITE_ZEROES:
+ * Allocate written extents over holes and convert unwritten extents in the
+ * range to written extents, initialising both to contain zeroes.
+ *
+ * This function does not update the file size; callers that extend the file
+ * are responsible for updating it once the extents are allocated.
*/
int
xfs_alloc_file_space(
@@ -688,6 +695,10 @@ xfs_alloc_file_space(
bmapi_flags = XFS_BMAPI_PREALLOC;
nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT;
break;
+ case XFS_ALLOC_FILE_SPACE_WRITE_ZEROES:
+ bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
+ nr_exts = XFS_IEXT_WRITE_UNWRITTEN_CNT;
+ break;
default:
return -EINVAL;
}
@@ -776,8 +787,10 @@ xfs_alloc_file_space(
allocatesize_fsb -= imapp->br_blockcount;
}
- ip->i_diflags |= XFS_DIFLAG_PREALLOC;
- xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+ if (mode == XFS_ALLOC_FILE_SPACE_PREALLOC) {
+ ip->i_diflags |= XFS_DIFLAG_PREALLOC;
+ xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+ }
error = xfs_trans_commit(tp);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 232b4c48247e..e3d506ca9610 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -57,6 +57,7 @@ int xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
/* preallocation and hole punch interface */
enum xfs_alloc_file_space_mode {
XFS_ALLOC_FILE_SPACE_PREALLOC,
+ XFS_ALLOC_FILE_SPACE_WRITE_ZEROES,
};
int xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e90ea6ebdc8e..37623baaaed6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1368,6 +1368,67 @@ xfs_falloc_force_zero(
return XFS_TEST_ERROR(ip->i_mount, XFS_ERRTAG_FORCE_ZERO_RANGE);
}
+static int
+xfs_falloc_write_zeroes(
+ struct file *file,
+ int mode,
+ loff_t offset,
+ loff_t len,
+ struct xfs_zone_alloc_ctx *ac)
+{
+ struct inode *inode = file_inode(file);
+ struct xfs_inode *ip = XFS_I(inode);
+ loff_t new_size = 0;
+ loff_t old_size = XFS_ISIZE(ip);
+ int error;
+ unsigned int blksize = i_blocksize(inode);
+ loff_t offset_aligned = round_down(offset, blksize);
+ bool did_zero;
+
+ if (xfs_is_always_cow_inode(ip) ||
+ !bdev_write_zeroes_unmap_sectors(xfs_inode_buftarg(ip)->bt_bdev))
+ return -EOPNOTSUPP;
+
+ error = xfs_falloc_newsize(file, mode, offset, len, &new_size);
+ if (error)
+ return error;
+
+ error = xfs_free_file_space(ip, offset, len, ac);
+ if (error)
+ return error;
+
+ /*
+ * Zero the tail of the old EOF block and any space up to the new
+ * offset.
+ * In the usual truncate path, xfs_falloc_setsize takes care of
+ * zeroing those blocks.
+ */
+ if (offset_aligned > old_size) {
+ trace_xfs_zero_eof(ip, old_size, offset_aligned - old_size);
+ error = xfs_zero_range(ip, old_size, offset_aligned - old_size,
+ NULL, &did_zero);
+ if (error)
+ return error;
+
+ }
+
+ error = xfs_alloc_file_space(ip, offset, len,
+ XFS_ALLOC_FILE_SPACE_WRITE_ZEROES);
+ if (error)
+ return error;
+
+ /*
+ * xfs_falloc_setsize() would re-zero the written extents via
+ * iomap_zero_range(). Use xfs_setfilesize() instead.
+ * Update in-core i_size first as xfs_setfilesize() clamps the on-disk
+ * size to it.
+ */
+ if (new_size > i_size_read(inode))
+ i_size_write(inode, new_size);
+
+ return xfs_setfilesize(ip, offset, len);
+}
+
/*
* Punch a hole and prealloc the range. We use a hole punch rather than
* unwritten extent conversion for two reasons:
@@ -1473,7 +1534,7 @@ xfs_falloc_allocate_range(
(FALLOC_FL_ALLOCATE_RANGE | FALLOC_FL_KEEP_SIZE | \
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_COLLAPSE_RANGE | \
FALLOC_FL_ZERO_RANGE | FALLOC_FL_INSERT_RANGE | \
- FALLOC_FL_UNSHARE_RANGE)
+ FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_WRITE_ZEROES)
STATIC long
__xfs_file_fallocate(
@@ -1525,6 +1586,9 @@ __xfs_file_fallocate(
case FALLOC_FL_ALLOCATE_RANGE:
error = xfs_falloc_allocate_range(file, mode, offset, len);
break;
+ case FALLOC_FL_WRITE_ZEROES:
+ error = xfs_falloc_write_zeroes(file, mode, offset, len, ac);
+ break;
default:
error = -EOPNOTSUPP;
break;
--
2.51.2
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES
2026-06-04 10:14 ` [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
@ 2026-06-08 10:20 ` Pankaj Raghav
0 siblings, 0 replies; 4+ messages in thread
From: Pankaj Raghav @ 2026-06-08 10:20 UTC (permalink / raw)
To: Pankaj Raghav, linux-xfs, dgc
Cc: bfoster, lukas, Darrick J . Wong, gost.dev, andres, kundan.kumar,
hch, cem, hch, pankaj.raghav
> +
> + error = xfs_alloc_file_space(ip, offset, len,
> + XFS_ALLOC_FILE_SPACE_WRITE_ZEROES);
> + if (error)
> + return error;
> +
> + /*
> + * xfs_falloc_setsize() would re-zero the written extents via
> + * iomap_zero_range(). Use xfs_setfilesize() instead.
> + * Update in-core i_size first as xfs_setfilesize() clamps the on-disk
> + * size to it.
> + */
> + if (new_size > i_size_read(inode))
> + i_size_write(inode, new_size);
> +
> + return xfs_setfilesize(ip, offset, len);
Sashiko reported:
On 32-bit systems where size_t is 32 bits, lengths exceeding 4GB will be
truncated, which might cause the on-disk inode size to be permanently updated
to a severely incorrect smaller size.
So a simple fix would be the following as we already store the value of offset + len locally:
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 37623baaaed6..86fae2190c24 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1426,7 +1426,7 @@ xfs_falloc_write_zeroes(
if (new_size > i_size_read(inode))
i_size_write(inode, new_size);
- return xfs_setfilesize(ip, offset, len);
+ return xfs_setfilesize(ip, new_size, 0);
}
Probably I will update setfilesize to take 64bit values for len in a separate series.
I will also wait if others have any comments before sending the next version.
@Dave: You were against the initial design[1]. Let me know your thoughts on the current version.
[1] https://lore.kernel.org/linux-xfs/abCzhDSVmFx4PtWI@dread/
--
Pankaj
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-06-08 10:20 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 10:14 [PATCH v5 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
2026-06-04 10:14 ` [PATCH v5 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
2026-06-08 10:20 ` Pankaj Raghav
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.