From: "Darrick J. Wong" <djwong@kernel.org>
To: Pankaj Raghav <p.raghav@samsung.com>
Cc: linux-xfs@vger.kernel.org, bfoster@redhat.com, lukas@herbolt.com,
dgc@kernel.org, gost.dev@samsung.com,
Zhang Yi <yi.zhang@huaweicloud.com>,
pankaj.raghav@linux.dev, andres@anarazel.de,
kundan.kumar@samsung.com, hch@lst.de, cem@kernel.org,
hch@infradead.org
Subject: Re: [PATCH v8 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES
Date: Thu, 25 Jun 2026 10:20:06 -0700 [thread overview]
Message-ID: <20260625172006.GC6078@frogsfrogsfrogs> (raw)
In-Reply-To: <20260625114550.4109104-3-p.raghav@samsung.com>
On Thu, Jun 25, 2026 at 01:45:50PM +0200, Pankaj Raghav wrote:
> If the underlying block device supports the unmap write zeroes
> operation, this flag allows users to quickly preallocate a file with
> written extents that contain zeroes. This is beneficial for subsequent
> overwrites as it prevents the need for unwritten-to-written extent
> conversions, thereby significantly reducing metadata updates and journal
> I/O overhead, improving overwrite performance.
>
> Punch the range first so it becomes a hole, update the size via
> xfs_falloc_setsize() while it is still a hole (so its xfs_zero_range()
> skips it and avoids rezeroing), then convert it to written
> zeroed extents. A crash between the size update and the conversion is
> safe, as a hole within i_size reads back as zeroes.
>
> Co-developed-by: Lukas Herbolt <lukas@herbolt.com>
> Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> fs/xfs/xfs_bmap_util.c | 19 ++++++++--
> fs/xfs/xfs_bmap_util.h | 1 +
> fs/xfs/xfs_file.c | 78 +++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 94 insertions(+), 4 deletions(-)
>
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index e5424d010a69..855602cb35e8 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -643,11 +643,18 @@ xfs_free_eofblocks(
> }
>
> /*
> - * Allocate space for a file according to @mode:
> + * Allocate space or convert extents for a file according to @mode:
> *
> * XFS_ALLOC_FILE_SPACE_PREALLOC:
> * Preallocate unwritten extents over holes across the range and mark the inode
> * as preallocated.
> + *
> + * XFS_ALLOC_FILE_SPACE_WRITE_ZEROES:
> + * Allocate written extents over holes and convert unwritten extents in the
> + * range to written extents, initialising both to contain zeroes.
> + *
> + * This function does not update the file size; callers that extend the file
> + * are responsible for updating it once the extents are allocated.
> */
> int
> xfs_alloc_file_space(
> @@ -688,6 +695,10 @@ xfs_alloc_file_space(
> bmapi_flags = XFS_BMAPI_PREALLOC;
> nr_exts = XFS_IEXT_ADD_NOSPLIT_CNT;
> break;
> + case XFS_ALLOC_FILE_SPACE_WRITE_ZEROES:
> + bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO;
> + nr_exts = XFS_IEXT_WRITE_UNWRITTEN_CNT;
> + break;
> default:
> return -EINVAL;
> }
> @@ -776,8 +787,10 @@ xfs_alloc_file_space(
> allocatesize_fsb -= imapp->br_blockcount;
> }
>
> - ip->i_diflags |= XFS_DIFLAG_PREALLOC;
> - xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> + if (mode == XFS_ALLOC_FILE_SPACE_PREALLOC) {
> + ip->i_diflags |= XFS_DIFLAG_PREALLOC;
> + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> + }
>
> error = xfs_trans_commit(tp);
> xfs_iunlock(ip, XFS_ILOCK_EXCL);
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 232b4c48247e..e3d506ca9610 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -57,6 +57,7 @@ int xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
> /* preallocation and hole punch interface */
> enum xfs_alloc_file_space_mode {
> XFS_ALLOC_FILE_SPACE_PREALLOC,
> + XFS_ALLOC_FILE_SPACE_WRITE_ZEROES,
> };
>
> int xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e90ea6ebdc8e..5dcb03e6de12 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1368,6 +1368,79 @@ xfs_falloc_force_zero(
> return XFS_TEST_ERROR(ip->i_mount, XFS_ERRTAG_FORCE_ZERO_RANGE);
> }
>
> +static int
> +xfs_falloc_write_zeroes(
> + struct file *file,
> + int mode,
> + loff_t offset,
> + loff_t len,
> + struct xfs_zone_alloc_ctx *ac)
> +{
> + struct inode *inode = file_inode(file);
> + struct xfs_inode *ip = XFS_I(inode);
> + loff_t new_size = 0;
> + int error;
> +
> + if (xfs_is_always_cow_inode(ip) ||
> + !bdev_write_zeroes_unmap_sectors(xfs_inode_buftarg(ip)->bt_bdev))
> + return -EOPNOTSUPP;
> +
> + error = xfs_falloc_newsize(file, mode, offset, len, &new_size);
> + if (error)
> + return error;
> +
> + /*
> + *
> + * |----------|----------|----------|----------|----------|
> + * ^ ^ ^ ^ ^ ^
> + * | | | | | |
> + * | offset | | end |
> + * | | | |
> + * offset_rd offset_ru end_rd end_ru
Do "_rd" and "_ru" mean "round down" and "round up"? And is that to the
fsblock size, or the allocation unit size?
> + *
> + * xfs_free_file_space() punches the aligned interior offset_ru -> end_rd
> + * to holes and byte-zeroes the in-range parts of the partial edge blocks,
xfs_free_file_space rounds inward to allocation unit granularity and
punches out that range; and then it writes zeroes to non-hole space
that doesn't get unmapped.
> + * offset -> offset_ru and end_rd -> end. xfs_zero_range() only touches
> + * already-written blocks here; it skips holes and unwritten extents, so
> + * unallocated/unwritten edge blocks are left for the allocation below.
> + */
> + error = xfs_free_file_space(ip, offset, len, ac);
> + if (error)
> + return error;
> +
> + /*
> + * Publish the new size while the punched range is still a hole, then
> + * fill it with written zeroes. Like the other fallocate modes we use
> + * xfs_falloc_setsize(), but it must run *before* we convert the range
> + * to written extents: xfs_setattr_size() zeroes [old EOF, new size) via
> + * xfs_zero_range(), which skips holes, so there is nothing to re-zero.
> + * It will also writeback partial EOF block before the on-disk size is
> + * logged.
> + * Note: extending the size before allocating means a failure below
> + * leaves the file larger with unallocated holes in the new range.
> + * That is safe as holes within i_size read back as zeroes and expose
> + * no stale data while the error is propagated to the caller.
> + */
> + error = xfs_falloc_setsize(file, new_size);
> + if (error)
> + return error;
Hrm ok so now that we've punched out some blocks and zeroed the rest,
now we adjust the file size, which should only entail committing the new
file size to disk...
> +
> + /*
> + * Allocate written, zeroed extents across the range. xfs_alloc_file_space()
> + * rounds outward to block granularity:
> + * - holes (the punched interior and any unallocated edge block) are
> + * allocated and zeroed;
> + * - unwritten extents (including unwritten edge blocks) are converted to
> + * written and zeroed;
> + * - Already written edge blocks are skipped. The out-of-range bytes of
> + * a written edge block keep their data (offset_rd -> offset and
> + * end -> end_rd); their in-range bytes (offset -> offset_ru and
> + * end_ru -> end were already zeroed by xfs_free_file_space().
> + */
> + return xfs_alloc_file_space(ip, offset, len,
> + XFS_ALLOC_FILE_SPACE_WRITE_ZEROES);
...and now we can just do an accelerated "write zeroes to disk" which is
conveniently always within EOF now. I /think/ this looks ok to me now,
though I'm curious how extensively the new fallocate mode has been
tested with fsx and unaligned file ranges? And rt volumes with rt
extent size > 1 fsblock.
--D
> +}
> +
> /*
> * Punch a hole and prealloc the range. We use a hole punch rather than
> * unwritten extent conversion for two reasons:
> @@ -1473,7 +1546,7 @@ xfs_falloc_allocate_range(
> (FALLOC_FL_ALLOCATE_RANGE | FALLOC_FL_KEEP_SIZE | \
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_COLLAPSE_RANGE | \
> FALLOC_FL_ZERO_RANGE | FALLOC_FL_INSERT_RANGE | \
> - FALLOC_FL_UNSHARE_RANGE)
> + FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_WRITE_ZEROES)
>
> STATIC long
> __xfs_file_fallocate(
> @@ -1525,6 +1598,9 @@ __xfs_file_fallocate(
> case FALLOC_FL_ALLOCATE_RANGE:
> error = xfs_falloc_allocate_range(file, mode, offset, len);
> break;
> + case FALLOC_FL_WRITE_ZEROES:
> + error = xfs_falloc_write_zeroes(file, mode, offset, len, ac);
> + break;
> default:
> error = -EOPNOTSUPP;
> break;
> --
> 2.51.2
>
>
next prev parent reply other threads:[~2026-06-25 17:20 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-25 11:45 [PATCH v8 0/2] add FALLOC_FL_WRITE_ZEROES support to xfs Pankaj Raghav
2026-06-25 11:45 ` [PATCH v8 1/2] xfs: add an allocation mode to xfs_alloc_file_space() Pankaj Raghav
2026-06-25 17:01 ` Darrick J. Wong
2026-06-25 11:45 ` [PATCH v8 2/2] xfs: add support for FALLOC_FL_WRITE_ZEROES Pankaj Raghav
2026-06-25 17:20 ` Darrick J. Wong [this message]
2026-06-26 16:04 ` Pankaj Raghav
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260625172006.GC6078@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=andres@anarazel.de \
--cc=bfoster@redhat.com \
--cc=cem@kernel.org \
--cc=dgc@kernel.org \
--cc=gost.dev@samsung.com \
--cc=hch@infradead.org \
--cc=hch@lst.de \
--cc=kundan.kumar@samsung.com \
--cc=linux-xfs@vger.kernel.org \
--cc=lukas@herbolt.com \
--cc=p.raghav@samsung.com \
--cc=pankaj.raghav@linux.dev \
--cc=yi.zhang@huaweicloud.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox