public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <dgc@kernel.org>
To: Lukas Herbolt <lukas@herbolt.com>
Cc: linux-xfs@vger.kernel.org, cem@kernel.org, hch@infradead.org,
	djwong@kernel.org, p.raghav@samsung.com
Subject: Re: [PATCH v11 2/2] xfs: add FALLOC_FL_WRITE_ZEROES to XFS code base
Date: Wed, 11 Mar 2026 11:12:52 +1100	[thread overview]
Message-ID: <abCzhDSVmFx4PtWI@dread> (raw)
In-Reply-To: <20260309181235.428151-2-lukas@herbolt.com>

On Mon, Mar 09, 2026 at 07:12:36PM +0100, Lukas Herbolt wrote:
> Add support for FALLOC_FL_WRITE_ZEROES if the underlying device
> enable the unmap write zeroes operation.
> 
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
> 
> ---
>  v11 changes:
> 	- split into 2 patches separating the bmapi_flags addition
> 	- 2 step allocation, to avoid zeroing beyond EOF
> 
>  fs/xfs/xfs_file.c | 41 +++++++++++++++++++++++++++++------------
>  1 file changed, 29 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index fd049a1fc9c6..f8c1611e3267 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1293,29 +1293,45 @@ xfs_falloc_zero_range(
>  	unsigned int		blksize = i_blocksize(inode);
>  	loff_t			new_size = 0;
>  	int			error;
> +	bool                    need_convert = false;
>  
>  	trace_xfs_zero_file_space(ip);
>  
> +	if (mode & FALLOC_FL_WRITE_ZEROES) {
> +		if (xfs_is_always_cow_inode(ip) ||
> +		    !bdev_write_zeroes_unmap_sectors(
> +			    xfs_inode_buftarg(ip)->bt_bdev))
> +			return -EOPNOTSUPP;
> +		need_convert = true;
> +	}
> +
>  	error = xfs_falloc_newsize(file, mode, offset, len, &new_size);
>  	if (error)
>  		return error;
>  
>  	if (xfs_falloc_force_zero(ip, ac)) {
>  		error = xfs_zero_range(ip, offset, len, ac, NULL);
> -	} else {
> -		error = xfs_free_file_space(ip, offset, len, ac);
> -		if (error)
> -			return error;
> -
> -		len = round_up(offset + len, blksize) -
> -			round_down(offset, blksize);
> -		offset = round_down(offset, blksize);
> -		error = xfs_alloc_file_space(ip, offset, len,
> -				XFS_BMAPI_PREALLOC);
> +		goto set_filesize;
>  	}
> +	error = xfs_free_file_space(ip, offset, len, ac);
>  	if (error)
>  		return error;
> -	return xfs_falloc_setsize(file, new_size);
> +
> +	len = round_up(offset + len, blksize) - round_down(offset, blksize);
> +	offset = round_down(offset, blksize);
> +	error = xfs_alloc_file_space(ip, offset, len, XFS_BMAPI_PREALLOC);

xfs_alloc_file_space() already rounds the offset down and the range
end upwards to block size boundaries (XFS_B_TO_FSBT() rounds down,
XFS_B_TO_FSB rounds up). There is no need to do it here.

> +
> +set_filesize:
> +	if (error)
> +		return error;
> +
> +	error = xfs_falloc_setsize(file, new_size);
> +	if (error)
> +		return error;
> +	if (need_convert)
> +		error = xfs_alloc_file_space(ip, offset, len,
> +				XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
> +	return error;

Honestly, this "prealloc, set file size, then convert and zero" thing
seems a bit ... clunky. 

I mean, this is trying to work around an issue with
xfs_falloc_setsize() operating on written extents beyond EOF to set
EOF. But why are we even using a generic truncate operation to set
the EOF when we have already done most of the complex truncate work
(exclusion, page cache invalidation, extent removal, etc) already?

My logic:

xfs_bmapi_write(off, len, XFS_BMAPI_ZERO) will allocate all the
holes and delalloc extents over the given range as written extents
that are initialised to contain zeroes.

xfs_bmapi_write(off, len, XFS_BMAPI_CONVERT) will convert all the
unwritten extents over a given range to written extents, but will
not touch the data in those extents.

xfs_bmapi_write(off, len, XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO) will
allocate holes and delalloc extents and convert unwritten extents
all to written extents, and it will write zeros to all of those
extents.

This latter case is exactly what we want, and if gives us zeroes on
disk to the end of the extent beyond the new EOF. i.e. we've
completed all the data IO, and so now all we need
to do is update the file size transactionally, just like we do in
IO completion.

Indeed, xfs_iomap_write_unwritten() integrates this size update
into post-IO unwritten extent conversion. i.e we already have code
that does exactly what we need, except for tweaking the
xfs_bmapi_write() flags for the zeroing behaviour we need.

So, a helper function in fs/xfs/xfs_bmap_util.c such as this:

/*
 * This function is used to allocate written extents over holes
 * and/or convert unwritten extents to written extents based on the
 * @flags passed to it.
 *
 * If @flags is zero, if will allocate written extents for holes and
 * delalloc extents across the range.
 *
 * If XFS_BMAPI_CONVERT is specified in @flags, then it will also do
 * conversion of unwritten extents in the range to written extents.
 *
 * If XFS_BMAPI_ZERO is specified in @flags, then both newly
 * allocated extents and converted unwritten extents will be
 * initialised to contain zeroes.
 *
 * If @update_isize is true, then if the range we are operating on
 * extends beyond the current EOF, extend i_size to offset+len
 * incrementally as extents in the range are allocated/converted.
 */
int
xfs_bmap_alloc_or_convert_range(
	struct xfs_inode	*ip,
	xfs_off_t		offset,
	xfs_off_t		len,
	uint32_t		flags,
	bool			update_isize)
{
	< guts of xfs_iomap_write_unwritten >
}

Then xfs_iomap_write_unwritten() can be factored to a one-liner
(or replaced altogether), and this new fallocate op pretty much
becomes:

	error = xfs_bmap_alloc_or_convert_range(ip, offset, len,
			XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO,
			new_size ? true : false);

The name of the function sucks but I can't think of anything better,
so if you've got a better idea then change it. :)

-Dave.
-- 
Dave Chinner
dgc@kernel.org

  parent reply	other threads:[~2026-03-11  0:12 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09 18:12 [PATCH v11 2/2] xfs: add FALLOC_FL_WRITE_ZEROES to XFS code base Lukas Herbolt
2026-03-10  0:44 ` Darrick J. Wong
2026-03-10 10:10   ` Pankaj Raghav (Samsung)
2026-03-10 11:22     ` Lukas Herbolt
2026-03-10 15:02       ` Darrick J. Wong
2026-03-10 10:20   ` Lukas Herbolt
2026-03-10 14:57     ` Darrick J. Wong
2026-03-11  0:12 ` Dave Chinner [this message]
2026-03-12 21:36   ` Pankaj Raghav (Samsung)
2026-03-15 23:49     ` Dave Chinner
2026-03-16  7:23       ` Pankaj Raghav
2026-03-16  5:03   ` Lukas Herbolt
2026-03-17 12:20     ` Pankaj Raghav (Samsung)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abCzhDSVmFx4PtWI@dread \
    --to=dgc@kernel.org \
    --cc=cem@kernel.org \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=lukas@herbolt.com \
    --cc=p.raghav@samsung.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox