From: "Darrick J. Wong" <djwong@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Carlos Maiolino <cem@kernel.org>,
Hans Holmberg <hans.holmberg@wdc.com>,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH 27/45] xfs: implement buffered writes to zoned RT devices
Date: Wed, 19 Feb 2025 13:47:27 -0800 [thread overview]
Message-ID: <20250219214727.GV21808@frogsfrogsfrogs> (raw)
In-Reply-To: <20250218081153.3889537-28-hch@lst.de>
On Tue, Feb 18, 2025 at 09:10:30AM +0100, Christoph Hellwig wrote:
> Implement buffered writes including page faults and block zeroing for
> zoned RT devices. Buffered writes to zoned RT devices are split into
> three phases:
>
> 1) a reservation for the worst case data block usage is taken before
> acquiring the iolock. When there are enough free blocks but not
> enough available one, garbage collection is kicked off to free the
> space before continuing with the write. If there isn't enough
> freeable space, the block reservation is reduced and a short write
> will happen as expected by normal Linux write semantics.
> 2) with the iolock held, the generic iomap buffered write code is
> called, which through the iomap_begin operation usually just inserts
> delalloc extents for the range in a single iteration. Only for
> overwrites of existing data that are not block aligned, or zeroing
> operations the existing extent mapping is read to fill out the srcmap
> and to figure out if zeroing is required.
> 3) the ->map_blocks callback to the generic iomap writeback code
> calls into the zoned space allocator to actually allocate on-disk
> space for the range before kicking of the writeback.
>
> Note that because all writes are out of place, truncate or hole punches
> that are not aligned to block size boundaries need to allocate space.
> For block zeroing from truncate, ->setattr is called with the iolock
> (aka i_rwsem) already held, so a hacky deviation from the above
> scheme is needed. In this case the space reservations is called with
> the iolock held, but is required not to block and can dip into the
> reserved block pool. This can lead to -ENOSPC when truncating a
> file, which is unfortunate. But fixing the calling conventions in
> the VFS is probably much easier with code requiring it already in
> mainline.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Hrrmrmrmh. I still don't like how this sprinkles allocation contexts
all over the codebase and then requires us to pass that through iomap
back to xfs' ->iomap_begin methods, and it's confusing to track the
XC_FREE_RTEXTENTS allocations out here vs. in xfs_trans_alloc_inode.
After thinking about this for two weeks I haven't come up with a clearly
better way to avoid this situation where the zone evacuator has to be
able to IOLOCK and remap a bunch of files before it can give any space
back to the freecounter, which means that writer threads cannot take the
IOLOCK and wait for freecounter reservation.
That said, I remembered today that there are a few places like
xfs_file_buffered_write where we do things like:
write_retry:
iolock = XFS_IOLOCK_EXCL;
ret = xfs_ilock_iocb(iocb, iolock);
if (ret)
return ret;
ret = iomap_file_buffered_write(iocb, from,
&xfs_buffered_write_iomap_ops, NULL);
if (ret == -EDQUOT && !cleared_space) {
xfs_iunlock(ip, iolock);
xfs_blockgc_free_quota(ip, XFS_ICWALK_FLAG_SYNC);
cleared_space = true;
goto write_retry;
}
At first I wondered if we could do something similar for non-zoned
files? e.g. do the XC_FREE_RTEXTENTS/AVAILABLE reservations only in
xfs_trans_reserve, and then have it return a special error code (e.g.
ENOBUFS) if there's not enough rt space available and this is a zoned
fs; and then add more of these loops.
I don't want to go adding opencoded logic loops all over the place, that
would be pretty horrid. But what if xfs_zone_alloc_ctx were instead a
general freecounter reservation context? Then we could hide all this
"try to reserve space, push a garbage collector once if we can't, and
try again once" logic into an xfs_reserve_space() function, and pass
that reservation through the iomap functions to ->iomap_begin.
But a subtlety here is that the code under iomap_file_buffered_write
might not actually need to create any delalloc reservations, in which
case a worst case reservation could fail with ENOSPC even though we
don't actually need to allocate a single byte.
Having said that, this is like the 5th time I've read through this patch
and I don't see anything obviously wrong now, so my last question is:
Zoned files cannot have preallocations so that's why we don't do this
for FALLOC_FL_ALLOCATE_RANGE, right?
If the answer is 'yes' then
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
--D
> ---
> fs/xfs/xfs_aops.c | 167 ++++++++++++++++++++++++++++++++--
> fs/xfs/xfs_bmap_util.c | 21 +++--
> fs/xfs/xfs_bmap_util.h | 12 ++-
> fs/xfs/xfs_file.c | 202 +++++++++++++++++++++++++++++++++++++----
> fs/xfs/xfs_iomap.c | 186 ++++++++++++++++++++++++++++++++++++-
> fs/xfs/xfs_iomap.h | 6 +-
> fs/xfs/xfs_iops.c | 31 ++++++-
> fs/xfs/xfs_reflink.c | 2 +-
> fs/xfs/xfs_trace.h | 1 +
> fs/xfs/xfs_zone_gc.c | 32 +++++++
> 10 files changed, 613 insertions(+), 47 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 5077d52a775d..f7f70bb4e19d 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1,7 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0
> /*
> * Copyright (c) 2000-2005 Silicon Graphics, Inc.
> - * Copyright (c) 2016-2018 Christoph Hellwig.
> + * Copyright (c) 2016-2025 Christoph Hellwig.
> * All Rights Reserved.
> */
> #include "xfs.h"
> @@ -20,6 +20,8 @@
> #include "xfs_errortag.h"
> #include "xfs_error.h"
> #include "xfs_icache.h"
> +#include "xfs_zone_alloc.h"
> +#include "xfs_rtgroup.h"
>
> struct xfs_writepage_ctx {
> struct iomap_writepage_ctx ctx;
> @@ -77,6 +79,26 @@ xfs_setfilesize(
> return xfs_trans_commit(tp);
> }
>
> +static void
> +xfs_ioend_put_open_zones(
> + struct iomap_ioend *ioend)
> +{
> + struct iomap_ioend *tmp;
> +
> + /*
> + * Put the open zone for all ioends merged into this one (if any).
> + */
> + list_for_each_entry(tmp, &ioend->io_list, io_list)
> + xfs_open_zone_put(tmp->io_private);
> +
> + /*
> + * The main ioend might not have an open zone if the submission failed
> + * before xfs_zone_alloc_and_submit got called.
> + */
> + if (ioend->io_private)
> + xfs_open_zone_put(ioend->io_private);
> +}
> +
> /*
> * IO write completion.
> */
> @@ -86,6 +108,7 @@ xfs_end_ioend(
> {
> struct xfs_inode *ip = XFS_I(ioend->io_inode);
> struct xfs_mount *mp = ip->i_mount;
> + bool is_zoned = xfs_is_zoned_inode(ip);
> xfs_off_t offset = ioend->io_offset;
> size_t size = ioend->io_size;
> unsigned int nofs_flag;
> @@ -116,9 +139,10 @@ xfs_end_ioend(
> error = blk_status_to_errno(ioend->io_bio.bi_status);
> if (unlikely(error)) {
> if (ioend->io_flags & IOMAP_IOEND_SHARED) {
> + ASSERT(!is_zoned);
> xfs_reflink_cancel_cow_range(ip, offset, size, true);
> xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, offset,
> - offset + size);
> + offset + size, NULL);
> }
> goto done;
> }
> @@ -126,7 +150,10 @@ xfs_end_ioend(
> /*
> * Success: commit the COW or unwritten blocks if needed.
> */
> - if (ioend->io_flags & IOMAP_IOEND_SHARED)
> + if (is_zoned)
> + error = xfs_zoned_end_io(ip, offset, size, ioend->io_sector,
> + ioend->io_private, NULLFSBLOCK);
> + else if (ioend->io_flags & IOMAP_IOEND_SHARED)
> error = xfs_reflink_end_cow(ip, offset, size);
> else if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN)
> error = xfs_iomap_write_unwritten(ip, offset, size, false);
> @@ -134,6 +161,8 @@ xfs_end_ioend(
> if (!error && xfs_ioend_is_append(ioend))
> error = xfs_setfilesize(ip, offset, size);
> done:
> + if (is_zoned)
> + xfs_ioend_put_open_zones(ioend);
> iomap_finish_ioends(ioend, error);
> memalloc_nofs_restore(nofs_flag);
> }
> @@ -176,17 +205,27 @@ xfs_end_io(
> }
> }
>
> -STATIC void
> +static void
> xfs_end_bio(
> struct bio *bio)
> {
> struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
> struct xfs_inode *ip = XFS_I(ioend->io_inode);
> + struct xfs_mount *mp = ip->i_mount;
> unsigned long flags;
>
> + /*
> + * For Appends record the actually written block number and set the
> + * boundary flag if needed.
> + */
> + if (IS_ENABLED(CONFIG_XFS_RT) && bio_is_zone_append(bio)) {
> + ioend->io_sector = bio->bi_iter.bi_sector;
> + xfs_mark_rtg_boundary(ioend);
> + }
> +
> spin_lock_irqsave(&ip->i_ioend_lock, flags);
> if (list_empty(&ip->i_ioend_list))
> - WARN_ON_ONCE(!queue_work(ip->i_mount->m_unwritten_workqueue,
> + WARN_ON_ONCE(!queue_work(mp->m_unwritten_workqueue,
> &ip->i_ioend_work));
> list_add_tail(&ioend->io_list, &ip->i_ioend_list);
> spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> @@ -463,7 +502,7 @@ xfs_discard_folio(
> * folio itself and not the start offset that is passed in.
> */
> xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK, pos,
> - folio_pos(folio) + folio_size(folio));
> + folio_pos(folio) + folio_size(folio), NULL);
> }
>
> static const struct iomap_writeback_ops xfs_writeback_ops = {
> @@ -472,15 +511,125 @@ static const struct iomap_writeback_ops xfs_writeback_ops = {
> .discard_folio = xfs_discard_folio,
> };
>
> +struct xfs_zoned_writepage_ctx {
> + struct iomap_writepage_ctx ctx;
> + struct xfs_open_zone *open_zone;
> +};
> +
> +static inline struct xfs_zoned_writepage_ctx *
> +XFS_ZWPC(struct iomap_writepage_ctx *ctx)
> +{
> + return container_of(ctx, struct xfs_zoned_writepage_ctx, ctx);
> +}
> +
> +static int
> +xfs_zoned_map_blocks(
> + struct iomap_writepage_ctx *wpc,
> + struct inode *inode,
> + loff_t offset,
> + unsigned int len)
> +{
> + struct xfs_inode *ip = XFS_I(inode);
> + struct xfs_mount *mp = ip->i_mount;
> + xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
> + xfs_fileoff_t end_fsb = XFS_B_TO_FSB(mp, offset + len);
> + xfs_filblks_t count_fsb;
> + struct xfs_bmbt_irec imap, del;
> + struct xfs_iext_cursor icur;
> +
> + if (xfs_is_shutdown(mp))
> + return -EIO;
> +
> + XFS_ERRORTAG_DELAY(mp, XFS_ERRTAG_WB_DELAY_MS);
> +
> + /*
> + * All dirty data must be covered by delalloc extents. But truncate can
> + * remove delalloc extents underneath us or reduce their size.
> + * Returning a hole tells iomap to not write back any data from this
> + * range, which is the right thing to do in that case.
> + *
> + * Otherwise just tell iomap to treat ranges previously covered by a
> + * delalloc extent as mapped. The actual block allocation will be done
> + * just before submitting the bio.
> + *
> + * This implies we never map outside folios that are locked or marked
> + * as under writeback, and thus there is no need check the fork sequence
> + * count here.
> + */
> + xfs_ilock(ip, XFS_ILOCK_EXCL);
> + if (!xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, &imap))
> + imap.br_startoff = end_fsb; /* fake a hole past EOF */
> + if (imap.br_startoff > offset_fsb) {
> + imap.br_blockcount = imap.br_startoff - offset_fsb;
> + imap.br_startoff = offset_fsb;
> + imap.br_startblock = HOLESTARTBLOCK;
> + imap.br_state = XFS_EXT_NORM;
> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> + xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, 0, 0);
> + return 0;
> + }
> + end_fsb = min(end_fsb, imap.br_startoff + imap.br_blockcount);
> + count_fsb = end_fsb - offset_fsb;
> +
> + del = imap;
> + xfs_trim_extent(&del, offset_fsb, count_fsb);
> + xfs_bmap_del_extent_delay(ip, XFS_COW_FORK, &icur, &imap, &del,
> + XFS_BMAPI_REMAP);
> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> + wpc->iomap.type = IOMAP_MAPPED;
> + wpc->iomap.flags = IOMAP_F_DIRTY;
> + wpc->iomap.bdev = mp->m_rtdev_targp->bt_bdev;
> + wpc->iomap.offset = offset;
> + wpc->iomap.length = XFS_FSB_TO_B(mp, count_fsb);
> + wpc->iomap.flags = IOMAP_F_ANON_WRITE;
> +
> + trace_xfs_zoned_map_blocks(ip, offset, wpc->iomap.length);
> + return 0;
> +}
> +
> +static int
> +xfs_zoned_submit_ioend(
> + struct iomap_writepage_ctx *wpc,
> + int status)
> +{
> + wpc->ioend->io_bio.bi_end_io = xfs_end_bio;
> + if (status)
> + return status;
> + xfs_zone_alloc_and_submit(wpc->ioend, &XFS_ZWPC(wpc)->open_zone);
> + return 0;
> +}
> +
> +static const struct iomap_writeback_ops xfs_zoned_writeback_ops = {
> + .map_blocks = xfs_zoned_map_blocks,
> + .submit_ioend = xfs_zoned_submit_ioend,
> + .discard_folio = xfs_discard_folio,
> +};
> +
> STATIC int
> xfs_vm_writepages(
> struct address_space *mapping,
> struct writeback_control *wbc)
> {
> - struct xfs_writepage_ctx wpc = { };
> + struct xfs_inode *ip = XFS_I(mapping->host);
> +
> + xfs_iflags_clear(ip, XFS_ITRUNCATED);
>
> - xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> - return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> + if (xfs_is_zoned_inode(ip)) {
> + struct xfs_zoned_writepage_ctx xc = { };
> + int error;
> +
> + error = iomap_writepages(mapping, wbc, &xc.ctx,
> + &xfs_zoned_writeback_ops);
> + if (xc.open_zone)
> + xfs_open_zone_put(xc.open_zone);
> + return error;
> + } else {
> + struct xfs_writepage_ctx wpc = { };
> +
> + return iomap_writepages(mapping, wbc, &wpc.ctx,
> + &xfs_writeback_ops);
> + }
> }
>
> STATIC int
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index c623688e457c..c87422de2d77 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -30,6 +30,7 @@
> #include "xfs_reflink.h"
> #include "xfs_rtbitmap.h"
> #include "xfs_rtgroup.h"
> +#include "xfs_zone_alloc.h"
>
> /* Kernel only BMAP related definitions and functions */
>
> @@ -436,7 +437,8 @@ xfs_bmap_punch_delalloc_range(
> struct xfs_inode *ip,
> int whichfork,
> xfs_off_t start_byte,
> - xfs_off_t end_byte)
> + xfs_off_t end_byte,
> + struct xfs_zone_alloc_ctx *ac)
> {
> struct xfs_mount *mp = ip->i_mount;
> struct xfs_ifork *ifp = xfs_ifork_ptr(ip, whichfork);
> @@ -467,7 +469,10 @@ xfs_bmap_punch_delalloc_range(
> continue;
> }
>
> - xfs_bmap_del_extent_delay(ip, whichfork, &icur, &got, &del, 0);
> + xfs_bmap_del_extent_delay(ip, whichfork, &icur, &got, &del,
> + ac ? XFS_BMAPI_REMAP : 0);
> + if (xfs_is_zoned_inode(ip) && ac)
> + ac->reserved_blocks += del.br_blockcount;
> if (!xfs_iext_get_extent(ifp, &icur, &got))
> break;
> }
> @@ -582,7 +587,7 @@ xfs_free_eofblocks(
> if (ip->i_delayed_blks) {
> xfs_bmap_punch_delalloc_range(ip, XFS_DATA_FORK,
> round_up(XFS_ISIZE(ip), mp->m_sb.sb_blocksize),
> - LLONG_MAX);
> + LLONG_MAX, NULL);
> }
> xfs_inode_clear_eofblocks_tag(ip);
> return 0;
> @@ -825,7 +830,8 @@ int
> xfs_free_file_space(
> struct xfs_inode *ip,
> xfs_off_t offset,
> - xfs_off_t len)
> + xfs_off_t len,
> + struct xfs_zone_alloc_ctx *ac)
> {
> struct xfs_mount *mp = ip->i_mount;
> xfs_fileoff_t startoffset_fsb;
> @@ -880,7 +886,7 @@ xfs_free_file_space(
> return 0;
> if (offset + len > XFS_ISIZE(ip))
> len = XFS_ISIZE(ip) - offset;
> - error = xfs_zero_range(ip, offset, len, NULL);
> + error = xfs_zero_range(ip, offset, len, ac, NULL);
> if (error)
> return error;
>
> @@ -968,7 +974,8 @@ int
> xfs_collapse_file_space(
> struct xfs_inode *ip,
> xfs_off_t offset,
> - xfs_off_t len)
> + xfs_off_t len,
> + struct xfs_zone_alloc_ctx *ac)
> {
> struct xfs_mount *mp = ip->i_mount;
> struct xfs_trans *tp;
> @@ -981,7 +988,7 @@ xfs_collapse_file_space(
>
> trace_xfs_collapse_file_space(ip);
>
> - error = xfs_free_file_space(ip, offset, len);
> + error = xfs_free_file_space(ip, offset, len, ac);
> if (error)
> return error;
>
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index b29760d36e1a..c477b3361630 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -15,6 +15,7 @@ struct xfs_inode;
> struct xfs_mount;
> struct xfs_trans;
> struct xfs_bmalloca;
> +struct xfs_zone_alloc_ctx;
>
> #ifdef CONFIG_XFS_RT
> int xfs_bmap_rtalloc(struct xfs_bmalloca *ap);
> @@ -31,7 +32,8 @@ xfs_bmap_rtalloc(struct xfs_bmalloca *ap)
> #endif /* CONFIG_XFS_RT */
>
> void xfs_bmap_punch_delalloc_range(struct xfs_inode *ip, int whichfork,
> - xfs_off_t start_byte, xfs_off_t end_byte);
> + xfs_off_t start_byte, xfs_off_t end_byte,
> + struct xfs_zone_alloc_ctx *ac);
>
> struct kgetbmap {
> __s64 bmv_offset; /* file offset of segment in blocks */
> @@ -54,13 +56,13 @@ int xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
>
> /* preallocation and hole punch interface */
> int xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
> - xfs_off_t len);
> + xfs_off_t len);
> int xfs_free_file_space(struct xfs_inode *ip, xfs_off_t offset,
> - xfs_off_t len);
> + xfs_off_t len, struct xfs_zone_alloc_ctx *ac);
> int xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
> - xfs_off_t len);
> + xfs_off_t len, struct xfs_zone_alloc_ctx *ac);
> int xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
> - xfs_off_t len);
> + xfs_off_t len);
>
> /* EOF block manipulation functions */
> bool xfs_can_free_eofblocks(struct xfs_inode *ip);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index dc27a6c36bf7..b6dc136970b7 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -25,6 +25,7 @@
> #include "xfs_iomap.h"
> #include "xfs_reflink.h"
> #include "xfs_file.h"
> +#include "xfs_zone_alloc.h"
>
> #include <linux/dax.h>
> #include <linux/falloc.h>
> @@ -360,7 +361,8 @@ xfs_file_write_zero_eof(
> struct iov_iter *from,
> unsigned int *iolock,
> size_t count,
> - bool *drained_dio)
> + bool *drained_dio,
> + struct xfs_zone_alloc_ctx *ac)
> {
> struct xfs_inode *ip = XFS_I(iocb->ki_filp->f_mapping->host);
> loff_t isize;
> @@ -414,7 +416,7 @@ xfs_file_write_zero_eof(
> trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
>
> xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> - error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, NULL);
> + error = xfs_zero_range(ip, isize, iocb->ki_pos - isize, ac, NULL);
> xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>
> return error;
> @@ -431,7 +433,8 @@ STATIC ssize_t
> xfs_file_write_checks(
> struct kiocb *iocb,
> struct iov_iter *from,
> - unsigned int *iolock)
> + unsigned int *iolock,
> + struct xfs_zone_alloc_ctx *ac)
> {
> struct inode *inode = iocb->ki_filp->f_mapping->host;
> size_t count = iov_iter_count(from);
> @@ -481,7 +484,7 @@ xfs_file_write_checks(
> */
> if (iocb->ki_pos > i_size_read(inode)) {
> error = xfs_file_write_zero_eof(iocb, from, iolock, count,
> - &drained_dio);
> + &drained_dio, ac);
> if (error == 1)
> goto restart;
> if (error)
> @@ -491,6 +494,48 @@ xfs_file_write_checks(
> return kiocb_modified(iocb);
> }
>
> +static ssize_t
> +xfs_zoned_write_space_reserve(
> + struct xfs_inode *ip,
> + struct kiocb *iocb,
> + struct iov_iter *from,
> + unsigned int flags,
> + struct xfs_zone_alloc_ctx *ac)
> +{
> + loff_t count = iov_iter_count(from);
> + int error;
> +
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + flags |= XFS_ZR_NOWAIT;
> +
> + /*
> + * Check the rlimit and LFS boundary first so that we don't over-reserve
> + * by possibly a lot.
> + *
> + * The generic write path will redo this check later, and it might have
> + * changed by then. If it got expanded we'll stick to our earlier
> + * smaller limit, and if it is decreased the new smaller limit will be
> + * used and our extra space reservation will be returned after finishing
> + * the write.
> + */
> + error = generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, &count);
> + if (error)
> + return error;
> +
> + /*
> + * Sloppily round up count to file system blocks.
> + *
> + * This will often reserve an extra block, but that avoids having to look
> + * at the start offset, which isn't stable for O_APPEND until taking the
> + * iolock. Also we need to reserve a block each for zeroing the old
> + * EOF block and the new start block if they are unaligned.
> + *
> + * Any remaining block will be returned after the write.
> + */
> + return xfs_zoned_space_reserve(ip,
> + XFS_B_TO_FSB(ip->i_mount, count) + 1 + 2, flags, ac);
> +}
> +
> static int
> xfs_dio_write_end_io(
> struct kiocb *iocb,
> @@ -597,7 +642,7 @@ xfs_file_dio_write_aligned(
> ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> if (ret)
> return ret;
> - ret = xfs_file_write_checks(iocb, from, &iolock);
> + ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
> if (ret)
> goto out_unlock;
>
> @@ -675,7 +720,7 @@ xfs_file_dio_write_unaligned(
> goto out_unlock;
> }
>
> - ret = xfs_file_write_checks(iocb, from, &iolock);
> + ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
> if (ret)
> goto out_unlock;
>
> @@ -749,7 +794,7 @@ xfs_file_dax_write(
> ret = xfs_ilock_iocb(iocb, iolock);
> if (ret)
> return ret;
> - ret = xfs_file_write_checks(iocb, from, &iolock);
> + ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
> if (ret)
> goto out;
>
> @@ -793,7 +838,7 @@ xfs_file_buffered_write(
> if (ret)
> return ret;
>
> - ret = xfs_file_write_checks(iocb, from, &iolock);
> + ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
> if (ret)
> goto out;
>
> @@ -840,6 +885,67 @@ xfs_file_buffered_write(
> return ret;
> }
>
> +STATIC ssize_t
> +xfs_file_buffered_write_zoned(
> + struct kiocb *iocb,
> + struct iov_iter *from)
> +{
> + struct xfs_inode *ip = XFS_I(iocb->ki_filp->f_mapping->host);
> + struct xfs_mount *mp = ip->i_mount;
> + unsigned int iolock = XFS_IOLOCK_EXCL;
> + bool cleared_space = false;
> + struct xfs_zone_alloc_ctx ac = { };
> + ssize_t ret;
> +
> + ret = xfs_zoned_write_space_reserve(ip, iocb, from, XFS_ZR_GREEDY, &ac);
> + if (ret < 0)
> + return ret;
> +
> + ret = xfs_ilock_iocb(iocb, iolock);
> + if (ret)
> + goto out_unreserve;
> +
> + ret = xfs_file_write_checks(iocb, from, &iolock, &ac);
> + if (ret)
> + goto out_unlock;
> +
> + /*
> + * Truncate the iter to the length that we were actually able to
> + * allocate blocks for. This needs to happen after
> + * xfs_file_write_checks, because that assigns ki_pos for O_APPEND
> + * writes.
> + */
> + iov_iter_truncate(from,
> + XFS_FSB_TO_B(mp, ac.reserved_blocks) -
> + (iocb->ki_pos & mp->m_blockmask));
> + if (!iov_iter_count(from))
> + goto out_unlock;
> +
> +retry:
> + trace_xfs_file_buffered_write(iocb, from);
> + ret = iomap_file_buffered_write(iocb, from,
> + &xfs_buffered_write_iomap_ops, &ac);
> + if (ret == -ENOSPC && !cleared_space) {
> + /*
> + * Kick off writeback to convert delalloc space and release the
> + * usually too pessimistic indirect block reservations.
> + */
> + xfs_flush_inodes(mp);
> + cleared_space = true;
> + goto retry;
> + }
> +
> +out_unlock:
> + xfs_iunlock(ip, iolock);
> +out_unreserve:
> + xfs_zoned_space_unreserve(ip, &ac);
> + if (ret > 0) {
> + XFS_STATS_ADD(mp, xs_write_bytes, ret);
> + ret = generic_write_sync(iocb, ret);
> + }
> + return ret;
> +}
> +
> STATIC ssize_t
> xfs_file_write_iter(
> struct kiocb *iocb,
> @@ -887,6 +993,8 @@ xfs_file_write_iter(
> return ret;
> }
>
> + if (xfs_is_zoned_inode(ip))
> + return xfs_file_buffered_write_zoned(iocb, from);
> return xfs_file_buffered_write(iocb, from);
> }
>
> @@ -941,7 +1049,8 @@ static int
> xfs_falloc_collapse_range(
> struct file *file,
> loff_t offset,
> - loff_t len)
> + loff_t len,
> + struct xfs_zone_alloc_ctx *ac)
> {
> struct inode *inode = file_inode(file);
> loff_t new_size = i_size_read(inode) - len;
> @@ -957,7 +1066,7 @@ xfs_falloc_collapse_range(
> if (offset + len >= i_size_read(inode))
> return -EINVAL;
>
> - error = xfs_collapse_file_space(XFS_I(inode), offset, len);
> + error = xfs_collapse_file_space(XFS_I(inode), offset, len, ac);
> if (error)
> return error;
> return xfs_falloc_setsize(file, new_size);
> @@ -1013,7 +1122,8 @@ xfs_falloc_zero_range(
> struct file *file,
> int mode,
> loff_t offset,
> - loff_t len)
> + loff_t len,
> + struct xfs_zone_alloc_ctx *ac)
> {
> struct inode *inode = file_inode(file);
> unsigned int blksize = i_blocksize(inode);
> @@ -1026,7 +1136,7 @@ xfs_falloc_zero_range(
> if (error)
> return error;
>
> - error = xfs_free_file_space(XFS_I(inode), offset, len);
> + error = xfs_free_file_space(XFS_I(inode), offset, len, ac);
> if (error)
> return error;
>
> @@ -1107,12 +1217,29 @@ xfs_file_fallocate(
> struct xfs_inode *ip = XFS_I(inode);
> long error;
> uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> + struct xfs_zone_alloc_ctx ac = { };
>
> if (!S_ISREG(inode->i_mode))
> return -EINVAL;
> if (mode & ~XFS_FALLOC_FL_SUPPORTED)
> return -EOPNOTSUPP;
>
> + /*
> + * For zoned file systems, zeroing the first and last block of a hole
> + * punch requires allocating a new block to rewrite the remaining data
> + * and new zeroes out of place. Get a reservations for those before
> + * taking the iolock. Dip into the reserved pool because we are
> + * expected to be able to punch a hole even on a completely full
> + * file system.
> + */
> + if (xfs_is_zoned_inode(ip) &&
> + (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
> + FALLOC_FL_COLLAPSE_RANGE))) {
> + error = xfs_zoned_space_reserve(ip, 2, XFS_ZR_RESERVED, &ac);
> + if (error)
> + return error;
> + }
> +
> xfs_ilock(ip, iolock);
> error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
> if (error)
> @@ -1133,16 +1260,16 @@ xfs_file_fallocate(
>
> switch (mode & FALLOC_FL_MODE_MASK) {
> case FALLOC_FL_PUNCH_HOLE:
> - error = xfs_free_file_space(ip, offset, len);
> + error = xfs_free_file_space(ip, offset, len, &ac);
> break;
> case FALLOC_FL_COLLAPSE_RANGE:
> - error = xfs_falloc_collapse_range(file, offset, len);
> + error = xfs_falloc_collapse_range(file, offset, len, &ac);
> break;
> case FALLOC_FL_INSERT_RANGE:
> error = xfs_falloc_insert_range(file, offset, len);
> break;
> case FALLOC_FL_ZERO_RANGE:
> - error = xfs_falloc_zero_range(file, mode, offset, len);
> + error = xfs_falloc_zero_range(file, mode, offset, len, &ac);
> break;
> case FALLOC_FL_UNSHARE_RANGE:
> error = xfs_falloc_unshare_range(file, mode, offset, len);
> @@ -1160,6 +1287,8 @@ xfs_file_fallocate(
>
> out_unlock:
> xfs_iunlock(ip, iolock);
> + if (xfs_is_zoned_inode(ip))
> + xfs_zoned_space_unreserve(ip, &ac);
> return error;
> }
>
> @@ -1488,9 +1617,10 @@ xfs_dax_read_fault(
> * i_lock (XFS - extent map serialisation)
> */
> static vm_fault_t
> -xfs_write_fault(
> +__xfs_write_fault(
> struct vm_fault *vmf,
> - unsigned int order)
> + unsigned int order,
> + struct xfs_zone_alloc_ctx *ac)
> {
> struct inode *inode = file_inode(vmf->vma->vm_file);
> struct xfs_inode *ip = XFS_I(inode);
> @@ -1528,13 +1658,49 @@ xfs_write_fault(
> ret = xfs_dax_fault_locked(vmf, order, true);
> else
> ret = iomap_page_mkwrite(vmf, &xfs_buffered_write_iomap_ops,
> - NULL);
> + ac);
> xfs_iunlock(ip, lock_mode);
>
> sb_end_pagefault(inode->i_sb);
> return ret;
> }
>
> +static vm_fault_t
> +xfs_write_fault_zoned(
> + struct vm_fault *vmf,
> + unsigned int order)
> +{
> + struct xfs_inode *ip = XFS_I(file_inode(vmf->vma->vm_file));
> + unsigned int len = folio_size(page_folio(vmf->page));
> + struct xfs_zone_alloc_ctx ac = { };
> + int error;
> + vm_fault_t ret;
> +
> + /*
> + * This could over-allocate as it doesn't check for truncation.
> + *
> + * But as the overallocation is limited to less than a folio and will be
> + * release instantly that's just fine.
> + */
> + error = xfs_zoned_space_reserve(ip, XFS_B_TO_FSB(ip->i_mount, len), 0,
> + &ac);
> + if (error < 0)
> + return vmf_fs_error(error);
> + ret = __xfs_write_fault(vmf, order, &ac);
> + xfs_zoned_space_unreserve(ip, &ac);
> + return ret;
> +}
> +
> +static vm_fault_t
> +xfs_write_fault(
> + struct vm_fault *vmf,
> + unsigned int order)
> +{
> + if (xfs_is_zoned_inode(XFS_I(file_inode(vmf->vma->vm_file))))
> + return xfs_write_fault_zoned(vmf, order);
> + return __xfs_write_fault(vmf, order, NULL);
> +}
> +
> static inline bool
> xfs_is_write_fault(
> struct vm_fault *vmf)
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 45c339565d88..0e64a0ce1622 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -31,6 +31,7 @@
> #include "xfs_health.h"
> #include "xfs_rtbitmap.h"
> #include "xfs_icache.h"
> +#include "xfs_zone_alloc.h"
>
> #define XFS_ALLOC_ALIGN(mp, off) \
> (((off) >> mp->m_allocsize_log) << mp->m_allocsize_log)
> @@ -1268,6 +1269,176 @@ xfs_bmapi_reserve_delalloc(
> return error;
> }
>
> +static int
> +xfs_zoned_buffered_write_iomap_begin(
> + struct inode *inode,
> + loff_t offset,
> + loff_t count,
> + unsigned flags,
> + struct iomap *iomap,
> + struct iomap *srcmap)
> +{
> + struct iomap_iter *iter =
> + container_of(iomap, struct iomap_iter, iomap);
> + struct xfs_zone_alloc_ctx *ac = iter->private;
> + struct xfs_inode *ip = XFS_I(inode);
> + struct xfs_mount *mp = ip->i_mount;
> + xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
> + xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, count);
> + u16 iomap_flags = IOMAP_F_SHARED;
> + unsigned int lockmode = XFS_ILOCK_EXCL;
> + xfs_filblks_t count_fsb;
> + xfs_extlen_t indlen;
> + struct xfs_bmbt_irec got;
> + struct xfs_iext_cursor icur;
> + int error = 0;
> +
> + ASSERT(!xfs_get_extsz_hint(ip));
> + ASSERT(!(flags & IOMAP_UNSHARE));
> + ASSERT(ac);
> +
> + if (xfs_is_shutdown(mp))
> + return -EIO;
> +
> + error = xfs_qm_dqattach(ip);
> + if (error)
> + return error;
> +
> + error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> + if (error)
> + return error;
> +
> + if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
> + XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
> + xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
> + error = -EFSCORRUPTED;
> + goto out_unlock;
> + }
> +
> + XFS_STATS_INC(mp, xs_blk_mapw);
> +
> + error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
> + if (error)
> + goto out_unlock;
> +
> + /*
> + * For zeroing operations check if there is any data to zero first.
> + *
> + * For regular writes we always need to allocate new blocks, but need to
> + * provide the source mapping when the range is unaligned to support
> + * read-modify-write of the whole block in the page cache.
> + *
> + * In either case we need to limit the reported range to the boundaries
> + * of the source map in the data fork.
> + */
> + if (!IS_ALIGNED(offset, mp->m_sb.sb_blocksize) ||
> + !IS_ALIGNED(offset + count, mp->m_sb.sb_blocksize) ||
> + (flags & IOMAP_ZERO)) {
> + struct xfs_bmbt_irec smap;
> + struct xfs_iext_cursor scur;
> +
> + if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &scur,
> + &smap))
> + smap.br_startoff = end_fsb; /* fake hole until EOF */
> + if (smap.br_startoff > offset_fsb) {
> + /*
> + * We never need to allocate blocks for zeroing a hole.
> + */
> + if (flags & IOMAP_ZERO) {
> + xfs_hole_to_iomap(ip, iomap, offset_fsb,
> + smap.br_startoff);
> + goto out_unlock;
> + }
> + end_fsb = min(end_fsb, smap.br_startoff);
> + } else {
> + end_fsb = min(end_fsb,
> + smap.br_startoff + smap.br_blockcount);
> + xfs_trim_extent(&smap, offset_fsb,
> + end_fsb - offset_fsb);
> + error = xfs_bmbt_to_iomap(ip, srcmap, &smap, flags, 0,
> + xfs_iomap_inode_sequence(ip, 0));
> + if (error)
> + goto out_unlock;
> + }
> + }
> +
> + if (!ip->i_cowfp)
> + xfs_ifork_init_cow(ip);
> +
> + if (!xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, &got))
> + got.br_startoff = end_fsb;
> + if (got.br_startoff <= offset_fsb) {
> + trace_xfs_reflink_cow_found(ip, &got);
> + goto done;
> + }
> +
> + /*
> + * Cap the maximum length to keep the chunks of work done here somewhat
> + * symmetric with the work writeback does.
> + */
> + end_fsb = min(end_fsb, got.br_startoff);
> + count_fsb = min3(end_fsb - offset_fsb, XFS_MAX_BMBT_EXTLEN,
> + XFS_B_TO_FSB(mp, 1024 * PAGE_SIZE));
> +
> + /*
> + * The block reservation is supposed to cover all blocks that the
> + * operation could possible write, but there is a nasty corner case
> + * where blocks could be stolen from underneath us:
> + *
> + * 1) while this thread iterates over a larger buffered write,
> + * 2) another thread is causing a write fault that calls into
> + * ->page_mkwrite in range this thread writes to, using up the
> + * delalloc reservation created by a previous call to this function.
> + * 3) another thread does direct I/O on the range that the write fault
> + * happened on, which causes writeback of the dirty data.
> + * 4) this then set the stale flag, which cuts the current iomap
> + * iteration short, causing the new call to ->iomap_begin that gets
> + * us here again, but now without a sufficient reservation.
> + *
> + * This is a very unusual I/O pattern, and nothing but generic/095 is
> + * known to hit it. There's not really much we can do here, so turn this
> + * into a short write.
> + */
> + if (count_fsb > ac->reserved_blocks) {
> + xfs_warn_ratelimited(mp,
> +"Short write on ino 0x%llx comm %.20s due to three-way race with write fault and direct I/O",
> + ip->i_ino, current->comm);
> + count_fsb = ac->reserved_blocks;
> + if (!count_fsb) {
> + error = -EIO;
> + goto out_unlock;
> + }
> + }
> +
> + error = xfs_quota_reserve_blkres(ip, count_fsb);
> + if (error)
> + goto out_unlock;
> +
> + indlen = xfs_bmap_worst_indlen(ip, count_fsb);
> + error = xfs_dec_fdblocks(mp, indlen, false);
> + if (error)
> + goto out_unlock;
> + ip->i_delayed_blks += count_fsb;
> + xfs_mod_delalloc(ip, count_fsb, indlen);
> +
> + got.br_startoff = offset_fsb;
> + got.br_startblock = nullstartblock(indlen);
> + got.br_blockcount = count_fsb;
> + got.br_state = XFS_EXT_NORM;
> + xfs_bmap_add_extent_hole_delay(ip, XFS_COW_FORK, &icur, &got);
> + ac->reserved_blocks -= count_fsb;
> + iomap_flags |= IOMAP_F_NEW;
> +
> + trace_xfs_iomap_alloc(ip, offset, XFS_FSB_TO_B(mp, count_fsb),
> + XFS_COW_FORK, &got);
> +done:
> + error = xfs_bmbt_to_iomap(ip, iomap, &got, flags, iomap_flags,
> + xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED));
> +out_unlock:
> + xfs_iunlock(ip, lockmode);
> + return error;
> +}
> +
> static int
> xfs_buffered_write_iomap_begin(
> struct inode *inode,
> @@ -1294,6 +1465,10 @@ xfs_buffered_write_iomap_begin(
> if (xfs_is_shutdown(mp))
> return -EIO;
>
> + if (xfs_is_zoned_inode(ip))
> + return xfs_zoned_buffered_write_iomap_begin(inode, offset,
> + count, flags, iomap, srcmap);
> +
> /* we can't use delayed allocations when using extent size hints */
> if (xfs_get_extsz_hint(ip))
> return xfs_direct_write_iomap_begin(inode, offset, count,
> @@ -1526,10 +1701,13 @@ xfs_buffered_write_delalloc_punch(
> loff_t length,
> struct iomap *iomap)
> {
> + struct iomap_iter *iter =
> + container_of(iomap, struct iomap_iter, iomap);
> +
> xfs_bmap_punch_delalloc_range(XFS_I(inode),
> (iomap->flags & IOMAP_F_SHARED) ?
> XFS_COW_FORK : XFS_DATA_FORK,
> - offset, offset + length);
> + offset, offset + length, iter->private);
> }
>
> static int
> @@ -1766,6 +1944,7 @@ xfs_zero_range(
> struct xfs_inode *ip,
> loff_t pos,
> loff_t len,
> + struct xfs_zone_alloc_ctx *ac,
> bool *did_zero)
> {
> struct inode *inode = VFS_I(ip);
> @@ -1776,13 +1955,14 @@ xfs_zero_range(
> return dax_zero_range(inode, pos, len, did_zero,
> &xfs_dax_write_iomap_ops);
> return iomap_zero_range(inode, pos, len, did_zero,
> - &xfs_buffered_write_iomap_ops, NULL);
> + &xfs_buffered_write_iomap_ops, ac);
> }
>
> int
> xfs_truncate_page(
> struct xfs_inode *ip,
> loff_t pos,
> + struct xfs_zone_alloc_ctx *ac,
> bool *did_zero)
> {
> struct inode *inode = VFS_I(ip);
> @@ -1791,5 +1971,5 @@ xfs_truncate_page(
> return dax_truncate_page(inode, pos, did_zero,
> &xfs_dax_write_iomap_ops);
> return iomap_truncate_page(inode, pos, did_zero,
> - &xfs_buffered_write_iomap_ops, NULL);
> + &xfs_buffered_write_iomap_ops, ac);
> }
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index 8347268af727..bc8a00cad854 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -10,6 +10,7 @@
>
> struct xfs_inode;
> struct xfs_bmbt_irec;
> +struct xfs_zone_alloc_ctx;
>
> int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb,
> xfs_fileoff_t count_fsb, unsigned int flags,
> @@ -24,8 +25,9 @@ int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
> u16 iomap_flags, u64 sequence_cookie);
>
> int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
> - bool *did_zero);
> -int xfs_truncate_page(struct xfs_inode *ip, loff_t pos, bool *did_zero);
> + struct xfs_zone_alloc_ctx *ac, bool *did_zero);
> +int xfs_truncate_page(struct xfs_inode *ip, loff_t pos,
> + struct xfs_zone_alloc_ctx *ac, bool *did_zero);
>
> static inline xfs_filblks_t
> xfs_aligned_fsb_count(
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 40289fe6f5b2..444193f543ef 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -29,6 +29,7 @@
> #include "xfs_xattr.h"
> #include "xfs_file.h"
> #include "xfs_bmap.h"
> +#include "xfs_zone_alloc.h"
>
> #include <linux/posix_acl.h>
> #include <linux/security.h>
> @@ -854,6 +855,7 @@ xfs_setattr_size(
> uint lock_flags = 0;
> uint resblks = 0;
> bool did_zeroing = false;
> + struct xfs_zone_alloc_ctx ac = { };
>
> xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
> ASSERT(S_ISREG(inode->i_mode));
> @@ -889,6 +891,28 @@ xfs_setattr_size(
> */
> inode_dio_wait(inode);
>
> + /*
> + * Normally xfs_zoned_space_reserve is supposed to be called outside the
> + * IOLOCK. For truncate we can't do that since ->setattr is called with
> + * it already held by the VFS. So for now chicken out and try to
> + * allocate space under it.
> + *
> + * To avoid deadlocks this means we can't block waiting for space, which
> + * can lead to spurious -ENOSPC if there are no directly available
> + * blocks. We mitigate this a bit by allowing zeroing to dip into the
> + * reserved pool, but eventually the VFS calling convention needs to
> + * change.
> + */
> + if (xfs_is_zoned_inode(ip)) {
> + error = xfs_zoned_space_reserve(ip, 1,
> + XFS_ZR_NOWAIT | XFS_ZR_RESERVED, &ac);
> + if (error) {
> + if (error == -EAGAIN)
> + return -ENOSPC;
> + return error;
> + }
> + }
> +
> /*
> * File data changes must be complete before we start the transaction to
> * modify the inode. This needs to be done before joining the inode to
> @@ -902,11 +926,14 @@ xfs_setattr_size(
> if (newsize > oldsize) {
> trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> - &did_zeroing);
> + &ac, &did_zeroing);
> } else {
> - error = xfs_truncate_page(ip, newsize, &did_zeroing);
> + error = xfs_truncate_page(ip, newsize, &ac, &did_zeroing);
> }
>
> + if (xfs_is_zoned_inode(ip))
> + xfs_zoned_space_unreserve(ip, &ac);
> +
> if (error)
> return error;
>
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index b977930c4ebc..cc3b4df88110 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1532,7 +1532,7 @@ xfs_reflink_zero_posteof(
> return 0;
>
> trace_xfs_zero_eof(ip, isize, pos - isize);
> - return xfs_zero_range(ip, isize, pos - isize, NULL);
> + return xfs_zero_range(ip, isize, pos - isize, NULL, NULL);
> }
>
> /*
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 7de1ed0ca13a..86443395b1a3 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1742,6 +1742,7 @@ DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write);
> DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_unwritten);
> DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_append);
> DEFINE_SIMPLE_IO_EVENT(xfs_file_splice_read);
> +DEFINE_SIMPLE_IO_EVENT(xfs_zoned_map_blocks);
>
> DECLARE_EVENT_CLASS(xfs_itrunc_class,
> TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size),
> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> index 36cc167522c8..0e1c39f2aaba 100644
> --- a/fs/xfs/xfs_zone_gc.c
> +++ b/fs/xfs/xfs_zone_gc.c
> @@ -21,6 +21,34 @@
> #include "xfs_zones.h"
> #include "xfs_trace.h"
>
> +/*
> + * Implement Garbage Collection (GC) of partially used zoned.
> + *
> + * To support the purely sequential writes in each zone, zoned XFS needs to be
> + * able to move data remaining in a zone out of it to reset the zone to prepare
> + * for writing to it again.
> + *
> + * This is done by the GC thread implemented in this file. To support that a
> + * number of zones (XFS_GC_ZONES) is reserved from the user visible capacity to
> + * write the garbage collected data into.
> + *
> + * Whenever the available space is below the chosen threshold, the GC thread
> + * looks for potential non-empty but not fully used zones that are worth
> + * reclaiming. Once found the rmap for the victim zone is queried, and after
> + * a bit of sorting to reduce fragmentation, the still live extents are read
> + * into memory and written to the GC target zone, and the bmap btree of the
> + * files is updated to point to the new location. To avoid taking the IOLOCK
> + * and MMAPLOCK for the entire GC process and thus affecting the latency of
> + * user reads and writes to the files, the GC writes are speculative and the
> + * I/O completion checks that no other writes happened for the affected regions
> + * before remapping.
> + *
> + * Once a zone does not contain any valid data, be that through GC or user
> + * block removal, it is queued for for a zone reset. The reset operation
> + * carefully ensures that the RT device cache is flushed and all transactions
> + * referencing the rmap have been committed to disk.
> + */
> +
> /*
> * Size of each GC scratch pad. This is also the upper bound for each
> * GC I/O, which helps to keep latency down.
> @@ -808,6 +836,10 @@ xfs_zone_gc_finish_chunk(
> /*
> * Cycle through the iolock and wait for direct I/O and layouts to
> * ensure no one is reading from the old mapping before it goes away.
> + *
> + * Note that xfs_zoned_end_io() below checks that no other writer raced
> + * with us to update the mapping by checking that the old startblock
> + * didn't change.
> */
> xfs_ilock(ip, iolock);
> error = xfs_break_layouts(VFS_I(ip), &iolock, BREAK_UNMAP);
> --
> 2.45.2
>
>
next prev parent reply other threads:[~2025-02-19 21:47 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-18 8:10 support for zoned devices v3 Christoph Hellwig
2025-02-18 8:10 ` [PATCH 01/45] xfs: reflow xfs_dec_freecounter Christoph Hellwig
2025-02-18 8:10 ` [PATCH 02/45] xfs: generalize the freespace and reserved blocks handling Christoph Hellwig
2025-02-19 22:11 ` Darrick J. Wong
2025-02-18 8:10 ` [PATCH 03/45] xfs: support reserved blocks for the rt extent counter Christoph Hellwig
2025-02-25 18:05 ` Darrick J. Wong
2025-02-26 1:20 ` Christoph Hellwig
2025-02-18 8:10 ` [PATCH 04/45] xfs: trace in-memory freecounter reservations Christoph Hellwig
2025-02-18 8:10 ` [PATCH 05/45] xfs: preserve RT reservations across remounts Christoph Hellwig
2025-02-18 8:10 ` [PATCH 06/45] xfs: fixup the metabtree reservation in xrep_reap_metadir_fsblocks Christoph Hellwig
2025-02-19 18:41 ` Darrick J. Wong
2025-02-18 8:10 ` [PATCH 07/45] xfs: make metabtree reservations global Christoph Hellwig
2025-02-19 18:44 ` Darrick J. Wong
2025-02-18 8:10 ` [PATCH 08/45] xfs: reduce metafile reservations Christoph Hellwig
2025-02-18 8:10 ` [PATCH 09/45] xfs: factor out a xfs_rt_check_size helper Christoph Hellwig
2025-02-18 8:10 ` [PATCH 10/45] xfs: add a rtg_blocks helper Christoph Hellwig
2025-02-18 8:10 ` [PATCH 11/45] xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c Christoph Hellwig
2025-02-19 21:47 ` Darrick J. Wong
2025-02-18 8:10 ` [PATCH 12/45] xfs: skip always_cow inodes in xfs_reflink_trim_around_shared Christoph Hellwig
2025-02-18 8:10 ` [PATCH 13/45] xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write Christoph Hellwig
2025-02-18 8:10 ` [PATCH 14/45] xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay Christoph Hellwig
2025-02-18 8:10 ` [PATCH 15/45] xfs: add a xfs_rtrmap_highest_rgbno helper Christoph Hellwig
2025-02-18 8:10 ` [PATCH 16/45] xfs: define the zoned on-disk format Christoph Hellwig
2025-02-18 8:10 ` [PATCH 17/45] xfs: allow internal RT devices for zoned mode Christoph Hellwig
2025-02-18 8:10 ` [PATCH 18/45] xfs: export zoned geometry via XFS_FSOP_GEOM Christoph Hellwig
2025-02-18 8:10 ` [PATCH 19/45] xfs: disable sb_frextents for zoned file systems Christoph Hellwig
2025-02-18 8:10 ` [PATCH 20/45] xfs: disable FITRIM for zoned RT devices Christoph Hellwig
2025-02-18 8:10 ` [PATCH 21/45] xfs: don't call xfs_can_free_eofblocks from ->release for zoned inodes Christoph Hellwig
2025-02-18 8:10 ` [PATCH 22/45] xfs: skip zoned RT inodes in xfs_inodegc_want_queue_rt_file Christoph Hellwig
2025-02-18 8:10 ` [PATCH 23/45] xfs: parse and validate hardware zone information Christoph Hellwig
2025-02-18 8:10 ` [PATCH 24/45] xfs: add the zoned space allocator Christoph Hellwig
2025-02-19 21:58 ` Darrick J. Wong
2025-02-20 6:17 ` Christoph Hellwig
2025-02-18 8:10 ` [PATCH 25/45] xfs: add support for zoned space reservations Christoph Hellwig
2025-02-19 22:00 ` Darrick J. Wong
2025-02-18 8:10 ` [PATCH 26/45] xfs: implement zoned garbage collection Christoph Hellwig
2025-02-19 22:02 ` Darrick J. Wong
2025-02-18 8:10 ` [PATCH 27/45] xfs: implement buffered writes to zoned RT devices Christoph Hellwig
2025-02-19 21:47 ` Darrick J. Wong [this message]
2025-02-20 6:16 ` Christoph Hellwig
2025-02-20 16:57 ` Darrick J. Wong
2025-02-25 17:59 ` Darrick J. Wong
2025-02-18 8:10 ` [PATCH 28/45] xfs: implement direct " Christoph Hellwig
2025-02-18 8:10 ` [PATCH 29/45] xfs: wire up zoned block freeing in xfs_rtextent_free_finish_item Christoph Hellwig
2025-02-18 8:10 ` [PATCH 30/45] xfs: hide reserved RT blocks from statfs Christoph Hellwig
2025-02-18 8:10 ` [PATCH 31/45] xfs: support growfs on zoned file systems Christoph Hellwig
2025-02-18 8:10 ` [PATCH 32/45] xfs: allow COW forks on zoned file systems in xchk_bmap Christoph Hellwig
2025-02-18 8:10 ` [PATCH 33/45] xfs: support xchk_xref_is_used_rt_space on zoned file systems Christoph Hellwig
2025-02-18 8:10 ` [PATCH 34/45] xfs: support xrep_require_rtext_inuse " Christoph Hellwig
2025-02-18 8:10 ` [PATCH 35/45] xfs: enable fsmap reporting for internal RT devices Christoph Hellwig
2025-02-18 8:10 ` [PATCH 36/45] xfs: disable reflink for zoned file systems Christoph Hellwig
2025-02-18 8:10 ` [PATCH 37/45] xfs: disable rt quotas " Christoph Hellwig
2025-02-18 8:10 ` [PATCH 38/45] xfs: enable the zoned RT device feature Christoph Hellwig
2025-02-18 8:10 ` [PATCH 39/45] xfs: support zone gaps Christoph Hellwig
2025-02-18 8:10 ` [PATCH 40/45] xfs: add a max_open_zones mount option Christoph Hellwig
2025-02-18 8:10 ` [PATCH 41/45] xfs: support write life time based data placement Christoph Hellwig
2025-02-19 18:49 ` Darrick J. Wong
2025-02-18 8:10 ` [PATCH 42/45] xfs: wire up the show_stats super operation Christoph Hellwig
2025-02-18 8:10 ` [PATCH 43/45] xfs: export zone stats in /proc/*/mountstats Christoph Hellwig
2025-02-18 8:10 ` [PATCH 44/45] xfs: contain more sysfs code in xfs_sysfs.c Christoph Hellwig
2025-02-18 8:10 ` [PATCH 45/45] xfs: export max_open_zones in sysfs Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250219214727.GV21808@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=cem@kernel.org \
--cc=hans.holmberg@wdc.com \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox