From: "Darrick J. Wong" <djwong@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Carlos Maiolino <cem@kernel.org>,
Hans Holmberg <hans.holmberg@wdc.com>,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH 25/43] xfs: implement buffered writes to zoned RT devices
Date: Thu, 13 Feb 2025 15:05:58 -0800 [thread overview]
Message-ID: <20250213230558.GX21808@frogsfrogsfrogs> (raw)
In-Reply-To: <20250213053943.GA18867@lst.de>
On Thu, Feb 13, 2025 at 06:39:43AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 11, 2025 at 04:54:05PM -0800, Darrick J. Wong wrote:
> > Why do we reserve space *before* taking the iolock? The non-zoned write
> > path reserves space through the transaction while holding the iolock.
> > Regular write attempts will drop the iolock on ENOSPC to push the
> > garbage collectors; why is this any different?
> >
> > Is it because of the way the garbage collector works backwards from the
> > rmap to figure out what space to move? And since the gc starts with a
> > zone and only then takes IOLOCKs, so must we?
>
> Sorta. GC stats based on a zone/rtgroup and then moves data elewhere.
> It only takes the iolock in the I/O completion path thanks the
> opportunistic write trick that only commit the write if the previous
> location hasn't changed. If we now have the same inode waiting for
> space with the iolock held we very clearly deadlock there.
> generic/747 was written to test this.
>
> > > (aka i_rwsem) already held, so a hacky deviation from the above
> > > scheme is needed. In this case the space reservations is called with
> > > the iolock held, but is required not to block and can dip into the
> > > reserved block pool. This can lead to -ENOSPC when truncating a
> > > file, which is unfortunate. But fixing the calling conventions in
> > > the VFS is probably much easier with code requiring it already in
> > > mainline.
> >
> > It /would/ be a lot more flexible if the rest of the filesystem could
> > provide its own locking like we do for read/write paths.
>
> I've look into fixing this in the VFS. Not entirely trivial but I'll
> hopefully get to it in the next few merge windows. The big problem
> is that currently truncate as a data operation is overlayed onto
> ->setattr which also does tons of other things, so that will have to
> be sorted first.
<nod> That's going to be a long treewide change, I suspect.
> > > struct xfs_writepage_ctx wpc = { };
> > > + int error;
> > >
> > > - xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > > + xfs_iflags_clear(ip, XFS_ITRUNCATED);
> > > + if (xfs_is_zoned_inode(ip)) {
> > > + struct xfs_zoned_writepage_ctx xc = { };
> >
> > I wonder if we could reduce stack usage here by declaring an onstack
> > union of the two writepage_ctx objects?
>
> I'll check if the compiler doesn't already do the right thing, but
> if not just using and if / else should do the work in a cleaner way.
>
> > > + /*
> > > + * For zoned file systems, zeroing the first and last block of a hole
> > > + * punch requires allocating a new block to rewrite the remaining data
> > > + * and new zeroes out of place. Get a reservations for those before
> > > + * taking the iolock. Dip into the reserved pool because we are
> > > + * expected to be able to punch a hole even on a completely full
> > > + * file system.
> > > + */
> > > + if (xfs_is_zoned_inode(ip) &&
> > > + (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
> > > + FALLOC_FL_COLLAPSE_RANGE))) {
> >
> > Are we expected to be able to zero-range/collapse-range even on a
> > completely full filesystem as well?
>
> I think there's a few xfstests that do at least.
>
> > > + * 3) another thread does direct I/O on the range that the write fault
> > > + * happened on, which causes writeback of the dirty data.
> > > + * 4) this then set the stale flag, which cuts the current iomap
> > > + * iteration short, causing the new call to ->iomap_begin that gets
> > > + * us here again, but now without a sufficient reservation.
> > > + *
> > > + * This is a very unusual I/O pattern, and nothing but generic/095 is
> > > + * known to hit it. There's not really much we can do here, so turn this
> > > + * into a short write.
> >
> > So... buffered write racing with a directio write /and/ a write fault
> > all collide? Heh.
>
> Yeah, funky xfstests :) Note that we might be able to fix this by
> only adding a delalloc reservation at the same time as dirtying the
> folios. In the current code that would be very nasy, but with the
> infrastructure for the folio based bufferd write locking discussed
> a while i might be more feasible.
Hmm, let me think about that.
> > > + * Normally xfs_zoned_space_reserve is supposed to be called outside the
> > > + * IOLOCK. For truncate we can't do that since ->setattr is called with
> > > + * it already held by the VFS. So for now chicken out and try to
> > > + * allocate space under it.
> > > + *
> > > + * To avoid deadlocks this means we can't block waiting for space, which
> > > + * can lead to spurious -ENOSPC if there are no directly available
> > > + * blocks. We mitigate this a bit by allowing zeroing to dip into the
> > > + * reserved pool, but eventually the VFS calling convention needs to
> > > + * change.
> >
> > I wonder, why isn't this a problem for COWing a shared EOF block when we
> > increase the file size slightly?
>
> The problem being that we need a space allocation for a write? Well,
> we'll need a space allocation for every write in an out of tree write
> file system. But -ENOSPC (or SIGBUS for the lovers of shared mmap
> based I/O :)) on writes is very much expected. On truncate a lot less
> so. In fact not dipping into the reserved pool here for example breaks
> rocksdb workloads.
It occurs to me that regular truncate-down on a shared block can also
return ENOSPC if the filesystem is completely out of space. Though I
guess in that case, at least df will say that nearly zero space is
available, whereas for zoned storage we might have plenty of free space
that simply isn't available right now... right?
--D
next prev parent reply other threads:[~2025-02-13 23:05 UTC|newest]
Thread overview: 106+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-06 6:44 support for zoned devices RFCv2 Christoph Hellwig
2025-02-06 6:44 ` [PATCH 01/43] xfs: factor out a xfs_rt_check_size helper Christoph Hellwig
2025-02-06 6:44 ` [PATCH 02/43] xfs: add a rtg_blocks helper Christoph Hellwig
2025-02-06 6:44 ` [PATCH 03/43] xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c Christoph Hellwig
2025-02-06 6:44 ` [PATCH 04/43] xfs: skip always_cow inodes in xfs_reflink_trim_around_shared Christoph Hellwig
2025-02-06 20:13 ` Darrick J. Wong
2025-02-07 4:15 ` Christoph Hellwig
2025-02-07 4:22 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 05/43] xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write Christoph Hellwig
2025-02-06 6:44 ` [PATCH 06/43] xfs: generalize the freespace and reserved blocks handling Christoph Hellwig
2025-02-06 21:29 ` Darrick J. Wong
2025-02-07 4:21 ` Christoph Hellwig
2025-02-07 4:28 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 07/43] xfs: preserve RT reservations across remounts Christoph Hellwig
2025-02-06 20:14 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 08/43] xfs: reflow xfs_dec_freecounter Christoph Hellwig
2025-02-06 20:14 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 09/43] xfs: trace in-memory freecounters Christoph Hellwig
2025-02-06 20:15 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 10/43] xfs: make metabtree reservations global Christoph Hellwig
2025-02-06 20:50 ` Darrick J. Wong
2025-02-07 4:18 ` Christoph Hellwig
2025-02-07 4:24 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 11/43] xfs: reduce metafile reservations Christoph Hellwig
2025-02-06 20:52 ` Darrick J. Wong
2025-02-07 4:19 ` Christoph Hellwig
2025-02-07 4:26 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 12/43] xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay Christoph Hellwig
2025-02-06 20:54 ` Darrick J. Wong
2025-02-07 4:19 ` Christoph Hellwig
2025-02-06 6:44 ` [PATCH 13/43] xfs: add a xfs_rtrmap_highest_rgbno helper Christoph Hellwig
2025-02-06 20:55 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 14/43] xfs: define the zoned on-disk format Christoph Hellwig
2025-02-06 21:00 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 15/43] xfs: allow internal RT devices for zoned mode Christoph Hellwig
2025-02-06 6:44 ` [PATCH 16/43] xfs: export zoned geometry via XFS_FSOP_GEOM Christoph Hellwig
2025-02-06 21:03 ` Darrick J. Wong
2025-02-07 4:20 ` Christoph Hellwig
2025-02-06 6:44 ` [PATCH 17/43] xfs: disable sb_frextents for zoned file systems Christoph Hellwig
2025-02-06 6:44 ` [PATCH 18/43] xfs: disable FITRIM for zoned RT devices Christoph Hellwig
2025-02-06 6:44 ` [PATCH 19/43] xfs: don't call xfs_can_free_eofblocks from ->release for zoned inodes Christoph Hellwig
2025-02-06 6:44 ` [PATCH 20/43] xfs: skip zoned RT inodes in xfs_inodegc_want_queue_rt_file Christoph Hellwig
2025-02-06 21:04 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 21/43] xfs: parse and validate hardware zone information Christoph Hellwig
2025-02-06 21:05 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 22/43] xfs: add the zoned space allocator Christoph Hellwig
2025-02-07 17:39 ` Darrick J. Wong
2025-02-13 5:14 ` Christoph Hellwig
2025-02-13 8:35 ` Christoph Hellwig
2025-02-17 9:20 ` Christoph Hellwig
2025-02-13 22:08 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 23/43] xfs: add support for zoned space reservations Christoph Hellwig
2025-02-07 17:52 ` Darrick J. Wong
2025-02-13 5:17 ` Christoph Hellwig
2025-02-13 22:09 ` Darrick J. Wong
2025-02-14 6:20 ` Christoph Hellwig
2025-02-14 18:25 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 24/43] xfs: implement zoned garbage collection Christoph Hellwig
2025-02-07 18:33 ` Darrick J. Wong
2025-02-13 5:22 ` Christoph Hellwig
2025-02-13 22:10 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 25/43] xfs: implement buffered writes to zoned RT devices Christoph Hellwig
2025-02-12 0:54 ` Darrick J. Wong
2025-02-13 5:39 ` Christoph Hellwig
2025-02-13 23:05 ` Darrick J. Wong [this message]
2025-02-14 6:22 ` Christoph Hellwig
2025-02-14 18:29 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 26/43] xfs: implement direct " Christoph Hellwig
2025-02-06 6:44 ` [PATCH 27/43] xfs: wire up zoned block freeing in xfs_rtextent_free_finish_item Christoph Hellwig
2025-02-06 6:44 ` [PATCH 28/43] xfs: hide reserved RT blocks from statfs Christoph Hellwig
2025-02-07 4:32 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 29/43] xfs: support growfs on zoned file systems Christoph Hellwig
2025-02-06 6:44 ` [PATCH 30/43] xfs: allow COW forks on zoned file systems in xchk_bmap Christoph Hellwig
2025-02-06 6:44 ` [PATCH 31/43] xfs: support xchk_xref_is_used_rt_space on zoned file systems Christoph Hellwig
2025-02-06 6:44 ` [PATCH 32/43] xfs: support xrep_require_rtext_inuse " Christoph Hellwig
2025-02-06 6:44 ` [PATCH 33/43] xfs: enable fsmap reporting for internal RT devices Christoph Hellwig
2025-02-07 4:39 ` Darrick J. Wong
2025-02-07 4:55 ` Christoph Hellwig
2025-02-06 6:44 ` [PATCH 34/43] xfs: disable reflink for zoned file systems Christoph Hellwig
2025-02-07 4:31 ` Darrick J. Wong
2025-02-07 4:54 ` Christoph Hellwig
2025-02-07 5:05 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 35/43] xfs: disable rt quotas " Christoph Hellwig
2025-02-07 4:31 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 36/43] xfs: enable the zoned RT device feature Christoph Hellwig
2025-02-06 6:44 ` [PATCH 37/43] xfs: support zone gaps Christoph Hellwig
2025-02-06 6:44 ` [PATCH 38/43] xfs: add a max_open_zones mount option Christoph Hellwig
2025-02-07 4:29 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 39/43] xfs: support write life time based data placement Christoph Hellwig
2025-02-12 0:27 ` Darrick J. Wong
2025-02-12 13:29 ` Hans Holmberg
2025-02-13 4:42 ` Darrick J. Wong
2025-02-13 13:03 ` Hans Holmberg
2025-02-14 6:41 ` hch
2025-02-14 12:20 ` Hans Holmberg
2025-02-14 16:21 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 40/43] xfs: wire up the show_stats super operation Christoph Hellwig
2025-02-06 6:44 ` [PATCH 41/43] xfs: export zone stats in /proc/*/mountstats Christoph Hellwig
2025-02-07 1:02 ` Darrick J. Wong
2025-02-07 4:25 ` Christoph Hellwig
2025-02-06 6:44 ` [PATCH 42/43] xfs: contain more sysfs code in xfs_sysfs.c Christoph Hellwig
2025-02-07 0:54 ` Darrick J. Wong
2025-02-06 6:44 ` [PATCH 43/43] xfs: export max_open_zones in sysfs Christoph Hellwig
2025-02-07 0:52 ` Darrick J. Wong
2025-02-07 4:23 ` Christoph Hellwig
2025-02-07 4:27 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250213230558.GX21808@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=cem@kernel.org \
--cc=hans.holmberg@wdc.com \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox