From: Joanne Koong <joannelkoong@gmail.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: miklos@szeredi.hu, djwong@kernel.org, brauner@kernel.org,
linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
bernd.schubert@fastmail.fm, kernel-team@meta.com
Subject: Re: [PATCH v1 2/8] iomap: add IOMAP_IN_MEM iomap type
Date: Tue, 10 Jun 2025 13:13:09 -0700 [thread overview]
Message-ID: <CAJnrk1b6eB71BmE_aOS77O-=77L_r5pim6GZYg45tUQnWChHUg@mail.gmail.com> (raw)
In-Reply-To: <aEgyu86jWSz0Gpia@infradead.org>
On Tue, Jun 10, 2025 at 6:27 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> So I looked into something else, what if we just use ->read_folio
> despite it not seeming ideal initially? After going through with
> it I think it's actually less bad than I thought. This passes
> -g auto on xfs with 4k blocks, and has three regression with 1k
> blocks, 2 look are the seek hole testers upset that we can't
> easily create detectable sub-block holes now, and one because
> generic/563 thinks the cgroup accounting is off, probably because
> we read more data now or something like that.
>
> ---
> From c5d3cf651c815d3327199c74eac43149fc958098 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Tue, 10 Jun 2025 09:39:57 +0200
> Subject: iomap: use ->read_folio instead of iomap_read_folio_sync
>
> iomap_file_buffered_write has it's own private read path for reading
> in folios that are only partially overwritten, which not only adds
> extra code, but also extra problem when e.g. we want reads to go
> through a file system method to support checksums or RAID, or even
> support non-block based file systems.
>
> Switch to using ->read_folio instead, which has a few up- and downsides.
>
> ->read_folio always reads the entire folios and not just the start and
> the tail that is not being overwritten. Historically this was seen as a
> downside as it reads more data than needed. But with modern file systems
> and modern storage devices this is probably a benefit. If the folio is
> stored contiguously on disk, the single read will be more efficient than
> two small reads on almost all current hardware. If the folio is backed by
> two blocks, at least we pipeline the two reads instead of doing two
> synchronous ones. And if the file system fragmented the folio so badly
> that we'll now need to do more than two reads we're still at least
> pipelining it, although that should basically never happen with modern
> file systems.
If the filesystem wants granular folio reads, it can also just do that
itself by calling an iomap helper (eg what iomap_adjust_read_range()
is doing right now) in its ->read_folio() implementation, correct?
For fuse at least, we definitely want granular reads, since reads may
be extremely expensive (eg it may be a network fetch) and there's
non-trivial mempcy overhead incurred with fuse needing to memcpy read
buffer data from userspace back to the kernel.
>
> ->read_folio unlocks the folio on completion. This adds extract atomics
> to the write fast path, but the actual signaling by doing a lock_page
> after ->read_folio is not any slower than the completion wakeup. We
> just have to recheck the mapping in this case do lock out truncates
> and other mapping manipulations.
>
> ->read_folio starts another, nested, iomap iteration, with an extra
> lookup of the extent at the current file position. For in-place update
> file systems this is extra work, although if they use a good data
> structure like the xfs iext btree there is very little overhead in
> another lookup. For file system that write out of place this actually
> implements the desired semantics as they don't care about the existing
> data for the write iteration at all, although untangling this and
> removing the srcmap member in the iomap_iter will require additional
> work to turn the block zeroing and unshare helpers upside down.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/iomap/buffered-io.c | 116 ++++++++++++++++-------------------------
> 1 file changed, 45 insertions(+), 71 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 3729391a18f3..52b4040208dd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -667,30 +667,34 @@ iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
> pos + len - 1);
> }
>
> -static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
> - size_t poff, size_t plen, const struct iomap *iomap)
> +/*
> + * Now that we have a locked folio, check that the iomap we have cached is not
> + * stale before we do anything.
> + *
> + * The extent mapping can change due to concurrent IO in flight, e.g. the
> + * IOMAP_UNWRITTEN state can change and memory reclaim could have reclaimed a
> + * previously partially written page at this index after IO completion before
> + * this write reaches this file offset, and hence we could do the wrong thing
> + * here (zero a page range incorrectly or fail to zero) and corrupt data.
> + */
> +static bool iomap_validate(struct iomap_iter *iter)
> {
> - struct bio_vec bvec;
> - struct bio bio;
> + const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
>
> - bio_init(&bio, iomap->bdev, &bvec, 1, REQ_OP_READ);
> - bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
> - bio_add_folio_nofail(&bio, folio, plen, poff);
> - return submit_bio_wait(&bio);
> + if (folio_ops && folio_ops->iomap_valid &&
> + !folio_ops->iomap_valid(iter->inode, &iter->iomap)) {
> + iter->iomap.flags |= IOMAP_F_STALE;
> + return false;
> + }
> +
> + return true;
> }
>
> -static int __iomap_write_begin(const struct iomap_iter *iter, size_t len,
> +static int __iomap_write_begin(struct iomap_iter *iter, size_t len,
> struct folio *folio)
> {
> - const struct iomap *srcmap = iomap_iter_srcmap(iter);
> + struct inode *inode = iter->inode;
> struct iomap_folio_state *ifs;
> - loff_t pos = iter->pos;
> - loff_t block_size = i_blocksize(iter->inode);
> - loff_t block_start = round_down(pos, block_size);
> - loff_t block_end = round_up(pos + len, block_size);
> - unsigned int nr_blocks = i_blocks_per_folio(iter->inode, folio);
> - size_t from = offset_in_folio(folio, pos), to = from + len;
> - size_t poff, plen;
>
> /*
> * If the write or zeroing completely overlaps the current folio, then
> @@ -699,45 +703,29 @@ static int __iomap_write_begin(const struct iomap_iter *iter, size_t len,
> * For the unshare case, we must read in the ondisk contents because we
> * are not changing pagecache contents.
> */
> - if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
> - pos + len >= folio_pos(folio) + folio_size(folio))
> + if (!(iter->flags & IOMAP_UNSHARE) &&
> + iter->pos <= folio_pos(folio) &&
> + iter->pos + len >= folio_pos(folio) + folio_size(folio))
> return 0;
>
> - ifs = ifs_alloc(iter->inode, folio, iter->flags);
> - if ((iter->flags & IOMAP_NOWAIT) && !ifs && nr_blocks > 1)
> + ifs = ifs_alloc(inode, folio, iter->flags);
> + if ((iter->flags & IOMAP_NOWAIT) && !ifs &&
> + i_blocks_per_folio(inode, folio) > 1)
> return -EAGAIN;
>
> - if (folio_test_uptodate(folio))
> - return 0;
> -
> - do {
> - iomap_adjust_read_range(iter->inode, folio, &block_start,
> - block_end - block_start, &poff, &plen);
> - if (plen == 0)
> - break;
> + if (!folio_test_uptodate(folio)) {
> + inode->i_mapping->a_ops->read_folio(NULL, folio);
>
> - if (!(iter->flags & IOMAP_UNSHARE) &&
> - (from <= poff || from >= poff + plen) &&
> - (to <= poff || to >= poff + plen))
> - continue;
> -
> - if (iomap_block_needs_zeroing(iter, block_start)) {
> - if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
> - return -EIO;
> - folio_zero_segments(folio, poff, from, to, poff + plen);
> - } else {
> - int status;
> -
> - if (iter->flags & IOMAP_NOWAIT)
> - return -EAGAIN;
> -
> - status = iomap_read_folio_sync(block_start, folio,
> - poff, plen, srcmap);
> - if (status)
> - return status;
> - }
> - iomap_set_range_uptodate(folio, poff, plen);
> - } while ((block_start += plen) < block_end);
> + /*
> + * ->read_folio unlocks the folio. Relock and revalidate the
> + * folio.
> + */
> + folio_lock(folio);
> + if (unlikely(folio->mapping != inode->i_mapping))
> + return 1;
> + if (unlikely(!iomap_validate(iter)))
> + return 1;
Does this now basically mean that every caller that uses iomap for
writes will have to implement ->iomap_valid and up the sequence
counter anytime there's a write or truncate, in case the folio changes
during the lock drop? Or were we already supposed to be doing this?
> + }
>
> return 0;
> }
> @@ -803,7 +791,6 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
> static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
> size_t *poffset, u64 *plen)
> {
> - const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
> const struct iomap *srcmap = iomap_iter_srcmap(iter);
> loff_t pos = iter->pos;
> u64 len = min_t(u64, SIZE_MAX, iomap_length(iter));
> @@ -818,28 +805,14 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
> if (fatal_signal_pending(current))
> return -EINTR;
>
> +lookup_again:
> folio = __iomap_get_folio(iter, len);
> if (IS_ERR(folio))
> return PTR_ERR(folio);
>
> - /*
> - * Now we have a locked folio, before we do anything with it we need to
> - * check that the iomap we have cached is not stale. The inode extent
> - * mapping can change due to concurrent IO in flight (e.g.
> - * IOMAP_UNWRITTEN state can change and memory reclaim could have
> - * reclaimed a previously partially written page at this index after IO
> - * completion before this write reaches this file offset) and hence we
> - * could do the wrong thing here (zero a page range incorrectly or fail
> - * to zero) and corrupt data.
> - */
> - if (folio_ops && folio_ops->iomap_valid) {
> - bool iomap_valid = folio_ops->iomap_valid(iter->inode,
> - &iter->iomap);
> - if (!iomap_valid) {
> - iter->iomap.flags |= IOMAP_F_STALE;
> - status = 0;
> - goto out_unlock;
> - }
> + if (unlikely(!iomap_validate(iter))) {
> + status = 0;
> + goto out_unlock;
> }
>
> pos = iomap_trim_folio_range(iter, folio, poffset, &len);
> @@ -860,7 +833,8 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>
> out_unlock:
> __iomap_put_folio(iter, 0, folio);
> -
> + if (status == 1)
> + goto lookup_again;
> return status;
> }
>
> --
> 2.47.2
>
next prev parent reply other threads:[~2025-06-10 20:13 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-06 23:37 [PATCH v1 0/8] fuse: use iomap for buffered writes + writeback Joanne Koong
2025-06-06 23:37 ` [PATCH v1 1/8] iomap: move buffered io bio logic into separate file Joanne Koong
2025-06-08 19:17 ` Anuj gupta
2025-06-09 4:44 ` Christoph Hellwig
2025-06-09 20:01 ` Joanne Koong
2025-06-06 23:37 ` [PATCH v1 2/8] iomap: add IOMAP_IN_MEM iomap type Joanne Koong
2025-06-09 4:45 ` Christoph Hellwig
2025-06-09 21:45 ` Joanne Koong
2025-06-10 3:39 ` Christoph Hellwig
2025-06-10 13:27 ` Christoph Hellwig
2025-06-10 20:13 ` Joanne Koong [this message]
2025-06-11 4:04 ` Christoph Hellwig
2025-06-11 6:00 ` Joanne Koong
2025-06-11 6:08 ` Christoph Hellwig
2025-06-11 18:33 ` Joanne Koong
2025-06-11 18:50 ` Darrick J. Wong
2025-06-11 23:08 ` Joanne Koong
2025-06-12 4:42 ` Christoph Hellwig
2025-06-09 16:24 ` Darrick J. Wong
2025-06-09 21:28 ` Joanne Koong
2025-06-12 3:53 ` Darrick J. Wong
2025-06-06 23:37 ` [PATCH v1 3/8] iomap: add buffered write support for IOMAP_IN_MEM iomaps Joanne Koong
2025-06-09 4:56 ` Christoph Hellwig
2025-06-09 22:45 ` Joanne Koong
2025-06-10 3:44 ` Christoph Hellwig
2025-06-09 16:38 ` Darrick J. Wong
2025-06-09 22:03 ` Joanne Koong
2025-06-12 3:54 ` Darrick J. Wong
2025-06-06 23:37 ` [PATCH v1 4/8] iomap: add writepages " Joanne Koong
2025-06-09 5:32 ` Christoph Hellwig
2025-06-09 16:57 ` Darrick J. Wong
2025-06-10 3:49 ` Christoph Hellwig
2025-06-12 3:56 ` Darrick J. Wong
2025-06-09 23:15 ` Joanne Koong
2025-06-10 3:58 ` Christoph Hellwig
2025-06-10 18:23 ` Joanne Koong
2025-06-10 18:58 ` Joanne Koong
2025-06-11 4:01 ` Christoph Hellwig
2025-06-06 23:38 ` [PATCH v1 5/8] iomap: add iomap_writeback_dirty_folio() Joanne Koong
2025-06-09 4:51 ` Christoph Hellwig
2025-06-09 17:14 ` Darrick J. Wong
2025-06-09 23:54 ` Joanne Koong
2025-06-10 3:59 ` Christoph Hellwig
2025-06-11 4:34 ` Matthew Wilcox
2025-06-18 4:47 ` does fuse need ->launder_folios, was: " Christoph Hellwig
2025-06-18 12:17 ` Jeff Layton
2025-06-20 18:15 ` Matthew Wilcox
2025-06-25 5:26 ` Joanne Koong
2025-06-25 6:26 ` Christoph Hellwig
2025-06-25 16:44 ` Joanne Koong
2025-07-01 5:41 ` Darrick J. Wong
2025-07-02 21:36 ` Joanne Koong
2025-07-02 21:47 ` Joanne Koong
2025-07-01 6:23 ` Miklos Szeredi
2025-06-09 23:30 ` Joanne Koong
2025-06-10 4:03 ` Christoph Hellwig
2025-06-06 23:38 ` [PATCH v1 6/8] fuse: use iomap for buffered writes Joanne Koong
2025-06-06 23:38 ` [PATCH v1 7/8] fuse: use iomap for writeback Joanne Koong
2025-06-08 19:20 ` Anuj gupta
2025-06-06 23:38 ` [PATCH v1 8/8] fuse: use iomap for folio laundering Joanne Koong
2025-06-08 19:12 ` [PATCH v1 0/8] fuse: use iomap for buffered writes + writeback Anuj gupta
2025-06-09 19:59 ` Joanne Koong
2025-06-14 14:22 ` Anuj gupta
2025-06-09 4:40 ` Christoph Hellwig
2025-06-09 12:38 ` Anuj gupta
2025-06-09 19:47 ` Joanne Koong
2025-06-10 4:04 ` Christoph Hellwig
2025-06-10 0:47 ` Dave Chinner
2025-06-10 4:06 ` Christoph Hellwig
2025-06-10 20:33 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJnrk1b6eB71BmE_aOS77O-=77L_r5pim6GZYg45tUQnWChHUg@mail.gmail.com' \
--to=joannelkoong@gmail.com \
--cc=bernd.schubert@fastmail.fm \
--cc=brauner@kernel.org \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=miklos@szeredi.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).