linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Joanne Koong <joannelkoong@gmail.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: miklos@szeredi.hu, djwong@kernel.org, brauner@kernel.org,
	 linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	 bernd.schubert@fastmail.fm, kernel-team@meta.com
Subject: Re: [PATCH v1 2/8] iomap: add IOMAP_IN_MEM iomap type
Date: Tue, 10 Jun 2025 13:13:09 -0700	[thread overview]
Message-ID: <CAJnrk1b6eB71BmE_aOS77O-=77L_r5pim6GZYg45tUQnWChHUg@mail.gmail.com> (raw)
In-Reply-To: <aEgyu86jWSz0Gpia@infradead.org>

On Tue, Jun 10, 2025 at 6:27 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> So I looked into something else, what if we just use ->read_folio
> despite it not seeming ideal initially?  After going through with
> it I think it's actually less bad than I thought.  This passes
> -g auto on xfs with 4k blocks, and has three regression with 1k
> blocks, 2 look are the seek hole testers upset that we can't
> easily create detectable sub-block holes now, and one because
> generic/563 thinks the cgroup accounting is off, probably because
> we read more data now or something like that.
>
> ---
> From c5d3cf651c815d3327199c74eac43149fc958098 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Tue, 10 Jun 2025 09:39:57 +0200
> Subject: iomap: use ->read_folio instead of iomap_read_folio_sync
>
> iomap_file_buffered_write has it's own private read path for reading
> in folios that are only partially overwritten, which not only adds
> extra code, but also extra problem when e.g. we want reads to go
> through a file system method to support checksums or RAID, or even
> support non-block based file systems.
>
> Switch to using ->read_folio instead, which has a few up- and downsides.
>
> ->read_folio always reads the entire folios and not just the start and
> the tail that is not being overwritten.  Historically this was seen as a
> downside as it reads more data than needed.  But with modern file systems
> and modern storage devices this is probably a benefit.  If the folio is
> stored contiguously on disk, the single read will be more efficient than
> two small reads on almost all current hardware. If the folio is backed by
> two blocks, at least we pipeline the two reads instead of doing two
> synchronous ones.  And if the file system fragmented the folio so badly
> that we'll now need to do more than two reads we're still at least
> pipelining it, although that should basically never happen with modern
> file systems.

If the filesystem wants granular folio reads, it can also just do that
itself by calling an iomap helper (eg what iomap_adjust_read_range()
is doing right now) in its ->read_folio() implementation, correct?

For fuse at least, we definitely want granular reads, since reads may
be extremely expensive (eg it may be a network fetch) and there's
non-trivial mempcy overhead incurred with fuse needing to memcpy read
buffer data from userspace back to the kernel.

>
> ->read_folio unlocks the folio on completion.  This adds extract atomics
> to the write fast path, but the actual signaling by doing a lock_page
> after ->read_folio is not any slower than the completion wakeup.  We
> just have to recheck the mapping in this case do lock out truncates
> and other mapping manipulations.
>
> ->read_folio starts another, nested, iomap iteration, with an extra
> lookup of the extent at the current file position.  For in-place update
> file systems this is extra work, although if they use a good data
> structure like the xfs iext btree there is very little overhead in
> another lookup.  For file system that write out of place this actually
> implements the desired semantics as they don't care about the existing
> data for the write iteration at all, although untangling this and
> removing the srcmap member in the iomap_iter will require additional
> work to turn the block zeroing and unshare helpers upside down.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 116 ++++++++++++++++-------------------------
>  1 file changed, 45 insertions(+), 71 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 3729391a18f3..52b4040208dd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -667,30 +667,34 @@ iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>                                          pos + len - 1);
>  }
>
> -static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
> -               size_t poff, size_t plen, const struct iomap *iomap)
> +/*
> + * Now that we have a locked folio, check that the iomap we have cached is not
> + * stale before we do anything.
> + *
> + * The extent mapping can change due to concurrent IO in flight, e.g. the
> + * IOMAP_UNWRITTEN state can change and memory reclaim could have reclaimed a
> + * previously partially written page at this index after IO completion before
> + * this write reaches this file offset, and hence we could do the wrong thing
> + * here (zero a page range incorrectly or fail to zero) and corrupt data.
> + */
> +static bool iomap_validate(struct iomap_iter *iter)
>  {
> -       struct bio_vec bvec;
> -       struct bio bio;
> +       const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
>
> -       bio_init(&bio, iomap->bdev, &bvec, 1, REQ_OP_READ);
> -       bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
> -       bio_add_folio_nofail(&bio, folio, plen, poff);
> -       return submit_bio_wait(&bio);
> +       if (folio_ops && folio_ops->iomap_valid &&
> +           !folio_ops->iomap_valid(iter->inode, &iter->iomap)) {
> +               iter->iomap.flags |= IOMAP_F_STALE;
> +               return false;
> +       }
> +
> +       return true;
>  }
>
> -static int __iomap_write_begin(const struct iomap_iter *iter, size_t len,
> +static int __iomap_write_begin(struct iomap_iter *iter, size_t len,
>                 struct folio *folio)
>  {
> -       const struct iomap *srcmap = iomap_iter_srcmap(iter);
> +       struct inode *inode = iter->inode;
>         struct iomap_folio_state *ifs;
> -       loff_t pos = iter->pos;
> -       loff_t block_size = i_blocksize(iter->inode);
> -       loff_t block_start = round_down(pos, block_size);
> -       loff_t block_end = round_up(pos + len, block_size);
> -       unsigned int nr_blocks = i_blocks_per_folio(iter->inode, folio);
> -       size_t from = offset_in_folio(folio, pos), to = from + len;
> -       size_t poff, plen;
>
>         /*
>          * If the write or zeroing completely overlaps the current folio, then
> @@ -699,45 +703,29 @@ static int __iomap_write_begin(const struct iomap_iter *iter, size_t len,
>          * For the unshare case, we must read in the ondisk contents because we
>          * are not changing pagecache contents.
>          */
> -       if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
> -           pos + len >= folio_pos(folio) + folio_size(folio))
> +       if (!(iter->flags & IOMAP_UNSHARE) &&
> +           iter->pos <= folio_pos(folio) &&
> +           iter->pos + len >= folio_pos(folio) + folio_size(folio))
>                 return 0;
>
> -       ifs = ifs_alloc(iter->inode, folio, iter->flags);
> -       if ((iter->flags & IOMAP_NOWAIT) && !ifs && nr_blocks > 1)
> +       ifs = ifs_alloc(inode, folio, iter->flags);
> +       if ((iter->flags & IOMAP_NOWAIT) && !ifs &&
> +           i_blocks_per_folio(inode, folio) > 1)
>                 return -EAGAIN;
>
> -       if (folio_test_uptodate(folio))
> -               return 0;
> -
> -       do {
> -               iomap_adjust_read_range(iter->inode, folio, &block_start,
> -                               block_end - block_start, &poff, &plen);
> -               if (plen == 0)
> -                       break;
> +       if (!folio_test_uptodate(folio)) {
> +               inode->i_mapping->a_ops->read_folio(NULL, folio);
>
> -               if (!(iter->flags & IOMAP_UNSHARE) &&
> -                   (from <= poff || from >= poff + plen) &&
> -                   (to <= poff || to >= poff + plen))
> -                       continue;
> -
> -               if (iomap_block_needs_zeroing(iter, block_start)) {
> -                       if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
> -                               return -EIO;
> -                       folio_zero_segments(folio, poff, from, to, poff + plen);
> -               } else {
> -                       int status;
> -
> -                       if (iter->flags & IOMAP_NOWAIT)
> -                               return -EAGAIN;
> -
> -                       status = iomap_read_folio_sync(block_start, folio,
> -                                       poff, plen, srcmap);
> -                       if (status)
> -                               return status;
> -               }
> -               iomap_set_range_uptodate(folio, poff, plen);
> -       } while ((block_start += plen) < block_end);
> +               /*
> +                * ->read_folio unlocks the folio.  Relock and revalidate the
> +                * folio.
> +                */
> +               folio_lock(folio);
> +               if (unlikely(folio->mapping != inode->i_mapping))
> +                       return 1;
> +               if (unlikely(!iomap_validate(iter)))
> +                       return 1;

Does this now basically mean that every caller that uses iomap for
writes will have to implement ->iomap_valid and up the sequence
counter anytime there's a write or truncate, in case the folio changes
during the lock drop? Or were we already supposed to be doing this?

> +       }
>
>         return 0;
>  }
> @@ -803,7 +791,6 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
>  static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>                 size_t *poffset, u64 *plen)
>  {
> -       const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
>         const struct iomap *srcmap = iomap_iter_srcmap(iter);
>         loff_t pos = iter->pos;
>         u64 len = min_t(u64, SIZE_MAX, iomap_length(iter));
> @@ -818,28 +805,14 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>         if (fatal_signal_pending(current))
>                 return -EINTR;
>
> +lookup_again:
>         folio = __iomap_get_folio(iter, len);
>         if (IS_ERR(folio))
>                 return PTR_ERR(folio);
>
> -       /*
> -        * Now we have a locked folio, before we do anything with it we need to
> -        * check that the iomap we have cached is not stale. The inode extent
> -        * mapping can change due to concurrent IO in flight (e.g.
> -        * IOMAP_UNWRITTEN state can change and memory reclaim could have
> -        * reclaimed a previously partially written page at this index after IO
> -        * completion before this write reaches this file offset) and hence we
> -        * could do the wrong thing here (zero a page range incorrectly or fail
> -        * to zero) and corrupt data.
> -        */
> -       if (folio_ops && folio_ops->iomap_valid) {
> -               bool iomap_valid = folio_ops->iomap_valid(iter->inode,
> -                                                        &iter->iomap);
> -               if (!iomap_valid) {
> -                       iter->iomap.flags |= IOMAP_F_STALE;
> -                       status = 0;
> -                       goto out_unlock;
> -               }
> +       if (unlikely(!iomap_validate(iter))) {
> +               status = 0;
> +               goto out_unlock;
>         }
>
>         pos = iomap_trim_folio_range(iter, folio, poffset, &len);
> @@ -860,7 +833,8 @@ static int iomap_write_begin(struct iomap_iter *iter, struct folio **foliop,
>
>  out_unlock:
>         __iomap_put_folio(iter, 0, folio);
> -
> +       if (status == 1)
> +               goto lookup_again;
>         return status;
>  }
>
> --
> 2.47.2
>

  reply	other threads:[~2025-06-10 20:13 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-06 23:37 [PATCH v1 0/8] fuse: use iomap for buffered writes + writeback Joanne Koong
2025-06-06 23:37 ` [PATCH v1 1/8] iomap: move buffered io bio logic into separate file Joanne Koong
2025-06-08 19:17   ` Anuj gupta
2025-06-09  4:44   ` Christoph Hellwig
2025-06-09 20:01     ` Joanne Koong
2025-06-06 23:37 ` [PATCH v1 2/8] iomap: add IOMAP_IN_MEM iomap type Joanne Koong
2025-06-09  4:45   ` Christoph Hellwig
2025-06-09 21:45     ` Joanne Koong
2025-06-10  3:39       ` Christoph Hellwig
2025-06-10 13:27         ` Christoph Hellwig
2025-06-10 20:13           ` Joanne Koong [this message]
2025-06-11  4:04             ` Christoph Hellwig
2025-06-11  6:00               ` Joanne Koong
2025-06-11  6:08                 ` Christoph Hellwig
2025-06-11 18:33                 ` Joanne Koong
2025-06-11 18:50                   ` Darrick J. Wong
2025-06-11 23:08                     ` Joanne Koong
2025-06-12  4:42                       ` Christoph Hellwig
2025-06-09 16:24   ` Darrick J. Wong
2025-06-09 21:28     ` Joanne Koong
2025-06-12  3:53       ` Darrick J. Wong
2025-06-06 23:37 ` [PATCH v1 3/8] iomap: add buffered write support for IOMAP_IN_MEM iomaps Joanne Koong
2025-06-09  4:56   ` Christoph Hellwig
2025-06-09 22:45     ` Joanne Koong
2025-06-10  3:44       ` Christoph Hellwig
2025-06-09 16:38   ` Darrick J. Wong
2025-06-09 22:03     ` Joanne Koong
2025-06-12  3:54       ` Darrick J. Wong
2025-06-06 23:37 ` [PATCH v1 4/8] iomap: add writepages " Joanne Koong
2025-06-09  5:32   ` Christoph Hellwig
2025-06-09 16:57     ` Darrick J. Wong
2025-06-10  3:49       ` Christoph Hellwig
2025-06-12  3:56         ` Darrick J. Wong
2025-06-09 23:15     ` Joanne Koong
2025-06-10  3:58       ` Christoph Hellwig
2025-06-10 18:23         ` Joanne Koong
2025-06-10 18:58           ` Joanne Koong
2025-06-11  4:01           ` Christoph Hellwig
2025-06-06 23:38 ` [PATCH v1 5/8] iomap: add iomap_writeback_dirty_folio() Joanne Koong
2025-06-09  4:51   ` Christoph Hellwig
2025-06-09 17:14     ` Darrick J. Wong
2025-06-09 23:54       ` Joanne Koong
2025-06-10  3:59       ` Christoph Hellwig
2025-06-11  4:34         ` Matthew Wilcox
2025-06-18  4:47           ` does fuse need ->launder_folios, was: " Christoph Hellwig
2025-06-18 12:17           ` Jeff Layton
2025-06-20 18:15             ` Matthew Wilcox
2025-06-25  5:26               ` Joanne Koong
2025-06-25  6:26                 ` Christoph Hellwig
2025-06-25 16:44                   ` Joanne Koong
2025-07-01  5:41                     ` Darrick J. Wong
2025-07-02 21:36                       ` Joanne Koong
2025-07-02 21:47                         ` Joanne Koong
2025-07-01  6:23                     ` Miklos Szeredi
2025-06-09 23:30     ` Joanne Koong
2025-06-10  4:03       ` Christoph Hellwig
2025-06-06 23:38 ` [PATCH v1 6/8] fuse: use iomap for buffered writes Joanne Koong
2025-06-06 23:38 ` [PATCH v1 7/8] fuse: use iomap for writeback Joanne Koong
2025-06-08 19:20   ` Anuj gupta
2025-06-06 23:38 ` [PATCH v1 8/8] fuse: use iomap for folio laundering Joanne Koong
2025-06-08 19:12 ` [PATCH v1 0/8] fuse: use iomap for buffered writes + writeback Anuj gupta
2025-06-09 19:59   ` Joanne Koong
2025-06-14 14:22     ` Anuj gupta
2025-06-09  4:40 ` Christoph Hellwig
2025-06-09 12:38   ` Anuj gupta
2025-06-09 19:47     ` Joanne Koong
2025-06-10  4:04     ` Christoph Hellwig
2025-06-10  0:47 ` Dave Chinner
2025-06-10  4:06   ` Christoph Hellwig
2025-06-10 20:33   ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJnrk1b6eB71BmE_aOS77O-=77L_r5pim6GZYg45tUQnWChHUg@mail.gmail.com' \
    --to=joannelkoong@gmail.com \
    --cc=bernd.schubert@fastmail.fm \
    --cc=brauner@kernel.org \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).