From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] [RFC] iomap: zeroing needs to be pagecache aware
Date: Wed, 30 Nov 2022 18:08:04 -0800 [thread overview]
Message-ID: <Y4gMhHsGriqPhNsR@magnolia> (raw)
In-Reply-To: <20221201005214.3836105-1-david@fromorbit.com>
On Thu, Dec 01, 2022 at 11:52:14AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Unwritten extents can have page cache data over the range being
> zeroed so we can't just skip them entirely. Fix this by checking for
> an existing dirty folio over the unwritten range we are zeroing
> and only performing zeroing if the folio is already dirty.
Hm, I'll look at this tomorrow morning when I'm less bleary. From a
cursory glance it looks ok though.
> XXX: how do we detect a iomap containing a cow mapping over a hole
> in iomap_zero_iter()? The XFS code implies this case also needs to
> zero the page cache if there is data present, so trigger for page
> cache lookup only in iomap_zero_iter() needs to handle this case as
> well.
I've been wondering for a while if we ought to rename iomap_iter.iomap
to write_iomap and iomap_iter.srcmap to read_iomap, and change all the
->iomap_begin and ->iomap_end functions as needed. I think that would
make it more clear to iomap users which one they're supposed to use.
Right now we overload iomap_iter.iomap for reads and for writes if
srcmap is a hole (or SHARED isn't set on iomap) and it's getting
confusing to keep track of all that.
I guess the hard part of all that is that writes to the pagecache don't
touch storage; and writeback doesn't care about the source mapping since
it's only using block granularity.
--D
> Before:
>
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
>
> real 0m14.103s
> user 0m0.015s
> sys 0m0.020s
>
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 85.90 0.847616 16 50000 ftruncate
> 14.01 0.138229 2 50000 pwrite64
> ....
>
> After:
>
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
>
> real 0m0.144s
> user 0m0.021s
> sys 0m0.012s
>
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 53.86 0.505964 10 50000 ftruncate
> 46.12 0.433251 8 50000 pwrite64
> ....
>
> Yup, we get back all the performance.
>
> As for the "mmap write beyond EOF" data exposure aspect
> documented here:
>
> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
>
> With this command:
>
> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
> -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
> -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
>
> Before:
>
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> 00000c00: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX
> 00000c10: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182 ops/sec
>
> After:
>
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> 00000c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000c10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429 ops/sec)
>
> We see that this post-eof unwritten extent dirty page zeroing is
> working correctly.
>
> This has passed through most of fstests on a couple of test VMs
> without issues at the moment, so I think this approach to fixing the
> issue is going to be solid once we've worked out how to detect the
> COW-hole mapping case.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/iomap/buffered-io.c | 50 +++++++++++++++++++++++++++++++++++-------
> fs/xfs/xfs_iops.c | 12 +---------
> 2 files changed, 43 insertions(+), 19 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 356193e44cf0..0969e4525507 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -585,14 +585,14 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
> }
>
> static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
> - size_t len, struct folio **foliop)
> + size_t len, struct folio **foliop, unsigned fgp)
> {
> const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> const struct iomap *srcmap = iomap_iter_srcmap(iter);
> struct folio *folio;
> - unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
> int status = 0;
>
> + fgp |= FGP_LOCK | FGP_WRITE | FGP_STABLE | FGP_NOFS;
> if (iter->flags & IOMAP_NOWAIT)
> fgp |= FGP_NOWAIT;
>
> @@ -615,7 +615,12 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
> folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> fgp, mapping_gfp_mask(iter->inode->i_mapping));
> if (!folio) {
> - status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
> + if (!(fgp & FGP_CREAT))
> + status = -ENODATA;
> + else if (iter->flags & IOMAP_NOWAIT)
> + status = -EAGAIN;
> + else
> + status = -ENOMEM;
> goto out_no_page;
> }
>
> @@ -791,7 +796,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> break;
> }
>
> - status = iomap_write_begin(iter, pos, bytes, &folio);
> + status = iomap_write_begin(iter, pos, bytes, &folio, FGP_CREAT);
> if (unlikely(status))
> break;
> if (iter->iomap.flags & IOMAP_F_STALE)
> @@ -1101,7 +1106,7 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
> unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
> struct folio *folio;
>
> - status = iomap_write_begin(iter, pos, bytes, &folio);
> + status = iomap_write_begin(iter, pos, bytes, &folio, FGP_CREAT);
> if (unlikely(status))
> return status;
> if (iter->iomap.flags & IOMAP_F_STALE)
> @@ -1147,10 +1152,14 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> loff_t pos = iter->pos;
> loff_t length = iomap_length(iter);
> loff_t written = 0;
> + unsigned fgp = FGP_CREAT;
>
> /* already zeroed? we're done. */
> - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> + if (srcmap->type == IOMAP_HOLE)
> return length;
> + /* only do page cache lookups over unwritten extents */
> + if (srcmap->type == IOMAP_UNWRITTEN)
> + fgp = 0;
>
> do {
> struct folio *folio;
> @@ -1158,9 +1167,20 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> size_t offset;
> size_t bytes = min_t(u64, SIZE_MAX, length);
>
> - status = iomap_write_begin(iter, pos, bytes, &folio);
> - if (status)
> + status = iomap_write_begin(iter, pos, bytes, &folio, fgp);
> + if (status) {
> + if (status == -ENODATA) {
> + /*
> + * No folio was found, so skip to the start of
> + * the next potential entry in the page cache
> + * and continue from there.
> + */
> + if (bytes > PAGE_SIZE - offset_in_page(pos))
> + bytes = PAGE_SIZE - offset_in_page(pos);
> + goto loop_continue;
> + }
> return status;
> + }
> if (iter->iomap.flags & IOMAP_F_STALE)
> break;
>
> @@ -1168,6 +1188,19 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> if (bytes > folio_size(folio) - offset)
> bytes = folio_size(folio) - offset;
>
> + /*
> + * If the folio over an unwritten extent is clean, then we
> + * aren't going to touch the data in it at all. We don't want to
> + * mark it dirty or change the uptodate state of data in the
> + * page, so we just unlock it and skip to the next range over
> + * the unwritten extent we need to check.
> + */
> + if (srcmap->type == IOMAP_UNWRITTEN &&
> + !folio_test_dirty(folio)) {
> + folio_unlock(folio);
> + goto loop_continue;
> + }
> +
> folio_zero_range(folio, offset, bytes);
> folio_mark_accessed(folio);
>
> @@ -1175,6 +1208,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> if (WARN_ON_ONCE(bytes == 0))
> return -EIO;
>
> +loop_continue:
> pos += bytes;
> length -= bytes;
> written += bytes;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 2e10e1c66ad6..d7c2f8c49f94 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -839,17 +839,7 @@ xfs_setattr_size(
> trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> &did_zeroing);
> - } else {
> - /*
> - * iomap won't detect a dirty page over an unwritten block (or a
> - * cow block over a hole) and subsequently skips zeroing the
> - * newly post-EOF portion of the page. Flush the new EOF to
> - * convert the block before the pagecache truncate.
> - */
> - error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> - newsize);
> - if (error)
> - return error;
> + } else if (newsize != oldsize) {
> error = xfs_truncate_page(ip, newsize, &did_zeroing);
> }
>
> --
> 2.38.1
>
next prev parent reply other threads:[~2022-12-01 2:08 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-01 0:52 [PATCH] [RFC] iomap: zeroing needs to be pagecache aware Dave Chinner
2022-12-01 2:08 ` Darrick J. Wong [this message]
2022-12-01 2:43 ` Dave Chinner
2022-12-01 3:59 ` Darrick J. Wong
2022-12-23 16:15 ` Christoph Hellwig
2022-12-13 15:18 ` Brian Foster
2022-12-23 16:15 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y4gMhHsGriqPhNsR@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox