From: "Darrick J. Wong" <djwong@kernel.org>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH] iomap: elide zero range flush from partial eof zeroing
Date: Thu, 24 Oct 2024 10:08:17 -0700 [thread overview]
Message-ID: <20241024170817.GK21853@frogsfrogsfrogs> (raw)
In-Reply-To: <20241023143029.11275-1-bfoster@redhat.com>
On Wed, Oct 23, 2024 at 10:30:29AM -0400, Brian Foster wrote:
> iomap zero range performs a pagecache flush upon seeing unwritten
> extents with dirty pagecache in order to determine accurate
> subranges that require direct zeroing. This is to support an
> optimization where clean, unwritten ranges are skipped as they are
> already zero on-disk.
>
> Certain use cases for zero range are more sensitive to flush latency
> than others. The kernel test robot recently reported a regression in
> the following stress-ng workload on XFS:
>
> stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64
>
> This workload involves a series of small, strided, write extending
> writes. On XFS, this produces a pattern of allocating post-eof
> speculative preallocation, converting preallocation to unwritten on
> zero range calls, dirtying pagecache over the converted mapping, and
> then repeating the sequence again from the updated EOF. This
> basically produces a sequence of pagecache flushes on the partial
> EOF block zeroing use case of zero range.
>
> To mitigate this problem, special case the EOF block zeroing use
> case to prefer zeroing over a pagecache flush when the EOF folio is
> already dirty. This brings most of the performance back by avoiding
> flushes on write and truncate extension operations, while preserving
> the ability for iomap to flush and properly process larger ranges.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>
> Hi iomap maintainers,
>
> This is an incremental optimization for the regression reported by the
> test robot here[1]. I'm not totally convinced this is necessary as an
> immediate fix, but the discussion on that thread was enough to suggest
> it could be. I don't really love the factoring, but I had to play a bit
> of whack-a-mole between fstests and stress-ng to restore performance and
> still maintain behavior expectations for some of the tests.
>
> On a positive note, exploring this gave me what I think is a better idea
> for dealing with zero range overall, so I'm working on a followup to
> this that reworks it by splitting zero range across block alignment
> boundaries (similar to how something like truncate page range works, for
> example). This simplifies things by isolating the dirty range check to a
> single folio on an unaligned start offset, which lets the _iter() call
> do a skip or zero (i.e. no more flush_and_stale()), and then
> unconditionally flush the aligned portion to end-of-range. The latter
> flush should be a no-op for every use case I've seen so far, so this
> might entirely avoid the need for anything more complex for zero range.
>
> In summary, I'm posting this as an optional and more "stable-worthy"
> patch for reference and for the maintainers to consider as they like. I
> think it's reasonable to include if we are concerned about this
> particular stress-ng test and are Ok with it as a transient solution.
> But if it were up to me, I'd probably sit on it for a bit to determine
> if a more practical user/workload is affected by this, particularly
> knowing that I'm trying to rework it. This could always be applied as a
> stable fix if really needed, but I just don't think the slightly more
> invasive rework is appropriate for -rc..
>
> Thoughts, reviews, flames appreciated.
>
> Brian
>
> [1] https://lore.kernel.org/linux-xfs/202410141536.1167190b-oliver.sang@intel.com/
>
> fs/iomap/buffered-io.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index aa587b2142e2..8fd25b14d120 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1372,6 +1372,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> loff_t pos = iter->pos;
> loff_t length = iomap_length(iter);
> loff_t written = 0;
> + bool eof_zero = false;
>
> /*
> * We must zero subranges of unwritten mappings that might be dirty in
> @@ -1391,12 +1392,23 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> * triggers writeback time post-eof zeroing.
> */
> if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> - if (*range_dirty) {
> + /* range is clean and already zeroed, nothing to do */
> + if (!*range_dirty)
> + return length;
> +
> + /* flush for anything other than partial eof zeroing */
> + if (pos != i_size_read(iter->inode) ||
> + (pos % i_blocksize(iter->inode)) == 0) {
> *range_dirty = false;
> return iomap_zero_iter_flush_and_stale(iter);
> }
> - /* range is clean and already zeroed, nothing to do */
> - return length;
> + /*
> + * Special case partial EOF zeroing. Since we know the EOF
> + * folio is dirty, prefer in-memory zeroing for it. This avoids
> + * excessive flush latency on frequent file size extending
> + * operations.
> + */
> + eof_zero = true;
> }
>
> do {
> @@ -1415,6 +1427,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> offset = offset_in_folio(folio, pos);
> if (bytes > folio_size(folio) - offset)
> bytes = folio_size(folio) - offset;
> + if (eof_zero && length > bytes)
> + length = bytes;
What does this do? I think this causes the loop to break after putting
the folio that caches @pos? And then I guess we go around the loop in
iomap_zero_range again if there were more bytes to zero after this
folio?
--D
>
> folio_zero_range(folio, offset, bytes);
> folio_mark_accessed(folio);
> --
> 2.46.2
>
>
next prev parent reply other threads:[~2024-10-24 17:08 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-23 14:30 [PATCH] iomap: elide zero range flush from partial eof zeroing Brian Foster
2024-10-24 14:38 ` kernel test robot
2024-10-24 17:08 ` Darrick J. Wong [this message]
2024-10-24 17:41 ` Brian Foster
2024-10-24 19:42 ` Brian Foster
2024-10-25 16:49 ` Brian Foster
2024-11-05 1:43 ` Darrick J. Wong
2024-11-05 13:26 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241024170817.GK21853@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=bfoster@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox