From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH] iomap: elide zero range flush from partial eof zeroing
Date: Thu, 24 Oct 2024 15:42:41 -0400 [thread overview]
Message-ID: <ZxqjMSg4c0UivDYU@bfoster> (raw)
In-Reply-To: <ZxqGujaIJmnHjgZd@bfoster>
On Thu, Oct 24, 2024 at 01:41:14PM -0400, Brian Foster wrote:
> On Thu, Oct 24, 2024 at 10:08:17AM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 23, 2024 at 10:30:29AM -0400, Brian Foster wrote:
> > > iomap zero range performs a pagecache flush upon seeing unwritten
> > > extents with dirty pagecache in order to determine accurate
> > > subranges that require direct zeroing. This is to support an
> > > optimization where clean, unwritten ranges are skipped as they are
> > > already zero on-disk.
> > >
> > > Certain use cases for zero range are more sensitive to flush latency
> > > than others. The kernel test robot recently reported a regression in
> > > the following stress-ng workload on XFS:
> > >
> > > stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64
> > >
> > > This workload involves a series of small, strided, write extending
> > > writes. On XFS, this produces a pattern of allocating post-eof
> > > speculative preallocation, converting preallocation to unwritten on
> > > zero range calls, dirtying pagecache over the converted mapping, and
> > > then repeating the sequence again from the updated EOF. This
> > > basically produces a sequence of pagecache flushes on the partial
> > > EOF block zeroing use case of zero range.
> > >
> > > To mitigate this problem, special case the EOF block zeroing use
> > > case to prefer zeroing over a pagecache flush when the EOF folio is
> > > already dirty. This brings most of the performance back by avoiding
> > > flushes on write and truncate extension operations, while preserving
> > > the ability for iomap to flush and properly process larger ranges.
> > >
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >
> > > Hi iomap maintainers,
> > >
> > > This is an incremental optimization for the regression reported by the
> > > test robot here[1]. I'm not totally convinced this is necessary as an
> > > immediate fix, but the discussion on that thread was enough to suggest
> > > it could be. I don't really love the factoring, but I had to play a bit
> > > of whack-a-mole between fstests and stress-ng to restore performance and
> > > still maintain behavior expectations for some of the tests.
> > >
> > > On a positive note, exploring this gave me what I think is a better idea
> > > for dealing with zero range overall, so I'm working on a followup to
> > > this that reworks it by splitting zero range across block alignment
> > > boundaries (similar to how something like truncate page range works, for
> > > example). This simplifies things by isolating the dirty range check to a
> > > single folio on an unaligned start offset, which lets the _iter() call
> > > do a skip or zero (i.e. no more flush_and_stale()), and then
> > > unconditionally flush the aligned portion to end-of-range. The latter
> > > flush should be a no-op for every use case I've seen so far, so this
> > > might entirely avoid the need for anything more complex for zero range.
> > >
> > > In summary, I'm posting this as an optional and more "stable-worthy"
> > > patch for reference and for the maintainers to consider as they like. I
> > > think it's reasonable to include if we are concerned about this
> > > particular stress-ng test and are Ok with it as a transient solution.
> > > But if it were up to me, I'd probably sit on it for a bit to determine
> > > if a more practical user/workload is affected by this, particularly
> > > knowing that I'm trying to rework it. This could always be applied as a
> > > stable fix if really needed, but I just don't think the slightly more
> > > invasive rework is appropriate for -rc..
> > >
> > > Thoughts, reviews, flames appreciated.
> > >
> > > Brian
> > >
> > > [1] https://lore.kernel.org/linux-xfs/202410141536.1167190b-oliver.sang@intel.com/
> > >
> > > fs/iomap/buffered-io.c | 20 +++++++++++++++++---
> > > 1 file changed, 17 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index aa587b2142e2..8fd25b14d120 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1372,6 +1372,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > > loff_t pos = iter->pos;
> > > loff_t length = iomap_length(iter);
> > > loff_t written = 0;
> > > + bool eof_zero = false;
> > >
> > > /*
> > > * We must zero subranges of unwritten mappings that might be dirty in
> > > @@ -1391,12 +1392,23 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > > * triggers writeback time post-eof zeroing.
> > > */
> > > if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> > > - if (*range_dirty) {
> > > + /* range is clean and already zeroed, nothing to do */
> > > + if (!*range_dirty)
> > > + return length;
> > > +
> > > + /* flush for anything other than partial eof zeroing */
> > > + if (pos != i_size_read(iter->inode) ||
> > > + (pos % i_blocksize(iter->inode)) == 0) {
> > > *range_dirty = false;
> > > return iomap_zero_iter_flush_and_stale(iter);
> > > }
> > > - /* range is clean and already zeroed, nothing to do */
> > > - return length;
> > > + /*
> > > + * Special case partial EOF zeroing. Since we know the EOF
> > > + * folio is dirty, prefer in-memory zeroing for it. This avoids
> > > + * excessive flush latency on frequent file size extending
> > > + * operations.
> > > + */
> > > + eof_zero = true;
> > > }
> > >
> > > do {
> > > @@ -1415,6 +1427,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > > offset = offset_in_folio(folio, pos);
> > > if (bytes > folio_size(folio) - offset)
> > > bytes = folio_size(folio) - offset;
> > > + if (eof_zero && length > bytes)
> > > + length = bytes;
> >
> > What does this do? I think this causes the loop to break after putting
> > the folio that caches @pos? And then I guess we go around the loop in
> > iomap_zero_range again if there were more bytes to zero after this
> > folio?
> >
>
> Yeah.. it's basically just saying that if we fell into folio zeroing due
> to the special case logic above, only process through the end of this
> particular folio and jump back out to process the rest of the range as
> normal. The idea was just to prevent going off and doing a bunch of
> unexpected zeroing across an unwritten mapping just because we had an
> unaligned range that starts with a dirty folio.
>
> FWIW, the reworked variant I have of this currently looks like the
> appended diff. The caveat is this can still flush if a large folio
> happens to overlap the two subranges, but as is seems to placate the
> stress-ng test. In theory, I think having something like an
> iomap_zero_folio(folio, start_pos, end_pos) that zeroed up through
> min(end_pos, folio_end_pos) for the unaligned part would mitigate that,
> but I'm not quite sure of a clean way to do that; particularly if we
> have a large folio made up of multiple mappings. I'm also still
> undecided on whether to unconditionally flush the rest or try to
> preserve the flush_and_stale() approach as well.
>
> Brian
>
> --- 8< ---
>
...
And here's another variant that preserves the flush_and_stale()
behavior. This one is compile tested only:
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index aa587b2142e2..37a27c344078 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1365,40 +1365,12 @@ static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
return filemap_write_and_wait_range(mapping, i->pos, end);
}
-static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
- bool *range_dirty)
+static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
{
- const struct iomap *srcmap = iomap_iter_srcmap(iter);
loff_t pos = iter->pos;
loff_t length = iomap_length(iter);
loff_t written = 0;
- /*
- * We must zero subranges of unwritten mappings that might be dirty in
- * pagecache from previous writes. We only know whether the entire range
- * was clean or not, however, and dirty folios may have been written
- * back or reclaimed at any point after mapping lookup.
- *
- * The easiest way to deal with this is to flush pagecache to trigger
- * any pending unwritten conversions and then grab the updated extents
- * from the fs. The flush may change the current mapping, so mark it
- * stale for the iterator to remap it for the next pass to handle
- * properly.
- *
- * Note that holes are treated the same as unwritten because zero range
- * is (ab)used for partial folio zeroing in some cases. Hole backed
- * post-eof ranges can be dirtied via mapped write and the flush
- * triggers writeback time post-eof zeroing.
- */
- if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
- if (*range_dirty) {
- *range_dirty = false;
- return iomap_zero_iter_flush_and_stale(iter);
- }
- /* range is clean and already zeroed, nothing to do */
- return length;
- }
-
do {
struct folio *folio;
int status;
@@ -1434,38 +1406,69 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
return written;
}
+static inline void
+iomap_iter_init(struct iomap_iter *iter, struct inode *inode, loff_t pos,
+ loff_t len)
+{
+ memset(iter, 0, sizeof(*iter));
+ iter->inode = inode;
+ iter->pos = pos;
+ iter->len = len;
+ iter->flags = IOMAP_ZERO;
+}
+
int
iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
const struct iomap_ops *ops)
{
- struct iomap_iter iter = {
- .inode = inode,
- .pos = pos,
- .len = len,
- .flags = IOMAP_ZERO,
- };
+ struct iomap_iter iter;
+ struct address_space *mapping = inode->i_mapping;
+ unsigned int blocksize = i_blocksize(inode);
+ unsigned int off = pos & (blocksize - 1);
+ loff_t plen = min_t(loff_t, len, blocksize - off);
+ bool dirty;
int ret;
- bool range_dirty;
+
+ iomap_iter_init(&iter, inode, pos, len);
/*
- * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
- * pagecache must be flushed to ensure stale data from previous
- * buffered writes is not exposed. A flush is only required for certain
- * types of mappings, but checking pagecache after mapping lookup is
- * racy with writeback and reclaim.
+ * Zero range wants to skip mappings that are already zero on disk, but
+ * the only way to handle unwritten mappings covered by dirty pagecache
+ * is to flush and reprocess the converted mappings after I/O
+ * completion.
*
- * Therefore, check the entire range first and pass along whether any
- * part of it is dirty. If so and an underlying mapping warrants it,
- * flush the cache at that point. This trades off the occasional false
- * positive (and spurious flush, if the dirty data and mapping don't
- * happen to overlap) for simplicity in handling a relatively uncommon
- * situation.
+ * The partial EOF zeroing use case is performance sensitive, so split
+ * and handle an unaligned start of the range separately. The dirty
+ * check tells the iter function whether it can skip or zero the folio
+ * without needing to flush. Larger ranges tend to have already been
+ * flushed by the filesystem, so flush the rest here as a safety measure
+ * and process as normal.
*/
- range_dirty = filemap_range_needs_writeback(inode->i_mapping,
- pos, pos + len - 1);
+ if (off &&
+ filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
+ iter.len = plen;
+ while ((ret = iomap_iter(&iter, ops)) > 0)
+ iter.processed = iomap_zero_iter(&iter, did_zero);
+ iomap_iter_init(&iter, inode, iter.pos, len - (iter.pos - pos));
+ if (ret || !iter.len)
+ return ret;
+ }
+
+ dirty = filemap_range_needs_writeback(mapping, iter.pos,
+ iter.pos + iter.len - 1);
+ while ((ret = iomap_iter(&iter, ops)) > 0) {
+ const struct iomap *s = iomap_iter_srcmap(&iter);
+ if (!(s->type == IOMAP_HOLE || s->type == IOMAP_UNWRITTEN)) {
+ iter.processed = iomap_zero_iter(&iter, did_zero);
+ continue;
+ }
+ iter.processed = iomap_length(&iter);
+ if (dirty) {
+ dirty = false;
+ iter.processed = iomap_zero_iter_flush_and_stale(&iter);
+ }
+ }
- while ((ret = iomap_iter(&iter, ops)) > 0)
- iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
return ret;
}
EXPORT_SYMBOL_GPL(iomap_zero_range);
next prev parent reply other threads:[~2024-10-24 19:41 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-23 14:30 [PATCH] iomap: elide zero range flush from partial eof zeroing Brian Foster
2024-10-24 14:38 ` kernel test robot
2024-10-24 17:08 ` Darrick J. Wong
2024-10-24 17:41 ` Brian Foster
2024-10-24 19:42 ` Brian Foster [this message]
2024-10-25 16:49 ` Brian Foster
2024-11-05 1:43 ` Darrick J. Wong
2024-11-05 13:26 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZxqjMSg4c0UivDYU@bfoster \
--to=bfoster@redhat.com \
--cc=djwong@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox