[PATCH v2 0/2] iomap: avoid flushes for partial eof zeroing

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] iomap: avoid flushes for partial eof zeroing
@ 2024-10-31 14:04 Brian Foster
  2024-10-31 14:04 ` [PATCH v2 1/2] iomap: lift zeroed mapping handling into iomap_zero_range() Brian Foster
  2024-10-31 14:04 ` [PATCH v2 2/2] iomap: elide flush from partial eof zero range Brian Foster
  0 siblings, 2 replies; 6+ messages in thread
From: Brian Foster @ 2024-10-31 14:04 UTC (permalink / raw)
  To: linux-fsdevel

Hi all,

Here's v2 of the performance improvement for zero range. This is the
same general idea as v1, but a rework to lift the special handling of
zeroed mappings into the caller and open-code the two approaches from
there. The idea is that for partial eof zeroing, we can check whether
the folio for the block is already dirty in pagecache and if so, zero it
directly. Otherwise, fall back into existing behavior for the remainder
of the range.

This brings stress-ng metamix performance back in my local tests and
survives fstests without seeing any regressions.

Thoughts, reviews, flames appreciated.

Brian

v2:
- Added patch 1 to lift zeroed mapping handling code into caller.
- Split unaligned start range handling at the top level.
- Retain existing conditional flush behavior (vs. unconditional flush)
  for the remaining range.
v1: https://lore.kernel.org/linux-fsdevel/20241023143029.11275-1-bfoster@redhat.com/

Brian Foster (2):
  iomap: lift zeroed mapping handling into iomap_zero_range()
  iomap: elide flush from partial eof zero range

 fs/iomap/buffered-io.c | 99 ++++++++++++++++++++++++------------------
 1 file changed, 56 insertions(+), 43 deletions(-)

-- 
2.46.2

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] iomap: lift zeroed mapping handling into iomap_zero_range()
  2024-10-31 14:04 [PATCH v2 0/2] iomap: avoid flushes for partial eof zeroing Brian Foster
@ 2024-10-31 14:04 ` Brian Foster
  2024-11-06  0:14   ` Darrick J. Wong
  2024-10-31 14:04 ` [PATCH v2 2/2] iomap: elide flush from partial eof zero range Brian Foster
  1 sibling, 1 reply; 6+ messages in thread
From: Brian Foster @ 2024-10-31 14:04 UTC (permalink / raw)
  To: linux-fsdevel

In preparation for special handling of subranges, lift the zeroed
mapping logic from the iterator into the caller. Since this puts the
pagecache dirty check and flushing in the same place, streamline the
comments a bit as well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 63 ++++++++++++++----------------------------
 1 file changed, 21 insertions(+), 42 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index aa587b2142e2..60386cb7b9ef 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1365,40 +1365,12 @@ static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
 	return filemap_write_and_wait_range(mapping, i->pos, end);
 }
 
-static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
-		bool *range_dirty)
+static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 {
-	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
 	loff_t written = 0;
 
-	/*
-	 * We must zero subranges of unwritten mappings that might be dirty in
-	 * pagecache from previous writes. We only know whether the entire range
-	 * was clean or not, however, and dirty folios may have been written
-	 * back or reclaimed at any point after mapping lookup.
-	 *
-	 * The easiest way to deal with this is to flush pagecache to trigger
-	 * any pending unwritten conversions and then grab the updated extents
-	 * from the fs. The flush may change the current mapping, so mark it
-	 * stale for the iterator to remap it for the next pass to handle
-	 * properly.
-	 *
-	 * Note that holes are treated the same as unwritten because zero range
-	 * is (ab)used for partial folio zeroing in some cases. Hole backed
-	 * post-eof ranges can be dirtied via mapped write and the flush
-	 * triggers writeback time post-eof zeroing.
-	 */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
-		if (*range_dirty) {
-			*range_dirty = false;
-			return iomap_zero_iter_flush_and_stale(iter);
-		}
-		/* range is clean and already zeroed, nothing to do */
-		return length;
-	}
-
 	do {
 		struct folio *folio;
 		int status;
@@ -1448,24 +1420,31 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 	bool range_dirty;
 
 	/*
-	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
-	 * pagecache must be flushed to ensure stale data from previous
-	 * buffered writes is not exposed. A flush is only required for certain
-	 * types of mappings, but checking pagecache after mapping lookup is
-	 * racy with writeback and reclaim.
+	 * Zero range can skip mappings that are zero on disk so long as
+	 * pagecache is clean. If pagecache was dirty prior to zero range, the
+	 * mapping converts on writeback completion and must be zeroed.
 	 *
-	 * Therefore, check the entire range first and pass along whether any
-	 * part of it is dirty. If so and an underlying mapping warrants it,
-	 * flush the cache at that point. This trades off the occasional false
-	 * positive (and spurious flush, if the dirty data and mapping don't
-	 * happen to overlap) for simplicity in handling a relatively uncommon
-	 * situation.
+	 * The simplest way to deal with this is to flush pagecache and process
+	 * the updated mappings. To avoid an unconditional flush, check dirty
+	 * state and defer the flush until a combination of dirty pagecache and
+	 * at least one mapping that might convert on writeback is seen.
 	 */
 	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
 					pos, pos + len - 1);
+	while ((ret = iomap_iter(&iter, ops)) > 0) {
+		const struct iomap *s = iomap_iter_srcmap(&iter);
+		if (s->type == IOMAP_HOLE || s->type == IOMAP_UNWRITTEN) {
+			loff_t p = iomap_length(&iter);
+			if (range_dirty) {
+				range_dirty = false;
+				p = iomap_zero_iter_flush_and_stale(&iter);
+			}
+			iter.processed = p;
+			continue;
+		}
 
-	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
+		iter.processed = iomap_zero_iter(&iter, did_zero);
+	}
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_zero_range);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] iomap: lift zeroed mapping handling into iomap_zero_range()
  2024-10-31 14:04 ` [PATCH v2 1/2] iomap: lift zeroed mapping handling into iomap_zero_range() Brian Foster
@ 2024-11-06  0:14   ` Darrick J. Wong
  0 siblings, 0 replies; 6+ messages in thread
From: Darrick J. Wong @ 2024-11-06  0:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel

On Thu, Oct 31, 2024 at 10:04:47AM -0400, Brian Foster wrote:
> In preparation for special handling of subranges, lift the zeroed
> mapping logic from the iterator into the caller. Since this puts the
> pagecache dirty check and flushing in the same place, streamline the
> comments a bit as well.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 63 ++++++++++++++----------------------------
>  1 file changed, 21 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index aa587b2142e2..60386cb7b9ef 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1365,40 +1365,12 @@ static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
>  	return filemap_write_and_wait_range(mapping, i->pos, end);
>  }
>  
> -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> -		bool *range_dirty)
> +static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  {
> -	const struct iomap *srcmap = iomap_iter_srcmap(iter);
>  	loff_t pos = iter->pos;
>  	loff_t length = iomap_length(iter);
>  	loff_t written = 0;
>  
> -	/*
> -	 * We must zero subranges of unwritten mappings that might be dirty in
> -	 * pagecache from previous writes. We only know whether the entire range
> -	 * was clean or not, however, and dirty folios may have been written
> -	 * back or reclaimed at any point after mapping lookup.
> -	 *
> -	 * The easiest way to deal with this is to flush pagecache to trigger
> -	 * any pending unwritten conversions and then grab the updated extents
> -	 * from the fs. The flush may change the current mapping, so mark it
> -	 * stale for the iterator to remap it for the next pass to handle
> -	 * properly.
> -	 *
> -	 * Note that holes are treated the same as unwritten because zero range
> -	 * is (ab)used for partial folio zeroing in some cases. Hole backed
> -	 * post-eof ranges can be dirtied via mapped write and the flush
> -	 * triggers writeback time post-eof zeroing.
> -	 */
> -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> -		if (*range_dirty) {
> -			*range_dirty = false;
> -			return iomap_zero_iter_flush_and_stale(iter);
> -		}
> -		/* range is clean and already zeroed, nothing to do */
> -		return length;
> -	}
> -
>  	do {
>  		struct folio *folio;
>  		int status;
> @@ -1448,24 +1420,31 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  	bool range_dirty;
>  
>  	/*
> -	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
> -	 * pagecache must be flushed to ensure stale data from previous
> -	 * buffered writes is not exposed. A flush is only required for certain
> -	 * types of mappings, but checking pagecache after mapping lookup is
> -	 * racy with writeback and reclaim.
> +	 * Zero range can skip mappings that are zero on disk so long as
> +	 * pagecache is clean. If pagecache was dirty prior to zero range, the
> +	 * mapping converts on writeback completion and must be zeroed.
>  	 *
> -	 * Therefore, check the entire range first and pass along whether any
> -	 * part of it is dirty. If so and an underlying mapping warrants it,
> -	 * flush the cache at that point. This trades off the occasional false
> -	 * positive (and spurious flush, if the dirty data and mapping don't
> -	 * happen to overlap) for simplicity in handling a relatively uncommon
> -	 * situation.
> +	 * The simplest way to deal with this is to flush pagecache and process
> +	 * the updated mappings. To avoid an unconditional flush, check dirty
> +	 * state and defer the flush until a combination of dirty pagecache and
> +	 * at least one mapping that might convert on writeback is seen.
>  	 */
>  	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
>  					pos, pos + len - 1);
> +	while ((ret = iomap_iter(&iter, ops)) > 0) {
> +		const struct iomap *s = iomap_iter_srcmap(&iter);

Needs a blank line after the declaration, but other than picking nits
this looks ok to me.

--D

> +		if (s->type == IOMAP_HOLE || s->type == IOMAP_UNWRITTEN) {
> +			loff_t p = iomap_length(&iter);
> +			if (range_dirty) {
> +				range_dirty = false;
> +				p = iomap_zero_iter_flush_and_stale(&iter);
> +			}
> +			iter.processed = p;
> +			continue;
> +		}
>  
> -	while ((ret = iomap_iter(&iter, ops)) > 0)
> -		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
> +		iter.processed = iomap_zero_iter(&iter, did_zero);
> +	}
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_zero_range);
> -- 
> 2.46.2
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] iomap: elide flush from partial eof zero range
  2024-10-31 14:04 [PATCH v2 0/2] iomap: avoid flushes for partial eof zeroing Brian Foster
  2024-10-31 14:04 ` [PATCH v2 1/2] iomap: lift zeroed mapping handling into iomap_zero_range() Brian Foster
@ 2024-10-31 14:04 ` Brian Foster
  2024-11-06  0:11   ` Darrick J. Wong
  1 sibling, 1 reply; 6+ messages in thread
From: Brian Foster @ 2024-10-31 14:04 UTC (permalink / raw)
  To: linux-fsdevel

iomap zero range flushes pagecache in certain situations to
determine which parts of the range might require zeroing if dirty
data is present in pagecache. The kernel robot recently reported a
regression associated with this flushing in the following stress-ng
workload on XFS:

stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64

This workload involves repeated small, strided, extending writes. On
XFS, this produces a pattern of post-eof speculative preallocation,
conversion of preallocation from delalloc to unwritten, dirtying
pagecache over newly unwritten blocks, and then rinse and repeat
from the new EOF. This leads to repetitive flushing of the EOF folio
via the zero range call XFS uses for writes that start beyond
current EOF.

To mitigate this problem, special case EOF block zeroing to prefer
zeroing the folio over a flush when the EOF folio is already dirty.
To do this, split out and open code handling of an unaligned start
offset. This brings most of the performance back by avoiding flushes
on zero range calls via write and truncate extension operations. The
flush doesn't occur in these situations because the entire range is
post-eof and therefore the folio that overlaps EOF is the only one
in the range.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 38 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 60386cb7b9ef..343a2fa29bec 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -227,6 +227,18 @@ static void ifs_free(struct folio *folio)
 	kfree(ifs);
 }
 
+/* helper to reset an iter for reuse */
+static inline void
+iomap_iter_init(struct iomap_iter *iter, struct inode *inode, loff_t pos,
+		loff_t len, unsigned flags)
+{
+	memset(iter, 0, sizeof(*iter));
+	iter->inode = inode;
+	iter->pos = pos;
+	iter->len = len;
+	iter->flags = flags;
+}
+
 /*
  * Calculate the range inside the folio that we actually need to read.
  */
@@ -1416,6 +1428,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		.len		= len,
 		.flags		= IOMAP_ZERO,
 	};
+	struct address_space *mapping = inode->i_mapping;
+	unsigned int blocksize = i_blocksize(inode);
+	unsigned int off = pos & (blocksize - 1);
+	loff_t plen = min_t(loff_t, len, blocksize - off);
 	int ret;
 	bool range_dirty;
 
@@ -1425,12 +1441,30 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 	 * mapping converts on writeback completion and must be zeroed.
 	 *
 	 * The simplest way to deal with this is to flush pagecache and process
-	 * the updated mappings. To avoid an unconditional flush, check dirty
-	 * state and defer the flush until a combination of dirty pagecache and
-	 * at least one mapping that might convert on writeback is seen.
+	 * the updated mappings. First, special case the partial eof zeroing
+	 * use case since it is more performance sensitive. Zero the start of
+	 * the range if unaligned and already dirty in pagecache.
+	 */
+	if (off &&
+	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
+		iter.len = plen;
+		while ((ret = iomap_iter(&iter, ops)) > 0)
+			iter.processed = iomap_zero_iter(&iter, did_zero);
+
+		/* reset iterator for the rest of the range */
+		iomap_iter_init(&iter, inode, iter.pos,
+			len - (iter.pos - pos), IOMAP_ZERO);
+		if (ret || !iter.len)
+			return ret;
+	}
+
+	/*
+	 * To avoid an unconditional flush, check dirty state and defer the
+	 * flush until a combination of dirty pagecache and at least one
+	 * mapping that might convert on writeback is seen.
 	 */
 	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
-					pos, pos + len - 1);
+					iter.pos, iter.pos + iter.len - 1);
 	while ((ret = iomap_iter(&iter, ops)) > 0) {
 		const struct iomap *s = iomap_iter_srcmap(&iter);
 		if (s->type == IOMAP_HOLE || s->type == IOMAP_UNWRITTEN) {
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] iomap: elide flush from partial eof zero range
  2024-10-31 14:04 ` [PATCH v2 2/2] iomap: elide flush from partial eof zero range Brian Foster
@ 2024-11-06  0:11   ` Darrick J. Wong
  2024-11-06 14:13     ` Brian Foster
  0 siblings, 1 reply; 6+ messages in thread
From: Darrick J. Wong @ 2024-11-06  0:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel

On Thu, Oct 31, 2024 at 10:04:48AM -0400, Brian Foster wrote:
> iomap zero range flushes pagecache in certain situations to
> determine which parts of the range might require zeroing if dirty
> data is present in pagecache. The kernel robot recently reported a
> regression associated with this flushing in the following stress-ng
> workload on XFS:
> 
> stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64
> 
> This workload involves repeated small, strided, extending writes. On
> XFS, this produces a pattern of post-eof speculative preallocation,
> conversion of preallocation from delalloc to unwritten, dirtying
> pagecache over newly unwritten blocks, and then rinse and repeat
> from the new EOF. This leads to repetitive flushing of the EOF folio
> via the zero range call XFS uses for writes that start beyond
> current EOF.
> 
> To mitigate this problem, special case EOF block zeroing to prefer
> zeroing the folio over a flush when the EOF folio is already dirty.
> To do this, split out and open code handling of an unaligned start
> offset. This brings most of the performance back by avoiding flushes
> on zero range calls via write and truncate extension operations. The
> flush doesn't occur in these situations because the entire range is
> post-eof and therefore the folio that overlaps EOF is the only one
> in the range.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Cc: <stable@vger.kernel.org> # v6.12-rc1
Fixes: 7d9b474ee4cc37 ("iomap: make zero range flush conditional on unwritten mappings")

perhaps?

> ---
>  fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 38 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 60386cb7b9ef..343a2fa29bec 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -227,6 +227,18 @@ static void ifs_free(struct folio *folio)
>  	kfree(ifs);
>  }
>  
> +/* helper to reset an iter for reuse */
> +static inline void
> +iomap_iter_init(struct iomap_iter *iter, struct inode *inode, loff_t pos,
> +		loff_t len, unsigned flags)

Nit: maybe call this iomap_iter_reset() ?

Also I wonder if it's really safe to zero iomap_iter::private?
Won't doing that leave a minor logic bomb?

> +{
> +	memset(iter, 0, sizeof(*iter));
> +	iter->inode = inode;
> +	iter->pos = pos;
> +	iter->len = len;
> +	iter->flags = flags;
> +}
> +
>  /*
>   * Calculate the range inside the folio that we actually need to read.
>   */
> @@ -1416,6 +1428,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  		.len		= len,
>  		.flags		= IOMAP_ZERO,
>  	};
> +	struct address_space *mapping = inode->i_mapping;
> +	unsigned int blocksize = i_blocksize(inode);
> +	unsigned int off = pos & (blocksize - 1);
> +	loff_t plen = min_t(loff_t, len, blocksize - off);
>  	int ret;
>  	bool range_dirty;
>  
> @@ -1425,12 +1441,30 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  	 * mapping converts on writeback completion and must be zeroed.
>  	 *
>  	 * The simplest way to deal with this is to flush pagecache and process
> -	 * the updated mappings. To avoid an unconditional flush, check dirty
> -	 * state and defer the flush until a combination of dirty pagecache and
> -	 * at least one mapping that might convert on writeback is seen.
> +	 * the updated mappings. First, special case the partial eof zeroing
> +	 * use case since it is more performance sensitive. Zero the start of
> +	 * the range if unaligned and already dirty in pagecache.
> +	 */
> +	if (off &&
> +	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
> +		iter.len = plen;
> +		while ((ret = iomap_iter(&iter, ops)) > 0)
> +			iter.processed = iomap_zero_iter(&iter, did_zero);
> +
> +		/* reset iterator for the rest of the range */
> +		iomap_iter_init(&iter, inode, iter.pos,
> +			len - (iter.pos - pos), IOMAP_ZERO);

Nit: maybe one more tab ^ here?

Also from the previous thread: can you reset the original iter instead
of declaring a second one by zeroing the mappings/processed fields,
re-expanding iter::len, and resetting iter::flags?

I guess we'll still do the flush if the start of the zeroing range
aligns with an fsblock?  I guess if you're going to do a lot of small
extensions then once per fsblock isn't too bad?

--D

> +		if (ret || !iter.len)
> +			return ret;
> +	}
> +
> +	/*
> +	 * To avoid an unconditional flush, check dirty state and defer the
> +	 * flush until a combination of dirty pagecache and at least one
> +	 * mapping that might convert on writeback is seen.
>  	 */
>  	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> -					pos, pos + len - 1);
> +					iter.pos, iter.pos + iter.len - 1);
>  	while ((ret = iomap_iter(&iter, ops)) > 0) {
>  		const struct iomap *s = iomap_iter_srcmap(&iter);
>  		if (s->type == IOMAP_HOLE || s->type == IOMAP_UNWRITTEN) {
> -- 
> 2.46.2
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] iomap: elide flush from partial eof zero range
  2024-11-06  0:11   ` Darrick J. Wong
@ 2024-11-06 14:13     ` Brian Foster
  0 siblings, 0 replies; 6+ messages in thread
From: Brian Foster @ 2024-11-06 14:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel

On Tue, Nov 05, 2024 at 04:11:30PM -0800, Darrick J. Wong wrote:
> On Thu, Oct 31, 2024 at 10:04:48AM -0400, Brian Foster wrote:
> > iomap zero range flushes pagecache in certain situations to
> > determine which parts of the range might require zeroing if dirty
> > data is present in pagecache. The kernel robot recently reported a
> > regression associated with this flushing in the following stress-ng
> > workload on XFS:
> > 
> > stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64
> > 
> > This workload involves repeated small, strided, extending writes. On
> > XFS, this produces a pattern of post-eof speculative preallocation,
> > conversion of preallocation from delalloc to unwritten, dirtying
> > pagecache over newly unwritten blocks, and then rinse and repeat
> > from the new EOF. This leads to repetitive flushing of the EOF folio
> > via the zero range call XFS uses for writes that start beyond
> > current EOF.
> > 
> > To mitigate this problem, special case EOF block zeroing to prefer
> > zeroing the folio over a flush when the EOF folio is already dirty.
> > To do this, split out and open code handling of an unaligned start
> > offset. This brings most of the performance back by avoiding flushes
> > on zero range calls via write and truncate extension operations. The
> > flush doesn't occur in these situations because the entire range is
> > post-eof and therefore the folio that overlaps EOF is the only one
> > in the range.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Cc: <stable@vger.kernel.org> # v6.12-rc1
> Fixes: 7d9b474ee4cc37 ("iomap: make zero range flush conditional on unwritten mappings")
> 
> perhaps?
> 

Hmm.. I am reluctant just because I was never super convinced at whether
this was all that important. A test robot called it out and it just
seemed easy enough to improve.

> > ---
> >  fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 38 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 60386cb7b9ef..343a2fa29bec 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -227,6 +227,18 @@ static void ifs_free(struct folio *folio)
> >  	kfree(ifs);
> >  }
> >  
> > +/* helper to reset an iter for reuse */
> > +static inline void
> > +iomap_iter_init(struct iomap_iter *iter, struct inode *inode, loff_t pos,
> > +		loff_t len, unsigned flags)
> 
> Nit: maybe call this iomap_iter_reset() ?
> 

Sure, I like that.

> Also I wonder if it's really safe to zero iomap_iter::private?
> Won't doing that leave a minor logic bomb?
> 

Indeed, good catch.

> > +{
> > +	memset(iter, 0, sizeof(*iter));
> > +	iter->inode = inode;
> > +	iter->pos = pos;
> > +	iter->len = len;
> > +	iter->flags = flags;
> > +}
> > +
> >  /*
> >   * Calculate the range inside the folio that we actually need to read.
> >   */
> > @@ -1416,6 +1428,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  		.len		= len,
> >  		.flags		= IOMAP_ZERO,
> >  	};
> > +	struct address_space *mapping = inode->i_mapping;
> > +	unsigned int blocksize = i_blocksize(inode);
> > +	unsigned int off = pos & (blocksize - 1);
> > +	loff_t plen = min_t(loff_t, len, blocksize - off);
> >  	int ret;
> >  	bool range_dirty;
> >  
> > @@ -1425,12 +1441,30 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  	 * mapping converts on writeback completion and must be zeroed.
> >  	 *
> >  	 * The simplest way to deal with this is to flush pagecache and process
> > -	 * the updated mappings. To avoid an unconditional flush, check dirty
> > -	 * state and defer the flush until a combination of dirty pagecache and
> > -	 * at least one mapping that might convert on writeback is seen.
> > +	 * the updated mappings. First, special case the partial eof zeroing
> > +	 * use case since it is more performance sensitive. Zero the start of
> > +	 * the range if unaligned and already dirty in pagecache.
> > +	 */
> > +	if (off &&
> > +	    filemap_range_needs_writeback(mapping, pos, pos + plen - 1)) {
> > +		iter.len = plen;
> > +		while ((ret = iomap_iter(&iter, ops)) > 0)
> > +			iter.processed = iomap_zero_iter(&iter, did_zero);
> > +
> > +		/* reset iterator for the rest of the range */
> > +		iomap_iter_init(&iter, inode, iter.pos,
> > +			len - (iter.pos - pos), IOMAP_ZERO);
> 
> Nit: maybe one more tab ^ here?
> 
> Also from the previous thread: can you reset the original iter instead
> of declaring a second one by zeroing the mappings/processed fields,
> re-expanding iter::len, and resetting iter::flags?
> 

I'm not sure what you mean by "declaring a second one." I think maybe
you're suggesting whether we could just zero out the fields that need to
be, rather than reinit the whole thing...?

Context: I originally had this opencoded and created the helper to clean
up the code. I opted to memset the whole thing to try and avoid creating
a dependency that would have to be updated if the iter code ever
changed, but the ->private thing kind of shows how that problem goes
both ways.

Hmmmmm.. what do you think about maybe just fixing up the iteration path
to reset these fields? We already clear them in iomap_iter_advance()
when another iteration is expected. On a first pass, I don't see
anywhere where the terminal case would care if they were reset there as
well.

I'll have to double check and test of course, but issues
notwithstanding, I suspect that would allow the original logic of just
tacking the remaining length onto iter.len and continuing on. Hm?

> I guess we'll still do the flush if the start of the zeroing range
> aligns with an fsblock?  I guess if you're going to do a lot of small
> extensions then once per fsblock isn't too bad?
> 

Yeah.. we wouldn't be partially zeroing the EOF block in that case, so
would fall back to default behavior. Did you have another case/workload
in mind you were concerned about?

BTW and just in case you missed the analysis in the original report
thread [1], the performance hit here could also be partially attributed
to commit 5ce5674187c34 ("xfs: convert delayed extents to unwritten when
zeroing post eof blocks"). I'm skeptical that an unconditional physical
block allocation per write extension is always a good idea over perhaps
something more based on a heuristic, but as often with XFS I'm a bit too
apathetic of the obstruction^Wreview process to dig into that one..

I do have another minimal iomap patch to warn about the post-eof zero
range angle. I'll tack that onto the next version of this series for
discussion. Thanks for the feedback.

Brian

[1] https://lore.kernel.org/linux-xfs/ZxkE93Vz3ZQaAFO1@bfoster/

> --D
> 
> > +		if (ret || !iter.len)
> > +			return ret;
> > +	}
> > +
> > +	/*
> > +	 * To avoid an unconditional flush, check dirty state and defer the
> > +	 * flush until a combination of dirty pagecache and at least one
> > +	 * mapping that might convert on writeback is seen.
> >  	 */
> >  	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> > -					pos, pos + len - 1);
> > +					iter.pos, iter.pos + iter.len - 1);
> >  	while ((ret = iomap_iter(&iter, ops)) > 0) {
> >  		const struct iomap *s = iomap_iter_srcmap(&iter);
> >  		if (s->type == IOMAP_HOLE || s->type == IOMAP_UNWRITTEN) {
> > -- 
> > 2.46.2
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-11-06 14:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-31 14:04 [PATCH v2 0/2] iomap: avoid flushes for partial eof zeroing Brian Foster
2024-10-31 14:04 ` [PATCH v2 1/2] iomap: lift zeroed mapping handling into iomap_zero_range() Brian Foster
2024-11-06  0:14   ` Darrick J. Wong
2024-10-31 14:04 ` [PATCH v2 2/2] iomap: elide flush from partial eof zero range Brian Foster
2024-11-06  0:11   ` Darrick J. Wong
2024-11-06 14:13     ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox