linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] iomap: flush dirty cache over unwritten mappings on zero range
@ 2024-08-28 18:19 Brian Foster
  2024-08-28 18:19 ` [PATCH v2 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Brian Foster @ 2024-08-28 18:19 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, djwong, josef, david

Hi all,

Here's v2 of the iomap zero range flush fixes. No real changes here
other than a comment update to better explain a subtle corner case. The
latest version of corresponding test support is posted here [1].
Thoughts, reviews, flames appreciated.

Brian

[1] https://lore.kernel.org/fstests/20240828181534.41054-1-bfoster@redhat.com/

v2:
- Update comment in patch 2 to explain hole case.
v1: https://lore.kernel.org/linux-fsdevel/20240822145910.188974-1-bfoster@redhat.com/
- Alternative approach, flush instead of revalidate.
rfc: https://lore.kernel.org/linux-fsdevel/20240718130212.23905-1-bfoster@redhat.com/

Brian Foster (2):
  iomap: fix handling of dirty folios over unwritten extents
  iomap: make zero range flush conditional on unwritten mappings

 fs/iomap/buffered-io.c | 57 +++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_iops.c      | 10 --------
 2 files changed, 53 insertions(+), 14 deletions(-)

-- 
2.45.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] iomap: fix handling of dirty folios over unwritten extents
  2024-08-28 18:19 [PATCH v2 0/2] iomap: flush dirty cache over unwritten mappings on zero range Brian Foster
@ 2024-08-28 18:19 ` Brian Foster
  2024-08-28 22:22   ` Darrick J. Wong
  2024-08-28 18:19 ` [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster
  2024-08-28 20:44 ` [PATCH v2 0/2] iomap: flush dirty cache over unwritten mappings on zero range Josef Bacik
  2 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2024-08-28 18:19 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, djwong, josef, david

The iomap zero range implementation doesn't properly handle dirty
pagecache over unwritten mappings. It skips such mappings as if they
were pre-zeroed. If some part of an unwritten mapping is dirty in
pagecache from a previous write, the data in cache should be zeroed
as well. Instead, the data is left in cache and creates a stale data
exposure problem if writeback occurs sometime after the zero range.

Most callers are unaffected by this because the higher level
filesystem contexts that call zero range typically perform a filemap
flush of the target range for other reasons. A couple contexts that
don't otherwise need to flush are write file size extension and
truncate in XFS. The former path is currently susceptible to the
stale data exposure problem and the latter performs a flush
specifically to work around it.

This is clearly inconsistent and incomplete. As a first step toward
correcting behavior, lift the XFS workaround to iomap_zero_range()
and unconditionally flush the range before the zero range operation
proceeds. While this appears to be a bit of a big hammer, most all
users already do this from calling context save for the couple of
exceptions noted above. Future patches will optimize or elide this
flush while maintaining functional correctness.

Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure")
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 10 ++++++++++
 fs/xfs/xfs_iops.c      | 10 ----------
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86ac..3e846f43ff48 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1451,6 +1451,16 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 	};
 	int ret;
 
+	/*
+	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
+	 * pagecache must be flushed to ensure stale data from previous
+	 * buffered writes is not exposed.
+	 */
+	ret = filemap_write_and_wait_range(inode->i_mapping,
+			pos, pos + len - 1);
+	if (ret)
+		return ret;
+
 	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.processed = iomap_zero_iter(&iter, did_zero);
 	return ret;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 1cdc8034f54d..ddd3697e6ecd 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -870,16 +870,6 @@ xfs_setattr_size(
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
 	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings
  2024-08-28 18:19 [PATCH v2 0/2] iomap: flush dirty cache over unwritten mappings on zero range Brian Foster
  2024-08-28 18:19 ` [PATCH v2 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster
@ 2024-08-28 18:19 ` Brian Foster
  2024-08-28 22:44   ` Darrick J. Wong
  2024-08-28 20:44 ` [PATCH v2 0/2] iomap: flush dirty cache over unwritten mappings on zero range Josef Bacik
  2 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2024-08-28 18:19 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, djwong, josef, david

iomap_zero_range() flushes pagecache to mitigate consistency
problems with dirty pagecache and unwritten mappings. The flush is
unconditional over the entire range because checking pagecache state
after mapping lookup is racy with writeback and reclaim. There are
ways around this using iomap's mapping revalidation mechanism, but
this is not supported by all iomap based filesystems and so is not a
generic solution.

There is another way around this limitation that is good enough to
filter the flush for most cases in practice. If we check for dirty
pagecache over the target range (instead of unconditionally flush),
we can keep track of whether the range was dirty before lookup and
defer the flush until/unless we see a combination of dirty cache
backed by an unwritten mapping. We don't necessarily know whether
the dirty cache was backed by the unwritten maping or some other
(written) part of the range, but the impliciation of a false
positive here is a spurious flush and thus relatively harmless.

Note that we also flush for hole mappings because iomap_zero_range()
is used for partial folio zeroing in some cases. For example, if a
folio straddles EOF on a sub-page FSB size fs, the post-eof portion
is hole-backed and dirtied/written via mapped write, and then i_size
increases before writeback can occur (which otherwise zeroes the
post-eof portion of the EOF folio), then the folio becomes
inconsistent with disk until reclaimed. A flush in this case
executes partial zeroing from writeback, and iomap knows that there
is otherwise no I/O to submit for hole backed mappings.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 57 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 48 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 3e846f43ff48..a6e897e6e303 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1393,16 +1393,47 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 }
 EXPORT_SYMBOL_GPL(iomap_file_unshare);
 
-static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
+/*
+ * Flush the remaining range of the iter and mark the current mapping stale.
+ * This is used when zero range sees an unwritten mapping that may have had
+ * dirty pagecache over it.
+ */
+static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
+{
+	struct address_space *mapping = i->inode->i_mapping;
+	loff_t end = i->pos + i->len - 1;
+
+	i->iomap.flags |= IOMAP_F_STALE;
+	return filemap_write_and_wait_range(mapping, i->pos, end);
+}
+
+static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
+		bool *range_dirty)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
 	loff_t written = 0;
 
-	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	/*
+	 * We can skip pre-zeroed mappings so long as either the mapping was
+	 * clean before we started or we've flushed at least once since.
+	 * Otherwise we don't know whether the current mapping had dirty
+	 * pagecache, so flush it now, stale the current mapping, and proceed
+	 * from there.
+	 *
+	 * The hole case is intentionally included because this is (ab)used to
+	 * handle partial folio zeroing in some cases. Hole backed post-eof
+	 * ranges can be dirtied via mapped write and the flush triggers
+	 * writeback time post-eof zeroing.
+	 */
+	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
+		if (*range_dirty) {
+			*range_dirty = false;
+			return iomap_zero_iter_flush_and_stale(iter);
+		}
 		return length;
+	}
 
 	do {
 		struct folio *folio;
@@ -1450,19 +1481,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		.flags		= IOMAP_ZERO,
 	};
 	int ret;
+	bool range_dirty;
 
 	/*
 	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
 	 * pagecache must be flushed to ensure stale data from previous
-	 * buffered writes is not exposed.
+	 * buffered writes is not exposed. A flush is only required for certain
+	 * types of mappings, but checking pagecache after mapping lookup is
+	 * racy with writeback and reclaim.
+	 *
+	 * Therefore, check the entire range first and pass along whether any
+	 * part of it is dirty. If so and an underlying mapping warrants it,
+	 * flush the cache at that point. This trades off the occasional false
+	 * positive (and spurious flush, if the dirty data and mapping don't
+	 * happen to overlap) for simplicity in handling a relatively uncommon
+	 * situation.
 	 */
-	ret = filemap_write_and_wait_range(inode->i_mapping,
-			pos, pos + len - 1);
-	if (ret)
-		return ret;
+	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
+					pos, pos + len - 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_zero_iter(&iter, did_zero);
+		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_zero_range);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/2] iomap: flush dirty cache over unwritten mappings on zero range
  2024-08-28 18:19 [PATCH v2 0/2] iomap: flush dirty cache over unwritten mappings on zero range Brian Foster
  2024-08-28 18:19 ` [PATCH v2 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster
  2024-08-28 18:19 ` [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster
@ 2024-08-28 20:44 ` Josef Bacik
  2 siblings, 0 replies; 13+ messages in thread
From: Josef Bacik @ 2024-08-28 20:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, djwong, david

On Wed, Aug 28, 2024 at 02:19:09PM -0400, Brian Foster wrote:
> Hi all,
> 
> Here's v2 of the iomap zero range flush fixes. No real changes here
> other than a comment update to better explain a subtle corner case. The
> latest version of corresponding test support is posted here [1].
> Thoughts, reviews, flames appreciated.
> 
> Brian
> 

Took me a second to grok what you were doing in the second patch, mostly because
I'm not as familiar with the iomap code, so with that caveat you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] iomap: fix handling of dirty folios over unwritten extents
  2024-08-28 18:19 ` [PATCH v2 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster
@ 2024-08-28 22:22   ` Darrick J. Wong
  2024-08-29  5:43     ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2024-08-28 22:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, josef, david

On Wed, Aug 28, 2024 at 02:19:10PM -0400, Brian Foster wrote:
> The iomap zero range implementation doesn't properly handle dirty
> pagecache over unwritten mappings. It skips such mappings as if they
> were pre-zeroed. If some part of an unwritten mapping is dirty in
> pagecache from a previous write, the data in cache should be zeroed
> as well. Instead, the data is left in cache and creates a stale data
> exposure problem if writeback occurs sometime after the zero range.
> 
> Most callers are unaffected by this because the higher level
> filesystem contexts that call zero range typically perform a filemap
> flush of the target range for other reasons. A couple contexts that
> don't otherwise need to flush are write file size extension and
> truncate in XFS. The former path is currently susceptible to the
> stale data exposure problem and the latter performs a flush
> specifically to work around it.
> 
> This is clearly inconsistent and incomplete. As a first step toward
> correcting behavior, lift the XFS workaround to iomap_zero_range()
> and unconditionally flush the range before the zero range operation
> proceeds. While this appears to be a bit of a big hammer, most all
> users already do this from calling context save for the couple of
> exceptions noted above. Future patches will optimize or elide this
> flush while maintaining functional correctness.
> 
> Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure")
> Signed-off-by: Brian Foster <bfoster@redhat.com>

I wonder why gfs2 (aka the other iomap_zero_range user) doesn't have a
truncate-down flush hammer, but maybe it doesn't support unwritten
extents?  I didn't find anything obvious when I looked, so

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

(but you might want to see if Andreas has any loud objections to this)

--D

> ---
>  fs/iomap/buffered-io.c | 10 ++++++++++
>  fs/xfs/xfs_iops.c      | 10 ----------
>  2 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index f420c53d86ac..3e846f43ff48 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1451,6 +1451,16 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  	};
>  	int ret;
>  
> +	/*
> +	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
> +	 * pagecache must be flushed to ensure stale data from previous
> +	 * buffered writes is not exposed.
> +	 */
> +	ret = filemap_write_and_wait_range(inode->i_mapping,
> +			pos, pos + len - 1);
> +	if (ret)
> +		return ret;
> +
>  	while ((ret = iomap_iter(&iter, ops)) > 0)
>  		iter.processed = iomap_zero_iter(&iter, did_zero);
>  	return ret;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 1cdc8034f54d..ddd3697e6ecd 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -870,16 +870,6 @@ xfs_setattr_size(
>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>  				&did_zeroing);
>  	} else {
> -		/*
> -		 * iomap won't detect a dirty page over an unwritten block (or a
> -		 * cow block over a hole) and subsequently skips zeroing the
> -		 * newly post-EOF portion of the page. Flush the new EOF to
> -		 * convert the block before the pagecache truncate.
> -		 */
> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> -						     newsize);
> -		if (error)
> -			return error;
>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>  	}
>  
> -- 
> 2.45.0
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings
  2024-08-28 18:19 ` [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster
@ 2024-08-28 22:44   ` Darrick J. Wong
  2024-08-29  0:26     ` Dave Chinner
  2024-08-29 15:03     ` Brian Foster
  0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2024-08-28 22:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, josef, david

On Wed, Aug 28, 2024 at 02:19:11PM -0400, Brian Foster wrote:
> iomap_zero_range() flushes pagecache to mitigate consistency
> problems with dirty pagecache and unwritten mappings. The flush is
> unconditional over the entire range because checking pagecache state
> after mapping lookup is racy with writeback and reclaim. There are
> ways around this using iomap's mapping revalidation mechanism, but
> this is not supported by all iomap based filesystems and so is not a
> generic solution.
> 
> There is another way around this limitation that is good enough to
> filter the flush for most cases in practice. If we check for dirty
> pagecache over the target range (instead of unconditionally flush),
> we can keep track of whether the range was dirty before lookup and
> defer the flush until/unless we see a combination of dirty cache
> backed by an unwritten mapping. We don't necessarily know whether
> the dirty cache was backed by the unwritten maping or some other
> (written) part of the range, but the impliciation of a false
> positive here is a spurious flush and thus relatively harmless.
> 
> Note that we also flush for hole mappings because iomap_zero_range()
> is used for partial folio zeroing in some cases. For example, if a
> folio straddles EOF on a sub-page FSB size fs, the post-eof portion
> is hole-backed and dirtied/written via mapped write, and then i_size
> increases before writeback can occur (which otherwise zeroes the
> post-eof portion of the EOF folio), then the folio becomes
> inconsistent with disk until reclaimed. A flush in this case
> executes partial zeroing from writeback, and iomap knows that there
> is otherwise no I/O to submit for hole backed mappings.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 57 +++++++++++++++++++++++++++++++++++-------
>  1 file changed, 48 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 3e846f43ff48..a6e897e6e303 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1393,16 +1393,47 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_unshare);
>  
> -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> +/*
> + * Flush the remaining range of the iter and mark the current mapping stale.
> + * This is used when zero range sees an unwritten mapping that may have had
> + * dirty pagecache over it.
> + */
> +static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
> +{
> +	struct address_space *mapping = i->inode->i_mapping;
> +	loff_t end = i->pos + i->len - 1;
> +
> +	i->iomap.flags |= IOMAP_F_STALE;
> +	return filemap_write_and_wait_range(mapping, i->pos, end);
> +}
> +
> +static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> +		bool *range_dirty)
>  {
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
>  	loff_t pos = iter->pos;
>  	loff_t length = iomap_length(iter);
>  	loff_t written = 0;
>  
> -	/* already zeroed?  we're done. */
> -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> +	/*
> +	 * We can skip pre-zeroed mappings so long as either the mapping was
> +	 * clean before we started or we've flushed at least once since.
> +	 * Otherwise we don't know whether the current mapping had dirty
> +	 * pagecache, so flush it now, stale the current mapping, and proceed
> +	 * from there.
> +	 *
> +	 * The hole case is intentionally included because this is (ab)used to
> +	 * handle partial folio zeroing in some cases. Hole backed post-eof
> +	 * ranges can be dirtied via mapped write and the flush triggers
> +	 * writeback time post-eof zeroing.
> +	 */
> +	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> +		if (*range_dirty) {
> +			*range_dirty = false;
> +			return iomap_zero_iter_flush_and_stale(iter);
> +		}
>  		return length;
> +	}
>  
>  	do {
>  		struct folio *folio;
> @@ -1450,19 +1481,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  		.flags		= IOMAP_ZERO,
>  	};
>  	int ret;
> +	bool range_dirty;
>  
>  	/*
>  	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
>  	 * pagecache must be flushed to ensure stale data from previous
> -	 * buffered writes is not exposed.
> +	 * buffered writes is not exposed. A flush is only required for certain
> +	 * types of mappings, but checking pagecache after mapping lookup is
> +	 * racy with writeback and reclaim.
> +	 *
> +	 * Therefore, check the entire range first and pass along whether any
> +	 * part of it is dirty. If so and an underlying mapping warrants it,
> +	 * flush the cache at that point. This trades off the occasional false
> +	 * positive (and spurious flush, if the dirty data and mapping don't
> +	 * happen to overlap) for simplicity in handling a relatively uncommon
> +	 * situation.
>  	 */
> -	ret = filemap_write_and_wait_range(inode->i_mapping,
> -			pos, pos + len - 1);
> -	if (ret)
> -		return ret;
> +	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> +					pos, pos + len - 1);
>  
>  	while ((ret = iomap_iter(&iter, ops)) > 0)
> -		iter.processed = iomap_zero_iter(&iter, did_zero);
> +		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);

Style nit: Could we do this flush-and-stale from the loop body instead
of passing pointers around?  e.g.

static inline bool iomap_zero_need_flush(const struct iomap_iter *i)
{
	const struct iomap *srcmap = iomap_iter_srcmap(iter);

	return srcmap->type == IOMAP_HOLE ||
	       srcmap->type == IOMAP_UNWRITTEN;
}

static inline int iomap_zero_iter_flush(struct iomap_iter *i)
{
	struct address_space *mapping = i->inode->i_mapping;
	loff_t end = i->pos + i->len - 1;

	i->iomap.flags |= IOMAP_F_STALE;
	return filemap_write_and_wait_range(mapping, i->pos, end);
}

and then:

	range_dirty = filemap_range_needs_writeback(...);

	while ((ret = iomap_iter(&iter, ops)) > 0) {
		if (range_dirty && iomap_zero_need_flush(&iter)) {
			/*
			 * Zero range wants to skip pre-zeroed (i.e.
			 * unwritten) mappings, but...
			 */
			range_dirty = false;
			iter.processed = iomap_zero_iter_flush(&iter);
		} else {
			iter.processed = iomap_zero_iter(&iter, did_zero);
		}
	}

The logic looks correct and sensible. :)

--D

>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_zero_range);
> -- 
> 2.45.0
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings
  2024-08-28 22:44   ` Darrick J. Wong
@ 2024-08-29  0:26     ` Dave Chinner
  2024-08-29 15:04       ` Brian Foster
  2024-08-29 15:03     ` Brian Foster
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2024-08-29  0:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, linux-xfs, josef

On Wed, Aug 28, 2024 at 03:44:20PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 28, 2024 at 02:19:11PM -0400, Brian Foster wrote:
> > @@ -1450,19 +1481,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  		.flags		= IOMAP_ZERO,
> >  	};
> >  	int ret;
> > +	bool range_dirty;
> >  
> >  	/*
> >  	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
> >  	 * pagecache must be flushed to ensure stale data from previous
> > -	 * buffered writes is not exposed.
> > +	 * buffered writes is not exposed. A flush is only required for certain
> > +	 * types of mappings, but checking pagecache after mapping lookup is
> > +	 * racy with writeback and reclaim.
> > +	 *
> > +	 * Therefore, check the entire range first and pass along whether any
> > +	 * part of it is dirty. If so and an underlying mapping warrants it,
> > +	 * flush the cache at that point. This trades off the occasional false
> > +	 * positive (and spurious flush, if the dirty data and mapping don't
> > +	 * happen to overlap) for simplicity in handling a relatively uncommon
> > +	 * situation.
> >  	 */
> > -	ret = filemap_write_and_wait_range(inode->i_mapping,
> > -			pos, pos + len - 1);
> > -	if (ret)
> > -		return ret;
> > +	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> > +					pos, pos + len - 1);
> >  
> >  	while ((ret = iomap_iter(&iter, ops)) > 0)
> > -		iter.processed = iomap_zero_iter(&iter, did_zero);
> > +		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
> 
> Style nit: Could we do this flush-and-stale from the loop body instead
> of passing pointers around?  e.g.
> 
> static inline bool iomap_zero_need_flush(const struct iomap_iter *i)
> {
> 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> 
> 	return srcmap->type == IOMAP_HOLE ||
> 	       srcmap->type == IOMAP_UNWRITTEN;
> }
> 
> static inline int iomap_zero_iter_flush(struct iomap_iter *i)
> {
> 	struct address_space *mapping = i->inode->i_mapping;
> 	loff_t end = i->pos + i->len - 1;
> 
> 	i->iomap.flags |= IOMAP_F_STALE;
> 	return filemap_write_and_wait_range(mapping, i->pos, end);
> }
> 
> and then:
> 
> 	range_dirty = filemap_range_needs_writeback(...);
> 
> 	while ((ret = iomap_iter(&iter, ops)) > 0) {
> 		if (range_dirty && iomap_zero_need_flush(&iter)) {
> 			/*
> 			 * Zero range wants to skip pre-zeroed (i.e.
> 			 * unwritten) mappings, but...
> 			 */
> 			range_dirty = false;
> 			iter.processed = iomap_zero_iter_flush(&iter);
> 		} else {
> 			iter.processed = iomap_zero_iter(&iter, did_zero);
> 		}
> 	}
> 
> The logic looks correct and sensible. :)

Yeah, I think this is better.

However, the one thing that both versions have in common is that
they don't explain -why- the iomap needs to be marked stale.
So, something like:

"When we flush the dirty data over the range, the extent state for
the range will change. We need to to know that new state before
performing any zeroing operations on the range.  Hence we mark the
iomap stale so that the iterator will remap this range and the next
ieration pass will see the new extent state and perform the correct
zeroing operation for the range."

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] iomap: fix handling of dirty folios over unwritten extents
  2024-08-28 22:22   ` Darrick J. Wong
@ 2024-08-29  5:43     ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-08-29  5:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, linux-xfs, josef, david

On Wed, Aug 28, 2024 at 03:22:22PM -0700, Darrick J. Wong wrote:
> I wonder why gfs2 (aka the other iomap_zero_range user) doesn't have a
> truncate-down flush hammer, but maybe it doesn't support unwritten
> extents?  I didn't find anything obvious when I looked, so

gfs2 does not support unwritten extents.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings
  2024-08-28 22:44   ` Darrick J. Wong
  2024-08-29  0:26     ` Dave Chinner
@ 2024-08-29 15:03     ` Brian Foster
  2024-08-29 17:29       ` Brian Foster
  2024-08-29 21:34       ` Darrick J. Wong
  1 sibling, 2 replies; 13+ messages in thread
From: Brian Foster @ 2024-08-29 15:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, josef, david

On Wed, Aug 28, 2024 at 03:44:20PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 28, 2024 at 02:19:11PM -0400, Brian Foster wrote:
> > iomap_zero_range() flushes pagecache to mitigate consistency
> > problems with dirty pagecache and unwritten mappings. The flush is
> > unconditional over the entire range because checking pagecache state
> > after mapping lookup is racy with writeback and reclaim. There are
> > ways around this using iomap's mapping revalidation mechanism, but
> > this is not supported by all iomap based filesystems and so is not a
> > generic solution.
> > 
> > There is another way around this limitation that is good enough to
> > filter the flush for most cases in practice. If we check for dirty
> > pagecache over the target range (instead of unconditionally flush),
> > we can keep track of whether the range was dirty before lookup and
> > defer the flush until/unless we see a combination of dirty cache
> > backed by an unwritten mapping. We don't necessarily know whether
> > the dirty cache was backed by the unwritten maping or some other
> > (written) part of the range, but the impliciation of a false
> > positive here is a spurious flush and thus relatively harmless.
> > 
> > Note that we also flush for hole mappings because iomap_zero_range()
> > is used for partial folio zeroing in some cases. For example, if a
> > folio straddles EOF on a sub-page FSB size fs, the post-eof portion
> > is hole-backed and dirtied/written via mapped write, and then i_size
> > increases before writeback can occur (which otherwise zeroes the
> > post-eof portion of the EOF folio), then the folio becomes
> > inconsistent with disk until reclaimed. A flush in this case
> > executes partial zeroing from writeback, and iomap knows that there
> > is otherwise no I/O to submit for hole backed mappings.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/iomap/buffered-io.c | 57 +++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 48 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 3e846f43ff48..a6e897e6e303 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1393,16 +1393,47 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> >  }
> >  EXPORT_SYMBOL_GPL(iomap_file_unshare);
> >  
> > -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> > +/*
> > + * Flush the remaining range of the iter and mark the current mapping stale.
> > + * This is used when zero range sees an unwritten mapping that may have had
> > + * dirty pagecache over it.
> > + */
> > +static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
> > +{
> > +	struct address_space *mapping = i->inode->i_mapping;
> > +	loff_t end = i->pos + i->len - 1;
> > +
> > +	i->iomap.flags |= IOMAP_F_STALE;
> > +	return filemap_write_and_wait_range(mapping, i->pos, end);
> > +}
> > +
> > +static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > +		bool *range_dirty)
> >  {
> >  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> >  	loff_t pos = iter->pos;
> >  	loff_t length = iomap_length(iter);
> >  	loff_t written = 0;
> >  
> > -	/* already zeroed?  we're done. */
> > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > +	/*
> > +	 * We can skip pre-zeroed mappings so long as either the mapping was
> > +	 * clean before we started or we've flushed at least once since.
> > +	 * Otherwise we don't know whether the current mapping had dirty
> > +	 * pagecache, so flush it now, stale the current mapping, and proceed
> > +	 * from there.
> > +	 *
> > +	 * The hole case is intentionally included because this is (ab)used to
> > +	 * handle partial folio zeroing in some cases. Hole backed post-eof
> > +	 * ranges can be dirtied via mapped write and the flush triggers
> > +	 * writeback time post-eof zeroing.
> > +	 */
> > +	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> > +		if (*range_dirty) {
> > +			*range_dirty = false;
> > +			return iomap_zero_iter_flush_and_stale(iter);
> > +		}
> >  		return length;
> > +	}
> >  
> >  	do {
> >  		struct folio *folio;
> > @@ -1450,19 +1481,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  		.flags		= IOMAP_ZERO,
> >  	};
> >  	int ret;
> > +	bool range_dirty;
> >  
> >  	/*
> >  	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
> >  	 * pagecache must be flushed to ensure stale data from previous
> > -	 * buffered writes is not exposed.
> > +	 * buffered writes is not exposed. A flush is only required for certain
> > +	 * types of mappings, but checking pagecache after mapping lookup is
> > +	 * racy with writeback and reclaim.
> > +	 *
> > +	 * Therefore, check the entire range first and pass along whether any
> > +	 * part of it is dirty. If so and an underlying mapping warrants it,
> > +	 * flush the cache at that point. This trades off the occasional false
> > +	 * positive (and spurious flush, if the dirty data and mapping don't
> > +	 * happen to overlap) for simplicity in handling a relatively uncommon
> > +	 * situation.
> >  	 */
> > -	ret = filemap_write_and_wait_range(inode->i_mapping,
> > -			pos, pos + len - 1);
> > -	if (ret)
> > -		return ret;
> > +	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> > +					pos, pos + len - 1);
> >  
> >  	while ((ret = iomap_iter(&iter, ops)) > 0)
> > -		iter.processed = iomap_zero_iter(&iter, did_zero);
> > +		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
> 
> Style nit: Could we do this flush-and-stale from the loop body instead
> of passing pointers around?  e.g.
> 

So FWIW, I had multiple other variations of this that used an
IOMAP_DIRTY_CACHE flag on the iomap to track dirty pagecache for
arbitrary operations. The flag could be set and cleared at the
appropriate points as expected (for ops that care).

To me, that's how I'd prefer to avoid just passing a pointer, but I
intentionally factored that out to avoid using a flag for something that
(for now) could be simplified to a local variable. OTOH, it is something
that might be useful for the iomap seek data/hole implementations down
the road.

I've played with that a bit, but also have been trying to avoid getting
too much into that rabbit hole for zero range. My thought was I'd
reintroduce it and replace the range_dirty thing if/when it proved
useful for multiple operations.

> static inline bool iomap_zero_need_flush(const struct iomap_iter *i)
> {
> 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> 
> 	return srcmap->type == IOMAP_HOLE ||
> 	       srcmap->type == IOMAP_UNWRITTEN;
> }

The factoring looks mostly reasonable, but a couple things bug me that
I'd like to see if we can resolve..

One is that this doesn't really indicate whether a flush is needed,
because the dirty cache state is a critical part of that logic. I
suppose we could rename it (to what?), but it also seems a little odd to
have a helper just for mapping type checks.

> 
> static inline int iomap_zero_iter_flush(struct iomap_iter *i)
> {
> 	struct address_space *mapping = i->inode->i_mapping;
> 	loff_t end = i->pos + i->len - 1;
> 
> 	i->iomap.flags |= IOMAP_F_STALE;
> 	return filemap_write_and_wait_range(mapping, i->pos, end);
> }
> 
> and then:
> 
> 	range_dirty = filemap_range_needs_writeback(...);
> 
> 	while ((ret = iomap_iter(&iter, ops)) > 0) {
> 		if (range_dirty && iomap_zero_need_flush(&iter)) {
> 			/*
> 			 * Zero range wants to skip pre-zeroed (i.e.
> 			 * unwritten) mappings, but...
> 			 */
> 			range_dirty = false;
> 			iter.processed = iomap_zero_iter_flush(&iter);
> 		} else {
> 			iter.processed = iomap_zero_iter(&iter, did_zero);
> 		}

The other is that the optimization logic is now split across multiple
functions. I.e., iomap_zero_iter() has a landmine if ever called without
doing the flush_and_stale() part first (a consideration if
truncate_page() were ever open coded, for example).

I wonder if a compromise might be to factor out the whole optimization
into a separate helper rather than just the flush part (first via a prep
patch), then the higher level loop ends up looking almost the same:

	while ((ret = iomap_iter(&iter, ops)) > 0) {
		/* special handling for already zeroed mappings */
		if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
			iter.processed = iomap_zero_mapping_iter(&iter, &range_dirty);
		else
			iter.processed = iomap_zero_iter(&iter, did_zero);
		}

That doesn't avoid passing the range_dirty pointer, but we just end up
passing that instead of did_zero. Also as noted above, it could still be
made to go away if the range_dirty check gets pushed down into the
iomap_iter() path for more general use.

Anyways those are just my thoughts. I'm of the mind that whatever
factoring we do here may have to change if Dave's batched folio
lookup/iteration idea pans out for fs' with validation support, so at
the end of the day I'll change this to look exactly like you wrote it if
it means the zeroing problem gets fixed. Thoughts or preference?

Brian

> 	}
> 
> The logic looks correct and sensible. :)
> 
> --D
> 
> >  	return ret;
> >  }
> >  EXPORT_SYMBOL_GPL(iomap_zero_range);
> > -- 
> > 2.45.0
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings
  2024-08-29  0:26     ` Dave Chinner
@ 2024-08-29 15:04       ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2024-08-29 15:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, josef

On Thu, Aug 29, 2024 at 10:26:06AM +1000, Dave Chinner wrote:
> On Wed, Aug 28, 2024 at 03:44:20PM -0700, Darrick J. Wong wrote:
> > On Wed, Aug 28, 2024 at 02:19:11PM -0400, Brian Foster wrote:
> > > @@ -1450,19 +1481,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > >  		.flags		= IOMAP_ZERO,
> > >  	};
> > >  	int ret;
> > > +	bool range_dirty;
> > >  
> > >  	/*
> > >  	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
> > >  	 * pagecache must be flushed to ensure stale data from previous
> > > -	 * buffered writes is not exposed.
> > > +	 * buffered writes is not exposed. A flush is only required for certain
> > > +	 * types of mappings, but checking pagecache after mapping lookup is
> > > +	 * racy with writeback and reclaim.
> > > +	 *
> > > +	 * Therefore, check the entire range first and pass along whether any
> > > +	 * part of it is dirty. If so and an underlying mapping warrants it,
> > > +	 * flush the cache at that point. This trades off the occasional false
> > > +	 * positive (and spurious flush, if the dirty data and mapping don't
> > > +	 * happen to overlap) for simplicity in handling a relatively uncommon
> > > +	 * situation.
> > >  	 */
> > > -	ret = filemap_write_and_wait_range(inode->i_mapping,
> > > -			pos, pos + len - 1);
> > > -	if (ret)
> > > -		return ret;
> > > +	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> > > +					pos, pos + len - 1);
> > >  
> > >  	while ((ret = iomap_iter(&iter, ops)) > 0)
> > > -		iter.processed = iomap_zero_iter(&iter, did_zero);
> > > +		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
> > 
> > Style nit: Could we do this flush-and-stale from the loop body instead
> > of passing pointers around?  e.g.
> > 
> > static inline bool iomap_zero_need_flush(const struct iomap_iter *i)
> > {
> > 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > 
> > 	return srcmap->type == IOMAP_HOLE ||
> > 	       srcmap->type == IOMAP_UNWRITTEN;
> > }
> > 
> > static inline int iomap_zero_iter_flush(struct iomap_iter *i)
> > {
> > 	struct address_space *mapping = i->inode->i_mapping;
> > 	loff_t end = i->pos + i->len - 1;
> > 
> > 	i->iomap.flags |= IOMAP_F_STALE;
> > 	return filemap_write_and_wait_range(mapping, i->pos, end);
> > }
> > 
> > and then:
> > 
> > 	range_dirty = filemap_range_needs_writeback(...);
> > 
> > 	while ((ret = iomap_iter(&iter, ops)) > 0) {
> > 		if (range_dirty && iomap_zero_need_flush(&iter)) {
> > 			/*
> > 			 * Zero range wants to skip pre-zeroed (i.e.
> > 			 * unwritten) mappings, but...
> > 			 */
> > 			range_dirty = false;
> > 			iter.processed = iomap_zero_iter_flush(&iter);
> > 		} else {
> > 			iter.processed = iomap_zero_iter(&iter, did_zero);
> > 		}
> > 	}
> > 
> > The logic looks correct and sensible. :)
> 
> Yeah, I think this is better.
> 
> However, the one thing that both versions have in common is that
> they don't explain -why- the iomap needs to be marked stale.
> So, something like:
> 
> "When we flush the dirty data over the range, the extent state for
> the range will change. We need to to know that new state before
> performing any zeroing operations on the range.  Hence we mark the
> iomap stale so that the iterator will remap this range and the next
> ieration pass will see the new extent state and perform the correct
> zeroing operation for the range."
> 

Sure, I'll update the comments however the factoring ultimately turns
out. Thanks.

Brian

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings
  2024-08-29 15:03     ` Brian Foster
@ 2024-08-29 17:29       ` Brian Foster
  2024-08-29 21:34       ` Darrick J. Wong
  1 sibling, 0 replies; 13+ messages in thread
From: Brian Foster @ 2024-08-29 17:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, josef, david

On Thu, Aug 29, 2024 at 11:03:57AM -0400, Brian Foster wrote:
> On Wed, Aug 28, 2024 at 03:44:20PM -0700, Darrick J. Wong wrote:
> > On Wed, Aug 28, 2024 at 02:19:11PM -0400, Brian Foster wrote:
> > > iomap_zero_range() flushes pagecache to mitigate consistency
> > > problems with dirty pagecache and unwritten mappings. The flush is
> > > unconditional over the entire range because checking pagecache state
> > > after mapping lookup is racy with writeback and reclaim. There are
> > > ways around this using iomap's mapping revalidation mechanism, but
> > > this is not supported by all iomap based filesystems and so is not a
> > > generic solution.
> > > 
> > > There is another way around this limitation that is good enough to
> > > filter the flush for most cases in practice. If we check for dirty
> > > pagecache over the target range (instead of unconditionally flush),
> > > we can keep track of whether the range was dirty before lookup and
> > > defer the flush until/unless we see a combination of dirty cache
> > > backed by an unwritten mapping. We don't necessarily know whether
> > > the dirty cache was backed by the unwritten maping or some other
> > > (written) part of the range, but the impliciation of a false
> > > positive here is a spurious flush and thus relatively harmless.
> > > 
> > > Note that we also flush for hole mappings because iomap_zero_range()
> > > is used for partial folio zeroing in some cases. For example, if a
> > > folio straddles EOF on a sub-page FSB size fs, the post-eof portion
> > > is hole-backed and dirtied/written via mapped write, and then i_size
> > > increases before writeback can occur (which otherwise zeroes the
> > > post-eof portion of the EOF folio), then the folio becomes
> > > inconsistent with disk until reclaimed. A flush in this case
> > > executes partial zeroing from writeback, and iomap knows that there
> > > is otherwise no I/O to submit for hole backed mappings.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/iomap/buffered-io.c | 57 +++++++++++++++++++++++++++++++++++-------
> > >  1 file changed, 48 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 3e846f43ff48..a6e897e6e303 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1393,16 +1393,47 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> > >  }
> > >  EXPORT_SYMBOL_GPL(iomap_file_unshare);
> > >  
> > > -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> > > +/*
> > > + * Flush the remaining range of the iter and mark the current mapping stale.
> > > + * This is used when zero range sees an unwritten mapping that may have had
> > > + * dirty pagecache over it.
> > > + */
> > > +static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
> > > +{
> > > +	struct address_space *mapping = i->inode->i_mapping;
> > > +	loff_t end = i->pos + i->len - 1;
> > > +
> > > +	i->iomap.flags |= IOMAP_F_STALE;
> > > +	return filemap_write_and_wait_range(mapping, i->pos, end);
> > > +}
> > > +
> > > +static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > > +		bool *range_dirty)
> > >  {
> > >  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > >  	loff_t pos = iter->pos;
> > >  	loff_t length = iomap_length(iter);
> > >  	loff_t written = 0;
> > >  
> > > -	/* already zeroed?  we're done. */
> > > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > > +	/*
> > > +	 * We can skip pre-zeroed mappings so long as either the mapping was
> > > +	 * clean before we started or we've flushed at least once since.
> > > +	 * Otherwise we don't know whether the current mapping had dirty
> > > +	 * pagecache, so flush it now, stale the current mapping, and proceed
> > > +	 * from there.
> > > +	 *
> > > +	 * The hole case is intentionally included because this is (ab)used to
> > > +	 * handle partial folio zeroing in some cases. Hole backed post-eof
> > > +	 * ranges can be dirtied via mapped write and the flush triggers
> > > +	 * writeback time post-eof zeroing.
> > > +	 */
> > > +	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> > > +		if (*range_dirty) {
> > > +			*range_dirty = false;
> > > +			return iomap_zero_iter_flush_and_stale(iter);
> > > +		}
> > >  		return length;
> > > +	}
> > >  
> > >  	do {
> > >  		struct folio *folio;
> > > @@ -1450,19 +1481,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > >  		.flags		= IOMAP_ZERO,
> > >  	};
> > >  	int ret;
> > > +	bool range_dirty;
> > >  
> > >  	/*
> > >  	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
> > >  	 * pagecache must be flushed to ensure stale data from previous
> > > -	 * buffered writes is not exposed.
> > > +	 * buffered writes is not exposed. A flush is only required for certain
> > > +	 * types of mappings, but checking pagecache after mapping lookup is
> > > +	 * racy with writeback and reclaim.
> > > +	 *
> > > +	 * Therefore, check the entire range first and pass along whether any
> > > +	 * part of it is dirty. If so and an underlying mapping warrants it,
> > > +	 * flush the cache at that point. This trades off the occasional false
> > > +	 * positive (and spurious flush, if the dirty data and mapping don't
> > > +	 * happen to overlap) for simplicity in handling a relatively uncommon
> > > +	 * situation.
> > >  	 */
> > > -	ret = filemap_write_and_wait_range(inode->i_mapping,
> > > -			pos, pos + len - 1);
> > > -	if (ret)
> > > -		return ret;
> > > +	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> > > +					pos, pos + len - 1);
> > >  
> > >  	while ((ret = iomap_iter(&iter, ops)) > 0)
> > > -		iter.processed = iomap_zero_iter(&iter, did_zero);
> > > +		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
> > 
> > Style nit: Could we do this flush-and-stale from the loop body instead
> > of passing pointers around?  e.g.
> > 
> 
> So FWIW, I had multiple other variations of this that used an
> IOMAP_DIRTY_CACHE flag on the iomap to track dirty pagecache for
> arbitrary operations. The flag could be set and cleared at the
> appropriate points as expected (for ops that care).
> 
> To me, that's how I'd prefer to avoid just passing a pointer, but I
> intentionally factored that out to avoid using a flag for something that
> (for now) could be simplified to a local variable. OTOH, it is something
> that might be useful for the iomap seek data/hole implementations down
> the road.
> 
> I've played with that a bit, but also have been trying to avoid getting
> too much into that rabbit hole for zero range. My thought was I'd
> reintroduce it and replace the range_dirty thing if/when it proved
> useful for multiple operations.
> 
> > static inline bool iomap_zero_need_flush(const struct iomap_iter *i)
> > {
> > 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > 
> > 	return srcmap->type == IOMAP_HOLE ||
> > 	       srcmap->type == IOMAP_UNWRITTEN;
> > }
> 
> The factoring looks mostly reasonable, but a couple things bug me that
> I'd like to see if we can resolve..
> 
> One is that this doesn't really indicate whether a flush is needed,
> because the dirty cache state is a critical part of that logic. I
> suppose we could rename it (to what?), but it also seems a little odd to
> have a helper just for mapping type checks.
> 
> > 
> > static inline int iomap_zero_iter_flush(struct iomap_iter *i)
> > {
> > 	struct address_space *mapping = i->inode->i_mapping;
> > 	loff_t end = i->pos + i->len - 1;
> > 
> > 	i->iomap.flags |= IOMAP_F_STALE;
> > 	return filemap_write_and_wait_range(mapping, i->pos, end);
> > }
> > 
> > and then:
> > 
> > 	range_dirty = filemap_range_needs_writeback(...);
> > 
> > 	while ((ret = iomap_iter(&iter, ops)) > 0) {
> > 		if (range_dirty && iomap_zero_need_flush(&iter)) {
> > 			/*
> > 			 * Zero range wants to skip pre-zeroed (i.e.
> > 			 * unwritten) mappings, but...
> > 			 */
> > 			range_dirty = false;
> > 			iter.processed = iomap_zero_iter_flush(&iter);
> > 		} else {
> > 			iter.processed = iomap_zero_iter(&iter, did_zero);
> > 		}
> 
> The other is that the optimization logic is now split across multiple
> functions. I.e., iomap_zero_iter() has a landmine if ever called without
> doing the flush_and_stale() part first (a consideration if
> truncate_page() were ever open coded, for example).
> 
> I wonder if a compromise might be to factor out the whole optimization
> into a separate helper rather than just the flush part (first via a prep
> patch), then the higher level loop ends up looking almost the same:
> 
> 	while ((ret = iomap_iter(&iter, ops)) > 0) {
> 		/* special handling for already zeroed mappings */
> 		if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> 			iter.processed = iomap_zero_mapping_iter(&iter, &range_dirty);
> 		else
> 			iter.processed = iomap_zero_iter(&iter, did_zero);
> 		}
> 
> That doesn't avoid passing the range_dirty pointer, but we just end up
> passing that instead of did_zero. Also as noted above, it could still be
> made to go away if the range_dirty check gets pushed down into the
> iomap_iter() path for more general use.
> 

FWIW, here's another variation of iomap_zero_range() that seems a bit
closer to yours:

       range_dirty = filemap_range_needs_writeback(inode->i_mapping,
                                        pos, pos + len - 1);

        while ((ret = iomap_iter(&iter, ops)) > 0) {
                const struct iomap *srcmap = iomap_iter_srcmap(&iter);

                if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
                        iter.processed = iomap_length(&iter);
                        if (range_dirty) {
                                range_dirty = false;
                                iter.processed = iomap_zero_iter_flush_and_stale(&iter);
                        }
                        continue;
                }

                iter.processed = iomap_zero_iter(&iter, did_zero);
        }

This avoids passing around range_dirty, but keeps the optimization logic
together. Hm?

Brian

> Anyways those are just my thoughts. I'm of the mind that whatever
> factoring we do here may have to change if Dave's batched folio
> lookup/iteration idea pans out for fs' with validation support, so at
> the end of the day I'll change this to look exactly like you wrote it if
> it means the zeroing problem gets fixed. Thoughts or preference?
> 
> Brian
> 
> > 	}
> > 
> > The logic looks correct and sensible. :)
> > 
> > --D
> > 
> > >  	return ret;
> > >  }
> > >  EXPORT_SYMBOL_GPL(iomap_zero_range);
> > > -- 
> > > 2.45.0
> > > 
> > > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings
  2024-08-29 15:03     ` Brian Foster
  2024-08-29 17:29       ` Brian Foster
@ 2024-08-29 21:34       ` Darrick J. Wong
  2024-08-30 11:58         ` Brian Foster
  1 sibling, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2024-08-29 21:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, josef, david

On Thu, Aug 29, 2024 at 11:03:57AM -0400, Brian Foster wrote:
> On Wed, Aug 28, 2024 at 03:44:20PM -0700, Darrick J. Wong wrote:
> > On Wed, Aug 28, 2024 at 02:19:11PM -0400, Brian Foster wrote:
> > > iomap_zero_range() flushes pagecache to mitigate consistency
> > > problems with dirty pagecache and unwritten mappings. The flush is
> > > unconditional over the entire range because checking pagecache state
> > > after mapping lookup is racy with writeback and reclaim. There are
> > > ways around this using iomap's mapping revalidation mechanism, but
> > > this is not supported by all iomap based filesystems and so is not a
> > > generic solution.
> > > 
> > > There is another way around this limitation that is good enough to
> > > filter the flush for most cases in practice. If we check for dirty
> > > pagecache over the target range (instead of unconditionally flush),
> > > we can keep track of whether the range was dirty before lookup and
> > > defer the flush until/unless we see a combination of dirty cache
> > > backed by an unwritten mapping. We don't necessarily know whether
> > > the dirty cache was backed by the unwritten maping or some other
> > > (written) part of the range, but the impliciation of a false
> > > positive here is a spurious flush and thus relatively harmless.
> > > 
> > > Note that we also flush for hole mappings because iomap_zero_range()
> > > is used for partial folio zeroing in some cases. For example, if a
> > > folio straddles EOF on a sub-page FSB size fs, the post-eof portion
> > > is hole-backed and dirtied/written via mapped write, and then i_size
> > > increases before writeback can occur (which otherwise zeroes the
> > > post-eof portion of the EOF folio), then the folio becomes
> > > inconsistent with disk until reclaimed. A flush in this case
> > > executes partial zeroing from writeback, and iomap knows that there
> > > is otherwise no I/O to submit for hole backed mappings.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/iomap/buffered-io.c | 57 +++++++++++++++++++++++++++++++++++-------
> > >  1 file changed, 48 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 3e846f43ff48..a6e897e6e303 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1393,16 +1393,47 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> > >  }
> > >  EXPORT_SYMBOL_GPL(iomap_file_unshare);
> > >  
> > > -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> > > +/*
> > > + * Flush the remaining range of the iter and mark the current mapping stale.
> > > + * This is used when zero range sees an unwritten mapping that may have had
> > > + * dirty pagecache over it.
> > > + */
> > > +static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
> > > +{
> > > +	struct address_space *mapping = i->inode->i_mapping;
> > > +	loff_t end = i->pos + i->len - 1;
> > > +
> > > +	i->iomap.flags |= IOMAP_F_STALE;
> > > +	return filemap_write_and_wait_range(mapping, i->pos, end);
> > > +}
> > > +
> > > +static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > > +		bool *range_dirty)
> > >  {
> > >  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > >  	loff_t pos = iter->pos;
> > >  	loff_t length = iomap_length(iter);
> > >  	loff_t written = 0;
> > >  
> > > -	/* already zeroed?  we're done. */
> > > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > > +	/*
> > > +	 * We can skip pre-zeroed mappings so long as either the mapping was
> > > +	 * clean before we started or we've flushed at least once since.
> > > +	 * Otherwise we don't know whether the current mapping had dirty
> > > +	 * pagecache, so flush it now, stale the current mapping, and proceed
> > > +	 * from there.
> > > +	 *
> > > +	 * The hole case is intentionally included because this is (ab)used to
> > > +	 * handle partial folio zeroing in some cases. Hole backed post-eof
> > > +	 * ranges can be dirtied via mapped write and the flush triggers
> > > +	 * writeback time post-eof zeroing.
> > > +	 */
> > > +	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> > > +		if (*range_dirty) {
> > > +			*range_dirty = false;
> > > +			return iomap_zero_iter_flush_and_stale(iter);
> > > +		}
> > >  		return length;
> > > +	}
> > >  
> > >  	do {
> > >  		struct folio *folio;
> > > @@ -1450,19 +1481,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > >  		.flags		= IOMAP_ZERO,
> > >  	};
> > >  	int ret;
> > > +	bool range_dirty;
> > >  
> > >  	/*
> > >  	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
> > >  	 * pagecache must be flushed to ensure stale data from previous
> > > -	 * buffered writes is not exposed.
> > > +	 * buffered writes is not exposed. A flush is only required for certain
> > > +	 * types of mappings, but checking pagecache after mapping lookup is
> > > +	 * racy with writeback and reclaim.
> > > +	 *
> > > +	 * Therefore, check the entire range first and pass along whether any
> > > +	 * part of it is dirty. If so and an underlying mapping warrants it,
> > > +	 * flush the cache at that point. This trades off the occasional false
> > > +	 * positive (and spurious flush, if the dirty data and mapping don't
> > > +	 * happen to overlap) for simplicity in handling a relatively uncommon
> > > +	 * situation.
> > >  	 */
> > > -	ret = filemap_write_and_wait_range(inode->i_mapping,
> > > -			pos, pos + len - 1);
> > > -	if (ret)
> > > -		return ret;
> > > +	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> > > +					pos, pos + len - 1);
> > >  
> > >  	while ((ret = iomap_iter(&iter, ops)) > 0)
> > > -		iter.processed = iomap_zero_iter(&iter, did_zero);
> > > +		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
> > 
> > Style nit: Could we do this flush-and-stale from the loop body instead
> > of passing pointers around?  e.g.
> > 
> 
> So FWIW, I had multiple other variations of this that used an
> IOMAP_DIRTY_CACHE flag on the iomap to track dirty pagecache for
> arbitrary operations. The flag could be set and cleared at the
> appropriate points as expected (for ops that care).
> 
> To me, that's how I'd prefer to avoid just passing a pointer, but I
> intentionally factored that out to avoid using a flag for something that
> (for now) could be simplified to a local variable. OTOH, it is something
> that might be useful for the iomap seek data/hole implementations down
> the road.

<nod> We can always adjust again when we get there; for now a local
variable sounds fine.

> I've played with that a bit, but also have been trying to avoid getting
> too much into that rabbit hole for zero range. My thought was I'd
> reintroduce it and replace the range_dirty thing if/when it proved
> useful for multiple operations.
> 
> > static inline bool iomap_zero_need_flush(const struct iomap_iter *i)
> > {
> > 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > 
> > 	return srcmap->type == IOMAP_HOLE ||
> > 	       srcmap->type == IOMAP_UNWRITTEN;
> > }
> 
> The factoring looks mostly reasonable, but a couple things bug me that
> I'd like to see if we can resolve..
> 
> One is that this doesn't really indicate whether a flush is needed,
> because the dirty cache state is a critical part of that logic. I
> suppose we could rename it (to what?), but it also seems a little odd to
> have a helper just for mapping type checks.

I thought about passing range_dirty into iomap_zero_need_flush since
it's a static inline function, but that just seemed unnecessary.

> > static inline int iomap_zero_iter_flush(struct iomap_iter *i)
> > {
> > 	struct address_space *mapping = i->inode->i_mapping;
> > 	loff_t end = i->pos + i->len - 1;
> > 
> > 	i->iomap.flags |= IOMAP_F_STALE;
> > 	return filemap_write_and_wait_range(mapping, i->pos, end);
> > }
> > 
> > and then:
> > 
> > 	range_dirty = filemap_range_needs_writeback(...);
> > 
> > 	while ((ret = iomap_iter(&iter, ops)) > 0) {
> > 		if (range_dirty && iomap_zero_need_flush(&iter)) {
> > 			/*
> > 			 * Zero range wants to skip pre-zeroed (i.e.
> > 			 * unwritten) mappings, but...
> > 			 */
> > 			range_dirty = false;
> > 			iter.processed = iomap_zero_iter_flush(&iter);
> > 		} else {
> > 			iter.processed = iomap_zero_iter(&iter, did_zero);
> > 		}
> 
> The other is that the optimization logic is now split across multiple
> functions. I.e., iomap_zero_iter() has a landmine if ever called without
> doing the flush_and_stale() part first (a consideration if
> truncate_page() were ever open coded, for example).

_zero_iter is a static function, let's hope nobody does that.  Though
you're right, experience tells me that someone will try this
eventually.

That said, I see the merit of having one complete loop body function
that knows how to handle all iomap types, since the others do that.

> I wonder if a compromise might be to factor out the whole optimization
> into a separate helper rather than just the flush part (first via a prep
> patch), then the higher level loop ends up looking almost the same:
> 
> 	while ((ret = iomap_iter(&iter, ops)) > 0) {
> 		/* special handling for already zeroed mappings */
> 		if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> 			iter.processed = iomap_zero_mapping_iter(&iter, &range_dirty);
> 		else
> 			iter.processed = iomap_zero_iter(&iter, did_zero);
> 		}
> 
> That doesn't avoid passing the range_dirty pointer, but we just end up
> passing that instead of did_zero. Also as noted above, it could still be
> made to go away if the range_dirty check gets pushed down into the
> iomap_iter() path for more general use.
> 
> Anyways those are just my thoughts. I'm of the mind that whatever
> factoring we do here may have to change if Dave's batched folio
> lookup/iteration idea pans out for fs' with validation support, so at
> the end of the day I'll change this to look exactly like you wrote it if
> it means the zeroing problem gets fixed. Thoughts or preference?

I'm ok with your original version now.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> Brian
> 
> > 	}
> > 
> > The logic looks correct and sensible. :)
> > 
> > --D
> > 
> > >  	return ret;
> > >  }
> > >  EXPORT_SYMBOL_GPL(iomap_zero_range);
> > > -- 
> > > 2.45.0
> > > 
> > > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings
  2024-08-29 21:34       ` Darrick J. Wong
@ 2024-08-30 11:58         ` Brian Foster
  0 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2024-08-30 11:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, josef, david

On Thu, Aug 29, 2024 at 02:34:02PM -0700, Darrick J. Wong wrote:
> On Thu, Aug 29, 2024 at 11:03:57AM -0400, Brian Foster wrote:
> > On Wed, Aug 28, 2024 at 03:44:20PM -0700, Darrick J. Wong wrote:
> > > On Wed, Aug 28, 2024 at 02:19:11PM -0400, Brian Foster wrote:
> > > > iomap_zero_range() flushes pagecache to mitigate consistency
> > > > problems with dirty pagecache and unwritten mappings. The flush is
> > > > unconditional over the entire range because checking pagecache state
> > > > after mapping lookup is racy with writeback and reclaim. There are
> > > > ways around this using iomap's mapping revalidation mechanism, but
> > > > this is not supported by all iomap based filesystems and so is not a
> > > > generic solution.
> > > > 
> > > > There is another way around this limitation that is good enough to
> > > > filter the flush for most cases in practice. If we check for dirty
> > > > pagecache over the target range (instead of unconditionally flush),
> > > > we can keep track of whether the range was dirty before lookup and
> > > > defer the flush until/unless we see a combination of dirty cache
> > > > backed by an unwritten mapping. We don't necessarily know whether
> > > > the dirty cache was backed by the unwritten maping or some other
> > > > (written) part of the range, but the impliciation of a false
> > > > positive here is a spurious flush and thus relatively harmless.
> > > > 
> > > > Note that we also flush for hole mappings because iomap_zero_range()
> > > > is used for partial folio zeroing in some cases. For example, if a
> > > > folio straddles EOF on a sub-page FSB size fs, the post-eof portion
> > > > is hole-backed and dirtied/written via mapped write, and then i_size
> > > > increases before writeback can occur (which otherwise zeroes the
> > > > post-eof portion of the EOF folio), then the folio becomes
> > > > inconsistent with disk until reclaimed. A flush in this case
> > > > executes partial zeroing from writeback, and iomap knows that there
> > > > is otherwise no I/O to submit for hole backed mappings.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/iomap/buffered-io.c | 57 +++++++++++++++++++++++++++++++++++-------
> > > >  1 file changed, 48 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 3e846f43ff48..a6e897e6e303 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -1393,16 +1393,47 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(iomap_file_unshare);
> > > >  
> > > > -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> > > > +/*
> > > > + * Flush the remaining range of the iter and mark the current mapping stale.
> > > > + * This is used when zero range sees an unwritten mapping that may have had
> > > > + * dirty pagecache over it.
> > > > + */
> > > > +static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
> > > > +{
> > > > +	struct address_space *mapping = i->inode->i_mapping;
> > > > +	loff_t end = i->pos + i->len - 1;
> > > > +
> > > > +	i->iomap.flags |= IOMAP_F_STALE;
> > > > +	return filemap_write_and_wait_range(mapping, i->pos, end);
> > > > +}
> > > > +
> > > > +static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
> > > > +		bool *range_dirty)
> > > >  {
> > > >  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > >  	loff_t pos = iter->pos;
> > > >  	loff_t length = iomap_length(iter);
> > > >  	loff_t written = 0;
> > > >  
> > > > -	/* already zeroed?  we're done. */
> > > > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > > > +	/*
> > > > +	 * We can skip pre-zeroed mappings so long as either the mapping was
> > > > +	 * clean before we started or we've flushed at least once since.
> > > > +	 * Otherwise we don't know whether the current mapping had dirty
> > > > +	 * pagecache, so flush it now, stale the current mapping, and proceed
> > > > +	 * from there.
> > > > +	 *
> > > > +	 * The hole case is intentionally included because this is (ab)used to
> > > > +	 * handle partial folio zeroing in some cases. Hole backed post-eof
> > > > +	 * ranges can be dirtied via mapped write and the flush triggers
> > > > +	 * writeback time post-eof zeroing.
> > > > +	 */
> > > > +	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> > > > +		if (*range_dirty) {
> > > > +			*range_dirty = false;
> > > > +			return iomap_zero_iter_flush_and_stale(iter);
> > > > +		}
> > > >  		return length;
> > > > +	}
> > > >  
> > > >  	do {
> > > >  		struct folio *folio;
> > > > @@ -1450,19 +1481,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> > > >  		.flags		= IOMAP_ZERO,
> > > >  	};
> > > >  	int ret;
> > > > +	bool range_dirty;
> > > >  
> > > >  	/*
> > > >  	 * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
> > > >  	 * pagecache must be flushed to ensure stale data from previous
> > > > -	 * buffered writes is not exposed.
> > > > +	 * buffered writes is not exposed. A flush is only required for certain
> > > > +	 * types of mappings, but checking pagecache after mapping lookup is
> > > > +	 * racy with writeback and reclaim.
> > > > +	 *
> > > > +	 * Therefore, check the entire range first and pass along whether any
> > > > +	 * part of it is dirty. If so and an underlying mapping warrants it,
> > > > +	 * flush the cache at that point. This trades off the occasional false
> > > > +	 * positive (and spurious flush, if the dirty data and mapping don't
> > > > +	 * happen to overlap) for simplicity in handling a relatively uncommon
> > > > +	 * situation.
> > > >  	 */
> > > > -	ret = filemap_write_and_wait_range(inode->i_mapping,
> > > > -			pos, pos + len - 1);
> > > > -	if (ret)
> > > > -		return ret;
> > > > +	range_dirty = filemap_range_needs_writeback(inode->i_mapping,
> > > > +					pos, pos + len - 1);
> > > >  
> > > >  	while ((ret = iomap_iter(&iter, ops)) > 0)
> > > > -		iter.processed = iomap_zero_iter(&iter, did_zero);
> > > > +		iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
> > > 
> > > Style nit: Could we do this flush-and-stale from the loop body instead
> > > of passing pointers around?  e.g.
> > > 
> > 
> > So FWIW, I had multiple other variations of this that used an
> > IOMAP_DIRTY_CACHE flag on the iomap to track dirty pagecache for
> > arbitrary operations. The flag could be set and cleared at the
> > appropriate points as expected (for ops that care).
> > 
> > To me, that's how I'd prefer to avoid just passing a pointer, but I
> > intentionally factored that out to avoid using a flag for something that
> > (for now) could be simplified to a local variable. OTOH, it is something
> > that might be useful for the iomap seek data/hole implementations down
> > the road.
> 
> <nod> We can always adjust again when we get there; for now a local
> variable sounds fine.
> 
> > I've played with that a bit, but also have been trying to avoid getting
> > too much into that rabbit hole for zero range. My thought was I'd
> > reintroduce it and replace the range_dirty thing if/when it proved
> > useful for multiple operations.
> > 
> > > static inline bool iomap_zero_need_flush(const struct iomap_iter *i)
> > > {
> > > 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > 
> > > 	return srcmap->type == IOMAP_HOLE ||
> > > 	       srcmap->type == IOMAP_UNWRITTEN;
> > > }
> > 
> > The factoring looks mostly reasonable, but a couple things bug me that
> > I'd like to see if we can resolve..
> > 
> > One is that this doesn't really indicate whether a flush is needed,
> > because the dirty cache state is a critical part of that logic. I
> > suppose we could rename it (to what?), but it also seems a little odd to
> > have a helper just for mapping type checks.
> 
> I thought about passing range_dirty into iomap_zero_need_flush since
> it's a static inline function, but that just seemed unnecessary.
> 
> > > static inline int iomap_zero_iter_flush(struct iomap_iter *i)
> > > {
> > > 	struct address_space *mapping = i->inode->i_mapping;
> > > 	loff_t end = i->pos + i->len - 1;
> > > 
> > > 	i->iomap.flags |= IOMAP_F_STALE;
> > > 	return filemap_write_and_wait_range(mapping, i->pos, end);
> > > }
> > > 
> > > and then:
> > > 
> > > 	range_dirty = filemap_range_needs_writeback(...);
> > > 
> > > 	while ((ret = iomap_iter(&iter, ops)) > 0) {
> > > 		if (range_dirty && iomap_zero_need_flush(&iter)) {
> > > 			/*
> > > 			 * Zero range wants to skip pre-zeroed (i.e.
> > > 			 * unwritten) mappings, but...
> > > 			 */
> > > 			range_dirty = false;
> > > 			iter.processed = iomap_zero_iter_flush(&iter);
> > > 		} else {
> > > 			iter.processed = iomap_zero_iter(&iter, did_zero);
> > > 		}
> > 
> > The other is that the optimization logic is now split across multiple
> > functions. I.e., iomap_zero_iter() has a landmine if ever called without
> > doing the flush_and_stale() part first (a consideration if
> > truncate_page() were ever open coded, for example).
> 
> _zero_iter is a static function, let's hope nobody does that.  Though
> you're right, experience tells me that someone will try this
> eventually.
> 

Yeah, it's probably unlikely, but the fact I already had the open coded
iomap_truncate_page() experiment (from the v1 thread) lying around that
does pretty much this is what gave me pause.

> That said, I see the merit of having one complete loop body function
> that knows how to handle all iomap types, since the others do that.
> 
> > I wonder if a compromise might be to factor out the whole optimization
> > into a separate helper rather than just the flush part (first via a prep
> > patch), then the higher level loop ends up looking almost the same:
> > 
> > 	while ((ret = iomap_iter(&iter, ops)) > 0) {
> > 		/* special handling for already zeroed mappings */
> > 		if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > 			iter.processed = iomap_zero_mapping_iter(&iter, &range_dirty);
> > 		else
> > 			iter.processed = iomap_zero_iter(&iter, did_zero);
> > 		}
> > 
> > That doesn't avoid passing the range_dirty pointer, but we just end up
> > passing that instead of did_zero. Also as noted above, it could still be
> > made to go away if the range_dirty check gets pushed down into the
> > iomap_iter() path for more general use.
> > 
> > Anyways those are just my thoughts. I'm of the mind that whatever
> > factoring we do here may have to change if Dave's batched folio
> > lookup/iteration idea pans out for fs' with validation support, so at
> > the end of the day I'll change this to look exactly like you wrote it if
> > it means the zeroing problem gets fixed. Thoughts or preference?
> 
> I'm ok with your original version now.
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 

Thanks. I'll post a v3 just with Dave's comment updates then.

Brian

> --D
> 
> 
> > Brian
> > 
> > > 	}
> > > 
> > > The logic looks correct and sensible. :)
> > > 
> > > --D
> > > 
> > > >  	return ret;
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(iomap_zero_range);
> > > > -- 
> > > > 2.45.0
> > > > 
> > > > 
> > > 
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-08-30 11:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-28 18:19 [PATCH v2 0/2] iomap: flush dirty cache over unwritten mappings on zero range Brian Foster
2024-08-28 18:19 ` [PATCH v2 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster
2024-08-28 22:22   ` Darrick J. Wong
2024-08-29  5:43     ` Christoph Hellwig
2024-08-28 18:19 ` [PATCH v2 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster
2024-08-28 22:44   ` Darrick J. Wong
2024-08-29  0:26     ` Dave Chinner
2024-08-29 15:04       ` Brian Foster
2024-08-29 15:03     ` Brian Foster
2024-08-29 17:29       ` Brian Foster
2024-08-29 21:34       ` Darrick J. Wong
2024-08-30 11:58         ` Brian Foster
2024-08-28 20:44 ` [PATCH v2 0/2] iomap: flush dirty cache over unwritten mappings on zero range Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).